Statistical Power in Cluster Randomized Trials

Material Information

Statistical Power in Cluster Randomized Trials an Evaluation of Observed and Latent Mean Covariate
Aydin, Burak
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (116 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology
Human Development and Organizational Studies in Education
Committee Chair:
Committee Co-Chair:
Committee Members:
Graduation Date:


Subjects / Keywords:
False positive errors ( jstor )
Mathematical variables ( jstor )
Missing data ( jstor )
Modeling ( jstor )
Multilevel models ( jstor )
Sample size ( jstor )
Simulations ( jstor )
Standard error ( jstor )
Statistical discrepancies ( jstor )
Statistical models ( jstor )
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
covariates -- crt -- power
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Research and Evaluation Methodology thesis, Ph.D.


Cluster randomized trials typically represent a substantial investment of time and money. As the use of cluster randomized trials has become more common, it has become particularly important to investigate the impact of analytical methods on the results of cluster randomized trials. This study investigated Type I error rate and power for the test of the treatment effect in a 2-level cluster randomized trial as well the bias of the standardized treatment effect for four analytical methods. The four methods are the result of two factors (a) whether a level-two covariate was an observed mean or latent mean of a level-1 covariate and (b) whether the level-1 deviation scores was included in the model. Results showed that the treatment effect is estimated without bias; coverage rates might be problematic with latent mean method excluding the deviation score if the cluster size is small. Results also showed that the effect of excluding the deviation score with observed mean methods on statistical power to detect a treatment effect is negligible. However, excluding the deviation score with latent mean method is not suggested due to accuracy problems and reduction on statistical power. There was evidence supporting that, in general, the difference between observed mean methods and latent mean method with the deviation score is not substantial in terms of bias, statistical power and Type I error rate. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2014.
Co-adviser: ALGINA,JAMES J.
Electronic Access:
Statement of Responsibility:
by Burak Aydin.

Record Information

Source Institution:
Rights Management:
Applicable rights reserved.
Embargo Date:
LD1780 2014 ( lcc )


This item has the following downloads:

Full Text




2 © 2014 BURAK AYDIN


3 To Mehmetali and K ymet Ayd n, Bur ç ak G ü ll ü , and my beautiful wife Vildan


4 ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor and committee chair Dr. Leite for guiding me for six years . He taught me how to think as an academic, how to publish manuscrip ts and more importantly how to use R, an amazing software. These three aspects are simply the fundamentals of how I am going to earn my life. I will always be grateful to him. I would like to thank Dr. Algina. He is the reason that I wished to join Univer s ity of Florida back in 2008 . He is always incredibly resourceful and helpful. I have always admired h is academic skills. I would like to thank Dr. Miller and Dr. Snyder. Dr. Miller is a respected professor. H is knowledge and experience are extraordinary. He has also been influential on my philosophy of professional life. Dr. Snyder is an exceptional professor and a leader. I tried to learn from her about how to manage a research project . I would like to thank Dr. Khare for being a committee member and lett ing me obtain the minor degree. I will always appreciate him for his valuable time. I also would like to thank the REMarkable faculty members, staff and students that I met at the University of Florida since 2008. They all positively impacted my academic life. Finally, I would like to thank Turkish Government for the financial support.


5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 11 2 LITERATURE REVIEW ................................ ................................ .......................... 17 Cluster Randomized Trials ................................ ................................ ...................... 17 Design Difficulties in CRT ................................ ................................ ................. 19 Analysis Difficulties in CRT ................................ ................................ ............... 21 Power Difficulty in CRT ................................ ................................ ..................... 23 Statistical Analysis of CRT Data ................................ ................................ ............. 23 Multilevel Modeling for CRT ................................ ................................ ............. 29 The null model ................................ ................................ ........................... 29 The random intercept model ................................ ................................ ...... 32 Parameter Estimation Procedures ................................ ............................. 38 Hypothesis Testing ................................ ................................ .................... 43 Missing Values ................................ ................................ ........................... 44 Covariates in Cluster Randomized Trials ................................ ......................... 48 Aggregating Lower Level Covariates ................................ ................................ 49 Observed vs. La tent Mean ................................ ................................ ............... 50 Including Covariates in Multilevel Models for CRT Data ................................ ......... 52 Power and Sample Size ................................ ................................ .......................... 54 Power in Single Level Models ................................ ................................ .......... 54 Power in CRT ................................ ................................ ................................ ... 56 Software for Power and Sample Calculations ................................ ......................... 60 Research Questions ................................ ................................ ............................... 63 3 SIMULATION STUDY ................................ ................................ ............................. 66 Methodology ................................ ................................ ................................ ........... 66 Population Model ................................ ................................ .............................. 66 Simulated Conditions and Population Parameters ................................ ........... 69 D ata Generation ................................ ................................ ............................... 72 Analysis of Simulated Data ................................ ................................ ............... 74 Analysis ................................ ................................ ................................ .................. 74


6 4 RE SULTS ................................ ................................ ................................ ............... 76 Convergence Rates ................................ ................................ ................................ 76 Bias of the Between Coefficient ................................ ................................ .............. 76 Treatment Effect ................................ ................................ ................................ ..... 76 Type I Error ................................ ................................ ................................ ............. 78 Power ................................ ................................ ................................ ...................... 79 5 DISCUSSION AND C ONCLUSION ................................ ................................ ........ 83 General Discussions ................................ ................................ ............................... 83 Specific Discussions ................................ ................................ ............................... 85 The Effect of Level 1 Deviation Scores on Statistical Power ............................ 86 The Effect of Latent Mean on Statistical Power ................................ ................ 89 The Effect of Missi ngness on Statistical Power ................................ ................ 92 Empirical Analyses ................................ ................................ ........................... 94 OMML1L2 versus OMML2 ................................ ................................ ......... 95 LMML1L2 versus OMML1L2 ................................ ................................ ...... 97 Conclusion ................................ ................................ ................................ .............. 98 Suggestions to Researchers ................................ ................................ ............ 99 Limitations and Possible Future Research ................................ ..................... 100 VARIANCE COMPONENT APPROXIMATIONS ................................ ........................ 101 R AND M plus SCRIPTS ................................ ................................ .............................. 103 LIST OF REFERENCES ................................ ................................ ............................. 106 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 116


7 LIST OF TABLES Table page 3 1: Average sample sizes for reviewed IES grants by goal type ................................ .. 70 4 1 Treatment bias summary by analysis method ................................ ......................... 77 4 2 Summary of Selected Effects on treatment bias ................................ ..................... 77 4 3 Relative bias of non zero treatment effects by analysis method ............................. 77 4 4 Coverage rates for the treatment population parameter ................................ .......... 78 4 5 Type I summary by analysis method ................................ ................................ ....... 79 4 6 Summary of statistical power by target and analysis method ................................ .. 80 4 7 Summary of Selected Effects on Power ................................ ................................ .. 80 4 8 Averag e power by method and ................................ ................................ .......... 81 4 9 Average power by method and missing data pattern ................................ .............. 81 4 10 Average power by method and cluster size ................................ .......................... 82 5 1 The Comparison of Two stage Adjustment Procedure to Latent Covariate Procedure ................................ ................................ ................................ ........... 91 5 2 Change in the Variance of the Group mean After the Two stage Adjustment ......... 92 5 3 the effect of handling missing data with FIML estimation on studied methods ........ 94 5 4 Empirical Comparison of OMML1L2 versus OMML2 ................................ .............. 96 5 5 Empirical Comparison of OMML1L2 versus LMML1L2 ................................ ........... 97


8 LIST OF FIGURES Figure page 5 1 The approximated increase in power as a function of and .......................... 88


9 Abstract of Dissertation Presented to t he Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATISTICAL POWER IN CLUSTER RANDOMIZED TRIALS: AN EVALUATION OF OBSERVED AND LATENT MEAN COVARIATE By Burak Ayd n August 2014 Chair: Walter L. Leite Major: Research and Evaluation Methodology Cluster randomized trials typically represent a substantial investment of time and money. As the use of cluster randomized trials has become more common, it has become part icularly important to investigate the impact of analytical methods on the results of these trials. This study investigated Type I error rate and power for the test of the treatment effect in a 2 level cluster randomized trial as well the bias of the standa rdized treatment effect for four analytical methods. The four methods are the result of two factors (a) whether a level two covariate was an observed mean or latent mean of a level 1 covariate and (b) whet her the level 1 deviation scores were included in t he model. R esults showed that the treatment effect is estimated without bias; coverage rates might be probl ematic with the latent mean method excluding the deviation score if the cluster size is small. R esults also showed that the effect of excluding the d eviation score with observed mean methods on statistical power to detect a treatment effect is negligible. E xcluding the deviation score with latent mean method is not suggested due to ac curacy problems and reduction of statistical power. There was evidenc e supporting that, in general, the difference between observed mean methods and latent mean method with the deviation score is not substantial in terms of


10 bias, statistical power and Type I error rate. Two empirical studies were conducted to demonstrate th e conclusions of the simulation study.


11 CHAPTER 1 INTRODUCTION A research design generally aims to collect information on a natural or manufactured process. In a research design, a researcher generally sets up an environment to conduct an experiment to di scover, evaluate or validate information. Achieving these goals might be possible by manipulating variables. One way to manipulate variables is to use the comparative experiment in which two or more different conditions are compared to each other. Comparat ive experiments are very common. In the social sciences, comparative experiments are generally conducted in a social environment rather than laboratories and the experimental units are usually individuals or groups of individuals. When compared to natural sciences, observing individuals typically causes a larger amount of experimental error variance. Statistical procedures are employed to achieve an estimate for the variance of experimental error to provide justifiable hypothesis testing to compare conditio ns and other kinds of statistical inference, for example confidence intervals. The estimate of the variance has to be valid. The random assignment of study conditions to individuals or groups might offer a baseline for a valid estimate of error variance. R onald A. Fisher, one of the most influential authorities on the statistical principle s of experimental design, stated ; One way of making sure that a valid estimate of error will be obtained is to arrange the plots deliberately at random, so that no distinc tion can creep in between pairs of plots treated alike and pairs treated differently; in such a case an estimate of error, derived in the usual way from the variations of sets of plots treated alike, may be applied to test the significance of the observed difference between the averages of plots treated differently. (Fisher, 1926 ,p.87 ) .


12 In the social sciences, conditions in a comparative experimental design are often termed control vs. intervention(s) or treatment(s). In the medical sciences, random assignment of interventions to participants is commonly called a randomized controlled trial which is a powerful, simple , and frequently used design. Nevertheless, especially in educational research, random assignment of interventions to clusters or to intact groups rather than individ uals might be more appropriate. Randomizing interventions to clusters is known as a Cluster Randomized Trial (CRT) or group randomized trial. In this type of design, experimental units are clusters. Examples of clusters are schools, classrooms, hospitals, communities, and geographic units. Compared to randomizing individuals, randomizing clusters is generally a more efficient approach to avoid contamination. Contamination occurs if information transfers sion of treatments. Contamination is known to be a threat to the validity of a study (Cook & Campbell, 1979). CRTs are also more useful to examine effectiveness of an intervention at the cluster level (Hayes & Bennett, 1999) . Randomizing clusters , rather than individuals , might be more feasible given education systems all around the world are organized into clusters such as classrooms, schools, and school districts. Using these existing clusters can assist cost reduction. Interventions in educational research are frequently planned at the cluster level. In the Education Sciences Reform Act (2002) , the scientifically based research standards section emphasizes random assignment a s an important step in making claims of casual relationships. I searched grants and contracts funded by the Institute of Education Sciences (IES) over the last deca de. A total of 357 abstracts were emerged


13 inquired, and that corresponds to 27.2% of al l available abstracts. Even though CRTs have considerable advantages, some difficulties might arise with conducting a CRT, for example, design difficulties, choosing among existing data analysis models and, as commonly mentioned as a main disadvantage, low statistical power. Cluster randomized trials are useful to obtain results from a sample of clusters that are generalizable to the target population. Endeavoring to reduce the bias of these results is essential. Bias can be introduced before, during or after the study. Jadad and Enkin (2007) described 29 different types of biases in a randomized controlled trial design that can also occur in a CRT design. Some of these biases cannot be controlled by the study design, because they are introduced by the researcher or readers; for example, outcome choice bias, publication bias or prominent author bias. However, some of these biases can occur due to a poor design choice; for example, selection bias. In a CRT design, sel ection bias does not exist if each cluster in the target population has an equal probability of being assigned to the study conditions. Selection bias was defined and discussed in detail by Guo and Fraser (2010). Selection of a sample of clusters should b e completed carefully to avoid bias. Ideally , clusters should be selected rand omly from the target population. H owever , participation in a study cannot often occur without informed consent in the social sciences. Hence clusters are selected from those who voluntarily consent . Compared to obtaining individual consent , obtaining consent from clusters can be more difficult.


14 CRTs can employ complete randomization, stratified , or matched pair designs to assign clusters to conditions. When compared to sampling i ndividuals, obtaining a large number of experimental units can be more difficult in a CRT. A limited number of clusters can make it diffic ult to match clusters or create homogenous strata and therefore negatively affect the performance of stratification an d pair matching. In order to obtain only cluster level inference, a researcher can employ classic approaches, for example a t test procedure to compare condition means using the cluster means. However, especially in educational research, interest in indiv idual level inference is more frequent. Responses provided by individuals within a specific cluster tend to be dependent. This violation of independence should be taken into account when analyzing the CRT data using within and between cluster correlations. One important step is to correctly establish the unit of randomization in the analysis. If the randomization unit cannot be addressed properly or, more critically, if randomization unit is completely ignored, between and withi n cluster variance will be co nfounded . This can cause incorrect inference. Furthermore, currently there is no single agreed upon method to analyze CRT data (Cunningham, 2010) . Researchers can choose among multileve l modeling, permutation tests , and generalized estimating equations (GEE) . In the present dissertation research study , the focus is on multilevel modeling. A researcher generally is more interested in discovering differences rather than failing to reject a null hypothesis and the probability of rejecting a false null is termed as statistical power (Cohen, 1977). Overpowere d studies can cause waste of resource s ; underpowered studies also can cause the same. Having the same number of


15 participants, CRTs oftentimes are not as powerful as corresponding individual dition is not independent within a cluster. This dependency decreases the flow of the unique information when compared to randomization of individuals, hence it reduces power. In order to increase power in a CRT design, a researcher can increase the number of clusters or include covariates. In this dissertation research, the focus is on power, covariates , and multilevel modeling methods with CRT data. Over the past decade or so, CRTs have become an important design in educational research. As an example of a two level CRT , consider a treatment control study in which N teachers, with one teacher from each of N schools, will be randomly assigned to two treatments. Even though CRTs have substantial advantages and sometimes are the mandatory design choice, diffi culties might arise before and after the data collection. One of these difficulties is power; compared to randomizing individuals, CRTs generally have lower power to detect a treatment effect. Covariates increase power by reducing the error variance if th e correlation between the outcome and the covariate is adequate. Including covariates in CRT designs can be and generally is more budgets friendly compared to increasing the number of clusters. In a CRT design, the researcher can measure covariates at diff erent levels. In their presentation of power analysis for two level CRTs, Spybrook et al. ( 2011) indicate that power can be enhanced by using a level 2 covariate . Level 1 covariates can be aggregated to create level 2 covari ates. L ü dtke et al. ( 2008) d iscussed two different types of aggregation: formative and reflective. In formative aggregation , an arithmetic average of a lower level variable


16 within an upper level can b e used as an upper level variable. In reflective aggregation the expected value of a lower variable is used a s an upper level variable. Formative aggregation results in an observed variable, whereas reflective aggregation results in a latent variable. T wo level CRT data with a level 1 covariate can be analyzed within a multilevel framework. The level 1 covariate can be excluded if it is aggregated to be used as a level 2 covariate; alternatively the 1 deviation can be used together with i ts aggregated projection. I refer to a model with formative aggregation as an observed mean model (OMM) and a model with reflective aggregation as a latent mean model (LMM). I t is of interest to ask if (a) the treatment effect estimator is unbiased, (b ) p ower for detecting the treatment effect is larger when an OMM or a LMM is used; and, (c ) if power is affected by including or excluding the level 1 deviation score. CRT s typically represent a substantial investment of time and money. As the use of CRT s h as become more common, it has become particularly important to investigate whether analytical methods affect bias of treatment effect estimates and Type I error rates and power for testing for treatment effects. My research has the potential to inform the selection of methods of analysis for two level CRT s such that the methods can be selected in light of the aims of minimizing bias, enhancing control of Type I error rates, and achieving the greatest power possible for the investment.


17 CHAPTER 2 LITERATUR E REVIEW Experimental design refers to form ulation of statistical hypothese s, determination of independent and dependent variables, specification of sample and population, specification of randomization process, and selecting an appropriate statistical ana lysis model. In most experimental designs, comparing different conditions is the leading motivation. In educational research, determined or dictated by the research question, individuals or group of individuals can constitute the experimental unit. In th is study, the focus is on assigning groups of individuals to treatments. If individuals are assigned to treatments, procedures associated with randomized controlled trials should be used. Moreover, I assume that randomly selecting the group of individual s and randomly assigning these selected groups to study conditions are feasible (or the choice) for the researcher. If the researcher randomly assigns individuals from a clustered population, in other words, if clusters are crossed with study conditions me aning that all study conditions can possibly take place in each cluster, multisite trial methods should be employed (Moerbeek , 2005; Raudenbush & Liu, 2000) . If randomization is not achievable at all, quasi experimental methods can be employed (Cook & Campbell, 1979). Cluster Randomized Tria ls Cluster randomized trials are common in educational and medical research. CRTs are also called group randomized trials or community trials. In his book, Murray (1998) defined CRTs as comparative studies in which the condition assignment occurs


18 at the gr oup level and observation occurs at the individual level. He provided a few CRT examples in the social sciences and stated that CRTs typically involve a limited number of clusters. He considers CRTs as more externally valid compared to individual randomiza tion stating CRTs more adequately reflect the real world. In his seminal book, Kish (1965) presented two chapters on sampling clusters within a survey sampling framework, and introduced the effect of clustering on the sample variance. Cornfield ( 1978) , in his two page article, studied the effect of clustering on creating equally distributed idio syncratic characteristics for two condition groups. He plainly stated that randomizing clusters and then analyzing the data as if randomization occurred at the individual level is an exercise of self deception. The design and analysis of cluster randomizat ion is discussed by several researchers; Murray ( 1998); Murray, Varnell, and Blitstein,( 2004) ; Donner ( 2009 ); Donner and Klar ( 2004) ; Puffer, Torgerson and Wats on ( 2005) ; Moerbeek ( 2006 a ) ; Chakraborty (2008). Compared to randomizing individuals, CRT designs can reduce the cost, avoid contamination, and increase the feasibility. In educational research randomizing clusters can be more feasible than randomizing individuals because educatio n systems all around the world are organized into clusters such as classrooms, schools, and school districts. Using these existing clusters can assist cost reduction. Kish (1965) compared cluster samplin g to element sampling and stated that cluster samplin g is cost efficient due to lower cost of listing or of locating. Compared to randomizing individuals, randomizing clusters is generally a more efficient approach to avoid contamination. Contamination occurs if information transfers


19 ditions and it can cause diffusion of treatment effects. Contamination is known to be a threat to validity of a study (Cook & Campbell, 1979). Moerbeek (2006a) provides an example in which the impact of vitamin A supplementation on childhood mortality was studied in Indonesia. Complete villages were randomly assigned to treatment or control group because randomizing children was considered politically unfeasible. Contamination is minimized because the children in different conditions are separated by geogra phic boundaries making it unlikely for control group children to access vitamin A via treatment group children. In addition, the CRT design assisted cost reduction given vitamin A capsules are delivered only to the treatment group villages. CRTs can somet imes be the only feasible approach to study effectiveness of an intervention; for example, randomizing individual patients to doctors who had received the guidelines and those who had not would not be feasible in practice ( Puffer, Torgerson & Watson, 2005) . This example is also valid in educationa l settings in which teachers can substitute for doctors and students for patients. The preceding section summarizes considerable advantages of a CRT design. However, CRTs might introduce complexity due to design difficulties, due to analysis model selecti on, and due to low statistical power to detect treatment effects. The following sections summarize these difficulties. Design Difficulties in CRT Sampling a large number of clusters might be expensive. A limited number of clusters can negatively affect th e performance of stratification and pair matching (Donner & Klar, 2004). In addition, loss of a cluster (i.e. attrition) can be more problematic compared to loss of an individual in terms of missing data.


20 After the assignment of clusters to study condition s, the researcher should carefully study the potential effects of subsampling of individuals within a specific cluster. When feasible, assessing all individuals within a cluster can be the choice. Donner ( 2009) stated that some investigators might be willing to provide the treatment to all members of a specific cluster. If the clusters are large in size, the researcher should consider at least two issues with subsampling; representing the population (Kish, 1965) and optimal allocation based on the study budget (Moerbeek et al. , 20 00) . Moreover educational research oftentimes does not allow allocation concealment. Even though this problem is not unique to CRTs and can occur with individually randomized trials, realization of study condition might introduce bias if recruitment wit hin a cluster is affected by the treatment allocation (Teerenstra et al.,2007) There are other designs that can be considered subtypes of CRT designs. For example, the pseudo cluster randomization designs were introduced by Teerensta et al. (2007) as a possible solution to trea tment concealment. Consider the following hypothetical situation; in a classic CRT, clusters are randomly assigned to treatment conditions (i.e. control and treatment), after the randomization, conditions are delivered to all clusters assigned to treatment . In pseudo cluster randomization, 80% of the selected assigned clusters receive the treatment and 20% receives the control; 80% of the assigned control clusters receives the control condition and the remaining receives the treatment. The authors also com pared several methods to analyze pseudo cluster randomization data.


21 Another design similar to CRT designs is the cluster randomized crossover designs. This design was introduced by Rietbergen and Moerbeek ( 2011) as an attem pt to increase statistical power. In this design, clusters receive more than one condition. The authors use the following hypothetical example: to study the effect of having diff the second group do not receive breakfast. Seven days later, the order is reversed. Analysis Difficulties in CRT A researcher might be interested in cluster level infere nce or individual level inference when CRT data are available. In order to obtain only cluster level inference, a researcher can employ classic approaches, for example a t test procedure to compare condition means using cluster means. However, especially i n educational research, interest in individual level inference is more frequent. Responses provided by individuals within a specific cluster tend to be dependent. This violation of independence should be taken into account when analyzing the CRT data using within and between cluster relations. Feng et al. (2001) reviewed the statistical issues in CRT designs and stated that these designs are too expensive to ignore associated statistical issues. There are mainly two classes of statistical models to account for cluster dependency; conditional and marginal models (Murray et al., 20 04) . Genera lized estimating equations are classified as marginal models and these models include two steps: the regression step and the correlation step to provide treatment effects with corrected standard errors. On the other hand, multilevel models ar e classified as conditional and aim to model variation at different levels. Permutation tests can also be employed to analyze CRT


22 data as an alternative to statistical models. The idea is simple; if there is no treatment effect, allocation of conditions do not affect the results. This null hypothesis can be studied by creating all possible treatment arrangements. As introduced above, several approaches can be employed to analyze CRT data. Each strategy has its strengths and shortcomings. A brief comparison of these approaches was provided in the next section. Nevertheless, the difficulty still exists, as stated by Chakraborty (2008) and Cunningham (2010) that methodological studies are not yet satisfactory to explicitly address the best strategies to analyz e CRT data for individual level inference. Another important step when analyzing CRT data is to correctly establish the unit of allocation in the analysis. If randomization unit cannot be addressed properly or, more critically, if randomization unit is com pletely ignored, between and within cluster variance will be mixed (Chakraborty 2008). This can cause incorrect inference. For example, consider a three level design in which students nested within classrooms nested within schools; if schools are randomly assigned to the study conditions but analyses treat classrooms as the allocation unit, p values would be artificially small and that might cause Type I errors. Using the t test procedures to compare two groups in a two level nested data, Dorman ( 2008) studied Type I error rates and concluded that when individual level scores are used, compared to a cl uster adjusted t test procedure, a conventional t test can dramatically inflate Type I error rate. This conclusion is of course not new and can be considered as a recent look to Cornfield (1978). Moerbeek ( 2004) reports statistical power to detect an effect of a predictor might be decreased if an upper level nesting is ignored, given that the allocation unit is correctly established.


23 Power Difficulty in CRT The d ependency introduced by clusters is studied in the concept of ICC and it plays an important role in analyzing clustered data and designing studies for a clustered population. Mathematical dimensions of the ICC were provided earlier in this chapter. From a very broad perspective, the ICC is calculated by decomposing the total variance into between and within group components. If the ICC is zero, similarity of individuals are independent from their cluster, hence each individual represents a random draw from the population. If the ICC is larger than zero, similarity of individuals within a cluster increases the variance of the sample mean (Kish, 1965). This increase in variance reduces the statistical power when compared to no cluster effect. To illustrate th is issue, I utilized Optimal Design Software (Raudenbush, et al., 2011; Spybrook et al., 2011 ) to compare power between some cluster effect (ICC =.10) versus no cluster effect for a design with 40 clusters and 20 individuals in each cluster. No cluster e ffect implies that CRT design is theoretically equivalent to individual randomization. When the effect size was kept at .20, power was .80 for no cluster effect and .36 with some cluster effect. Statistical Analysis of CRT Data Examining a two level struc ture in which individuals are nested in clusters and a continuous outcome is used, this section aims to provide (a) possible justifications to choose multilevel modeling among existing approaches t o analyze CRT data; and (b) a theoretical introduction to m ultilevel modeling , including ass umptions and hypothesis testing.


24 There are several influential books on multilevel analysis: Hox (2010), Kreft and de Leeuw (1998), Raudebush and Bryk (2002), and Snijders a nd Bosker (2012). Kreft and de Leeuw (1998) provi ded a b rief section on the history of multilevel models . Garson (2013) described multilevel modeling a s a variant term for mixed models and notes that the term multilevel modeling is preferred in sociology. The same class of models is known as random regr ession coefficient models in economics. The term covariance components models in statistics even though these models are not the same class of models as multilevel models are can be considered as an indirect but suitable term. Garson also classifies thr ee main situations that call for mixed models: random effects, repeated measures , and hierarchical effects. Random effects exist if the levels of a factor are randomly sampled from all possible values rather than constituting the complete set of all possib le values. Repeated measure models imply correlated observations, for example assessing students multiple times. Hierarchical effects refer to the existence of clustering effects. Raude nbush and Bryk (2002) identified five subm odels for a two level hiera rchy: random effects one way anal ysis of variance (ANOVA), means as outcomes regression, random effects one way analysis of covariance (ANCOVA), random coefficients regression model, and intercepts and slopes as outcomes model. Multilevel models are an ext ension of mixed ANOVA/ANCOVA models and when applied to the same design/da ta, the results will be similar. M ore specifically, as shown by Moerbeek (2006 a ) , the results can be derivable from each other. Heck and Thomas (2000) note that multilevel models might be formulated or discusse d as random coefficients or variance components as a result of the historical traditions. In this dissertation, the


25 terminology is based on Snijders and Bosker (2012). Before providing the theory, a brief comparison of multilevel modeling versus possible a lternatives to analyze CRT data is presented . When analyzing CRT data, the first step is to decide if statistical inference should be provided for the cluster level or the individual level. When inference is needed at the cluster level, summary measures c an be created by collapsing the data in each cluster and following analysis would not need to adjust for the clustering effect given the allocation unit and the analysis unit is the same (Donner, 2009) . If the cluster sizes differ, weighting should be used for summary measures (Moerbeek, Breukelen & Bergen, 2003) . When statistical inference is needed at the individual level, the researcher can choose summary measures or individual level data to analyze. Klar and Donner (2007) show ed that for a continuous outcome and with fixed cluster siz e, t test on cluster means is fully as efficient as an individual level analysis of variance. Moerbeek , Breukelen and Bergen (2003) compares t test on cluster means, linear regression ignoring cluster effect, fixed effect regression in which clusters are t reated as fixed effects and multilevel regression to obtain individual level inference. They state d that fixed effect regression and naïve regression generally result in incorrect estimates of the standard error of the treatment effect, causing Type I erro r. The t test on cluster means was found to be less efficient than m ultilevel modeling. Therefore, they conclude that multilevel modeling is the best strategy among these four approaches when cluster sizes vary or covariates are included. In 2007, Hedges i ntroduced a cluster adjusted t test that can be used with individual level data. This approach might compensate for the


26 inefficiency reported by Moerbeek , Breukelen and Berger (2003). H owever, it still is not able to adjust for covariates. When individual level inference is needed, and the individual level data are chosen for analysis , GEE can be used (Fen g, Diehr, Peterson & McLerran, 2001; Ghisletta & Spini, 2004; Halekoh, 2006; Liu & Liang, 1997 ) . This strategy aims to provide population averaged effects with cluster adjusted error estimates. The intra class correlation (ICC) is treated as a nuisance parameter in GEE , whereas in a multilevel framework , the ICC is modeled . Hence, GEE is not appropriate to examine the variance and cor relation structure of CRT data. Hox (2010) stated that standard errors obtained with GEE m ethods counteract the approximate estimation of the random structure in multilevel models. Snijders and Bosker (2012) briefly mention GEE as an alternative estimation method to maximum likelihood (ML) based multilevel models. They introduce a type of GEE t hat uses a sandwich estimator to obtain standard errors and ordinary least squares for the regression coefficients. They stated that when the model is a good approximation of reality , in other words, a correctly specified model, ML based multilevel modelin g is more efficient than GEE. Rau denbush and Bryk (2002) compares GEE based standard errors to multilevel model standard errors to check the severity of model m isspecification. If the standard errors are not very different, they conclude there is no signal that the constructed multilevel model is misspecified. When the model is misspecified, for example when the normality assumption of error terms in a multilevel structure does not hold, ML based multilevel model standard errors are not consistent but GEE b ased standard errors are (Hox 2010). A full comparison between multilevel modeling and GEE is beyond the scope of this dissertation. Feng et al.


27 (2001), using two data sets , compared GEE to multilevel modeling and permutation tests and concluded that GEE i s too liberal. Murray et al. (2004) summarized the limitations of both methods for a binary outcome and provide key references to these limitations , espec ially for GEE. Gardiner, Luo and Roman (2009) provided differences between multilevel models and marginal models in terms of theory and assumptions. Hubbard et al. (2010) , in their article titled to GEE or not to G that GEE provides a compelling alternative to multilevel models when the focus of the analysis is the estimation of mean effects. In this case, they f ound GEE more useful given the error distribution is not needed to be correctly specified. Moerbeek , Breukelen and Berger (2003) criticize d GEE because it assumes missing data occur completely at random whereas multilevel models are usable for both missing at random and missing completely at random. Overall, both methods acknowledge the dependen cy within a cluster. In this dissertation study, the focus is on multilevel modeling because it can estimate both regression coefficients and variance components. However, in certain circumstances an educational researcher might be interested only in regre ssion coefficients and their significance. Hence, when analyzing a n empirical data set, it would be important to carefully examine the multilevel model assumptions, especially for the error terms and to compare results with GEE. Basic permutation tests, introduced by Fisher (1925), are model free approaches and can be used to analyze individual level data to obtain individual level inference. The idea is simple; if there is no treatment effect, allocation of conditions do es not affect the results. This nu ll hypothesis can be studied by creating all possible treatment arrangements. However , the basic permutation tests are unable to adjust for covariates.


28 Gail, Mark, Carroll, Green and Pee (1996) adapted the permutation test approach to adjust for covariates for a single level model. The y examined the randomization distribution of the residuals of a linear regression model that excludes the treatment indicator. Feng et al. (2001) analyzed two CRT data sets in which each condition had the same number of groups, one with a categorical and o ne with a continuous outcome, and reported that permutation test on residuals produced larger p values for the treatment effect when compared to GEE and mixed model. However the authors recommend ed permutation test over GEE and mixed models due to its robu stness to model assumptions. Murray et al. (2006) conducted a simulation study to compare the permutation approach to the mixed model approach for the analysis of balanced CRT data, in which they used covariate adjusted summary measures in permutatio n test . One important point is that the covariate adjustment for summaries requires assumptions depending on the adjustment type, for example linearity for regression. When the cluster and individual level errors are distributed normally, the two approaches per formed roughly identically (Murray et al. , 2006). When individual level errors are distributed normally but cluster level errors have a substantially non normal distribution, mixed models became slightly less powerful (Murray et al. , 2006). One important d rawback of permutation tests is that it is not possible to examine interaction effects. Moreover, permutation tests are also partially model based approaches if covariates exist in the study. Multilevel modeling is the model choice in this dissertation sin ce any departures from normal a distribution are not included. The comparison of multilevel modeling to GEE and permutation tests in the preceding paragraphs was limited to continuous outcomes given the research questions


29 are also limited to continuous ou tcomes. All three methods can be modified to analyze binary and categorical endpoints. Multilevel Modeling for CRT When clusters are randomly assigned to conditions in a design with a continuous outcome and continuous predictor variables, a multil evel mod el is appropriate to examine the mean and variance structure. In the following sections, a two level design is examined; individuals constitute the level 1 units and clusters constitute the level 2 units. Following Snijders and Bosker (2012) notation, J de notes the number of clusters, the number of individuals in the groups is denoted by n j for group j ( j J ), the total number of individuals is denoted by , individuals within the group are denoted by i ( i n j ), the contin uous outcome is denoted by Y ij , the continuous level 1 predictor is denoted by X ij , the binary treatment indicator is denoted by W j , the continuous level 2 predictor is denoted by Z j . Two multilevel sub models are introduced briefly, the null model and the random intercept model. The null model The null model, aka the empty model, is the simplest possible multilevel model. Raudenbush and Bryk (2002) u se the term one way ANOVA with random e ffects for this set up. (2.1) where is the grand mean, the fixed effect; is the random effect at the cluster level; is the random effect at the individual level. The random effects are assumed to have a normal distribution wit h a mean of 0 and , and . The random


30 effects are assumed to be mutually independent thus . The null model provides a point estimate and confidence interval for the grand mean. However, the main estimate that can be obtained from this model is the ICC to understand the fraction of total variability due to the cluster level. I present the theory of multilevel models using the least squares (aka ANOVA) and ordinary least squares (OLS) estimati on procedure because it is instructive for demonstration . Using least square estimator with , the random effects can be estimated first and treated as known during the estimation of the fixed effect. Following Kuehl (2000); (2.2) (2.3) (2.4) (2.5) (2.6) (2.7) Hence, assuming a balanced design, the estimate of is M SW, and the estimate of is (MSB MSW)/n. The overall mean can be estimated with;


31 (2.8) The ICC can be estimated with; (2.9) Confidence intervals (CI) of these estimates can be computed , 100(1 effects and the ICC: (2.10) (2.11) (2.12) (2.13) The ICC is calculated by Equation 2.9 and measure s homogeneity between individua ls who are randomly selected from a specific cluster, given that this specific cluster is randomly selected from actual clusters of the physical distribution of the population (Snijders & Bosker, 2012). The standard error of the estimator in Equation 2.9 w hen all grou p sizes are equal (Fisher,1958) is (2.14) The null hypothesis of can be tested with (2.15)


32 where it follows an F distribution with J 1 and J(n 1) degrees of freedom. Snij ders and Bosker ( 2012) states that if the null hypothesis is rejected, inference on the grand mean will likely be more valid if multilevel analysis is used . It is not possible to be certain about the choice of multilevel analysis bec ause rejecting H 0 may be due to a Type I error. Failure to reject H 0 is also a fallible indicator for selecting a single level analysis because of the possibility of a Type II error. The ICC estimate can also be used in the calculation of a design effect ( DE) for estimating the grand mean ; (2.16) This statistic reveals how much the sample variance is affected by clustering, as compared with a simple random sample. For example, if DE=2 and M=Jn=100, the two stage sample is equivalent to sampling 50 individuals with simple random sampling, to be precise, the standard error of the mean for the two stage sample is equal to the standard error of the mean for simple random sampling of 50 individuals . A confidence interval for the design ef fect, using a normal approximation, can be calculated using Equation 2.14; (2.16a) The random intercept model The null model does not include any covariates. The random intercept model can include covariates at both levels and assu mes the effect of individual level covariates is fixed across clusters. Let p and q be the numbers of covariates at the individual and cluster level respectively; if p =0 and q , the random intercept model is classified as a means as o utcomes regression (Raudenbush & Bryk, 2002): (2.17)


33 Equation 2.17 can be divided into two parts: (2.18) It can be seen in Equation 2.18 that the coefficients of the level 1 covariates are assumed to be fixed across clusters; in other words, slope variances are fixed to be zero. This assumption can be relaxed and covariates can be added to explain slope variation. In that case, the model is no longe r called the random intercept model. Raudenbush and Bryk use the term intercepts and slopes as outcomes model . Snijders and Bosker use the term random slopes . Matrix notation for a multilevel model is provided by Swaminathan and Rogers (2008). (2.19)


34 (2.20) where each row in Z is a vector of predictors, for example . In terms of the observations, (2.21) (2.22) where is an vector of outcomes, is an matrix of predictor variables, is a vector of unknown parameters, is an identity matrix, is an vector of random errors assumed normally distributed with a mean vector of 0 and a variance covariance matrix in which all diagonal elements are an d remaining elements are 0. (2.23) (2.24) where is a matrix of predictors, is an vector of fixed effects, is a vector of level 2 errors and T is a


35 variance covariance matrix. The diagonal elements of T are the variances of the elements of , and the remaining elements are their covariances. As an illustration, here I provide matrix notation and equations for a random intercept model wit h a single level 1 covariate, one level 2 covariate and one level 2 dummy coded variable which is the most complex model used in the analysis of simulated data in this dissertation. This matrix notation approach can easily be modified for the null model to replace sum of squares notation provided before. (2.25) (2.26) During the estimation process Equation 2.25 and 2.26 are created for each cluster. For example, the equations for the s th cluster are (2.27)


36 (2.28) Followi ng Raudenbush and Bryk (2002, p. 42 44), assuming all elements of are random, each group has the equal number of observations and the same values of the predictor matrix X , two stage OLS estimation proceeds as follows (2.29) (2.30) (2.31) (2.32) If the estimates of and are available, the sampling variance of two stage OLS estimator can be calculated following Swaminathan and Rogers (2008) by defining T (Equation 2.24) as parameter dispersion, the variance covariance matrix of level 2 errors and as the error dispersion for cluster j. (2.33a) (2.33b) Given that Swaminathan and Rogers state d that tw o stage OLS is an appropriate procedure if is the same across the clusters, in Equation 2.33a should be the same for each cluster.


37 Estimates of the parameters are derived from equations and solving these equa tions might require assumptions. For example Equation 2.29 is derived assuming expected level 1 errors to be 0. If assumptions are violated, the validity of estimates is questionable. For a continuous outcome, a random intercept model requires, at least, n ormally distributed residuals at both levels, and constant variance of level 1 residuals across groups and examination of outliers. Snijders and Bosker (2012) advise d devoting some energy to examine assumptions of the multilevel models. Hox (2010) gives an example on how to inspect residuals plots. Kim and Frees ( 2006) studied the effect of omitted vari ables in multilevel models and found that given the varia bles have the same correlation with the dependent variable, omitted lower level predictors yield more serious bias in regression coefficients than omitted variables in higher levels. A recent R package, influence.ME (Rense , Manfred & Ben, 2012) offers dete cting influential data points in mixed models. Influence diagnostics for mixed models has been available in SAS for a longer period (SAS, 2004) Raudenbush and Bryk (2002) note d that if level 1 residuals are not normally distributed, the standard errors at both levels are biased. A violation of homogeneity of variance at level 1 might indicate unidentified slope heterogeneity at level 1. An unidentified slope heterogeneity might bias estimates of the level 2 coefficien ts. (Raudenbush & Bryk, 2002, p. 264). If level 2 residuals are not normally distributed, the standard errors at level 2 are biased. The assumption of normally distributed level 1 residuals might be violated if the outcome is not continuous. For discrete outcomes, readers are referred to Snijde rs and Bosker (2012). When one or more assumptions are questionable, researchers can use variable transformations or, they can study modified


38 estimators, aka robust estimators (Raudenbush & Bryk, 2002). Several estimators for a multilevel framework are int roduced in the next sections. Parameter Estimation P rocedures Parameters in multilevel models can be estimated with several methods Fixed effects and variance components can be estimated simultaneously with maximum likelihood (ML) or, they can be estimated separately with least squares estimation procedures. Ordinary least squares (OLS) estimation minimizes the sum of squared deviations of the observed and predicted data points. The OLS procedure produces unbiased estimates when the cluster sizes are equal (Kreft & de Leeuw 1998) and can be expressed in explicit formulas. If cluster sizes are unequal, weighted least squares (WLS) estimation should replace OLS (Murray, 1998). Generalized least squares (GLS), as an extension to OLS, is a more efficient estim ator when homogeneity of error variances is questionable (Swaminathan & Rogers, 2008). Moreover, if the multilevel model is constructed as random slopes model that includes level 2 predictors with random variation for different level 1 coefficients, GLS es timators are more appropriate than OLS estimators (Swaminathan & Rogers, 2008). Murray (1998) noted that quite often least squares and ML estimates converge to the same result in large samples with normal error distribution. Swaminathan and Rogers (2008) n ote d that if cluster sizes differ, fixed effects and variance components should be estimated together with maximum likelihood procedures. Estimating fixed and random parts simultaneously can be achieved with an iterative procedure aimed to maximize the lik elihood based on trial and error. The trial and error procedure can be guided using the expectation maximization (EM) algorithm. Alternatively, ML estimation


39 can be accomplished with the Newton Raphson procedure to solve likelihood equations with approxima tions rather than trial and error. Searle, Casel la and McCulloch (1992) provided mathematical explanations for ML with EM and ML with Newton Raphson procedures and report ed that ML with Newton Raphson can be faster due to fewer iterations but can be more vulnerable to poor starting values in terms of convergence. Swaminathan and Rogers (2008) suggests OLS fixed effect estimates and pooled within cluster variance to be used as starting points for ML with Newton Raphson. They also provide a formula to create a T matrix as a reasonable starting point for level 2 residuals. The restricted maximum likelihood (REML, aka RML) procedure separates the estimation of fixed and random paramet ers (Raudenbush & Bryk, 2002; Searle, Casella & McCulloch, 1992). Snijders a nd Bosker (2012) note d that when J q 1 is equal or larger than 50 (J is the number of clusters and q is the number of level 2 predictors), the difference between ML and REML estimates are negligible. If J q 1 is smaller than 50; ML estimates of the varianc e components are biased, generally downward (Hox, 2010). Raudenbush and Bryk (2002) posit that ML estimates for level 2 variances and covariances will be smaller than REML by a factor of approximately (J F)/J where F is the total number of elements in in Equation 2.20. However, when deviance tests are the choice to compare models with different fixed effects but the same variance components, the REML deviance should not be used because it is a deviance for the variance components onl y. Instead the ML deviance should be used (Snijders & Bosker, 2012). If the models differ in both the fixed effects and the variance components, neither deviance can be used to conduct test s on the fixed effects.


40 If the assumptions of normally distribute d residuals are violated for a continuous outcome, the OLS and GLS estimation can be modified using the Huber correction as an attempt to produce robust standard errors with large level 2 sample size. A brief introduction is provided b y Raudenbush and Bryk (2002, p. 276), the authors provide robust variance estimator under OLS and GLS. M plus software employs the MLR estimator not only to produce robust standard errors using an extension of Hubert White sandwich estimator but also robust model chi square usi ng Yuan Bentler correctio n (Muthen & Muthen, 1998 2012). As noted previously the GEE framework can also use sandwich estimators. The random intercept models are expected to yield very close estimates to GEE with an exchangeable correlation matrix (Hubbard et al., 2010) . Feng et al. (2001) reports the exchangeable correlation, which has the same interpretation as the ICC, is the most common function in GEE. In their notable article, Hox, Maas and Brinkhuis ( 2010) , showed that when the data simulation procedure did not include departures from normality or variance homogeneity in a multilevel structural equation framework, standard errors produced by the ML were more accurate than standard errors produced by MLR except when there were at least 200 clusters. These authors did not include only the ML and MLR in their study . T wo different diagonally weighted least square estimator s (DWLS) we re also compared. T hey concluded that DWLS can perform as well as ML. Unfortunately, Mplus does not implement DWLS estimators for the la tent mean model. Because the data simulation process for this dissertation does not include modeled departures from normality and homogeneity, the ML estimator is used. Before closing this section on


41 parameter estimation, it is necessary to mention other e stimation procedures, the bootstrapping and Bayesian methods. The bootstrapping procedure resamples from the observed sample using replacement. Estimation is applied to each sample. Snijders and Bosker (2012) mentioned the bootstrapping procedure as a prom ising method because it can improve the standard error estimates. Leeden, Meijer and Busing (2008 ) summarized three approaches to the bootstrap for two level models: the parametric bootstrap, the residual bootstrap and the cases bootstrap. Each pr ocedure h as its own assumptions: the parametric bootstrap assumes fixed explanatory variables with a correctly specified model and residuals at two levels with a multivariate normal distribution. The residual bootstrap approach relaxes the normal distribution of th e residuals assumption. The third approach, the cases bootstrap, requires only a correct specification of the hierarchical dependency. Given it does not assume fixed pr edictor variables, Hox (2010, p. 267) states resampling cases would be justifiable in mu ch of the social and behavioral science s . However, resampling entire cases in a multilevel structure is not as straightforward as it is for single level structures. Leeden, Meijer and Busing (2008) stated that, the cases bootstrap procedure should reflect the nature of the data. If for a CRT design, both clusters and individuals are randomly selected, subsampling clusters and individuals separately would be preferable. If all individuals in each cluster are included in the study, only the level 2 units sho uld be resampled. However , there might be several difficulties in follow ing these suggestions, F or example if cluster size is not equal across clusters, subsampling only level 2 units would not assure the same sample size for each bootstrap replicate and w ould introduce noise (Goldstein,2011).


42 Subsampling cases based on level 1 units would assure the same sample size for each bootstrap replicate, but it would produce variable number of level 2 units. Small cluster sizes might create difficulties due to resa mpling the same case with replacement when the cases bootstrap is utilized (Leeden, Meijer & Busing, 2008 ; Vallejo et al. 2013). An illustration of parametric and non parametric bootstrapping for multilevel models was provided by Hox (2010). Another proce dure to estimate parameters is Bayesian methods. ML, OLS and bootstrapping procedures aim to estimate specific population parameters; whereas Bayesian methods consider population parameters as random variables that represent uncertainty about the values of the parameters . Independent from the sample in hand, this procedure chooses a prior distribution, informative or uninformative, to indicate the uncertainty about the population parameter distribution prior to collecting the data . This prior distribution i s combined with the information from the sample in hand in order to create the posterior distribution of the population parameter which represents uncertainty after collecting the data. Posterior uncertainty is typically smaller than prior uncertainty . In this sense, Bayesian methods are an approach to obtain likely values of a parameter rather than point estimates, however, the mean or median of the posterior distribution can be interpreted as a point estimate, and the standard deviation of the posterior c an be used as the standard error (Snijders & Bosker, 2012). A short but effective introduction to Bayesian philosophy is provided by Wackerly Mendenhall and Scheaffer (2008). Hox (2010), Snijders and Bosker (2012), Swaminathian and Rogers (2008) introduce Bayesian methods for multilevel models.


43 Comparing the performance of different estimators or algorithms or procedures is well beyond the scope of this study. Overall, early in this chapter , GEE and permutation frameworks are left behind since multile vel modeling can focus on both regression coefficients and variance components. Within the multilevel framework, robust estimation procedures are not considered in this study given they do not perform better than non robust procedures if certa in assumption s are not violated. O rdinary least squares estimation is not addressed in this study due to its inefficiency, Bayesian and bootstrapping are also not addressed due to complexity and computational demand. Other rationales for choosing ML are that it is wide ly used and convenient. Hypothesis T esting Using large sample approximations, p values can be computed to separately examine the null hypotheses of whether each parameter is equal to zero; Raudenbush and Bryk (2002) shows that t and distributions can be utilized to compute p values for the fixed effects and the random effects respectively. Available with ML procedures, deviance tests can also be employed to test the fixed effects in some situations and one tailed likelihood ratio tes ts can be empl oyed to test the random effects (Snijders & Bosker 2012). LaHauis and Ferguson ( 2009) compared the performance of tests to deviance tests for random e ffects in a simulation study and conclude d that tests are more powerful, but corrected deviance tests (one tailed tests on deviance differences) exhibit slightly better Type I error rates. Another alternative t o test for the random effects is to use permutation tests (Lee & Braun, 2012) , which have been shown to perform well even when the sample sizes at both level s are small. For relatively small sample sizes, Manor and Zucker (2004) investigated eight different


44 procedures for testing a given fixed effect. Using a mixed model framework in their simulation study, sample sizes of 15, 20 , and 30 and approximately 6 observations for each individual were investigated. The authors utilized both REML and ML estimates to evaluate eight different procedures. In short, the authors concluded that the Satterthwaite procedure with REML estimates and the Bar t lett corrected likelihood ratio tests with ML estimates performed equally well by yielding appropriate Ty pe I error Type I error rate with correctly specified covariance structure in mixed modeling was also noted by Kowalchuk et al . (2004). The authors also investigated the performance of Kenward normal and heterogeneous covariance structure with small samples and concluded that this procedure generally provided good Type I error control even with the unstructured covariance matrix. Missing Values M y review of the abstracts of studies that were funded by the IES over the last decade that were designed to use a CRT showed that researchers generally plan for perfectly balanced data structure, where each cluster has the same number of students and each treatment condition has the same number of clusters. It is not common to obtain perfectly balanced data due to (a ) nonresponse, and (b ) failure recruit the same number of participants within clusters and the same number of clusters within study conditions. As a consequence, the information gathered with the collected data is less than what is planned. A lso, the representativeness of recorded data is questionable (Longford, 2008) . When analyzing an incomplete data set, any data analysis method makes an assumption of missingness type. Rubin ( 1976) categorizes three different


45 types of missingness: missing completely at ra ndom (MCAR), missing at random (MAR) and missing not at random (MNAR). Following Schaffer and Graham ( 2002) , consider a univariate missing pattern in which missingness occurs for a single variable ( Y ) in the data set and the remaining variables ( Xs ) are compl ete. Let R be an indicator variable where 1 indicates the datum is missing on Y and 0 in dicates it is not missing. L et the data set that would have been observed if there were no missing data be denoted by D com and let D com =(D obs, D mis ), meaning D com is pa rtitioned as observed data and missing data; MAR implies that the distribution of missingness is independent from D mis ; P(R|D com )=P(R|D obs ) (2.34) Equation 2.34 implies that the probability of missing Y is dependent on observed data , but not on missing d ata. If missingness is also independent from observed data, the data are MCAR: P(R|D com )=P(R) (2.35) If missingness is dependent on D mis , the case is called MNAR. For example, consider a CRT design in which teachers are randomly assigned to an interventi on group and pre intervention test scores are obtained for every single student in the sample, but some post intervention test scores are missing. If missingness occurred w ho scored poorly on the pretest refuses to take the post test, the case might correspond to MAR. If a teacher refuses to report low posttest scores, the case might correspond to MNAR.


46 nalyst is required to make assumptions about the causes of nonresponse, MCAR, MAR or MNAR. It is generally not realistic to assume MCAR unless missingness is planned (A tkins, 2005; Schafer & Graham, 2002) , that is, when all participants are not asked the same set of questions to avoid fatigue. In some cases, the r esearcher might have reason to assume MNAR , for example, when drop outs are likely to be related to the outc ome measures. Even though it might not be very feasible, researchers might conduct follow ups on nonrespondents to investigate the causes of missing data. In short, assumptions about the causes of nonresponse generally cannot be tested (Atkins, 2005; Graham, 2012; Schafer & Graham, 2002) . When analyzing an incomplete data set, the ultimate purpose is not to predict missing observations but to provide valid and efficient inferences about the population of interest (S chafer & Graham, 2002). When it is reasonable to assume MCAR, listwise deletion can yield unbiased estimates. When MAR or MCAR hold, in order to increase efficiency, hence to increase statistical power, there are several alternatives; single imputations (o bserved mean imputation, conditional mean imputation, imputing from conditional mean distributio n), multiple imputation (MI), EM algorithms and full information maximum likelihood (FIML) estimation . Multiple imputation creates several complete data sets u sing plausible values based on a missing data model. Each constructed complete data set is analyzed and results are processed to provide estimates. It is suggested to use as many variables as practicable in the missing data model (Schafer & Graham, 2002) . The EM algorithm to produce a maxim um likelihood estimate can be employed to handle missing data. In the expectation step of the EM


47 algorithm , the distribution of R|D obs is used in the calculation of the expected likelihood and the maximization step proceeds as usual (Longford, 2008) . Enders (2006), following a struc tural equation framework, stated that the use of an EM input matrix should provide parame ter estimates that are nearly identical to parameters provided by FIML. Enders and Bandalos (2001) refer to FIML as a casewise likelihood procedure, and the likelihoods for each case are summed to produce the likelihood for the sample. Graham (2012) referr ed to FIML as raw data maximum likelihood where the process reads the raw data one case at a time and uses all available information for each case. S nijders and Bosker (2012) stated , formulating the likelihood of the observed data using ML under MAR is ref erred to as FIML. A simple technical comparison between ML and FIML estimation methods available with the PROC CALIS procedure is provided in ML procedure would discard any observations with at least one missing value. On the other hand, the FIML procedure would use every bit of information, assuming that the incomplete observations are random. When MNAR is assumed, handling missing data in general is a more difficult task. Missing data tech niques for clustered data have been investigated by several authors, such as Black, Harel and McCoach ( 2011) , Graham (2012), Longford (2008), Snijders and Bosker ( 2012), and Yang, Kim and Zhu ( 2013) . Snijders and Bosker (2012) explain ed multiple imputation under MAR and MCAR ass umptions and provide an R code o n their website. Longford (2008) provides technical details about the EM and imputation appro aches for multile vel data. Graham (2012) mentioned an R pac kage PAN (Jing & Schafer,2013) as the best imputation solution for clustered data. Although


48 he does not provide any guidance to use the PAN package, he provided guidance to analyze multilevel data with MAR using both SPSS (IBM corp., 2012) and HLM (Raudenbush, Bryk & Congdon, 2004) accompanied with NORM 2.03 software (Schafer, 1999) to impute data. Black , Harel and McCoach (2011) investigated three different missing data techniques for multilevel d ata with MAR : multiple imputation ignoring the clustered structure with the R package NORM (Alvaro & Schafer,2013); multiple imputation taking into account the clustered structure with the R package PAN, and FIML in M plus software . The authors conclude d th at FIML and imputation with PAN performed similarly and better than imputation with NORM. Yang, Kim and Zhu ( 2013) introduced a new imputation method for mixed models under MNAR. Covariates in Cluster Randomized Trials Appropriate covariates are expected to increase power in randomized designs and also to reduce bias in non randomized designs. Thus, a strong relationship between the covariates and the dependent variable is desirable. The ultimate goal of measuring covariates is to maximize power given that the alternative, increasing the sample size, is generally more expensive. Moerbeek (2006b) studied the cost and benefit for measuring covariates vs. increasing the sample in terms of statistical power. In a two level design, covariates can be measured at two different levels. The impact of different level covariates on statistic al power is discussed later in this chapter. In a setup with students nested in schools, the researcher can measure student level covariates, for example pretest scores , socioeconomic status, gender etc. The researcher can also measure school level covaria tes, for example school budget, participation in a program. Hox (2010) referred to covariates that are used only in the


49 level they are defined as global covariates . For example , gender is a global variable at level 1 and school budget is a global variable for level 2. Ludtke et al. (2008) use d the term global for school level covariates that cannot be disaggregated to the student level, for example school budget. A common practice to create school level covariates is to aggregate the student level variable s; for example pretest scores, race, income. These types of variables are referred as contextual variables. Student level outcomes can introduce two different relation structures to the outcome; within group and between groups. Consider a random intercepts model, a simplified version of Equation 2.18, with a continuous level 1 covariate; (2.36) the regression coefficients of and are both related to the same variable X , and these two coe fficients can be completely different (Snijders & Bosker, 2012). The difference between two coefficients, is termed the contextual effect by Raundenbush and Bryk (2002). Aggregation alternatives and issues are the subject of the next section. Aggregating Lower Level C ovariates Ludtke et al. (2008) discussed two different types of mean aggregation: formative and reflective. Using a setup with students nested in schools. I n formative aggregation an arithmetic average of a student var iable within each cluster is used as a school level variable. In reflective aggregation the expected value of a student variable is used a s a school level variable , for example, pretest scores and the school mean of


50 pretest scores. Formative aggregation re sults in an observed level 2 variable, whereas reflective aggregation results in a latent level 2 variable. In other words, for formative aggregation it is assumed that level 1 covariates define level 2 characteristics; whereas for reflective aggregation it is assumed that a level 2 construct is accounted for a level 1 covariate. Aggregating lower level covariates to create higher level covariates might lead to two different issues: the ecological fallacy and reliability of the aggregated variables (Hox, 201 0). The correlation between group means cannot be interpreted to explain the individual level relationship. This issue is known as ecological fallacy. Using reflective aggregation and following Snijders and Bosker (2012) for a balanced design, the reliabil ity of an aggregated variable is (2.37) Equation 2.37 implies that the reliability of an aggregated variable is a function of ICC and group size . F or example if ICC=.10 there should be 21 participants in each cluster t o achieve a reliability of .7. I f ICC=.2 there should be 10 pa rticipants to achieve .71. Reliability is generally a concern for reflective aggregation. When formative aggregation is the case and all participants in each cluster are included in the study, it would be i nappropriate to use ICC to estimate the reliabil ity (Ludtke et al., 2008). Observed vs. Latent M ean When using reflective mean aggregation or formative mean aggregation with small sampling ratios (the percentage of sampled participants in each cluster), u sing arithmetic average might introduce bias when estimating the between level coefficient due to reliability of the aggregated variable (Croon & van Veldhoven, 2007 ; Ludtke et al.


51 2008, Shin & Raudenbush , 2010) . A strategy that takes reliability into account is provided by Lutdke et al. (2008). They found that the one stage latent mean approach outperformed the observed mean approach in terms of producing unbiased contextual effects in all simulat ed conditions (i.e. different covariate ICC, level 1 and level 2 sample size). Assuming a set up with two levels and a continuous contextual variable in a balanced design, calculating the arithmetic average, the observed mean, of the level 1 variable to b e used as a level 2 predictor is straightforward: (2.38) However , adjusting the cluster means to correct the regression analysis requires several steps when following the two stage approach (Croon & van Veldhoven, 2007 ). Consider the following model for a level 1 covariate (2.39) where is the contextual variable, measured for individual i in cluster j , refers to cluster s latent mean and this point , only the is observed, remaining are all unobserved. The variance of is assumed t o be constant for each cluster and is denoted by . The variance of is denoted by . Treating as an outcome and estimating an empty multilevel model provides estimates of these variance terms. In the second step, weights can be created based on these estimates ;


52 (2.40) (2.41) where is the grand mean of the observed covariate. If the clu ster sizes are equal w eight is constant across clusters. Equation 2.41 is termed as unbiased predictor of the latent group mean by Croon and van Veldhoven (2007), empirical B ayes estimate by Shin and Raudenbush ( 2010) and two stage latent variable approach by Lutdke et al. (2008). The latent mean model can be estimated by calculating empirical Bayes estimates of the latent means and using them as level 2 means in a multilevel analysis. Lutdke et al. (2008) state that the two stage approach is a limited information app roach as opposed to full information approach (one stage) utilized in M plus when the LMM is estimated . The comparison by Lutdke et al. (2008) between these two approaches indicated that the results are similar except under the condition with small sample s izes and low ICC , in which case the full information procedure outperformed the two stage procedure. They stated that the full information procedure should be preferred. The full information approach wa s utilized when an alyzing the simulated data sets in t he present . More details and historical background of latent variables in structural equations are provided elsewhere (Bollen, 1989). Including Covariates in Multilevel Models for CRT Data In their presentation of power analysis for two level CRTs , Spybroo k et al. (2011) indicate that power can be enhanced by using a level 2 covariate and that the data can be analyzed by using the following model


53 (2.42) where is the dependent variable for the i th level 1 unit in the j th cluster, is a level 2 covariate, the is an effect coded variable, ½ for treatment and ½ for control, is a level 2 residual, and is a level 1 residual. The covariate could be a variable that is not aggregated from scores on level or it could be a variable that is aggregated from scores on level 1 units, such as a pretest measure. L udtke et al. (2008) investigated estimators of the parameters of the model (2.43) where is the latent mean of for cluster j . An alternative to Equation 2.43 is to substitute for ; (2.44) I refer to a model in which is used as the aggregated variable as a OMM and a model in which is used as the aggregated variable as a LMM . M plus 7 can be used to estimate E quation 2.43 and 2.44 . Using both an analytical and an empirical (simulation) approach, Ludtke et al. (2008) showed that the OMM results in a biased estimator for regardless of the level 1 or leve l 2 sample size and that the LMM can be used to improve estimation of . M plus can also be used to estimate models obtained by excluding the level 1 deviation score s from the models in Equation 2.43 and 2.44.


54 The models studied by Lud tke et al. (2008) can be extended by adding a treatment effect. For example, the extended LMM with a level 1 covariate is (2.45) Power and S ample Size As mentioned earlier, researchers generally are more interested in discovering differences rather than failing to reject a null hypothesis. In a simple comparative experiment, the researcher sets up the environment and collects data on different treatment conditions followed by a test to compare means. The power of this test is the p robability of rejecting a false null hypothesis. Power analysis should be conducted as part of the design of a study. If the power is low, the researcher is likely to fail to show statistical difference between treatment means; hence likely to waste resour ces if treatment effect e xisted in reality. I f the power is unreasonably high, the researcher is also likely to waste resources. In order to avoid this conflict , it is essential to design studies while carefully examining the sample size requirements to a chieve a reasonable power. Starting with very simple designs, the following sections provide formulas to calculate statistical power or the required sample size to achieve reasonable power. Power in Single Level M odels The sample size in a study affects the precision of estimates and statistical power to detect differences among the treatment means. The simplest case to compare two treatment means might be collecting data on two independent samples. When planning this simple design, the following formula can be employed to determine the number of replications if the standard normal distribution test statistic, the normal z statistic for non directional test, is used (Kuehl, 2000)


55 (2.46) where r is the required sample size for each group, is the probability of Type I error, is the targeted power of test, is the standard deviation and D is the size of the difference between two means. This formula indicates that for a simple design, the required sample size is larger when is smaller or the target power is higher. Keeping remaining parameters constant, smaller D requires a larger sample. These basic attributes hold with relatively complex models ; for example a single level ANCOVA model (Spybrook et al. , 2011); (2.47) where Y is a continuous outcome, is the mean, is the treatment effect, is the coefficie nt for the covariate, is the random error, W is the effect coded treatment indicator, X is the grand mean centered covariate, is the between person variation after controlling for the covariate . An F test can b e employed to examine if is significantly different than zero. If the null hypothesis is true, this F statistic follows a central F distribution with 1 and ( N 2 ) degrees of freedom, where N refers to the total sample size. If the nu ll hypothesis is false, the distribution is non central, an approximation of the non centrality parameter is given by Spybrook et al. (2011) (2.48) w ith the aid of a computer , these two F distributions can be utilized to calculate the statistical power. The exact non centrality parameter for Equation 2.47 with two treatments, equal cell frequencies , and X regarded as fixed was provided by Algina and


56 Olejnik(2003). For the case of random covariates, mathematical derivations and num erical examples were provided by Shieh (2007). Power in CRT Calculating statistical power in multilevel studies is relatively more complex compared to single level studies. When designing a multilevel study, the researcher might be motivated to detect si gnificance for a covariate in a specific level, a variance component or an int eraction term; each choice lead to its own power study. For example, in a two level set up, if the interest is a level 1 parameter , the total number of observation s is important to reach a reasonable power; whereas, the total number of clusters are more important for the power of a level 2 parameter. This section provide s power calculations for a CRT design to detect a treatment effect. The simplest model for a balanced two level CRT is; (2.49) where Y is the continuous outcome, i for individual and j for clusters, W is the effect coded treatment indicator, is the grand mean, is the treatment effect, is the random effect at the cluster level, is the random effect at the individual level. Following Raudenbush (1997 ) when cluster sizes are equal; and (2.50) where and is the mean for the experimental and control group respectively, n is the cluster size and J is the number of clusters. The square root of the variance is the standard error of the treatmen t effect, as , the standard error ;


57 however, if the standard error 0. This shows that the number of clusters has a stronger influence on power to detect a treatment effect. When testing whether the treatment effect is zero, the F statistic follows a central distribution with 1 and J 2 degrees of freedom if the null is true. If the null is false distribution is non central with the same degrees of freedom. The non cent rality parameter is; (2.51) This formula is obtained with the raw metric of the dependent variable. Given it is common to report effects in a standardized metric we can modify Equatio n 2.51. In a standardized model, , , and the effect size is denoted by . (2.52) Given power is increased as is increased and keeping J, n and constant, power decrea ses as the ICC approaches to 1. Bloom, Hayes and Black (2007 ) show ed that the required number of cluster s to detect a treatment effect in a CRT can be reduced by covariates and including aggregated pretest scores in the model is as effective as including student level pretests. If a normally distributed cluster level predictor, Z , is included in Equation 2. 49, assuming there is no interaction between the treatment and the covariate, the variance of the main effect of treatment is approximated by ( Raudenbush, 1997) ; (2.53)


58 where is the conditional level 2 variance and it equals to . The quantity refers to the prop ortion of between cluster variance explained by the covariate. Using the standardized metric, the non centrality parameter is approximately (2.54) Power increases as increases, and incre ases as approaches to 1. Including a level 2 covariate costs losing one degree of freedom and F statistic is approximately distributed as F 1 and J 3 degrees of freedom. Hedges and Hedberg ( 2007) provide an approximation to the non centrality parameter for a two level balanced CRT with both level 2 and group centered level 1 covariates which can be written as follows when the number of clusters in each treatment is equal: (2.55) where refers to the proportion of within cluster variance explained by the group mean centered level 1 covariate and M is the total sample size . The F test for the treatment with covariates at both levels foll ows an F distribution with 1 and J q 2 degrees of freedom, q being the number of level 2 covariates. When there is no level 1 covariate, Equation 2. 55 is identical to Equation 2.54 . If and , ; however, If and , . W hen comparing the use of just a level 1 covariate to just a level 2 covariate, the same for either covariate, a nd with an exception of extreme cases ( )) , cluster level


59 variables are more effective than group centered level 1 variables in terms of power gain. This issue is discussed in detail by Konstantopoulos ( 2012) . The author use d a similar approach and concluded that when and , including just a level 2 covariate will have a larger impact on power when compared to i ncluding just a level 1 covariate with the same . In other words, the author concludes that if and level 1 variables might have a larger impact on power. A balanced design requires clus ter sizes to be equal, otherwise the harmonic mean of n should be used (Konstantopoulos, 2010; Raudenbush, 1997) . Also in all equations, is cal culated by assuming any covariate is uncorrelated with treatment. It is important to note that Kreft and Leeuw (1998, section 5.4) concluded maximum likelihood procedures give higher power for fixed effects compared to the OLS due t o its efficiency. Moreover , all calculations assume that assumptions are not violated. E ach type of violation would lead to a different power curve. Overall, in a CRT, power to detect the treatment effect is generally more influenced by the number of clust ers rather than the cluster size and more influenced by cluster level covariates rather than the group mean cente red level 1 covariates. As shown above, power increases as ICC decreases and as variance reduction by the covariates increases. Of course, as i n single level analyses, power increases as alpha increases and as the effect size increases. When designing a CRT, the researcher is expected to provide realistic parameters, an intelligent guesswork, for the effect size, ICC and the variance reduction by the covariates; Bloom , Hayes and Black (2007 ), Hedges and Hedberg (2007) might be helpful to provide realistic guesses for


60 intervention studies or pilot studies in educational settings. Equations in this section did not address the optimization of budget use based on sampling costs at each level. R eaders are referred to Raudenbush (1997) and Moerbeek, Breukelen, and Berger (2001) for additional information on this issue . Equations also did n ot address more than two levels, r eaders are referred to Konstantopoulos (2008,2009) ,and Schochet (2005) . Moreover the equations imply that the covariates are included in the analysis. Possible alt ernatives, blocking and matching are discussed in Bloom (2005 ) and Raudenbush and Spybrook (2007) . Power in multisite randomized trials ha s studied by Raudenbush and Liu (2000) and its comparison to CRT is studied by Moerbeek (2005) . Power in multilevel models with binary outcome is studied by Moerbeek and Wong (2008) and Paccagnella (2011) . One last limitation, this section is focused on statistical power to detect a fixed treatment effect, power to detect variance or cross level interactions a re not addressed. A review of previous simulation studies on power in multilevel models is provided by Scherbaum and Ferreter (2008) . Software for Power and Sample Calculations Studies should be designed wisely to detect any intervention effect when it exists. Before data are collected, statistical power tools accompanied with intelligent guesswork for parameters can provide a great amount of guidanc e. In designing a CRT, several materials are available to calculate the sufficient sample size for a targeted statistical power. This section aims to provide a brief review of some of these materials, specifically their use for CRT. Optimal Design (Raudenb ush, et al., 2011) is a user friendly software that can handle single, two or three level designs. For a two level CRT design with person level endpoint, Optimal Design can handle continuous and binary outcome s , budget


61 optimization and blocking for a rando m intercepts model. Optimal Design allows comparing two different treatment conditions with a restriction of equal number of clusters for each condition. For a two level CRT with continuous outcome and no budgetary and blocking considerations, the user can manipulate target power, ICC, number of clusters, cluster size, effect size, Type I error rate , and variance reduction by a single level 2 covariate. The software produces a graphical output. PowerUp! (Dong & Maynard, 2013) is another user friendly tool to compute minimum detectable effect size or minimum required sample size for single, two, three or four level models. The tool consists of Excel spreadshee ts for each different experiment al or quasi experimental design . For a two level CRT with continuous outcome and two treatment conditions in a random intercepts model, the user can manipulate, target power, ICC, cluster size, number of clusters or effect s ize, Type I error rate, proportion of clusters randomized to treatment, number of level 2 predictors, proportion of variance explained by level 1 and level 2 covariates. The tool produces a single numeric outcome, required number of clusters or minimally d etectable effect size, the smallest effect size that can be detected with a target level for power. PINT (Bosker, Snijders, & Guldemond, 2003) is a program for estimating standard errors in a two level regression set up. This program is relatively complex compared to above mentioned tools. However, the user c an input parameters for a random slopes model and estimate standard errors for regression coefficients of level 1, level 2 and cross level interaction predictors. This program is based on Snijders and Bosker (1993) in which matrix equations are derived using GLS estimator (equivalent to ML if multivariate normality is assumed). The authors suggest having at least 10


62 clusters and at least 6 participants in each cluster in order to achieve accurate estimates. The program can also accommodate monetary restrictions. The user can manipulate the number of covaria tes, residual/ correlation matrices and covariate means. The output is standard error estimates for the given sample size combinations with or without monetary considerations. Standard errors can be converted to statistical power using a z table. The progr am requires conditional error variances compared to Optimal Design and Power Up tools. However, for a random intercept model, required inputs for these two tools can easily be transformed into a PINT input. MLPowSim (Browne, Lahi, & Parker, 2009) is a tool that can create R scripts or MLwiN macros to simulate data sets and examine statistical power for single or multilevel models. This brief review is limited to the generated R scripts. However, MLwiN macros can offer a larger variety of estimation approach choices, including Markov chain Monte Carlo (MCMC) estimation. MLPow Sim tool can provide R code to simulate single, two or three level balanced or unbalanced data with continuous, binary or count outcomes. Assuming that the provided R script is executed without any modification, the user can obtain statistical power estima tes based on the proportion of rejected null or based on the average standard error estimates from the simulated data sets. In order to obtain the R script for a balanced two level random intercepts CRT with a continuous endpoint, the user should input sig nificance level, number of simulations, REML or ML to analyze simulated data sets, the number of level 1 and level 2 covariates along with their mean, variance and regression coefficient at each level, and error structure (conditional on predictors). The output contains statistical power


63 estimates for the given sample size combinations. The manual includes an additional R script to create graphs based on the output. This review section is limited to these four different tools. However , the literature indi cates there are several other tools or approaches to calculate power for a multilevel design. For example, Power and Precision (Borenstein & Cohen, 2001) is a proprietary software that can handle power and s ample analysis for a CRT. An R package PAMM (Martin, 2012) includes simulation functions to study statistical power of random effects in mixed models. A Monte Carlo study with M plus software can be designed to conduct power analysis. For non computer based approaches, there are several tables and sample size calculation formulas available (Hayes & Bennett, 1999; Konstantopoulos , 2009; Moerbeek & Wong, 2008) . Research Questions The preceding sections introduce d CRT designs with a focus on power calculations to detect the treatment effect. These designs are common to study efficacy of treatments . These designs generally require a larger sample t o successfully detect a treatment effect when compared to random assignment of individuals. Given the possible high costs of increasing the number of clusters, including covariates in these designs has a particular importance to increase statistical power. In a two level CRT design, the researcher can measure covariates at two different levels. When testing whether a treatment effect exists; as it is shown in Chapter 2 and discussed by Konstantopoulos ( 2012) , level 2 predictors are more effective to increase statistical power compared to level 1 predictors. A level 2 covariate might be measured at its own level, for example school budget, or it can be


64 manufactured using a level 1 covariate. Aggregating a level 1 variable is one of the common manufacturing processes. Ludtke et al. (2008) d iscussed two different types of aggregation , formative and reflective. I n a two level set up, for formative aggregation it is assumed that lev el 1 covariates define leve l 2 characteristics; whereas reflective aggregation assumes that an underlying structural model includes an arrow from the level 2 construct to the level 1 covariate. The authors investigated the consequences of using the arithme tic mean or the latent mean of the level 1 covariate for these two different types of aggregation. I refer to a model with formative aggregation as an observed mean model and a model with reflective aggregation as a latent mean model . It is of interest to investigate if these two models provide unbiased estimates of the treatment effect for a CRT; and, if they both provide unbiased estimates, it is also, and mainly, of interest to investigate the performance of these two models in terms of statistical power and the Type I error rate. In these two methods, the aggregated mean of the level 1 covariate is included as a level 2 predictor. When comparing OMM to LMM, the effect of level 1 deviation scores on power is also of interest. The effect might be inconseq uential for OMM . Power Up software provides similar number for a level 2 sample size for a given effect size when the level 1 covariate is included and excluded. For example, for a two tailed test targeting .8 power with an ICC of .2, class size of 20, exp lained variance of 25% at each level, the required number of clusters to detect an effect size of .3 is 65 when the level 1 deviation scores are included and the required number of clusters is 69 when the level 1 deviation scores are excluded. Hence, the e ffect of including or excluding the


65 level 1 deviation scores when evaluating the performance difference between OMM and LMM constitutes the second research question. Overall a comparison of four different models, OMM with a treatment variable and covariat es at levels 1 and 2 (OMML1L2), OMM with a treatment variable, a level 2 covariate but without the level 1 deviation score (OMML2), LMM with a treatment variable and covariates at levels 1 and 2 (LMML1L2) and LMM with a treatment variable, a level 2 covari ate but without the level 1 deviation scores (LMML2) to analyze the simulated data sets is of interest. In conclusion, the CRTs are important study designs and this research has the potential to inform the selection of methods of analysis for two level CRT such that the methods can be selected in light of the aims of minimizing bias, enhancing control of Type I error rates, and achieving the greatest power possible for the investment.


66 CHAPTER 3 SIMULATION STUDY Methodology The simulation study in thi s diss ertation research utilized a Monte Carlo simulation approach to compare performance of four multilevel methods for analyzing data collected in a CRT design. These data sets were created using R software ( R Core Team, 2013) and analyzed with M plus 7 (Muthen & Muthen, 1998 2012). The simulation design consists of a total of 3 456 conditions and each condition was replicated 1000 times. The four models compared in the study were OMML1L2, OMML2, LMML1L2 and LMML2 In this chapter I introduc e (a) the population mo del, (b ) simulation conditio ns and population parameters, (c ) data generation and analysis of the simulated data, and (d ) analysis of outputs. Population Model T he continuous dependent variable , , and the continuous independent var iable, , are assumed to be standard normal( ). Hence , for each variable , , resulting in . According to Snijders and Bosker (2012 ) a level 1 covariate can be decompose d as ; (3.1a) (3.1b) T he dependent variable can be similarly decomposed: (3.2a)


67 (3.2b) In this two level set up, there are two relations hips between and : the correlation between the cluster means and the co rrelation between the deviation variables for individuals . T hese correlations are de noted by (3.3) (3.4) T he classic linear regression equations for two variables are (3.5) Given that s and s are independent, Equation 3.5 can be applied separately to derive between cluster and within cluster l evel relations hips of and : (3.6a) (3.6b)


68 (3.7a) (3.7b) Using the notation from Equation 2.43, and and therefore, using Equations 3.6 and 3.7 yields (3.8a) (3.8b) (3.9a) (3.9b) Thus , the coefficients and can be determined from and . and were manipulated factors in the simulation. Equation 2.43 does not include a treatment effect. H owever, with a realistic assumption of no dependency between the treatment assignment and the level 1 covariate, ad ding a treatment effect yields E quation 2.45. Thus , the theoretical results reported by Snidjers and Bosker


69 ( 2012 ) can be used as the basis f or simulating data based on the LMM model in Equation 2.45 . I used this approach in the data simulation process. Simulated Conditions and Population Parameters As noted in the preceding section, the data simulation was based on a latent mean approach, repe ating Equation 2.45; (2.45 repeated) The design for investigating power has 2304 conditions arranged as a 4 ( L evel 1 S ample S ize) x 4 ( L evel 2 S ample S ize) x 2 ( ) x 1 ( ) x 3 ( ) x 3 ( ) x 4 (Missing Data Pattern ) x 2 (Target Power) factorial design . The levels of the factors are 3,6,10,20 for the level 1 sample size; 40,60, 80,140 for the level 2 sample size; .20 and .30 for ; .20 for ;. 10, .25 and .50 for each of and , 10% missing data at each level, at level1 only, at level 2 only and no missing data at both levels for missingness; .70, and .90 for target pow er for the model in Equation 2.42 with as the level 2 covariate. An additional 1152 conditions obtained by crossing the levels of the first seven factors and setting the treatment effect to zero is used to investigate Type I error rate. Each condition is replicated 1000 times. All 3456 conditions are used to estimate the bias of the treatment effect estimate. Mass and Hox ( 2005) investigated the effect of level 1 (5, 30, and 50) and level 2 (30, 50, and 100) sample sizes on the results for a model that included a level 1 variable, a level 2 variable, and a cross level interaction. They concluded that estimators of fixed effects, variance components, and standard errors of fixed effects are unbiased regardless of the sample size combination, but the standard errors of the level 2 variance components are too small when the level 2 sample size is substantially


70 lower th an 100. Bell, Ferron, and Kromery ( 2008) and Bell and Morgan ( 2010) found similar results. I have also searched research g rants and contracts funded by Institute of Education Sciences to identify realistic sample sizes in CRT studies . A total of 357 abstracts emerged when was searched. Among these 357 abstracts 131 reported the planned level 2 sampl e size. Sample sizes equal to 32, 48, 80 and 140 correspond to the 20th, 40th, 60th and 80th per centile respectively of the 131 sample sizes . Planned level 1 sample size is reported in 85 abstracts, 30% of these values were smaller than 6, where the median was 10 and the 3rd quartile was 20. Table 3 1 shows the average sample size at each level for different goal types. Sample size values of 40, 60, 80 and 140 for l evel 2 and 3,6,10, 20 for level 1 were selected in light of this review. Table 3 1: Average s ample size s for reviewed IES grants by goal type Goal Level 1 Level 2 N Average N Average Development 17 10.47 22 111.96 Efficacy and Replication 43 31.58 71 216.00 Exploration 3 121.33 3 70.00 Measurement 1 15.00 2 262.46 No Goal 5 13.59 11 69.60 R&D Center 4 11.00 5 321.14 Scale Up Evaluations 6 12.83 7 111.96 Not identified 6 8.83 10 111.40 N: number of available data points. Hedges and Hedberg (2007) reported that in the Early Childhood Longitudinal S tudy Kindergarten Cohort, the ICCs for reading and mathematics in grade K were .23 and .24, respectively. Results for data collected by Snyder et al. (2007) indicated that the average ICC across nine constructs was also .22. The values of and in the design (.20 and .30) were selected in light of these results.


71 The values and in the design were chosen to represent small, moderate and strong relationships. Bloom, Hayes and Black (2007) reported and values for pretest posttest relations for a sample including approximately 28000 students from 450 elementary schools in five different districts. District level aggregated and values were ranged between .12 and .78. The values in the design (.10, .25 and .50) were selected in light of these results. The purpose of the m issing data condition is to evaluate the effect of lack of balance in group sizes on bias, power and Type I error rates. Missing data were MCAR. Three missing data conditions were created by randomly selecting the 10% of the cases (a) at both levels (5% each level), (b) at level 1 only, and (c) at level 2 only and setting the dependent variable score to missing for the selected cases. When the dependent variable is missing, LMM and OMM methods delete the entire case by default. Thus, four levels of missing data condition respectively corresponds to an unbalanced design at both levels, meaning that the dependent variable might be missing for an entire cluster or students within a cluster ; an unbalanced design at level 1 only, meaning missing data occur only at student level ; an unbalanced design at level 2 only, meaning missing data occur only at cluste r level; or a balanced design, meaning there is no missing data. Black, Harel and McCoach (2011) conducted a simulation study examining missing data techniques in multilevel data. They compared performance of different techniques with multilevel data when values are missing at random on the dependent variable at rates 10%, 30 % and 50%. T heir results indicated that missing data techniques perform satisfactory with 10% missing data.


72 Target power was either .7 or .9 in all condition s . T he minimum detectable effect size (MDES) (Bloom, 2005 ) for any condition used to investigate power is the treatment effect that can be detected with a specified ta rget power. Given Equation 2.42 with as the level 2 covariate, in this study, t he effect si ze for each condition used to investigate power was the MDES for that condition. Due to the fact that our dependent variable and covariate was scaled to have standard deviation 1, the MDESs are on a z score scale. The level 2 sample size values in conjunct ion with the values for target power result in MDESs in the ranges .14 to .59, and .19 to .80 when target power is .70 and .90, respectively. Thus , the condition combinations used to investigate power, but particularly, the level 2 sample sizes, result in MDESs that seem likely to occur in practice. Manipulating target power rather than effect size allows investigating how well the LMM L2 performs in terms of achieving target power and also whether the other models are better or worse choices. Data Generati on The d ata simulation model and study conditions were presented in the preceding sections. The data generation process is explained in this section. Data generation was based on Equation 2.45 which is repeated here: (2.45 repeated) The inputs to generate the data sets were level 2 sample size, level 1 sample size, , , target power, and . The f ollowing steps were used to generate the data 1. Calcu late between coefficient, , based on Equation 3.9a 2. Calculate within coefficient, , based on Equation 3.8a


73 3. Calculate based on Equation 2.42 using Equation 2.54 for a one tailed test. 4. Simu late 5. Simulate based on Equation 3.9b 6. Calculate 7. Randomly assign binary treatment indicator, , each condition gets exactly J/2 clusters. 8. Simulate 9. Simulate based on Equation 3.8b 10. Calculate 11. Calculate 12. Calculate After creating the dependent variable , for each individual in each cluster, missi ngness condition was created by (a) saving all cases for the non missing data, (b ) randomly select ing 10 % of clusters and deleting the outcome for the missing at level 2 conditions, (c ) randomly select ing 10 % of the full sample and deleting the outcome for the missing at level 1 conditions, (d ) randomly select ing 5 % of the clusters and deleting the outcome and then randomly select ing 5 % if the individuals without missing scores and deleting the outcome for the missing at both levels condition s. Each data set included five columns for each missingness condition, a column for the dependent variables, independent variable, arithmetic average of the independent variable for each cluster, treatment indicator, and cluster identification.


74 Analysi s of Simulated Data The data were generated in R software which was also u tilized to call M plus 7 to estimate four different multilevel models with ML estimation method. In this dissertation study , the focus is on the estimate of a level 2 fixed effect (tr eatment), and its standard error; both can be biased if the assumptions of normally distributed errors or homogeneity of variance are violated. However, in the data simulation process neither non normality nor heterogeneous variances were included . Hox, Ma as and Brinkhuis ( 2010) has shown that when the nor mality and homogeneity assumptions are met, ML works as well as or better than robust ML. Therefore, robust ML was not included in the study. The four models, OMML2L1, OMML2, LMML2L1 and LMML2 were executed for all 1000 iterations of all 3456 study conditi ons. The M plus syntaxes for the four mo dels a re provided in Appendix B . Each run produced parameter estimates and the accompanied standard errors. Analysis Simulation outputs are analyzed to investigate the convergence rate, treatment effect bias, power and Type I error rate. M plus 7 provides information whether the estimation terminated normally for each iteration. This information is used for calculating the convergence rate. Moreover, the software also provides an iteration id, this feature enables th e user to find the iteration with non convergence and replace the relevant row with missing indicators. Parameter bias was calculated by using where is the mean estimate across all replications of a conditio n. I also examined accuracy of


75 interval estimation by using the coverage of the 95% confidence interval (CI); . In order to investigate the effects of the factors on power for the 2304 non zero treatment effect conditions, was tested by using and as the one tail critical value. The same strategy is followed for Type I error rates. The degrees of freedom to test the treatment effect with one explanatory variable a t level two , ( N 3) was chosen based on Snijders and Bosker (2012, p. 95) The variable was used in a mixed ANOVA to investigate the effect of the manipulated factors on bias. The factors of the simulation design were the between subje cts factors and the analysis model was the within subjects factor. Generalized eta squared (Olejnik & Algina,2003) and proportion of effect variance (PEV) for each factor and interaction were calculated as effect size measures. In an ANOVA set up, the total sum of squares (SS) can be divided into two parts as where (3.10 ) where k is the total number of effects. PEV was calculated by; (3.11) as the relative siz e of an effect. I attempted to use a repeated measures logistic regression model for manipulated factors and an alysis method on power and Type I error rate. However, estimation would not converge. Instead, I used the ANOVA model described above and calcula ted PEV for each effect.


76 CHAPTER 4 RESULTS Convergence Rates Convergence problems were minimal. In 91% of the simulated conditions, estimation converged for all replications. In the remaining, the minimum convergence rate was 99.5%. Bias of the Between C oefficient Ludtke et al. (2008) compared LMM and OMM in terms of bias of the between coefficient, that is of the coefficients of and respectively. These authors also provided a formula to calculate this bias. The estimated between coefficient ( ) bias in our simulated conditions were in perfect agreement with their results. For example, a study condition with 1 000 replications in which J = 140, n = 6, =.3, =.2, =.5, = .25 , target power =.7, and no missing values, the average bias for was .160. This bias was calculated to be .159 by the formula. Treatment Effec t Table 4 1 shows minimum and maximum bias by analysis methods as well as the quartiles of the bias distributions. Mean bias was equal to zero for all for models. Among all conditions, bias ranged from .033 to .021 with an approximately normal distributi on of mean 0 for each method. Overall these results indicate that bias tends to be small. A mixed ANOVA with seven between subjects factors and one within subjects factor resulted in 255 effects. A total of 112 effects were significant. Table 4 2 reports F , p and partial eta square for effects with PEV larger than .05. There are two complex


77 interactions that explained more than 5% of the total effect variance for bias. However, these two interactions did not include the method effect, indicating methods pe rformed equally well. Especially OMML1L2 and OMML2 produced almost identical treatment effect estimates, with differences occurring after the third decimal place. Table 4 1 Treatment bias summary by analysis method Percentiles Model Minimum 25th 50t h 75th Maximum LMML1L2 .033 .003 0 .003 .021 LMML2 .022 .003 0 .003 .020 OMML1L2 .022 .003 0 .003 .019 OMML2 .022 .003 0 .003 .019 Table 4 2 Summary of Selected Effects on treatment bias Source PEV F value p value Eta squared J x n x x Target x x .085 4.55 <.0001 <.0001 J x n x Target x x .074 3.97 <.0001 <.0001 J : Number of clusters, n : cluster size, : b etween association, :within association Out of 3456 conditions, 2304 included a non zero treatment effect. Table 4 3 shows relative bias by analysis method. Among all non zero treatment conditions and analysis method, relative bias ranged from .048 to .049 with a mean of 0. Following Hoogland and Boomsma (1998), relative bias was considered acceptable if the value was smaller than 0.05.Overall t hese results indicate that bias wa s acceptable. Table 4 3 Relative bias of non zero tre atment effects by analysis method Percentiles Model Minimum 25th 50th 75th Maximum LMML1L2 .048 .009 0 .009 .043 LMML2 .044 .009 0 .009 .049 OMML1L2 .044 .008 0 .008 .045 OMML2 .044 .008 0 .008 .044


78 Table 4 4 shows the summary of covera ge rates for all conditions by analysis method. Mean coverage rates were equal to the median coverage rates. Among all conditions , the mean and median coverage rates were equal and .943, .938, .944 and .945 for LMML1L2, LMML2, OMML1L2 and OMML2, respectiv ely. Coverage rates within .925 and .975 can be considered acceptable (Bradley, 1978). The mixed ANOVA model revealed that the method effect was the only effect with PEV larger than .05. Results indicate that the small coverage rates tended to occur when L MML2 was used. Approximately 14% of the conditions had a coverage rate below .925. Table 4 4 Coverage rates for the treatment population parameter Percentiles Model Minimum 25th 50th 75th Maximum LMML1L2 .909 .938 .943 .949 .970 LMML2 .886 .930 .9 38 .946 .990 OMML1L2 .912 .939 .944 .949 .974 OMML2 .912 .939 .945 .950 .972 Type I E rror A total of 1152 out of 3456 study conditions were designed with no treatment effe ct in order to examine the Type I error rates. Table 4 5 shows the summary of Typ e I error rate for all conditions by methods. On average, .054, .059, .053 and .053 for LMML1L2, LMML2, OMML1 L2 and OMML2 respectively, Type I error rates were acceptable (i.e., between .025 and .075). This is important because power would be misleading if the actual Type I error rate was not near the nominal Type I error rate. There was only one effect with a PEV larger than .05, a five way interaction between J , n , , and explained 8% of t he total effect variance. However, this interaction


79 did not include the method effect, possibly indicating that methods performed equally well. Table 4 5 Type I summary by analysis method Percentile Model Minimum 25th 50th 75th Maximum LMML1L2 .033 .04 9 .054 .059 .078 LMML2 .019 .052 .059 .066 .094 OMML1L2 .033 .049 .053 .059 .080 OMML2 .033 .048 .053 .058 .080 Power A total of 2304 out of 3456 study conditions included a non zero treatment effect to achieve a target power of .7 or .9. PEV revealed that 86% of the variance due to effects wa s explained by the target power factor. Table 4 6 shows the summary of statistical power for all conditions by methods and the target. For conditions with target power = .7, the mean statistical power was .722, .7 07, .742 and .740 for LMML1L2, LMML2, OMML1L2 and OMML2 respectively. With target power = .9, the mean statistical power was .902, .893, .920, .919 for LMML1L2, LMML2, OMML1L2 and OMML2 respectively. Overall, in terms of power, OMML1L2 and OMML2 performed very similarly and better than the latent mean models. Further, excluding the level 1 covariate from the latent mean model caused only a slight decrease in power. Because PEV for target power was so large, it may have obscured smaller effects. To address t his possibility, a mixed ANOVA was conducted separately for conditions with target power = .7 and target power =.9. Results are presented in Table 4 7 . The ANOVA models produced consistent findings for the two level s of target power. Further, , method, missing data pattern, and method by cluster size ( n ) effects


80 produced PEV larger than .05. Specifically, more than 30% of the total effect variance is explained by the effect of , each of the remaining three e ffects explained roughly 10%. Table 4 6 Summary of statistical power by target and analysis method Percentiles Target Power Model Minimum 25th 50th 75th Maximum .7 LMML1L2 .577 .696 .722 .748 .870 LMML2 .581 .684 .711 .733 .796 OMML1L2 .646 .715 .737 .764 .882 OMML2 .646 .714 .736 .764 .869 .9 LMML1L2 .763 .886 .906 .925 .978 LMML2 .781 .878 .898 .913 .958 OMML1L2 .848 .903 .919 .935 .990 OMML2 .848 .903 .919 .935 .987 Table 4 7 Summary of Selected Effects on Power Source PEV F value p value Eta squared Target=.7 .36 2065 <.0001 .003 Method .12 10139 <.0001 .001 Missing .12 441 <.0001 .001 Method x n .09 2476 <.0001 .001 Target=.9 .31 2287 <.0001 .003 Method .15 11 950 <.0001 .002 Missing .10 467 <.0001 .001 Method x n .11 2947 <.0001 .001 :within association Table 4 8 shows the average power by analysis method and As expected, an increase in is associated with an increase in statistical power independent of the analysis method.


81 Table 4 8 Average power by method and Model .10 .25 .50 Target =.7 LMML1L2 .690 .715 .759 LMML2 .686 .704 .732 OMML1L2 .712 .736 .777 OMML2 .711 .736 .774 Target =.9 LMML1L2 .880 .901 .925 LMML2 .874 .893 .912 OMML1L2 .900 .919 .940 OMML2 .900 .919 .939 :within association Table 4 9 shows the average power by an alysis method and missing data pattern. As expected, the highest power is achi eved when there is no missing for the dependent variable. The greatest power loss occurred when data are missing due to only deleted clusters. The smallest power loss occurred w hen data are missing due t o only deleted individuals. Power loss between the two extremes occurred when data are missing due to deleted cluster and individuals. Table 4 9 Average power by method and missing data pattern Model Missing Data Pattern No mi ssing Missing only at L1 Missing only at L2 Missing at both Target =.7 LMML1L2 .743 .725 .703 .716 LMML2 .729 .711 .689 .701 OMML1L2 .762 .743 .725 .736 OMML2 .761 .742 .724 .734 Target =.9 LMML1L2 .917 .904 .890 .898 LMML2 .908 .895 .881 .888 OMML1L2 .932 .922 .909 .915 OMML2 .932 .922 .909 .915


82 Table 4 10 shows the average power by method and cluster size. The results indicated that, OMML1L2 and OMML2 performed very similarly and better than the latent mean models for all cluster sizes. However, when using latent mean models, excluding the level 1 covariate was associated with a decrease in power except whe n the cluster size is 20. When the cluster size is 20, the difference between the four methods was negligible. Table 4 10 Aver age power by method and cluster size Model Cluster Size 3 6 10 20 Target =.7 LMML1L2 .703 .728 .731 .725 LMML2 .678 .708 .721 .723 OMML1L2 .757 .744 .737 .728 OMML2 .753 .744 .737 .728 Target =.9 LMML1L2 .877 .910 .912 .909 LMML2 . 868 .893 .904 .907 OMML1L2 .927 .923 .918 .911 OMML2 .926 .923 .918 .911


83 CHAPTER 5 DISCUSSION AND CONCLUSION I investigated four methods to detect a fixed treatment effect in a CRT design i n terms of bias, power and Type I error rate. The four method s are the result of two factors (a) whether the level two covariate was an observed mean or latent mean of a level 1 covariate and (b) whether a level 1 covariate was included in the model. General Discussions All four methods performed equally well in te rms of bias of the treatment effect. However, LMML2 performed less well than the other methods in terms of coverage rate. A ll four methods produced some coverage rates below the lower limit of .925, for LMML2 coverage rates below .925 were found for 14% of th e simulation conditions , indicating that either the standard errors were too small or the distribution of did not follow a t distribution. Further inspection revealed that the 86% of the liberal coverage rates (rates below .925) o ccurred when n = 3 or 6. The remaining three methods also produced coverage rates below the lower limit in approximately 1% of the conditions. These generally occurred when J = 40. Coverage rates above the upper limit occurred only with LMML2, 2% of the c onditions and almost all occurred with n = 3. OMM methods and LMML1L2 produced mainly accurate estimates for the fixed treatment effect even with small sample sizes in the simulation study. This finding agrees with the results reported by Bell, Ferron, and Kromery (2008) and Maas and Hox (2004). A new finding was LMML2 was associated with accur acy problems even with larger number of clusters. The results indicated that once target power was controlled has the largest effect, among the factors, on power. This finding is likely due to the difference between


84 the model used to generate the data and the model used to determine the coefficient for the treatment effect. The former model included both a level 1 and level 2 covariate and the latter model included only a level 2 covariate. Thus once target power was controlled in the analyses, it would be expected that which varied from .10 to .50, would have a large effect on power. As expected , power was almost identical for the two OMM methods (see Table 4 6 ). Examination of results for the two OMM metho ds with balanced data shows that both methods result in the same treatment effect estimate and, when OMML2 results in a non zero estimated variance component for level 2, the same standard error. When OMML2 results in a null estimated variance component fo r level 2, the standard error tends to be slightly smaller for OMML1L2 than for OMML2. Power was similar for the two LMM methods, but tended to be slightly larger for LMML1L2. Moreover, OMML2, OMML1L2, and LMML1L2 worked slightly better than expected becau se the average power with these models was slightly larger than the target power. The results showed that OMM methods were slightly more powerful than LMM methods. Inspection of the standard errors of the treatment effect revealed that on average LMM meth ods produced approximately 1.06 times larger standard errors. Given that the all methods resulted in small or no bias for the treatment effect, the larger standard errors are the likely source of the difference in power. These results are consistent with Ludtke et al. (2008) who stated that the estimates of the between coefficient in the LMM approach are more variable than those of the OMM approach. The results indicated that missing data results in a decrease in power with the largest decrease when all mi ssing data is due cluster dropping out of the study. It has


85 been shown in Equation 2.50 that level 2 sample size ( J ) has a larger impact than level 1 sample size ( n ) has on the sampling variance for the treatment effect estimator (Raudenbush, 1997) . This s imulation study included four different level 2 sample sizes; 40, 60, 80 and 140; 80 on average and four level 1 sample sizes; 3, 6, 10, 20; 10 on average. When the 10% of the cases were randomly del eted via clusters, the average J was 72 and average n was 10. When the cases were delete d via individuals, the average J was 80 and average n was 9. When the cases were deleted via both clusters and individuals, averag e J was 76 and n was 9.5. The results also showed that the missing data pattern did not interac t with analysis method choice for either bias or power. The effect of method by n interaction was also interpretable. Results showed that when n =20, all four methods performed equally well in terms of statistical power. This is not surprising given the rel iability of the level 2 covariate is high, .83. When the reliability (hence the weight in Equation 2.40) is high, the adjust ment on the arithmetic mean has relatively little effect (Croon & van Veldhoven, 2007 ) . In terms of Type I error rate, all four m ethods performed acceptably. For OMML1L2, OMML2 and LMML1L2 proportion of conditions with unacceptable Type I error rate was smaller than .01%. For LMML2, only 6% of the 1152 conditions produced unacceptable Type I error rates; tracing the coverage rate s, almost all occurred when n = 3 or 6. Difference due to the analysis method was not detected. Specific Discussions The p receding section aimed to provide a general discussion of the simulation study findings and some of these findings can be discussed in more detail. The


86 following sections aim to provide a bett er understanding of (a ) the effect of level 1 deviation scores, (b ) the effect of latent mean approximation , and (c ) the effect of missingness on statistical power. Further, I provide empirical da ta analyses to demonstrate the conclusions of the simulation study. The Effect of Level 1 Deviation Scores on Statistical Power Including covariates in a CRT design to achieve more statistical power can be more feasible when compared to increasing the sam ple sizes. Assuming a two level CRT with a student level covariate, the multilevel model can be constructed using an uncentered or grand mean centered level 1 covariate. In such case, an appropriate level 1 covariate is expected to reduce the error varianc e at both levels. In fact, when examining a treatment effect on student achievement, it is not unrealistic to encounter a student level pretest score that explains up to 50% of the variance at each level (Bloom, Hayes & Black, 2007). However, as noted by S nijders and Bosker (2012) ,the within group relation of a level 1 covariate with the outcome can be completely different from between group relation. Hence, as suggested by Raudenbush and Bryk (2002, p . 143) and concurred by Hedges and Hedberg (2007) and Mo erbeek (2006b) it is good practice to use both components of a covariate by using level 1 deviation scores as the level 1 predictor and group mean as the level 2 predictor. Thus, each predictor supposedly explains variance only at the corresponding level. As shown by Raudenbush (1997) , the reduced variance , in turn , is expected to reduce the variance component of the treatment effect, thus to increase the statistical power. The increase in power might be neutralized if losing 1 degree of freedom is also inf luential; however having such small number s of clusters is not good practice for a multilevel model. As showed in the theoretical framework section of this dissertation, variance reduction at


87 level 2, hence the effect of a level 2 predictor is likely to be more influential to increase power. Of course, the influence will vary as a function of ICC and sample size given a constant effect size. Konstantopoulos ( 2012) concluded tha t when and , including just a level 2 covariate will have a larger impact on power when compared to including just a level 1 covariate with the same . In the present study, the effect of adding the deviation score into the model that has already included the aggregated score was investigated. The Optimal D esign software does not allow including a level 1 covariate when conducting power analyses for a two level CRT design. However, the in fluence of a level 1 deviation score on statistical power via non centrality parameter for a two tailed t test was approximated by Hedges and Hedberg (2007). This approximation is provided in Equation 2.55. The authors define and , where the numera tor is unconditional error variance minus condi tional error variance. Figure 5 1 depicts the approximated increase in power as a function of and when the number of clust er s is 50, cluster size is 5, ICC is .2 and delta is .3. The blue line shows the increase in power as increases while is kept constant at .2, and the red line shows vice versa. Contradictory to this approximat ion, the simulation study of this dissertation did not provide any evidence to support the claim that including level 1 deviation scores can increase the statistical power. Empirical analyses showed that including the level 1 deviation score can cause an i ncrease in the level 2 variance component. This issue is explained by Snijd ers and Bosker (1994); the authors stated that the decrease in the


88 level 1 variance component must be balanced by an in crease in level 2 variance given the unexplained group variabi lity remains the same after the inclusion of the deviation scores. This behavior is seen for both ML and REML estimators. Approximations of what is estimated with RE ML based on Snijders and Bosker (1994) is provided in Appendix A. Figure 5 1 The approxi mated increase in power as a function of and In short, the non centrality parameter approximation in Equation 2.55 should be used with caution given it is not likely to keep constant while adding the deviation scores into the model. The influence of the deviation score on statistical power varies as a function of cluster size and strength of the within level relation. Keeping the strength of the within relation constant, the positive i nfluence of the deviation score on statistical power via reduction in level 1 variance is stronger with small sample sizes, at the same time, the negative influence via the increased level 2 variance component is also


89 stronger and neutralizes the total eff ect on power. When the cluster size is large, (i.e. 50), the positive and negative influence of the deviation score on statistical power is almost negligible. The s e conclusion s might imply that the Optimal D esign is a sufficient program when designing a tw o level CRT even though it does not allow the inclusion of level 1 covariates . The Effect of Latent Mean on Statistical Power Using a latent mean approximation ( ) instead of the arithmetic mean ( ) has been s hown to reduce the bias of ( Ludtke et al., 2008; Croon & van Veldhoven, 2007 ), and its effect on statistical power to detect a fixed treatment effect was one of the research questions in this dissertation. As explained in the gener al discussion section, even though LMML1L2 method was slightly less powerful compared to OMM methods, the difference can be considered minor. As an attempt to understand the effect of latent mean approximation, I selected two conditions from the main study and applied the two stage adjustment procedure (Croon & van Veldhoven, 2007 ). As noted earlier, this procedure might not be as efficient as the maximum likelihood procedure utilized by M plus . However, computation of the two stage adjustment procedure is relatively simple, especially when the treatment indicator is not correlated with the predictor. Please note that this assumption is made for the simulation study given random assignment is expected to be orthogonal to observed covariates in practice. I compared t he performance of the two stage adjustment procedure to the latent covariate procedure on two of the selected simulation conditions. These conditions were selected because they produced the largest bias w hen there is no mis sing data w ith a non zero treatment effect. T he largest positive bias occurred with J =140, n=3,


90 , target=.9, and . Average bias across 1000 repl ication was J =80, n=3, , target=.9, and . Average bias across 1000 replication was .329, by .326. One single dataset was simulated for each combination and the datasets were analyzed by M plus using LMML1L2, OMML1L2 and OMML1L2 with two stage ad justment (OMML1L2 2S). Table 5 1 shows the estimates , for example, when the positive b ias case was analyzed with three different procedures, estimates of the variance components and the within level coefficient were identical. Estimates for the between coefficient were .580, .578 and .634 for LMML1L2, OMML1L2 2S and OMML1L2 respectively. Ta ble 5 1 indicates that OMML1L2 2S can produce estimates that are roughly identical to LMML1L2. This is expected for a balanced data set without missing data. Based on this development, the effect of the latent mean on statistical power to detect the treat ment effect can be investigated by studying the two stage adjustment procedure. The o bserved vs. latent mean section of the literature review discussed the adjustment steps in the two stage procedure, showing that the adjustment does not affect the streng th of the relationship between the level 2 predictor and the group means of the outcom e; however, the variance is ICC and the cluster size. The variance of the unadjusted group mean is ; (5.1 ) I conducted a small simulation to show that the strength of the relationship between the level 2 predictor and the group means of the outcome does not change when using the two stage adjustment. I arbitrarily selected a condition; target=.9,


91 , and J = 80. One thousand iteration s were run and the observed between level relationship was calculated before and after the adjustments. The variance of unadjusted and adjusted mean was also calculated. Results are shown in Ta ble 5 2 . For example, the first row indicates that when n=3 and , the observed between level correlation remains unchanged after the adjustment ( ) . However, the variance of the adjusted covariate ( =.092) was smaller than the variance of the unadjusted covariate ( =.463). The last column in Table 5 2 shows the estimates from Equation 5 1 . Table 5 1 The Comparison of Two stage Adjustment Procedure to Latent Covariate Pro cedure Model Co ndition Estimates and Standard Errors LMML1L2 Positive Bias .295(.099) .580(.198) .668(.039) .212(.043) .396(.033) OMML1L2 2S Positive Bias .297(.099) .578(.198) .668(.039) .213(.043) .396(.033) OMML1L2 Positive Bias .297(.099) .634(.071) .668(.039) .213(.043) .396(.033) LMML1L2 Negative Bias .426(.156) .551(.197) .373(.069) .239(.073) .580(.065) OM ML1L2 2S Negative Bias .446 (.150) .545(.190) .373(.069) .243(.072) .580(.065) OMML1L2 Negative Bias .446(.150) .468(.100) .373(.069) .243(.072) .580(.065) :treatment, : level 2 predictor, :level 1 predictor, :level 2 error var, level 1 error var Tables 5 1 and 5 2 indicate that the latent mean adjustment, at least with the two stage procedure, reduces the bias of by adjusting the variance. Given the strength of the between level relation is unaffected; conditional error variance at level 2, thus the variance of the treatment effect, might remain unchanged after the adjustment. This


92 deduction is supported by Table 5 1 in which OMML1L2 2S and OMML1L2 has exactly the same variance components. Table 5 2 Change in the Variance of the Group mean After the Two stage Adjustment Eq 5 1 3 0.1 0.1 0.100 0.100 0.463 0.092 .466 6 0.1 0.1 0.097 0.097 0.331 0.121 .333 10 0.1 0.1 0.100 0.100 0.279 0.143 .280 20 0.1 0.1 0.101 0.101 0.237 0.165 .240 3 0.5 0.5 0.451 0.451 0.462 0.091 .466 6 0.5 0.5 0.459 0.459 0.330 0.120 .333 10 0.5 0.5 0.462 0.462 0.276 0.141 .280 20 0.5 0.5 0.468 0.468 0.236 0.164 .240 n :cluster size, pop. parameter, observed correlation, unadjusted mean, adjusted mean The Effect of Missingness on Statistical Power The main simulation study included four different missingness pattern s , not only to investigate the effect on statistical power , but also to examine the interaction with the method choice. A significant interaction might have indicated that the performance of differe nt methods vary when the data are not balanced. Nevertheless the PEV values for this interaction when the target power was .7 or .9 were smaller than .01. On the other hand , the main effect of missingness pattern had PEV values .12 and .10 for the target power .7 and .9 respectively. In this study, I used listwise deletion to handle missing data. However, t he M plus software also of fers full information maximum likelihood ( FIML ) as well as multiple imputation . FIML performs similar ly to multiple imputation for clustered data (Black , Harel & McCoach, 2011) . I conducted a small simulation study with 1000 replications to investigate th e performance of FIML by arbitrarily setting J = 60; n =3,10; ;


93 ; ; a nd target power of .7. Table 5 3 reports the summaries. Tabl e 5 3 provides information on (a ) a small sc ale replication of the main simulati on study and (b ) the effect of FIML on studied methods. When the listwise de letion is performed, as it was i n the main simulation study, these findings replicate that OMM methods were slightly more powerful than LMM meth ods and LMML2 was slightly less powerful than LMML1L2. These findings also replicate the effect of the missingness pattern. The effect of FIML estimation can be c ategorized in three ways . First, for OMML2 method, the FIML estimation procedure did not prov ide additional power . F or OMML1L2 method , additional power was only obtained when missingness occurred due to both levels or due to level 1 only. These findings indicate that when using the arithmetic mean of a covariate as a second level predictor, the FI ML estimation to handle missing data might provide additional power via bringing the deviation scores into the model if missingness did not occur only at the cluster level. Second, for the LMM methods, regardless o f the missingness pattern, the FIML estima tion procedure might provide additional power even when there is no missing data at all. Moreover, when the FIML estimation procedure is used, the power gap between LMML1L2 and OMML1L2 vanishes. Third and last, the FIML procedure might offer a partial solu tion to loss of power due to missingness for LMM methods and for OMML1L2 method if missingness did not occur only at the cluster level. Additional research is needed to understand the effect of FIML estimation on the studied methods , and the performance of FIML estimation in terms of Type I error .


94 Table 5 3 the effect of handling missing data with FIML estimation on studied methods No missing Missing only at L1 Missing only at L2 Missing at both Model min mean max min mean max min mean max min Mean max L MML1L2 0.642 0.737 0.803 0.623 0.713 0.784 0.613 0.695 0.759 0.613 0.703 0.771 LMML1L2 FIML 0.696 0.769 0.872 0.683 0.757 0.849 0.667 0.732 0.834 0.674 0.744 0.833 LMML2 0.640 0.716 0.764 0.622 0.695 0.754 0.610 0.676 0.719 0.602 0.682 0.73 6 LMML2 FIML 0.690 0.747 0.812 0.664 0.724 0.774 0.647 0.704 0.766 0.653 0.713 0.767 OMML1L2 0.696 0.770 0.872 0.686 0.748 0.822 0.669 0.732 0.834 0.672 0.739 0.822 OMML1L2 FIML 0.696 0.770 0.872 0.685 0.757 0.849 0.669 0.732 0.834 0.674 0 .744 0.834 OMML2 0.696 0.767 0.849 0.685 0.746 0.816 0.669 0.728 0.811 0.673 0.737 0.806 OMML2 FIML 0.696 0.767 0.850 0.685 0.746 0.815 0.669 0.728 0.811 0.673 0.737 0.806 Empirical Analyses This additional section aims to demonstrate th e model comparisons using an empirical data set. The comparisons are structured separately as (a) OMML1L2 versus OMML2 and (b ) OMML1L2 versus LMML1L2 . The empirical data set is borrowed from a special education project, a cognitive behavioral intervention implemented at the universal level in order to prevent emotional and behavioral problems (Daunic et al. 2006; Smith et al. 2005; Smith et al. 2009) . The intervention is titled Tools for Getting along (TFGA), a curriculum designed for upper elementary schoo l students. The sample included approximately 2000 students nested in 135 classrooms; the average class size was 15.45. The outcome for the empirical analyses is the Behavioral Regulation Index (BRI) score at the post intervention assessment . The outcome i s created based on the BRI of Executive Function Teacher Form, a standardized instrument that consists of 86 L ikert type items (1 3) compromising 8 clinical scales; the BRI score is calculated as a sum of two scales as an indicator of ability t o use inhibi tory control to manage emotions and behavior (Gioia, Isquith, Guy, & Kenworthy, 2000). The pre intervention BRI score is used as a covariate in the empirical analyses . The BRI scores ranged from 20 60 in which a high score


95 indicates higher risk. Details ab out the initial sample, setting, intervention, instruments, and study procedures can b e found in Daunic et al. (2012). OMML1L2 versus OMML2 The empirical data set is randomly subsampled from the original data in order to have equal cluster sizes (n=10 ). The subsample includes 60 clusters; 23 for control and 37 for treatment. For this subsample, using REML estimates, the ICC equals to .24 and .25 for pre and post intervention BRI score s respectively . Correlation between pre test and post test scores is .73, =.50 and =.55 . Analyses to compare OMML1L2 versus OMML2 were conducted in R, with the linear mixed effects model (lme) function of the nlme package (Pinhero et al., 2014) with both ML and REML estimators. R esults are shown in Table 5 4 . The combined model for OMML1L2 reads; (5 3) where is the post test BRI score for individual i in cluster j , is the pre test BRI score, is the observed pre test group mean for cluster j , and is the binary treatment indicator (control=0) . In main simulation study , estimator was ML , and as mentioned earlier (p. 33), ML estimates for level 2 variances will be small er than REML by a factor of approximately (J F)/J . For this particular data set the factor is equal to (60 4)/60=.933 for the OMML1L2 method , and indeed the ML estimate for the level 2 variance in Table 5 4 is .936 times smaller than the REML estimate. Whe n using the same estimator, both methods produced the same estimates and standard errors for the condition e ffect. The f ollowing paragraph interpret s the ML estimates.


96 The t value for condition using the two methods is equal to 1.88 and two tailed p valu e is .065 using 57 (60 2 1) degrees of freedom. The estimates of the observed mean coefficient and its standard error are also the same across two methods. The variance component at level 2 is larger for OMML1L2 ( = 15.68 ) compared to OMML2 ( =10.04 ) as discussed in prior sections. Even though statistical power for the treatment effect is remained unchanged, the OMML1L2 provided a better fit (AIC=4108, BIC=4135 ) compared to OMML2 (AIC=4534, BIC=455 6). Residuals at both levels were examined and no salient departure from the normal distribution was observed for both methods. Overall, the treatment group showed some improvement but the statistical signifi cance using a t test approach was not affected by including the d eviation score in the model. Table 5 4 Empirical Comparison of OMML1L2 versus OMML2 Model Estimator Fixed Effects (Std. Error) Random Effects Fit AIC BIC OMML1L2 REML 2.32 (1.26) .78 (.10) .75 (.03) 16.74 46.69 4113 4140 ML 2.32 (1.23) .78 (.09) .75 (.03) 15.68 46.60 4108 4135 OMML2 REML 2.32 (1.26) .78 (.10) NA 11.11 102.96 4534 4556 ML 2.32 (1.23) .78 (.09) NA 10.0 4 102.96 4534 4556 Empty REML NA NA NA 34.39 102.96 4575 4588 ML NA NA NA 33.65 102.96 4576 4589 Note: AIC: Akaike Information Criterion, BIC: Bayesian Information Criterion


97 LMML1L2 versus OMML1L2 The same data set wa s analyzed using M plus 7 software i n order to compare L MML1L2 versus OMML1L2. Table 5 5 shows the results . As expected, the estimates for the within coefficient ( ) is identical in both methods. For this particular data set, given the are almost identical for within and between level association, no substantial bias for the between coefficient ( ) is expected. Using the Ludtke et al. (2008) bias formula, the bias is calculated to be .004. Hence, the estimates of the between c oefficient is roughly identical for both methods. Table 5 5 Empirical Comparison of OMML1L2 versus LMML1L2 Model Fixed Effects (Std. Error) Random Effects OMML1L2 2.32 (1.23) .78 (.09) .75 (.03) 15.73 46.60 LMML1L2 2.34 (1.25) .79 (.13) .75 (.03) 15.69 46.60 The t test statistics for the condition effect are 1.871 and 1.884 for latent and observed mean model respectively. As discussed earlier, even though OMML1L2 might be slightly more powerful compared to LMML1L2, the difference is considered minor, and especially for this particular data set the difference is immaterial. Overall, the treatment group showed some improveme nt but the statistical power to detect this improvement remained roughly the same regardless of the choice of latent or observed level 2 covariate.


98 Conclusion Education systems all around the world are organized into clusters and CRTs are important design s in edu cational research. C ovariates in these designs play an important role to efficiently increase statistical power given that sampling more clusters is generally expensive. In a two level design, level 2 covariates are generally more important than le vel 1 covariate s in terms of power . There are at least two alternatives to create a level 2 predictor from a level 1 predictor, the arithmetic mean and the latent mean. The effect of these two alternatives along with the effect of including the deviation s cores on statistical power, bias and Type I error rate for a treatment effect in a CRT design was examined in this study . The magnitude of the treatment effect in the simulation study varied across the combinations of number of clusters, cluster size, stre ngth of between level relationship and ICC of the dependent variable in order to keep the target power constant. Results indicated that the treatment effect was generally estimated without bias; coverage rates might be problematic with LMM methods when the cluster size is small (i.e. , 3) , especially if the deviation score is not included in the model. R esults also showed that the effect of excluding the deviation score with OMM on statistical power to detect a treatment effect is negligible. However, with L MM methods excluding the level 1 deviation score might produce accuracy problems for the standard errors. The study found some evidence supporting that, in general, the difference between OMM methods and LMM with the deviation score is not substantial in t erms of statistical power. The Type I error rates were acceptable for all four methods with a slight departure for LMML2 method. And finally, assuming a constant rate of 10% missingness, the


99 negative effect of losing clusters com pared to losing individuals was larger in terms of statistical power. Suggestions to Researchers Based on the simulation results, the re are several suggestions for applied researchers deviation score and a . First, when the main interest is to detect a treatment effect, the latent group mean and arithmetic group mean perform similarly as covariates in terms of statistical power , bias and Type I error rate . When the latent group mean pro cedure is chosen, it is suggested to include level 1 deviation scores in the model to prevent accuracy problems with small cluster size s and to reach the maximum statistical power. With small cluster sizes, even though the difference might not be substanti al, OMM methods were shown to be more powerful than LMM. On the other hand, when the arithmetic mean is chosen, including the level 1 deviation score did not provide additional power, hence, the applied researcher might want to exclude the deviation scores when analyzing the data sets or when designing a CRT study. Second, the applied researcher should avoid losing an entire cluster as much as possible during a study. It is unfortunate tha t in many studies missing data are inevitable, but this dissertation study suggests that, given a constant rate of 10% missingness, losing cluster level data reduces power more than losing individual level data. Additional simulation study showed that, if available, the FIML procedure might offer some recovery for the reduc ed statistical power due to missingness . There are several of other suggestions b ased on the literature review for this study. When analyzing a dataset projecting a CRT design, the multilevel modeling framework might be more popular compared to other frame works, suc h as GEE and

PAGE 100

100 permutation tests but it would be good practice to compare results from different frameworks. For example, comparing GEE results to multilevel modeling results might help to evaluate the sensitivity of possible violation of assumptio ns . Further, when a multilevel modeling framework is chosen, there are alternative procedures to conduct hypothesis testing as briefly introduced in the literature review section of this study. Alternative procedures should be considered by the applied res earcher. When designing a CRT study, there are several available software and programs to calculate power . The Optimal Design software might be the most user friendly when the interest is only on the treatment effect. If the researcher was also interested in detecting interactions, PINT and MLpowSim are freely available on web. Limitations and Possible Future Research This study was restricted to conditions with two levels . Matching and blocking procedures were not studied and they are possible alternative s to using covariates to increase power in CRT. The study was also restr icted to a continuous outcome. P ower in multilevel models with categorical outcomes using latent or observed variable are an area for further research. Also, the study included missing data that was MCAR and listwise deletion was used . The small simul ation in the discussion section addressing the effect of FIML estimation requires further research. T his study wa s focused on statistical power to detect a fixed treatment effect, power to detect variance or cross level interactions are an area for further research.

PAGE 101

101 APPENDIX A VARIANCE COMPONENT APPROXIMATIONS The Effect of Deviation Scores on ML Variance Components for OMM models; an Approximation based on Snijders and Bosker (1994) Equa tions 3 to 7 . With level 1 and level 2 covariate The model Model for Predicted value of Residual for Variance of residual for

PAGE 102

102 With level 2 covariate The model Model for Predicted value of Residual for Variance of residual for

PAGE 103

103 APPENDIX B R AND M plus SCRIPTS The datasets were generated using the following R script; step1=function (clus=40, nclus=3, iccx=.2, iccy=.3, target=0.7,rhosqb=.50, rhosqw=.50) { # calculating between and within coefficie nts betab=sqrt(rhosqb*iccy/iccx) betaw=sqrt((rhosqw*(1 iccy))/(1 iccx)) # Calculating the fixed treatment effect mines=.1 maxes=2 res=seq(mines,maxes,by=.01) cval=qt(.95,clus 3) num=clus*res^2 fact=(clus 3)/(clus 4) den=4*((1 rhosqb)*i ccy+(1 iccy)/nclus)*fact nc=sqrt(num/den) powerd=1 pt(cval,clus 3,ncp=nc) es=ifelse(target==0,0,res[min(which(powerd>target))]) # calculating and id=seq(1:clus) mupre=matrix(c(rep(0,clus)),ncol =1) delta=matrix(c(rep(0,clus)),ncol=1) for (i in 1:clus){ mupre[i]=sqrt(iccx)*rnorm(1) delta[i]=sqrt(iccy ((betab^2)*iccx))*rnorm(1)} #calculating mupost=(betab*mupre)+delta mupost=data.frame(cbind(mupre,mupos t,id,c(rep(c(0,1),each=clus/2)))) colnames(mupost)=c("mupre","mupost","id","trt") # calculating and n=clus*nclus rpre=matrix(c(rep(0,n),rep(1:clus,each=nclus)),ncol=2) epsilon=matrix(c(rep(0,n)), ncol=1) for ( j in 1:n){ rpre[j,1]=sqrt(1 iccx)*rnorm(1) epsilon[j]=sqrt((1 iccy) ((betaw^2)*(1 iccx)))*rnorm(1) } rpre=data.frame(cbind(rpre,epsilon)) colnames(rpre)=c("rpre","id","epsilon")

PAGE 104

104 #calculating dat a=merge(mupost,rpre,by="id") data$x=data$mupre+data$rpre data$meanx=ave(data$x,data$id,FUN=mean) data$rpost=(betaw*data$rpre)+data$epsilon data$y.full=data$mupost+data$rpost+(data$trt*es) #deleting clusters del.size=length(id)/10 del.clus= sample(id,del.size,replace=F) cls.ind=data$id cls.ind=as.numeric(!cls.ind %in% del.clus) y.mcls=data$y.full*cls.ind y.mcls[which(y.mcls==0)] = ( 99) data$y.mcls=y.mcls #deleting level1 del.size2=length(data$y.full)/10 del.l1=sample(1:len gth(data$y.full),del.size2,replace=F) y.ml1=data$y.full for (i in del.l1){ y.ml1[i]=( 99) } data$y.ml1=y.ml1 # deleting both level del.size.l2=length(id)/20 del.clus.l2=sample(id,del.size.l2,replace=F) cls.ind.l2=data$id cls.ind.l2 =as.numeric(!cls.ind.l2 %in% del.clus.l2) y.both=data$y.full*cls.ind.l2 y.both[which(y.both==0)] = ( 99) ind=cbind(y.both,1:length(data$y.full)) nonmiss=ind[,2][ c(which(y.both== 99))] del.size.l1=length(data$y.full)/20 del.l1.l1=sample(nonm iss,del.size.l1,replace=F) for (i in del.l1.l1){ y.both[i]=( 99) } data$y.mboth=y.both data$clus=clus data$nclus=nclus data=data.frame(data) return(data)} The datasets were analyzed using the following M plus scripts;

PAGE 105

105 LMML1L2 for non missing pattern TITLE: Simulation for 2lvl regression DATA: FILE IS datalist.txt; type = montecarlo; VARIABLE: NAMES ARE id mupre mupst trt rpre epsil x meanx rpost yfull ymcls yml1 ymboth clus nclus; MISSING ARE ALL ( 99); usevariables ARE yfull x id trt; cluster is id; between trt; analysis: estimator=ml; type=twolevel; model: %within% yfull on x; %between% yfull on x trt; output: Tech9; savedata: results are results.txt; OMML1L2 for non missing pattern usevariabl es ARE yfull x id trt meanx; cluster is id; within x; between meanx trt; DEFINE: center x (groupmean); analysis: estimator=ml; type=twolevel; model: %within% yfull on x; %between% yfull on meanx trt; output: Tech9; savedata: results are results.txt;

PAGE 106

106 LIST OF REFERENCES Algina, J., & Olejnik, S. (2003). Conducting power analyses for ANOVA and ANCOVA in between subjects designs. Evaluation & the Health Professions , 26(3), 288 314. doi:10.1177/0163278703255248 Alvaro A. N . Original by Joseph L. Schafer (2013) . norm: Analysis of multivariate normal datasets with missing values. R package version 1.0 9.5. http://CRAN.R Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data: basic and advanced issues. Journal of family psychology the Division of Family Psychology of the American Psychological Association (Division 43) , 19 (1), 98 110. doi:10.1037/0893 3200.19.1.98 Bell, B., Ferron, J., & Kromrey, J. (2008). Cluster size in multilevel models: the impact of sparse data structures on point and interval estimates in two level models. Section on Survey Research Methods . Retrieved from 2008/Files/300933.pdf Bell, B. A., Morgan, G. B. , Kromrey, J., & Ferron, J. (2010). The Impact of Small Cluster Size on Multilevel Models: A Monte Carlo Examination of Two Level Models with Binary and Continuous Predictors. Section on Survey Research Met hods. E source, available at df Black, A. C., Harel, O., & Betsy McCoach, D. (2011). M issing data techniques for multilevel data: implications of model misspecification. Journal of Applied Statistics , 38 (9), 1845 1865. doi:10.1080/02664763.2010.529882 Bloom, H. S. (2005 ). Randomizing Groups to Evaluate Place Based Programs. Bloom, H. S., Richburg Hayes, L., & Black, A. R. (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis , 29(1), 30 59. doi:10.3102/0162373707299550 Bollen, K. A . (1989). Structural equations with latent variables . New York : Wiley, c1989 Borenstein, M., & Cohen, J. (2001). Power And Precision Software . Bosker, R. J., Snijders, T. A. B., & Guldemond, H. (2003). PINT Version 2.1, 1 37. Bradley, J. V. (1978). Robustness? The British Journal of Mathematical and Statistical Psychology, 31 (2), 144 152.

PAGE 107

107 Browne, W. J., Lahi, M. G., & Parker, R. M. A. (2009). A Guide to Sample Size Calculations for Random Effect Models via Simulati on and the MLPowSim Software Package , School of Clinical Veterinary Sciences : University of Bristol Tarbiat Modares University , Iran This draft March 2009 . Chakraborty H. (2008) The Design and Analysis Aspects of Cluster Randomized Trials. In Statisti cal advances in the biomedical sciences: clinical trials, epidemiology, survival analysis, and bioinformatics (p. 67 75). Wiley Interscience. Cohen, J. (1977). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press . C ook, T. D., & Campbell, D. T. (1979). Quasi experimentation: Design and analysis issues for field settings. Boston: Houghton Mifflin . Cornfield, J. ( 1978 ) . Randomization by Group: A Formal Analysis. American Journal of Epidemiology 108 (2): 100 102. Croo n, M. a, & van Veldhoven, M. J. P. M. (2007). Predicting group level outcome variables from variables measured at the individual level: a latent variable multilevel model . Psychological methods , 12 (1), 45 57. doi:10.1037/1082 989X.12.1.45 Cunningham, T. D . (2010). Power and sample size for three level cluster designs . ProQues t, UMI Dissertations Publishing . Daunic, A. P., Smith, S.W., Brank, E. M., & Penfield, R. D. (2006). Classroom based cognitive behavioral intervention to prevent aggression: Efficacy and social validity. Journal of School Psychology , 44, 123 139. Daunic, A. P., Naranjo, A. H., Smith, S. W., Garvan, C. W., Barber, B. R., Bec ker, M. K., Li, W. (2012). Reducing developmental risk for emotional/behavioral problems: A randomized controlle d trial examining the tools for getting along curriculum. Journal of School Psychology , 50(2), 149 166. doi:10.1016/j.jsp.2011.09.003 Effect Sizes and Minimum Required Sample Sizes for Experimental and Quasi Experimental Design Studies. Journal of Research on Educational Effectiveness , 6 (1), 24 67. doi:1 0.1080/19345747.2012.673143 Donner, A . (2009). Cluster Unit Randomized Trials . E source: available at documents/public/donner_fullc hapter.pdf

PAGE 108

108 Donner, A., & Klar, N. (2004). Pitfalls of and controversies in cluster randomization trials. American journal of public health , 94 (3), 416 22. Retrieved from rez&rendertype=abstract Dorman, J. P. (2008). The effect of clustering on statistical tests: an illustration using c lassroom environment data. Educational Psychology , 28 (5), 583 595. doi:10.1080/01443410801954201 Educational Sciences Reform A ct , Government. (2002) . Available at: cy/rschstat/leg/PL107 279.pdf Enders, C., & Bandalos, D. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling a Multidisciplinary Journal, 8 (3), 43 0 457. doi:10.1207/S15328007SEM0803_5 Enders, C. K. (2006) Analyzing Structural Equation Models with Missing Data. In Structural equation modeling; a second course . Portland: Ringgold Inc. Feng, Z., Diehr, P., Peterson, A., & McLerran, D. (2001). Selecte d statistical issues in group randomized trials. Annual Review of Public Health , 22, 167 187. Fisher RA. 1925. Statistical Methods for Research Workers . Edinburgh, UK: Oliver & Boyd Fisher, R. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain . Retrieved from 1 4612 4380 9_8 Fisher RA. (1958) . Statistical Methods for Research Workers . 13 th edn. London: Hafner Press. Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B., & Pee, D. (1996). On design considerations and randomization based inference for community intervention trials. Statistics in medicine , 15 (11), 1069 92. doi:10.1002/( SICI)1097 0258(19960615)15:11<1069::AID SIM220>3.0.CO;2 Q Gardiner, J. C., Luo, Z., & Roman, L. A. (2009). Fixed effects, random effects and GEE: What are the differences? Statistics in Medicine, 28 (2), 221 239. doi:10.1002/sim.3478 Garson, G. (2013). Hi erarchical linear modeling : guide and applications / G. David Garson, editor. Thousand Oaks, Calif. : Sage Publications, c2013.

PAGE 109

109 Ghisletta, P., & Spini, D. (2004). An Introduction to Generalized Estimating Equations and an Application to Assess Selectivit y Effects in a Longitudinal Study on Very Old Individuals. Journal of Educational and Behavioral Statistics , 29 (4), 421 437. doi:10.3102/10769986029004421 Gioia, G. A., Isquith, P. K., Guy, S. C., & Kenworthy, L. (2000). Behavior Rating Inventory of Execu tive Function professional manual . Lutz, FL: Psychological Assessment Resources, Inc. Goldstein, H. (2011). Bootstrapping in multilevel models. In Handbook of advanced multilevel analysis (p.163) . edited by Joop J. Hox, J. Kyle Roberts . New York : Routled ge, c2011. Graham, J. W. A. (2012). Missing data: Analysis and design . New York: Springer. Guo S. & M.W. Fraser (2010). Propensity Score Analysis: Statistical Methods and Applications . Thousand Oaks: SAGE Publications. 370+xviii pp. Halekoh, U. (2006). The R Package geepack for Generalized Estimating Equations, 15 (2), 9 10. Hayes, R. J., & Bennett, S. (1999). Simple sample size calculation for cluster randomized trials. International journal of epidemiology , 28 (2), 319 26. Retrieved from Heck, R. H., & Thomas, S. L. (2000). An introduction to multilevel modeling techniques. Mahwah, N.J: Lawrence Erlbaum Associates. Hedges, L. V. (2007). Effect sizes in cluster randomized designs. Journal of Educational Statistics, 32 (4), 341 370. doi:10.3102/1076998606298043 Hedges, L. V., & Hedberg, E. C. (2007). Intraclass Correlation Values for Planning Group Randomized Trials in Education. Educational Evaluation and Policy Analysis , 29 (1), 60 87. doi:10.3102/0162373707299706 Hoogland,J.J, Boomsma, A . (1998). Robustness studies in convariance structure modeling. Sociological Methods and Research, 26 (3), 329. Hox, J. J., Maas, C. J. M., & Brinkhuis, M. J. S. (2 010). The effect of estimation method and sample size in multilevel structural equation modeling. Statistica Neerlandica , 64 (2), 157 170. doi:10.1111/j.1467 9574.2009.00445.x Hox, J. J. (2010). Multilevel analysis: Techniques and applications. New York: R outledge.

PAGE 110

110 Hubbard, A. E., Ahern, J., Fleischer, N. L., Van der Laan, M., Lippman, S. a, Jewell, N., Satariano, W. a. (2010). To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk fact ors and health. Epidemiology (Cambridge, Mass.) , 21 (4), 467 74. doi:10.1097/EDE.0b013e3181caeb90 IBM Corp. Released 2012. IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp. Jadad, A. R., & Enkin, M. (2007). Randomized controlled trials: Questions, answers, and musings. Malden , Mass: Blackwell/BMJ. Jing, H. Z. , & Schafer , J. L. (2013). pan: Multiple imputation for multivariate panel or clustered dataR package version 0.9. Kim, J., & Frees, E. W. (2006). Omitted variables in multilevel m odels. Psychometrika , 71(4), 659 690. doi:10.1007/s11336 005 1283 0 Kish, L. (1965). Survey sampling . New York, J. Wiley [1965]. Klar N., & Donner A. (2007 ) . Cluster Randomization. In: Massaro J, DAgost ino RB, Sullivan LM, eds. Wiley Encyclopedia of Cl inical Trials . United States: John Wiley & Sons, Inc.:329 345. Konstantopoulos, S. (2008). The power of the test for treatment effects in three level block randomized designs. Journal of Research on Educational Effectiveness, 1(4), 265 288. Konstantopoul os, S. (2009). Using power tables to compute statistical power in multilevel experimental designs. Practical Assessment, Research & Evaluation , 14(10) Konstantopoulos, S. (2010). Power Analysis in Two Level Unbalanced Designs. The Journal of Experimental Education , 78 (3), 291 317. doi:10.1080/00220970903292876 Konstantopoulos, S. (2012). The impact of covariates on statistical power in cluster randomized designs: Which level matters more? Multivariate Behavioral Research, 47 (3), 392 420. doi:10.1080/00273 171.2012.67389 Kowalchuk, R. K., Keselman, H. J., & Algina, J. (2004). The analysis of repeated measurements with mixed model adjusted F tests. Educational and Psychological Measurement [H.W. Wilson EDUC], 64 (2), 224. Kreft, I. G. G., & Leeuw, J. d. (1 998). Introducing multilevel modeling. Thousand Oaks, Calif: Sage

PAGE 111

111 Kuehl, R. O. ( 2000 ) . Design of Experiments: Statistical Principles of Research Design and Analysis , 2nd edition. Pacific G rove, California: Brooks/Cole. LaHuis, D. M., & Ferguson, M. W. (2009). The Accuracy of Significance Tests for Slope Variance Components in Multilevel Random Coefficient Models. Organizational Research Methods , 12 (3), 418 435. doi:10.1177/1094428107308984 Leeden, R. v. d., Meijer, E., & Busing, F. M. T. A. (2008). R esampling multilevel models. (p . 401 433) In Handbook of multilevel analysis . New York, NY: Springer New York. doi:10.1007/978 0 387 73186 5_11 Lee, O. E., & Braun, T. M. (2012). Permutation t ests for random effects in linear mixed models. Biometrics , 68 (2), 486 93. doi:10.1111/j.1541 0420.2011.01675.x Liu, G., & Liang, K. Y. (1997). Sample size calculations for studies with correlated observations. Biometrics , 53 (3), 937 47. Retrieved from Longford, N. T. (2008). Missing Data. In Handbook of multilevel analysis (2008) . DE: Springer New York. doi:10.1007/978 0 387 73186 5 Lüdtke, O., Marsh, H . W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: a new, more reliable approach to group level effects in contextual studies. Psychological methods , 13 (3), 203 29. doi:10.1037/a0012869 Maas, C. J. M., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling . Methodology , 1(3), 86 92. doi:10.1027/1614 2241.1.3.86 Martin , J. (2012). pamm: Power analysis for random effects in mixed models. R package version 0.7. http://CRAN.R Manor, O., & Zucker, D. M. (2004). Small sample inference for the fixed effects in the mixed linear model. Computational Statistics and Data Analysis, 46 (4), 801 817. doi:10.1016/j. csda.2003.10.005 Moerbeek, M., van Breukelen, Gerard J. P, & Berger, M. P. F. (2000). Design issues for experiments in multilevel populations. Journal of Educational and Behavioral Statistics,25 (3), 271. doi:10.2307/1165206 Moerbeek, M., Van Breukelen, G . J. P., & Berger, M. P. F. (2001). Optimal Experimental Designs for Multilevel Models With Covariates. Communications in Statistics Theory and Methods , 30 (12), 2683 2697. doi:10.1081/STA 100108453

PAGE 112

112 Moerbeek, M., van Breukelen, G. J. P., & Berger, M. P. F. (2003 ). A comparison between traditional methods and multilevel regression for the analysis of multicenter intervention studies. Journal of Clinical Epidemiology , p . 341 350. doi:10.1016/S0895 4356(03)00007 6 Moerbeek, M. (2004). The consequence of ign oring a level of nesting in multilevel analysis. Multivariate Behavioral Research, 39 (1), 129 149. doi:10.1207/s15327906mbr3901_5 Moerbeek, M. (2005). Randomization of clusters versus randomization of persons within clusters: Which is preferable? The Ameri can Statistician, 59 (2), 173 179. doi:10.1198/000313005X43542 Moerbeek, M. (2006a). Cluster Randomized Trials. In Springer Handbook of Materials Measurement Methods ( p. 705 718 ). Berlin: Springer. Moerbeek, M. (2006b). Power and money in cluster randomiz ed trials: when is it worth measuring a covariate? Statistics in medicine , 25 (15), 2607 17. doi:10.1002/sim.2297 Moerbeek, M., & Wong, W. K. (2008). Sample size formulae for trials comparing group and individual treatments in a multilevel model. Statistic s in Medicine, 27 (15), 2850 2864. doi:10.1002/sim.3115 Murray, D. M. (1998). Design and analysis of group randomized trials . New York : Oxford University Press, 1998 Murray, D. M., Varnell, S. P., & Blitstein, J. L. (2004). Design and analysis of group r andomized trials: a review of recent methodological developments. American journal of public health , 94 (3), 423 32. Retrieved from http://ww rez&rendertype=abstract Murray, D. M., Hannan, P. J., Pals, S. P., McCowen, R. G., Baker, W. L., & Blitstein, J. L. (2006). A comparison of permutation and mixed model regression methods for the analysis of simulated data in the context of a group randomized trial. Statistics in Medicine, 25 (3), 375 388. doi:10.1002/sim.2233 Muthén, L.K. & Muthén, B.O. (1998 2012). M . Seventh Edition. Los Angeles , CA: Muthén & Muthén Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods , 8, 434 447.

PAGE 113

113 Paccagnella, O. (2011). Sample Size and Accuracy of Estimat es in Multilevel Models. power Methodology: European Journal of Research Methods for the Behavioral and Social Sciences , 7 (3), 111 120. doi:10.1027/1614 2241/a000029 Pinheiro J, Bates D, DebRoy S, Sarkar D and R Core Team (2014). _nlme: Linear and Nonline ar Mixed Effects Models_. R package version 3.1 117, . Puffer, S., Torgerson, D. J., & Watson, J. (2005). Cluster randomized controlled trials. Journal of evaluation in clinical practice , 11 (5), 479 83. doi:10 .1111/j.1365 2753.2005.00568.x R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R . Raudenbush, S.W. , Bryk, A.S, & Congdon, R. (2004). HLM 6 for Windows [Computer software]. Skokie, IL: Scientific Software International, Inc. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Thousand Oaks: Sage Publications. Raudenbush, S W, & Liu, X. (2000). Statistical power and optimal design f or multisite randomized trials. Psychological methods , 5 (2), 199 213. Retrieved from http://www.ncbi.nlm.nih. gov/pubmed/10937329 Raudenbush, S. W. (1997 ). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods , 2 (2), 173 185. doi:10.1037//1082 989X.2.2.173 Raudenbush, S. W., et al. (2011). Optimal Design Software for Multi level and Longitudinal Research (Version 3.01) [Software]. Available from . Rense, N. , Manfred te , G . & Ben , P. (2012). Influence.ME: Tools for Detecting Influential Data in Mixed Effects Models. R Journal , 4(2): p . 38 47. Rietbergen, C., & Moerbeek, M. (2011). The Design of Cluster Randomized Crossover Trials. Journal of Educational and Behavioral Statistics , 36 (4), 472 490. doi:10.3102/1076998610379136 Rubin, D. (1976). Inferenc e and missing data. Biometrika , 63 (3), 581 592. Retrieved from SAS Institute Inc.

PAGE 114

114 SAS Institute Inc. 2004. Proceedings of the Twenty Ninth Annual SAS® Users Group International Conference . Cary, NC: SAS Institute Inc. Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components New York : Wiley, c1992 . Center, PennState. Retrieved from Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods , 7 (2), 147 177. doi:10.1037//1082 989X.7.2.147 Scherbaum, C. a., & Ferreter, J. M. (2008). Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling. Organiz ational Research Methods , 12 (2), 347 367. doi:10.1177/1094428107308906 Schochet, P. Z., Institute of Education Sciences (ED), Washington, DC, & Mathematica Policy Research, Princeton, NJ. (2005). Statistical power for random assignment evaluations of educ ation programs. Mathematica Policy Research, Inc, Shieh, G. (2007). A unified approach to power calculation and sample size determination for random regression models. Psychometrika , 72(3), 347 360. doi:10.1007/s11336 007 9012 5 Shin, Y., & Raudenbush, S. W. (2010) A latent cluster mean approach to the contextual effects model with missing data. J ournal of Educational and Behavioral Statistics, 35(1), 26 53. doi:10.3102/1076998609345252 Smith, S. W., Lochman, J. E., & Daunic, A. P. (2005). Managing aggre ssion using cognitive behavioral interventions: State of the practice and future directions. Behavioral Disorders , 30, 227 240. Smith, S. W., Graber, J., & Daunic, A. P. (2009). Cognitive behavioral interventions for anger/aggression: Review of research a nd research to practice issues. In M. Mayer, R. Van Acker, J. Lochman, & F. Gresham (Eds.), Cognitive behavioral interventions for emotional and behavioral disorders: School based practice (pp. 111 142). New York: Guilford Press. Snijders, T., & Bosker, R . (1993). Standard errors and sample sizes for two level research. Journal of Educational Statistics , 18 (3), 237 259. Retrieved from Snijders, T. A . B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and a dvanced multilevel modeling . Los Angeles: Sage.

PAGE 115

115 Snyder, P., Hemmeter, M.L., Sandall, S., & McLean, M. (2007). Impact of professional de use of embedded instruction practices. Gainesville, FL: University of Florida. Spybrook, J. , Bloom, H., Congdon, R., Hill, C., Martinez, A., & Raudenbush, S. (2011). Swami nathan, H., Rogers , H.J. (2008). Estimation Procedures for HLM. In Multilevel modeling of educational data ( p . 469 519) . Charlotte, NC: IAP. Teerenstra, S., Moerbeek, M., Melis, R. J. F., & Borm, G. F. (2007). A comparison of methods to analyse continuous data from pseudo cluster randomized trials. Statistics in Medicine,26 (22), 4100 4115. doi:10.1002/sim.2851 Vallejo, S. , G., Ato García, M., Fernández García, M. P., & Livacic Rojas, P. E. (2013). Multilevel bootstrap analysis with assumptions violated. P sicothema, 25 (4), 520 528. doi:10.7334/psicothema2013.58 Wackerly, D. D., Mendenhall, W., & Scheaffer, R. L. (2008). Mathematical statistics with applications. Belmont, CA: Thomson Brooks/Cole. Yang, S., Kim, J. K., & Zhu, Z. (2013). Parametric fractional imputation for mixed models with nonignorable missing data. Statistics and Its Interface , 6 (3), 339 347. doi:10.4310/SII.2013.v6.n3.a4

PAGE 116

116 BIOGRAPHICAL SKETCH Burak AYDIN was born in Istanbul, Turkey. He received his B.A. in mathematics education from Koca eli University . He served for the Turkish Government as a mathe matics teacher for seven months and shortly after was qualified for a scholarship to study abroad. He joined Research and Evaluation Methodology (REM) program at the UF in 2008, obtained a MA a nd a Ph.D. degree in REM with a minor in statistics.