UFDC Home  myUFDC Home  Help 



Full Text  
PAGE 1 1 THE MANTEL HAENSZEL METHOD FOR DETECTING DIF F ERENTI AL ITEM FUNCTIONING IN DICHOTOM OUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2009 PAGE 2 2 2009 Jann Marie Wise MacInnes PAGE 3 3 To The loving memory of my mother, Peggy R Wise PAGE 4 4 ACKNOWLEDGMENTS I would like to take thi s opportunity to thank my dissertation supervisory committee chair Dr M. David Miller, whose guidance and encouragement has made this work possible. I would also like to thank all the members of my committee for their support: Dr James Algina, Dr Wal ter Leite and Dr R Craig Wood. This dissertation would not have been completed without the support of my friends and family A special thank you goes to my son Joshua, and friends Jenny Bergeron, Steve Piscitelli and Beth West for it was their advice, encouragement love and friendship that kept me going. I would like to thank my parents, Peggy and Mac Wise, who taught me the value of dedication and hardwork And last, but certainly not least, I wo uld like to remember my mother who never stopped bel ieving in me. PAGE 5 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...................................................................................................... 4 LIST OF TABLES ................................................................................................................ 7 LIST OF FIGURES .............................................................................................................. 8 ABSTRACT .......................................................................................................................... 9 CHAPTER 1 INTRODUCTION ........................................................................................................ 11 Purpose of the Study .................................................................................................. 14 Significance of the Study ............................................................................................ 14 2 LITERATURE REVIE W .............................................................................................. 16 Dichotomous DIF Detection Procedures ................................................................... 18 Mantel Haenszel Method ..................................................................................... 19 Logistic Regression .............................................................................................. 23 Item Response Theory ......................................................................................... 27 Logistic Regression .............................................................................................. 32 Item Response Theory ......................................................................................... 34 Mantel Haenszel Procedure ................................................................................ 40 3 METHODOLOGY ....................................................................................................... 42 Overview of the Study ................................................................................................ 42 Research Questions ................................................................................................... 43 Model Specification ..................................................................................................... 43 Two level Multilevel Model for Dichotomously Scored Data .............................. 43 Mantel Haenszel Multilevel Model for Dichotomously Scored Data .................. 45 Simulation Design ....................................................................................................... 50 Simulation Conditions for Item Scores ................................................................ 50 Simulation Conditions for Subjects ...................................................................... 52 Analysis of the Data ............................................................................................. 54 4 RESULTS .................................................................................................................... 56 Results ........................................................................................................................ 56 Illustrative Examples ................................................................................................... 56 Simulation Design ................................................................................................ 57 Parameter recovery for the logistic regression model ........................................ 57 PAGE 6 6 Parameter recovery of the Mantel Haenszel log odds ratio ............................... 59 Simulation Study: Parameter Recovery of the Multilevel Mantel Haenszel ............. 68 Simulation Study: Performance of the Multilevel Mantel Haenszel .......................... 69 All items simulated as DIF free ............................................................................ 71 Items Simulated to Contain DIF ........................................................................... 72 5 CONCLUSION ............................................................................................................ 80 Summary ..................................................................................................................... 80 Discussion of Results ................................................................................................. 84 Multilevel Equivalent of the Mantel Haenszel Method for Detecting DIF ........... 85 Performance of the Multilevel Mantel Haenszel Model ...................................... 86 Implication for DIF Detection in Dichotomous Items ................................................. 89 Limitations and Future Research ............................................................................... 91 LIST OF REFERENCES ................................................................................................... 94 BIOGRAPHICAL SKETCH .............................................................................................. 102 PAGE 7 7 LIS T OF TABLES Table page 2 1 Responses on a dichotomous item for ability level j ........................................... 19 3 1 Generating conditions for the items ....................................................................... 52 3 2 Simulation design ................................................................................................... 54 4 1 Item parameters for the illustrative example ......................................................... 58 4.2 A comparison of the logistic and multilevel logistic models .................................. 63 4 3 A comparison of the Mantel Haenszel and Multilevel Mantel Haenszel .............. 65 4 4 A comparison of the standard errors for the illustrative example ......................... 67 4 5 Pvalues for the illustrative example ...................................................................... 68 4 6 Item parameters for the condition of no DIF .......................................................... 73 4 7 Type I error: Items DIF free .................................................................................... 73 4 8 Type I error: 10% DIF of size 0.2 ........................................................................... 74 4 9 Power: 10% DIF of size 0.2 ................................................................................... 75 4 10 Type I error: 10% DIF of size 0.4 ........................................................................... 76 4 11 Power: 10% DIF of size 0.4 ................................................................................... 76 4 12 Type I error: 20% DIF of size 0.2 ........................................................................... 77 4 13 Power: 20% DIF of size 0.2 ................................................................................... 78 4 14 Type I error: 20% DIF of size 0.4 ........................................................................... 78 4 15 Power: 20% DIF of size 0.4 ................................................................................... 79 PAGE 8 8 LIST OF FIGURES Figure page 4 1 HLM logistic regression .......................................................................................... 62 4 2 HLM output for the logistic regre ssion model ........................................................ 62 4 3 Multilevel Mantel Haenszel HLM model ................................................................ 64 4 4 HLM results for the Mantel Haenszel logodds ratio ............................................. 64 4 5 Graph of the log odds ratio estimates for both methods ...................................... 69 PAGE 9 9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in P artial Fulfillment of the Requirements for the Degree of Doctor of Philosophy THE MANTEL HAENSZEL METHOD FOR DETECTING DIF F ERENTIAL ITEM FUNCTIONING IN DICHOTOMUOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By Jann Marie Wise MacInnes December 2009 Chair : M David Miller Major: Research and Evaluation Methodol o gy Multilevel data often exist in educational studies The focus of this study is to consider differential item functioning (DIF) for dichotomous items from a multilevel perspective One of the m ost often used methods for detecting DIF in dichotomously scored items is the Mantel Haenszel log odds ratio However, the Mantel Haenszel reduces the analyses to one level, thus ignoring the natural nesting that often occurs in testing situations In th is dissertation, a multilevel statist ical model for detecting DIF in dichotomously scored items that is equivalent to the traditional Mantel Haenszel method for detecting DIF in dichotomously scored items will be presented This model is called the Multil evel Mantel Haenszel model. The reformulated Multilevel Mantel Haenszel method is a special case of an item response theory model (IRT) embedded in a logistic regression model with discrete ability levels Results for the Multilevel Mantel Haenszel model were analyzed using the hierarchical generalized linear framework ( HGLM ) of the HLM multilevel software program Parameter recovery of the Mantel Haenszel log odds ratio by the Multilevel Mantel Haenszel model is first demonstrated by Illustrative example s A simulation PAGE 10 10 study provides further support that (1) the Multilevel Mantel Haenszel can fully recover the log odds ratio of the traditional Mantel Haenszel (2) the Multilevel Mantel Haenszel is a method capable of p roperly detecting the presence of D IF in dic hotomously scored items, and, (3) the Multilevel Mantel Haenszel performance compares favorably to the performance of the traditional Mantel Haenszel. PAGE 11 11 CHAPTER 1 INTRODUCTION Test scores are often used as a basis for making important decisions concerning an individuals future T herefore, it is imperative that the tests used for making these decisi ons be both reliable and valid. One threat to test validity is bias Test bias results when performance on a test is not the same for individuals from di fferent subgroups of the population, although the individuals are matched on the same level of the trait measured by the test Since a test is comprised of items, concerns about bias at the item level emerged from within the framework of test bias Item bias exists if examinees of the same ability do not have the same probability of answering the item correctly (Holland & Wainer, 1993) Item bias implies the presence of some item characteristic that results in the differential performance of examinees from different subgroups of the population that have the same ability level Removal or modification of items identified as biased will improve the validity of the test and result in a test that is fair for all subgroups of the population (Camilli & Congdon, 1999) One method of investigating bias at the item level is differential item functioning (DIF) DIF is present for an item when there is a performance difference between individuals from two subgroups of the population that are matched on the level of the trait Methods of DIF analysis allow test developers, researchers and others to judge whether items are functioning in the same manner for various subgroups of the population A possible consequence of retaining items that exhibit DIF is a test t hat is unfair for certain subgroups of the population PAGE 12 12 A distinction should be made between item DIF, item bias, and item impact DIF methods are statistical procedures for flagging items An item is flagged for DIF if examinees from different subgroups of the population have different probabilities of answering the item correctly, after the examinees have been conditioned on the underlying construct measured by the item Camilli & Shepard (1994) recommend that such items be investigated to uncover the source of the unintended subgroup differences If the source of the subgroup difference is irrelevant to the attribute that the item was intended to measure, then the item is considered biased. Item impact refers to subgroup differences in performance on an item Item impact occurs when examinees from different subgroups of the population have different probabilities of answering an item correctly because true differences exist between the subgroups on the underlying construct being measured by the item (Camilli & Shepard, 1994) DIF analysis allows researchers to make group comparisons and ruleout measurement artifacts as the source of any difference in subgroup performance. Many statistical methods for detecting DIF in dichotomously scored it ems have been developed and empirically tested, resulting in a few preferred and often used powerful statistical techniques (Holland & Wainer, 1993; Clauser & Mazor, 1998) The Mantel Haenszel (Holland & Thayer, 1988), the logistic regression procedure (Sw aminathan & Rogers, 1990), and several item response theory (IRT) techniques (Thissen, Steinberg, & Wainer, 1988) are members of this select group. The increased use of various types of performance and constructedresponse assessments, as well as perso nality, attitude, and other affective tests, has created a need for psychometric methods that can detect DIF in polytomously scored items PAGE 13 13 Generalized DIF procedures for polytomously scored items have been developed from the dichotomous methods These inc lude variations of the Mantel Haenszel, logistic regression, and IRT procedures Once an item is identified as exhibiting DIF, it may be useful to identify the reason for the differential functioning. Traditionally, the construction of the item was consi dered to be the source of the DIF Items flagged for displaying DIF were analyzed on an item by item basis by content specialists and others to determine the possible reasons for the observed DIF Item by item analysis of this type makes it more difficul t (a) to identify common sources of DIF across items and (b) to provide alternative explanations for the DIF (Swanson et al., 2002) The matter of knowing why an item exhibited DIF led researchers to look for DIF detection methods that allow the inclusion of contextual variables as explanatory sources of the DIF A multilevel structure often exists in social science and educational data. Traditional methods of detecting DIF for both dichotomous and polytomous items ignore this natural hierarchical str ucture and reduce the analysis to a single level, thus ignoring the influence that an institution, such as a school, may have on the item responses of its members (Kamata, 2001) The natural nesting that exists in educational data may cause a lack of stat istical independence among study subjects For example, students nested within a group, such as a classroom, school, or school district may have the same teacher and/or curriculum, and may be from similar backgrounds These commonalities may affect studen t performance on any measure, including tests Multilevel models also called hierarchical models, have been widely used in social science research. And, r ecent research has demonstrated that multilevel modeling may PAGE 14 14 be a useful approach for conducting DI F analysis Multilevel models address the clustered characteristics of many data sets used in social science and educational research and allow educational researchers to study the affect of a nesting variable on students, schools, or communities. Purpose of the Study The purpose of this study is: (1) To reformulate the Mantel Haenszel technique for analyzing DIF in dichotomously scored items as a multilevel model, (2) to demonstrate that the newly reformulated multilevel Mantel Haenszel approach is equivalent to the Mantel Haenszel approach for DIF detection in dichotomous items when the data are item scores nested within persons, (3) to demonstrate that the estimate of the Mantel Haenszel odds ratio can be recovered from the reformulated Mantel Haenszel multilevel approach, and (4) to compare the performance of the Mantel Haenszel technique for identifying differential item functioning in dichotomous items to the performance of the Mantel Haenszel multilevel model for identifying differential item functio ning in dichotomous To achieve this goal, data will be simulated to fit a multilevel situation in which item scores are nested within subjects and a simulation study will be conducted to determine the adequacy of the before mentioned methods Significan ce of the Study The assessment of DIF is an essential aspect of the validation of both educational and psychological tests Currently, there are several procedures for detecting DIF in dichotomous items T hese include the Mantel Haenszel, logistic regres sion and item response theory approaches Multilevel equivalents of the logistic regression and item response theory methods of DIF detection have been formulated for use in both dichotomous and polytomous PAGE 15 15 items ( Kamata 1998, 2001, 2002; Kamata & Binci, 2003; Rogers & Swaminathan, 2002; Swanson et al, 2002) M ultilevel approaches are a valuable addition to the family of DIF detection procedures as they take into consideration the natural nesting of item scores within persons and they allow for the contemplation of possible sources of differential functioning at all levels of the nested data. Although multilevel approaches are promising, additional empirical testing is required to establish the theoretical soundness of the multilevel procedures that have been developed The study has several unique and important applications to DIF detection of multilevel items from a multilevel perspective. First, the Mantel Haenszel method for DIF detection in dichotomous items will be reformulated as a multilevel approach for detecting DIF when items are nested in individuals Furthermore, it will be demonstrated that the parameter esti mate of the Mantel Haenszel log odds ratio can be recovered from the Mantel Haenszel multilevel reformulation. The multilevel refo rmulation will allow for a more thorough investigation into the source of the differential functioning and, therefore, the usefulness of the already popular Mantel Haenszel procedure will increase Second, the Mantel Haenszel technique for identifying dif ferential item functioning in dichotomous items will be compared to the Mantel Haenszel reformulated multilevel model for identifying differential item f unctioning in dichotomous items A comparison of this type will provide valuable information that will give test developers and researchers confidence in selecting and using the multilevel approached for DIF detection. PAGE 16 16 CHAPTER 2 LITERATURE REVIEW One threat to test validity is bias, which has both a social and statistical meaning (Angoff, 1993) From a social point of view, bias means that a difference exists in the performance of subgroups from the same population and that difference is harmful to one or more of the subgroups As a statistical term, bias means the expected test scores are not the same for individuals from different subgroups of the population; given the individuals have the same level of the trait measured by the test (Kamata & Vaughn, 2004) In order to determine that bias exists, a difference between the performances of subgroups, which have been matched on the level of the trait, must be determined and the difference must be due to sources other than differences on the construct of interest Generally, bias is investigated at the item level Items identified as biased can then be removed or modified T he removal or modification of such items will improve the validity of the test and result in a test that is fair for all subgroups of the population (Camilli & Congdon, 1999). Differential item functioning (DIF) is a common way of eval uating item bias DIF refers to a difference in item performance between subgroups of the population that have been conditioned, or matched, on the level of the targeted trait or ability Conditioning on the level of the targeted trait, or ability, is a very important part of the DIF analysis and is what distinguishes the detection of differential item functioning from item impact, which is the existence of true betweengroup differences on item performance (Dorans & Holland, 1997) DIF procedures assume that one controls for the trait or ability level The trait or ability level is used to match subjects from the subgroups so that the effect of the trait or ability level is controlled. Thus, by controlling PAGE 17 17 for the trait or ability level, one may detect subgroup differences that are not confounded by trait or ability T he trait or ability level is called the matching criterion. The matching criterion is some estimate of the trait or ability level Total test performance is often used as the matching cr iterion. The presence of DIF is used as a statistical indicator of possible item bias If an item is biased, then DIF is present However, the presence of DIF does not always indicate bias DIF may simply indicate the multidimensionality of the item and not item bias An interpretation of the severity of the impact of any subgroup difference is necessary before an item can be considered biased Typically two subgroups of the population are compared in a DIF analysis The main group of interest, the sub group of the population for which the item could be measuring unintended traits, is called the focal group. The other subgroup, the comparison group, is called the reference group. The focal and reference groups are matched on the level of the intended trait as a part of DIF procedures Therefore, any differences between the focal and reference groups are not confounded by differences in trait or ability levels There are two different types of DIF: uniform and nonuniform Uniform DIF refers to diffe rences in performance between the focal and reference groups that are the same in direction across all levels of the ability and indicates that one group has an advantage on the item across the continuum of ability Nonuniform DIF refers to a difference in performance direction across the levels of ability between the focal and reference groups and the advantaged group changes depending on the ability level The presence PAGE 18 18 of non uniform DIF means an interaction between ability level and item performance ex ists Current methods for DIF detection can be classified along two dimensions (Potenza & Dorans, 1995) The first of these dimensions is the nature of the ability estimate used for the matching, or conditioning, variable The matching variable can use either an actual observed score, such as a total score, or a latent variable score, such as an estimate of the trait or ability level The second dimension refers to the method used to estimate item performance at each level of the trait or ability Meth ods of DIF detection can be categorized as parametric or nonparametric Parametric procedures utilize a model, or function, to specify the relationship between the item score and ability level for each of the subgroups In nonparametric procedures no suc h model is required because item performance is observed at each level of the trait or ability for each of the subgroups Parametric procedures generally require larger data sets and have the risk of model misspecification Dichotomous DIF Detection Procedures A number of statistical methods have been developed over the years to detect differential item functioning in test items Some of the first methods included the analysis of variance procedure (Camilli & Shepard, 1987), the Golden Rule procedure (F aggen, 1987) and the deltaplot, or transformed item difficulty (Angoff, 1993) These methods utilized the item difficulty values for each of the subgroups and were found to be inaccurate detectors of DIF, especially if the item discrimination value was v ery high or low More sophisticated, and accurate, statistical methods have replaced the earlier methods The more common of these methods include the Mantel Haenzel procedure (Holland and Thayer, 1988), logistic regression procedure (Swaminathan and Rog ers, PAGE 19 19 1990) and various item response theory procedures (Lord, 1980) First developed for use in dichotomously scored items, these methods have been generalized to polytomously scored items Mantel Haenszel Method The Mantel Haenszel procedure (Mantel, 1963; Mantel & Haenszel, 1959) was first introduced by Holland (1985) and applied by Holland and Thayer (1988) as a statistical method for detecting DIF in dichotomous items The Mantel Haenszel is a nonparametric procedure that utilizes a discrete, observ ed score as the matching variable. The Mantel Haenszel procedure provides both a statistical test of significance and an effect size estimate for DIF The Mantel Haenszel procedure uses a 2 x 2 contingency table to examine the relationship between the focal and reference groups of the population and the two categories of item response, correct and incorrect, for each of the k ability levels These s have a format shown in Table 2 1. Table 2 1 Responses on a dichotomous item fo r ability level j Group Response to item I Correct Incorrect Total Reference ijrn1 ijrn0 ijrn. Focal ijfn1 ijfn0 ijfn. Total ijn. 1 ijn. 0 ijn.. In the table, ijrn1 is the number of subjects in the reference group, at trait or ability level j whic h answered item i correctly and ijrn0is the number of subjects in the reference group, at trait or ability level j which incorrectly answered item i Likewise, PAGE 20 20 ijfn1 is the number of subjects in the focal group, at trait or ability level j which answered item i correctly and ijfn0 is the number of subjects in the focal group, at trait or ability le vel j which incorrectly answered item i The first step in the analysis is to calculate the common odds ratio, MH (Mellenbergh, 1982) The common odds ratio is the ratio of the odds that an individual from the reference group will answer an item correctly to the odds that an individual from the focal group, of the same ability level, will answer the same item correctly The values are combined across all levels of the trait or ability to create an effect size estimate for DIF An estimate of MH can be obtained by the formula MH ^ = k j ij ij f r k j ij f rn n n n n nij ij ij1 .. 1 0 1 .. 0 1/ / (2.1) where rjn1 ijrn0 ijfn1 ijfn0 and jn .. are defined as in Table 21 and j represents the jth ability level The estimate for MH has a range of zero to positive infinity If the estimate for MH equals one, then there is no difference in performance between the reference and focal group. V alues of MH between zero and one indicate the item favors the focal group, while values greater than one indi cate the item favors the reference group The common odds ratio is often transformed to the scale of differences in item difficulty used by the Educational Testing Service by the formula MH = ) ln( 35 2MH (2.2) PAGE 21 21 On the new transformed scale MH is centered about 0 and a value of 0 indicates the absence of DIF On the transformed scale, negative values of DIF indicate the item favors the reference group and positive values indicate the item favors the focal group The Mantel Haenszel statistical test of significance tests for uniform DIF (Holland and Thayer, 1988), across all levels of the ability, under the null hypothesis of no DIF Rejection of the null hypothesis indicates the presence o f DIF. The test statistic, 2MH has an approximate chi squared distribution with one degree of freedom The 2MH test statistic is 2MH = k rk j k j r rn Var n E n1 1 2 1 1 1 1) ( 5 0 ) ( (2.3) where ) ()1ijrn Var = ) 1 (.. 2 .. 0 1 .ij ij ij ij ij ijn n n n n nf r (2.4) and ijrn E1( ) = ij ij ij ijn n nf r .. (2.5) The Educational Testing Service has proposed values of MH for classifying the magnitude of the DIF as negligible, moderate or large (Zwick & Ericikan, 1989) Roussos and Stout (1996a, 1996b) modified the values and gave the following guidelines to aid in the interpretation of DIF: Type A Items negligible DIF:  MH  < 1, PAGE 22 22 Type B Items moderate DIF: MH test is significant and 1.0 <  MH  < 1.5 Type C Items large DIF: MH test is statistically significant and  MH  > 1.5. The Mantel Haenszel procedure is considered by some to be the most p owerful test for uniform DIF for dichotomous items (Holland & Thayer, 1988) The Mantel Haenszel procedure is easy to conduct, has an effect size measure and test of significance, and works well for small sample sizes However, the Mantel Haenszel procedure detects uniform DIF only (Narayanan & Swaminathan, 1994; Swaminathan & Rogers, 1990) Research also indicates that the Mantel Haenszel can indicate the presence of DIF when none is present if the data are generated by item response theory models (Meredith & Millsap, 1992; Millsap & Meredith, 1992; Zwick, 1990) Other factors that influence the performance of the Mantel Haenszel include the am ount of DIF, length of the test, sample size, and ability distributions of the focal and reference groups (Claus er & Mazor, 1998; Cohen & Kim, 1993; Fidalgo, A., Mellenbergh, G. & Muniz, J. (2000) ; French & Miller, 2007; Jodoin & Gierl, 2001; Narayana & Swaminathan, 1 994; Roussos & Stout, 1996; Utttaro & Millsap, 1994) The Mantel Haenszel method for detecting DIF in dichotomous items outlined above can be extended to polytomous items This extension is often referred to as the Generalized Mantel Haenszel or GMH (Allen & Donoghue, 1996) The Generalized Mantel Haenszel also compares the odds of a correct response for the reference group to the odds of a correct response for the focal group across all response categories of an item, after controlling for the trait or ability level. The Mantel Haenszel procedure is extended to the Generalized Mantel Haenszel by modifying the contingency table to include more than two response categories The PAGE 23 23 Generalized Mantel Haenszel uses a 2 x j contingency table to examine the relationship between the reference and focal groups and the j catego ry responses for each item at each of the k levels of ability Logistic Regression The logistic regression model for detecting DIF in dichotomous items was first proposed by Swaminathan and Rogers (1990) and is one of the most effec tive and recommended methods for detecting DIF in dichotomous items (Clauser & Mazor, 1998; Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990; Zumbo, 1999) The logistic regression model is a parametric method that can detect both uniform and nonunif orm DIF The logistic regression model, when applied to DIF detection, uses item response as the dependent variable. Independent variables include group membership, ability, and groupby ability interaction variables T he logistic regression procedure uses a continuous observed score matching variable, which is usually the total scale, or subscale, score. The logistic regression model is given by j jp p 1 ln = j j jXG G X ) (3 2 1 0 (2.6) In the model, jp repr esents the probability that individual j provides a correct response Therefore, the quantity j jp p 1 ln represents the log odds ratio, or logit, of individual j providing a correct response In the model, jX is the trait or ability level for individual j and serves as the matching criterion and G represents group membership PAGE 24 24 for individual j The term jXG ) ( is the interact ion between ability and group membership and is used to detect the presence of nonuniform DIF The logistic regression approach provides both a test of statistical significance and effect size measure of DIF An item is examined for the presence of DIF by testing the regression coefficients 3 2 1 and If DIF is not present then only 1 should be significantly different from zero. If uniform DIF is present in an item then 2 is significantly different f rom zero, but 3 is not If nonuniform DIF is present then 3 is significantly different from zero (Swaminathan and Rogers, 1990) A model comparison test can be used to simultaneously detect both uniform and nonuni form DIF (Swaminathan & Rogers, 1990) Under this approach, the full model provided in equation 2.6 that includes the variables ability, group membership, and interaction as independent variables is compared to a reduced model with ability as the only ind ependent variable. Such a model is j jp p 1 ln = jX1 0 (2.7) A chi square statistic, 2 DIF is computed as the difference in chi square for the full model given in equation 2.6 and the reduced model given in equation 2.7: 2 DIF = 2 2 reduced full (2.8 ) The statistic follows a chi square distribution with two degrees of freedom Significant test results indicate the presence of uniform or nonuniform DIF Exponen tiation of the regression coefficients 2 and 3 in equation 2.6 provide s an effect size measure of DIF As with the MH a value of one indicate there is no PAGE 25 25 difference in performance between the r eference and focal group, values between zero and one indicate the item favors the focal group, and values greater than one indicate the item favors the reference group. Swaminathan and Rogers (1990) contend that the Mantel Haenszel procedure for dichotomous items is based on a logistic regression model where the ability variable is a discrete, observed score and there is no interaction between group and ability level. They showed that if the ability variable is discrete and there is no interaction bet ween group and ability level, then the model expressed in equation 2.6 can be written as j jp p 1 ln = j I k k kG X 1 0 (2.9 ) In the above model kXrepresents the discrete ability level categories of I ,... 2 1 where I is the total number of items. kX is coded 1 for person j if person j is a member of ability level k meaning person j s matching criterion score is equal to k If person j is not a member of ability level k then kXis coded 0. kXis coded 0 for all pers ons with a matching criterion score of 0. In equation 2.9 the coefficient of the group variable, is equal to ln where is the odds ratio of the Mantel Haenszel procedure. Therefore, in the logistic regression equation pres ented in equation 2.9, the test of hypothesis that 0 is equivalent to the test of hypothesis that 1 in the Mantel Haenszel procedure given there is no interaction. Logistic regres sion methods for detecting DIF in dichotomous items can be extended to polytomous items (French & Miller, 1996; Wilson, Spray, & Miller, 1993; Zumbo, 1999) The extension is possible via a link function that is used to dichotomize the polytomous responses (French and Miller, 1996) In addition to the link, for each PAGE 26 26 item, the probability of response for each of the response categories 1 through 1 K where K is the total number of response categories, is modeled usin g a separate logistic regression equation (Agresti, 1996; French & Miller, 1 996) Logistic regression procedures provide an advantageous method of identifying DIF in dichotomous items Logistic regression procedures provide both a significance test and measure of effect size, detect both uniform and nonuniform DIF, and use a matching variable that can be continuous in nature. Independent variables can be added to the model to explain possible causes of DIF And all independent variables, including abil ity, can be linear or curvilinear (Swaminathan, 1990) Furthermore, the procedure can be extended to more than two examinee groups (Agresti, 1990; Miller & Spray, 1993) Swaminathan and Rogers (1990) compared the logistic regression procedure for dichot omous items to the Mantel Haenszel procedure for dichotomous items and found that the logistic regression model is a more general and flexible procedure than the Mantel Haenszel, is as powerful for detecting uniform DIF as the Mantel Haenszel procedure, and, unlike the Mantel Haenszel, is able to detect nonuniform DIF However, if the data are modeled to fit a multi parameter item response theory model, logistic regression methods produce poor results Several studies have shown that the logistic regress ion procedure is sensitive to changes in the sample size and differences in the ability distributions of the reference and focal groups Studies show that power and Type I error rates increase as the sample size increases (Rogers and Swaminathan, 1993; Sw aminathan & Rogers, 1990) Jodoin and Gierl (2000) showed that differences in the ability distributions PAGE 27 27 between the reference and focal groups degraded the power of the logistic regression procedure. Item Response Theory Item response theory (IRT), also known as latent trait theory, is a mathematical model for estimating the probability of a correct response for an item based on the latent trait level of the respondent and characteristics of the item (Embretson & Riese, 2000) IRT procedures are a parametric approach to the classification of DIF in which a latent ability variable is used as the matching variable. The use of IRT models as a primary basis for psychological measurement has increased since it was first introduced by Lord and Novick (1968) The graph of the IRT model is called an item characteristic curve, or ICC The ICC represents the relationship between the probability of a correct response to an item and the latent trait of the respondent, or The latent trait usually represents some unobserved measure of cognitive ability The simplest IRT model is the one parameter (1P), or Rasch model In the 1P model the probability a person, with ability level responds correctly to an item is modeled as a function of the item difficulty parameter, ib The 1P model is given by the formula: ) (iP = ) exp( 1 ) exp(i ib b (2.10) The equation in 2.9 can also be written as ) (iP = ) ( exp 1 1ib (2.11) The twoparameter IRT model (2P) adds an item discrimination parameter to the oneparameter model The item discrimination parameter, ia determines the steepness PAGE 28 28 of the ICC and measures how well the i tem discriminates between persons of low and high levels of the latent trait The 2P model is given by the formula ) (iP = )) ( exp( 1 )) ( exp(i i i ib a b a (2.12) The threeparameter IRT model (3P) adds to the two parameter model a pseud o guessing parameter The pseudo guessing parameter, ic represents the probability a person with extremely low ability will respond correctly to the item The pseudoguessing parameter provides the lower asymptote for the ICC The 3P model is given by the formula ) (iP = )) ( exp( 1 )) ( exp( ) 1 (i i i i i ib a b a c c (2.13) Three important assumptions concerning IRT models aid in their use as a DIF detection tool s The first of these assumptions is unidimensionality Unidimensionality means a single latent trait, often referred to as ability, is sufficient for characterizing a persons response to an item Therefore, given the assumption of unidimensionality, if an item response is a function of more than one latent trait that is correlated with group membership, then DIF is present in the item T he second assumption is local independence. Local independence states that a response to any one item is independent of the response to any other item, controlling for ability and item p arameters T he third assumption is item invariance, which states that item characteristics do not vary across subgroups of the population. Item invariance ensures that, in the presence of no DIF item parameters are invariant across subgroups of the popul ation. PAGE 29 29 For IRT models, DIF detection is based on the relationship of the probability of a correct response to the item parameters for two subgroups of the population, after controlling for ability (Embretson & Reise, 2000) DIF analysis is a comparison o f the item characteristic curves that have been estimated separately for the focal and reference groups The presence of DIF means the parameters are different for the focal group and reference group and the focal group has a different ICC than the reference group (Thissen & Wainer, 1985) Several methods are available for DIF detection using IRT models including a test of the equality of the item parameters (Lord, 1980) and a measure of the area between ICC curves (Kim & Cohen, 1995; Raju, 1988; Raju, 19 90, Raju van der Linden & Fleer, 1992) Lords (1980) statistical test for detecting DIF in IRT models is based on the difference between the item difficulty parameters of the focal and reference groups Lords test statistic, id is given by the formula id = 2 2 i i fb b r fb b (2.14) where b is the maximum likelihood estimate of the item difficulty parameter for the focal and reference groups and 2 is the variance component A second approach estimates the area between the ICCs of the focal and reference groups (Raju, 1988; Raju, 1990, Raju van der Linden & Fleer, 1992; Cohen & Kim, 1993) If no DIF is present then the area between the ICCs is zero W hen the item discrimination parameters Fa and Ra differ for the focal and reference groups but the pseudoguessing parameters Fc and Rc are equal, the formula for calculating the PAGE 30 30 difference between the item characteristic curves, also called the signed area, for the 3P model is Area = ) )( 1 (R Fb b c (2.15) where c is the pseudo guessing parameter and c = Fc = Rc Fb is the item difficulty for the focal group and Rb is the item difficulty for the reference group. For the Rasch, or 1P IRT model, the area becomes Area = R Fb b (2.16) Studies indicate that Lords (1980) statistical test for DIF based on the difference between the item difficulty parameters of the focal and reference groups and the statistical test for DIF based on the measure of the area between the ICC curves of the focal and reference groups produce similar results if the sample size and number of items are both large (Kim and Cohen, 1995 ; Shepard, Camilli, &Averill, 1981; Shepard, Camilli, & Williams, 1984) Holland and Thayer (1988) demonstrated that the Mantel Haenszel and item response theory models were equivalent under the following set of conditions 1 All items follow the Rasch model; 2 All items, except the item under study, are free of DIF; 3 The matching variable includes t he item under study and 4 The data are random samples from the reference and focal groups. Under the above set of conditions the total test score is a sufficient estimate for the latent ability parameter, (Lewis, 1993; Meredith & Millsap, 1992) Donaghue, Holland and Thayer (1993) further demonstrated that the relationship between MH and the two parameter IRT model can be expressed by MH = 4a( bF bR) (2.17) PAGE 31 31 where a is the estimate of the item discrim ination parameter, bR is the estimate of the item difficulty parameter for the reference group and bF is the estimate for the item difficulty parameter for the focal group The relationship stated in equation 2.17 assumes all items except the item under s tudy are free of DIF, the total test score is used as an estimate for and the total test score includes the item of interest. The IRT approach for DIF detection can be expanded to polytomous items The IRT approach in polytomous items uses the category response curve (CRC) for each response category The approach used for the category response curves is similar to the approach used for the item characteristic curves in dichotomous items The category response curves are estimated separately for the foca l and reference groups The presence of DIF in a response category means the parameters are different for the focal group and reference group and, therefore, the focal group has a different CRC than the reference group. Multilevel Methods for Detecting DIF Often, in social science research and educational studies, a natural hierarchical nesting of the data exists This is also true for testing and evaluation, as item responses are naturally nested within persons and persons may be nested with groups, such as schools Traditional DIF detection methods for both dichotomous and polytomous items ignore this natural hierarchical structure. Therefore, the DIF detection is reduced to a single level analysis, and the influence that an institution, such as a school may have on the item responses of its members is ignored (Kamata, et al., 2005) Furthermore, the natural nesting that exists in educational data may cause a lack of statistical independence among study subjects. PAGE 32 32 The use of multilevel models for the pur pose of detecting DIF in dichotomous and polytomous educational measurement data may be advantageous for several reasons First, in social science and educational measurement data a natural nesting of the data often exists Multilevel models allow resear chers to account for the dependency among examinees that are nested within the same group. Second, traditional methods of detecting DIF do not offer an explanation of the causes of the DIF R esearchers can use multilevel models to examine the affect of an individual level or group level characteristic variable on the performance of the examinees as explanatory variables can be added to the individual level or group level equations to give reasons for the DIF. Third, traditional methods assume the degree of DIF is constant across group units But, a multilevel random effect model with three levels allows the magnitude of DIF to vary across group units Furthermore, individual level or grouplevel characteristic variables can be added to the model to acc ount for the variation in DIF among group units. Recent research has demonstrated that traditional methods for conducting DIF analysis for both dichotomous and polytomous items can be expressed as multilevel models Both the logistic regression methods and IRT methods for detecting DIF in dichotomous and polytomous items have been formulated as multilevel models and the dichotomous approaches are presented in the paragraphs that follow. Logistic Regression Swanson et al (2002) proposed a multilevel logist ic regression approach to analyzing DIF in dichotomous items in which persons are assumed to be nested within items The two level approached used the logistic regression DIF model proposed by Swaminathan and Rogers (1990) as the level 1, or personlevel, model Coefficients PAGE 33 33 from the level 1 model were treated as random variables in the level 2, or item level, model Therefore, differences in variation among the items could be accounted for by the addition of explanatory variables in the level 2 model Others (Adams & Wilson, 1996; Adams, Wilson, & Wang, 1997; Luppescu, 2002) have also investigated a multilevel approach logistic regression for the purpose of DIF detection. The level 1 equation, proposed by Swanson et al (2002), for the purpose of detec ting DIF in dichotomously scored item j for person i is formulated as a logistic regression equation: )] 1 ( [ logit ijY P = *2 1 0group b y proficienc b bj j j (2.18) where proficiency is a measure of ability and group i s coded 0 for those persons in the reference group and 1 for those persons in the focal group. In the model, jb0 is the item difficulty for the reference group, jb1 is the item discrimination, and jb2 is the difference in item difficulty between the reference and focal groups. The level 2 equation considers the coefficients in the level 1 model as random variables with values that will be estimated from item characteristics included in the level 2 equations The level 2 equation is formulated as: j jU G b0 00 0 j jU G b1 10 1 j n n jU I G I G I G I G G b2 2 3 23 2 22 1 21 20 2 (2.19) where 0 kG is the grand mean of the kth level one coefficient kjU is the variance of the kth level one coefficient and nIis a dummy coded item characteristic If jU1 is dropped PAGE 34 34 from the model in 2.19 the item discriminations are forced to be equal and the re sulting model is like a Rasch model. Item Response Theory Developments in multilevel modeling have made it possible to specify the relationship between item parameters and examinee performance within the multilevel modeling framework Multilevel formulat ions of IRT models have been proposed for the use of item analysis and DIF detection in both dichotomous and polytomous items In 1998, Kamata made explicit connections between the hierarchical generalized linear model (HGLM) and the Rasch model to refor mulate the Rasch model as a special case of the HGLM, which he called the oneparameter hierarchical generalized linear logistic model (1 P HGL L M ) Kamata (1998, 2001) further demonstrated that the 1P HGL L M could be formulated for use in a twolevel hier archical approach of item analysis for dichotomous items, where items are nested within people. Item and person parameters were estimated using the HLM software (Bryk, Raudenbush, & Congdon, 1996) In Kamatas two level hierarchical model, items are the level 1 units which are naturally nested in persons, which are the level 2 units The level 1 model, or item level model, is a linear combination of predictors which can be expressed as ij = ij ijp p 1 log = j I j I j j j j jX X X) 1 ( ) 1 ( 2 2 1 1 0... = 1 1 0 I q qj qj jX (2.20) PAGE 35 35 where ij is the logit, or log odds, of ijp which is the probability that person j answers item i correctly andqjX is the qth dummy indicator variable for person j with value 1 when i q and value 0 when i q In order to achieve full rank one of the dummy indicator variables is dropped, therefore there are 1 I indicator variables, where I is the total number of items Item I is coded 0 for all dummy codes and is called the comparison item. Thus, the level 1 model fo r the ith item can be reduced to ij = ij j 0 (2.21) The coefficient j 0 is the intercept term and represents the exp ected item effect of the comparison item for person j The coefficient ij represents the effect of the ith individ ual item compared to the comparison item The level 2 model is the personlevel model and is specified as: j 0 = ju0 00 j 1 = 10 (2.22) j I ) 1 ( = 0 ) 1 ( I w here ju0 the person parameter, is the random component of j 0 and is assumed to be normally distributed with a mean of 0 and variance of Since the item parameters are assumed to be fixed across persons j 1 through j I ) 1 ( are modeled without a random component The combined model for the ith item and jth person is ij = 0 0 00 i ju (2.2 3 ) PAGE 36 36 The probability that the jth person answers t he ith item correctly is ijp = ) ( exp 1 1ij (2.24) With the expression for ij substituted in, the probability that jth person answers the ith item correctly becomes ijp = ) ( ( exp 1 100 0 0 i ju (2.25) The above model is algebraically equivalent to the Rasch model (Kamata, 1998, 2001) In the above model ju0 corresponds to the person abi lity parameter j of the Rasch model, and 00 0 i corresponds to the item difficulty parameter i Kamata (2002) added a third level to his two level model to create a threelevel hierarchical model In the threelevel model the level 1, or item level, model for item i nested in person j nested in group k is ijk = jk I jk I jk jk jk jk jkX X X) 1 ( ) 1 ( 2 2 1 1 0... (2.26) where K k J j I i 1 and 1 1 1 The level 2, or personlevel model is jk 0 = jk ku0 00 jk 1 = k 10 (2.27) jk I ) 1 ( = k I 0 ) 1 ( where jk 0 is assumed to be normally distributed with a mean of k 00 and variance of The random component, jku0 is the deviation of the score for person j in group k from the intercept of group k The effect of the dropped item in group k is represented PAGE 37 37 by k 00 and k i 0 represents the effect of item i in group k compared to the dropped item (Kamata, 2001) The level 3, or grouplevel model, is k 00 = kr00 000 k 10 = 100 (2.28) k I 0 ) 1 ( = 00 ) 1 ( I where kr00 is assumed to be normally distributed with a mean of 0 and variance of The combined model for item i person j and group k is ijk = jk k iu r0 00 00 000 (2.29) which can be written as ij = 000 00 00 i ojk ku r (2.30) Therefore the probability t hat person j in school k will answer item i correctly is ijp = ) ( exp 1 1000 00 0 000 i jku r (2.31) where ) (000 00 i is the item difficulty and ) (0 000 jku r is the person ability parameter The random effect of the level 3 model,000r, is the average ability of students in the kth group The random effect at the second level, jku0 represents the deviation in the ability of person j from the average ability of all persons in group k Therefore, the threelevel model provides person and average group ability estimates Kamata (2001) also extended the two level and threelevel models to latent regression models with the advantage of adding person and/or group characteristic variables PAGE 38 38 Kamata (2002) applied his two level hierarchical model to the detection of DIF in dichotomously scored items In Kamatas two level hierarchical DIF model the level 1 model given in equat ion 2.20 remains the same. H owever, the item parameters in the person level model, or level2 model, are decomposed into one or more group characteristic parameters The purpose of the de composition is to determine if the item parameters functioned differently for different groups of examinees The level 2 model for Kamatas DIF model is j 0= j ju G0 01 00 j 1 = jG11 10 (2.32) j I ) 1 ( = j I IG1 ) 1 ( 0 ) 1 ( where G a group characteristic dummy variable, is assigned a 1 if the person is a member of the focal group and a 0 if the person is a member of the reference group. In the above level 2 model, the item effects, 1)j(I to ij are modeled to include a mean effect, 1)0 (I 10 to and a group effect, 1 ) 1 ( 01 to I The coefficient 01 represents the DIF common to all items, whereas the coefficient 1 i is the additional amount of DIF present in item i The combined DIF model is ij = j i i j jG u G1 0 0 01 00 = j i i jG u ) (1 01 0 00 0 = ]. ) ( [1 01 0 00 0 j i i jG u (2.33) PAGE 39 39 In the combined model, the term ) (1 01 0 00 i i is the difficulty of item i for the group labeled 1 and 0 00 i is the difficulty of item i for the group labeled 0, 1, 1 I i The term 01 00 is the difficulty for the group labeled 1, or focal group, and the term 00 is the difficulty for the group labeled 0, or refer ence group, for the comparison item If any of the model estimates of 1 01 i for items 1 1 I i or the estimate of 01 for the comparison item are significantly different from zero, then it indicates the item functions differently for the two groups, given the groups have been conditioned on ability level Therefore, if the estimates are statistically different from 0 it indicates the item exhibits DIF. The DIF model proposed by Kamata can be a valuable tool for detecting DIF in dichotomous items First, the DIF model has a statistical test of significance. Second, the DIF model provides a measure of th e effect size of DIF through exponentiation of the estimates 1 01 i for items 1 1 I i and 01 for the comparison item. And, third, the DIF model simultaneously estimates the DIF statistics for all items. However, Kamatas model is restricted to the detection of uniform DIF (Chaimongkol, 2005) Cheong (2006), Kamata and Binici (2003) and Kamata, Chaimongkon, Genc, and Bilir (2005) expanded Kamatas two level hierarchical random effects DIF model to a threelevel hierarchical random effects DIF model In the threelevel model, one or more group characteristic var iables are added to the equations in the level three model The level 3 model necessary to extend the two level model given in equation 2.32 is k 00 = kr00 000 k 10 = 100 (2.33) PAGE 40 40 k I 1 ) 1 ( = k i k i Ir X1 1 11 10 ) 1 ( where kX1 is a level 3 characteristic variable for the kth group. The addition of a level 3 characteristic variable allows for the investigation o f the threeway interaction between the item difficulty, the group characteristic variable, and the level 3 characteristic variable. The effect of this interaction is represented in the model by 11 i Kamatas twolevel hierarchical mode l for detecting DIF in dichotomous items has been tailored for use with polytomous items by several researchers Shin (2003), Williams (2003), Williams and Beretvas (2006), and Vaughn (2006) all proposed twolevel fixedeffects hierarchical models for det ecting DIF in polytomous items Chu and Kamata (2005) expanded the twolevel model for polytomous items to a threelevel model for polytomous items Mantel Haenszel Procedure Since the Mantel Haenszel statistic is conceptually simple, easy to use, an d has both a test of significance and a measure of effect size, it has become one of the most widely used measures for detecting DIF in dichotomous items As a result, the Mantel Haenszel statistic has been researched extensively and its performance has b een compared to many of the other DIF detection procedures Furthermore, the Mantel Haenszel statistic has been shown to be equivalent, under certain conditions, to both the logistic regression DIF detection procedure (Swaminathan & Rogers, 1990) and item response theory models (Holland & Thayer, 1988; Donaghue, Holland & Thayer, 1993). Multilevel models have been formulated for use in DIF detection in dichotomous items Two very popular methods for detecting DIF in dichotomous items, the logistic PAGE 41 41 regres sion and item response theory procedures, have multilevel equivalents However, at this time, the most widely used measure for detecting DIF in dichotomous items, the Mantel Haenszel procedure, has no multilevel equivalent A multilevel equivalent of t he Mantel Haenszel would be advantageous for several reasons First, a multilevel Mantel Haenszel model would allow for the extraction of an estimate of MH ^ that matches the estimate of MH ^ calculated using the formu la in 2.1 In theory, an estimate of MH ^ can be calculated from the parameters estimated in the multilevel logistic regression and IRT models However, in practice, the estimate of MH ^ calculated from the multilevel logistic regression and IRT models is not an exact match. And second, there are the reasons given for all multilevel DIF detection methods A n atural nesting of the data often exists in educational testing situations and a multilevel equivalent of the Mantel Haenszel would allow researchers to account for the dependency among examinees that occurs when examinees are nested within the same group. The Mantel Haenszel procedure for detecting DIF does not provide an explanation for the cause of the DIF. A multilevel Mantel Haenszel approach would allow researchers to examine the affect of an individual level or grouplevel characteristic variable on the performance of the examinees The Mantel Haenszel procedure assumes the degree of DIF is constant across group units But, a multilevel Mantel Haenszel model with three levels would allow the magnitude of the DIF to differ across group units PAGE 42 42 CHAPTER 3 METHODOLOGY Overview of the Study The assessment of DIF is an essential component of the validation of all forms of assessment and evaluation. A multilevel structure often exists in educational settings as students are nested within schools Therefore, educational assessment data would also have a multilevel organization with scores nested within studen ts that are nested within schools Recent research has demonstrated that multilevel modeling may be a useful approach for conducting DIF analysis Multilevel models not only address the clustered characteristics of assessment data, but they also allow educational researchers to study the affect of a nesting variable on performance at the student and school level. Some of the techniques used for DIF detection have also been formulated for use in multilevel modeling. Kamata (1998, 2001) demonstrated that i tem response theory could be incorporated into a multilevel logistic regression model which led to the formulation of a two level logistic regression approach for DIF detection in dichotomously scored items (Kamata 2001, 2002) Kamata and Binici (2003) ex tended Kamatas twolevel DIF detection model to a threelevel DIF detection model for dichotomous items Cheong (2006), Vaughn (2006), Williams (2003), and Williams and Beretvas (2006) expanded the threelevel model for dichotomous items to a threelevel model for polytomous items And, although great strides have been made in the use of multilevel modeling for the detection of DIF in both dichotomous and polytomous items, one of the most widely used methods for DIF detection in dichotomous items, the Ma ntel Haenszel, is yet to be formulated as a multilevel approach for DIF detection. PAGE 43 43 Research Questions The following research questions w e re employed for this study Can the Mantel Haenszel DIF detection procedure for dichotomous items be reformulated as a multilevel model where items are nested within individuals? Is the log odds ratio of the reformulated multilevel Mantel Haenszel approach for detecting DIF in dichotomo us items equivalent to the log odds ratio of the Mantel Haenszel approach for detecti ng DIF in dichotomous items for items that are nested within individuals? How does the reformulated multilevel Mantel Haenszel approach for detecting DIF in dichotomous items compare to the Mantel Haenszel approach for detecting DIF in dichotomous items fo r items that are nested within individuals? Model Specification Th e Mantel Haenszel and reformula ted multilevel Mantel Haenszel models for dichotomously scored items discussed in Chapter 2 will be used to detect DIF in dichotomous items that are nested wi thin individuals The results from each method will be compared on the basis of parameter recovery, Type I error rates and power Since the primary focus of this study is the multilevel model approach to DIF detection, a review of the twolevel multileve l model for detecting DIF in dichotomously scored items, based on the 1P HG L LM is included in this section. A discussion of the reformulation of the Mantel Haenszel procedure for detecting DIF in dichotomous items to a multilevel model is also included i n the section. Two level Multilevel Model for Dichotomously Scored Data The twolevel multilevel HG L LM model for detecting DIF in dichotomously scored items that was discussed in Chapter 2 will be reviewed in the paragraphs that follow The level 1, or it em level, model for the twolevel multilevel model for DIF detection in items that are dichotomously scored is PAGE 44 44 ij = ij ijp p 1 log = j I j I j j j j jX X X) 1 ( ) 1 ( 2 2 1 1 0... = 1 1 0 I q qj qj jX (3.1) where ij is the logit, or log odds, of the probability that person j answers item i correctly, and qjX is the qth dummy indicator variable for person j with value 1 when i q and value 0 when i q For the comparison item qjX equal s 0. The coefficient j 0 represents the expected it em effect of the comparison item for person j and the coefficient ij represents the effect of the ith individ ual item compared to the comparison item The level 2 model is the personlevel model and is specified as: j 0 = j ju G0 01 00 j 1 = jG11 10 (3.2) j I ) 1 ( = j I IG1 ) 1 ( 0 ) 1 ( where G a group characteristic dummy variable, is assigned a 1 if the person is a member of the focal group and 0 if a person is a member of the reference group. In the above level 2 DIF detection model, the item effects, 1)j (I 0 to j are modeled to include a mean effect, 1)0 (I 00 to and a group effect, 1 ) 1 ( 01 to I The coefficient 01 represents the DIF common to all items, whereas the coefficient 1 i is the additional amount of DIF present in item i In the model ju0 is the random component of j 0 and is assumed to be normally distributed with a mean of 0 and variance of Since the item parameters PAGE 45 45 are assumed to be fixed across persons, j 1 through j I ) 1 ( are modeled without a random component The level 1 and level 2 DIF detection models can be combined to form a twolevel DIF detection model ij = ]. ) ( [1 01 0 00 0 j i i jG u (3.3) I n the combined model, the term ) (1 01 0 00 i i is the difficulty of item i for the group labeled 1, or focal group and 0 00 i is the difficulty of item i for the group labeled 0, or referen ce group. The term 01 00 is the difficulty for the focal group, and the term 00 is the difficulty for the reference group for the comparison item Differential item functioning is indicated if any of the model estim ates of 1 01 i for items 1 1 I i or the estimate of 01 for the comparison item are significantly different from zero. Mantel Haenszel Multilevel Model for Dichotomously Scored Data Swaminathan and Rogers (1990) demonstrated that the Mantel Haenszel procedure for dichotomous items is based on a logistic regression model where the ability variable is a discrete, observed score and there is no interaction between the group and ability level They showed t hat the logistic regression model stated in equation 2.6 but restated here j jp p 1 ln = j j jXG G X ) (3 2 1 0 (3. 4 ) where jX is the matching criterion score, or total score, for individual j G represents group membership for individual j ,and jXG ) ( is the interaction between ability and PAGE 46 46 group membership can be written as logistic regression model where the group coefficient is equivalent to the Mantel H aenszel log odds ratio if jX is replaced by I discrete ability categories and the interaction term jXG ) ( is removed. The resulting equation is ij = j I k k kG X 1 0 (3.5 ) In the above model kXrepre sents the discrete ability categories of I ,... 2 1 where I is the total number of items. kX is coded 1 for person j if person j is a member of ability level k meaning person j s total score is equal to k If person j is not a member of ability level k then kXis coded 0. kXis coded 0 for all persons with a total score of 0. In the model is the coefficient of the group variable and is equivalent to the log odds ratio of the Mantel Haenszel. The logistic regression model stated in equation 3.4 (Swaminathan & Rogers, 1990) can be embedded in the multilevel model for detecting DIF in dichotomous items, to create a multilevel approach to DIF detection T o embed the logistic regression, the level 1 model would remain the same. The change would occur in the level 2 model The level 2, or personlevel model, would be j 0 = j j ju G Ability0 02 01 00 j 1 = j jG Ability12 11 10 j 2 = j jG Ability22 21 20 (3.6 ) j I ) 1 ( = .2 ) 1 ( 1 ) 1 ( 0 ) 1 ( j I j I IG Ability In the above level 2 model, jAbility is the total score and jG is the group indicator variable, cod ed 1 for the focal group and 0 for the reference group. PAGE 47 47 For item i t he combined model can be written as ij = j i i j i jG Ability u ) ( ) (2 02 0 00 1 01 0 (3.7 ) Using the combined equation, the difficulty for item i for a person in the reference group is 0 00 i (3.8 ) And the difficulty for item i for a person in the focal group is 2 02 0 00 i i (3.9 ) Applying the findings of Holland an d Thay er (1988) to equations 3.8 and 3.9 yields ). (2 02 i (3.10) Therefore, an estimate of the DIF effect size obtained by logistic regression can be recovered through a multilevel logistic regression DIF model by 2 02 i (3.11) where 02 and 2 i are the coefficients for the group variable. The multilevel reformulation of the Mantel Haenszel procedure for detecting DIF in dichotomous items uses the level 1 DIF detection model given in equation 3.1 with no changes to the model The change that is necessary to create a multilevel reformulation of the Mantel Haenszel is the addition of a discrete ability level estimate in the level 2 DIF detection model given in equation 3.2. The level 2 model given in equation 3.2 thus becomes j 0= 0 1 ) 1 ( 0 0 00 j I k j I k ku G A j 1 = I k j I k kG A1 ) 1 ( 1 1 10 (3.12) PAGE 48 48 j I ) 1 ( = I k j I I k k IG A1 ) 1 )( 1( 1 0 ) 1 ( In th e above model kA represents the discrete ability level categories of I ,...2 1 where I is the total number of items. kA is coded 1 for person j if person j is a member of ability level k meaning person j s matching criterion score is equal to k If person j is not a member of ability level k then kA is coded 0. kA is coded 0 for all persons with a matching criterion score of 0. For item i and ability level k t he combined model can be written as ij = j I i I i k ik k jG A u ) ( ) () 1 ( ) 1 ( 0 0 00 0 0 (3.13) Using the combined equation, the difficulty for item i for a person in the reference group is 0 00 i (3.14) And the difficulty for item i for a person in the focal group is ) 1 ( ) 1 ( 0 0 00 I i I i (3.15) Applying the findings of Holland and Thayer (1988) to equations 3.14 and 3.15 yields ) () 1 ( ) 1 ( 0 I i I (3.16) where ) 1 ( 0 I and ) 1 ( I i are the coefficients with the group variable. Therefore, the log odds ratio of the Mantel Haenszel procedure for detecting DIF in dichotomously scored items when the data fit the Rasch model and represent items nested within individuals, can b e recovered from a HGLM by the equation ln = ,) 1 ( ) 1 ( 0 I i I (3.17) PAGE 49 49 where ) 1 ( 0 I and ) 1 ( I i are the coefficients with the group variable in the multilevel model. The multilevel Mantel H aenszel model can be used to flag items for DIF The null hypothesis of no DIF would be tested by using the standard t test for the coefficient of the group variable for item i (Kim, 2003) This tes t is a part of the standard HGLM output. A reject ion of the null hypothesis means the item is functioning differently for t he focal and reference group and an investigation into item bias may be warranted. In equation 3.13 0 ju is a residual, and as such, represents an adj ustment to the ability parameter, ju0 of the DIF model stated in equation 3.2 Fisher (1973) demonstrated that the person ability parameter of the Rasch model could be decomposed into a linear combination of one or more timevarying parameters Fi shers decomposition of the ability parameter for item i is given by p i i il ic a w1, (3.18) where ia is the decomposed person ability parameter, ilw is a weight, suc h as a coefficient, for parameter l and c is a normalization constant The decomposition allows for person parameters to be added to the Rasch model as linear constraints Kamata (1998) applied Fishers finding to the 1P HGL L M model with a level2 predictor to show that the residual for the 1 P HGL L M model without a level 2 predictor, or ju0 can be expressed as a linear combination of the level 2 predictors added to the model and 0 ju Thus, by combining the findings of Fisher (1973) and Kamata (1998) the relationship of 0 ju to ju0 for the multilevel reformulation of the Mantel Haenszel given in equation 3.7 can be expressed as PAGE 50 50 ju0 = k ik k jA u ) (0 0 (3.19) Therefore, 0 ju represents an adjustment to the discrete ability score categories used as ability measures in the Mantel Haenszel DIF detection procedure. Simulation Design T his simulation study manipulated several factors, including sample size, number of items, magnitude of DIF, and ability distribution in order to explore the performance of the reformulated multilevel Mantel Haenszel, and the Mantel Haenszel methods of DIF detection methods for dichotomous items The results from each method will be compared on the basis of parameter recovery, empirical Type I error rates, and power To simulate a two level multilevel model, item scores will be simulated for subjects The simulation will be constructed using the R statistical program (R Development Core Team, 2005) Simulation Conditions for Item Scores The dichotomous responses for t he items, or level 1 units, were simulated to fit the Rasch Model In the Rasch model, the probability of a specific response (e.g correct/incorrect answer) is modeled as a function of the difference between the person and item parameter Given the Rasch model, the probability that subject j will have a correct respon se for item i is given by the equation ) exp( 1 ) exp(  1i j i j j ijX P (3.20) where j ,the person parameter, represents the ability level for subject j and i the item pa rameter, is the difficulty parameter for item i The equation in 3.20 can also be written as PAGE 51 51 ) ( exp 1 1  1i j j ijX P (3.21) Probabilities were converted to item responses by comparing each probability to a random number between zero and one generated from the uniform probability distribution. If the probability is greater than the random number the response was scored as correct (i.e. 1) and if the probability is less than or equal to the random number the response was scored a s incorrect (i.e. 0). The DIF was introduced by changing the item difficulty parameters for the focal group using the formula i R Fdi i (3.22) where iF is the item parameter for the focal group, iR is the item parameter for the reference group and id is the magnitude of the DIF for the ith item Therefore, for the f ocal group, the equation in 3.21 becomes )] ( exp[ 1 )] ( exp[  1i R j i R j j ijd d X Pi i (3.23) or ) exp( 1 ) exp(  1i iF j F j j ijX P (3.24) Items were simulated for 2 different levels of uniform DIF: 0.20 and 0.40. Therefore, in equation 3.10, the value of id will be 0.20 and 0.40, and the corresponding item difficulty param eter for the fo cal group was either 0.20 or 0.40 larger than the item difficulty for the reference group. PAGE 52 52 Items were simulated under varying percentages of DIF items Studies have shown that larger proportions of DIF items may result in contamination of the matching va riable, thus resulting in increased Type I error rates (French & Mi ller, 2007, Miller & Oshima, 1992) The percentage of DIF items is generally between 5 and 20 percent Therefore, 3 different conditions were simulated for the number of DIF items: 0%, 10 % and 20%. To investigate the effect of the length of the test on the ability of the method to detect DIF 2 different test lengths were simulated: 20 items and 40 items Therefore, for the level 1, or item le vel units, three conditions were manipulated. These condi tions are summarized in Table 31. Table 3 1 Generating conditions for the items Item Condition Description Magnitude of DIF 0.2, 0.4 Concentration of DIF 0%, 10%, 20% Type of DIF Uniform Number of Items 20 40 Simulation Condition s for Subjects DIF exists when subjects from two different groups have different response probabilities on an it em given the subjects in the 2 different groups have the same ability level However, research indicates that a difference in the ability dist ributions of the focal and reference groups impacts the performance of certain DIF detection methods, such as the logistic regression method (Jodoi n & Gierl, 2001) Therefore, 2 conditions were simulated for the purpose of assessing a method s ability to properly flag items when ability distributions differ First, subjects were simulated with no ability difference between the focal and reference groups The ability dist ribution for both groups were simulated to fit a standard normal distribution (e.g., N(0, 1)) For t he PAGE 53 53 second case, subjects were simulated with a one standard deviation difference in means between the focal and reference groups The focal group was simulated to fit a normal distribution with mean 1 and standard deviation 1 (e.g., N( 1, 1)), wh ile the reference group was simulated to fit a standard normal distribution A difference of one standard deviation in the means was selected because it approximates what is seen in real testing situations and has been used in prior DIF simulation studies (Clauser & Mazor, 1993; Cohen & Kim, 1993; French & Miller, 2007; Narayana & Swaminathan, 1994; Roussos & Stout, 1996). No theoretical guidelines exist about the number of subjects necessary for parameter estimation. However, Raudenbush and Bryk (2002) recommend between 5 and 200 subjects per level 3 unit and Mok (1995) suggests that the number of level 2 units (subjects) should be as large as the number of level 1 units (items) in order to have a two level model with less bias For certain DIF detection methods, power increases as the sample size increases This is true for the logistic regression approach to detecting DIF (Rogers & Swaminathan, 1993; Swamination & Rogers, 1998) Therefore, data were simulated to approximate small and large sample sizes: n=250 and n=500 For both cases, the n umber of subjects was divided equally among the focal and reference groups Various factors were manipulated for this study including magnitude of DIF, percentage of DIF items, number of items, ability distribution, and sample size However, only one t ype of DIF, uniform DIF, was considered. A summary of the simulatio n design is provided in Table 32. PAGE 54 54 Analysis of the Data The reformulated mult ilevel Mantel Haenszel model will be analyzed using hierarchical generalized linear models, or HGLM HGLM is incorporated in the HLM program (Bryk, Raudenbush & Congdon, 1996) The HGLM program is a combination of generalized linear models (GLM) and hierarchical linear models (HLM) The est imation procedures of the HLM are performed both between and within the GLM and HLM procedures, resulting in what Raudenbush (1995) refers to as a doubly interactive algorithm HLM is a macro procedure; GLM is a micro procedure. In GLM the penalized qua si likelihood (PQL) is maximized in order to achieve estimates of the linearly dependent variables, ijZ and the weights ijw where ijZ = ij ij ij jw p u 0 (3.23) a nd ) 1 (ij ij ijp p w (3.24) Table 3 2 Simulation design Variable Description Magnitude of DIF 0.2, 0.4 Percentage of DIF I tems 0%, 10%, 20% Type of DIF Uniform Ability Distribution Both N(0,1) Focal N( 1,1), Reference N(0,1) Number of Items 20 40 Number of Subjects 250, 500 PAGE 55 55 In HGLM, the level 2 parameters are estimated using two different pro cedures. The Empirical Bayes (EB) method is used to estimate ju0 Generalized least squares, or GLS, is used to estimate the s. The HGLM procedure produces a joint posterior distribution of level 1 and level 2 parameters given a variancecovariance matrix based on normal approximation to the restricted likelihood, or PQL. PAGE 56 56 CHAPTER 4 RESULTS The main purpose of this study was to determine if a multilevel equivalent to the Mantel Haenszel method for detecting DIF in dichotomously scored items could be formulated In order to do this it was first necessary to establish that a multilevel model could be us ed to recover the Mantel Haenszel log odds ratio or MH Second, it was necessary to confirm that the performance of the multilevel equivalent of the Mantel Haenszel would be at least equal to that of the Mantel Haenszel In this chapt er, examples to illustrate the parameter recovery of both the logistic regression method proposed by Swaminathan and Rogers (1990) and the Mantel Haenszel DIF log odds ratio using the HGLM method for detecting DIF are presented first The illustrative examples provide evidence to support the first two research questions Second, results from a simulation study for one set of conditions are provided to give additional evidence for the ability of the Multilevel Mantel Haenszel to recover the log odds ratio of the Mantel Haenszel. And, third, results of the simulation study that compared the Type I error rates and power for the Mantel Haenszel and Multilevel Mantel Haenszel methods for detecting DIF are presented The simulation study results support the third research question. Results Illustrative Examples The first two research questions posited that (1) the Mantel Haenszel could be reformulated as a multilevel model and (2) the log odds ratio of the Mantel Haenszel could be recovered through the use of a multilevel model In this section an example is presented to provide support for the these research questions Because the multilevel PAGE 57 57 equivalent to the Mantel Haenszel is based on the logistic regression equivalent to the Mantel Haenszel (Swaminathan and Rogers, 1990), an example will first be presented that illustrates the ability of a multilevel model to recover the logistic regression measure of DIF Simulation D esign In the examples, in order to demonstrate the parameter recovery capability of the multilevel approach, dichotomous responses for a 20item test were simulated for 500 persons The responses for the 20 items were simulated using the Rasch IRT model under the following conditions T he total number of examinees, 500, was split equally amon g the focal and reference groups with n=250 for each group. The distribution of the ability levels for both the focal and reference grou ps were simulated to be normal with mean 0 and standard deviation 1 The item difficulties for the 20 items were simulat ed to fit a normal distribut ion and are provided in Table 41 Ten percent of the items, or 2 items, were simulated to contain DIF with a magnitude of 0.4. These two items were items 3 and 13. Both items were simulated to favor the focal group All other items were simulated to be DIF free. Parameter r ecovery for the logistic regression m odel The logistic regression model for detecting uniform DIF in each of items in the 20item example has the form j jp p 1 ln = j iG X2 1 0 (4.1) where j jp p 1 ln is the log odds of person j j =1, 2, 500, answering item i i = 1, 2, ,,, ,20, correctly iX is the total score, or sum of the correct responses, for person j PAGE 58 58 and jG is the dummy coded group membership variable, coded 0 for the reference group and 1 for the focal group. The coefficient of the group membership variable 2 is the measure of DIF To estimate the measure of DIF for each of the items 1 through 20 it is necessary to construct 20 logistic regression equations. Table 4 1 Item parameters for the illustrative example Item Number Item Pa rameter 1 0.528 2 1.186 3 0.329 4 1.187 5 0.124 6 2.292 7 0.518 8 0.111 9 0.353 10 1.472 11 0.452 12 0.835 13 0.131 14 0.419 15 0.276 16 0.319 17 1.181 18 1.154 19 0.126 20 0.084 To confirm the recovery of the parameters o f the logistic regression model for detecting uniform DIF (Swaminathan and Rogers, 1990) by a multilevel logistic regression model, a twolevel multilevel logistic regression model was applied to the simulated data. The level 1, or item level model, for t he 20 item example was PAGE 59 59 ij = ij ijp p 1 log = j j j j j j j j jX X X X19 19 3 3 2 2 1 1 0... (4.2) The level 2, or personlevel model was j 0= j j ju G Ability0 02 01 00 j 1 = j jG Ability12 11 10 j 2 = j jG Ability22 21 20 (4.3) j 19 = .192 191 190 j jG Ability In the above models, 02 2i is the multilevel equivalent of the logistic regression measure of DIF for items 1 through 19. The measure for item 20 is 02 The models as entered into HLM are given in Figure 41. An excerpt from the HLM output for the 20 item model is provided in Figure 42 The logistic regression DIF statistic estimated for item 20 is the coefficient for the variable Group or 02 and, according to Figure 42, is equal to 0.059. The measure of DIF for item i i = 1, 2, 19, is equal to 02 2i From the HLM output, the measure of DIF for item 2 is 144 0 059 0 203 0 Table 4 2 contains the results for both the multilevel model analyzed using HLM and the logistic regression model analyzed using SPSS 16 .0 T he results illustrate the ability of the multilevel logistic model to completely recover the measure of DIF for all 20 items Parameter recovery of the Mantel Haenszel log odd s r atio To illustrate the ability of the Multilevel Mantel Haenszel model to recover the Mantel Haenszel log odds ratio, the logistic regression model with discrete ability levels (Swaminathan and Rogers, 1990) was embedded in the HLGM DIF detection model PAGE 60 60 proposed by Kamata (1999, 2001) The logistic regression model with disc rete ability levels for each of the 20 items in the example is j jp p 1 ln = .21 20 20 2 2 1 1 0 jG A A A (4.4) jG is the dummy coded group membership variable, coded 0 for the reference group and 1 for the focal group, and 1A through 20A are the 20 discrete ability levels that correspond to the total scores, or sum of the correct responses, of 1 through 20. Therefore, if the total score for person j was 19, then 19Awould be coded 1 and all other ability levels would be coded 0 for person j For a total score of 0 all ability levels were coded 0. The coefficient of the group membership variable, 21 is the logistic regression equivalent to the Mantel Haenszel log odds ratio. The model stated in equation 4.4 was embedded in a multilevel model where the level 1, or item level model was ij = ij ijp p 1 log = j j j j j j j j jX X X X19 19 3 3 2 2 1 1 0... (4.5) and the level 2, or personlevel model, was j 0 = j j j j ju G Level Level Level0 021 20 020 2 02 1 01 00 j 1 = j j j jG Level Level Level121 20 120 2 12 1 11 10 j 2 = j j j jG Level Level Level221 1 220 2 22 2 21 20 (4.6) j 19 = j j j jG Level Level Level1921 20 1920 2 192 1 191 190 In model 4.6, jLevel was the discrete ability level for person j Using the models stated in 4.6, the multilevel equivalent of the log odds ratio for the Mant el Haenszel was PAGE 61 61 estimated for items 1 through 19 as021 21 i. The multilevel equivalent for item 20, 021 was obtained directly from the HLM output. An excerpt from the HL M model is provided in Figure 43. A sample of t he results from the HLM output for the 20item multileve l model is provided by Figure 44. The Mantel Haenszel log odds ratio for item 20 was the coefficient for the Group variable in the equation for the interc ept, and, according to Figure 44, was estim ated to be equal to 0.119. The Mantel Haenszel logodds ratio for item i is estimated by adding the coefficient for the Group variable in the equation for item i to the coefficient for the Group variable item 20. Therefore, for item 2 the Mantel Haenszel logodds ratio was estimated as 005 0 119 0 127 0 Table 43 illustrates the ability of the multilevel model to recover the Mantel Haenszel logodds ratio for all 20 items. Estimates of the Mantel Haenszel logodds ratio were obtained through SPSS version 16.0. PAGE 62 62 Level 1 Model Prob(Y=1B) = P log[P/(1 P)] = B0 + B1*(ITEMID1) + B2*(ITEMID2) + B3*(ITEMID3) + B4*(ITEMID4) + B5*(ITEMID5) + B6*(ITEMID6) + B7*(ITEMID7) + B8*(ITEMID8) + B9*(ITEMID9) + B10*(ITEMID10) + B11*(ITEMID11) + B12*(ITEMID12) + B13*(ITEMID13) + B14*(ITEMID14) + B15*(ITEMID15) + B16*(ITEMID16) + B17*(ITEMID17) + B18*(ITEMID18) + B19*(ITEMID19) Level 2 Model B0 = G00 + G01*(ABILITY) + G02*(GROUP) + U0 B1 = G10 + G11*(ABILITY) + G12*(GROUP) B2 = G20 + G21*(ABILITY) + G22*(GROUP) B3 = G30 + G31*(ABILITY) + G32*(GROUP) B4 = G40 + G41*(ABILITY) + G42*(GROUP) B5 = G50 + G51*(ABILITY) + G52*(GROUP) B6 = G60 + G61*(ABILITY) + G62*(GROUP) B7 = G70 + G71*(ABILITY) + G72*(GROUP) B8 = G80 + G81*(ABILITY) + G82*(GROUP) B9 = G90 + G91*(ABILI TY) + G92*(GROUP) B10 = G100 + G101*(ABILITY) + G102*(GROUP) B11 = G110 + G111*(ABILITY) + G112*(GROUP) B12 = G120 + G121*(ABILITY) + G122*(GROUP) B13 = G130 + G131*(ABILITY) + G132*(GROUP) B14 = G140 + G141*(ABILITY) + G142*(GROUP) B15 = G150 + G151*(ABILITY) + G152*(GROUP) B16 = G160 + G161*(ABILITY) + G162*(GROUP) B17 = G170 + G171*(ABILITY) + G172*(GROUP) B18 = G180 + G181*(ABILITY) + G182*(GROUP) B19 = G190 + G191*(ABILITY) + G192*(GROUP) Figure 41: HLM logistic regression Final estimation of fixed effects (Unit specific model with robust standard errors) Standard Approx. Fixed Effect Coefficient Error T ratio d.f. P value For INTRCPT1, B0 INTRCPT2, G00 2.746069 0.333914 8.224 497 0.000 ABILITY, G01 0.223584 0.026050 8.583 497 0.000 GROUP, G02 0.059385 0.196874 0.302 497 0.763 For ITEMID1 slope, B1 INTRCPT2, G10 0.414986 0.460411 0.901 9940 0.368 ABILITY, G11 0.059608 0.037825 1.576 9940 0.115 GROUP, G12 0.203499 0.286892 0.709 9940 0.478 For ITEMID2 slope, B2 INTRCPT2, G20 0.740064 0.495784 1.493 9940 0.135 ABILITY, G21 0.078153 0.043396 1.801 9940 0.071 GROUP, G22 0.138881 0.312435 0.445 9940 0.656 Figure 42: HLM output for the logistic regression model PAGE 63 63 Table 4 2 A comparison of the logistic a nd multilevel logistic models Item Number DIF Estimate Logistic Regression 2 i 02 2i DIF Estimate Multilevel Logistic Regression 1 0.144 0.203 0.144 0.144 2 0.198 0.139 0.198 0.198 3 0.498 0.557 0.498 0.4 98 4 0.407 0.348 0.407 0.407 5 0.282 0.223 0.282 0.282 6 0.214 0.273 0.214 0.214 7 0.188 0.247 0.188 0.188 8 0.292 0.351 0.292 0.292 9 0.124 0.183 0.124 0.124 10 0.170 0.229 0.170 0.170 11 0.077 0.136 0.077 0.077 12 0.465 0 .406 0.465 0.465 13 0.747 0.806 0.747 0.747 14 0.234 0.175 0.234 0.234 15 0.386 0.327 0.386 0.386 16 0.095 0.036 0.095 0.095 17 0.002 0.057 0.002 0.002 18 0.163 0.222 0.163 0.163 19 0.329 0.270 0.329 0.329 20 0.059 0.059 0.059 PAGE 64 64 Level 1 Model Prob(Y=1B) = P log[P/(1 P)] = B0 + B1*(ITEMID1) + B2*(ITEMID2) + B3*(ITEMID3) + B4*(ITEMID4) + B5*(ITEMID5) + B6*(ITEMID6) + B7*(ITEMID7) + B8*(ITEMID8) + B9*(ITEMID9) + B10*(ITEMID10) + B11*(ITEMID11) + B12*(ITEMID12) + B13*(ITEMID13) + B14*(ITEMID14) + B15*(ITEMID15) + B16*(ITEMID16) + B17*(ITEMID17) + B18*(ITEMID18) + B19*(ITEMID19) Level 2 Model B0 = G00 + G01*(ABILITY1) + G02*(ABILITY2) + G03*(ABILITY3) + G04*(ABILITY4) + G05*(ABILITY5) + G06*(ABILITY6) + G07*(ABILITY7) + G08*(ABI LITY8) + G09*(ABILITY9) + G010*(ABILIT10) + G011*(ABILIT11) + G012*(ABILIT12) + G013*(ABILIT13) + G014*(ABILIT14) + G015*(ABILIT15) + G016*(ABILIT16) + G017*(ABILIT17) + G018*(ABILIT18) + G019*(ABILIT19) + G020*(ABILIT20) + G021*(GROUP) + U0 B1 = G10 + G11*(ABILITY1) + G12*(ABILITY2) + G13*(ABILITY3) + G14*(ABILITY4) + G15*(ABILITY5) + G16*(ABILITY6) + G17*(ABILITY7) + G18*(ABILITY8) + G19*(ABILITY9) + G110*(ABILIT10) + G111*(ABILIT11) + G112*(ABILIT12) + G113*(ABILIT13) + G114*(ABILIT14) + G115*(ABIL IT15) + G116*(ABILIT16) + G117*(ABILIT17) + G118*(ABILIT18) + G119*(ABILIT19) + G120*(ABILIT20) + G121*(GROUP) Figure 43: Multilevel Mantel Haenszel HLM model Final estimation of fixed effects: (Unit specific model) Standard Approx. Fixed Effect Coefficient Error T ratio d.f P value For INTRCPT1, B0 INTRCPT2, G00 12.493828 330.597674 0.038 478 0.970 ABILITY1, G01 0.198757 382.108982 0.001 478 1.000 ABILITY2, G02 0.209490 369.911614 0.001 478 1.000 ABILITY3, G03 10.630998 330.599462 0.032 478 0.975 ABILIT19, G019 25.070801 345.387075 0.073 478 0.943 ABILIT20, G020 24.982757 405.523179 0.062 478 0.951 GROUP, G021 0.119097 0.202652 0.602 478 0.547 For ITEMID1 slope, B1 INTRCPT2, G10 0.121190 468.907333 0.000 9560 1.000 ABILITY1, G11 0.206119 541.625744 0.000 9560 1.000 ABILITY2, G12 0.217277 524.400226 0.000 9560 1.000 ABILITY 3, G13 0.195146 468.909855 0.000 9560 1.000 ABILIT19, G119 0.156620 489.799095 0.000 9560 1.000 ABILIT20, G120 0.248342 574.615127 0.000 9560 1.000 GROUP, G121 0.127152 0.297859 0.427 9560 0.669 Figure 44: HLM results for the Mantel Haenszel logodds ratio PAGE 65 65 Table 4 3 A comparison of the Mantel Haenszel and Multilevel Mantel Haenszel Item Number Mantel Haenszel Log Odds Ratio 21 i 021 21 i Multilevel Mantel Haenszel Log Odds Ratio 1 0.008 0.127 0.008 0.008 2 0.191 0.074 0.193 0.193 3 0.505 0.627 0.508 0.508 4 0.410 0.290 0.409 0.409 5 0.263 0.143 0.262 0.262 6 0.190 0.309 0.190 0.190 7 0.133 0.259 0.133 0.133 8 0.284 0.403 0.284 0.284 9 0.183 0.302 0.183 0.183 10 0.115 0.237 0.118 0.118 11 0.089 0.213 0.094 0.094 12 0.487 0.375 0.487 0.487 13 0.724 0.875 0.724 0.724 14 0.246 0.124 0. 244 0.244 15 0.348 0.235 0.353 0.353 16 0.067 0.052 0.067 0.067 17 0.020 0.103 0.018 0.018 18 0.209 0.328 0.209 0.209 19 0.293 0.187 0.306 0.306 20 0.119 0.119 0.119 The DIF estimate for item 20 is 021 = 0.119. The standard errors for the Mantel Haenszel log odds ratio were recovered from the multilevel model by partitioning the variance term for 21 i The variance term for the estimate of the log odds ratio of the Mantel Haenszel, 021 21 i was partitioned using the following equation 021 212 iSE SEMH PAGE 66 6 6 W here, ) ( 2021 ^ 2 021 2 21021 21 MH Cov SE SE SEii (4.7) If the 0 ) (021 ^MH Cov the n the equation in 4.7 reduces to 2 021 2 21021 21SE SE SEii (4.8) The results for the partitioning of the varianc e terms are provided in Table 44. Estimates of the standard errors for the Mantel Haenszel log odds ratio were obtained from the SPSS, versi on 16.0 output. The multilevel standard error term for each item was obtained from the HLM output (see Figure 44). The partitioning resulted in standard error terms that were almost identical to the standard error terms for the Mantel Haenszel log odds r atio provided by SPSS (see Table 44). An item was flagged for DIF if the coefficient of the group indicator variable was significantly different from 0. For item 20, the null hypothesis of 0 :021 0H (4.9) is equivalent to t he Mantel Haenszel test of 1 :0H was rejected if the pvalue for the test statistic w as less than or equal to 0.05. A rejection of the null hypothesis meant the item functioned differently for the focal and reference groups. A similar hy pothesis of 0 :21 0iH (4.10) was tested for item i for 19 2 1 i For items 19 2 1 i the null hypothesis stated in 4.10 is not the same as the Mantel Haenszel test of 1 :0H A rejection of the null hypothesi s stated in 4.10 indicates that for item i there is a difference in performance between the focal and reference groups. The test of DIF stated in 4.9 and 4. 10 u sed the t test as t h e test statistic. Items 3 and 13 were simulated to contain DIF. Based on the PAGE 67 67 p value both items were flagged by the Multilevel Mantel Haenszel method and none of the remaining 18 items were improperly flagged for containing DIF. The pvalues for each of the 20 items are listed in Table 45. Table 4 4 A comparison of the standard errors for the illustrative example Item Number SE for 21 i 2 021 2 21SE SEi Mantel Haenszel SE Multilevel Mantel Haenszel SE 1 0.298 0.046 0.216 0.214 2 0.314 0.057 0.237 0.237 3 0.297 0.055 0.214 0.216 4 0.310 0.055 0.232 0.234 5 0.289 0.042 0.203 0.205 6 0.352 0.082 0.282 0.287 7 0.296 0.046 0.213 0.215 8 0.288 0.042 0.202 0.204 9 0.300 0.049 0.219 0.220 10 0.307 0.053 0.230 0.230 11 0 .297 0.047 0.213 0.216 12 0.301 0.049 0.221 0.220 13 0.295 0.046 0.211 0.214 14 0.298 0.047 0.214 0.216 15 0.297 0.055 0.214 0.216 16 0.297 0.055 0.209 0.216 17 0.311 0.055 0.232 0.235 18 0.312 0.056 0.234 0.236 19 0.295 0.046 0.211 0.214 *The SE for 021 is 0.203. Therefore, in summary, for the example, the Multilevel Mantel Haenszel method for detecting DIF in dic hotomously scored items provided an estimate of DIF equal to the log odds ratio of the Mantel Haenszel method for detecting DIF for the same items. Furthermore, the standard error terms of the Mantel Haenszel method were recovered by partitioning the error terms provided by the Multilevel Mantel Haenszel. PAGE 68 68 Table 4 5. P values for the illustrative example Item Number p value Item Number p value 1 0.669 11 0.470 2 0.813 12 0.213 3 0.034 13 0.003 4 0.351 14 0.653 5 0.617 15 0.429 6 0.364 16 0.860 7 0.383 17 0.742 8 0.152 18 0.280 9 0.304 19 0.544 10 0.439 20 0.547 Simulation Study : Parameter Recovery of the Multilevel MantelHaenszel Further support for the parameter recovery capability of the Multilevel Mantel Haenszel will now be provided. A small simulation study will demonstrate that th e parameter recovery of the Multilevel Mantel Haenszel method is near perfect. For the study, responses for n=500 persons (n=250 for the focal group and n=250 for the reference group) to a test of length n=20 items were simulated using the Rasch IRT model. The ability distributions for both the focal and reference groups we re simulated to fit a normal distribution with mean 0 and standard deviation 1. The item parameters used can be found in Table 4 1. Ten percent of the items, or 2 items, were simulated to display DIF, all others were simulated to be DIF free. The 2 items simulated to display DIF were items 3 and 13. The size of the DIF simulated was 0.4. All DIF was simulated to favor the focal group. The correlation between the log odds ratio of the Mantel Haenszel and the log odds ratio recovered by the Multilevel Mantel Haenszel was 0.999977 for all 20 items. PAGE 69 69 The relationship between the log odds ratio estimated by both methods, as seen in Figure 45, appr oximates the line x y indicating that the estimates of the log odds ratio for the Multilevel Ma ntel Haenszel were essentially identical to the estimates for the Mantel Haenszel. Figure 45: Graph of the log odds ratio estimates for both methods 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 Simulation Study : Performance of the Multilevel MantelHaenszel In the previous section a numerical example, based on one set of simulated data and one set of simulated conditions, was presented to provide empirical support for the parameter recovery capabilities of the Multilevel Mantel Haenszel The illustrative examples wer e followed by a simulation of one set of conditions, designed to provide additional evidence of the parameter recovery capability of the Multilevel Mantel Haenszel. The results of the simulation study will now be presented The purpose of the simulation study is to provide support for the third research question which was to show that the MantelHaenszel reformulated as a multilevel model (MLMH) performs as well as the Mantel Haenszel (MH) in terms of identifying, or flagging, items for possible DIF The focus of the simulation study w ill be on the empirical Type I error rate and power It PAGE 70 70 is expected that the Multilevel Mantel Haenszels performance will be comparable to the performance of the Mantel Haenszel. For the simulation study, data were replicated 50 times under different conditions of amount and size of DIF, length of test, sample size and ability distribution. Three different concentrations of DIF (0%, 10% and 20%) were simulated and two different magnitudes of DIF (0.2 and 0.4) were simulated for each concentration. The numb er of items was varied with two different test lengths simulated, 20 items and 40 items. Two conditions of sample size (n=250 and n=500) were simulated in which the ratio of the number of examinees in the focal group to the number of examinees in the refer ence group was kept at 1:1. And, the study considered two different distributions for the abilities for the focal group and reference group The ability distributions of the focal and reference groups were simulated to be equal, with both distributions sim ulated to fit a normal distribution with mean 0 and standard deviation of 1 and the ability distributions of the focal and reference groups were simulated to be different, with the ability distribution of the focal group simulated to fit a normal distribut ion with mean 1 and standard deviation 1 and the reference group simulated to fit a normal distribution with mean 0 and standard deviation 1. The reformulated multilevel Mantel Haenszel model was analyzed using hierarchical generalized linear models, or HGLM HGLM is incorporated in the HLM program (Bryk, Raudenbush & Congdon, 1996) Items were identified as performing differently if the coefficient of the group variable for the item was significantly differently from zero PAGE 71 71 Models converged for all replications under all conditions. In a few instances, due to the number of dummy coded independent variables, convergence was hampered by the multicol l inearity that existed between the independent variables for the fixed component of the multilevel model. For those situations of nonconvergence the correlation matrix for the independent variables was examined and the independent variable exhibiting the strongest correlation with the other independent variables was removed. The model was analyzed and the est imate of the Multilevel Mantel Haens zel log odds ratio was examined. In no instance was the estimate of the log odds ratio comprised by the deletion of an independent variable. The removal of the independent variable allowed for the convergence of all mod els for all co nditions for all replications. Only 50 replications w ere used in this study. Therefore, there could be substantial variability between the empirical Type I errors and power reported in this study and the empirical Type I errors and power that exist in the population. Therefore, conclusions and generalizations drawn may be based on poorly estimated results. All items simulated as DIF free For this case all item scores were simulated to be free of DIF Results are presented for conditions of eq ual and un equal ability distribu tions for s ample sizes of n=250 and n=500, and for test lengths of 20 and 40 items T able 4.6 provides the item parameters that were used to simulate the data The item parameters were simulated to fit a normal distribution The empirical Type I error rates for the condition of all items simulated to be DI F free are presented in Table 47 The empirical Type I error rates for the Multilevel Mantel Haenszel are less than or equal to the empirical Type I error rates for the Ma ntel Haenszel for most conditions, indicating the multilevel Mantel Haenszel PAGE 72 72 performed as well or better than the Mantel Haenszel Fo r most conditions the rates were below 0.05 Exceptions include rates of 0.06 and 0.07, both for the Mantel Haenszel, and 0.057 for the Multilevel Mantel Haenszel An increase in the sample size from n=250 to n=500 did not result in a decrease in the empirical Type I error rate for either method. For the condition of unequal ability distribution, for both methods, the empi rical Type I error rates increased across all conditions of test length and were the highest for n=250 and 40 items. Items Simulated to Contain DIF In this section a subset of the item scores was simulated to function differently for the focal and reference groups. To study the effect that the amount of DIF had on the performance of the Multilevel Mantel Haenszel and Mantel Haenszel methods of DIF detection the amount and magnitude of DIF was varied: Four different conditions were examined: 10% of the items were simulated to contain DIF of size 0.2, 10% of the items were simulated to contain DIF of size 0.4, 20% of the items were simulated to contain DIF of size 0.2 and 20% of the items were simulated contain DIF of size 0.4. In all combinations each of t he DIF items was simulated to favor the focal group. The combinations of DIF were simulated across equal and unequal ability distributions of the focal and reference groups, sample sizes of n=250 and n=500 and test lengths of 20 items and 40 items. For both samples (n=250 and n=500) the number in the focal group equaled the number in the reference group. PAGE 73 73 Table 4 6 Item parameters for the condition of no DIF Item Number Item Parameter Item Number Item Parameter 1 0.528 21 0.528 2 1.186 22 1.186 3 0.329 23 0.329 4 1.187 24 1.187 5 0.124 25 0.124 6 2.292 26 2.292 7 0.518 27 0.518 8 0.111 28 0.111 9 0.353 29 0.353 10 1.472 30 1.472 11 0.452 31 0.452 12 0.835 32 0.835 13 0.131 33 0.131 14 0.419 34 0.419 15 0.276 35 0.276 16 0.319 36 0.319 17 1.181 37 1.181 18 1.154 38 1.154 19 0.126 39 0.126 20 0.084 40 0.084 Table 4 7 Type I error : I tems DIF free n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0.022 0.04 5 0.047 0.060 40 0.039 0.037 0.031 0.039 Unequal 20 0.038 0.043 0.038 0.042 40 0.057 0.070 0.043 0.044 PAGE 74 74 The empirical Type I error rates and power for the first condition: 10% of the items exhibit DIF of si ze 0.2 are presented in Table 48 and T able 4 9, respectively. For tests of length 20 items, items 3 and 13 were simulated to exhibit DIF. For tests of length 40 items, items 3, 13, 24 and 32 were simulated to exhibit DIF. For these items, 0.2 was added to the item parameters in Table 46 For most conditions, the empirical Type I error rates for the Multilevel Mantel Haenszel and Mantel Haneszel were similar and acceptable (below 0.05) The rates systematically increased when the length of the test increased from 20 to 40 items and for une qual ability distributions of the focal and reference groups but decreased when the sample size was increased to n=500. Table 4 8 Type I error: 10% DIF of size 0.2 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0.0 27 0.01 5 0.01 1 0.015 40 0.0 29 0.034 0.0 28 0.04 6 Unequal 20 0.0 4 9 0.04 1 0.0 46 0.02 8 40 0.056 0.042 0.047 0.040 For all conditions power was greater for the Multilev el Mantel Haenszel (see Table 4 9) Power obtained for both the Multilevel Mantel Haensz el and Mantel Haenszel generally increased when the sample size inc reased from n=250 to n=500, was larger for the 40 item test length, but de creased when the ability distributions of the focal and reference groups were not the same. PAGE 75 75 Table 4 9 Power: 10% DIF of size 0.2 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0. 32 0. 32 0.3 6 0.3 4 40 0.53 0.5 0 0. 65 0. 3 6 Unequal 20 0. 22 0. 21 0. 36 0.32 40 0.42 0.39 0. 36 0.3 1 Results for the second condition: 10% of the items contain DIF of si ze 0.4 are presented in Table 410 and Table 411, respectively As in the previous condition, items 3 and 13 for tests of length 20 and items 3, 13, 24 and 32 for tests of length 40 were sim ulated to function differently in favor of t he reference group. For these items, 0.4 was added to the item parameters given in Table 44. An examination of t h e results in Table 4 10 reveal that, similar to the previous condition of 10% DIF of size 0.2, the empirical Type I error rates were generall y below acceptable limits for both the Multilevel Mant el Haenszel and Mantel Haenszel, but were larger for condi tions of unequal ability distributions for the focal and reference groups However, different f rom the previous condition, the rates decreased when the length of the test was increased to 40 items, if the ability distributions of the focal and reference groups were equal If the distributions were unequal, the empirical Type I errors increased. Although t here were no systematic differences betw een the errors for the two different sample sizes of n=250 and n= 500, in most cases the empirical Type I errors decreased when the sample size was increased to n=500. PAGE 76 76 Table 4 10. Type I error: 10% DIF of size 0.4 n=250 n=500 Ability Number of Ite ms M LMH MH M LMH MH Equal 20 0.021 0.021 0.020 0.032 40 0.018 0.020 0.023 0.015 Unequal 20 0.030 0.033 0.029 0.028 40 0.062 0.040 0.031 0.035 As seen in the previous case, p ower for the Multilevel Mantel Haenszel method exceeded that o f the Mantel Haenszel (see Table 4 11) For both methods, power generally increased when the sample size was increased to n=500. And, for both methods, power decreased when the length of the test was increased to 40 items. Power was the lowest for both th e Multilevel Mantel Haenszel and Mantel Haenszel when the ability distributions of the focal and ref erence groups were simulated to be different Table 4 11 Power: 1 0% DIF of size 0.4 n=250 n=500 Ability Number of Items M LMH MH M LMH M H Equal 20 0. 54 0. 48 0. 5 8 0.4 0 40 0.58 0.3 6 0.53 0.4 8 Unequal 20 0. 39 0.36 0.52 0.4 8 40 0. 35 0.35 0.36 0.38 Results for the condition of 20% DIF of size 0.2 are contained in Table 412 and Table 4 13 For this condition, items 3, 7, 13 and 19 were simulated to contain DIF for tests of length 20 and items 3, 7, 13, 19, 24, 32 37 and 40 were simulated to contain DIF for tests of length 40. In all conditions the DIF items were simulated to favor the PAGE 77 77 focal group. For these items, 0.4 was added t o the item parameters given in Table 44 for the focal group. T he empirical Type I error rates were generally larger for this condition as compared to previous conditions (see Table 412) And, 38 % of t he conditions (6 of 16) were above the acceptable rat e of 0.05 As seen in previous conditions, t he empirical Type I error rates generally increased when the ability distributions of the focal and reference groups were unequal and when the test length was increased to 40 items An exception to this general ization was the combination of the conditions of n=250 and test length of 40 items for unequal ability distributions in which case the empirical error rates decreased For n=500 the empirical error rates for the Mantel Haenszel were larger than the empir ical error rates for the Multilevel Mantel Haenszel, but, for n=250 the empirical error rates for the Mantel Haenszel were larger for only 25% of the cases. Table 4 12 Type I error: 2 0% DIF of size 0.2 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0.0 44 0.02 2 0.015 0.0 32 40 0.04 6 0.0 52 0.040 0.060 Unequal 20 0.058 0.04 0 0.020 0.0 68 40 0.04 8 0.0 35 0.085 0.080 Once again, p ower was greater for the Multilevel Mantel Haenszel method evidenced by the results provided in Table 4 13 For the Multilevel Mantel Haenszel, power systematically increased when the test length increased to 40 items As seen in previous conditions, p ower decreased for both methods when the ability distributions of the focal and reference groups were simulated to be different Unlike previous PAGE 78 78 conditions, an increase in the sample size to n=500 did not result in an increase in power. Table 4 13 Power: 20% DIF of size 0.2 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Eq ual 20 0.5 0 0.4 3 0.46 0.45 40 0. 74 0.4 8 0. 41 0.34 Unequal 20 0.3 4 0.27 0.37 0.34 40 0. 61 0.4 4 0.35 0.35 Empirical Type I error rates and power for the final condition of 20% DIF of siz e 0.4 are presented in Tables 414 and 415, respectively T he empirical Type I error rates (see Table 414) were much higher for the Multilevel Mantel Haenszel and in many cases (5 of 16) the rate was ab ove the acceptable rate of 0.05. Rates were generally higher for tests of length 40 items But, unlike previous scenarios the empirical Typ e I error rates did not increase for the condition of unequal ability distributions. Table 4 14. Type I error: 20% DIF of size 0.4 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0.028 0.029 0 .012 0.055 40 0.090 0.029 0.119 0.054 Unequal 20 0.030 0.031 0.038 0.040 40 0.081 0.034 0.106 0.048 Like all previous conditions, power for the Multilevel Mantel Haenszel exceeded that of t he Mantel Haenszel (see Table 415). Overall, thi s condition (20% DIF of size 0.4) resulted in the lowest power. Like previous conditions, power generally decreased PAGE 79 79 across for the condition of unequal ability distribution for the focal and reference groups. No general differences existed between the power for n=250 and n=500. And no systematic changes resulted from an increase in the number of items to 40. Table 4 15 Power: 20% DIF of size 0. 4 n=250 n=500 Ability Number of Items M LMH MH M LMH MH Equal 20 0.32 0. 28 0.36 0.3 3 40 0. 55 0. 35 0.6 1 0. 31 Unequal 20 0.3 4 0.31 0. 32 0.31 40 0. 48 0. 26 0. 33 0. 18 PAGE 80 80 CHAPTER 5 CONCLUSION This chapter includes a summary of findings and explores the conclusions, implications, limitations and recommendations for future research to be drawn fr om the results discussed in the previous chapter Summary The first section of the results provided support for the first two research questions which were (1) Can the Mantel Haenszel DIF detection procedure for dichotomous items be reformulated as a multil evel model where items are nested within individuals? a nd (2) Is the log odds ratio of the reformulated multilevel Mantel Haenszel approach for detecting DIF in dichotomous items equivalent to the log odds ratio of the Mantel Haenszel approach for detecti ng DIF in dichotomous items for items that are nested within individuals? Parameter recovery of the Multilevel Mantel Haenszel method was exemplary for both the logistic regression measure of DIF and the Mantel Haenszel log odds ratio. For a 20item examp le with 10% of the items simula ted to exhibit DIF of size 0.4, 50% of the estimates for the log odds ratio provided by the Multilevel Mantel Hanszel were identical to the corresponding estimate of the Mantel Haenszel log odds ratio as provided by SPSS. Fo r the other 10 items, the difference between the estimates for the log odds ratio provided by the Multilevel Mantel Hanszel and the estimates provided by SPSS were negligible; a difference of .005 or less in most cases. The excellent parameter recovery of the Multilevel Mantel Haenszel was further evidenced by a correlation of 0.999977 between the log odds ratio of the Multilevel Mantel Haenszel and the log odds ratio of the Mantel Haenszel r atio for a small PAGE 81 81 simulation study of 50 replications for a test o f length 20 items, with 10% of the items exhibiting DIF of size 0.4, and a sample of size n=500. A simulation study compared the M ultilevel Mantel Haenszel DIF model for detecting differential functioning in dichotomous data to the traditional Mantel Haenszel approach for data simulated under three concentrations of DIF (0% DIF, 10% DIF, and 20% DIF), two different magnitude s of DIF (0.2 and 0.4), two different sample sizes (n= 250 and n=500), two test lengths (20 and 40) and two different ability distribut ions for the focal and reference groups (equal and unequal). The study provided empirical support for the third research question which was: How does the reformulated multilevel Mantel Haenszel approach for detecting DIF in dichotomous items compare to the Mantel Haenszel approach for detecting DIF in dichotomous items for items that are nested within individuals? A summary of the findings is provided in the paragraphs that follow. For the case where all items were simulated to be DIF free, the empirical Type I error rates for the Multilevel Mantel Haenszel were generally lower than the empirical Type I error rates for the Mantel Haenszel. The rates for both methods were within a cceptable levels. The rates for the case of unequal ability distributions bet ween the focal and reference groups were generally higher, especially for a sample size of n=250. And, the Multilevel Mantel Haenszel was more sensitive to the differences in the ability distributions as evidenced by the greater increase in the rates for the Multilevel Mantel Haenszel as compared to the Mantel Haenszel for the case of unequal ability distributions between the focal and reference groups. PAGE 82 82 For the case where 10% of the items were simulated to exhibit DIF of size 0.2 the Multilevel Mantel Haen szel performed as well as the Mantel Haenszel in terms of empirical Type I error rate and power. The empirical Type I error rates were within acceptable limits for both methods. For both the Multilevel Mantel Haenszel and Mantel Haenszel the error rates ge nerally decreased as the sample size increased from n=250 to n=500. The empirical Type I error rates also increased as the test length increased from n=20 items to n=40 items. The error rates decreased when the sample size increased to n=500 for both the M ultilevel Mantel Haenszel and the Mantel Haneszel. For this case the Multilevel Mantel Haenszel exhibited greater power than the Mantel Haenszel. For both methods power was greater for the longer test (40 items) and for the larger sample size (n=500). Power decreased for the Multilevel Mantel Haenszel and Mantel Haenszel when the ability distributions of the focal and reference group were different. The case of 10% DIF of size 0.4 yielded empirical Type I error rates that were lower and power that was higher than the corresponding rates and power from the previous case. Across the conditions of sample size, test length and ability distribution, the Multilevel Mantel Haenszel and Mantel Haenszel experienced similar empirical Type I error rates. The error rates were lower for both methods for the longer test length (40 items) if the ability distributions of the focal and reference groups were unequal. Consistent with the previous case, the error rates decreased when the sample size increased to n=500. As in the previous condition of 10% DIF of size 0.2, power was higher for the Multilevel Mantel Haenszel. Power increased for both the Multilevel Mantel Haenszel PAGE 83 83 and Mantel Haenszel when the sample size was increased to n=500 and decreased for both methods when the ability distributions of the focal and reference groups differed. This too is consistent with the previous condition of 10% DIF of size 0.2. The case of 20% DIF of size 0.2 produced higher empirical Type I error rates, with 38% of the rates above 0.05. Furthermore, the error rates for the Multilevel Mantel Haenszel were higher than the error rates for the Mantel Haenszel across all conditions of test length, sample size, and ability distribution. For both methods the error rates increased for th e condition of unequal ability distributions of the focal and reference groups. Like previous cases, p ower was higher for the Multilevel Mantel Haenszel. However, unlike previous cases, for both the Multilevel Mantel Haenszel and Mantel Haenszel power dec reased when the sample size was increased to n=500. This is inconsistent with the literature and no explanation for the results can be offered. As in previous cases, power decreased for both methods for the condition of unequal ability distributions for th e focal and reference groups. Results for the case of 20% DIF of size 0.4 demonstrated even higher empirical Type I error rates for both the Multilevel Mantel Haenszel and Mantel Haenszel, with 6 of the 16 conditions resulting in error rates above 0.05. T he error rates were higher for the longer test length (40 items) as well as the larger sample size (n=500). This was true for both methods, however the increase was more pronounced for the Multilevel Mantel Haenszel. Consistent with previous cases, f or b oth methods the error rates increased for the condition of unequal ability distributions of the focal and reference groups. PAGE 84 84 Consistent with the previous cases, power was higher for the Multilevel Mantel Haenszel. Power increased for both the Multilevel M antel Haenszel and Mantel Haenszel when the test length increased to 40 items and the sample size increased to n=500. Also, consistent with previous conditions, power was lower for both methods for the condition of unequal ability distributions for the fo cal and reference groups. In summary, the Multilevel Mantel Haenszel DIF model compared favorably to the traditional Mantel Haenszel and provided DIF detection with acceptable empirical Type I error rates and moderate power across most conditions. In gen eral, empirical Type I error rates increased for both the Multilevel Mantel Haenszel and Mantel Haenszel when the amount of DIF present increased. And, across all cases, the error rates were higher for both methods for the condition of unequal ability dis tributions for the focal and reference groups. Across all cases and conditions the Multilevel Mantel Haenszel exhibited greater power than the Mantel Haenszel. However, for both methods power decreased when the ability distributions of the focal and reference groups differed. Discussion of Results The intention of this study was to investigate a multilevel equivalent of the Mantel Haenszel method for identifying differential item functioning in dichotomous items. The study was based on the work of Kamata and others, who proposed a multilevel approach to detecting differential item functioning in both dichotomous and polytomous items. This study extended the idea of a multilevel approach for detecting to the very popular MantelHaenszel method for detecting DIF. PAGE 85 85 Multilevel Equivalent of the Mantel Haenszel Method for Detecting DIF The first research question of interest was whether a multilevel equivalent of the very popular Mantel Haenszel method for detecting DIF in dichotomously scored items could be fo rmulated. It was found that, by embedding a Rasch IRT model in a multilevel logistic regression model with discrete ability levels, a model equivalent to the Mantel Haenszel could be formulated. Thus, one of the most widely used methods for detecting DIF in dichotomous items has been reformulated as a multilevel DIF model called the Multilevel Mantel Haenszel model. The second research question of interest was whether the log odds ratio of the Mantel Haenszel DIF detection procedure for dichotomous items could be recovered fully by the Multilevel Mantel Haenszel Model. The results provided in Chapter 4 provided support for the parameter recovery capability of a multilevel model analyzed using HGLM. Parameter recovery was exemplary for the Mantel Haenszel log odds ratio, as evidenced by the illustrative example and simulation study. For a 20item example with 10% of the items simulated to exhibit DIF of size 0.4, half of the estimates of the log odds ratio provided by the Multilevel Mantel Hanszel were identical to the corresponding estimate of the log odds ratio of the Mantel Haenszel provided by SPSS. For the other 10 items, very small differences (less than 0.005) existed between the log odds ratio of the Multilevel Mantel Haenszel and the log odds ratio of the Mantel Haenszel. The simulation study yielded a near perfect correlation coefficient of 0.999977 between the logodds ratio produced by the Mantel Haenszel and the logodds ratio computed by the Multilevel Mantel Haenszel PAGE 86 86 Performance of the Multilevel Mantel Haenszel Model The third research question addressed the performance of the Multilevel Mantel Haenszel model in terms of empirical Type I error and power. A model simulation study applied both the Multilevel Mantel Haenszel and Mantel Haensz el methods for detecting differential functioning in dichotomous items to data simulated under three concentrations of DIF (0% DIF, 10% DIF, and 20% DIF), two different magnitudes of DIF (0.2 and 0.4), two different test lengths (20 items and 40 items), tw o different sample sizes (n=250 and n= 500), and two different ability distributions for the focal and refer ence groups (equal and unequal). The performance of the Multilevel Mantel Haenszel was compared to the performance of the traditional Mantel Haenszel, and, overall, as expected, the Multilevel Mantel Haenszel performed as well as the Mantel Haenszel. The Multilevel Mantel Haenszel DIF model provided DIF detection comparable to the Mantel Haenszel with acceptable Type I error rates and moderate power across most conditions. In all cases the power of the Multilevel Mantel Haenszel exceeded that of the Mantel Haenszel. However, when the ability distribution of the focal group differed from the ability distribution of the reference group, both the Multlevel Mantel Haenszel and Mantel Haenszel methods experienced higher empirical Type I error rates and lower power. This is consistent with previous studies on the effect of the ability distribution on the Mantel Haenszel log odds ratio (Clauser & Mazor, 1998 ; Cohen & Kim, 1993; Fidalgo, A., Mellenbergh, G. & Muniz, J. (2000); French & Miller, 2007; Jodoin & Gierl, 2001; Narayana & Swaminathan, 1994; Roussos & Stout, 1996; Uttaro & Millsap, 1994). PAGE 87 87 In general, for the case of 10% DIF, empirical Type I error r ates decreased fo r both the Mantel Haenszel and Multilevel Mantel Haenszel when the test length was increased to 40 items from 20 items. This trend in error rates was repeated when the sample size was increased from n=250 to n=500. For the same case, power increased for both methods as a result of an increase in test length to 40 items. Overall, power was higher for DIF of size 0.4. Literature supports these findings as studies by others showed decreased empirical Type I error rates for test s of longer l en gth and for larger samples and increased power for tests of longer length, for larger samples, and for larger DIF effect sizes (Clauser & Mazor, 1993; Cohen & Kim, 1993; Fidalgo, Mellenbergh & Munuz, 2000; Fidalgo, A., Mellenbergh, G. & Muniz, J. (2000); French & Miller, 2007; Mazor, Clauser, & Hambleton, 1992; Narayana & Swaminathan, 1994; Roussos & Stout, 1996; Utttaro & Millsap, 1994). When the amount of DIF was increased to 20%, the empirical Type I error rates increased and power decreased for both the Multilevel Mantel Haenszel and the Mantel Haneszel. However contradictory to expectations warranted from previous studies, the error rates increased for the conditions of increased test length (40 items) an d increased sample size (n=500) for both meth ods. Power decreased as a result of the increased concentration of items exhibiting DIF. Although, in general, the findings of the study were as expected and supported by literature, there are 3 circumstances that could prove problematic in terms of the accuracy of the estimates for the empirical Type I error rates and power. First, only 50 replications were used in this study. Therefore, there could be substantial variability between the empirical Type I errors and power reported in this study and the empirical PAGE 88 88 Type I errors and power that exist in the population. This variability may account for the results that differed from what was exp ected, as the results may re present poorly estimated error rates and power. Second, several issues regarding the mat ching criterion used are brought up by p revious studies on the perform ance of the Mantel Haenszel. These issues included (1) the inclusion of the item under consideration in the matching criterion and (2) the removal of all items that exhibit DIF from the matching criterion. Donoghue, Holland, and Thayer (1993) asserted that if the item under investigation is not included in the matching criterion, then the Mantel Haenszel method may indicate the item exhibits DIF when no DIF exists. And, Holland and Thay er (1988) specifically stated that, in order for the Mantel Haenszel to be considered equivalent to the Rasch IRT model, the item under consideration must be included in the matching criterion. Furthermore, all other items should be DIF free if the Mantel Haenszel is to be equivalent to the Rasch IRT model. And, according to Shealy and Stout (1993a, 1993b), the matching criterion should be purified in order to be free of DIF items. In this study, the item under consideration was included in the matching criterion, but the matching criterion did not undergo a purification process to rid it of all other items that exhibited DIF. This could have negatively impacted the results for both the Multilevel Mantel Haenszel and Mantel Haesnzel, resulting in higher empirical Type I error rates and lower power. And, third, the Multilevel Mant el Haenszel was estimated using HGLM. The HGLM program is a combination of generalized linear models (GLM) and hierarchical linear models (HLM). In GLM the penalized quasi likel ihood (PQL) is used to estimate the values for the linearized dependent variables. The PQL algorithm considers the PAGE 89 89 linearized dependent variables to be approximately normally distributed. The algorithm provides reliable results except when the level 2 vari ances are large. Large level 2 variance results in variance estimates and fixed effect estimates that are negatively biased (Raudenbush & Byrk, 2002). Biased variance estimates could have contributed to the unexpected findings. Implication for DIF Detection in Dichotomous Items The assessment of DIF is an essential aspect of the validation of both educational and psychological tests Currently, there are several procedures for detecting DIF in dichotomous items These include the Mantel Haenszel, logistic regression, and now the Multilevel Mantel Haenszel model The Multilevel Mantel Haenszel approach is a valuable addition to the family of DIF detection procedures F irst and foremost, by formulating the Mantel Haenszel as a multilevel model an already popular procedure for detecting DIF in dichotomous items, the Mantel Haenszel, is permitted to take into consideration the natural nestin g of item scores within persons Second, by acknowledging the nested nature of the data, t he M ultilevel Mantel Haenszel provides educators, test developers and researchers the opportunity to contemplate possible sources of differential functioning at all levels of the data Third, by choosing to use a multilevel model, the researcher is able to interpret the results witho ut ignoring the hierarchical structure of the data and t he lack of statistical independence that often exists in such data. And fourth by modeling the Mantel Haenszel as a multilevel model, educators, test developers and re searchers are provided the oppor tunity to more fully understand the cause of the differential functioning through the addition of contextual variables at the various levels of the PAGE 90 90 model Furthermore, a measure of the variables effect on the subgroup performance can be estimated by the multilevel model. The Multilevel Mantel Haesnzel allows for item bias to be investigated in a completely new manner. Traditionally, investigation of item bias began with a procedure for identify ing DIF. Once an item was flagged for exhibiting DIF, the c onstruction, wording, and content of the item were closely examined as possible sources of the differential functioning. With the formulation of a Multilevel Mantel Haenszel, the source of the differential item functioning is not limited to the item, instead variables at all levels included in the multilevel model can be considered as possible sources. For example, variables related to study habits, learning or physical disabilities, or socio economic status may be added to the level 2, or person level, mod el as possible explanatory sources of the DIF. For a model with 3 levels, variables related to group membership can be added to the level 3 model to capture the differences in performance due to group membership. Variables at this level could include tho se related to school accommodations or neighborhood socioeconomic status. The use of a multilevel model for the purpose of DIF detection expands the definition of DIF to include all factors at all levels that result in a difference in the performance of t wo or more subgroups of the population that have been matched on ability. Both the Mantel Haenszel and logistic regression methods for identifying items that exhibit DIF require a separate analysis for each item Therefore, for a test with 20 items, 20 an alyses must be conducted, one for each item The Multilevel Mantel Haenszel method employs one model to analyze all items Therefore, for a 20 item test PAGE 91 91 one could test for DIF and obtain an effect size measure of the DIF simultaneously for all 20 items. The results obtained from this study provide empirical support for the use of the Multilevel Mantel Haenszel method for detecting DIF in dichotomously score items. For most conditions th e Multilevel Mantel Haenszel demonstrated acceptable Type I error rates indicating the Multilevel Mantel Haenszel would not improperly flag an item as functioning differently This is important since an item flagged for DIF is carefully scrutinized for the source of the differential functioning T his process can be labor and time intensive and can result in the removal of an item that should not be removed Based on the empirical evidence provided in this study, t he Multilevel Mantel Haenszel is more powerful than the Mantel Haenszel Therefore, the Multilevel Mantel Ha enszel properly identified items that were functioning differently at least as often as Mantel Haenszel. The combination of acceptable empirical Type I error rates and power allows test developers and psychometricians to confidently apply the Multilevel M antel Haesnszel model to the detection of DI F in dichotomously scored items. Limitations and Future Research The limitations of this study and implications for future research will be discussed in this section. First, the Multilevel Mantel Haenszel model presented only considered dichotomous data The increased use of various types of performance and constructedresponse assessments, as well as personality, attitude, and other affective tests, has created a need for psychometric methods that can detect DIF in polytomously scored items Thus, there i s a need for further research focused on extending the Multilevel Mantel Haenszel model presented in this study to a model for polytomously scored items. PAGE 92 92 This study focused only on uniform DIF, the type of DIF be st detected by the Mantel Haenszel A study of the performance of the Multilevel Mantel Haenszel under conditions of both uniform and nonuniform DIF would allow for an expanded application of the Multilevel Mantel Haenszel to the detection of DIF Accord ing to Swaminathan and Rogers (1990) nonuniform DIF can be detected using logistic regressio n by including an interaction t erm between ability and group in the model The simulation conditions for this study were limited to a two level hierarchical model where the level 1 units were the items and the level 2 units were the examinees The extension of the Multilevel Mantel Haenszel to a threelevel model would be beneficial as it would allow for the investigation of the impact of level 3 units on the diff erential functioning of the items Although the results indicated that the Multilevel Mantel Haenszel performed in a manner similar to the Mantel Haenszel under the conditions examined, in order to obtain a more complete understanding of how the two meth ods compare the performance of both methods should be observed for an expanded set of conditions, especially conditions related to the size of the DIF and sample size. Since this study only considered small and moderate effect sizes for DIF, an inquiry into the influence that a large effect size, such as 0.6 or 0. 8, would have on the empirical Type I error rate and power for both the Multilevel Mantel Haenszel and Mantel Haenszel is justified. A simila r justification can be made for an inquiry into the imp act of a large sample size, such as n=1000 or n=1500, on the empirical Type I error rates and power for both methods. Furthermore, since purification of the matching criterion was not considered PAGE 93 93 in this study, an examination of the performance of both met hods under the condition of a purified matching criterion is warranted. The Multilevel Mantel Haenszel was estimated using HGLM. Many other software packages, such as M Plus, SAS, and R now have the capability to estimate multilevel models. An investigation into the advantages and disadvantages of the before mentioned methods is worthwhile as the use of one these packages may result in a more efficient process that overcomes the problem of biased variance and fixed effect estimates due to PQL algorithm emp loyed by HGLM (Raudenbush & Byrk, 2002). In summary, although much research is still warranted, the development of the Multilevel Mantel Haenszel method for detecting differential item functioning in dichotomously scored items adds a new dimension of DIF detection. The very popular and widely used Mantel Haenszel procedure can now be used to investigate Item bias at many levels PAGE 94 94 LIST OF REFERENCES Ackerman, T (1992) A didactic explanation of item bias, item impact, and item validity fro m a multidimensional perspective Journal of Educational Measurement 29 6791 Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley. Allen, N L.& Donoghue, J R (1996) Applying the Mantel Haenszel procedure to complex samples of items Journal of Educational Mea sur e ment 33 231251. Ang off, W H (1993) Perspectives on differential item functioning methodology In P. W Holland & H Wainer (Eds ), Differential Item functioning (pp 3 24), Hillsdale, NJ: Lawrence Erlbaum Associates. Borsboom, D Mellenbergh, G J., & van der Linder, W J (2002) Different kinds of DIF: A distinction between absolute and relative forms of measurement invariance and bias Applied Psychological Mea surement, 26, 433450 Camilli, G ., & Co ngdon, P (1999) Application of a method of estimating DIF for polytomous test i tems, Journal of Educational and Behavioral Statistics, 24. Camilli, G .,& Shepherd, L (1994) Methods for Identifying Biased Test Items, (Vol 4 ), Thousand Oaks, CA: Sage Publications Cheong, Y F ( 2006) Analysis of school context effects on differential item functioning using hierarchical generalized linear models, International Journal of Testing, 6, 5779 Chiamongkol S. (2005) Modeling differential item functi oning (DIF) using multilevel logistic regression models: A Bayesian perspective. Unpublished doctoral dissertation, Florida State University, Tallahasse, FL. Clauser, B E, Nungester, R J., & Swaminathan, H (1996) Improving the matching for DIF ana lysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement 33, 453 464 Clauser, B E. & Mazor, K M (1998) Using statistical procedures to identify differenti al item functioning test item s Educational Measurement: Issues and Practice, 17, 31 44 Cohen, A S. & Kim, S (1993) A comparison of Lords 2 and Rajus area measures in detection of DIF Ap plied Psychological Measurement, 17 39 52 PAGE 95 95 Donoghue, J R., Holland P. W., & Th ayer, D T (1993) A Monte Carlo study of factors that effect the Mantel Haenszel and standardization measures of differential item functioning. In P W Holland & H Wainer (Eds .), Differential Item Functioning (pp. 137 163) Hillsdale, NJ: Lawrence Erlbaum Associates Doran s N J., & Holland, P W (1993) DIF detection and description: Mantel Haenszel and standardization. In P W Holland & H Wainer (Eds) Differential Item Functioning (pp 35 66) Hillsdale, NJ : Lawrence Erlbaum Asso ciates. Fidalgo, A., Mellenbergh, G. & Munoz, J. (2000). Effects of amount of DIF, test length and purification on robustness and power of Mantel Haenszel procedures.5, Methods of Psychological Research Online 2000, 43 53 Finch, W H., & French, B F (2 007) Detection of crossing differential item functioning: A comparison of four methods Educational and Psychological Measurement 67, 565582. Fox, J P. (2005) Multilevel IRT using dichotomous and polytomous response data British Journal of M athematical and Statistical Psychology, 58, 145172. Fox, J P. (2004) Applications of multilevel IRT m odeling School Effectiveness and School Improvement 15, 261280. French, A ., & Mi ller, T (1996) Logistic regression and its use in detecting d if ferential item functioning in polytomous i tems, Journal of Educational Measurement, 33, 315332 Guo, G., & Zhao, H (2000) Multilevel modeling for binary data. American Sociological Review, 26, 441462. Hidalgo M, & Perez Pina, J (2004) Differential item functioning detection and effect size: A comparison between logistic regression and Mantel Haenszel procedures Educational and Psychological Measurement 64, 903913. Holland, P W., & Thayer, D T (1985) An alternative definisiton of the ETS d elta scale of item difficulty Educational Testing Service Report ETS RR8543 and ETS TR 85 64, 1985 Holland, P W., & Thayer, D T (1986 April ) Differential item performance and the Mantel Haenszel procedure Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Holland, P W., & Thayer, D T. (1988) Differential item performance and the Mantel Ha enszel procedure. In H Wainer and H I Br aun (Eds ), Test v alidity (pp. 129145) Hillsdale, NJ: La wrence Erlbaum Associates PAGE 96 96 Hollan d, P W. & Wainer, H (Eds) (1993) Differential Item Functioning, Hillsdale, NJ: Lawrence Erlbaum Associates. Jodoin, M ., & Girl, M (2001) Evaluating type I error and power rates using an effect size measure with t he logisitic regression procedure for DIF detection Applied Measurement in Education 14, 329349. Janssen, R., Tuerlinckx, F., Meulders, M., & DeBoeck, P (2000) A hierarchical IRT model for criterionreferenced measurement Journal of Educational and Behavioral Statistics, 25, 285306. Jodoin, M, G., & Gierl M J (2000, April) Reducing type I error using an effect size measure with the logistic regression procedure for DIF detection. Paper presented at the annual meeting of the National Counci l on Measurement in Education, New Orleans. Kamata, A (1998) Some generalizations of the Rasch model: An application of the hierarchical generalized linear model Unpublished doctoral dissertation, Michigan State University, East Lansing. Kamata, A (2002) Procedures to perform item responses analysis by hierarchical generalized linear models Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Kamata, A (2001) Item Analysis by the Hierarchical Ge neralized Linear Model, Journal of Educational Measurement, 38 79 93. Kamata, A & Binci, S (2003) Random effect DIF analysis via hierarchical generalized linear models Paper presented at the annual meeting of the Psychometric Society, Sardinia, Ita ly. Kamata, Chaimongkol, Genc, & Bilir (2005) Random Effect Differential Item Functioning Across Group Unites by the Hierarchical Generalized Linear Models Paper presented at Paper presented at the annual meeting of the American Educational Res earch Asso ciation, Montreal Kamata, A. & Vaughn, B (2004) An introduction to differential item functioning analysis Learning Disabilities: A Contemporary Journal 2, 4869 Kim, S ., & Cohen, A (1995) A comparison of Lords chi square, Rajus area measures, and the likelihood ratio test on detection of different ial item functioning. Applied Measurement in Education, 8, 291312 Kim, W (2003) Development of a differential item functioning (DIF) procedure using the hierarchical generalized linear model: A comparison study with logistic PAGE 97 97 regression procedure. Unpublished doctoral dissertation, Pennsylvania State University, University Park, PA Lewis, C (1993) A note on the value of including the studied item in the test score when analyzing test items for DIF In P W Holland and H Wainer (Eds ), Differential Item Functioning (pp 317320) Hillsdale, NJ : Lawrence Erlbaum Associates. Linn, R L (1993) The use of differential item functioning statistics: A discussion of current practice and f uture implications In P W Holland & H Wainer (Eds.) Differential item functioning (pp 349366 ). Hillsdale, NJ: Lawrence Erlbaum Associates. Lord, F M (1980) Applications of item response theory to practical testing problems Hillsdale,NJ: Lawrence Erlbaum Associates. Luppescu, S (2002) DIF detection in HLM Paper presented at the annual meeting of the American Educational Research Association, New Orleans Maier, K S. (2001) A Rasch hierarchical measurement model Journal of Educ ational and Behavioral Statistics, 26, 307330 M antel, N (1963) Chi Square tests with one degree of freedom; Extensions of the Mantel Haenszel procedure. Journal of the American Statistical Association, 58, 690700 Mantel, N., & Haenszel, W (1959) Statistical aspects of the analysis of data from retrospective studies of disease Journal of the National Cancer Institute 22 719748 Mazor, K., Kanjee, A., & Clauser, B (1992) Using logistic regression and the Mantel Haenszel with multiple abili ty estimates to detect differential item functioning Journal of Educational Measurement, 32, 131 144. Mazor, K., Clauser, B., & Hambleton, R. (1992). The effect of sample size on the functioning of the Mantel Haenszel statistic. Educational and Psychologi cal Measurement 58, 443 451. Meredith, W., & Millsap, R E. (1992) On the misuse of manifest variables in the detection of measurement bias, Psychometrika, 57, 289 311 Meyer, J P. Huynh, H., & Seaman, M A. (2004) Exact small sample differential item functioning methods for polytomous items with illustration based on an attitude survey Journal of Educational Measurement 41, 331344. PAGE 98 98 Miller, M D & Linn, Robert L, (1988) Invariance of item characteristic function with variations in instruct ional coverage. Journal of Educational Measurement 25, 205219 Miller, M D., & Oshima, T C (1992) Effect of sample size, number of biased items, and magnitude of bias on a two stage item bias estimation method. Applied Psychological Measurement 16, 381 388. Miller, T & Spray, J (1993) Logistic discriminant function analysis for DIF identification of p olytomously s cored i tems, Journal of Educational Measurement 30, 107122. Millsap, R E., & Everson, H T (1993) Methodology review: St atistical approaches for assessing measurement bias Applied Psychological Measurement, 17, 297334. Millsap, R E. & Meredith, W (1992) Inferential conditions in the statistical detection of measurement bias, Applied Psychological Measurement, 16, 389 402. Mok, M (1995) Sample size requirements for 2level designs in educational research. Multilevel Modeling Newsletter, 7 11 15 Na rayanan, P., & Swaminathan, H (1994) Performance of the Mantel Haenszel and simultaneous item bias procedures for detecting differential item functioning, Applied Psychological Measurement, 18, 315328. Pastor, D A. (2003) The use of multilevel item response theory modeling in applied research: An illustration Applied Mea surement in Education, 16, 223 243. Penfield, R (2001) Assessing differential item functioning among multiple groups: A comparison for three Mantel Haenszel procedures Applied Measurement in Education, 14, 235259. Penfield, R, & Lam, T, (2000) Assessing differential item functionin g in performance assessmen t: review and recommendations Educational Measurement: Issues and Practices 5 16. Potenza, M T. & Dorans, N J (1995) DIF assessment for polytomously scored items: A framework for classification and evaluation, Applied Psychological Measurement 19, 23 37 Raju, N S. (1988) The area between two item characteristic curves, Psychometrika 53, 495502. PAGE 99 99 Raju, N S. (1990) Determining the significance of estimated signed and unsigned areas between two item response f unctions, Applied Psychological Measurem e nt 14, 197 207. Raju, N S., van der Linden, W J. & Fleer, P J (1995) IRT based internal measures of differential item functioning in items and tests, Applied Psychological Measurement, 19, 353 368 Raud enbush, S W., & Bryk, A S. (2002) Hierarchical linear models: Applications and data analysis methods Thousand Oaks, CA: Sage Publications Roberts, J (2004) An introductory primer on multilevel and hierarchical linear modeling, Learning Dis abilities: A contemporary Journal, 2, 3038 Rogers, H J. & Swaminathan, H (1993) A comparison of logistic regression and Mantel Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105 116. Rous sos, L A., & Stout, W F (1996a) A multidimensionality based DIF analysis paradigm Applied Psychological Measurement, 20 355371. Roussos, L A., & Stout, W F (1996b) Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel Haenszel Type I error performance. Journal of Educational Measurement, 33 215230. Rudas, T & Zwick, R (1997) Estimating the importance of differential item functioning. Journal of Educational and Behavioral Statisti cs 22(1) 31 45 Scheuneman, J ( 1979) A method of assessing bias in test items Journal of Educational Measurement, 16, 143152. Shealy, R T. & Stout, W F (1993a ). An item response theory model for test bias and differential item functioning In P W Holland and H Wainer (Eds ), Differential Item Functioning Hillsdale, NJ: Lawrence Erlbaum Associates. Shealy, R. T., & Stout, W. F. (1993b ). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159194 Shen, L. (1999) A multilevel assessment of differential item functioning. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal Shepard, L., Camilli, G., & Averill, M (1981 ). Comparison of procedures for detecting test item bias with both internal and external ability criteria, Journal of Educational Statistics, 6, 317375. PAGE 100 100 Shepard, L. Camilli, G., & Williams, D M (1984) Accounting for statistical artifacts in item bias research, Journal of Educational Statistics, 9, 93 128 Swaminathan, H & Rogers, J ( 1 990) Detecting differential item functioning using logistic regression procedures Journal of Educational Measurem ent, 27,361370. Swanson, D B. Clauser, B E., Case, S M., Nungster, R M. & Featherman, C (2002) Analysis of differential item functioning (DIF) using hierarchical logistic regression models Journal of Educational and Behavioral Statistics, 27, 53 57. Thissen, D., Steinberg, L. & Wainer, H (1993) Detection of differential item functioning using the parameters of item re sponse models In P.W Holland and H Wainer (Eds ), Differential Item Functioning (pp 67114) Hillsdale, NJ: Lawrenc e Erlbaum Associates Uttaro, T. & Millsap, R. (1994). Factors Influencing the Mantel Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement 18, 1625 Van der Noortgate, W, & De Boeck, P (2005) Assessi ng and explaining differential item functioning using logistic mixed models Journal of Educational and Behavioral Statistics, 30(4), 443 464. Vaughn, B K. (2006) A hierarchical generalized linear model of random differential item functioning for p olytomous items: A bayesian multilevel approach. An unpublished dissertation, Fl o rida State University, Tallahassee, FL. Williams, N ., & Beretvas, N (2006) DIF identification using HGLM for polytomous i tems. Applied Psychological Measurement 30 22 4 2 Wilson, A W., Spray, J A., & Miller, T R (1993) Logistic regression and its use in detecting nonuniform differential item functioning in polytomous items Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta. Zumbo, B D (1999) A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert type (ordinal) item scores Ottawa, ON: Directorate of Human Resourc es Research and Evaluation, Department of National Defense. Zwick, R (1990) When do item response function and Mantel Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 185 197. PAGE 101 101 Zwick R., & Ericikan K (1989) Analysis of differential item functioning in the NAEP history assessment Journal of Educational Measurement 28, 55 66. Zwick, R., Don oghue, J R., & Grima, A (1993 ). Assessing differential item functioning in performance tests (No 9314) Princeton, NJ: Educational Testing Service. Zwick, R Thayer & Mazzeo (1997) Descriptive and Inferential Procedures for Assessing Differential Item Functioning in Polytomous Items, Applied Measurement in Education, 10, 321344. Zwick, R., Donoghue, J R., & Grima, A Assessment of differential item functioning for performance tasks Journal of Educational Measurement 30(3), 233251. PAGE 102 102 BIOGRAPHICAL SKETCH Jann Marie Wise MacInnes, the oldest child of Peggy and Mac Wise, was born in Americus, Georgia, but grew in Jacksonville Beach, Florida. She graduated with honors from the University of North Florida, Jacksonville, Florida, in 1972 with a Bachelor of Arts degree in statistics and again in 1985 with a Master of Arts degree in mathematics with an emph asis in statistics. She was employed by the local electric authority as an electric rates analyst before she entered the field of education. Her first teaching position was with Florida Community College at Jacksonville In 2003 she moved to the Univers ity of North Florida She has a total of more than 20 years teaching experience teaching freshman and sophomore mathematics and statistics In 1995 she received an Outstanding Faculty Award in recognition of her teaching excellence. In 2003 her interests and goals changed and she entered the Ph.D program in research and evaluation methodology at the University of Florida, Gainesville, Florida Her current research interests include issues in testing and measurement as they relate to differential item functioning. 