|UFDC Home||myUFDC Home | Help|
This item has the following downloads:
1 A COMPARISON OF LOGISTIC REGRESSION MODELS FOR DIF DETECTION IN POLYTOMOUS ITEMS: THE EFFECT OF SMALL SAMPLE SIZES AND NON NORMALITY OF ABILITY DISTRIBUTIONS By YASEMIN KAYA A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2010
2 2010 Yasemin Kaya
3 To my undergraduate professors Dr. Huseyin Bag and Zeki Kasap, who strengthened my sense of scholarship, believed me in, and encou raged me to start this long way
4 ACKNOWLEDGMENTS I would like to thank all people who have helped and inspired me. I especially want to thank my advisor, Dr. Walter Leite, for his guidance and endless help during my research and study and to my committee member Dr. David Miller for sharing his ideas and corrections. I was delighted to interact with Dr. James Algina by attending his classes. My deepest gratitude goes to my family for their constant support and love I hope that this achievement will complete the dream that they had for me a ll those many years ago when they chose t o give me the best education they could. Thanks to my father Sefer Kaya for making me believe that all the obstacles can be defea ted with the right decisions, to my mother Firdevs Kaya for her endless encouragement and unconditional love, to my sister Aysun for listening all my grumblings in her most busy work hours, and my brother Tolga for making me smile in the worst moments. Las t, thanks go out to Francisco Jimenez for his supportive intimacy
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 10 1.1 Purpose of the Study ................................ ................................ ..................... 13 1.2 Research Questions ................................ ................................ ...................... 14 2 LITERATURE REVIEW ................................ ................................ .......................... 15 2.1 Differential Item Functioning (DIF) ................................ ................................ 15 2.2 Description of DIF Detection Methods ................................ ........................... 19 2.2.1 Mantel Haenszel ................................ ................................ ................ 19 2.2.2 Logistic Discriminant Fu nction Analysis ................................ ............. 22 2.2.3 Item Response Theory (IRT) ................................ ............................. 24 2.2.4 Standardization ................................ ................................ .................. 25 2.2.5 Logistic Regression ................................ ................................ ........... 26 18.104.22.168 Logistic regressio n method in dichotomous items ..................... 26 22.214.171.124 Logistic regression method in polytomous items ....................... 27 2.3 Comparison of Logistic Regression DIF Detection Method with the Other Methods ................................ ................................ ................................ ........ 30 2.4 Groups with Skewed Ability Distributions ................................ ...................... 33 2.5 Groups with Small Sample Sizes ................................ ................................ .. 35 3 METHODS ................................ ................................ ................................ .............. 39 3.1 Factors Manipulated ................................ ................................ ...................... 40 3.1.1 Type of Differential Item Functioning (DIF) ................................ ......... 40 3.1.2 Sample Size ................................ ................................ ........................ 40 3.1. 3 G roup Ability Distributions ................................ ................................ ... 40 3.2 Data Generation ................................ ................................ ............................ 41 3.3 Data Analysis ................................ ................................ ................................ 44 4 RESULTS ................................ ................................ ................................ ............... 45 5 DISCUSSION ................................ ................................ ................................ ......... 55
6 5.1 Limitations and Suggestions for Future R esearch ................................ ......... 58 5.2 Conclusion ................................ ................................ ................................ .... 59 REFERENCES ................................ ................................ ................................ .............. 62 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 66
7 LIST OF TABLES Table page 2 1. Data demonstration of Kth level of the matching variable. ................................ .. 21 3 1. Item parameters for items 1 to 24 ................................ ................................ ....... 42 3 2. Item parameters for the 25th item ................................ ................................ ....... 43 4 1. P ower of any kind of differential item functioning (DIF) detection for item 25 ..... 47 4 2. Non uniform DIF detection rates for item 25 ................................ ....................... 48 4 3. Uniform DIF detection rates for item 25 ................................ .............................. 50 4 4. Type I error rates for the tests of any kind of DIF detection for items 1 to 24 ..... 52 4 5 Type I error rates for the tests of uniform DIF detection for items 1 to 24 ........... 53 4 6. Type I error rates for the tests of non uniform DIF detection for items 1 to 24 ... 54
8 LIST OF FIGURES Figure page 2 1. Item characteristic curves (ICCs) for non unifor m differential item functioning (DIF). ................................ ................................ ................................ .................. 16 2 2. ICCs for uniform DIF. ................................ ................................ .......................... 17 2 3. ICCs for non uniform DIF due to differences in both discrimination and difficulty parameters ................................ ................................ ........................... 17 2 4. Category response curves (CRCs) of focal and reference groups for a non uniform DIF condition ................................ ................................ ......................... 18 2 5. CRCs of focal and reference groups for a uniform DIF condition ....................... 19
9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for th e Degree of Master of Arts in Education A COMPARISON OF LOGIST IC REGRESSION MODELS FOR DIF DETECTION IN POLYTOMOUS ITEMS: THE EFFECT OF SMALL SAMPLE SIZES AND NON NORMALITY OF ABILITY DISTRIBUTIONS By Yasemin Kaya August 2010 Chair: Walter L. Leite Major: Research and Evaluation Methodology This study investiga ted the effectiveness of logistic regression models to detect uniform and non uniform DIF in polytomous items across small sample sizes and non normality of ability distributions. A simulation study was used to compare three logistic regression models, whi ch were the cumulative logits model, the continuation ratio model, and the adjacent categories model. The result s revealed that logistic regression wa s a powerful method to detect DIF in polytomous items, but not useful to distinguish the type of DIF. Cont inuation ratio model worked best to detect uniform DIF, but the cumulative logits model gave more acceptable type I error results. As sample size increased, type I errors increased at cumulative logits model results. Skewness of ability distributions reduc ed power of logistic regression to detect non uniform DIF. Small sample sizes reduced power of logistic regression.
10 CHAPTER 1 INTRODUCTION The use of standardized tests brings many considerations about the measurement tool, such as fairness and validity. Validity is one of the concerns of test developers and users, and detecting bias in a test is an important step in construct validation. In a measurement instrument, all items are expected to be fair to all the test takers. If a test is not fair to all th e test takers, a test bias issue arises. Test bias is a systematic error that causes deviant results for the members of a certain group in a measurement (Camilli & Shepard, 1994). An item with bias is an item where individuals with t he same ability level b ut from different groups do not have the same probability of getting an item correct (Holland & Wainer, 1993). Procedures to detect item bias started receiving of work has been done since then to deal with item bias. On e way to deal with item bias is using Differential Item Functioning statistics. Differen tial item functioning (DIF) simply describes the inspection of different operational functions of an item on the different groups, after controlling for ability level ( Holland & Wainer, 1993). Basically, DIF procedures consist in determining those items that function differently for different groups, analyzing the reasons why such items function differently for a particular group, identifying these items as biased, and f inally, removing them from the test (Camilli & Shepard, 1994). Various DIF methods are used to i dentify item bias Some of the most commonly used DIF methods are the Mantel Haenszel (Mantel & Haenszel, 1959; Holland & Thayer, 1988), standardized differen ce (Dorans & Kulick, 1983; 1986; Dorans & Schmitt ,1991 ), logistic regression (Swaminathan & Rogers, 1990), ordinal logistic regression
11 (Zumbo, 1999), logistic discriminant function analysis (Miller & Spray, 1993), IRT (Raju, 1988; 1990; Lord, 1980; This ses, Stainberg & Wainer, 1988), and SIBTEST (Shealy & Stout,1993; Chang, Mazzeo & Roussos, 1996), and so on. The logistic regression procedure is a model based DIF detection procedure, and it is one of the first techniques that were used to detect both un iform and non uniform DIF (Swaminathan & Rogers, 1990). The logistic regression DIF detection technique works based on comparing the probability values of getting the item correct for different groups. The probability of getting an item correct is estimate d with the model below (Swaminathan & Rogers, 1990), (1 1) 0 1 is the slope. The item shows DIF if the probabilities are not the same for differe nt groups in the same ability level (Swaminathan & Rogers, 1990). The logistic regression method has become a widely used DIF detection method for the last two decades. Its capability to be used not only for both uniform and non uniform DIF detection, but also for dichotomous and polytomous items might make this method attractive in this area. Swaminathan and Ro gers (1990) pointed out that one of the most important advantages of logistic regression method on detecting DIF is that it is a model based method providing information about the nature of DIF. Furthermore, the logistic regression DIF detection method for dichotomous items can be extended to polytomous items (Zumbo, 1999). This advantage of the logistic regression method provides easiness in the DIF detection procedure when questions are both polytomous
12 and dichotomous. Moreover, logistic regression can detect both uniform and non uniform DIF (Swaminathan & Rogers, 1990) Dichotomous and polytomous items are t wo common formats that differ in the scorin g method. Dichotomous items are scored in two categories dichotomously, such as correct or incorrect. On the other hand, polytomous items are scored with more than two categories. An i ncrease in the use of polytomous items in assessment tools such as perfo rmance assessments, portfolios, and questions with more than one possible score categ ories has focused researchers atten tion to DIF detection methods with polytomous items (Thurman, 2009). DIF detection method s in polytomous items have received attention for the last twenty years. Several authors addressed the extensions of the logistic regression DIF detection method from dichotomous items to polyto mous ( Miller & Spray, 1993; Welch & Hoover, 1993; French &Miller, 1996; Zumbo, 1999). The use of logistic re gression models with polytomous items as a DIF detection method and the comparison of three logistic regression models cumulative logits, continuation ratio, and adjacent categories were first examined by French and Miller (1996). Our study is an exten sion of the study of French and Miller (1996). Skewness of group ability distributions in DIF detection is an important point that needs to be considered (Monaco, 1997; Kristjansson, Aylesworth, McDowell &Zumbo, 2005 ; Welkenhuysen Gybels, 2004) Skewness of ability distributions in the logistic regression DIF detection method has not been studied much. Even if the simulated conditions allow us to cr eate normally distributed data, it is possible to confront with skewed ability distributions in real data sit uations as well In order to evaluate the possible effect of skewness, the present simulation study investigated the effectiveness
13 of logistic regression DIF detection method with polytomous items for groups with non normal ability distributions. DIF detec tion with the logistic regression method in small samples is another condition needs to be examined. The common small sample size used in the studies for in logistic regression DIF detection method is approximately 250 per group ( Swaminathan & Rogers, 1990 ; Scott, Fayers, Aaronson, Bottomley, Graeff, Groenvold, Gundy, Koller, Petersen & Sprangers, 2008; Zumbo, 1999). The present study investigated the feasibility of logistic regression for small samples by simulating two small sample sizes of 100 and 250 pe r group. 1.1 Purpose of the Study This study replicates the work done by French and Miller (1996) in which they examined how useful logistic regression was for the detection of DIF in polytomous items French and Miller (1996) compared power and Type I error values for detecting uniform and nonuniform DIF by simulating item parameters and sample sizes across three models which are cumulative logits, continuation ratio, and adjacent categories. The s pecified conditions consisted of normal ability distrib ution, sample sizes of 500 and 2000, and varied item parameters ( a s and b s). They concluded that cumulative logits model and continuation ratio model had a high power compare d to the adjacent categories model As expected, sample size had an effect on powe r. The power to detect nonuniform DIF was higher for the samples with higher range of a parameters. They suggested further research to investigate the most efficient coding scheme under different settings. French and Miller's (1996) study was significant as the first simulation study comparing the performances of the three extensions of logistic modeling to polytomous
14 data in DIF detection. The primary purpose of this Monte Carlo simulation study was to investigate the feasibility of using three logistic regression models for polytomous items cumulative logits, continuations ratio, and adjacent categories as a DIF detection method in the case of small sample sizes and non normality of ability distributions. The study also aimed to determine the most effe ctive model among the three models under certain conditions. Since there was not much work done about the impact of skewed ability distributions and small sample sizes on the logistic regression DIF detection models we added new conditions to the study co nditions. Thus this study expended on French and Miller's (1996) research by examining the performance differences under the effect of small sample sizes and skewed ability distributions. This study has three goals: (1) to compare the power and type I er ror values of the three extensions of logistic modeling to polytomous data in detecting DIF (2) to investigate the effect of small sample sizes on the effectiveness of these three models (3) to study the effect of non normality of ability distributions on the effectiveness of these three models. 1.2 Research Questions The following research questions are addressed: 1. Do the cumulative logits, continuation ratio and adjacent categories model s differ with respect to power to detect DIF in polytomous items? 2. Do the cumulative logits, continuation ratio and adjacent categories model s differ with respect to Type I error when used to detect DIF in polytomous items? 3. Is the performance of logistic regression models for polytomous items affected by non normality of th e ability distribution? 4. How are logistic regression models for polytomous items affected by small sample sizes?
15 CHAPTER 2 LITERATURE REVIEW The literature review chapter begins with a brief introduction of differential item functioning ( DIF ) and its im portance and necessity in terms of validity and test fairness. An overview of DIF detection techniques and previous research on the effectiveness of these different techniques are presented. The use of polytomous items in measurement; and the detection of DIF in such items are then described. DIF methods with polytomous items and the logistic regression method to detect DIF with polytomous items are then compared. Finally, the necessity of a research on the logistic regression method to detect DIF on polyto mous items with skewed and small sample sizes is given. 2.1 Differential Item Functioning (DIF) The term differential item functioning (DIF) was used by Holland and Wainer (1993) to describe different probabilities of an item to be answered as correctly f or different groups at the same ability levels DIF analysis separates the whole examinee group into two differ ent subgroups, which are called focal and reference groups. The f ocal group is the group of interest, such as racial groups, minorities, women, e tc, whereas the reference group is the group used to compare the focal group with, such as Whites, males, etc (Holland & Wainer, 1993). Depending on the presence of interaction between ability level and group membership, DIF can be categorized into two dif ferent ty pes: uniform and non uniform Mellenberg (1982) defines non uniform DIF as the presence of interaction between group and score category in the data. In non uniform DIF, the existence of an interaction is needed to describe the data. Figure 2 1 bel ow shows non uniform DIF.
16 The curves are crossing each others, because they have different slopes. On the other hand their locations in the graph are the same. Figure 2 1 Item characteristic curves (ICCs) for n on uniform differential item functioning ( D IF ) Uniform DIF is described by Mellenberg (1982) as the independence of score categ ories from the group membership Figure 2 2 below illustrates uniform DIF. As we see, the curves are parallel because they have equal slopes. However, they are in differe nt locations. Figure 2 3 de monstrates the case of an item showing nonuniform DIF due to differences in both discrimination and difficulty parameters The curves of the reference and the focal groups cross each other since they have different slopes. Moreo ver, the curves locate differently.
17 Figure 2 2. ICCs for u niform DIF. Figure 2 3. ICCs for non uniform DIF due to differences in both discrimination and difficulty parameters
18 DIF detection in polytomous items is analogous to DIF detection in dichot omous items. However, since possible response categories are more than two, DIF detection in polytomous items requires a different methodology than what is followed in dichotomous items. The focal and reference groups may differ in one or more categories, but not in all categories. Thus each response category must be evaluated separately to detect DIF (Vaughn, 2006). Figure 2 4. Category response curves (CRCs) of focal and reference groups for a non uniform DIF condition Item characteristic curves for p olytomous items are drawn for each category separately for each group, and called as category response curve (CRC) (Embretson & Steven, 2000). Figure 2 4 and figure 2 5 are example demonstrations of CRCs for a non uniform DIF condition and a uniform DIF co ndition.
19 Figure 2 5. CRCs of focal and reference groups for a uniform DIF condition 2.2 Description of DIF Detection M ethods 2.2.1 Mantel Haenszel Mantel Haenszel (M H) DIF detection method for dichotomously scored items was introduced by Holland & Tha yer (1988) as an adapt at ion of the data analysis procedure proposed by Mantel and Haenszel (1959). This is a method that only detects uniform DIF. According to Holland and Wainer's (1993) method description, the M H DIF detection analysis starts with 2x2xM contingency table arrangement. The first 2 represents group membership, focal or reference. The second 2 represents item response, right or wrong. And finally, M represents total test score, which is a matching variable. The examinee group is first divided in to two groups of focal and reference. In order to test the difference between the probabilities of two groups t o answer the items correct ly a hypothesis test is conducted. The null hypothesis of M H is defined as that,
20 for a certain level of matching v ariable, M, both reference and focal groups have equal odds ratios of answering the item correctly. A chi square test is used to test the null hyp othesis of M H (Holland & Wainer, 1993). (2 2) where, (2 3 (2 4) In these equations, N represents the number of people at the ability level of m. R indicates right and W indicates wrong answers. The subscript r represents reference group, f represents focal group, and t represents total group. Commonly us ed item difficulty metrics of M H are delta metric and p metric (Holland & Wainer, 1993). Delta item difficulty metric is obtained by converting proportion correct (p) to a z score. (2 5) The bigger val ues of delta correspond to more difficult items. Holland and Wainer (1993) cites Holland and Thayer's (1985) conversion formula of constant odds ratio estimate into delta metric as; (2 6) where,
21 P metric is called the proportion correct metric, and obtained by subtracting predicted proportion correct in the reference group from observed proportion correct in the focal group. (2 7) where (2 8) (Holland & Wainer, 1993). Several extensions of MH to polytomous items with more than two response categories have been done. One of the methods Mantel (1963) introduced includes multiple contingency tables of 2xk for each matching level of m. Here k represents the number of response categories, and the number of 2 is for the focal and reference groups. Zwick and Thayer (1996) used 2xTxK notation for contingency table, where T is the number of the response categories and K is the nu mber of levels of matching variable. Table 2 1 represents their 2xT contingency table demonstration for each Kth level of the matching variable. Table 2 1. Data demonstration of Kth level of the matching variable. Item Score Group y1 y2 y3 ... yT Total Reference nR1k nR2k nR3k nRTk nR+k Focal nF1k nF2k nF3k nFTk nF+k Total n+1k n+2k n+3k n+Tk n++k Zwick and Thayer (1996) reformulated Mantel's chi square statistic as Z statistic. The reformulated chi square statistic of Mantel is expressed as
2 2 (2 9) where F k is the sum of the scores for the focal group in the kth level of matching variable. F k is identified as (2 10) Under H 0 of no difference, the expected value of F k is expressed as (2 11) And under H 0 the variance of F k defined as (2 12) Rejection of the null hypothesis means that the item can be flagged as showing DIF, and there is a significant difference between the performances of the reference and focal groups, which ar e in the same ability level (Su & Wang, 2005). 2.2.2 Logistic Discriminant Function A nalysis Logistic discriminant function analysis (LDFA) is another method to detect DIF in polytomous items that is related to logistic regression. LDFA method differs from logistic regression in the nature of its regression model. Miller and Spray (1993) proposed the DIF detection use of LDFA method in polytomous items. In their study, the probability equation of LDFA was prese nted as, (2 13)
23 where are regression coefficients; G is the group membership variable, which is 1 for the reference group and 0 for the focal group; U is the response variable; and X is total score. The model predicts group membership from total score and item response, which is the main difference between LDFA and logistic regression. However, the logistic regression model predicts item response from group membership and total score. The equation allows U the item response variable, to take more than two categories. Hence, this makes th e LDFA model eligible for use in DIF detection with polytomously scored items. Miller and Spray (1993) evaluated the effectiveness of LDFA with an example dataset, whic h wa s a mathematic performance test consisting of 21 dichotomous and 6 polytomous items. Female and male students were compared by their total scores. Three regression models were estimated, which are (1) the first model with only total score as a variable (2) the second model with total score and item score, and (3) the full model with total score, item score, and score by item interaction. The LDFA model was compared with the Mantel Haenszel for dichotomously scored questions, and with the generalized Ma ntel Haenszel test for polytomously scored items. The study indicated that LDFA had some advantages over other methods, including Mantel Haenszel as well. LDFA was an appropriate method for both uniform and non uniform DIF detection, although Mantel Haensz el is not suitable for non uniform DIF detection. Furthermore, since item score is an independent variable, no item needs more than one regression. Moreover, there is no limit for the number of items or variables that are used as an independent variable (M iller & Spray, 1993).
24 2.2.3 Item Response Theory ( IRT ) DIF in IRT is explained fundamentally with the Item Response Function (IRF). The of getting the item correct f or dichotomous items, and probability of getting the expected score for polytomous items (Kim & Cohen, 1998). IRFs for different groups at the same ability level are expected to be identical. Any difference on the item parameters of two groups with the sam e ability level results in difference s on IRFs of these two groups and is explained as existence of DIF (Kim & Cohen, 1998). The two parameter (2 PL) logistic model for IRF, or trace line model, uses the following formula (Birnbaum, 1968), (2 14) where, a i is the item discrimination, and b i is the item difficulty parameter. Different approaches for IRT to detect DIF have been proposed (Raju, 1988; 1990; Lord, 1980; Thisses, Stainberg & Wainer, 1988). We can classify these appro aches in two groups, as parametric and nonparametric IRT methods. Raju (1990) proposed a nonparametric method to detect DIF by examining whether the areas between IRFs of two groups are significantly different than 0. The areas between two IRFs are named a s signed area (SA) and unsigned area (UA). The formulas to calculate the exact values of these signed and unsigned areas were presented by Raju (1988). SA defines the difference between two IRFs, and UA defines the distance (Raju, 1988). Significance test for the signed area is established by computing a Z value with the assumption of normality, (2 15)
25 and comparing the result with the tabled value. A significant result means that the signed area between two groups is significa ntly different than 0. In other words, two groups significantly differ in their performance. For the unsigned area calculation, SA in the Z formula replaces with UA calculated by considering non normality (Raju, 1990). The parametric approach is IRT like lihood ratio (IRT LR), which compares likelihood functions of two groups to test the significance of the difference between groups (Thissen, Steinberg & Wainer, 1988). IRT LR estimates the item parameters by using Marginal Maximum Likelihood (MML) algorith m described by Bock & Aitkin (1981). After estimating IRT logistic model, the likelihood functions are compared. Another parametric approach is X 2 estimation of logistic IRT models for the reference and the focal group described by Lord (1980). 2.2.4 Stan dardization The s tandardization method was introduced by Dorans & Kulick (1986) as a DIF detection method for dichotomously scored items. The term standardization is explained as controlling one variable while comparing the groups on another associated va riable (Dorans & Kulick, 1986). Dorans and Kulick (1986) called the largest and the most stable group on the estimates of conditional probabilities called as the base (b) group. The other group was called as focal (f) group. They defined the DIF measure at individual score level s as (2 16) where P fs and P bs are the proportions correct at the focal and base groups.
26 Standardization's item discrepancy indices were defined as the standardized p difference (D STD ) and the root mean weighted squared difference (RMWSD). The standardized P difference was identified as (2 17) where is defined as weighting factor at the s score level. Dorans & Schmitt (1991) extended the standardizati on method to polytomous items. Basically, this method creates matching levels, and calculates expected item scores of the focal group and the reference group at each matching level. The expected item scores determine the empirical item test regressions of the groups. DIF indices are calculated by using these empirical item test regressions. 2.2.5 Logistic R egression 126.96.36.199 Logistic regression method in dichotomous items Using logis tic regression for DIF detection in dichotomous items was proposed by Swam inathan and Rogers (1990). Based on Bock's (1975) standard logistic regression model, the authors used the formulation of logistic regression model to detect DIF as, (2 18) where (2 19) In this model g is the group membership defined as 0 for the reference group and 1 for the focal group; the term of is the product of two variables, and g ; is the group differen ce in performance on
27 the item, and is the interaction term betwee n group membership and ability. Swaminathan and Rogers (1990) indicated that if and the items shows uniform D IF. Furthermore, if the item shows non uniform DIF, no matter or not. Swaminathan and Rogers (1990) compared the logistic regression procedure with the Mantel Haenszel procedure in terms of power by manipulating sample size, test length, and type of DIF. Results indicated that Logistic regression was as powerful as Mantel Haenszel in detecting uniform DIF. Furthermore, Logistic regression was able to detect non uniform DIF, although Mantel Haenszel w as not. Conversely, the logistic regression procedure resulted in higher Type I error rate than predicted. 188.8.131.52 Logistic regression method in polytomous items The l ogistic re gression DIF detection method for dichotomous items has been extended to polyt omous items by several researchers (Miller & Spray, 1993; Welch & Hoover, 1993; French &Miller, 1996; Zumbo, 1999). French and Miller (1996) conducted a simulation study to examine the usefulness of different logistic regression models extended to polytomo us data. Their study compared three extensions of the logistic modeling to polytomous data, which are cumulative logits, adjacent categories, and continuation ratio logits. These three recoding procedures were discussed by Agresti (2002). Agresti (2002) de fined the cumulative logits as (2 20) 1.
28 where J is the number of the response categories in an item. This model reformulates the responses to two categories, 1 to j is the outcome in numerator and from j +1 to J is the outcome in denominator (Agresti, 2002). This is a model that does not lose data on dichotomization of the score categories. The adjacent categories are defined as (Agresti, 2002) (2 21) 1. The probability of every r esponse is compared with the probability of the adjacent response (French and Miller, 1996). This model distinguishes the order of response categories (Agresti, 2002). The continuation ratio logits is described as (Agresti, 2002) 1. (2 22) or as 1. (2 23) In this model, numerator increases one by one and denominator increases along with it. Thus, gradually more but relatively small amount of data are lost when using t his model (French and Miller, 1996). In their study, French and Miller (1996) evaluated the power of logistic regression DIF detection method in polytomously scored items across three coding schemes: cumulative logits, adjacent categories, and continuation ratio logits. A 25 item test was simulated with a single item including DIF; and every item had four potential score
29 categories ranging from 0 to 3. Item parameters were varied in four different conditions. In order to create a non uniform DIF, the discri mination parameters were varied in three of the conditions. In the fourth condition, the difficulty parameters were varied to create a uniform DIF. Two levels of sample sizes as symbolizing small and big populations were simulated, which are 500 and 2000. Results can be summarized in three major points. Firstly, as expected, sample size had a significant effect on power and logistic regression had higher power with the larg est sample size (2000). Second the difference in item parameters between two groups had an impact on the likelihood of detection of uniform and non uniform DIF. Finally, cumulative logits cod ing and continuation ratio had high power. However, adjacent category coding had the lowest power, due to loss of data. Thus, loss of data affected t he power and DIF detection ability of logistic regression. On the other hand, the study stated that adjacent categories coding scheme might be used to determine the location of DIF. Zumbo (1999) proposed a DIF detection method that combines the Ordinal L ogistic regression (OLR) method and an R 2 measure of effect size in order to detect DIF and determine the magnitude of DIF in ordinal data. OLR is an extension of logistic regression method for dichotomous data, and it uses the same modeling proposed by Fr ench and Miller (1996). In addition to the logistic regression method, Zumbo (1999) developed a DIF effect size measure to ordinal items (R 2 ). In three steps, he showed the application of the method on two items from a simulated 20 item ordinal test data f or gender DIF. In the first step of modeling, only the conditioning variable, which was total score, was entered in the model. In the second step, the group variable, which was gender, was added to the model. In the third and the last model, the interactio n term
30 with both the gender and total score variables was added to the model. DIF statistics were calculated by using the chi square values obtained from the analysis. Subtracting the chi square value of step 1 from the chi square value of step 3 gave the simultaneous statistical DIF test of uniform and non uniform DIF. The result was compared to the table value with 2 degrees of freedom. The same procedure was repeated between step 2 and step 1 by subtracting the former from the latter to identify the exis tence of uniform DIF. Comparisons of R squared values between step 3 to step 1, step 2 to step 1, and step 3 to step 2 are also made by subtracting the latter ones from the former ones. As the criteria R squared value of .002 was assumed to show DIF In addition to the DIF detection methods described above, various other methods can be used to detect DIF. For example DIF can be detected using t he simultaneous item bias test (SIBTEST) method (Shealy & Stout,1993; Chang, Mazzeo & Roussos, 1996), the m ultip le indicators multiple causes (MIMIC) metho d (Finch,2005; Oort, 1998) and the h ierarchical generalize linear modeling (HGLM) method ( Swanson, Clauser, Case, Nungester & Featherman 2002). 2.3 Compari son of Logistic Regression DIF Detection Method with the Other M ethods Kristjansson et al. (2005) compared four different DIF detection methods in ordinal items on Type I error rate and power with a simulation study. The methods compared were the Mantel, generalized Mantel Haenszel (GMH), logistic discriminant function analysis (LDFA), and unconstrained cumulative logits ordinal logistic regression (UCLOLR). The author manipulated the presence and type of DIF, item discrimination, group ability distribution, sample size ratio, and skewness in ability distributio n
31 conditions in the study. One item out of 26 items presented DIF and this item was simulated in three different conditions: null DIF with equal a and b parameters for both groups, uniform DIF with equal a parameters for both groups but .25 increasing b pa rameters for the focal group, and non uniform DIF with varying a parameters and equal b parameters for both groups. Item d iscrimination values differed by .8 (low), 1.2 (moderate), and 1.6 (high) in three conditions. Group ability distributions differed i n two levels. In one condition the focal and reference groups had equal normal distribution with a mean of 0 and standard dev iation of 1. In the second condition the focal group had a mean of .5 and a standard deviation of 1; however, the reference grou p had a mean of 0 and a standard deviation of 1. Total sample size was held at 4000, but the group sample size ratio for focal and reference groups changed in two conditions: equal ratio (1:1) and unequal ratio (4:1) which reference group had 3200 and foca l group had 800 indivi duals. As a last factor skewness in ability distributions varied in two different conditions; no skewness, and a moderate negative skewness with a value of .75 for both groups. 400 iterations were simulated. Results indicated that t here was slight difference s in the Type I error rates of four methods and the values were relatively close to .05. UCLOLR got the lowest type I e rror value (.049) and Mantel had the highest value (.054). GMH and UCLOLR did not show any significant relation ship with any of the study conditions in terms of Type I error, but Type I error of Mantel and LDFA was significantly related with high item discrimination and group ability difference (Kristjansson et al., 2005). All the methods gave a high value of power for detecting uniform DIF. The power for detecting uniform DIF was related with item discrimination and sample size for the methods GMH and UCLOLR. High power was observed with
32 moderate item discrimination, and a lower power with a sample size ratio of 4: 1 (Kristjansson et al., 2005). For detection of non uniform DIF, GMH and UCLOLR performed well and showed a high value of average power (.999 and 1.0). Nevertheless, LDFA had a low power value (.55), and non uniform D IF could not be identified by Mante l. I tem discrimination had a slight positive effect on the power for detecting uniform DIF for GMH and UCLOLR, but no effect for detecting non uniform DIF. Finally, moderate skewness did not have any significant effect on DIF detection for all four methods. A nother simulation study conducted by Welkenhuysen Gybels (2004) compared different DIF detection techniques for dichotomous items to investigate the performance differences between the techniques. One of the conditions simulated was ability distributions. The researcher simulated three different level of ability distributions, which are normal/normal, normal/high positive skew, normal/high negative skew for the reference and the focal groups. The researcher simulated two different levels of uniform DIF, sma ll and large, by differing item discrimination parameters by .2 and .6 between the reference and the focal groups, and two levels of nonuniform DIF, moderate and large, by differing item difficulty parameters by .4 and .8. Sample size varied in two levels, which were 1000/1000 and 300/1000 for the focal and the reference groups respectively Two levels of test length (i.e., 20 items and 30 items ), and three levels of the number of DIF items in the test (i.e., 10%, 20%, 50% ) were manipulated. Eight differe nt measures of DIF were compared, which were the log linear model, the logistic regression model, the signed area, the unsigned area, the sum of squares 1 (SOS1), and the sum of squares 3 (SOS3). The results of the study indicated that among the
33 studied te chniques, the logistic regression method showed the highest power and the lowest type I error for the conditions with nonuniform DIF, and with uniform DIF when the number of items with DIF wa s small. An important result obtained by the researcher was that the logistic regression approach was the best method for n onuniform DIF detection because it had the same power of the log linear method but lower type I error than the log linear method in these study conditions. In addition, for negatively skewed abilit y distributions, the logistic regression and log linear methods performed better than all the other techniques. In the comparison of MH and logistic regression DIF detection method for dichotomous items, Swaminathan and Rogers (1990) noted that logistic regression was as powerful as MH for detecting uniform DIF items. For the nonuniform DIF items, logistic regression was 50% accurate in the small sample short test case and 75% accurate in the large sample long test case, but MH wa s not able to detect any nonuniform DIF item. 2.4 Groups with Skewed Ability D istribution s Monaco (1997) conducted a Monte Carlo simulation study to investigate the role of skewed ability distributions on DIF detection procedures. The author applied the analysis on three differe nt DIF detection methods with dichotomous data, which were Differential Functioning of Items and Tests (DFIT), Mantel Haenszel (M H), and Lord's chi square procedure. Skewness levels varied in five different conditions: high positive, moderate positive, non e, moderate negative, and moderate positive. The results showed that moderate skewness did not change the results of DIF detection and a high level of skewness had a slight effect on the detection results for any of three DIF detection methods examined her e.
34 A simulation study compared four DIF detection methods in polytomous items, which are the Mantel, generalized Mantel Haenszel, logistic discriminant function analysis and unconstrained cumulative logits ordinal logistic regression, by using two levels of skewness in group ability distribution (Kristjansson et. al, 2005 ). In order to compare the effect of skewness in ability distribution, the authors set the skewness levels as .75 (moderate negative) for both groups and no skewness. The study results i ndicated that the skewness level used in this study ( .75) did not have a notable effect on the efficiency of the four methods. Welkenhuysen Gybels (2004) examined the performance of different DIF detection methods on dichotomously scored items, by varyi ng group ability distribution. To test the robustness of the logistic regression method, the author simulated three different ability distribution conditions: (1) a normal distribution for both groups, (2) a normal distribution for the reference group and a positively skewed distribution for focal group (with a beta distribution of 1.5, 5), and (3) a normal distribution for the reference group and a negatively skewed distribution for the focal group (with a beta distribution of 5, 1.5). Type of DIF vari ed a s uniform and nonuniform In the case of uniform DIF, logistic regression method results indicated that both the false positive rate and the false negative rate for the normal / positively skewed ability distribution condition had a higher value than for t he normal / normal ability condition. On the other hand, the false negative rate for the normal / negatively skewed ability distribution condition was lower than for the normal / normal ability distribution condition. In the case of nonuniform DIF, a skewe d distribution always gave higher false positive rate than normal distribution.
35 However, the false negative rate of any skewed distribution condition was not significantly different than the normal distribution condition. The m ajority of simulation studie s about logistic regression have been done by simulating a normal ability distribution. As we mentioned above, there are only a few studies for all DIF detection methods with skewed ability level conditions. However, the sample s in the real environment are not always normally distributed Some application studies from medical and educational fields have been done with real data included real skewed ability distribution values. (e.g. Wang & Lane, 1996). Lack of enough simulation studies examining the robustn ess of the DIF detection results with skewed ability distribution levels for logistic regression method, and DIF detection studies for polytomously scored items with logistic regression method that is already using real data coming from non normal samples required us to make a deeper investigation of the effect of skewed ability distribution on the power of DIF detection results. 2.5 Groups with Small Sample Sizes Many studies have been done to investigate how effective a DIF method would work with differ ent sample size s. Some of the methods allow small sample sizes, and some others only work well with large sample sizes. A simulation study examined the effectiveness of MH procedure with small sample sizes (Mazor, Clauser & Hambleton, 1992). The study resu lts indicated that use of MH procedure with small sa mples was problematic. The lowest level of sample size for adequate power was 200 per group. A s amp le size of 500 gave more accurate results, but sample sizes of 1000 and 2000 could detect all the items w ith DIF. The findings of Fidalgo, Ferreres & Muniz (2004) supported these findings. As the sample size increased, the significance level and the power increased. For the same Type I error rate, MH chi squared test results gave
36 higher detection rates in hig her sample size levels. A simulation study compared the power and Type I error performances of MH and SIBTEST procedures for small samples (Roussos & Stout, 1996). The results showed that as sample size increased, hypothesis testing rejection rates signifi cantly increased for both MH and SIBTEST procedures. The increase rate was higher for MH than SIBTEST. Rogers & Swaminathan (1993) compared MH and logistic regression DIF detection procedures in small sample sizes. The study proved the great impact of sam ple size on detection rate. The results showed 19% increase for the logistic regression and 11% increase for MH procedures on detection rate, as the sample size increased from 250 to 500. A Monte Carlo study performed by Herrera and Gomez (2008) detected the effect of the different sample sizes of reference and focal groups on the power and type I error of logistic regression The study manipulated 12 conditions with two different sample sizes for the reference group (500, 1500) and six different ratio of sample sizes for the focal group/the reference group (1/5, 1/4, 1/3, 2/5, 1/2, 1/1). Error mean squares were used as the accuracy index. The results of the study indicated that the highest type I error values of logistic regression procedure were gathered for the condition with a sample size of 1500 with equal group sizes. In other words, the highest type I error r ate was obtained from the larges t sample size condition, as opposed to the other literature findings. Another surprising result was that the incr ease in the sample size from 500 to 1500 did not affect type I error of logistic regression. Welkenhuysen Gybels (2004) examined the impact of the sample size as a factor effecting DIF detection for different DIF methods. The sample size varied to two
37 dif ferent levels for reference and focal groups. The first condition had equal sample sizes for both groups, which were 1000 for the reference group and 1000 for the focal group The second condition included a sample size of 1000 f or the reference group and 300 for the focal group. In the case of uniform DIF, the logistic regression method results indicated that when the sample size decreased for the focal group, the false positive rate decreased but the false negative rate increased. For nonuniform DIF, the logistic regression method resulted in an increase on the false negative rate, as the sample size decreased. The author summarized the results as that for both uniform and nonuniform DIF, sample sizes decrease the power of the methods and increase the type I error rate, as a result. A simulation study (Swaminathan & Rogers, 1990), comparing Mantel Haenszel and Logistic Regression techniques for dichotomously scored items, simulated two different levels of sample sizes per group to investigate the effect of sample size on the power of DIF detection. The sample sizes were simulated in two different conditions as 250 per group and 500 per group. The results indicated that, for samples of 250, the logistic regression resulted with 75% correct detection of uni form DIF and 50 % correct detection of nonuniform DIF. In the case of large sample sizes, 500 per group, the logistic regression procedure resulted with 100 % accurate uniform DIF detection and 75% accurate nonuniform DIF detection. A simulation study in vestigating the adequate sample size for the scales with small number of items suggested that, for a power level of 80% or higher, ordinal logistic regression requires at least 200 observations per group (Scott et al. 2009). The study suggested 300 observa tions per group for such a scale with two items.
38 Furthermore, for a smaller p value, 500 was the suggested minimum sample size per group. Based on the finding s in the li terature, in order to reach an adequate power level, Zumbo (1999) suggests at least 200 observations per group. Another study (Lai, Teresi & Gershon 2005) suggests at least 100 subjects per gr oup for items with no skewness.
39 CHAPTER 3 METHODS The logistic r egression procedure was used in this study to conduct differential item functionin g ( DIF ) analyses in polytomously scored items. A Monte Carlo simulation study was conducted to compare the performances of three different Extensions of logisti c modeling to polytomous data for detecting items with DIF: the continuation ratio logits model, the cumulative logits model, and the adjacent categories model. This is a partial replication and extension of the study done by French and Miller (1996). The performances of three logit models in DIF detection were compared in terms of the power and type I error rates. A test including 25 simulated items was simulated under t he graded response model of Samejima (1969; 1996) which is an appropriate model for the items with ordered polytomous responses. Each item had four score categories. The possible s core that an examinee can obtain ranged from 0 to 3. These item conditions were the same with the conditions that French and Miller (1996) used in their simulation study. In this simulation study, as similar to French and Miller's (1996) study, only 1 o f the 25 items contained DIF. In other words a single item in each test condition was a DIF item, which was the 25th item. Therefore, the items from 1 to 24 did not contain DIF, and these items were used to estimate the actual Type I error rate of the stud y. We used the 25th it em to investigate the power of the l ogistic regression DIF detection method in polytomous items.
40 3.1 Factors Manipulated 3.1.1 Type of Differential Item Functioning ( DIF ) DIF was simulated in four conditions. The differences between item parameters (i.e., discrimination and threshold) determined the existence and type of DIF in 25th item. The first three conditions contained uniform DIF, and the last condition contained nonun iform DIF. These conditions will be explained in the data g eneration section in detail. 3.1.2 Sample Size In all conditions, the s ample size s for the referen ce group and the focal group were equal. Therefore, the sample size ratio was set to 1:1 between the two groups. Three levels of sample size were investigat ed Similarly to French and Miller's (1996) study one o f the sample sizes, 500 for each group was used to represent the medium sample size. Additionally, we investigated the effect of small sample sizes on DIF detection with logistic regression. In order to obtain sufficient power i n DIF detection, the findings in the literature suggests the smallest sample size as 200 per group for the logistic regression procedure (Zumbo, 1999 ; Scott et.al. 2009 ). If there is no skewness, 100 samples for each group is t he smallest sample size value suggested by Lai, Teresi and Gershon (2005). Swaminathan and Rogers (1990) obtained 75 % power for samples of 250 For the t wo conditions with small sample size t he present study simulated the total number of examinees in eac h reference group and focal group as 100 and 250. The sample size of 250 was a common sample size in simulation studies investigating DIF with small samples. Furthermore, we investigated the feasibility of the sample size of 100.
41 3.1.3 Group Ability D istr ibutions The effect of skewness of ability distributions has not been studied broadly. To examine the effect of skewness in power to detect DIF in polytomous items, the present study compar ed different levels of skewness of the ability distri The ability distributions of examinees were simulated in three levels: normal distribution, moderate negative skewness ( .75), and high negative skewness ( 1.75). Normal ability distribution values were simulated by using rnorm function in R pr oject software with a mean of 0 and standard deviation of 1. Fleishman's (1978) power transformation method for simulating non normal distributions was followed to simulate skewed distributions. The levels of skewness and kurtosis were simulated by using t he coefficient values on Fleishman's Power Method Weights table Moderate negative skew on ability distribution was studied by Kristjansson et al. (2005) at the level of .75 and it showed a slight effect on the performance of DIF detection methods. In the present study, we set the skewness levels of ability distributions as .75 for moderate skewness, and 1.75 for high skewness. Kurtosis level of the ability distributions was fixed on the level of 3.75. 3.2 Data Generation Each simulated d ata set included responses to 25 items with a single item showing DIF. The other 24 items did not include DIF and were used to calculate type I error. The item parameters used by French and Miller (1996) to generate the data were replicated in this study. Item parameters were varied in four conditions. DIF items were created in these four conditions by using the differences between the item parameters a and b In all four conditions, the item parameters remained the same for the first 24 items, which are the
42 items with no DIF. The parameters were varied between the focal group and the reference group for only the 25th item, which is the DIF item, in all the conditions. The item parameters used for eac h condition are shown in Table 3 1 Item parameters were identical for th e first 24 items in every condition. Five different values of item discrimination parameter ( a ) were simulated. Table 3 1. Item parameters for item s 1 to 24 Item number a b1 b2 b3 b4 1 .50 .00 2.00 .00 2.00 2 .50 .00 2.00 1.00 2.00 3 .50 .00 1.00 .00 1.00 4 .50 .00 .00 1.00 2.00 5 .50 .00 2.00 1.00 .00 6 .75 .00 2.00 .00 2.00 7 .75 .00 2.00 1.00 2.00 8 .75 .00 1.00 .00 1.00 9 .75 .00 .00 1.00 2.00 10 .75 .00 2.00 1.00 .00 11 1.00 .00 2.00 .00 2.00 12 1.00 .00 2.00 1.00 2.00 13 1 .00 .00 1.00 .00 1.00 14 1.00 .00 .00 1.00 2.00 15 1.00 .00 2.00 1.00 .00 16 1.25 .00 2.00 .00 2.00 17 1.25 .00 2.00 1.00 2.00 18 1.25 .00 1.00 .00 1.00 19 1.25 .00 .00 1.00 2.00 20 1.25 .00 2.00 1.00 .00 21 1.50 .00 2.00 .00 2.00 22 1.50 .00 2.00 1.00 2.00 23 1.50 .00 1.00 .00 1.00 24 1.50 .00 .00 1.00 2.00 The values were .50 for questions 1 to 5; .75 for questions 6 to 10; 1.00 for questions 11 to 15; 1.25 for questions 16 to 20; and 1.50 for questions 21 to 24. Since the value of the first score category was 0,French and Miller (1996) set the threshold parameters to 0 for the first score category ( b 1 ). Because in a four category item only three thresholds are needed, we did not use b1 parameters in the simulation The parame ters b 2, b3 and
43 b4 were different for each item, and they were ordered in an increasing way within every item. In the first three conditions, the item discrimination parameters ( a ) of the DIF item (25th item) were varied between the focal group and the referenc e group, but the threshold parameters ( b ) remained the same to create nonuniform DIF. The item discrimination parameters differed by .5, 1.0 and 1.5 points i n the first, second and third conditions respectively In the fourth condition, the threshold para meters ( a ) were varied between the focal group and the reference group, but the item discrimination parameters ( a ) remained the same to create uniform DIF. The threshold parameter of the first score category (b1) and the fourth score category (b4) remained the same. The threshold parameter of the second score category (b2) and the third score category (b3) differed by 1.0 Table 3 2. Item parameters for the 25th item Conditions Parameters Focal Group Reference Group 1 2 3 4 1 2 3 4 a .5 .5 .5 1.0 1.0 1.5 2.0 1.0 b1 .0 .0 .0 .0 .0 .0 .0 .0 b2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 b3 .0 .0 .0 .0 .0 .0 .0 1.0 b4 1.0 1.0 1.0 2.0 1.0 1.0 1.0 2.0 The simulation design consis ted of 4 DIF conditions (3 uniform and 1 non uniform DIF), 3 sample size cond itions (100, 250, 500), and 3 logistic regression models (cumulative logits, continuation ratio, adjacent categories). For each condition, 1000 datasets were generated. This study used the R 2.10.1 statistical software (R Development Core Team, 201 0) t o si mulate the data.
44 3.3 Data Analysis The R software was used to fit the ordinal logistic regression models and compute the power and type I errors of DIF detection The VGAM package for categorical data analysis (Yee, 2010) was used to fit the logistic regr ession models. Three different models were created for each logistic regression model, which were the first model with only total score variable, the second model with total score and group variables, and the third model with total score, group variable, a nd interaction term. Deviance chi square values of these models were subtracted from each other in order to reach p value of each DIF detection test. Three different tests were applied in each condition of the study, which are the test of any kind of DIF, the test of uniform DIF and the test of non uniform DIF. To test the existence of any kind of DIF, we subtracted deviance chi square value of third model from the first model. To test uniform DIF, the deviance value of the second model was subtracted from the first model. And finally, to test the non uniform DIF, the deviance chi square value of the third model was subtracted from the second model. After the data generation and analysis was done, the p values of each logistic regression models were obtai ned. Power was estimated by calculating the propor tion of iterations DIF was correctly detected. On the other hand, Type I error was calculated by computing the proportion of iterations DIF was falsely detected. Similar to French and Miller's (1996) study the alpha level of .05 was divided by the number of items, resulting in an alpha level of .002 was used in order to control for familywise Type I error rate. The results of each model were compared across all the conditions to determine the best performing model.
45 CHAPTER 4 RESULTS The results of the study are composed of the power values for the 25th item and Type I error values for 24 non DIF items in each condition. Power values were calculated as the percentages of detecting differential item function ing ( DIF ) in item 25, which is the only item that shows DIF, across 1000 replications. Type I error v alues were calculated by percentage of DIF detection in items 1 through 24 incorrectly, which are DIF free items. The study had similar hypotheses with Fre nch and Miller's (1996) study. Four different DIF conditions were simulated for the 25th item. From condition 1 to condition 3, the difference between the item discrimination parameters ( a ) of the focal and the reference groups were increased but item th reshold parameters ( b ) were held constant in order to create non uniform DIF. In condition 4, item discrimination parameters ( a ) of the focal and the reference groups were held constant at the same level but the difference between the item threshold parame ters ( b ) were increased in order to create uniform DIF. Increase d item discrimination parameters were expected to simplify the identification of DIF in the first three conditions. Since the more extreme differences between item discrimination parameters wo uld make the shape of Item characteristic curves ( ICCs ) more distinguishable (Embretson & Reise, 2000), it was expected that the power values for the conditions would increase from condition 1 through condition 3. Sample sizes were established on three d ifferent levels as representing small to medium sample sizes, which were 100, 250 and 500, in order to determine whether small sample sizes give sufficient power for logistic regression method to detect DIF in
46 polytomous items. It is expected that i ncrease s in sample size result in higher power for DIF detection. Finally, three levels of ability distributions were simulated, which are normal, moderately negative ly skew ed and highly negatively skew ed Skewness of ability distributions were expected to reduc e the power of detecting DIF. Study results are presented in tables 4 1 to 4 6. Overall, the results showed that most of the expectations were met. Research question 1 addresses whether the cumulative logits, continuation ratio and adjacent categories mo del s differ with respect to power to detect DIF in polytomous items Research question 2 addresses whether these three models differ with respect to Type I error values to detect DIF. Research question 3 addresses whether the performance of logistic regres sion models for polytomous items affected by non normality of the ability distribution Finally, research question 4 addresses how logistic regression models for polytomous items are affected by small sample sizes Table 4 1 illustrates power values of th ree logistic regression methods to flag presence of any kind of DIF in item 25 across all the levels of sample sizes, ability distributions, and all the conditions. Results showed that three models did not differ in their performances to detect any kind of DIF in condition 1, which had the smallest difference between item parameters. A s am ple size of 100 did not provide sufficient power to detect DIF due to a 0.5 difference between discriminations Power values for sample size of 100 remained approximately around 0.2 in condition 1. It is possible to see that with the smallest level of DIF, the cumulative logit model had lower power with small sample sizes, but the difference in power between models decreased as sample size increased Conditions 2, 3, and 4 all provided perfect power results The results
47 gathered from conditions 2 to 4 showed that sample size of 100 worked well in the case of conditions with high difference between the item parameters of the focal and the reference groups. Power values did no t differ substantially between normal distribution and two levels of skewness. Non normality of ability distributions did not have an effect on DIF detection with logistic regression in polytomous items. Table 4 1. Power of any kind of differential item f unctioning ( DIF ) detection for item 25 Ability Distributions Model Sample size Condition1 Condition2 Condition3 Condition4 Normal Cumulative 100 0.123 0.786 0.990 0.896 Distribution logits 250 0.793 1 1 1 500 0.996 1 1 1 Continuation 100 0.206 0 .876 0.997 0.939 ratio 250 0.813 1 1 1 500 0.997 1 1 1 Adjacent 100 0.206 0.872 0.995 0.94 categories 250 0.801 1 1 1 500 0.997 1 1 1 Moderately Cumulative 100 0.110 0.749 0.988 0.931 Skewed logits 250 0.750 1 1 1 500 0.995 1 1 1 Continuation 100 0.202 0.879 0.999 0.962 ratio 250 0.807 1 1 1 500 0.996 1 1 1 Adjacent 100 0.210 0.885 0.997 0.962 categories 250 0.816 1 1 1 500 0.997 1 1 1 Highly Cumulative 100 0.141 0.826 0.995 0.954 Skewed logits 250 0.797 1 1 1 500 1 1 1 1 Continuation 100 0.23 0.9 0.999 0.964 ratio 250 0.818 1 1 1 500 1 1 1 1 Adjacent 100 0.229 0.906 0.999 0.963 categories 250 0.817 1 1 1 500 1 1 1 1
48 In conditions 1 to 3, t able 4 2 illustrates power values of three logist ic regression methods to flag the presence of non uniform DIF in item 25 across all the levels of sample sizes, ability distrib utions, and all the conditions. On the other hand, since condition 4 was including uniform DIF, the values in the condition 4 col umn illustrates type I error values for detecting non uniform DIF when uniform DIF exists. Results in table 4 2 showed that power rates increased from condition 1 through condition 3, Table 4 2. N on uniform DIF detection rates for item 25 Ability Distrib ution Model Sample size Condition1 Condition2 Condition3 Condition4 Normal Cumulative 100 0.019 0.120 0.484 0 Distribution logits 250 0.090 0.671 0.971 0 500 0.339 0.978 1 0.006 Continuation 100 0.059 0.443 0.867 0.005 ratio 250 0.255 0.962 1 0.007 500 0.695 1 1 0.030 Adjacent 100 0.031 0.203 0.567 0.002 categories 250 0.113 0.744 0.985 0.003 500 0.386 0.987 1 0.014 Moderately Cumulative 100 0.013 0.055 0.272 0 Skewed logits 250 0.067 0.488 0.891 0 500 0.168 0.947 1 0 Continuation 100 0.035 0.273 0.660 0.007 ratio 250 0.181 0.867 0.997 0.012 500 0.505 1 1 0.016 Adjacent 100 0.023 0.148 0.433 0.003 categories 250 0.096 0.636 0.970 0.010 500 0.268 0.977 1 0.006 Highly Cumulative 100 0.002 0.016 0.069 0.002 Skewed logits 250 0.016 0.067 0.373 0.001 500 0.029 0.313 0.828 0.002 Continuation 100 0.017 0.085 0.294 0.003 ratio 250 0.046 0.392 0.863 0.006 500 0.133 0.859 0.998 0.004 Adjacent 100 0.007 0.034 0.120 0.003 categories 250 0. 017 0.127 0.494 0.003 500 0.042 0.462 0.931 0.001
49 as expected. None of the logistic regression models had sufficient power to detect non uniform DIF in condition 1. For groups with normal ability distribution, none of the logistic regression models were powerful to detect non uniform DIF for sample size of 100 in any of the conditions. For groups with normal ability distribution, a sample size of 250 provided acceptable power in condition 2 and sufficient power in condition 3 to detect non uniform DI F with all logistic regression models. Sample size of 500 was powerful in both condition 2 and 3. Both moderate and high skewness of ability distribution effected power of the models on non uniform DIF detection to some extent. Power values for groups with highly skewed ability distributions were not in a sufficient level except for condition 3 with 500 sample size case. The power to detect DIF decreased consistently for all models as the level of skewness increased. Models provided different performance th an expected. When models were compared, the continuation ratio mo del performed best, and the cumulative logits model performed worse. Since condition 4 was designed as uniform DIF condition, DIF detection rates were 0 or close to zero. Table 4 3 illustra tes percentages of flagging uniform DIF in item 25 for three logistic regression methods across all the levels of sample sizes, ability distributions, and all the conditions. I tem parameters were chosen to create non uniform DIF in conditions 1 to 3, and u niform DIF in condition 4. Thus, the values in the condition 4 column of the table can be considered power, but the values in condition 1, 2, and 3 columns are Type I error rates for detecting uniform DIF when non uniform DIF exists. Althoug h conditions 1 to 3 were simulated to present non uniform DIF, the results in table 4 3 indicated that all of the logistic regression models in conditions 2 and 3
50 detected uniform DIF in item 25 with a perfect power level. All three models in condition 1 provided high un iform DIF detection rates for sample sizes of 250 and 500. Condition 4 was designed to create a uniform DIF item. Table 4 3. U niform DIF detection rates for item 25 Ability Distribution Model Sample Size Condition1 Condition2 Condition3 Condition4 Norm al Cumulative 100 0.145 0.723 0.917 0.967 Distribution logits 250 0.751 1 1 1 500 0.997 1 1 1 Continuation 100 0.147 0.649 0.881 0.973 ratio 250 0.645 0.993 0.999 1 500 0.974 1 1 1 Adjacent 100 0.206 0.799 0.996 0.976 categories 250 0. 761 0.999 1 1 500 0.996 1 1 1 Moderately Cumulative 100 0.152 0.764 0.964 0.968 Skewed logits 250 0.77 1 1 1 500 0.992 1 1 1 Continuation 100 0.168 0.755 0.96 0.987 ratio 250 0.712 1 1 1 500 0.989 1 1 1 Adjacent 100 0.217 0.843 0.982 0.986 categories 250 0.804 1 1 1 500 0.994 1 1 1 Highly Cumulative 100 0.209 0.875 1 0.99 Skewed logits 250 0.877 1 1 1 500 0.999 1 1 1 Continuation 100 0.261 0.893 0.996 0.99 ratio 250 0.833 1 1 1 500 0.996 1 1 1 Adjacent 100 0. 287 0.926 0.998 0.991 categories 250 0.878 1 1 1 500 0.998 1 1 1 Study results in table 4 3 showed that all three logistic regression models were powerful to detect uniform DIF item in all three sample size and skewness levels. No clear differenc e was detected between the models.
51 Research question 2 addresses whether the cumulative logits, continuation ratio and adjacent categories model differ with respect to Type I error when used to detect DIF in polytomous items Items 1 to 24 were generated as DIF free items to analyze inaccurate DIF detection rate of the models in different condition settings. Tables 4 4 to 4 6 contain Type I error values for 24 DIF free items, which are item 1 to 24. In these types, the Type I error rate is considered adeq uate if it is under 0.002. Table 4 4 illustrates Type I error results for the test of any kind of DIF detection, no matter which type of DIF exists, in the case of all the logistic regression models, all the sample size levels, and all the ability distrib ution levels. Table 4 5 illustrates Type I error results for the test of uniform DIF detection in all the condition levels. Table 4 6 illustrates Type I error values for the test of non uniform DIF detection in all the condi tion levels At the alpha leve l of .002, all three tables showed common features. Results indicated that Type I error rates of only the cumulative logits model were at acceptable levels. It was observed that as sample size increased, type I errors of the cumulative logits model increas ed as well. Type I error rates in the continuation ratio and the adjacent categories models were mostly at higher than alpha level. No clear effect of skewness or sample size appeared on the continuation ratio and the adjacent categories models.
52 Tabl e 4 4. Type I error rates for the tests of any kind of DIF detection for items 1 to 24 Ability Distribution Model Sample size Condition 1 Condition 2 Condition 3 Condition 4 Normal Cumulative 100 0.0005 0.0009 0.0002 0.0004 Distribution logits 250 0.000 9 0.0015 0.0012 0.0014 500 0.0017 0.0021 0.0019 0.0022 Continuation 100 0.0034 0.0032 0.0026 0.0030 ratio 250 0.0020 0.0026 0.0025 0.0021 500 0.0024 0.0023 0.0021 0.0026 Adjacent 100 0.0032 0.0032 0.0029 0.0031 categories 250 0.0017 0 .0024 0.0025 0.0020 500 0.0023 0.0023 0.0020 0.0024 Moderately Cumulative 100 0.0009 0.0007 0 0.0003 Skewed logits 250 0.0009 0.0011 0.0010 0.0007 500 0.0015 0.0010 0.0010 0.0008 Continuation 100 0.0028 0.0034 0.0028 0.0030 ratio 250 0.00 24 0.0022 0.0020 0.0021 500 0.0021 0.002 0.0020 0.0015 Adjacent 100 0.0030 0.0036 0.0030 0.0030 categories 250 0.0024 0.0024 0.0020 0.0023 500 0.0025 0.0020 0.0019 0.0017 Highly Cumulative 100 0.0003 0.0095 0.0009 0.0011 Skewed logits 250 0.0014 0.0015 0.0012 0.0010 500 0.0016 0.0017 0.0014 0.0013 Continuation 100 0.0027 0.0030 0.0023 0.0028 ratio 250 0.0029 0.0022 0.0021 0.0019 500 0.0017 0.0019 0.0020 0.0021 Adjacent 100 0.0027 0.0028 0.0026 0.0031 categories 250 0. 0027 0.0023 0.0019 0.0019 500 0.0018 0.0019 0.0019 0.0020
53 Table 4 5. Type I error rates for the tests of uniform DIF detection for items 1 to 24 Ability Distribution Model Sample size Condition 1 Condition 2 Condition 3 Condition4 Normal Cum ulative 100 0.0011 0.0010 0.0013 0.0010 Distribution logits 250 0.0017 0.0013 0.0017 0.0020 500 0.0018 0.0025 0.0023 0.0016 Continuation 100 0.0026 0.0021 0.0025 0.0026 ratio 250 0.0019 0.0018 0.0020 0.0022 500 0.0020 0.0025 0.0020 0.0018 Adjacent 100 0.0025 0.0023 0.0029 0.0024 categories 250 0.0020 0.0020 0.0019 0.0021 500 0.0022 0.0027 0.0021 0.0015 Moderately Cumulative 100 0.0026 0.0009 0.0013 0.0010 Skewed logits 250 0.0010 0.0016 0.0018 0.0017 500 0.0014 0.0020 0.00 19 0.0013 Continuation 100 0.0026 0.0028 0.0020 0.0022 ratio 250 0.0017 0.0020 0.0018 0.0022 500 0.0019 0.0020 0.0021 0.0015 Adjacent 100 0.0027 0.0030 0.0027 0.0024 categories 250 0.0017 0.0019 0.0021 0.0020 500 0.0017 0.0020 0.0023 0.0018 Highly Cumulative 100 0.0015 0.0013 0.0014 0.0017 Skewed logits 250 0.0018 0.0026 0.0014 0.0015 500 0.0019 0.0019 0.0014 0.0017 Continuation 100 0.0028 0.0027 0.0026 0.0025 ratio 250 0.0025 0.0029 0.0020 0.0020 500 0.0018 0.0019 0. 0080 0.0020 Adjacent 100 0.0028 0.0027 0.0025 0.0028 categories 250 0.0023 0.0028 0.0020 0.0022 500 0.0019 0.0020 0.0013 0.0019
54 Table 4 6. Type I error rates for the tests of non u niform DIF detection for items 1 to 24 Ability Distribution Model Sample size Condition 1 Condition 2 Condition 3 Condition 4 Normal Cumulative 100 0.0005 0.0007 0.0008 0.0002 Distribution logits 250 0.0010 0.0010 0.0010 0.0010 500 0.0019 0.0017 0.0015 0.0019 Continuation 100 0.0036 0.0035 0.0025 0.00 23 ratio 250 0.0023 0.0027 0.0029 0.0020 500 0.0028 0.0025 0.0019 0.0030 Adjacent 100 0.0035 0.0028 0.0023 0.0024 categories 250 0.0018 0.0022 0.0024 0.0018 500 0.0023 0.0023 0.0019 0.0024 Moderately Cumulative 100 0.0009 0.0007 0.0001 0.0002 Skewed logits 250 0.0009 0.0007 0.0004 0.0004 500 0.0015 0.0006 0.0010 0.0010 Continuation 100 0.0031 0.0034 0.0027 0.0035 ratio 250 0.0018 0.0028 0.0020 0.0020 500 0.0020 0.0020 0.0023 0.0024 Adjacent 100 0.0035 0.0036 0.0030 0. 0033 categories 250 0.0020 0.0026 0.0019 0.0020 500 0.0022 0.0020 0.0020 0.0024 Highly Cumulative 100 0.0005 0.0009 0.0013 0.0007 Skewed logits 250 0.0012 0.0012 0.0016 0.0010 500 0.0017 0.0018 0.0017 0.0014 Continuation 100 0.0028 0.0036 0.0033 0.0033 ratio 250 0.0022 0.0023 0.0025 0.0022 500 0.0019 0.0021 0.0023 0.0021 Adjacent 100 0.0026 0.0031 0.0025 0.0029 categories 250 0.0022 0.0017 0.0023 0.0020 500 0.0021 0.0020 0.0020 0.0018
55 CHAPTER 5 DISCUSSION This st udy focused on the effectiveness of three different logistic regression models on detecting uniform and non uniform differential item functioning ( DIF ) in polytomous items by comparing their power and Type I error values. The models compared were cumulativ e logits model, continuation ratio model, and adjacent categories model. A secondary focus of this research was to detect how the performances of the models change for groups with small sample sizes and non normal ability distributions. In summary, power of the three models did not differ in detecting the presence of any kind of DIF. All the models performed similarly to detect the presence of DIF, no matter which type of DIF exists. However, three logistic regression models differed on their performances when the test of non uniform DIF detection was applied. The c ontinuation ratio model was the most powerful model The cumulative logits model had the lowest power compared to the other models. The adjacent categories mode l performed better than the cumula tive logits model. These results dis confirmed Fre nch and Miller's (1996) findings in which cumulative logits and continua tion ratio performed similarly, and the a djacent categories model had the lowest power compared to the other models. As the magnitud e of DIF increased, the power for detecting non uniform DIF increased as well. In other words, the increase in the difference between item discri mination parameters made it easier to detect DIF in an item, so increased the power. This was an expected resul t, and co nfirmed the previous findings in the literature (French & Miller, 1996).
56 DIF detection with small sample sizes was less powerful than moderate sample size s which confirmed the previous findings (French &Miller, 1996). Logistic regression models did not have sufficient power to detect non uniform DIF when the sample size was 100 per group However, DIF detection in groups of 250 and 500 was powerful enough with logistic regression when difference between item discrimination parameters was 1.00 or above. Furthermore, in testing the presence of any kind of DIF, logistic regression models were powerful enough for the groups with sample size of 100 when the difference between the item discrimination parameters for the focal and the reference groups was 1.00, or more For the case of uniform DIF, logistic re gression models were powerful with small sample sizes (100 and 250). Thus, as opposed to previous studies ( Zumbo, 1999; Scott et.al. 2009) this study 's results show that logistic regression has suffi cient power to detect uniform DIF in datasets with a sample size of 100 per group. For non uniform DIF, logistic regression would have sufficient power with a smallest sample size of 100, if the magnitude of DIF is big enough One important piece of info rmation gathered from the results of the study was that skewn ess of ability distributions has a substantial effect on the power of DIF detection with logistic regression, when the presence of non uniform DIF was tested. As the skewness level increased, pow er to detect non uniform DIF decreased. Power was not sufficient to detect non uniform DIF for highly skewed groups. This effect was no clear in the test of uniform DIF and any kind of DIF. This finding disc onfirms the previous findings in the literature ( Monaco, 1997; Kristjansso n et. al, 2005), which indicated that skewness of ability distributions does not have a notable effect on power of non uniform DIF detection with logistic regression.
57 Another important point gathered was that logist ic regression models were not able to distinguish non uniform DIF and uniform DIF in polytomous items, when the test of uniform DIF was applied to an item with non uniform DIF. This inference confirms French and Miller's (1996) findings. Non uniform DIF conditions were created by differing item discrimination parameters of the focus and reference groups, and holdi ng item threshold parameters constant. The findings of the study indicate tha t logistic regression method is not able to differentiate uniform and nonuniform DI F in polytomous items, in the presence of non uniform DIF. On the other hand, logistic regression i s able to distinguish uniform and non uniform DIF, if only uniform DIF exists Finally logistic regression models differed on their Type I errors. Even if t he cumulative logits model has the lowest power, it is the only model whose type I errors are all at the acceptable level. The values increase as sample size increases. This result disconfirms Herrera and Gomez 's (2008 ) findings. The continuation ratio and the adjacent categories models mostly have high values of type I error, at the alpha level of .002. Based on the ov erall results, it can be concluded that the logistic regression method is a powerful method to detect DIF in polytomous items, but not use ful t o distinguish the type of DIF. However, the p ower of logistic regression to detect non uniform DIF drops when the abilit y distribution of the groups is skewed. Sample size is a factor that affects power of the logistic regression to detect DIF. Howeve r, if the di fference between item discrimination parameters is larger than 1.0 logistic regression provides sufficient power to detect DIF with in small samples, such as 100 per group. Continuation ratio model is the most powerful logistic regression mode l to detect DIF in polytomous items. Although the difference between models was moderate in the non
58 uniform DIF detection test, it was small in uniform DIF and any kind of DIF detection tests. On the other hand, even if the cumulative logits model gives th e lowest power among the models, it has the lowest type I error rates. The type I error rate of the cumulative logits model increases as the sample size increases, but remains at acceptable levels. 5.1 Limitations and Suggestions for Future Research There are several limitations in this Monte Carlo simulation study. One limitation of this study is that only a single item was simulated including DIF within a test of 25 items. A real test may have more than 1 item showing DIF. With multiple items showing DIF, the total score may be contaminated and a purification process may be necessary. Future simulations should include more than 1 DIF items. Another limitation is that we simulated three different n on uniform dif conditions to examine the trend of power, but only one condition was simulated to create uniform DIF. Thus, the trend of power on the increase of uniform DIF magnitude could not be examined. Moreover, DIF conditions were generated by changing only one of the item parameters, and fixing the other o ne in a certain value There was no condition simulated in which both a and b parameters were changing. Future research should examine different uniform DIF conditions by changing threshold parameters. Furthermore, the conditions in which both parameters a re changing also should be examined. Another important limitation is that only negatively skewed ability distributions were simulated The effect of positive skewness in ability distributions on DIF detection was not examined in this study. Moreover, set ting kurtosis to a fixed level was another limitation. The c ombination of different kurtosis levels with different skewness levels
59 might show an effect on power. Thus, future research should examine the effect of positively skewed ability distributions and changing kurtosis values as well. Finally, o nly the small and moderate sample size conditions were examined in this study. To make a better comparison, large sample size s should be examined Furthermore, the focal and the reference groups were simulated with equal sample sizes. Whereas, unequal sample sizes for groups could have been simulated to see the difference on the power. 5.2 Conclusion French and Miller (1996) indicated that running separate regressions for each model was time consuming in logi stic regression. However, recent improvements in statistical programs allow us to run all the separate regressions at the same time. So, this is no longer a disadvantage for logistic regression in polytomous items. Previous research indicates that likeliho od ratio DIF detection test for polytomous items is not powerful for small sample sizes as small as 500 per group ( Ankenmann, Witt & Dunbar, 1999). However, logistic regression is powerful with sample size of 250, and even with a sample size of 100 in the case of extreme differences between item discrimination parameters. Nevertheless, the IRT LR allows direct omnibus tests of DIF hypotheses for all the item parameters, which is not possible with logistic regression method. The MH method has been shown to produce inflated Type I error rates when the item discrimination and difficulty parameters of an item are high (Chang, Mazzeo, & Roussos, 1996; Roussos and Stout, 1996) Nevertheless, the logistic regression model shows no clear inflation on Type I error v alues as item parameters increase. Thus, this feature makes logistic regression advantageous over MH.
60 Generalized Mantel Haenszel and logistic regression are both similarly powerful to detect uniform DIF in polytomous items ( Kristjansson et. al, 2005). Lo gistic regression's capability to detect both uniform and non uniform DIF makes it advantageous over Mantel, because Mantel is not able to detect non uniform DIF (Kristjansson et. al, 2005). On the other hand, logistic regression in polytomous items is not able to distinguish nonuniform DIF from uniform DIF. Another disadvantage of logistic regression is that skewness of ability distributions can reduce the power of logistic regression for non uniform DIF detection, but it does not affect the power of unifo rm DIF and any kind of DIF detection tests. Finally, small sample sizes reduce the power of most DIF methods, but logistic regression can reach to sufficient power with small sample sizes if the difference between item parameters is large, even with a samp le size of 100 in this study. The necessity of dichotomization of polytomous response categories in order to compare the probabilities to answer a question as correct for different groups in logistic regression method causes the loss of some amount of dat a, which makes logistic regression less advantageous. Nevertheless, as French and Miller (1996) point out, the separate comparisons of score categories in the adjacent categories model helps us to identify the location of DIF in polytomous items, which is a unique feature of logistic regression in polytomous items. The continuation ratio model is the most powerful logistic regression model to detect non uniform DIF in polytomous items. However, high rates of type I error occur in all the test results of c ontinuation ratio. The cumulative logits model is the only model that provides acceptable type I error rates in every condition. Hence, in non uniform DIF
61 detection, the continuation ratio model can be used due to its high power to detect non uniform DIF. On the other hand, since the power of the models do not differ in the test of any kind of DIF or uniform DIF detection, the cumulative logits model is a more appropriate model in uniform and any kind of DIF detection tests due to its low type I error rate.
62 REFERENCES Agresti, A. (2002). Categorical data analysis Hoboken, NJ: John Wiley. Ankenmann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness of fit statistic in detecting differential i tem func tioning. Journal of Educational Measuremen t, 36, 277 300. Bock, R.D. (1975) Multivariate statistical methods in behavioral r esearch New York: McGraw Hill. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Appl ication of an EM algorithm. Psychometrica 46 443 459. Camilli, G., & Shepard, L. A. (1994). MMSS: methods for identifying biased test items. Thousand Oaks, CA: Sage. Chang, H. H., Mazzeo, J., & Roussos, L. (1996). Detecting DIF for polytomously scored ite ms: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33, 333 353. Dorans, N. J., & Kulick, E. (1983). Assessing unexpected differential item performance of female candidates on SAT and TSWE forms administered in December 1977: An application of the standardization approach (RR 83 9). Princeton, NJ: Educational Testing Service. Dorans,N. J., & Kulick, E (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the S cholastic Aptitude Test. Journal of Educational Measurement, 23 355 368. Dorans, N. J., & Schmitt, A. P. (1991). Constructed response and differential item functioning: A pragmatic approach (ETS Research Report No. 91 47). Princeton, NJ: Educational Testi ng Service. Embretson, S. E., & Reise, S. P. (2000). Psychometric methods: Item response theory for p sychologists Mahwah, NJ: Erlbaum. Haenszel procedure for detecting differential item functioning in small samples. Educational and Psychological Measurement, 64 925 936. Finch, W. H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel Haenszel, SIBTEST and the IRT likelihood ratio Applied Psychological Meas urement, 29 278 295. Fleishman, A. I. (1978). A method for simulating non normal distributions. Psychometrika, 43, 521 532.
63 French, A.W., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous item s. Journal of Educational Measurement 33 315 332. Herrera, A., N., & Gomez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel Haenszel and logistic regression techniq ues. Quality & Quantity, 42 739 755 Holland, E W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty. ETS Research Report No. 85 -43. Princeton, N J: Educational Testing Service. Holland, P. W., & Thayer, D. T. (1 988). Differential item performance and the Mantel Haenszel procedure In H. Holland & H. I. Braun (Eds.), Hillsdale, NJ: Lawrence Erlbaum. Holland, P. W., & Wainer, H. (1993). Differential item functioning Hillsdale, NJ: Lawrence Erlbaum. Kim, S. H., & C ohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22 345 355. Kristjansson, E., Aylesworth, R., McDowell, I.,& Zumbo, B. D. (2005). A compariso n of four methods for detecting differential item functioning in ordered response items. Educational and Psychological Measurement, 65, 933 953. Lai, J S ., Teresi, J., Gershon, R. C. (2005). Procedures for the analysis of differential item functioning (DIF ) for small sample sizes Evaluation & the Health Profession, 28 283 294 Lord, F. (1980). Applications of Item Response Theo ry to practical testing p roblems Hillsdale: Lawrence Erlbaum Associates. Mantel, N. (1963). Chi square tests with one degree of fr eedom: Extensions of the Mantel Haenszel procedure. Journal of the American Statistical Association 58 690 700. Mantel, N.,& Haenszel, W. M. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the Nationa l Cancer Institute 22 719 748. Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1992). The effect of sample size on the functioning of the Mantel Haenszel statistic. Educational and Psychological Measurement, 52 443 451. Mellenbergh, G. (1982). Conting ency table models for assessing item bias. Journal of Educational Statistics 7 105 118.
64 Monaco, M. K. (1997). A Monte Carlo assessment of skewed theta distributions on differential item functioning indices. Dissertation Abstracts International: Section B: The Sciences and Engineering 58 (5 B), 2746. Oort, F.J., (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling 5 107 124 R development Core Team (2010). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Raju, N. (1988). The area between two item characteristic curves, Psychometrika 53 495 502. Raju, N. (1990). Determining the significance of estimated signed and unsigned areas between t wo item response functions, Applied Psych ological Measurement 14, 197 207. Rogers, H. and Swaminathan, H. (1993). A comparison of logistic regression and Mantel Haenszel procedures for detecting differential item functioning, Applied Psychological Measurem ent 1 7 105 116. Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample and studied item parameters on SIBTEST and Mantel Haenszel type I error performance. Journal of Educational Measurement, 33 215 230. Samejima, F. (19 69). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17 Samejima, F. (1996). Evaluation of mathematical models for ordered polychotomous responses. Behaviormetrika, 23 17 35. Scott, N.W., Fayers, P. M., Aaronson, N. K., Bottomley, A., DeGraeff, A., Groenvold, M., Gundy, C., Koller, M., & Petersen, M. A. (2009). A simulation study provided sample size guidance for differential item functioning (DIF) studies using short scales. Journal of Clinical Epidemio logy, 62, 288 295. Shealy, R., & Stout, W. F. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58 159 194. Su, Y. H., & Wang, W. C. (2005). Efficiency of the Mantel, Generalized Mantel Haenszel, and Logistic Discriminant Function Analysis Methods in Detecting Differential Item Functioning for Polytomous Item. Applied Measurement in Education, 18 313 350. Swamina than, H. and Rogers, H. (1990). Detecting differential item functioning using logistic regression procedures, Journal of Educational Measurement, 27 361 370.
65 Swanson, D. B., Clauser, B. E., Case, S. M., Nungester, R. J. & Featherman, C. (2002). Analysis of differential item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics 27 53 75. Thissen, D., Steinberg, L. and Wainer, H. (1988). Use of item re s ponse theory in the study of group differences in trace lines I n: H. Wainer and H. Braun (eds), Test Validity. Hillsdale: Lawrence Erlbaum Associates. Thurman, B. A. (2009). Teaching of critical thinking skills in the English content area in South Dakota public high schools and colleges ( Unpublished Doctoral dissertat ion ). Georgia State University, Atlanta, GA. Vaughn, B. K. (2006). A hierarchical generalized linear model of random differential item functioning for polytomous items: A bayesian multilevel approach (Unpublished Doctoral dissertation). Florida State Univ ersity, Tallahassee, FL. Wang, N., & Lane, S. (1996). Detection of Gender Related Differential Item Functioning in a Mathematics Performance Assessment. Applie d Measurement in Education, 9 175 199. Welch, C. J., & Hoover, H. D. (1993). Procedures for exte nding item bias techniques to polytomously scored items. Applied Measurement in Education 6 1 19. Welkenhuysen Gybels, J. (2004). The performance of some observed and unobserved conditional invariance techniques for the detection of differential item fun ctioning. Quality & Quantity, 38, 681 702. Woods, C. M. (2008). Likelihood Ratio DIF t esting: Effec ts of n onnormality. Applied Psychological Measurement, 32 511 526. Yee, T. W. (2010). The VGAM package for categorical data a nalysis. Journal of Statistical Software, 32 Zumbo, B. D. (1999). A handbook on the theory and methods for differential item functioning: Logistic regression modeling as a unitary framework for binary and Likert type (ordinal) item scores Ottawa, Canada: Directo rate of Human Resources Research and Evaluation, Department of National Defense. Zwick, R., & Thayer, D. T. (1996). Evaluating the magnitude of differential item functioning in polytomous items. Journal of Educational and Behavioral Statistics 21 187 201
66 BIOGRAPHICAL SKETCH Yasemin Kaya was born in Mudurnu, Turkey. She completed her Bachelor of Arts of the Elementary Science Education in the Faculty of Education at Pamukkale University in Turkey in 2006. She received her Master of Arts in Education de gree from the program of Research and Evaluation Methodology at University of Florida in the fall of 2010. She is currently enrolled in the Ph.D. program of Research and Evaluation Methodology at University of Florida.