<%BANNER%>

DIF Detection across Two Methods of Defining Group Comparisons

MISSING IMAGE

Material Information

Title:
DIF Detection across Two Methods of Defining Group Comparisons Pairwise and Composite Group Comparisons
Physical Description:
1 online resource (101 p.)
Language:
english
Creator:
Sari, Halil I
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( M.A.E.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology, Human Development and Organizational Studies in Education
Committee Chair:
Huggins, Anne Corinne
Committee Members:
Leite, Walter Lana

Subjects

Subjects / Keywords:
composite -- dif -- fairness -- pairwise
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre:
Research and Evaluation Methodology thesis, M.A.E.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Differential item functioning (DIF) analysis is a key component in the evaluation of fairness as a lack of bias in educational tests(AERA, APA, & NCME, 1999; Zwick, 2012). This study compares two methods of defining groups for the detection of differential item functioning(DIF): (a) pairwise comparisons and (b) composite group comparisons. The two methods differ in how they implicitly define fairness as a lack of bias,yet the vast majority of DIF studies use pairwise methods without justifying this decision and/or connecting the decision to the appropriate definition of fairness. This study aims to emphasize and empirically support the notion that our choice of pairwise versus composite group definitions in DIF is a reflection of how we define fairness in DIF studies. In this study, a simulation was conducted based on data from a 60-item ACT Mathematics test (ACT,1997; Hanson& Beguin, 2002) . The Unsigned area measure (Raju, 1988) was utilized as the DIF detection method. Results indicate that the amount of flagged DIF was lower in composite comparisons than pairwise comparisons. The results were discussed in connection to the differing definitions of fairness. Practitioners were recommended to explicitly define fairness as a lack of bias within their own measurement context, and to choose pairwise or composite methods in a manner that aligns with their definition of fairness. Limitations and suggestions for further research were also provided to the researchers and practitioner.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Halil I Sari.
Thesis:
Thesis (M.A.E.)--University of Florida, 2013.
Local:
Adviser: Huggins, Anne Corinne.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045994:00001

MISSING IMAGE

Material Information

Title:
DIF Detection across Two Methods of Defining Group Comparisons Pairwise and Composite Group Comparisons
Physical Description:
1 online resource (101 p.)
Language:
english
Creator:
Sari, Halil I
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( M.A.E.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology, Human Development and Organizational Studies in Education
Committee Chair:
Huggins, Anne Corinne
Committee Members:
Leite, Walter Lana

Subjects

Subjects / Keywords:
composite -- dif -- fairness -- pairwise
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre:
Research and Evaluation Methodology thesis, M.A.E.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Differential item functioning (DIF) analysis is a key component in the evaluation of fairness as a lack of bias in educational tests(AERA, APA, & NCME, 1999; Zwick, 2012). This study compares two methods of defining groups for the detection of differential item functioning(DIF): (a) pairwise comparisons and (b) composite group comparisons. The two methods differ in how they implicitly define fairness as a lack of bias,yet the vast majority of DIF studies use pairwise methods without justifying this decision and/or connecting the decision to the appropriate definition of fairness. This study aims to emphasize and empirically support the notion that our choice of pairwise versus composite group definitions in DIF is a reflection of how we define fairness in DIF studies. In this study, a simulation was conducted based on data from a 60-item ACT Mathematics test (ACT,1997; Hanson& Beguin, 2002) . The Unsigned area measure (Raju, 1988) was utilized as the DIF detection method. Results indicate that the amount of flagged DIF was lower in composite comparisons than pairwise comparisons. The results were discussed in connection to the differing definitions of fairness. Practitioners were recommended to explicitly define fairness as a lack of bias within their own measurement context, and to choose pairwise or composite methods in a manner that aligns with their definition of fairness. Limitations and suggestions for further research were also provided to the researchers and practitioner.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Halil I Sari.
Thesis:
Thesis (M.A.E.)--University of Florida, 2013.
Local:
Adviser: Huggins, Anne Corinne.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045994:00001


This item has the following downloads:


Full Text

PAGE 1

1 DIF DETECTION ACROSS TWO METHODS OF DEFINING GROUP COMPARISONS: PAIRWISE AND COMPOSITE GROUP COMPARISONS By HALIL IBRAHIM SARI A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT O F THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2013

PAGE 2

2 2013 Halil Ibrahim Sari

PAGE 3

3 To my family

PAGE 4

4 ACKNOWLEDGMENTS First of all I would like to thank my patient advis or Dr. Anne Corinne Huggins for guiding me in my thesis. I would not have completed this thesis without her help. I am proud to be her graduating advisee. I would also like to thank to my committee member Dr. Walter Leite for accessing to computers to run my analyses. Secondly, I am thankful to my beloved wife, Hasibe Yahsi Sari, who has always been there to provide support and motivation for my graduate studies. Lastly, I would like to thank Turkish Government for providing full scholarship to pursue my graduate studies in the U.S. Without this scholarship, I would not be able to afford my graduate studies. I also thank Necla Sari, Kadriye Sari, and Suleyman Sari for trusting me.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 9 LIST OF ABBREVIATIONS ................................ ................................ ........................... 10 ABSTRACT ................................ ................................ ................................ ................... 11 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 13 2 LITERATURE REVIEW ................................ ................................ .......................... 17 Fairness ................................ ................................ ................................ .................. 17 Item Response Theory ................................ ................................ ............................ 20 History of Bias and DIF Studies ................................ ................................ .............. 22 Importance of Detecting DIF ................................ ................................ ................... 23 IRT Based DIF Detection Methods ................................ ................................ ......... 24 Area Measures ................................ ................................ ................................ 25 Likelihood Ratio Test ................................ ................................ ........................ 27 square Test(X 2 ) ................................ ................................ ............... 28 Non IRT DIF Detection Methods ................................ ................................ ............. 29 Mantel Haenszel ................................ ................................ .............................. 29 Logistic Regression ................................ ................................ .......................... 31 Pairwise and Composite Group Comparisons in DIF Analysis ............................... 32 Advantages to using a composite group approach in DIF studies .......................... 36 3 OBJECTIVES AND RESEARCH QUESTIONS ................................ ...................... 39 4 RESARCH DESIGN AND MET HOD ................................ ................................ ....... 41 Data Generation ................................ ................................ ................................ ..... 41 Study Design Conditions ................................ ................................ ......................... 42 Number of Groups ................................ ................................ ............................ 42 Magnitude of true b parameter differences ................................ ....................... 42 Nature of group differences in b parameters ................................ .................... 44 Data Analysis ................................ ................................ ................................ .......... 44 5 RESULTS ................................ ................................ ................................ ............... 46 Results of the All Conditions Classified as All Groups Differ in b Parameters ........ 4 6

PAGE 6

6 3 Group Results ................................ ................................ ............................... 46 4 Group Results ................................ ................................ ............................... 48 5 Group Results ................................ ................................ ............................... 50 Results of the All Conditions Classified as One Group Differs in b Parameters ...... 53 3 Group Results ................................ ................................ ............................... 53 4 Group Results ................................ ................................ ............................... 54 5 Group Results ................................ ................................ ............................... 57 6 CONCLUSIONS AND DISCUSSIONS ................................ ................................ ... 60 7 LIMITATIONS AND FURTHER RESEARCH ................................ .......................... 65 APPENDIX A THE TRUE b PARAMETER DIFFERENCES ................................ ......................... 67 B SIMULATION RESULTS ................................ ................................ ........................ 68 C EXAMPLE TABLES ................................ ................................ ................................ 86 D FIGURES ................................ ................................ ................................ ................ 87 LIST OF REFERENCES ................................ ................................ ............................... 95 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 101

PAGE 7

7 LIST OF TABLES Table page A 1 True item difficulty parameters across the groups ................................ .............. 67 B 1 Effect size and percentage of statistically significant results: 3 groups, small true b DIF, and all groups differ ................................ ................................ .......... 68 B 2 Effect size and percentage o f statistically significant results: 3 groups, moderate true b DIF, and all groups differ ................................ .......................... 69 B 3 Effect size and percentage of statistically significant results: 3 groups, large true b DIF, and a ll groups differ ................................ ................................ .......... 70 B 4 Effect size and percentage of statistically significant results: 4 groups, small true b DIF, and all groups differ ................................ ................................ .......... 71 B 5 Effect size and percentage of statistically significant results: 4 groups, moderate true b DIF, and all groups differ ................................ .......................... 72 B 6 Effect size and percentage of statistically significant re sults: 4 groups, large true b DIF, and all groups differ ................................ ................................ .......... 73 B 7 Effect size and percentage of statistically significant results: 5 groups, small true b DIF, and all groups differ ................................ ................................ .......... 74 B 8 Effect size and percentage of statistically significant results: 5 groups, moderate true b DIF, and all groups differ ................................ .......................... 75 B 9 Effect size and pe rcentage of statistically significant results: 5 groups, large true b DIF, and all groups differ ................................ ................................ .......... 76 B 10 Effect size and percentage of statistically significant results: 3 groups, small true b DI F, and only one group differs ................................ ................................ 77 B 11 Effect size and percentage of statistically significant results: 3 groups, moderate true b DIF, and only one group differs ................................ ................ 78 B 12 Effect size and percentage of statistically significant results: 3 groups, large true b DIF, and only one group differs ................................ ................................ 79 B 13 Effect size and percentage of st atistically significant results: 4 groups, small true b DIF, and only one group differs ................................ ................................ 80 B 14 Effect size and percentage of statistically significant results: 4 groups, moderate true b DIF, and only one group differs ................................ ................ 81

PAGE 8

8 B 15 Effect size and percentage of statistically significant results: 4 groups, large true b DIF, and only one group differs ................................ ................................ 82 B 16 Effect size and percentage of statistically significant results: 5 groups, small true b DIF, and only one group differs ................................ ................................ 83 B 17 Effect size and percentage of statistical ly significant results: 5 groups, moderate true b DIF, and only one group differs ................................ ................ 84 B 18 Effect size and percentage of statistically significant results: 5 groups, large true b DIF, and only one group differs ................................ ................................ 85 C 1 Example of a contingency table ................................ ................................ .......... 86

PAGE 9

9 LIST OF FIGURES Figure page D 1 Item characteristic curve (ICC) for a 2PL item ................................ .................... 87 D 2 The shaded area between two ICCs is a visual representation of DIF ............... 88 D 3 A conc eptual model of two different methods of group definition in DIF ............. 89 D 4 DIF under pairwise and composite group comparisons for two groups .............. 90 D 5 Item characteristic curves (ICCs) for three groups and the operational ICC across the groups ................................ ................................ ............................... 91 D 6 Effect size results for the three groups ................................ ............................... 92 D 7 Effect size results for the four groups ................................ ................................ 93 D 8 Effect size results for the five groups ................................ ................................ .. 94

PAGE 10

10 LIST OF ABBREVIATIONS CTT Classical Test Th eory DIF Differential Item Functioning ICC Item Characteristic Curve IRT Item Response Theory MH Mantel Haenszel

PAGE 11

11 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requiremen ts for the Degree of Master of Arts in Education DIF DETECTION ACROSS TWO METHODS OF DEFINING GROUP COMPARISONS: PAIRWISE AND COMPOSITE GROUP COMPARISONS By Halil Ibrahim Sari August 2013 Chair: Anne Corinne Huggins Major: Research and Evaluation Metho dology Differential item functioning (DIF) analysis is a key component in the evaluation of fairness as a lack of bias in educational tests (AERA, APA, & NCME, 1999; Zwick, 2012). This study compares two methods of defining groups for the detection of di fferential item functioning (DIF): (a) pairwise comparisons and (b) composite group comparisons. The two methods differ in how they implicitly define fairness as a lack of bias, yet the vast majority of DIF studies use pairwise methods without justifying t his decision and/or connecting the decision to the appropriate definition of fairness. This study aims to emphasize and empirically support the notion that our choice of pairwise versus composite group definitions in DIF is a reflection of how we define fa irness in DIF studies. In this study, a simulation was conducted based on data from a 60 item ACT Mathematics test (ACT,1997; Hanson & Beguin, 2002) The Unsigned area measure (Raju, 1988) was utilized as the DIF detection method. Results indicate that th e amount of flagged DIF was lower in composite comparisons than pairwise comparisons. The results were discussed in connection to the differing definitions of fairness. Practitioners were recommended to explicitly define fairness as a lack of bias

PAGE 12

12 within t heir own measurement context, and to choose pairwise or composite methods in a manner that aligns with their definition of fairness. Limitations and suggestions for further research were also provided to the researchers and practitioner.

PAGE 13

13 CHAPTER 1 INTROD UCTION Studies on item bias were first undertaken in earnest in the 1960s (Holland & Wainer, 1993). These item bias studies were initially concerned with some particular cultural groups (e.g., Black, White, and Hispanics) to study the possible influence of cultural differences on item measurement properties (Holland & Wainer, 1993). In earlier years test bias and item bias were the preferred terminology used in the studies. However, with increased studies, the new expression of differential item functionin g (DIF) came into use (Osterlind & Everson, 2009). Formally, DIF exists in an item if but from separate subgroubs of the population, differ in their expected score on th e 107, as cited in Penfield & Algina, 2006). That is, an item is said to be differentially functioning if the probability of a correct response is different for examinees at the same ability level but from different groups (Pine, 1977). It is necessary and important to assess for DIF in instruments because DIF is not only a form of possible measurement bias but also an ethical concern. DIF is an important part of understanding concerns related to test validity and fairness (Thurman, 2009). When test scores or test items create an advantage for one group over another, validity and test fairness are threatened (Kane, 2006; Messick, 1988). Thus, DIF analysis is a key component in the evaluation of fairness as a lack of bias in educational tests (AERA, APA, & NCME, 1999; Zwick, 2012). DIF analyses compare item performance of subgroups, but the comparison is conditioned on the level of ability. The comparison of groups conditioned on ability is almost always done by comparing the groups directly to each other, which can be called

PAGE 14

14 a pairwise comparison. The most common approach used by psychometricians is to define groups and then compare them directly using these pairwise comparisons. For example, if participants are defined as Bla ck, White, or Hispanic, then one group is selected as a reference group to which other focal groups are compared. Historically, the reference group consists of individuals for which the test is expected to favor (e.g., a majority group such as Whites), and the focal groups consist of individuals that may be at risk of being disadvantaged by test items (e.g., minority groups such as Blacks and Hispanics) (Penfield & Camilli, 2007). Then, each focal group is directly compared to the reference group (e.g., Bla cks compared to Whites, and Hispanics compared to Whites,). This approach is seen throughout the literature on DIF (Kanjee, 2007; Coffman&Belue, 2009; Hidalgo&Pina, 2004; Penfield & Algina, 2006; Penfield & Camilli, 2007; Kim & Cohen, 1991; Guler & Penfiel d, 2009; Flowers et al., 1999; Woods, 2008; Hanson & Beguin, 2002; Kim & Cohen, 1995). However, Penfield (2001) states that multiple pairwise comparisons result in an increased Type 1 error rate and can prove to be very time consuming. A less common approa ch of composite group comparisons was introduced by Ellis and Kimmel (1992). In this approach, each individual group is compared to the population, which can be called a composite group. After a thorough literature review, it appears that Ellis and Kimmel (1992) have been the only researchers who have conducted DIF analysis in this composite group manner. However, it is important to note that this approach to group comparisons is widely used in assessing equating invariance (Dorans & Holland, 2000), whic h is a form of testing for measurement invariance that focuses on the equated test level scores as opposed to the item level

PAGE 15

15 focus of DIF analysis. When Ellis and Kimmel proposed this new method (1992), they argued that they did not imply that pairwise com parisons are unimportant or should not be used in DIF studies. Moreover, they claimed that the two methods answer two different questions and argued that pairwise comparisons enable us to draw conclusions about how one group differs from another group(s); however, their proposed method (composite group comparisons) enables us to answer a different the population (Ellis & Kimmel, 1992). This study aims to emphasize and em pirically support the notion that our choice of pair wise versus composite group definitions in DIF analysis affects the interpretation of results and reflects how we define fairness in DIF studies. In other words, the definition of fairness that is most of interest in a particular DIF investigation should determine if pairwise or composite group DIF methods should be utilized. Currently, there is no empirical comparison of these methods to guide researchers in making this choice, and there is no theoretical connection made between these different choices of group comparisons and the definition of fairness in measurement. While many researchers have compared the various methods of DIF detection (e.g., likelihood ratio tests, area measures, and contingency tab le approaches), no studies have focused on the approach toward defining and comparing groups during a DIF investigation. The main goal of this study is to examine whether the approach to defining groups for DIF comparisons impacts the magnitude and/or sign ificance of detected DIF in items. It is hypothesized that the significance and effect size of the detected DIF will vary across pairwise and composite group comparisons. In addition, this study discusses how the

PAGE 16

16 choice of these methods is connected to def initions of test fairness and draws some conclusions about which type of group comparison aligns best with the definition of fairness provided by the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999).

PAGE 17

17 CHAPTER 2 LITERATURE REVIEW Fairness Measurement f airness issues first came to attention in the 1960s after the Civil Rights Act (Osterlind & Everson, 2009). The main element of these regulations was related to racial concerns (Camilli, 2006). Then, issues related to fairness were i ncluded in the Standards for Educational and Psychological Tests Three eminent professional organizations the American Educational Research Association (AERA), the American Psychological Association ( APA ), and National Council on Measurement in Education (NCME) worked jointly to develop the first version of the Standards for Educational and Psychological Testing in 1966 (Osterlind & Everson, 2009). Since 1966, the Standards were updated and three other versions were published in 1974, 1985, and 1999. Al l editions have served as the authoritative source for test development and have been used by practitioners in the United States and in other countries in which national standards have not been published (Oakland, 2004). The current edition (AERA, APA, & NCME, 1999) is the fourth iteration of these publications and is currently being revised; the fifth draft is scheduled for release in 2013. In the Standards (1999), fairness is elaborately discussed and handled under four subsections: (a) fairness as equi table treatment in the testing process, (b) fairness as opportunity to learn, (c) fairness as equality in outcomes of testing, and (d) fairness as a lack of bias. Equitable treatment refers to providing individuals equal opportunity to prepare for a test and ensuring all participants receive the same appropriate test conditions. According to the Standards

PAGE 18

18 requires that all examinees be given a comparable opportunity to demonstrate their standing on th 1999, p. 74). Fairness as opportunity to learn concerns the opportunity to learn the subject matter covered by the test content. It is exemplified in the Standards (1999) that when individ uals have not had the opportunity to learn the subject matter, they most likely receive lower scores. The lower scores may have resulted in part from not having had the opportunity to learn the material. In such circumstances, reporting lower test scores f or these individuals can be viewed seen as a problematic lack of fairness (AERA, APA, & NCME, 1999). Fairness as equality in outcomes of testing ensures that test score distributions are comparable across groups regardless of group membership. If test scor e distributions differ from one group to another, it is generally not desirable to use the test, particularly if other tests that do not suffer from this problem are available (AERA, APA, & NCME, 1999). However, outcome differences across groups do not nec essarily indicate a lack of fairness (AERA, APA, & NCME, 1999). Fairness as a lack of bias, which is the main concern in this study, is handled as a technical term in the Standards (1999). The Standards (AERA, APA, NCME, 1999) f bias] is said to arise when deficiencies in a test itself or the manner in which it is used result in different meanings for scores earned by members of predictive bias have been introduced by the Standards (1999), and comparing the pattern of association for different groups between test scores and external variables

PAGE 19

19 was related to predictive bias (AERA, APA, NCME, 1999). On the other hand, bias in measurement is examined wit hin the test scores and item responses of the test in question, which is an internal process of the test. Even though these two types of bias seem different from one another, researchers would never consider these alone when examining validity evidences (C amilli & Shepard, 1994). Fairness as a lack of bias is a moral concern in testing and test use (Holland & Wainer, 1993). Thus, even though absolute fairness is not possible, since it is directly related to validity of test scores, the maximum level of atte ntion should be given to fairness as a lack of bias. In order to assess fairness as a lack of bias, item performance is tested via DIF. DIF methods compare item statistics across individuals belonging to different groups but with the same latent ability le vels to make a conditional comparison of the performance of the groups. DIF provides results that assist with decisions about eliminating or revising items that perform differently across groups (Gipps & Murphy, 1994). Traditionally, when examining DIF, fi rst of all, the group of individuals that is potentially expected to be affected unfairly (e.g., minority group) is selected and another group of individuals that is potentially expected to have an advantage (e.g., majority group) is compared to that group Although this pairwise approach is a typical way to examine DIF, no one to date has inquired whether or not this is an appropriate approach to examine fairness as a lack of bias. In fact, group comparison choices in DIF have never been directly related t o fairness as a lack of bias in t he Standards It may be that the pairwise approach does not always provide us with information about a true lack of fairness.

PAGE 20

20 Item Response Theory Although classical test theory (CTT) has served test development well over several decades (Embretson & Reise, 2000) both psychometric theoreticians and practitioners have expressed some dissatisfied with CTT methodology (Baker, 1992). In 1953, Dr. Frederic Lord (as cited in Hambleton & Jones, 1993) claimed that CTT focuses on tr ue scores and observed scores, which are test dependent, and that latent ability scores have the advantage of being test independent. Then, Dr. Frederic Lord (Lord, 1980) asserted: en to him and we must also know how effective each item in the pool is for measuring at each 12). Over the past several years, his claim has been echoed by several r esearchers and practitioners (see Loyd, 1988 ; Crocker & Algina, 1986 ; McKinley & Mills, 1989; Baker, 1992; Hambleton &Jones, 1993; Embertson & Reise, 2000 ; Brennan, 2011) and it was emphasized that under CTT examinee test scores will always depend on the s election of items in the task and test takers will always receive lower scores on harder tests and receive higher scores on easier tests despite the fact that their ability level remains constant over any tasks (Hambleton &Jones, 1993). Due to these concer ns about CTT, item response theory (IRT) has rapidly become mainstream as the theoretical basis for measurement (Embretson & Reise, 2000, p. 3). IRT is a general statistical theory about examinee item and test performance and how item performance relates t o the latent abilities that are measured by the items in

PAGE 21

21 ability, we must attempt to determine level of skills; in order to do this, the tester must figure out the relation ship between the level of skills and their responses to a given item. This relationship is called an item response function or item characteristic curve (ICC) An ICC is the functional relationship between the probability of a given response to an item and a trait or ability level ( ) (Baker, 2000). The monotonicity assumption of IRT asserts that as ability level increases, the probability of success on an item increases. Figure D 1 shows an example ICC from an item modeled under the two parameter logistic model (2PL), which is a widely used IRT models. In IRT, there are several models for dichotomous (i.e., binary) test data. Models such as the Rasch model (Rasch, 1960), one parameter logistic model (1PL) (Rasch, 1960, 1980), 2PL (Lord & Novick 1968 ), and three parameter logistic model (3PL) ( Birnbaum, 1968 ) were developed to estimate parameters of binary items and examinee abilities. Each model has some advantages over the others; however, the 2PL model is the focus of this study. According to Birnbaum (19 68), the 2PL model defines the conditional probability of a correct response to item i ( X i = 1) as ( 2 1 ) where P ( X ij =1| i ) is the probability of a correct response on item j for individual i, i is the latent trait ability leve l of individual i a j is the discrimination parameter for item j and b j is the difficulty parameter for item j In this model, all items are defined by two parameters that are freely estimated across the items of the test, that of discrimination and diffi culty.

PAGE 22

22 History of Bias and DIF Studies Although item bias research appears to have begun in the early 1900s (Camilli & Shepard, 1994), little attention was given to issues of test fairness and item bias until the 1960s (Osterlind & Everson, 2009, p. 22). A fter the historic Civil Rights Act of 1964 was enacted (Osterlind & Everson, 2009), the given attention to bias has increased. The focus in this regulation was bias in IQ tests because IQ tests were extensively used in educational decisions and for employe e selection until the 1970s (Camilli & Shepard, 1994). However, Jensen (1969) argued that IQ is determined mainly by genetic differences rather than environmental influences and he implied that when detecting biased items in tasks, considering the differen ces in IQ between blacks and whites without considering past discrimination is misleading (Jensen, 1969). Under such circumstances, it is quite likely that some items will be flagged as biased even if they are not, and some items will be flagged as unbiase d even if they are (Holland & Wainer, arguments over the definition bias persisted over the years because some people defined it as a social concept while others defined it as a statistical concept (Holland & Wainer, 1993). Formally, in the social context, item bias is defined as a kind of invalidity that harms one group more than another (Shepard et all, 1984) and in the statistical context, an item is biased if equally able i ndividuals, from different groups, do not have equal probabilities of answering the item correctly (Holland & Wainer, 1993, p. 4). Then, with increased studies, the term of differential item functioning (DIF) came into use (Holland & Wainer, 1993) and has become a common term among the researchers. DIF occurs when test takers from different demographic groups do not share the same

PAGE 23

23 probability of answering an item correctly, even though they have been matched on the trait of interest (e.g., ability level) ( Clauser & Mazor, 1998). Although the terms bias and DIF are sometimes used interchangeably, they are distinguished in the measurement literature. An item is biased if the performance on that given item is differentially difficult for two different group of examines because of some characteristic of the item in the test that is irrelevant to the purpose of the test (Zumbo, 1999). The presence of DIF is required for an item to be bias, but it is not sufficient for determining bias. It means that detecting DIF in an item does not necessarily imply that the item has bias (Lord, 1980). For example, when two different cultural groups are being compared in linguistic comparisons, it is most likely that some items will show DIF but not be considered bias or unfair t o use (Alderman & Holland, 1981). Thus, it should be noted that DIF investigations should be examined from a sensible perspective that acknowledges the difference between DIF and item bias (Osterlind & Everson, 2009). Over the years, many researchers hav e made valuable contributions to the DIF literature and have developed many methods to detect DIF (see Jensen, 1969; Lord, 1980; Holland & Wainer, 1993; Crocker & Algina, 1986; Camilli & Shepard, 1994; Kim et all, 1995; Dorans & Holland, 2000; Penfield & A lgina, 2006; Osterlind & Everson, 2009). Importance of Detecting DIF There is a broad consensus among researchers that validity is the most important element in any research (AERA, APA, & NCME, 1999). However, it is impossible to examine validity itself wi thout considering other issues such as fairness and DIF because, as emphasized before, DIF, validity, and fairness issues are linked (Gipps & Murphy, 1994). DIF indicates a possible threat to test validity and test fairness. In other words, the investigati on of DIF in instruments is a way of examining the validity

PAGE 24

24 and fairness evidence in educational tests. Thus, especially in high stakes tests, practitioners consider the presence of DIF to be a significant problem for accurate and fair measurement. Cole an d Moss (1989) stated that if there is bias in test items, it can be concluded that the test is not equally valid for different groups (Cole & Moss, 1989). In such circumstances, although some researchers believe that bias has little impact on validity (Roz nowski & Reith, 1999, Zumbo, 2003), many researchers think that test items create an advantage for one group over another (see, Pae & Park, 2006, Penfield & Algina, 2006, Osterlind & Everson, 2009) and interpretation of test scores is inappropriate when th ese advantages are present (Gipps & Murphy, 1994). Hence, although a completely DIF free test is a rare occurrence, it is known that the number and magnitude of DIF items should be examined and the test should be refined as necessary (Messick, 1988). Other wise, it can be said that the test does not accurately measure the same construct for different groups. IRT Based DIF Detection Methods A variety of IRT DIF detection methods have been introduced in the field of psychometrics. Three such methods from the l ast three decades are the likelihood ratio square test (Kim, S. & Cohen, A., 1995). However, there is still a debate on which method is more powerful for detection of DIF in item pe rformance studies. Several studies were conducted to investigate the relative performance of these methods for DIF detection. Some researchers suggested using the likelihood ratio test to evaluate the significance of observed differences between two groups (see Thissen, Steinberg, & Gerrad, 1986; Thissen, Steinberg & Wainer, 1988; Kim & Cohen, 1995). Moreover, it was seen as an advantage of the area measure method that it is easier to compute and requires less

PAGE 25

25 sample size than the likelihood ratio method an square.. However, it seems the likelihood ratio test is preferred over other IRT based DIF detection methods (Kim & Cohen, 1995). On the other hand, its limitation is that it can be an extremely time consuming procedure if the number of items to test is large. This can be particularly problematic in simulation studies in which thousands of data sets are being generated. However, some researchers agree that these three methods give close results in DIF detection procedure. Kim and Cohen (1995) c chi found that concerning error rates and power, these three methods provide very similar DIF detection results. Area Measures This metho d was first introduced by Raju in 1988, and estimates DIF via the area between two ICCs, one for each of the two groups being compared (Raju, 1988). In Figure D 2, the shaded area between two ICCs displays a visual representation of how area measures defin differences and making a reasonable judgment, but a more accurate method is to use some test statistics (Osterli nd & Everson, 2009). Much research has been done on area measure DIF methods, and the signed area (SA) and the unsigned area (UA) methods (Raju, 1988) were proposed over a bounded (closed) interval or an open (exact) interval on the scale (Cohen & Kim, 1993). Raju (1988) defined the SA as:

PAGE 26

26 ( 2 2 ) In this equation, P(Y=1) is the probability of answering an item correct, and G is the group membership (e.g., reference or focal). Assuming that the c paramet er is invariant across groups (which would always be the case in a 2PL or 1PL model), the equation can be estimated by ( 2 3 ) This formula can be used under the assumption of invariant c parameters, even those cases where a param eters are not invariant across the groups (Raju, 1988). In this case, it can be said that SA formula examines DIF in b (difficulty) parameters (Osterlind & Everson, 2009). However, when a parameter s are not invariant across the groups, using SA is misleadi ng (Penfield & Camilli, 2007). To address this issue, Raju (1988) suggested the UA method and defined it as follows (Raju, 1988): ( 2 4 ) The UA integral (Equation 2 4) differs from the SA integral (Equation 2 2 ) in that the UA in tegral has the absolute value symbol and so, it always produces mathematically positive effect sizes. Therefore, DIF that favors one group at a particular ability level cannot be cancelled out by DIF that favors the other groups at a different ability leve l. The UA formula can be used when c parameters are invariant across the groups; however, there is no any analytic solution of the integrals when c parameter varies across groups.

PAGE 27

27 Implementing area measures methodology requires separate calibrations of th e item parameters in each reference and focal groups (Kim & Cohen, 1995). Under pairwise comparisons, one would estimate the item parameters from the reference group and the focal group and then directly compare the resultant group level ICCs. Under compos ite group comparisons, one would estimate the item parameters for each group separately and also for the composite group. Then, one would utilize Equation 2 Likelihood Ratio Test be invariant for the reference and focal groups with the likelihood when the parameters o (p. 50). Thissen, Steinberg, and Wainer (1993) define the likelihood ratio test as ( 2 5 ) where L ( C ) represents a model in which both groups are constrained to have the same item parameters, and L ( A ) represents a model in which item parameters of the item that is being tested for DIF are free to vary across the groups. G 2 is distributed approximately as a chi square variable with degrees of freedom equal to the difference in the number of parameters between the two models. In E quation 2 5 the L(C) model is calibrated across the overall population; thus, we can argue that, by nature, the likelihood ratio test is a form of composite group comparison approach, although the

PAGE 28

28 results from this analysis would look different from the composite group approach used in the methodology of this study. square Test(X 2 ) square method or differences in item parameters proced ure. This method is calculated by contrasting b parameters. Lord (1980) defined the following formula; ( 2 6 ) where is the standard error of the difference between the parameter estimates for the reference and focal groups (L ord, 1980) and calculated as: ( 2 7 ) However, when the 2PL or 3PL model is of interest, the difference in b parameters might be a misleading estimate of DIF In this case, Lord (1980) suggested that a chi square test of the simultaneous differences betwe en a and b parameters may be a more appropriate test for DIF. Thus, the following formula (Lord, 1980) is computed as ( 2 8 ) and the test statistic can be computed as: ( 2 9 )

PAGE 29

29 W here S represents the es timated variance covariance matrix of the between group difference in a and b parameter estimates. When using E quation 2 6, if R and F are calculated across only two specific groups from a set of groups, it reflects a pairwise com parison approach However, if R and F are calculated such that the focal group is a single group and the reference group includes all examines it reflects a composite square test can be used in both pairwise and composite group comparison approaches, as seen in Ellis and Kimmel (1992) study. Non IRT DIF Detection Methods Non IRT based DIF detection methods such as the Mantel Hae n szel method (Mantel, Haenszel, 1959) and the logi stic regression method (Swaminathan & Rogers, 1990) are also arguably popular in DIF studies. In fact, because of computational ease the Mantel Haenszel (MH) statistic has previously been cited as the most widely used method to evaluate DIF (Clauser & Mazo r, 1998). The logistic regression method is less commonly used in DIF studies, yet it is thought to be more powerful than the MH statistic (Hidalgo & Pina, 2004). Mantel Haenszel The Mantel Haenszel method was first introduced by Nathan Mantel and William Haenszel (1959), but further developed by Holland and Thayer (1988). This approach utilizes contingency tables to compare the item performance of groups who were previously matched on ability level (Hidalgo & Pina, 2004). The MH procedure is based on a chi square distribution and involves the creation of K x 2 x 2 chi square

PAGE 30

30 contingency tables, where K is the number of ability level groups and the 2 x 2 tables represent the frequency counts of correct and incorrect responses for each of two groups (Zwick, 2012). Table C 1 shows an example of 2x2 contingency table. MH calculations begin with an odds ratio of p/q where p indicates the odds of a correct response to an item and ( 2 10 ) Then, an odds ratio in the MH ) ( Mantel & Haenszel, 1959) which expresses a linear association between the row and column variables in the table (Osterlind & Everson, 2009), is calculated as ( 2 11 ) where a k and c k represent the number of examinees who answer ed item k correct in the reference and focal group, respectively, b k and d k represent the number of examinees who answered item k in correct in the reference and focal group, and N K is the total number of participants within k th score level. However, it i s difficult to interpret MH statistic (Osterlind & Everson, 2009). Thus, the MH D DIF index was introduced by Holland and Thayer (1988) and is defined as ( 2 12 ) For those who prefer using the MH D DIF statistic, researchers from the Educational Testing Service (ETS) provided item categories as labeled A, B, and C items. Based on the ETS classification, items with absolute values of MH D DIF < 1 Items with absolut e values

PAGE 31

31 of 1 < MH D DIF absolute values of MH D DIF (Zwick, 2012). Zwick and Ercikan (1989) pointed out that A and B items can be used in the tests but C items will be selected only if they are necessary to achieve test specifications (Zwick & Ercikan 1989). However, these types of decisions are always dependent on the particular test use. Although MH method is very effective even when the sample size is small, MH contingency table approaches assume group independenc e In other words, based on the con tingency table, the examinee groups being compared must be independent groups. Thus, pairwise methods can be used with the MH procedure because a given examinee is in only one of the groups being compared. However, a composite group approach would violate this assumption of the statistical tests underlying the MH procedure. Logistic Regression The logistic regression method was developed by Swaminathan and Rogers (1990) and is based on a probability function that is estimated by methods of maximum likeliho od. This approach is a model based procedure and models a nonlinear relationship between the probability of correct response to the studied item and the observed test score (Penfield & Camilli, 2007). The general equation of the logistic regression can be expressed as ( 2 13 )

PAGE 32

32 where X is the observed test score, G is the group membership (dummy coded) and z is ( 2 14 ) where 0 is the intercept and represents the probability of a response category when X and G are equal to zero; 1 is the ability regression coefficient associated with the total test score; 2 is the coefficient for the group variable; and 3 is the interaction coefficient (Swaminathan & Rogers, 1990). In the case that 2 = 3 =0, the null hy pothesis of no DIF is retained, and in the case of 2 3 rejected (Guler & Penfield, 2009). When there are only two groups, G is assigned a value of 0 for the focal and 1 for the reference group. I n the three group case, two dummy code d variables are needed in which two focal groups are compared to one reference group One dummy variable compares one of the focal groups to the reference group and the other dummy variable compares the other focal group to the referenc e group. As can be seen, the logistic regression DIF detection method is, by nature, a pairwise comparison Pairwise and Composite Group Comparisons in DIF Analysis Over the past three decades, much research has been conducted on DIF. However, as explicit ly explained before, DIF analysis routinely compares two groups to each other, known as a pairwise comparison (see Liu & Dorans, 2013; Ellis & Kimmel, 1993; Yildirim & Berberoglu, 2009; Fidalgo et al, 2000; Ankenmann et al, 1999; Penfield & Algina, 2006; G uler & Penfield, 2009; Flowers et al, 1999; Woods, 2008). In fact, even if there were more than two groups of concern in a DIF analysis, the most commonly

PAGE 33

33 used approach is to select a reference group, define each of the other groups as focal groups, and c ompare each focal group directly and independently to the reference group (see Kim & Cohen, 1995; Penfield, 2001). In this approach and under an area measure DIF method, the group level item parameter estimates are compared directly to each other, none of which are used in operational practice Although the pairwise approach is extensively used, it is criticized in terms of low power, high Type I error, and time consuming (Penfield, 2001). Another approach of composite group comparisons is less commonly us ed in DIF studies (Liu & Dorans, 2013). In this approach, each individual group is compared to the population, which can be called a composite group. For example, if a variable categorizes examinees into three groups (e.g., Black, White, and Hispanic), whe n using an area measure approach item parameter estimates calibrated using only the data from Black participants are compared to item parameter estimates calibrated using the entire population of participants (including Black, White, and Hispanic participa nts). The item parameter estimates based on the whole population are considered to be the operational item parameters as these parameters would be used to estimate reported scores in practice. Figure D 3 shows a conceptual model of the two types of group c omparisons explored in this study. As noted above, the pairwise approach is done such that one group is chosen as a reference group, and all other groups are compared to that reference group. For example, focusing on the top part of Figure D 3, if group 1 is chosen as the reference group, then the top two pairwise comparisons in Figure D 3 would be used, but the bottom pairwise comparison (i.e., Group 2 to Group 3) would be omitted from the

PAGE 34

34 analysis. Omitting the pairwise comparison between group 2 and grou p 3 can be thought of as a limitation of the pairwise approach in multiple group DIF studies. knowledge) who have conducted DIF analysis in the composite group manner. In their study, they investigated the presence of DIF among American, French and German students by selecting each group as a focal group and the full population (i.e., the composite groups) as the reference group. square DIF detection method (Equati on 2 9) for all composite group comparisons. The concern in their study was to examine omni cultural differentiation and to find the relations between each group and the population. In contrast, the concern in this study is with respect to how one defines fairness in measurement. Figure D 4 displays how pair wise and composite group comparisons define area measure DIF. As can be inferred from Figure D 4 the amount of flagged DIF is expected to be smaller in some composite group comparisons than in some pair wise comparisons. However, this will not always be the case. For example, when there are three groups in a DIF analysis it is quite possible that two groups will have very similar ICCs and one ers. In such circumstances, the operational ICC will be closer to some groups than others. Thus, the amount of flagged DIF will be smaller in a pairwise comparison between two similar groups than it will be in a composite group comparison between the popul ation and a group that is quite different from that population. Figure D 5 displays a visual example of this possible situation.

PAGE 35

35 The main goal for both pairwise and composite group DIF comparisons is to examine test items for fairness as a lack bias. Howev er, it can be argued that the two approaches are evaluating different types of lack of bias that align with different definitions of fairness. In pairwise group comparisons, fairness is achieved if the function for one group on an item is the same relative to other groups (i.e., comparing group level functions to each other). In composite group comparisons, fairness is achieved if the function for one group on an item is the same relative to the function based on the composite group (i.e., the function used in operational practice). Within this framework, a question is begged should determine if pairwise or composite group D IF methods should be utilized. The Standards (AERA, APA, NCME, 1999) provides definitions of fairness that are to permeate the field of educational measurement and drive decisions about how fairness is defined. Therefore, it is appropriate to connect the definition of fairness in the Standards (AERA, APA, NCME, 1999) to the choice of group definition in DIF analysis. The Standards arise when deficiencies in a test itself or the manner i n which it is used result in different According to this definition, it can be argued that we are concerned with the scores that students receive based on the operational i tem characteristic curves (ICCs), as this and that we want to have the same meaning across groups. This aligns with comparing group level ICCs to operational ICCs, which aligns with the definition of composite gro up comparisons.

PAGE 36

36 Furthermore, it is emphasized in many chapters of the Standards (1999) that test scores are used to monitor individual student performance as well as to evaluate or ERA, APA, & NCME, 1999). While this is referring to unconditional group differences (e.g., mean differences on tests), the language could be used to support the notion that pairwise comparisons in general are important On the other hand, the Standards sta te that many decisions, especially in high stake tests, such as pass/fail or admit/reject, are made based on the full population of examinees taking the test (AERA, APA, & NCME, 1999). These statements about fairness suggest that student success is determi ned by test scores resulting from calibrations that include all examinees and therefore issues of fairness as a lack of bias should relate to test responses that are compared to a composite group rather than to a single, reference group of individuals. Th us, different concerns of a particular type of fairness would call for different methods of group comparisons in a DIF study. To date, pairwise is the default method used in DIF studies (Liu & Dorans, 2013). This study examines whether or not this choice h as an impact on the results of a DIF analysis. If the results are different for the different types of group comparisons, then a more informed decision must be made with respect to the concerns of fairness on a particular test and the choice of pairwise or composite group comparisons. Advantages to using a composite group approach in DIF studies There are several potential benefits to using composite group approaches over pairwise approaches. First, for example, if it is found that Hispanic students are di sadvantaged by some items, this disadvantage may not hold over different gender groups, for example Hispanic females and Hispanic males (Liu & Dorans, 2013).

PAGE 37

37 However, composite group comparisons in which each group is compared to the population can more ea sily allow for a fine tuned definition of groups based on more than one grouping variable. For example, one could easily compare Hispanic females to the composite group of all examinees. In a pairwise comparison, one is left wondering who and appropriate r eference group is for Hispanic females. Second, some have stated that it is problematic to consistently compare groups in a manner that requires defining one particular group as a reference to which all other groups are to be compared (APA, 2009). For exam ple, choosing White examinees as a reference for all other non White examinees has an underlying value statement about perspective. Composite group comparisons, by nature, overcome this problem; the reference group always consists of all examinees rather than a chosen group. Third, while examinees receive test scores based on parameters that are calibrated on the composi te group of examinees, the pair wise approach only com pares estimates from the composite group calibration. Composite group comparisons can overcome this third problem. The item parameters used for operational test developmen t purposes are used as reference parameters to which groups are compared. Fourth, the composite group approach allows for a separate DIF estimate for each group. For example, we can talk about the fairness as a lack of bias for females without having to r efer to a reference group. This makes it easier on practitioners to determine which groups might have bias problems in their reported scores In pairwise

PAGE 38

38 comparisons, particularly when there are four or more groups, one has to look through many pairs of r esults (e.g., Black vs. White, Hispanic vs. Black, Hispanic vs. White, differences. Not only can it be difficult to determine this nature, but one is also left with several results for each group (e.g., there are three DIF effects for Hispanics in the above example). Of course one can select a reference group to minimize the comparisons, but the sacrifice is that the overall nature of group differences in the item is lost because all differences are rela tive to a single reference group (e.g., you would not directly compare Hispanics and Asians if the reference group is Blacks) Conversely, w hen using composite group approaches, a single DIF effect is estimated for eac I s the group different from the overall Fifth, as mentioned previously, running multiple DIF tests results in an increased Type 1 error rate (Penfield, 2001) When a variable groups examinees into 4 or more groups, the number of pairwise comparisons needed to complete a DIF analysis on this variable is greater than the number of composite group comparisons. While composite group comparisons cannot overcome the problem of Type 1 error rate accumulation, they can reduce it relative to pairwise comparisons. The amount of relative reduction in Type 1 error rate increases as the number of groups being compared increases.

PAGE 39

39 CHAPTER 3 OBJECTIVES AND RESEARCH QUESTIO NS Researchers and practitioners tend to use pairwise comparisons in DIF analysis without considering other options, but this study will provide researchers and practitioners with detailed information on the effects of their choice of defining groups for DIF analysis. The results will contribute to the field of educational measurement by empirically examining the effect of defining groups on DIF detection, which is a unique contribution to the literature. As a result, researchers and practitioners will bet ter understand how the definition of their groups has an impact on their DIF analysis, as well as some empirical evidence for choosing the most appropriate method of defining group comparisons for their DIF studies. Furthermore, although the purpose of DI F detection in instruments is to achieve fairness as a lack of bias, the type of achieved fairness that comes about from pairwise DIF analysis has never been discussed in the literature, nor in The Standards (1999). In this study, the definition of fairnes s is elaborately discussed in the pairwise and composite group comparison framework. Also, the definition s of fairness as lack of bias achieved by these two approaches are compared to each other and to the definition of fairness as defined by Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). A simulation study is utilized to examine the differences in pairwise and composite group DIF results under different sets of test conditions. In this simulation study, data is generated such that DIF is introduced into one test item on a 60 item test. The data from that test is subsequently analyzed with both pairwise and composite group approaches to the UA DIF detection method. Differences in true b parameters are

PAGE 40

40 introduced across all cond itions, and the magnitude and statistical significance of detected DIF is compared across pairwise and composite group approaches under UA methodology. This study will address the following research questions: 1. Does the number of groups in a DIF analysis di fferentially impact the ability of pairwise and composite group comparisons to detect DIF? 2. Does the magnitude of true b parameter differences between groups differentially impact the ability of pairwise and composite group comparisons to detect DIF? 3. Does the nature of true b parameter differences between groups (i.e., all groups are different from each other versus a single group is different from all other groups) differentially impact the ability of pairwise and composite group comparisons to detect DIF?

PAGE 41

41 CHAPTER 4 RESARCH DESIGN AND METHOD This chapter consists of three subsections: (a) data generation, (b) simulation conditions, and (c) data analysis. The data generation section includes the general descriptions for the test and simulatio n design. The simulation conditions section includes the factors manipulated in the study. The data analysis section describes the methods that were used to analyze the simulated data. Data Generation A 2P L IRT model (Birnbaum, 1968) was used for data gene ration (see Equation 2 1) in R version 2.15.1 (R Development Core Team, 2012). The item parameters used in this simulation study were based on estimated item parameters from the 1997 ACT Mathematics test that were used in a previous study on obtaining a co mmon scale for item responses using separate versus concurrent estimation in the common item equating design. (Hanson & Beguin, 2002). Aligned with Hanson and Beguin (2002), the true item parameters of 60 dichotomous items were used to generate item respon se randomly sampled from a normal distribution of N (0,1), and the difficulty parameters were selected from a distribution of N (0.11,1.11) (based on estimated item parameters of the 1997 ACT test). The discrimination parameters were generated from a random uniform distribution and ranged from min( a )=0.42 to max( a )=1.88 (again, based on estimated parameters of the 1997ACT test). There were 37 unique conditions and in each condi tion, 100 replications were performed which resulted in 3700 datasets in the study. First, the null condition was specifi ed based on the ACT Mathematics test

PAGE 42

42 estimated parameters. Then, other data sets were generated for each condition of the study. Study Design Conditions Number of Groups Each data set was generated with respect to 3 4, or 5 groups that were to be compared in the DIF analysis. A sample size of 500 examinees, which is adequate sample size for power rates in UA DIF methods (see Kim & Cohen 1991;Cohen & Kim, 1993; Holland & Wainer, 1993), was created within all subgroups. The manipulation of th e number of groups factor addresses research question one. Magnitude of true b parameter differences The magnitude of true b parameter differences w as manipulated in this study. Each condition of the study had a test item in which either small, moderate, or large true b parameter differences were introduced into the true group level item difficulty parameters. The manipulation of this factor addressed research question two. A lack of invariance in difficulty parameters was the focus of this study because previous research has shown that difficulty parameters ( b ) have a higher correlation with ability parameters ( ), as compared to discrimination parameters ( a ) and pseudo guessing parameters ( c ) (Cohen & Kim, 1993). Furthermore, it was found that in real test administrations, statistically significant DIF was usually due to group differences in b parameters (e.g., Smith & Reise, 1998; Morales et al., 2000; Woods, 2008) whereas significant DIF was only sometimes found in a parameters (Morales et al., 2000, Woods, 2008). Also, DIF in b parameters has been stated to be the primary concern in many DIF studies (see Cohen & Kim, 1993, Santelices & Wilson, 2011, Ankenman et al., 1999; Flowers et al., 1999; Fidalgo et al., 2000).

PAGE 43

43 The size of true b parameter differences was closely determined according to the ETS classifications of pairwise DIF effects (Zwick, 1993, and 201 2). As previously explained, this classification places items into three categories (i.e., A items, B items, and C items). Based on the ETS classification scheme, the magnitude of differences in b parameters across the reference and the focal groups are de fined as follows. Small DIF ( A items) Moderate DIF ( B items) Large DIF ( C items) However, not all researchers follow this guidance when specifying magnitud es of b parameter differences. For example, Shepard, Camilli and Williams (1985) used a difference of .20 and .35 in the b parameter to manipulate small and moderate group parameter differences, respectively (Shepard et al., 1985). Moreover, Zilberberg et al. (n.d.) used a difference of .45 and .78 in the b parameter s to represent moderate and large group parameter differences, respectively. In this study, a difference of either b F b R = 0.3, b F b R = 0.6, or b F b R = 0.9 was introduced between b parameters to represent small, moderate, and large group parameter differences, respectively. Furthermore, the magnitude of true b parameter differences was manipulated so as to be the maximum amount of b parameter differences between any of the pairs of 3, 4, or 5 groups. Therefore, the magnitude of true b parameter differences served as controlling the maximum size of group parameter differences in any given condition. Therefore, the magnitude of the true b parameter differences is more aptly stated as defining co nditions of small or less DIF, moderate or less DIF, or large or less DIF. This

PAGE 44

44 was necessary due to factor 3 in this study, and one can refer to Table A 1 to better understand this definition of magnitude of DIF. Nature of group differences in b parameter s As illustrated in Figure D 5 when there are more than two groups, it is quite specified by very different b parameters (see Ellis & Kimmel, 1993, and Kim & Cohen, 1995) Otherwise stated, the nature of the group differences in b parameters is not always consistent. Thus, to take this situation into account two factors were created and Th ese types of b parameter differences were varied to address research question 3. Table A 1 describes the way that b parameter differences were introduced into differ in b the magnitudes of small, moderate, or large true b parameter group differs in b or large b parameter differen ces was only added to the last subgroup, and the remaining subgroups were specified as having the same b parameters. Data Analysis The 2Pl model (Lord & Novick, 1968) (see Equation 2 1) was used for data tion 2 4) was used for all DIF analyses under all possible pairwise and composite group defining approaches. This was completed with the difR package ( Magis et. all, 2013 ) in R version 2.15 (R Development Core Team, 2012). This method calculates the unsign ed area between two item characteristic curves for the reference and focal groups with an integral and

PAGE 45

45 gives the effect size associated with this area (see Equation 2 4). The magnitude of DIF is then determined based on the effect size. In all pairwise comparisons in this study, the lower coded subgroup within a pair was always selected as the reference group. In all composite group comparisons, the subgroups were always selected as the focal group, and the composite group was selected as the reference g roup. Each research question in the study focuses on comparing the effect size and statistical significance of DIF detected between the two approaches of group definitions (e.g, pairwise and composite). First, the effect size of detected DIF was averaged over the 100 trials in each condition. Next, the percentage of 100 trials in each condition that resulted in statistically significant DIF for item 1 was calculated. Both the average effect size for each condition and the percentage of statistically signif icant DIF for each condition were compared across the pairwise and composite group approaches.

PAGE 46

46 CHAPTER 5 RESULTS This chapter includes the results of the simulation design described in chapter 4 and has two main subsections. The results of all c differ in b b DIF effect sizes, and of the percentage of significance under the pairwise and composite group defining methods were provided for three, four and five groups in each subsection. Results of th e All Conditions Classified as All Groups Differ in b Parameters 3 Group Results Based on the pairwise com parisons for 3 groups under the condition of small true b parameter differences, the average effect size of UA = 0.19, UA = 0.16, and UA = 0.19 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3, resp ectively. Moreover, 4%, 8%, and 6% of trials showed significant DIF effects in these pairwise comparisons. Pairwise comparisons indicated that each group displayed a small amount of DIF relative to each of the other groups. W hen the composite group approac h was used under the same conditions (e.g., small true b parameter differences, 3 groups), the average effect size of UA = 0.19, UA =0.09, and UA =0.19 were found in the comparison of group 1 versus population, group 2 versus population, and group 3 versus population, respectively. Furthermore, 12%, 1%, and 15% of trials showed significant DIF effects in these composite group comparisons. Composite comparisons indicated that all three groups had small DIF, but one group showed significantly less problematic DIF than the other two groups. Please see Table

PAGE 47

47 B 1 for the detected DIF effects and the percentages of statistical significance, and see the left side of the Figure D 6 for the visual representation of these effect sizes. When the magnitude of moderate true b parameter differences was introduced across the groups, the average effect size of UA =0.32, UA =0.65, and UA =0.33 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3, respectively. Furthermore 2%, 56%, and 0% of trials showed significant DIF effects in these pairwise comparisons. In this condition, pairwise comparisons showed one pair of groups (group 1 compared to group 3) as having more problematic DIF than the other groups. When the composi te group comparison approach was used under the same conditions (e.g., moderate true b parameter differences, 3 groups), the average effect sizes of UA =0.33, UA =0.10, and UA =0.32 were found in the comparison of group 1 versus population, group 2 versus population, and group 3 versus population, respectively. Additionally, 88%, 2%, and 81% of trials showed significant DIF effects in these composite group comparisons. Composite group methods indicated groups 1 and 3 as being more problematic with DIF conce rns than group 2. This is a similar finding to pairwise approaches. Table B 2 provides the effect size values and the percentage of statistically significance and these DIF effects are shown in the left side of the Figure D 6. Lastly, when large magnitude of true b parameter differences were spread across the groups, the average effect size of UA =0.50, UA =0.97, and UA =0.47 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3, respectively. Additionall y, 59%, 95%, and 60% of trials showed significant DIF effects in these pairwise comparisons. However, when composite group comparison

PAGE 48

48 approach was used as a group defining method under the same conditions (e.g., large true b parameter differences, 3 groups ) the average effect size of UA =0.48, UA =0.10, and UA =0.49 were found in the comparison of group 1 versus population, group 2 versus population, and group 3 versus population, respectively. Moreover, 99%, 1%, and 100% of trials showed significant DIF effec ts in these composite comparisons. In this large magnitude, the composite group method clearly showed groups 1 and 3 as having problematic DIF, but the pairwise approach showed some problems with all of the pairs. The left side of Figure D 6 shows the effe ct sizes, and Table B 3 summarizes these effects sizes and the percentage of statistically significance. 4 Group Results Based on the pairwise comparisons for 4 groups under the condition of small true b parameter differences, average effect size s of UA =0. 15, UA =0.24, UA =0.34, UA =0.13, UA =0.21, and UA =0.12 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, and group 3 versus group 4, respectively. Moreov er, 3%, 1%, 6%, 4%, 2%, and 1% of trials showed significant DIF effects in these pairwise comparisons. Thus, the pairwise approach showed all pairs as having small, negligible amounts of DIF. When the composite group comparison approach was used under the same condition (e.g., small true b parameter differences, 4 groups,), average effect size s of UA =0.17, UA =0.07, UA =0.08, and UA =0.16 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, and gr oup 4 versus population, respectively. Additionally, 2%, 2%, 1%, and 0% of trials showed significant DIF effects in these composite group comparisons. Similar to the pairwise approach, the composite group approach indicated negligible amounts of DIF

PAGE 49

49 for al l groups. Please see Table B 4 for effect size and percentage of significance effect sizes and left side of the Figure D 7 for visual representation of these effect sizes. Also, when the magnitude of moderate true b parameter differences were introduced a cross the groups, average effect size s of UA =0.21, UA =0.43, UA =0.65, UA =0.23, UA =0.44, and UA =0.22 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, a nd group 3 versus group 4, respectively. Moreover, 0%, 21%, 91%, 4%, 35%, and 3% of trials showed significant DIF effects in these pairwise comparisons. W hen the composite group comparison approach was used as a group defining method under the same conditi ons (e.g., moderate true b parameter differences, 4 groups,), average effect size s of UA =0.32, UA =0.12, UA =0.13, and UA =0.33 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, and group 4 versus population, respectively. Additionally, 5%, 4%, 2%, and 21% of trials showed significant DIF effects in these composite group comparisons. In this condition, composite group methods showed groups 1 and 4 having more problematic DIF than other groups, but pairwise methods showed several problematic pairwise comparisons that involved combinations of groups 1, 3 and 4. Table B 5 summarizes the effect sizes and the percentage of significance associated with these effect sizes and the left side of the Figure D 7 visually represents these effect sizes. Lastly, when the magnitude of large true b parameter differences were spread among the groups, average effect size s of UA =0.33, UA =0.66, and UA =1.00, UA =0.32, UA =0.67, and UA =0.34 were found in the comparison of group 1 versus group

PAGE 50

50 2, group 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, group 3 versus group 4, respectively. Moreover, 0%, 85%, 100%, 4%, 97%, and 2% of trials showed significant DIF effects in these pairwise comparisons. When the composite group approach was used under the same conditions (e.g., large true b parameter differences, 4 groups), average effect size s of UA =0.50, UA =0.19, UA =0.20, and UA =0.51 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, and group 4 versus population, respectively. Furthermore, 69%, 12%, 11%, and 97% of trials showed significant DIF effects in these composite group comparisons. In this condition, composite g roup comparisons flagged groups 1 and 4 whereas pairwise comparisons flagged three pairwise comparisons that involved groups 1, 3, 2 and 4. Please see Table B 6 for the effect size and percentages and see the left side of the Figure D 7 for visual represen tation of these effect sizes. 5 Group Results Based on the pairwise comparisons for 5 groups under the condition of small true b parameter differences, the average effect size of UA =0.40, UA =0.42, UA =0.47, UA =0.49, UA =0.37, UA =0.38, UA =0.41, UA =0.36, UA =0. 38, and UA =0.37 were found in the comparison of group 1 versus group 2, group 1 versus group 3, group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus g roup 5 and group 4 versus group 5, respectively. Moreover, the 15%, 22%, 28%, 50%, 12%, 24%, %39, %13, 24%, and 16% of trials showed significant DIF effects in these pairwise comparisons. When the composite group comparison approach was used under the sa me conditions (e.g., small true b parameter differences, 5 groups,), the average effect

PAGE 51

51 size of UA =0.31, UA =0.24, UA =0.22, UA =0.25, and UA =0.27 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus popula tion, group 4 versus population, and group 5 versus population respectively. Additionally, the 23%, 11%, 5%, 11%, and 31% of trials showed significant DIF effects in these composite group comparisons. Under this condition, pairwise and composite group meth ods showed similar results of all groups having small, often negligible results. Please see Table B 9 for the effect sizes and percentages and see the left side of the Figure D 8 for the visual representation of these effect sizes. When the magnitude of m oderate true b parameter differences were introduced across the groups, average effect size s of UA =0.36, UA =0.41, UA =0.55, UA =0.67, UA =0.35, UA =0.47, UA =0.54, UA =0.38, UA =0.43, and UA =0.37 were found in the comparison of group 1 versus group 2, g roup 1 versus group 3, group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus group 5 and group 4 versus group 5, respectively. Furthermore, 15%, 45%, 78%, 88%, 17%, 51%, %77, %26, 53%, and 15% of trials showed significant DIF effects in these pairwise comparisons. When the composite group comparison approach was used under the same conditions (e.g., moderate true b parameter differences, 5 groups), ave rage effect size s of UA =0.35, UA =0.26, UA =0.21, UA =0.27, and UA =0.36 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, group 4 versus population, and group 5 versus population, respectively. Ad ditionally, 68%, 21%, 8%, 25%, and 70% of trials showed significant DIF effects in these composite group comparisons. In this condition, composite methods showed groups 1 and 5 as being

PAGE 52

52 more problematic, whereas pairwise methods flagged multiple pairs, in which each of the five groups was present in at least one of the pairs. Please see Table B 8 for the effect sizes and percentages, and see left side of the Figure D 8 for the visual representation of these effect sizes. Lastly, when the magnitude of large true b parameter differences was spread among the groups, average effect size s of UA =0.43, UA =0.56, UA =0.81, UA =0.99, UA =0.43, UA =0.61, UA =0.78, UA =0.46, UA =0.57, and UA =0.42 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus group 5 and group 4 versus group 5, respectively. Also, 27%, 73%, 94%, 100%, 36%, 81%, %98, %33, 73%, and 29% of trials showed significant DIF effects in these pairwise comparisons. When the composite group comparison approach was used under the same conditions (e.g., moderate true b parameter differences, 5 groups), average effect size s of UA =0 .51, UA =0.33, UA =0.22, UA =0.35, and UA =0.50 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, group 4 versus population, and group 5 versus population, respectively. Additionally, 96%, 47%, 8%, 55%, and 92% of trials showed significant DIF effects in these composite group comparisons. In this condition, composite methods showed groups 1 and 5 as having DIF concerns whereas pairwise comparisons showed all pairwise comparisons as having DIF conce rns Table B 9 summarizes the effect sizes and the percentages of statistically significance, and left side of the Figure D 8 shows visual representation of these DIF effects.

PAGE 53

53 Results of th e All Conditions Classified as One Group Differs in b Parameters 3 Group Results In the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3 under the condition of small true b parameter differences, the pairwise comparisons resulted in average effect size s of UA = 0.10, UA = 0.27, and UA = 0.32, respectively. Moreover, 2%, 0%, and 2% of trials showed significant DIF effects in these pairwise comparisons. When the composite group approach was used under the same conditions (e.g., small true b parameter differences, 3 groups), average eff ect size s of UA =0.12, UA =0.13, and UA =0.20 were found in the comparison of group 1 versus population, group 2 versus population, and group 3 versus population, respectively. Zero percent of trials showed significant DIF effects in all composite group com parisons. b parameter was specified as b parameters, which were invariant to each other. In this particular condition of small true b parameter differences, both pairw ise and composite methods indicated no meaningful amount of DIF for any of the groups. Both the detected effect size values and the percentage of statically significance were presented in Table B 10, and these DIF effects were visually presented in the rig ht side of the Figure D 6. In the condition with a moderate magnitude of true b parameter differences introduced to the last group of three, average effect size s of UA =0.10, UA =0.61, and UA =0.60 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3, respectively. Furthermore, 2%, 8%, and 9% of trials showed significant DIF effects in these pairwise comparisons. In the composite group comparison of group 1 versus population, group 2 versus population, and group 3

PAGE 54

54 versus population, average effect size s of UA =0.22, UA =0.21, and UA =0.40 were found, respectively. Additionally, 1%, 0%, and 0% of trials showed significant DIF effects in these composite group comparisons. The methods converged in that c omposite c omparisons flagged that group 3 had a larger area between its ICC and the composite group ICC, while pairwise methods indicated that group 3 had problematic DIF in relation to the other two groups. Please see Table B 11 for the effect size and percentage o f significance, and see the right side of the Figure D 6 for the visual representation of these DIF effects. Lastly, in the case of a large magnitude of true b parameter differences added to the last group, average effect size s of UA =0.11, UA =0.91, and UA = 0.89 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 2 versus group 3, respectively. Additionally, 1%, 30%, and 30% of trials showed significant DIF effects in these pairwise comparisons. A verage effect size s of UA =0.31, UA =0.30, and UA =0.61 were found in the comparison of group 1 versus population, group 2 versus population, and group 3 versus population, respectively. Moreover, 2%, 2%, and 1% of trials showed significant DIF effects in these composite comparisons, respectively. Both pairwise and composite comparisons were able to flag that group 3 had problematic DIF, with the pairwise indicating this by relating group 3 to the other two groups. Table B 12 summarizes the effect sizes and the percentage of statisti cally significance indices, and right side of the Figure D 6 shows these DIF effects. 4 Group Results When the pairwise comparisons were used under the condition of small true b parameter differences, average effect size s of UA =0.17, UA =0.17, UA =0.32, UA = 0.17, UA =0.35, and UA =0.33 were found in the comparison of group 1 versus group 2, group

PAGE 55

55 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, and group 3 versus group 4, respectively. Moreover, 0%, 1%, 3%, 1%, 3%, a nd 3% of trials showed significant DIF effects in these pairwise comparisons. These pairwise results show ed negligible amounts of DIF across the groups. When the composite group comparison approach was used as a group defining method in the comparison of g roup 1 versus population, group 2 versus population, group 3 versus population, and group 4 versus population, average effect size s of UA =0.11, UA =0.12, UA =0.12, and UA =0.22 were found, respectively. Additionally, 3%, 1%, 2%, and 0% of trials showed si gnificant DIF effects in these composite group comparisons, respectively. In this condition with small true b parameter differences, both composite and pairwise methods showed negligible amounts of DIF across the groups. Please see Table B 13 for the effec t sizes and percentage of significances, and see right side of the Figure D 7 for the visual representation of these DIF effects. Also, when the magnitude of moderate true b parameter differences was introduced to the last group, average effect size s of U A =0.16, UA =0.18, UA =0.65, UA =0.17, UA =0.64, and UA =0.65 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, and group 3 versus group 4, respectively. Mo reover, 0%, 1%, 9%, 1%, 5%, and 5% of trials showed significant DIF effects in these pairwise comparisons. Pairwise comparisons involving group 4 had larger effect sizes than other pairwise comparisons, but statistical significance showed few flags of this group 4 problem. When composite methods were used, group four had a larger effect size than all other groups and more data sets showed a statistically significant effect for this

PAGE 56

56 group. Specifically, average effect size s of UA =0.19, UA =0.19, UA =0.18, and UA =0.51 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, and group 4 versus population, respectively. Additionally, 5%, 6%, 7%, and 20% of trials showed significant DIF effects in these compos ite group comparisons. In this condition, composite comparisons flagged that group 4 had a larger area between its ICC and the composite group ICC. Table B 14 summarizes the effect sizes and the percentage of statistically significances and the right side of Figure D 7 visually shows these DIF effects Lastly, when the magnitude of large true b parameter differences were added to the last group, average effect size s of UA =0.17, UA =0.18, UA =1.01, UA =0.16, UA =0.98, and UA =0.99 were found in the comparison o f group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 2 versus group 3, group 2 versus group 4, group 3 versus group 4, respectively. Moreover, 1%, 1%, 67%, 0%, 51%, and 58% of trials showed significant DIF effects in these pa irwise comparisons. W hen each group was compared to the population, average effect size s of UA =0.27, UA =0.26, UA =0.26, and UA =0.78 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus overall population, and group 4 versus population, respectively. Furthermore, 11%, 12%, 16%, and 98% of trials showed significant DIF effects in these composite group comparisons, respectively. In this condition, when group 4 was compared to the other groups, pairwise comparisons showed high DIF effects, and the composite group method clearly showed only group 4 as having problematic DIF. Please see Table B 15 for these effect sizes and the percentage of

PAGE 57

57 statistically significance, and see the right side of the Figure D 7 for the visual representation of these DIF effects. 5 Group Results When the pairwise comparisons were used under the condition of small true b parameter differences, average effect size s of UA =0.10, UA =0.09, UA =0.09, UA =0.33, UA =0.09, UA =0.09, UA =0.33, UA =0.09, U A =0.33, and UA =0.32 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus group 5 and group 4 versus group 5, respectively. Moreover, 3%, 4%, 1%, 11%, 1%, 1%, %13, %3, 12%, and 8% of trials showed significant DIF effects in these pairwise comparisons. When each group was compared to the population, the average effect s ize of UA =0.09, UA =0.09, UA =0.08, UA =0.08, and UA =0.26 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, group 4 versus population, and group 5 versus population respectively. Additionally, 1%, 1%, 1%, 0%, and 6% of trials showed significant DIF effects in these composite group comparisons. In this condition, when group 5 was compared to the other groups, which resulted in 4 pairwise comparisons, the pairwise comparisons detected high DIF effect s. Similarly, composite group comparisons showed only group 5 as having problematic DIF. Table B 16 summarizes all effect sizes and the percentage of statistically significance for both methods, and the right side of Figure D 8 visually shows these DIF ef fects. Also, when the magnitude of moderate true b parameter differences was added to the last group, average effect size s of UA =0.10, UA =0.10, UA =0.09, UA =0.66,

PAGE 58

58 UA =0.10, UA =0.10, UA =0.65, UA =0.09, UA =0.65, and UA =0.66 were found in the comparison of gr oup 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus group 5 and group 4 versus group 5, respectiv ely. Furthermore, 8%, 7%, 3%, 86%, 8%, 5%, %92, %5, 87%, and 92% of trials showed significant DIF effects in these pairwise comparisons. When each group was compared to the population, average effect size s of UA =0.14, UA =0.15, UA =0.14, UA =0.15 and UA =0.52 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, group 4 versus population, and group 5 versus population, respectively. Additionally, 5%, 6%, 3%, 1%, and 85% of trials showed significant DIF effects in these composite group comparisons. Again, in this condition, both pairwise and composite group comparisons isolated group 5 as having DIF concerns Please see Table B 17 for the effect size and the percentage of the statistically significance, a nd see the right side of Figure D 8 for the visual representation of these DIF effects. Lastly, when the magnitude of large true b parameter differences was added to the last group, average effect size s of UA =0.09, UA =0.09, UA =0.09, UA =0.99, UA =0.09, UA =0.09, UA =1.00, UA =0.09, UA =0.99, and UA =0.98 were found in the comparison of group 1 versus group 2, group 1 versus group 3, and group 1 versus group 4, group 1 versus group 5, group 2 versus group 3, group 2 versus group 4, group 2 versus group 5, group 3 versus group 4, group 3 versus group 5 and group 4 versus group 5, respectively. Also, 4%, 2%, 3%, 100%, 1%, 7%, %100, %5, 100%, and 100% of trials showed significant DIF effects in these pairwise comparisons. On the other hand, when

PAGE 59

59 each group was com pared to the population, average effect size s of UA =0.21, UA =0.22, UA =0.21, UA =0.21, and UA =0.80 were found in the comparison of group 1 versus population, group 2 versus population, group 3 versus population, group 4 versus population, and group 5 versus population, respectively. Additionally, 10%, 14%, 9%, 11%, and 99% of trials showed significant DIF effects in these composite group comparisons. In this condition, both methods showed isolated problematic DIF concerns to group 5 comparisons Please see Ta ble B 18 and Figure D 8 for these effect sizes and the percentage of statistically significance results.

PAGE 60

60 CHAPTER 6 CONCLUSIONS AND DISCUSSIONS The main purpose of the study was to examine the impact of two methods of defining group comparisons in DIF detections: pairwise and composite group comparisons. A simulation study was conducted to answer the three research questions. Research question 1 asked whether the number of the groups affects the detection of DIF differentially f or pairwise versus composite group approaches. This factor was manipulated because two group cases (e.g., base and second group) is almost always used in DIF studies, however, multiple group comparisons are frequently desirable (Penfield, 2001). Research q uestion 2 asked whether the magnitude of true b parameter differences differentially affected DIF detection under composite versus pairwise approaches. Lastly, research question 3 asked whether the nature of the b parameter differences (i.e., all groups a re different from each other versus a single group is different from all other groups that are the same) has a differential impact on the ability of pairwise versus composite group comparisons to detect DIF. This factor was manipulated to examine the impac t of the nature of differences in b parameters because in many circumstances, DIF does not systematically spread among the groups. (see Ellis & Kimmel, 1992; Kim et al, 1995; Penfield, 2001). The results showed that p airwise and composite method results d iffered across the nature of true b parameter difference s factor. When one group was different in true b parameters by a moderate or large magnitude, both methods flagged the group of concern but just in a different way (e.g., see Table B 17). Under pairw ise methods, every pair that included the group of co ncern was flagged, so a practitioner would easily be able to interpret that the last group had problematic concerns for fairness. Under the

PAGE 61

61 composite group methods, the group of concern was flagged throu gh its group specific DIF effect So even though the DIF estimates from the methods are different because they are comparing different groups a researcher or practitioner would draw the same conclusions regardless of the choice between pairwise and compos ite. However, this was not the case in conditions in which the true b parameter differences were spread amongst the groups. Differences between the results of pairwise and composite methods were found across many conditions that had all groups differing in their true b parameters. For example, when there were small true b parameter differences, both pairwise and composite methods led to similar interpretations (i.e., that there is no DIF). The effect sizes and percentage of statistical significance do have some differences between the two methods, with the most consistent difference being that effect sizes are smaller for composite. However, the interpretation from a practitioner standpoint would be similar across these two methods. When the differences in t rue b parameters were moderate or large, different conclusions were often drawn between the pairwise and composite methods For example, Table B 3 shows that when the true b parameter differences were large across three groups, pairwise methods indicate d s ome problems for all of the pairs which would lead a practitioner to interpret that all groups have some concerns. If one was concerned with group to group invariance, this would be an appropriate interpretation of DIF. However, Table B 3 also shows that w hen true b parameter differences were large across three groups, composite methods indicate d that group 1 and group 3 had DIF concerns, but not group 2. The interpretation is different as

PAGE 62

62 compared to pairwise, but it is more appropriate if one is concerned with individual groups being invariant to the operational item parameters. This is just one example of when moderate to large b parameter differences resulted in different interpretations between pairwise and composite approaches. In these cases, it is c ritical to make an informed decision about which method to use, because the results of the DIF study will differ More importantly, researchers do not know the nature of true b parameter group differences in their observed data It is quite possible that t here are some moderate to large parameter differences that are spread across more than two groups on a few items Therefore, it is always important to make an informed choice between pairwise and composite methods T his study also showed that one reason to consider using composite approaches is because of the ease in interpretation. For example, Table B 6 shows that pairwise comparisons flagged groups 1, 3, 2, and 4 in various pairwise sets. However, if one looks at pairs without problems, these include gro ups 1, 2, 3, and 4. In other words, all four groups are invariant to some groups and not invariant to others. So how does one decide where the problem lies? Which groups are being disadvantaged ? A researcher could reduce the number of pairwise comparisons by using a single reference group, but the reference group she/he chooses will det ermine whether other groups are considered to be advantaged or disadvantaged. In such cases, researchers will only be able to interpret DIF between the reference group and fo among the other groups that are selected as the focal group. If that is a concern, it is better to use composite comparison approach which will result in less number of comparisons Composite compariso ns make the group and direction of advantage very

PAGE 63

63 clear In Table B 6 it is shown that group 1 and group 4 are not invariant to the operational item parameters, while groups 2 and 3 are invariant to these parameters. Looking at item parameters indicates th at group 1 is advantaged (i.e., smaller b estimate than composite b estimate) and group four is disadvantaged (i.e., larger b estimate than composite b estimate). When there are many groups, this can be an advantage of composite group comparisons. As menti oned before when only one group was different in true b parameters both comparison methods flagged the differentiated group as having more DIF concerns than the other groups (see Table B 10 through B 18). However, pairwise methods achieved this with many more comparisons than composite group methods Therefore, the composite group comparisons are less time consuming and have a lower familywise Type 1 error rate. If this study had found that pairwise and composite methods provided the same interpretations about DIF in all conditions, then it isn't very important for practitioners to carefully choose between pairwise and composite. But this study showed that the methods do not always provide the same results. Therefore, it is important for practitioners to d etermine what one is trying to measure and how one defines fairness as a lack of bias on their assessment. If the practitioner wants to know if two groups are invariant to each other, then they should use pairwise methods. If the practitioner is interested in whether or not the operational item parameters are invariant to individual groups, they should use the composite group approach. There are a variety of methods that can be adapted to both pairwise and composite approaches. For example, this study demon strated a composite group approach with the UA method. Also, Lord's chi

PAGE 64

64 square is easily adapted to composite group approaches (Ellis & Kimmel, 1992). Therefore, methods are currently available for the less commonly used composite group approach, so there are no methodological barriers to using this approach. More importantly, using the composite group approach in appropriate situations can be theoretically and practically justified. This approach is arguably aligned with the Standards definition of fairn ess as a lack of bias (which does not necessarily preclude an argument that both pairwise and composite group are aligned with this definition of fairness). Again, The Standards (AERA, APA, NCME, 1999) states, hen deficiencies in a test itself or the manner in which it is used result in different meanings for scores earned by members of different as all groups being the same relative to overall population, which aligns with the definition of fairness defined by composite group approach. To be sure, some analysis and assessment conditions can easily justify pairwise approaches (e.g., one can answer research questions about dir ect group comparisons; groups being compared are independent allowing the use of standard contingency table approaches to DIF analysis). This study does not argue against pairwise comparisons, as the composite and pairwise methods are different rather than correct/incorrect. Rather, this study simply makes the case that the two methods can provide different interpretations of DIF results, that practitioners can decide on which to use based on definitions of fairness, and that the uncommon approach of compos ite group comparisons has many advantages and should deserve more consideration in future DIF studies.

PAGE 65

65 CHAPTER 7 LIMITATIONS AND FURTHER RESEARCH This study has some limitations. The first limitation is that simulated data was used to detect DIF rather t han operational test data. Also, the 2PL model was used to generate and calibrate data in this study, but future research could replicate this study using other models, such as the 3PL model which additionally employs a pseudo guessing ( c ) item parameter. Moreover, DIF in this study was introduced by changing the item difficulty parameters of only 1 item, which is not realistic. Often times, more than one item in a test will display some DIF concerns. The study only examined DIF in b parameters. However, fu ture studies should examine how defining groups for DIF studies has an impact on detection of DIF in a parameters. Moreover, this study examined percentage of iterations within conditions that flagged statistically significant DIF. However, similar to stud ies looking at Type 1 error rates, it would most likely be more appropriate to do this with 1,000 iterations rather than 100. Furthermore, only the UA method was utilized for DIF detections. However, other detection methods should be used to test the condi tions, and to see the impact of group defining method on DIF effect magnitude and interpretation. Some researchers strongly believe that the Mantel Haenszel procedure is the most powerful method in DIF detection studies (Langer et al., 2008). However, Mant el Haenszel method assumes independent group s and dis allow s us to use it in composite group comparisons. But, other methods such as likelihood ratio test that are not really classified as pairwise or composite can overcome concerns of composite/pairwise comparisons in that the method has its own unique way of performing group comparisons could also be investigated

PAGE 66

66 Lastly, i t is strongly recommend that this study be replicated with different group ratios instead of equal group sample sizes because in rea l test situations, unequal sample sizes are a frequent occurance It was found that when group size ratio (reference/focal) is different than one ( i.e., unequal group size), Type 1 error rate is negatively affected (Awuor, 2008) Gierl, et al., (2001) sugg ested that the difference in sample size for reference and focal groups should be controlled because as it increases power decreases and it becomes difficult to detect items that function differentially. Moreover, Wyse and Mapuranga (2009) showed that wh en the difference in sample size is very large, results in DIF detection studies pro duce low power even when the groups have adequate sample size. Thus, w e believe that having unequal sample sizes for the reference and focal groups will affect detected eff ect sizes in both pairwise and composite group comparisons. However, it is important to note that by nature it is impossible to have equal group sizes in composite group which calls for future research in this area. It should also be mentioned that when d oing composite group comparisons, weighting groups within some DIF methods may be necessary, as is seen in many considerations of detecting equating invariance.

PAGE 67

67 APPENDIX A THE TRUE b PARAMETER DIFFERENCES Table A 1. True item difficulty para meters across the groups All groups differ in b parameters One group differs in b parameters Small Moderate Large Small Moderate Large 3 Groups G1 b 1 =b* .15 b 1 =b* .3 b 1 =b* .45 b 1 =b* b 1 =b* b 1 =b* G2 b 2 =b* 0 b 2 =b* 0 b 2 =b* 0 b 2 =b* b 2 =b* b 2 =b* G3 b 3 =b*+.15 b 3 =b*+.3 b 3 =b*+.45 b 3 =b*+.3 b 3 =b*+.6 b 3 =b*+.9 4 Groups G1 b 1 =b* .15 b 1 =b* .3 b 1 =b* .45 b 1 =b* b 1 =b* b 1 =b* G2 b 2 =b* .05 b 2 =b* .1 b 2 =b* .15 b 2 =b* b 2 =b* b 2 =b* G3 b 3 =b*+.05 b 3 =b*+.1 b 3 =b*+.15 b 3 =b* b 3 =b* b 3 =b* G4 b 4 =b *+.15 b 4 =b*+.3 b 4 =b*+.45 b 4 =b*+.3 b 4 =b*+.6 b 4 =b*+.9 5 Groups G1 b 1 =b* .15 b 1 =b* .3 b 1 =b* .45 b 1 =b* b 1 =b* b 1 =b* G2 b 2 =b* .075 b 2 =b* .15 b 2 =b* .225 b 2 =b* b 2 =b* b 2 =b* G3 b 3 =b* 0 b 3 =b* 0 b 3 =b* 0 b 3 =b* b 3 =b* b 3 =b* G4 b 4 =b*+.075 b 4 =b* +.15 b 4 =b*+.225 b 4 =b* b 4 =b* b 4 =b* G5 b 5 =b*+.15 b 5 =b*+.3 b 5 =b*+.45 b 5 =b*+.3 b 5 =b*+.6 b 5 =b*+.9 b*= The true item difficulty parameter that was sampled for the particular condition

PAGE 68

68 APPENDIX B SIMULATION RESULTS Table B 1. Effect size and p ercentage of s tatistically s ignificant r esults: 3 g roups, s mall t rue b DIF, and a ll groups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.19 4 Group 1 vs Group 3 0.19 8 Group 2 vs Gr oup 3 0.19 6 Group 1 vs Population 0.19 12 Group 2 vs Population 0.09 1 Group 3 vs Population 0.19 15

PAGE 69

69 Table B 2. Effect size and p ercentage of s tatistically s ignificant r esults: 3 g roups, m oderate t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.32 2 Group 1 vs Group 3 0.65 56 Group 2 vs Group 3 0.33 0 Group 1 vs Population 0.33 88 Group 2 vs Population 0.10 2 Group 3 vs Populatio n 0.32 81

PAGE 70

70 Table B 3. Effect size and p ercentage of s tatistically s ignificant r esults: 3 g roups, l arge t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Gr oup 1 vs Group 2 0.50 59 Group 1 vs Group 3 0.97 95 Group 2 vs Group 3 0.47 60 Group 1 vs Population 0.48 99 Group 2 vs Population 0.10 1 Group 3 vs Population 0.49 100

PAGE 71

71 Table B 4. Effect size and p ercentage of s tatistic ally s ignificant r esults: 4 g roups, s mall t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.15 3 Group 1 vs Group 3 0.24 1 Group 1 vs Group 4 0.34 6 Group 2 vs Gr oup 3 0.13 4 Group2 vs Group 4 0.21 2 Group 3 vs Group 4 0.12 1 Group 1 vs Population 0.17 2 Group 2 vs Population 0.07 2 Group 3 vs Population 0.08 1 Group 4 vs Population 0.16 0

PAGE 72

72 Table B 5. Effect size and p ercentage of s tat istically s ignificant r esults: 4 g roups, m oderate t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.21 0 Group 1 vs Group 3 0.43 21 Group 1 vs Group 4 0.65 91 Gr oup 2 vs Group 3 0.23 4 Group2 vs Group 4 0.44 35 Group 3 vs Group 4 0.22 3 Group 1 vs Population 0.32 5 Group 2 vs Population 0.12 4 Group 3 vs Population 0.13 2 Group 4 vs Population 0.33 21

PAGE 73

73 Table B 6. Effect size and p erc entage of s tatistically s ignificant r esults: 4 groups, large t rue b DIF, and all g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.33 0 Group 1 vs Group 3 0.66 85 Group 1 vs Group 4 1 .00 100 Group 2 vs Group 3 0.32 4 Group2 vs Group 4 0.67 97 Group 3 vs Group 4 0.34 2 Group 1 vs Population 0.50 69 Group 2 vs Population 0.19 12 Group 3 vs Population 0.20 11 Group 4 vs Population 0.51 97

PAGE 74

74 Table B 7. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, s mall t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.40 15 Group 1 vs Group 3 0.42 22 Group 1 vs Group 4 0.47 28 Group 1 vs Group 5 0.49 50 Group 2 vs Group 3 0.37 12 Group 2 vs Group 4 0.38 24 Group 2 vs Group 5 0.41 39 Group 3 vs Group 4 0.36 13 Group 3 vs Group 5 0.38 24 Group 4 vs Group 5 0.37 16 Group 1 vs Population 0.31 23 Group 2 vs Population 0.24 11 Group 3 vs Population 0.22 5 Group 4 vs Population 0.25 11 Group 5 vs Population 0.27 31

PAGE 75

75 Table B 8. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, m oderate t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.36 15 Group 1 vs Group 3 0.41 45 Group 1 vs Group 4 0.55 78 Group 1 vs Group 5 0.67 88 Group 2 vs Group 3 0.35 17 Group 2 vs Group 4 0.47 51 Group 2 vs Group 5 0.54 77 Group 3 vs Group 4 0.38 26 Group 3 vs Group 5 0.43 53 Group 4 vs Group 5 0.37 15 Group 1 vs Population 0.35 68 Group 2 vs Population 0.26 21 Group 3 vs Population 0.21 8 Group 4 vs Population 0.27 25 Group 5 vs Popul ation 0.36 70

PAGE 76

76 Table B 9. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, l arge t rue b DIF, and a ll g roups d iffer Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs G roup 2 0.43 27 Group 1 vs Group 3 0.56 73 Group 1 vs Group 4 0.81 94 Group 1 vs Group 5 0.99 100 Group 2 vs Group 3 0.43 36 Group 2 vs Group 4 0.61 81 Group 2 vs Group 5 0.78 98 Group 3 vs Group 4 0.46 33 Group 3 vs Group 5 0.57 73 Group 4 vs Grou p 5 0.42 29 Group 1 vs Population 0.51 96 Group 2 vs Population 0.33 47 Group 3 vs Population 0.22 8 Group 4 vs Population 0.35 55 Group 5 vs Population 0.50 92

PAGE 77

77 Table B 10. Effect size and p ercentage of s tatistically s ignificant r esu lts: 3 g roups, s mall t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.10 2 Group 1 vs Group 3 0.27 0 Group 2 vs Group 3 0.32 2 Group 1 vs Population 0.12 0 Group 2 vs Population 0.13 0 Group 3 vs Population 0.20 0

PAGE 78

78 Table B 11. Effect size and p ercentage of s tatistically s ignificant r esults: 3 g roups, m oderate t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Comp osite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.10 2 Group 1 vs Group 3 0.61 8 Group 2 vs Group 3 0.60 9 Group 1 vs Population 0.22 1 Group 2 vs Population 0.21 0 Group 3 vs Population 0.40 0

PAGE 79

79 Table B 12. Effect size and p ercentage of s tatistically s ignificant r esults: 3 groups, large t rue b DIF, and o nly o ne group d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.11 1 Group 1 vs Group 3 0.91 30 Group 2 vs Group 3 0.89 30 Group 1 vs Population 0.31 2 Group 2 vs Population 0.30 2 Group 3 vs Population 0.61 1

PAGE 80

80 Table B 13. Effect size and p ercentage of s tatistically s ignificant r esults: 4 g roups s mall t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.17 0 Group 1 vs Group 3 0.17 1 Group 1 vs Group 4 0.32 3 Group 2 vs Group 3 0.17 1 Group2 vs Group 4 0.35 3 Group 3 vs Group 4 0.33 3 Group 1 vs Population 0.11 3 Group 2 vs Population 0.12 1 Group 3 vs Population 0.12 2 Group 4 vs Population 0.22 0

PAGE 81

81 Table B 14. Effect size and p ercentage of s tatistically s ignificant r esult s: 4 g roups, m oderate t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.16 0 Group 1 vs Group 3 0.18 1 Group 1 vs Group 4 0.65 9 Group 2 vs Group 3 0.17 1 G roup2 vs Group 4 0.64 5 Group 3 vs Group 4 0.65 5 Group 1 vs Population 0.19 5 Group 2 vs Population 0.19 6 Group 3 vs Population 0.18 7 Group 4 vs Population 0.51 20

PAGE 82

82 Table B 15. Effect size and p ercentage of s tatistically s i gnificant r esults: 4 g roups, l arge t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.17 1 Group 1 vs Group 3 0.18 1 Group 1 vs Group 4 1.01 67 Group 2 vs Gro up 3 0.16 0 Group2 vs Group 4 0.98 51 Group 3 vs Group 4 0.99 58 Group 1 vs Population 0.27 11 Group 2 vs Population 0.26 12 Group 3 vs Population 0.26. 16 Group 4 vs Population 0.78 98

PAGE 83

83 Table B 16. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, s mall t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.10 3 Group 1 vs Group 3 0.09 4 Group 1 vs Group 4 0.09 1 Group 1 vs Group 5 0.33 11 Group 2 vs Group 3 0.09 1 Group 2 vs Group 4 0.09 1 Group 2 vs Group 5 0.33 13 Group 3 vs Group 4 0.09 3 Group 3 vs Group 5 0.33 12 Group 4 vs Group 5 0.32 8 Group 1 vs Population 0.09 1 Group 2 vs Population 0.09 1 Group 3 vs Population 0.08 1 Group 4 vs Population 0.08 0 Group 5 vs Population 0.26 6

PAGE 84

84 Table B 17. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, m oderate t rue b DIF, and o nly o ne g roup d iffers Comparison ( Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.10 8 Group 1 vs Group 3 0.10 7 Group 1 vs Group 4 0.09 3 Group 1 vs Group 5 0.66 86 Group 2 vs Group 3 0.10 8 Group 2 vs Group 4 0.10 5 Group 2 vs Group 5 0.65 92 Group 3 vs Group 4 0.09 5 Group 3 vs Group 5 0.65 87 Group 4 vs Group 5 0.66 92 Group 1 vs Population 0.14 5 Group 2 vs Population 0.15 6 Group 3 vs Population 0.14 3 Group 4 vs Population 0.15 1 Group 5 vs Population 0.52 85

PAGE 85

85 Table B 18. Effect size and p ercentage of s tatistically s ignificant r esults: 5 g roups, l arge t rue b DIF, and o nly o ne g roup d iffers Comparison (Pairwise or Composite) Effect Size Percentage of Statically Significance Group 1 vs Group 2 0.09 4 Group 1 v s Group 3 0.09 2 Group 1 vs Group 4 0.09 3 Group 1 vs Group 5 0.99 100 Group 2 vs Group 3 0.09 1 Group 2 vs Group 4 0.09 7 Group 2 vs Group 5 1.00 100 Group 3 vs Group 4 0.09 5 Group 3 vs Group 5 0.99 100 Group 4 vs Group 5 0.98 100 Group 1 vs Pop ulation 0.21 10 Group 2 vs Population 0.22 14 Group 3 vs Population 0.21 9 Group 4 vs Population 0.21 11 Group 5 vs Population 0.80 99

PAGE 86

86 APPENDIX C EXAMPLE TABLES Table C 1. Example of a contingency table Number of Correct Response (1) Number of Incorrect Response (0) Totals Reference a k b k N Rk Focal c k d k N Fk Totals N 1, N 0,

PAGE 87

87 APPENDIX D FIGURES Figure D 1. Item characteristic curve (ICC) for a 2PL item

PAGE 88

88 Figure D 2. The shaded area between two ICCs is a visual representation of DIF

PAGE 89

89 Figure D 3. A conceptual model of two different methods of group definition in DIF Group 3 Composite Group Group 2 Group 1 Group 1 Group 3 Group 2 Composite Group Composite Group Group 1 Group 2 Group 3 Pairwise Comparisons Composite Group Comparisons

PAGE 90

90 Figure D 4 DIF under pairw ise and composite group comparisons for two groups

PAGE 91

91 Figure D 5. Item characteristic curves (ICCs) for three groups and the operational ICC across the groups

PAGE 92

92 Figure D 6. Effect size results for the three groups

PAGE 93

93 Figure D 7. Effect size results for the four groups

PAGE 94

94 Figure D 8. Effect size results for the five groups

PAGE 95

95 LIST OF REFERENCES American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME), (1999). Standards for Educational and Psychological Testing. Washington, DC: American Psychol ogical Association. Ankenmann, R.D., Witt, E.A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness of fit statistic in detecting differential item functioning. Journal of Educational Measurement Vol.36, No.4, pp.277 3 00. Awuor, R.A. (2008). Effect of Unequal Sample Sizes on the Power of DIF Dete ction: An IRT Based Monte Carlo. (Unpublished doctoral dissertation) Virginia State University, Blacksburg, Virginia. Retrieved from http://scholar.lib.vt.edu/theses/available/etd 07172008 130938/unrestricted/RAA_ETD.pdf Baker, F. (1992) Item response theory Newyork, NY: Markel Dekker, INC. Birnbaum, A. (1968). Some latent trai ability. In Brennan, R.L. (2011) Generalizability theory and classical test theory. Applied Measurement in Education 24:1, 1 21. Retrieved from http://www.tandfonline.com/doi/abs/10.1080/08957347.2011.532417#.UbVlG dkMXE Camilli, G. (2006). Test Fairness In R. L. Brennan (Ed.), Educational measurement (4th ed.) (pp. 221 256). Westport, CT: American Council on Education, Praeger. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items Newbury Park, CA: Sage. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differential item functioning test items. Education al Measurement :Issues and Practice, 17 ,31 44. Coffman, D. L., & BeLue, R. (2009). Using item response theory to detect differential item functioning in health disparities research. Journal of Community Psychology 37(5), 1 12. Cohen, A.S., & Kim, S.H. (19 measures in detection of DIF. Applied Psychological Measurement, 17 :39. DOI: 10.1177/014662169301700109. Cole, N., & Moss, P. (1989). Bias in test use. In Gipps, C. & Murphy, P.(1994). Fair test Bucki ngham, PA: Open University Press. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory New York: Holt, Rinehart & Winston

PAGE 96

96 Dorans, N.J., & Holland, P.W. (2000). Population invariance and the equatability of tests: Basic theor y and the linear case. Journal of Educational Measurement 37(4) 281 306. Ellis, B.&Kimmel, H.(1992).Identification of unique cultural response patterns by means of item response theory. Journal of Applied Psychology. 1992, 77(2), 177 184. Embretson, S. E, & Reise,S.P.(2000) Item response theory for psychologists Mahwah, NJ: Lawrence Erlbaum. Fidalgo, A. M., Mellenbergh, G.J., & Muniz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel Haenszel proc edures. Methods of Psychological Research Online 2000, Vol.5, No.3. Flowers, C.P., Oshima, T.C., &Raju, S.N. (1999). A description and demonstration of the polytomous DFIT framework. Applied Psychological Measurement .1999 23:309.DOI:10.1177/01466219922031 437. Gierl, M. J., Bisanz, J., Bisanz, G., Boughton, K., & Khaliq, S. (2001). Illustrating the utility of differential bundle functioning analyses to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practic e 20, 26 36 Gipps, C. & Murphy, P.(1994). Fair test. Buckingham, PA: Open University Press. Guler, N., & Penfield, R. D. (2009). A comparison of the logistic regression and contingency table methods for simultaneous detection of uniform and nonuniform DI F. Journal of Educational Measurement. Vol.46, No.3,pp.314 329. Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 3847. Hanson, B.,A.,& A. Bguin, A., A.(2002) Obtaining a Common Scale for Item Response Theory Item Parameters Using Separate Versus Concurrent Estimation in the Common Item Equating Design Applied Psychological Measurement. Vol:26: 3 Hidalgo, M. D. & Lopez Pina, J. (2004). Differential item functioning detection and effe ct size: A comparison between LR and MH procedures for detecting differential item functioning Educational and Psychological Measurement, 64 : 903 915 Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129 145). Hillsdale, NJ: Erlbaum.

PAGE 97

97 Holland, P. W & Wainer, Howard (1993). Differential Item Functioning Lawrence Erlbaum As sociates Publishers. Hillsdale, New Jersey:Educational Testing Service. Ironson, G. H. (1982). Use of chi square latent trait approaches for detecting item bias. In Holland, W. P., & Wainer, Howard (1993). Differential Item Functioning Lawrence Erlbaum As sociates Publishers. Hillsdale, New Jersey:Educational Testing Service. Jensen, A.R (1980). Bias in mental testing. New York, NY: Free Press. In Osterlind S., &Everson H. (2009). Differential item functioning Newbury Park, CA:Sage Kane, M. T. (2006). Vali dation Educational measurement ( 4th ed., pp. 17 64). Washington, DC: The National Council on Measurement in Education & The American Council on Education. Kanjee, A. (2007) Using logistic regression to detect bias when multiple g roups are tested. South African Journal of Psychology 37(1) 47 61. Kim, S. H. & Cohen, A.S.(1991). A comparison of two area measures for detecting differential item functioning. Applied Measurement in Educatio,Vol15 269 278. Kim, S.H., & Cohen, A.S. (1 measures, and the likelihood test on detection of differential item functioning. Applied Measurement in Education, 8 (4), 291 312. Kim, S.H., Cohen, A.S., & Park, T.H. (1995). Detection of differential i tem functioning in multiple groups. Journal of Educational Measurement. Vol.32 No.3, pp.261 276. Liu, J & Dorans J. N (2013). Assessing a Critical Aspect of Construct Continuity When Test Specifications Change or Test Forms Deviate from Specifications. E ducational Measurement: Issues and Practice. Spring 2013, Vol. 32, No. 1, pp. 15 22 Lord, F. M. (1953). The standard errors of various test statistics when the items are sampled. Educational Testing Service Bulletin. In Osterlind S., &Everson H. (2009). Differential item functioning Newbury Park, CA:Sage Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, New Jersey : Lawrence Erlbaum Associates. Lord, F.M.,& Novick, M.E.(1968). Statistical theories of mental test scores Reading, MA:Addison Wesley. Magis, D., Beland, S., & Raiche, G. (2013). difR Statistical package. http://cran.r project.org/web/packages/difR/difR.pdf

PAGE 98

98 Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719 748. In Osterlind S., &Everson H. (2009). Differential item functioning Newbury Park, CA:Sage McKinley, R., & Mills, C. (1989). Item response theory: Advances in achievement and attitude measurement. In B. Thompson (Ed.), Advances in social science methodology (Vol. 1, pp. 71 135). Greenwich, CT: JAI. Messick, S. (1988). The once and future issues of validity: Ass essing the meaning and Test validity (pp. 33 45) Hillsdale, NJ: Lawrence Erlbaum Associates. Morales, L. S., Reise, S. P., & Hays, R. D. (2000). Evaluating the equivalence of health care ra tings by Whites and Hispanics. In Woods, C., M. (2008) Likelihood ratio DIF testing: Effects of non normality. Applied Psychological Measurement, 32, 511 526. Oakland, T.,(2004). Use of educational and psychological tests internationally. Applied Psycholog y:An International Review, 53 (2), 157 172. Osterlind S., &Everson H. (2009). Differential item functioning Newbury Park, CA:Sage Pae, T. I., & Park, G. P. (2006). Examining the relationship between differential item functioning and differential test funct ioning. Language Testing, 23 (4), 475 496 Penfield, R. D. (2001). Assessing differential item functioning across multiple groups: A comparison of three Mantel Haenszel procedures. Applied Measurement in Education 14, 235 259. Penfield, R. D., & Algina, J. (2006). A Generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement. Winter 2006, No.4, pp.295 312. Penfield, R. D., & Camilli, G. (2007). Differential item func tioning and item bias. Handbook of Statistics, Vol.26 ISSN:0169 7161. Pine, S. M. (1977). Applications of item characteristic curve theory to the problem of test Applications of computerized adaptive testing : Proceedings of a symposium presented at the 18th annual convention of the Military Testing Association (Research Rep. No. 77 1, pp. 37 43). R Development Core Team, 2013. R: A language and environment for statistical computing, reference index version 2.2.1. R Foundation f or Statistical Computing. Vienna: Austria. URL http://www.R project.org Raju ,N. S. (1988). The area between two item characteristic curves. Psychometrika 53 ,495 502.

PAGE 99

99 Rasch, G. (1960) Probabilistic models for some i ntelligence and attainment tests. Copenhagen: Nielsen and Lydiche. In Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Roussos, L.A., & Stout, W. (2004). Differentia l item functioning analysis: Detecting DIF item and testing DIF hypotheses. In Penfield, R. D., & Algina, J. (2006). A Generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educatio nal Measurement. Winter 2006, No.4, pp.295 312. Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests containing differentially functioning items: Do biased items result in poor measurement? Educational and Psychological Measurement 59 : 248. DOI: 10.1177/00131649921969839. Santelices, M. V., & Wilson, M. (2011). On the relationship between differential item functioning and item difficulty: An issue of methods? Item response theory approach to differential item functioning. Educatio nal and Psychological Measurement 1 32. DOI: 10.1177/0013164411412943. Shepard, L. A., Camilli, G., & Williams, D. M. (1984). Accounting for statistical artifacts in item bias research. Journal of Educational Statistics, 9 93 128. In Holland, W. P., & Wa iner, Howard (1993). Differential Item Functioning Lawrence Erlbaum Associates Publishers. Hillsdale, New Jersey:Educational Testing Service. Shepard, L. A., Camilli, G., & Williams, D. M. (1985). Validity of approximation techniques for detecting item bi as. Journal of Educational Measurement. 22 77 I 0.5 Smith, L. L., & Reise, S. P. (1998). Gender differences on negative affectivity: An IRT study of differential item functioning on the multidimensional personality questionnaire stress reaction scale. In Woods, C., M. (2008) Likelihood ratio DIF testing: Effects of non normality. Applied Psychological Measurement, 32, 511 526. Swaminathan, H., & rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Ed ucational Measurement, 27 (4), 361 370. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118 128. In In Holland, W. P., & Wainer, Howard (1993). Differential Item Function ing Lawrence Erlbaum Associates Publishers. Hillsdale, New Jersey:Educational Testing Service. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In Holland, W. P., & Wainer, Howa rd

PAGE 100

100 (1993). Differential Item Functioning Lawrence Erlbaum Associates Publishers. Hillsdale, New Jersey:Educational Testing Service. Thurman, C.,J., (2009). A monte carlo study investigating the influence of item discrimination, category intersection param eters, and differential item functioning patterns on the detection of differential item functioning in polytomous items. Retrieved from ProQuest LLC Digital dissertations .(UMI:3410739) Woods, C., M. (2008) Likelihood ratio DIF testing: Effects of non normality. Applied Psychological Measurement, 32, 511 526. Yildirim, H.H., & Berberoglu, G. (2009). Judgmental and statistical DIF analyses of the PISA 2003 mathematics literacy items. Inter national Journal of Testing, 9:108 121. Zieky, M. (1993). DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337 347). Hillsdale, NJ: Erlbaum Zilberberg, A., Phan, H., Socha, A., Kong, J., & Keng, L. (n.d). The effects of matching type and sample size on the Mantel Haenszel technique for detecting item with DIF. Retrieved from http:/ /educ.jmu.edu/~sochaab/index_files/Presentations/Effects_of_Matching_Ty pes_and_Sample_Size_on_MH.pdf Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert Type (Ordinal) Item Scores Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from http://www.educ.ubc.ca/fac ulty/zumbo/DIF/handbook.pdf Zumbo, B. D. (2003). Does item level DIF manifest itself in scale level analyses? Implications for translating language tests. Language Testing, 20 136 47. In Karami, H., & Nodoushan (2011). Differential item functioning (dif ): current problems and future directions International Journal of Language Studies (IJLS), Vol. 5 (3), 2011 (pp. 133 142) Zwick, R. (2012) A Review of ETS Differential Item Functioning Assessment Procedures: Flagging Rules, Minimum Sample Size Requiremen ts, and Criterion Refinement. Research Report. ETS RR 12 08. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26 44 66

PAGE 101

101 BIOGRAPHICAL SKETCH Halil Ibrahim Sari was born in Kutahya Turkey. He received his B.A. in mathematic s education from Abant Izzet Baysal University, Turkey. He later qualified for a scholarship to study abr oad and in the fall of 2009, enrolled for graduate studies in the Department of E ducatio nal Psychology at the University of Florida and receiv e d his M.A.E in research and evaluation m ethodology form the Department of Ed ucational Psychology in August, 201 3