<%BANNER%>

A Monte Carlo Investigation of the Performance of Factor Mixture Modeling in the Detection of Differential Item Functioning

Permanent Link: http://ufdc.ufl.edu/UFE0041954/00001

Material Information

Title: A Monte Carlo Investigation of the Performance of Factor Mixture Modeling in the Detection of Differential Item Functioning
Physical Description: 1 online resource (109 p.)
Language: english
Creator: Jackman, Mary
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: differential, factor, functioning, item, mixture, modeling
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre: Research and Evaluation Methodology thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two-parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions - sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these two extremes and tended to favor the 'true' two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Mary Jackman.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Miller, M David.
Local: Co-adviser: Leite, Walter.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041954:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041954/00001

Material Information

Title: A Monte Carlo Investigation of the Performance of Factor Mixture Modeling in the Detection of Differential Item Functioning
Physical Description: 1 online resource (109 p.)
Language: english
Creator: Jackman, Mary
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: differential, factor, functioning, item, mixture, modeling
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre: Research and Evaluation Methodology thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two-parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions - sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these two extremes and tended to favor the 'true' two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Mary Jackman.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Miller, M David.
Local: Co-adviser: Leite, Walter.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041954:00001


This item has the following downloads:


Full Text





A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR
MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING




















By

MARY GRACE-ANNE JACKMAN


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2010

































2010 Mary Grace-Anne Jackman
































To my parents, Enid and Kenmore Jackman, and my brother, Stephen









ACKNOWLEDGMENTS

First and foremost, I am most grateful to Almighty God through whom all things are

possible. I am also forever indebted to the Office of Graduate Studies and the Alumni

Fellowship Committee for the financial support that has made these four years of study

possible. The completion of this dissertation would also not have been possible without

the support and guidance of my dissertation committee: Dr. David Miller, Dr. Walter

Leite, Dr. James Algina, and Dr. Craig Wood. I would like to thank my committee chair

and academic adviser, Dr. Miller, for his expert guidance and insightful counsel during

this PhD program. I am also grateful to Dr. Leite, for his patient mentorship and his

uncanny ability to reduce my seemingly insurmountable programming mountains to

mere molehills. I must also acknowledge Dr. Algina, the epitome of teaching excellence.

I am indeed privileged to have been your student. I am also thankful to Dr. Wood for

agreeing to be a part of my dissertation committee in my time of need and for your quiet

vote of confidence.

Finally, I also need to express my immense gratitude and deepest appreciation to

my friends, in the US and at home in Barbados. This could not have been possible

without your unwavering support and unending encouragement. I am grateful for the

Skype chats, emails, texts and all the other modes of communication you used to

continually encourage, support me and keep me abreast of all the happenings back

home. This journey has been made easier because of your friendship, support and

prayers.









TABLE OF CONTENTS

page

ACKNOW LEDGM ENTS ............. ............... .................................. ............... 4

LIST O F TA BLES .................................................................................................... 8

LIST OF FIGURES................................. ......... 10

A B S T R A C T .................... ...................................................................................... 11

CHAPTER

1 IN T R O D U C T IO N .............................................................................. ..................... 13

2 LITERATURE REVIEW ............................................... 21

Differential Item Functioning ............................................... 21
Types of Differential Item Functioning ................................... 23
D IF vs. Im pact ............... ........................................ ....... ........... 24
Frameworks for Exam ining DIF .................................................... ........ 24
Observed score framework................. ...... .................... 24
The latent variable framework................ ......... ....... 25
SEM-based DIF Detection Methods .................................... 26
Factor Analytic Models with Ordered Categorical Items................................. 27
Mixture Modeling as an Alternative Approach to DIF Detection .............................. 31
Estim ation of M ixture M odels ................................. ...... ... .............. ............. 36
Class Enumeration ...... .................................................... ......... ................ 37
Information Criteria Indices............................ .................... 39
Mixture Model Estimation Challenges ................................... 43
Purpose of Study ...... ...................................................... .......................... 44

3 METHODOLOGY ....................................... ..................... 49

Factor Mixture Model Specification for Latent Class DIF Detection ........................ 49
D ata G e ne ratio n ........................................................................................ .. 50
Simulation Study Design............................................. ............... 51
Research Study 1 .... ................................................................................ 52
M manipulated C ond itions .................................. ........ ............... ............... 52
Sample size ...................... ........ ......... ................... 52
M agnitude of uniform DIF .................. ............................... ....... 53
Ability differences between groups ..................................... 53
Fixed Sim ulation Conditions ........... ...... ............... ............................. 54
Test length ..... ............................................ .............................. 54
N um ber of D IF item s.................................................. .................... 55
Sam ple size ratio ....................................................... .. 55
Percentage of overlap between manifest and latent classes .................... 55









Mixing proportion ............... .... ........................ 56
Study Design Overview ............... .... ............................. 56
Evaluation C criteria ......... .. ........ ......... .......... ............... .............. 57
Research Study 2 ........................ ......... ......... 57
Data Analysis ............................ ....... ................. 58
Evaluation Criteria ................ ........ ........ ......... 58
Model Estimation .......... ............ ......... ................. ............... 59

4 R E S U L T S .............. ..... ............ ................. .............................................. 6 2

Research Study 1 ................ ......... ................. 62
C onvergence R ates...................................................................... ......... 62
Class Enumeration ..................... ... ............... 63
Akaike Information Criteria (AIC) ............. .................... ..... ......... 64
Bayesian Inform action Criteria (BIC) ............. ............................. ...... .. 64
Sample-size adjusted BIC (ssaBIC)..................... ................................ 65
R e se a rch S tu d y 2 ............. ......... .. .. .. .................................. .............. ...... 6 5
Nonconvergent Solutions .................. ........................_.. ... ... ................ 66
Type I Error Rate ............... ..................... ............... 66
Magnitude of DIF .......... ........... ........ ................ 67
Sample size ............. .... ......... ........................... 68
Im p a c t ............. .. ..... .. ................ ............................................................. 6 8
Variance components analysis ............... .............. .............. 69
S statistical P ow er............................... ............... 69
Magnitude of DIF .......... ........... ........ ................ 70
Sample size .......... .. .......... ................ ............... 70
Impact ................ ........... .................. 71
Effect of item discrimination parameter values.................................... 71
Variance components analysis ............... .............. .............. 72

5 DISCUSSION ..................................................................... .... ......... .... .......... ........ 81

Class Enumeration and Performance of Fit Indices ............ .......... 81
Type I Error and Statistical Power Performance ............. ................ ............. 84
Type I Error Rate Study...................... ....................... .................... 85
Statistical Power Study...................... ........................ .................... 86
Reconciling the Sim ulation Results ................................................................ 87
Limitations of the Study and Suggestions for Future Research .......................... 88
C o n c lu s io n ............. ......... .... ............................. ............... 9 2

APPENDIX

A MPLUS CODE FOR ESTIMATING 2-CLASS FMM........................................... 94

B MPLUS CODE FOR DIF DETECTION ............. .. ........................ 95

LIST O F R EFER EN C ES ...................... ............ ......... ................... ............... 96



6









BIOGRAPHICAL SKETCH ............................................................... ............... 109









LIST OF TABLES


Table page

3-1 Generating population parameter values for reference group ........................... 60

3-2 Fixed and manipulated simulation conditions used in study 1 ........................... 61

3-3 Fixed and manipulated simulation conditions used in study 2........................... 61

4-1 Number of converged replications for the three factor mixture models .............. 73

4-2 Mean AIC values for the three mixture models .............................. ............... 74

4-3 Mean BIC values for the three mixture models ....................... ....... ............... 74

4-4 Mean ssaBIC values for the three mixture models .................................... 74

4-5 Percentages of converged solutions across study conditions ........................... 75

4-6 Overall Type I error rates across study conditions................................... 75

4-7 Type I error rates for D IF = 1.0 ........... ................. ...................... ............... 76

4-8 Type I error rates for D IF = 1.5 ........... ................. ...................... ............... 76

4-9 Type I error rates for sample size of 500 .............................. ............... 76

4-10 Type I error rates for sam ple size of 1000 ..................... ............................... 76

4-11 Type I error rates for impact of 0 SD ............. ..... ...................... 77

4-12 Type I error rates for impact of 0.5 SD ............................ .......... ..... 77

4-13 Type I error rates for impact of 1.0 SD ........................... ............ ... 77

4-14 Variance components analysis for Type I error .............. .... ................ 77

4-15 Overall power rates across study conditions .............. ...... .................. 78

4-16 Pow er rates for D IF of 1.0 ............................... ............................. ............... 78

4-17 Pow er rates for D IF of 1.5 ............................... ............................. ............... 78

4-18 Power rates for sample size N of 500 ................... ........ ............... 79

4-19 Power rates for sample size N of 1000........................... .................. 79

4-20 Pow er rates for im pact of 0 S D ....................................................... ............... 79









4-21 Pow er rates for im pact of 0.5 S D ............................................... ... .. ............... 79

4-22 Pow er rates for im pact of 1.0 S D ............................................... ... .. ............... 79

4-23 Power rates for DIF detection based on item discriminations........................... 80

4-24 Variance components analysis for power results................ ... ........... 80









LIST OF FIGURES


Figure page

2-1 Example of uniform DIF................................ ..................... 46

2-2 Exam ple of non-uniform D IF ................................................... ..................... 46

2-3 Depiction of relationship between y and y for a dichotomous item .................... 47

2-4 Diagram depicting specification of the factor mixture model.......................... 48









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR
MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING

By

Mary Grace-Anne Jackman

August 2010

Chair: M. David Miller
Cochair: Walter Leite
Major: Research and Evaluation Methodology

This dissertation evaluated the performance of factor mixture modeling in the

detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study

first investigated the ability of the factor mixture model to recover the number of true

latent classes existing in the population. Data were simulated based on the two-

parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for

a two-group, two-class population. In addition, the three simulation conditions sample

size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-,

and three-class factor mixture models were estimated and compared using three

commonly-used likelihood-based fit indices: the Akaike information criterion (AIC),

Bayesian information criterion (BIC), and sample size adjusted Bayesian information

criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices

with respect to the best-fitting model. Whereas the AIC tended to over-extract the

number of latent classes and under most study conditions selected the three-class

model, the BIC erred on the side of parsimony and consistently selected the simpler

one-class model. On the other hand, the ssaBIC held the middle ground between these









two extremes and tended to favor the "true" two-class mixture model as the sample size

or DIF magnitude was increased.

In the second phase of the study, the factor mixture approach was assessed in

terms of its Type I error rate and statistical power to detect uniform DIF. One thousand

data sets were replicated for each of the 12 study conditions. The presence of uniform

DIF was assessed via a significance test of each of differences in item thresholds

across latent classes. Overall, the results were not as encouraging as was hoped.

Inflated Type I errors were observed under all of the study conditions, particularly when

the sample size and DIF magnitude were reduced.









CHAPTER 1
INTRODUCTION

Given the ubiquity of testing in the United States coupled with the serious

consequences of high-stakes decisions associated with these assessments, it is critical

that conclusions drawn about group differences among examinee groups be accurate

and that the validity of interpretations is not compromised. One way of eliminating the

threat of invalid interpretations is to ensure that tests are fair and the items do not

disadvantage subgroups of examinees. In addressing the issue of fairness in testing,

the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999)

outlines four widely used interpretations of fairness. The first defines a fair test as one

that is free from bias that either systematically favors or systematically disadvantages

one identifiable subgroup of examinees over another. In the second definition, fairness

refers to the belief that equal treatment should be afforded to all examinees during the

testing process. The third definition has sparked some controversy among testing

professionals. It defines fairness as the equality of outcomes as characterized by

comparable overall passing rates across examinee subgroups. However, it is widely

agreed that while the presence of group differences in ability distributions between

groups should not be ignored, it is not an indication of test bias. Therefore, a more

acceptable definition specifies that in a fair test, examinees possessing equal levels of

the underlying trait being measured should have comparable testing outcomes,

regardless of group membership. The fourth and final definition of fairness requires that

all examinees be afforded an equal, adequate opportunity to learn the tested material

(AERA, APA, & NCME, 1999). Clearly, the concept of fairness is a complex, multi-

faceted construct and therefore it is highly unlikely that consensus will be reached on all









aspects of its definition, interpretation and implementation (AERA, APA, & NCME,

1999). However, there is agreement that fairness must be paramount to test developers

during the writing and review of items as well as during the administration and scoring of

tests. In other words, the minimum fairness requirements are that items are free from

bias and that all examinees receive an equitable level of treatment during the testing

process (AERA, APA, & NCME, 1999).

In developing unbiased test items, one of the primary concerns is ensuring that the

items do not function differentially for different subgroups of examinees. This issue of

item invariance is investigated through the use of a statistical technique known as

differential item functioning (DIF). DIF detection is particularly critical if meaningful

comparisons are to be made between different examinee subgroups. The fundamental

premise of DIF is "that if test takers have approximately the same knowledge, then they

should perform in similar ways on individual test questions regardless of their sex, race

or ethnicity" (ETS, 2008). Therefore, the process of DIF assessment involves the

accumulation of empirical evidence to determine whether items function differentially for

examinees with the same ability. DIF analysis is widely regarded as the psychometric

standard in the investigation of bias and test fairness. Consequently, it has been the

topic of extensive research (Camilli & Shepard, 1994; Clauser & Mazor, 1998; Holland &

Thayer, 1988; Holland & Wainer, 1993; Milsap & Everson, 1993; Penfield & Lam, 2000;

Potenza & Dorans, 1995; Swaminathan & Rogers, 1990). As part of its evolution,

several statistical DIF detection techniques, including non-parametric (Holland &

Thayer, 1988; Swaminathan & Rogers, 1990), IRT-based methods (Thissen, Steinberg,

& Wainer, 1993; Wainer, Sireci, & Thissen, 1991) and SEM-based methods (Joreskog &









Goldberger, 1975; Macintosh & Hashim, 2003; Muthen, 1985, 1988; Muthen, Kao, &

Burstein, 1991) have been developed. Typically these methods have all focused on a

manifest approach to detecting DIF. In other words, they use pre-existing group

characteristics such as gender (e.g. males vs. females) or ethnicity (e.g. Caucasian vs.

African Americans) to investigate the occurrence of DIF in a studied item. And while this

approach to DIF testing has been widely practiced and accepted as the standard for DIF

testing, some believe that this emphasis on statistical DIF analysis has been less

successful in determining substantive causes or sources of DIF. For example, in

traditional DIF analyses, after an item has been flagged as exhibiting DIF, content-

experts may subsequently conduct a substantive review to determine the source of the

DIF (Furlow, Ross, & Gagne, 2009). However, while this is acknowledged as an

important step, some view its success at being able to truly understand the explanatory

sources of DIF as minimal, at best (Engelhard, Hansche, & Rutledge, 1990; Gierl et al.,

2001; O'Neill & McPeek, 1993; Roussos & Stout, 1996). More specifically,

inconsistencies in agreement between reviewers' judgments or between reviewers and

DIF statistics make it difficult to form any definitive conclusions explaining the

occurrence of DIF (Engelhard et al., 1990). Others suggest that the inability of traditional

DIF methods to unearth the causes of DIF is because the pre-selected grouping

variables on which these analyses are based are often not the real dimensions causing

DIF. Rather, they view these a priori variables as mere proxies for educational

disadvantagee attributes that if identified could better explain the pattern of differential

responses among examinees (Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton,

2002; Dorans & Holland, 1993; Samuelsen, 2005, 2008; Webb, Cohen, &









Schwanenflugel, 2008). Finally, the assumption of an inherent homogeneity in

responses among examinees in the subgroups has also been cited as another

weakness of the traditional DIF approach (De Ayala, 2009; De Ayala, Kim, Stapleton &

Dayton, 2002; Samuelsen, 2005). This view has been supported by the observation that

even within a seemingly homogenous manifest group (e.g. Hispanic or black) there can

be high levels of heterogeneity, resulting in segments which respond differently to the

item than other examinees in that group. Using race as an example, De Ayala (2009)

noted that a racial category such as Asian American would lump together examinees of

Filipino, Korean, Indonesian, Taiwanese descent as a single homogeneous group,

ignoring their intra-manifest variability. As a result, De Ayala (2009) argues that this

assumed homogeneity in traditional manifest DIF assessments may lead to false

conclusions about the existence or magnitude of DIF.

As an alternative to the traditional manifest approach, a latent mixture

conceptualization of DIF has been proposed. Rather than focusing on a priori examinee

characteristics, this method characterizes DIF as being the result of unobserved

heterogeneity in the population. A latent mixture conceptualization relaxes the

requirement that associates DIF with a specific preexisting variable and the assumption

that manifest groups are homogenous. Instead, examinees are classified into latent

subpopulations based on their differential response patterns. These latent

subpopulations or latent classes arise as a result of qualitative differences (e.g. use of

problem solving strategies, response styles or level of cognitive thinking) among

examinee subgroups (Mislevy et al., 2008; Samuelsen, 2005). Interestingly, more than a

decade before the latent mixture conceptualization had been proposed, Angoff (1993)









also voiced his concern about the inability of traditional methods to provide substantive

interpretations of DIF. In offering evidence supporting this view Angoff (1993) reported

that in attempting to account for DIF "test developers are often confronted by DIF results

that they cannot understand; and no amount of deliberation seems to explain why some

perfectly reasonable items have large DIF values" (Angoff, 1993, pg. 19).

This dissertation proposes the use of factor mixture models as an alternative

approach to investigating heterogeneity in item parameters when the source of

heterogeneity is unobserved. Factor mixture modeling blends factor analytic (Thurstone,

1947) and latent class (Lazarsfield & Henry, 1968) models, two structural equation

modeling (SEM) based methods that provide a unique but complementary approach to

explaining the covariation in the data. The latent class model accounts for the item

relationships by assuming the existence of qualitatively different subpopulations or

latent classes (Bauer & Curran, 2004). However, since class membership is unobserved

individuals are categorized based not on an observed grouping variable but rather by

using probabilities to determine their most likely latent class assignment. On the other

hand, the factor analytic model assumes the existence of an underlying, continuous

factor structure in explaining the commonality among the item responses. As part of the

factor mixture estimation, the class-specific item parameters from the factor analytic

model that are estimated can be compared to determine their level of measurement

non-invariance or differential functioning. A significant DIF coefficient provides evidence

that the item is functioning differentially among latent classes after controlling for the

latent ability trait.









In estimating factor mixture models, one important decision to be made is the

determination of the number of latent classes. However, while there are several fit

criteria such as the Akaike Information Criteria (AIC; Akaike, 1987), Bayesian

Information Criteria (BIC, Schwartz, 1978), the adjusted BIC (ssaBIC; Sclove, 1987),

and the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, &

Rubin, 2001) available to assist the researcher in making the determination, there is

seldom perfect agreement among these fit indices. Therefore, practitioners are

cautioned against applying the mixture approach without having theoretical support for

their hypothesis of unobserved heterogeneity in the population of interest.

Generally, when group membership is known a priori, SEM-based DIF detection

models can be specified using either a multiple indicators, multiple causes (MIMIC)

model or a multiple-group CFA approach (Allua, 2007). In this paper, the factor mixture

model will be specified using the mixture analog of the manifest multiple-group CFA.

Therefore, in the model specification since the observed group variable will now be

replaced by a latent categorical variable, not only will the heterogeneity of item

parameters be examined but a profile of the latent, unobserved subpopulations can be

examined as well. Finally, if needed, covariates can also be included in the model to

help in explaining the composition of the latent classes as well.

The primary purpose of this dissertation was to explore the utility of the factor

mixture models in the detection of DIF. In a 2006 paper, Bandalos and Cohen

commented that while the estimation of the factor mixture models had previously been

presented in the IRT literature, the models were not as frequently utilized with SEM-

based models. However, programming enhancements to software packages such as









Mplus (Muthen & Muthen, 1998-2008) have increased the likelihood that the estimation

of SEM-based factor mixture models will be more commonly estimated in practice

(Bandalos & Cohen, 2006). In this study, the performance of the factor mixture model

was evaluated primarily in terms of its ability to produce high convergence rates, control

Type I error rate, and enhance statistical power under a variety of realistic study

conditions. The simulated conditions examined sample size, magnitude of DIF, and

similarity of latent ability means. In sum, this Monte Carlo simulation was conducted to

determine the conditions under which a factor mixture approach performs best when

assessing item non-invariance and ultimately whether its use in practice should be

recommended.

SEM-based approaches are not commonly used in the discipline of educational

testing which traditionally has been considered the domain of techniques developed

within an IRT framework. And although the equivalence between factor analytic and IRT

approaches for categorical items has long been established and applied repeatedly in

the literature (Bock & Aiken, 1981; Finch 2005; Glockner-Rist & Hoijtink 2003; Moustaki

2000; Muthen, 1985; Muthen & Asparouhov 2002; Takane & de Leeuw, 1987), these

methods are still utilized primarily within their respective discipline of origin. Therefore,

despite its obvious potential, it is unlikely that this latent conceptualization will gain

widespread acceptance, unless the applied community is convinced that (i) framing DIF

with respect to latent qualitative differences and (ii) using a SEM-based approach are

both worthwhile, practical options. In sum, if a mixture approach can be shown to add

substantial value in an area such as DIF detection, then research of this kind will









contribute positively to bridging the gap between SEM-based and IRT-based methods in

the area of testing and measurement.









CHAPTER 2
LITERATURE REVIEW

Differential Item Functioning

From a historical standpoint, the term item bias was coined in the 1960s during an

era when a public campaign for social equality, justice and fairness was being waged on

all fronts. The term referenced studies designed to investigate the claim that "the

principal, if not the sole, reason for the great disparity in test performance between

Black and Hispanic students and White students on tests of cognitive ability is that the

tests contain items that are outside the realms of the minority cultures" (Angoff, 1993,

pg. 3). Concerns about bias in testing were particularly relevant in cases where the

results were used in high-stakes decisions involving job selection and promotions,

certification, licensure and achievement. What followed in the early '60s and '70s was a

series of studies using rudimentary methods based on classical test theory (CTT)

techniques (Gelin, 2005). One of these early methods involved an analysis of variance

(ANOVA) approach (Angoff & Sharon, 1974; Cleary & Hilton, 1986) and focused on the

interaction between group membership and item performance as a means of identifying

outliers and detecting potentially biased items. Another method, the delta-plot

technique, (Angoff, 1972; Thurstone, 1925) used plots of the transformed CTT index of

item difficulty (p-values) for each group as a means of detecting biased items.

However, the main weakness of these early methods was that they both failed to control

for the underlying construct that was purported to be measured (e.g. ability) by the test.

Another criticism was that because these methods only considered item difficulty, their

implicit assumption of equally discriminating items led to an increase in the incidence of

false negative or false positive identifications.









During this period, the term "bias" also came under heavy scrutiny. It was felt that

the strong emotional connotation which the word carried was creating a semantic rift

between the technical testing community and the general public. As a result of this

debate, the term differential item functioning (DIF) was proposed as a less value-laden,

more neutral replacement (Angoff, 1993; Cole, 1993). The DIF concept is defined as the

accumulation of empirical evidence to investigate whether there is a difference in

performance between comparable groups of examinees (Hambleton, Swaminathan, &

Rogers, 1991). More specifically, it refers to a difference in the probability of correctly

responding to an item between two subgroups of examinees of the same ability or

groups matched by their performance on the test representing the underlying construct

of interest (Kamata & Binici, 2003; Potenza & Dorans, 1995). These two groups are

referred to as the focal group, those examinees expected to be disadvantaged by the

item(s) of interest on the test (e.g. females or African Americans), and the reference

group, those examinees expected to be favored by the DIF items (e.g. males or

Caucasians).

Since the introduction of the early CTT methods, a variety of additional techniques

have been introduced for the detection of DIF (Clauser & Mazor, 1998). These include

the nonparametric Mantel-Haenszel chi-square method (Holland & Thayer, 1988;

Mantel & Haenszel, 1959), the standardization method (Dorans & Holland, 1993),

logistic regression (Swaminathan & Rogers, 1990), likelihood ratio tests (Wainer, Sireci,

& Thissen, 1991), and item response theory (IRT) approaches comparing parameter

estimates (Lord, 1980) or estimating the areas between the item characteristic curves

(Raju, 1988, 1990). Additionally, though not as popular in the testing and measurement









discipline, SEM-based methods such as the multiple-groups approach (Lee, 2009;

Sorbom, 1974) and the MIMIC model (Gallo, Anthony, & Muthen, 1994; Muthen, 1985,

1988) have also been used to detect DIF. The fact that these methods involve some

level of conditioning on the latent construct of interest, differentiate them from the earlier

CTT ANOVA and p-value approaches. Furthermore, DIF assessment has now emerged

as an essential element in the investigation of test validity and fairness and for testing

companies such as Educational Testing Service (ETS), the use of DIF is a critical part

of the test validation process. Cole's (1993) comments in which she describes DIF as "a

technical tool in ETS to help us assure ourselves that tests are as fair as we can make

them" underscore the importance of this analytical approach as a standard practice in

the design and administration of fair tests.

Types of Differential Item Functioning

There are two primary types of DIF: uniform and non-uniform DIF. Uniform DIF

occurs when the probability of correctly responding to or endorsing a response category

is consistently higher or lower for either the reference or focal group across all levels of

the ability scale. As shown in Fig 2-1, when uniform DIF is present in an item, the two

ICCs do not cross at any point along the range of ability. In this example, the probability

of responding correctly to this dichotomous item is uniformly lower for group 2 than for

group 1 along the entire range of latent ability.

Conversely, when there is an interaction between the latent ability trait and group

membership and the ICCs cross at some point along the ability range, then this is

referred to as non-uniform DIF. This means that for a portion of the scale, one group is

more likely to correctly respond or endorse a response category. However, this

"advantage" is reversed for the second portion of the scale. An example of this is shown









in Figure 2-2. It is important to note that while some methods can detect both types of

DIF, others are capable of detecting uniform DIF only.

DIF vs. Impact

As was mentioned previously, DIF occurs when there is a significant difference in

the probability of answering an item correctly or endorsing an item category between

groups of the same ability level (Wainer, 1993). On the other hand, impact refers to

legitimate group differences in the probability of getting an item correct or endorsing an

item category (Wainer, 1993). Therefore, in distinguishing between these two concepts,

it is important to note that while DIF assessment includes comparable groups of

examinees with the same trait level, impact refers to group differences without

controlling or matching on the construct of interest.

Frameworks for Examining DIF

Traditionally, DIF is examined within one of two frameworks: (i) the observed score

framework and (ii) the latent variable framework.

Observed score framework

The observed score framework includes methods that adjust for ability by

conditioning on some observed or manifest variable designed to serve as a proxy for

the underlying ability trait (Ainsworth, 2007). When an observed variable such as the

total test score is used it is assumed that this internal criterion is an unbiased estimate

of the underlying construct. Therefore, as an alternative to the aggregate test score, an

alternative or a purified form of the total test score may also be used as the internal

matching criteria. This means that provided that DIF is detected, the total test score is

adjusted by dropping those items identified as displaying DIF and calculating a revised

total score. This iterative process which is used to refine the criterion measure and limit









the impact of DIF contamination is known as purification (Kamata & Vaughn, 2004).

While internal criterion measures are most commonly used, external criteria consisting

of a set of items that were not part of the administered test could also function as an

external criterion measure. An external criterion may be an adequate choice particularly

when there are a high proportion of DIF items in the scale or the total score is deemed

to be an inappropriate measure (Gelin, 2005). However, external measures are seldom

used in practice due to the difficulty in finding an adequate set of items that would be

more appropriate at measuring the latent trait than the actual test that was designed

specifically for that use (Gelin, 2005; Shih & Wang, 2009). Procedures that use the

observed variable framework include contingency tabulation methods such as the

Mantel-Haenszel (MH) (Holland & Thayer, 1998), generalized linear models such as

logistic regression (Swaminathan & Rogers, 1990) or ordinal regression (Zumbo, 1999),

and the standardization method (Dorans & Kulik, 1983).

The latent variable framework

Unlike the procedures implemented within the observed score framework, these

techniques do not control using an observed, manifest measure like the total test score

or purified total test score. Instead, latent variable DIF detection methods involve the

use of an assumed underlying latent trait such as ability. The two main classes of

methods which use the latent variable framework are: (i) item response theory (IRT) and

(ii) structural equation modeling (SEM) based approaches. IRT methods include

techniques in which comparisons are made either between item parameters (Lord,

1980), or between item characteristic curves (Raju, 1988, 1990) or likelihood ratio test

methods (IRT-LRT; Thissen, Steinberg, & Wainer, 1988). However, since the focus of









this dissertation is on the use of an SEM-based approach to detect DIF, no detailed

description of the IRT-based DIF detection methods will be given.

SEM-based DIF Detection Methods

While it is possible to specify an exploratory factor model where the measurement

model relating the item responses to the latent underlying factors is unknown, a CFA

approach in which the factor structure has been specified is more often used in DIF

detection. Typically, a CFA model is formulated as:

Y=v+Ar+s (1)

r =a+, (2)

where Y is a pxl vector of scores on the p observed variables, v is a pxl vector of

measurement intercepts, A is a pxm matrix of factor loadings, r7 is a mx1 vector of

factor scores on the m latent factors, and E is a pxl vector of residuals or measurement

errors representing the unique portion of the observed variables not explained by the

common factorss. It is assumed that the Es have zero mean and are uncorrelated not

only with the r, but with each other as well. Additionally, a denotes an mx1 vector of

factor means and ; is a m-vector of factor residuals. The model-implied mean and

covariance structures are formulated as follows:

/ = v+AK (3)

E(O)= ADA' + (4)

where p/ is a pxl vector of means of the observed variables, K is an mx1 vector of

factor means, (0O) is a pxp matrix of variances and covariances of the p observed

variables, D is an mxm covariance matrix of the latent factors and 0 is a square pxp

matrix of variances and covariances for the measurement errors. For a single-group









CFA model, it is also assumed that the independent observations are drawn from a

single, homogenous population.

The estimates of the parameters are found by minimizing the discrepancy between

the observed covariance matrix, S and the sample-implied covariance E(0). The general

form of the discrepancy function is denoted by F(S,Z(e)) but there are several different

types of functions that can be used in the estimation process. For example, if

multivariate normality of the data is assumed, then the maximum likelihood (ML)

estimates of the parameters are obtained by minimizing the following:

F ()= [log l(0) +tr(S(O0) ')-logS p]+(Y -)' 1(Y -) (5)

where S is the sample covariance matrix, Z(8) is the model-implied covariance matrix

and p is the number of observed variables. As presented, this formulation of the factor

model is described for items with a continuous response format. However, since the

focus of this dissertation is on DIF assessment with dichotomous items, what follows is

a respecification of the model to accommodate categorical items.

Factor Analytic Models with Ordered Categorical Items

In educational measurement, the ability trait is typically measured by dichotomous

or polytomous items rather than continuous items. One method of dealing with

categorical item responses is to specify a threshold model or a latent response variable

(LRV) Formulation (LRV; Muthen & Asparouhov, 2002). The LRV formulation assumes

that underlying each observed item response y, is a continuous and normally distributed

latent response variable y*. This continuous latent variable can be thought of as a

response tendency with higher values indicating a greater propensity of answering the

item correctly. Further, it is assumed that when this tendency is sufficiently high thereby









exceeding a specific threshold value, then the examinee will answer the item correctly.

Likewise, if it falls below the threshold, then an incorrect response is observed.

Therefore, based on this formulation, the observed items responses can be viewed as

discrete categorizations of the continuous latent variables. The relationship between

these two variables y and y are represented by the following nonlinear function:

y =c, if r-1
where c denotes the number of response categories for y and the threshold structure is

defined by r= oo < < ... < r, = +oo for c categories with c-1 thresholds. In the case

r0, if y* < r1w
of binary items, the mapping of yi onto yi is expressed as: y, = ify* > where


r, denotes the threshold parameter for test item yl. This relationship is illustrated in

Figure 2-3.

Because of the LRV formulation, the measurement component of the model which

relates, in this case, the continuous latent response variables to the latent factor and to

the group membership variable is respecified as:

y,j =v, +iAr, +s E (7)

where y j is individual's i latent response to item j. The distributional assumptions of the

p-vector of measurement errors determine the appropriate link function to be selected.

For example, if it is assumed that the measurement errors are normally distributed then

the probit link function, that is, the inverse of the cumulative normal distribution

function, t '[ ] is used. As a result the thresholds and factor loadings are interpreted as

probit coefficients in the linear probit regression equation. The alternative is to assume a








logistic distribution function for the measurement errors which allows the coefficients to

be interpreted either in terms of logits or converted to changes in odds.

Under the LRV formulation, the single factor model for a continuous latent trait

measured by binary outcomes is expressed as in Equation 7. Therefore, the conditional

probability of a correct response as a function of r is:

P(Y, =I |)= P(Y*>j> T,)= I-P(y/ = F[(_ r _- j)v(,') 1,2] (8)
__,F [_ (. _, ,,, )V(,,j) 1,1]

where V(E,)is the residual variance and F can be either the standard normal or logistic

distribution function depending on the distributional assumptions of theess (Muthen &

Asparouhov, 2002). Further, in addition to the LRV, latent variable models with

categorical variables can also be presented using an alternative formulation. The

conditional probability curve formulation focuses on directly modeling the nonlinear

relationship between the observed y, and the latent factor trait, n as:

P(y,=l 11,)=F[ (7j,-b )] (9)

where ac is the item discrimination, bi is the item difficulty, and the distribution of F is

either the standard normal or logistic distribution function. In their 2002 paper, Muthen

and Asparouhov illustrate the equivalence of results between these two conceptual

formulations of modeling factor analytic models with categorical outcomes. The authors

showed that equating the two formulations:

SF[- -(-, V,, ) )-1/2b ] _=_F [c (i,, (10)









where as previously indicated, F is either the standard normal or logistic distribution

depending on the distributional assumptions of theEs. However, it should be noted that

in the case of factor mixture modeling with categorical variables, the default estimation

method used in Mplus (Muthen & Muthen, 1998-2008) is robust maximum likelihood

estimation method (MLR) and the default distribution of F is the logistic distribution.

To allow for the estimation of the thresholds, the intercepts in the measurement

model are assumed to be zero. As a result, the factor analytic logistic parameters can

be converted to IRT parameters using:


a=, and b= (11)
Var(8_) A 7

Additionally, to ensure model identification, it is necessary to assign a scale to the latent

trait. One method of setting the scale of the latent trait is to standardize it by setting the

mean equal to zero and fixing the variance at one, that is, /, =0 and o- =1. In this case,

the factor loadings and thresholds can then be converted to item discrimination and

item difficulties using the following expressions (Muthen & Asparouhov, 2002):

a= and b = (12)


The ease with which estimates of the factor analysis parameters can be converted

to the more recognizable IRT scale should increase the interpretability and utility of the

results to the applied researchers (Fukuhara, 2009). An alternative method of setting

the scale of the latent factor is by fixing one loading per factor to one. As a result, the

simplified conversion formulae will differ and by extension, the magnitude of the

parameter estimates will be affected as well (Bontempo, 2006; Kamata & Bauer, 2008).









Therefore, when running these models with factor analytic programs such as Mplus, the

user should be aware of the default scale setting methods since this invariably will affect

the parameter conversion formulae.

Mixture Modeling as an Alternative Approach to DIF Detection

Traditional DIF detection methods assume a manifest approach in which

examinees are compared based on demographic classifications such as gender

(females vs. males) or race (African Americans vs. Caucasians). While this approach

has a long history and has been used successfully in the past to assess DIF, recent

emerging research has suggested that this perspective may be limiting in scope (Cohen

& Bolt, 2005; De Ayala et al., 2002; Mislevy et al., 2008; Samuelsen, 2005, 2008). As an

alternative to the traditional manifest DIF approach, a latent DIF conceptualization,

rooted in latent class and mixture modeling methods, has been proposed (Samuelsen,

2005, 2008). In this conceptualization of DIF, rather than focusing on a priori examinee

characteristics, this approach assumes that the underlying population consists of a

mixture of heterogeneous, unidentified subpopulations, known as latent classes. These

latent classes exist because of perceived qualitative differences (e.g. different learning

strategies, different cognitive styles etc.) among examinees.

One technique that can be used to examine DIF from a latent perspective is factor

mixture modeling (FMM). Factor mixture models result from merging the factor analysis

model and the latent class model resulting in a hybrid model consisting of two types of

latent variables; continuous latent factors and categorical unobserved classes (Lubke &

Muthen, 2005, 2007; Muthen, Asparouhov, & Rebollo, 2006). The simultaneous

inclusion of these two types of latent variables allows first for the exploration of

unobserved heterogeneity that may exist in the population and the examination of the









underlying dimensionality within these latent groups (Lubke & Muthen, 2005). As a

result, a primary advantage of factor mixture modeling is its flexibility in allowing for a

wider range of modeling options. For instance, models can be specified for multiple

factors, multiple latent classes and the structure of the within-class models can vary in

the complexity of its relationships not only with the latent factors but with observed

combinations of continuous and categorical covariates as well (Allua, 2007; Lubke &

Muthen, 2005, 2007; McLachlan & Peel, 2000). However, it is important to note that as

more specifications are introduced to a model, not only does the level of complexity of

model increase but the computational intensity of the estimation process as well.

Therefore, it is recommended that researchers should be guided by substantive theory

not only in supporting their hypothesis of population heterogeneity but also with regard

to the complexity of the specification of their mixture models as well (Allua, 2007; Jedidi,

Jagpal, & DeSarbo, 1997).

Previous applied studies highlight several fields where mixture modeling has been

applied successfully to investigate population heterogeneity (Bauer & Curran, 2004;

Jedidi et al., 1997; Kuo et al., 2008; Lubke & Muthen, 2005, 2007; Lubke & Neale, 2006;

Muthen & Asparouhov, 2006; Muthen, 2006). One area in which factor mixture models

have been utilized with some success is that of substance abuse research (Kuo et al.,

2008; Muthen et al., 2006). In their research on alcohol dependency, Kuo et al. (2008)

compared a factor mixture model approach against a latent class and a factor model

approach to determine which of the three best explained the alcohol dependence

symptomology patterns. What they found was that while a pure factor analytic model

provided an unsatisfactory solution, the latent class approach provided a better fit to the









alcohol dependency data. However, the single factor, three-class factor mixture model

provided the best fit to the data and best accounted for the covariation in both the

pattern of symptoms and the heterogeneity of the population. In another study from the

substance abuse literature, Muthen et al. (2006) compared two types of factor mixture

models to a factor analysis and latent class approach in analyzing the responses of 842

pairs of male twins to 22 alcohol criteria items. The findings showed that both factor

mixture models fit the data well and explained heritability both with regard to the

underlying dimensional structure of the data and the latent class profiles of the

heterogeneous population (Muthen et al., 2006).

With regard to DIF, factor mixture models may be used to investigate item

parameters differences between the latent classes of subpopulations of examinees. In

this conceptualization of DIF, the unobserved latent classes represent qualitatively

different groups of individuals whose item responses function differentially across the

classes (Bandalos & Cohen, 2006). To allow for the specification of these models, a

categorical latent variable is integrated into the common factor model specified in

Equations 1 and 2. As a result, the K-class factor mixture model is now expressed as:

y,k = k+ Akk + ,k (13)

rk = k + k (14)

where the subscript k indicates the parameters that can vary across the latent classes.

Figure 2-4 provides a depiction of a factor mixture model where the unidimensional

latent factor is measured by five observed items and there are K latent classes in the

population. In the diagram, the relationships are specified as follows:









* The arrows from the latent factor to the item responses (from n to the Ys)
represent the factor loadings or the A parameters measuring the relationship
between the latent factor and the items.

* The arrows from the latent class variable to the item responses (from Ck to the Ys)
are the class-specific item thresholds conditional on each of the K latent classes.

* The broken-line arrows from the latent class variable to the factor loadings arrows
indicate that these loadings are also class-specific and therefore can vary across
the K latent classes.

* The arrow from the latent class to the latent factor (i.e. from Ck to the r) allows for
factor means and/or factor variances to be class-specific as well.

Since the model parameters are allowed to be class-specific, that is, both item

thresholds and factor loadings can be specified as non-invariant, then this specification

allows for testing of both uniform and non-uniform DIF.

The focus of this dissertation is on assessing the performance of the factor mixture

model in the detection of uniform DIF. Therefore for this specification, while the item

thresholds are allowed to vary across the levels of the latent class variable, the factor

loadings are constrained across the K classes. In Mplus, the implementation of the

factor mixture model for DIF detection can be conceptualized as a multiple-group

approach where DIF is tested across latent classes rather than manifest groups.

Therefore using Equations 1 and 2, the CFA mixture model can be reformulated as:

Yk =rk+ kk+8 k and 7lk a + k (15)

where the parameters are as previously defined for each of the k = 1, 2,...,K latent

classes. Once again, it is assumed that the measurement errors have an expected

value of zero and are independent of the latent trait(s), and of each other. Similarly, the

model implied mean and covariance structure for the observed variables in each of the

k = 1, 2,...,K latent classes can be defined as:









/,k= r, + AkKk and Yk =A, (kAk' + Ok (16)

Three main strategies for using CFA-based approaches to investigate

measurement invariance have been proposed: (i) a constrained-baseline approach

(Stark, Chernyshenko, & Drasgow, 2006), (ii) a free-baseline approach (Stark et al.,

2006), and (iii) a more recent approach which tests the significance of the threshold

differences via the Mplus "model constraint" feature (Clark, 2010; Clark et al., 2009). In

the first two approaches, DIF testing is conducted via a series of tests of hierarchically

nested models (Lee, 2009). The constrained-baseline approach begins with a baseline

model in which all the parameters are constrained equal across groups, then one-at-a-

time parameters of the studied item(s) are freed. On the other hand, the free-baseline

approach starts with a model in which all parameters (except those needed for model

identification) are freely estimated across groups. After this model has been estimated,

then the parameters of the item(s) of interest are constrained in a sequential manner.

With either of these two approaches, a series of nested comparisons are conducted to

determine the level of measurement invariance. For example, if testing for uniform DIF,

the baseline model is compared with models formed by either individually constraining

or releasing the item thresholds of interest across the groups/latent classes. Stark et al.

(2006) noted that whereas the constrained-baseline approach is typically used in IRT

research, the free-baseline approach is more common with CFA-based methods.

However, one disadvantage of these baseline approaches is that they require two

models (i.e. baseline and constrained or augmented) to be fitted to the data, which

increases the complexity of the model estimation procedure.









Unlike the two previous methods which use a nested models approach, the third

method does not require the specification of two sets of models. The approach which

has been credited to Mplus' Tihomir Asparouhov has recently been used in factor

mixture studies conducted by Clark (2010) and Clark et al. (2009). The Mplus (Muthen

& Muthen, 1998-2008) implementation of this approach as described by Clark (2010) is

as follows:

* The thresholds of the all items, except those of a referent needed for identification,
are allowed to vary freely across the latent classes.

* Next, the Mplus "model constraint" option is invoked to create a set of new
variables. Each new variable defines a threshold difference across classes for
each of the items to be tested for DIF. For example, the estimated threshold
difference for item 6 may be defined as: dif_it6 = t2_i6 t1_i6, where t2_i6 and
t1_i6 are the user-supplied variable names for the threshold of item 6 in class 2
and class 1 respectively.

* The creation of the 14 threshold equations allows for the testing of the item
thresholds across classes via a Wald test to determine whether the differences are
significantly different from zero. A significant p-value provides evidence that the
item is functioning differentially while a non-significant result indicates that the item
is DIF-free.

* In the case where the referent item is not known to be invariant across classes, a
series of tests are undertaken in which each item's thresholds are successively
constrained equal across classes and the threshold differences are estimated for
the remaining items. Finally, a tally is made of the total number of times that each
item displayed a significant p-value, when its thresholds were not constrained.

Since this method does not require the formation of two models, one obvious advantage

is its simplicity. However, unlike the more established baseline procedures, it has not

been subjected to the methodological rigor that should precede the acceptance and

usage of an approach in applied settings.

Estimation of Mixture Models

The purpose of the mixture modeling estimation process is to attempt to

disentangle the hypothesized mixture of distributions into the pre-specified number of









latent classes. Unlike the manifest situation where group membership is observed and

group proportions are known, class membership is unobserved. Therefore an additional

model parameter, known as the mixing proportion, (p, is estimated (Gagne, 2004). The

K-1 mixing proportions estimate the proportion of individuals comprising each of the K

hypothesized classes. Additionally, while individuals obtain a probability for being a

member in each of the K classes, they are assigned to a specific class based on their

highest posterior probability of class membership. To estimate the model parameters,

the joint log-likelihood of the mixture across all observations is maximized (Gagne,

2004). For a mixture of two latent subpopulations, the joint log-likelihood of the mixture

model can be expressed as the maximization of:


] In( +(1-)L) (17)
i=1 k=1

where L,1 and L2 represent the likelihood of the ith examinee being a member of

subpopulation 1 and subpopulation 2 respectively, (p represents the unknown mixing

proportion, and N is the total number of examinees in the sample. Likewise, for k-

subpopulations, Gagne (2004) presents the expression for the joint log-likelihood of the

mixture model expressed as:


({ok(2;z)- p 12 (- 5)(x~/k)k (18)
=1 k=l 1
where ilk = k + AKkk and 'k =Ak (kk +k,.

Class Enumeration

An important decision to be made is determining the number of latent class

existing in the population (Bauer, & Curran, 2004; Nylund, Asparouhov, & Muthen,

2006). Traditionally, researchers use standard chi-squared based statistics to compare









models. However, in mixture analysis when comparing models with differing numbers of

latent classes, the traditional likelihood ratio test for nested models is no longer

appropriate (Bauer, & Curran, 2004; McLachlan & Peel, 2000; Muthen, 2007). Instead,

alternative model selection indices are used to compare competing models with

different numbers of latent classes. These include: (i) information-based criteria such as

Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC,

Schwartz, 1978), and the adjusted BIC (ssaBIC; Sclove, 1987), (ii) likelihood-based

tests such as the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo,

Mendell, & Rubin, 2001) and the bootstrapped version of the LRT (BLRT; McLachlan &

Peel, 2000), and (iii) statistics based on the classification of individuals using estimated

posterior probabilities, such as entropy (Lubke & Muthen, 2007). And while there has

been limited research conducted comparing the performances of these various model

selection methods, no consistent guidelines have been established determining which

model selection indices are most useful in comparing models or selecting the best-fitting

model (Lubke & Neale, 2006; Nylund, et al. 2006; Tofighi & Enders, 2008; Yang, 2006).

The reason for this is that there is seldom unanimous agreement across the various

model selection indices and as a result the possibility of misspecification of the number

of classes is a likely occurrence (Bauer & Curran, 2004; Nylund et al., 2006). Therefore,

the researcher should not rely on these indices as the sole determinant of the number of

latent classes. Rather, it is advised that in addition to the statistical indices, a theoretical

justification should also guide not only the selection of the optimal number of classes

but the interpretation of the classes as well (Bauer & Curran, 2004; Gagne, 2006;









Muthen, 2003). The most common information criterion indices for model selection are

introduced below.

Information Criteria Indices

The information criteria measures (e.g. the AIC, BIC, and the sample-size adjusted

BIC) are all based on the log-likelihood of the estimated model and the number of free

parameters in the model. On their own, individual values of these information-based

criteria for a specified model are not very useful. Instead, for a specified model, the

indices for each of the measures are compared with models of varying numbers of

classes. For example, if the hypothesized model is one with two latent classes, this

would be successively compared with a one-, three-, and four- class models. Typically,

the model with the lowest information criteria value compared to the other models with

different number of classes is selected as the best-fitting model (Lubke & Muthen, 2005;

Nylund et al., 2006).

The information criteria such as the AIC, BIC, and ssaBIC are all based on the log-

likelihood and adjust differently for the number of free parameters and sample size

(Lubke & Muthen, 2005; Lubke & Neale, 2006). The AIC, which is defined as a function

of log-likelihood and number of estimated parameters penalizes for

overparameterization only but not for sample size,

AIC = -2log L+ 2p. (19)

On the other hand, BIC and ssaBIC, adjust for both the number of parameters and the

sample size (Nylund et al., 2006). The BIC and ssaBIC are given by:

BIC = -2 log L + p. log(N) (20)


ssaBIC = -2 logL + p.log (21)









As was noted earlier, when comparing models with different numbers of latent classes,

lower values of AIC, BIC, and ssaBIC indicate better fitting models. A typical approach

to class enumeration begins with the fitting of a baseline one-class model and

successively fitting models with the goal of identifying the mixture model with the

smallest number of latent classes that provide the best fit to the data (Lui, 2008).

However, previous research has found that results from different information criteria can

provide ambiguous evidence regarding the optimal number of classes. In addition,

across different mixture models, there is also inconsistency regarding the model

selection information criterion that performs best.

Nylund et al. (2006) conducted a simulation study comparing the performance of

commonly-used information criteria for three types of mixture models: latent class,

factor mixture, and growth mixture models. Overall, the researchers found that among

the information criteria measures, the AIC, which does not adjust for sample size,

performed poorly and identified the correct k-class model on fewer occasions than the

two sample-sized adjusted indices, the BIC and the ssaBIC. Moreover, the AIC

frequently favored the selection of the k+1-class model over the correct k-class. In

addition, whereas the ssaBIC generally performed well with smaller sample sizes

(N=200, 500), the BIC tended to be the most consistent overall performer, particularly

with larger sample sizes (N=1000). Based on their simulation results, Nylund et al.

(2006) concluded that the BIC was the most accurate and consistent of the IC

measures at determining the correct number of latent classes.

Yang (1998) evaluated the performance of eight information criteria in the

selection of latent class analysis (LCA) models for six simulated levels of sample size.









The results suggested that the ssaBIC outperformed the other five IC measures

including the AIC and the BIC. For instance, with smaller sample sizes (N =100, 200)

the ssaBIC had the highest accuracy rates of 62.7% and 77.5% respectively. In

addition, Yang (1998) found that both BIC and a consistent form of the AIC (CAIC)

tended to incorrectly select models with fewer latent classes than actually simulated.

The performance of the BIC and CAIC only improved after the sample size increased to

the largest condition of N=1000. The researcher concluded that in the case of LCA

models, the ssaBIC outperformed the AIC and BIC at determining the correct number of

latent classes (Yang, 1998).

Tofighi and Enders (2007) extended their simulation research to evaluating the

accuracy of information-based indices in identifying the correct number of latent classes

in growth mixture models (GMM). Manipulated factors included the number of repeated

measures, sample size, separation of latent classes, mixing proportions, and within-

class distribution shape simulated for a three-class population GMM. The researchers

found that of the ICs, the ssaBIC was most successful at consistently extracting the

correct number of latent classes. Once again, the BIC showed its sensitivity to small

sample sizes and frequently favored too few classes. The accuracy of the ssaBIC

persisted even when the latent classes were not well-separated, whereas the ssaBIC

extracted the correct three-class solution in 88% of the replications, the BIC and CAIC

only correctly identified this solution 11% and 4% of the time respectively.

In examining the accuracy of model selection indices in multilevel factor mixture

models, Allua (2007) found that while the BIC and ssaBIC outperformed the AIC in

correct predictions when data were generated from a one-class model, none of the fit









indices performed credibly when a two-class model was used as the data-generating

model. In this case all of the fit indices tended to underestimate the number of latent

classes by continuing to favor the one-class model over the "correct" two-class model.

This inconsistency between model fit measures has also evidenced in applied

studies. Using an illustrative example, Lubke and Muthen (2005) applied factor mixture

modeling to continuous observed outcomes from the Longitudinal Study of American

Youth (LSAY) as a means of exploring the unobserved population heterogeneity. A

series of increasingly invariant models were estimated and compared to a two-factor

single-class model baseline model. For each of the models fit to the data, two- through

five-class solutions were specified. The commonly used relative fit indices (AIC, BIC,

ssaBIC and aLRT) were used in choosing the best fitting models. However, there were

several instances of disagreement between the IC results. For example, among the

non-invariant and fully invariant models, while the AIC and the ssaBIC identified the 4-

class solution as the best fitting model, the BIC and aLRT produced their lowest values

for the 3-class solution. In summarizing their results, the authors suggested that in

addition to relying on the model fit measures, researchers should explore the additional

classes, in a similar manner as additional factors are investigated in factor analysis, to

determine if their inclusion provides new, substantively meaningful interpretations to the

solution.

Overall, results from both simulation and applied studies highlight the lack of

agreement among the mixture model fit indices. Researchers have attributed this

inconsistency of performance to the heavy dependence of the indices on the type of

mixture model under consideration as well as the assumptions made about the









populations (Liu, 2008; Nylund et al., 2006). Therefore, rather than viewing any single

measure as being superior, each should be seen as a contributory piece of evidence in

determining the comparison of one model versus another. However, while the model fit

indices provide the statistical perspective this should be augmented with a

complementary approach of incorporating a substantive theoretical justification to aid in

both the selection of the optimal number as well as the interpretability of the latent

classes as well (Bauer & Curran, 2004).

Mixture Model Estimation Challenges

While factor mixture modeling is an attractive tool for simultaneously investigating

population heterogeneity and latent class dimensionality, it is not without its challenges.

The merging of the two types of latent variables into one integrated framework results in

a model that requires a high level of computational intensity during the estimation

process. As a result, factor mixture models require lengthy computation times which in

turn reduce the number of replications that can be simulated within a realistic time

frame. In addition to the increased computation times, the models are susceptible to

problems due to multiple maxima solutions. Ideally, in ML estimation as the iterative

procedure progresses, the log-likelihood should monotonically increase until it reaches

one final maxima. However, with mixture models the solution often converges on a local

rather than a global maximum, thereby producing biased parameter estimates. Whether

the expectation-maximization (EM) algorithm converges to a local or global maximum

largely depends on the set of different starting values that are used. Therefore, one

approach to mitigating this problem is to incorporate multiple random starts, a practice

that is permitted in Mplus. In the event that the default number of random starts (in

Mplus, the defaults are 10 random starting sets and the best 2 sets used for final









optimization) is insufficient to converge on a maximum likelihood solution, Mplus allows

the user the flexibility to increase the number of start values. By adjusting the random

starts option to include a larger number of start values both in the initial analysis and

final optimization phases allows for a more thorough investigation of multiple solutions

and should improve the likelihood of successful convergence (Muthen & Muthen, 1998-

2008). However, since the increase in the number of random starts will also increase

the computational load and estimation time, it is recommended that prior to conducting

a full study researchers should experiment with various sets of user-defined starting

values to determine an appropriate number of sets of starting values (Nylund et al.,

2006). During this process, it is important to examine the results from the final stage

solutions to determine whether the best log-likelihood is replicated multiple times. This

ensures that the solution converged on a global maximum, thus reducing the possibility

that the parameter estimates are derived from local solutions.

Purpose of Study

In the past, the lack of usage of SEM-based mixture models has been attributed to

an unavailability of commercial software (Bandalos & Cohen, 2006). However, given the

recent innovations integrated in software packages such as Mplus (Muthen & Muthen,

1998-2008), the estimation of SEM mixture models is now possible (Bandalos & Cohen,

2006). The purpose of this study was to evaluate the performance of factor mixture

modeling as a method for detecting items exhibiting manifest-group DIF. In this study,

manifest-group DIF was generated in a set of dichotomous data for a two-group, two-

class population. The questions addressed were as follows.

* How successful is the factor mixture modeling approach at recovering the correct
number of latent classes?









* If the number of classes are known a priori, how well does the factor mixture
model perform at detecting differentially functioning items. Specifically, how are
the (i) convergence rates, (ii) Type I error rate, (iii) and power to detect DIF
affected under various manipulated conditions characteristic of those that may be
encountered in DIF research?

















Item 1, Group 1 vs. Item 1. Group 2


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 ----
-4 -3 -2 -1 0
Latent Trait

Figure 2-1. Example of uniform DIF


Item 1, Group 1




Item 1, Group 2


1 2 3 4


Item 1, Group 1 vs. Item 1, Group 2


0.9

0.8 ,

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1
Latent Trait

Figure 2-2. Example of non-uniform DIF


Item 1. Group 1




Item 1, Group 2


2 3 4

























T

Incorrect Correct

Figure 2-3. Depiction of relationship between y and y for a dichotomous item
Figure 2-3. Depiction of relationship between y and y for a dichotomous item





























Figure 2-4. Diagram depicting specification of the factor mixture model









CHAPTER 3
METHODOLOGY

The simulation was conducted in two parts. The first part of the study focused on

the ability of the factor mixture model to recover the correct number of latent classes

under a variety of simulated conditions. In the second phase of the study, the number of

classes was assumed known and the emphasis was on evaluating the performance of

the mixture model at identifying differentially functioning items. Following is a description

of the model as implemented, as well as the study design used in evaluating the

performance of the factor mixture modeling approach to DIF detection.

Factor Mixture Model Specification for Latent Class DIF Detection

The factor mixture model was specified in its hybrid form as having both a single

factor measured by 15 dichotomous items and a categorical latent class variable. The

factor mixture model was formulated in the study as:

y *=k r+ Ak + ck (22)

rk = k + k (23)

where the parameters are as previously defined in Chapter 2 and k = 1 to K indexes the

number of latent classes. To accommodate the testing of uniform DIF, the model was

formulated so that the factor loadings were constrained to be class-invariant but the

item thresholds were allowed to vary across classes. Therefore in Equation 22, the A

parameter is not indexed by the k subscript. Overall, the single-factor mixture model

was specified as follows:

1. The factor loadings were constrained equal across the latent classes. For scaling
purposes, the factor loadings of the referent (i.e. item 1) were fixed at one for each
of the latent classes.









2. To ensure identification, the item thresholds of the referent were also held equal
across the latent classes. The remaining 14 (i.e. K -1) item thresholds were freely
estimated.

3. One of the factor means was constrained to zero while the remaining factor mean
was freely estimated. For K latent classes, the Mplus default is to fix the mean of
the last or highest numbered latent class to zero (i.e. Pk = 0). Therefore in this
case, the mean of the first class was freely estimated

4. Factor variances were freely estimated for all latent classes.

Data Generation

The discrimination and difficulty parameters used in this study were adopted from

dissertation research conducted by Wanichtanom (2001). The original test

(Wanichtanom, 2001) consisted of 50 items however in this case, parameters for ten of

the 50 items have been selected. These ten items from the Wanichtanom (2001) study

represented the DIF-free test items. In the original study, the item discrimination

parameters were drawn from a uniform distribution within a 0 to 2.0 range and the

difficulty parameters from a normal distribution within a -2.0 to 2.0 range (Wanichtanom,

2001). The remaining five DIF items that formed part of the scale reflected low (i.e. 0.5),

medium (i.e. 1.0) and high (i.e. 2.0) levels of discrimination. For the entire 15-item test,

the discrimination a parameters ranged from 0.4 to 2.0, with a mean of 0.98 while the

difficulty b parameters ranged from -1.2 to 0.7 with a mean of -0.34. Uniform DIF was

simulated against the focal group on Items two to six. The values of the item parameters

are presented in Table 3-1.

Data were generated using R statistical software (R Development Core Team,

2009). The ability parameters were drawn from normal distributions for both the

reference and focal groups. For these dichotomous items, the probability of a correct

response was computed using the 2PL IRT model as:










Pr(Yj = 1 0) = ] (24)
]+ e [- ,(o0 b,)]

where ai is the item discrimination parameter, bi is the item threshold parameter, and Oj

is the latent ability trait for examinee j. To determine each examinee's item response,

the calculated probability Pr(Y =1 O0)was compared to a randomly generated number

from a uniform U(0,1) distribution. If that probability exceeded the random number, the

examinee's item response was scored as correct (i.e. coded as 1). On the other hand, if

the probability of a correct response was less than the random number the item

response was scored as incorrect and coded as 0. Finally, 50 replications were run for

each set of simulation conditions and the dichotomous item response datasets were

exported to Mplus V5.1 (Muthen & Muthen, 1998-2008) for the analysis phase. Since

the data were generated externally, the Mplus Type=Montecarlo option was used to

analyze the multiple datasets and to save the results for the replications that converged

successfully.

Simulation Study Design

In their 1988 paper, Lautenschlager and Park reiterated the need for Monte Carlo

studies to be designed in such a way that they simulate real data conditions as closely

as possible. This advice was followed when selecting the condition and levels for this

simulation study. The conditions were chosen to replicate those adopted in previous

latent DIF studies (Bolt, Cohen, & Wollack, 2001; Bilir, 2009; Cohen & Bolt, 2005; De

Ayala et al., 2002; Samuelsen, 2005) and mixture modeling studies (Gagne, 2004; Lee,

2009; Lubke & Muthen, 2005, 2007).









Research Study 1

In the first part of the study dichotomous item responses were generated for the

two-group, two-class scenario. The focus was on determining the success rate of the

specified factor mixture model to recover the correct number of latent classes. Solutions

for one- through three-class mixture models were estimated and three information-

based criteria values were compared across the models. The model with the lowest IC

value was selected as the best-fitting model (Lubke & Muthen, 2005; Nylund et al.,

2006). The fixed and manipulated factors used in this study are listed below.

Manipulated Conditions

Sample size

Previous findings have shown that as with pure CFA models, sample size affects

the convergence rates of mixture models as well (Gagne, 2004; Lubke, 2006). In

evaluating the performance of several CFA mixture models, Gagne (2004) reported a

significant increase in the convergence rates as the sample size was increased from a

minimum of 200 to 500 to 1000. A review of previous simulation and real data mixture

model research found that whereas only a few studies used as few as 200 simulees

(Gagne, 2004; Nylund et al., 2006), sample sizes of at least 500 were most frequently

used (Bolt et al., 2001; Bilir, 2009; Cho, 2007; De Ayala et al., 2002; Rost, 1990;

Samuelsen, 2005). In this study, the two combinations of sample size (N=500, N=1000)

were chosen to be representative of realistic research samples and to reduce the

possibility of convergence problems. In addition, the sample size of 500 was used as a

lower limit to examine the effects of small sample size on the performance of the factor

mixture approach to DIF detection.









Magnitude of uniform DIF

In previous DIF studies (Camilli & Shepard, 1987; De Ayala et al., 2005; Meade,

Lautenschlager, & Johnson, 2007; Samuelsen, 2005) the manipulated difficulty shifts

have typically varied in magnitude from .3 to 1.5. Overall, these results have shown

higher DIF detection rates with items simulated to have moderate or strong amounts of

DIF. However, with mixture models, it may be necessary to simulate larger DIF

magnitudes to ensure the detection of DIF. This hypothesis was based on the results

from preliminary small-scale simulation in which several levels of DIF magnitude were

manipulated. As a result, this study focused on DIF effects at the upper range of the

scale where the magnitude of manifest differential functioning is large, namely, Ab = 1.0

and Ab = 1.5. For items with no DIF, the item difficulties are defined as biF = biR. On the

other hand, when there is uniform DIF, the items will be simulated to function differently

in favor of the reference group and the item difficulties are defined as biF = biR + Ab

(where Ab = 1.0 or 1.5).

Ability differences between groups

Several researchers have recommended the inclusion of latent ability differences

(i.e. impact) in DIF detection studies since they contend that in real data sets, the focal

and reference populations typically have different latent distributions (Camilli & Shepard,

1987; De Ayala et al., 2002; Donoghue, Holland, & Thayer, 1993; Duncan, 2006; Stark

et al., 2006). Simulation results of the effects of impact on DIF detection have varied.

For instance, some researchers have reported good control of Type I error rates with a

moderate difference of .5 SD (Stark et al., 2006) and even with differences as large as 1

SD (Narayanan & Swaminathan, 1994). On the other hand, others (Cheung &

Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994) have









reported inflated Type I error rates with unequal latent trait distributions. The results are

also mixed with respect to the presence of impact on power. Whereas some studies

have shown reduced power (Ankemann, Witt, & Dunbar, 1999; Clauser, Mazor, &

Hambleton, 1993; Narayanan & Swaminathan, 1996), others (Gonzalez-Roma et al.,

2006; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006) found

that DIF detection rates were not negatively affected by the dissimilarity of latent

distributions. In this first part of the study, two conditions of differences in mean latent

ability were manipulated:

1. Equal latent ability means with the reference and focal groups both generated from
a standard normal distribution (i.e. OR~N(0,1), OF~N(0,1)), and

2. Unequal latent ability means with the reference group having a latent ability mean
.5 standard deviation higher than the focal group (i.e. ER~N(0.5,1), OF~N(0,1)).

Fixed Simulation Conditions

Test length

The test was simulated for a fixed length of 15 dichotomous items. Previous

studies using factor mixture modeling have typically used shorter scale lengths varying

between 4 and 12 observed items for a single-factor model with categorical items

(Lubke & Neale, 2008; Nylund et al., 2006; Kuo et al., 2008; Reynolds, 2008; Sawatzy,

2007). This may be due to the fact that longer computation times are required when

fitting mixture models to categorical data (Lubke & Neale, 2008). Therefore, while more

test items may have been included, this length was chosen not only to be consistent

with previous research, but also taking into account the computational intensity of factor

mixture models.









Number of DIF items

In previous simulations studies, the percentage of DIF items has typically varied

from 0% to 50% as a maximum (Bilir, 2009; Cho, 2007; Samuelsen, 2005; Wang et al.,

2009). For example, Samuelsen (2005) considered cases with 10%, 30% and 50% of

DIF items, Cho (2007) investigated cases with 10% and 30% DIF items, and Wang et

al. (2009) manipulated the number of DIF items in increments of 10% from 0% to 40%.

With respect to real tests, Shih and Wang (2009) reported that they typically contain at

least 20% of DIF items. In this study, the percentage of DIF items was 33.3% (five

items), with the DIF items all favoring the reference group. Items 2 through 6 were

selected to display uniform DIF.

Sample size ratio

With respect to the ratio of focal to reference groups, Atar (2007) reports that in

actual testing situations, the sample size for the reference group may be as small as

the sample size for the focal group or the sample size for the reference group may be

larger than the one for the focal group" (pg. 29). In this study, a 1:1 sample size ratio of

focal to reference group will be considered for each of the two sample sizes. Using

comparison groups of equal size is representative of an evenly split manifest variable

frequently used in DIF studies such as gender (Samuelsen, 2005).

Percentage of overlap between manifest and latent classes

In the manifest DIF approach when an item is identified as having DIF, there is an

implied assumption that all members of the focal groups must have been disadvantaged

by this item. However, under a latent conceptualization, the view is that DIF is detected

based on the degree of overlap between the manifest groups and the latent classes. In

this context, overlap refers to the percentage of membership homogeneity between the









manifest groups and latent classes. For example, if each of the examinees in either the

manifest-focal or the manifest-reference group belongs to the same latent class, then

this is referred to as 100% overlap. Therefore, as the level of group-class overlap

decreases, there is a corresponding decrease in the level of homogeneity between

groups and classes as well. In Samuelsen's (2005) study, five levels of overlap

decreasing in increments of 10% from 100% to 60% were considered. Samuelsen

(2005) found that as the group-class overlap increased, the power of the mixture

approach to correctly detect DIF increased as well. In this study, the level of overlap

was fixed at 80%, a somewhat realistic expectation of what may be encountered in

practice. This means that DIF was simulated against 80% of the simulees in the focal

group.

Mixing proportion

The mixing proportion ((pk) represents the proportion of the population in class k,

which was fixed at .50. Although the class membership was known, it was not used in

the simulation.

Study Design Overview

In sum, a total of three fully crossed factors resulting in eight simulation conditions

(2 sample sizes x 2 DIF magnitudes x 2 latent ability distributions) were manipulated to

determine their effect on the recovery of the correct number of latent classes. For each

of the eight conditions, a total of 50 replications were run. It is important to note that in

the original plan for this study, a larger number of replications was proposed. However,

initial simulation runs revealed that the computational time necessary to complete larger

numbers of replications was impractical for this dissertation. Therefore, given the timing









constraints, a smaller number of data sets (i.e. 50) were replicated. The list of study

conditions is provided in Table 3-2.

Evaluation Criteria

As previously noted, the objective of this first part of the simulation was to

determine the success rate of the factor mixture method in identifying the correct

number of classes. The three likelihood-based model fit indices (AIC, BIC, and ssaBIC)

provided by Mplus were compared, with smaller values indicating better model fit. The

outcome measures evaluated for each of the three (i.e. one- through three- class) factor

mixture solutions fit to the data were:

* Convergence rates This was represented as the number of replications that
converged to a proper solutions across the 50 simulations for each set of the eight
conditions. Data sets with improper or non-convergent solutions were not included
in the analysis.

* IC Performance Performance was evaluated by calculating the average IC
values and comparing the values for each index across the one-, two-, and three-
class models. For each of the simulated conditions, the lowest average IC value
and the corresponding model are identified.

Research Study 2

In the second part of the study, research was conducted to evaluate the Type I

error rate and power performance of the factor mixture model at detecting uniform DIF,

assuming that the correct number of classes is known. With respect to the study design,

two levels of DIF magnitude (DIF = 1.0, 1.5) and two levels of sample size (N = 500,

1000) were again simulated using the same levels as in Study 1. However, an additional

level will be included for the impact condition. More specifically, in addition to the no-

impact and moderate impact condition, a large level of impact (i.e. mean for the

reference group was 1.0 SD higher than the mean of the focal group) was included as

well. The inclusion of this new level permitted a more complete investigation of the









robustness of the factor mixture model in DIF detection to the influence of impact.

Overall, a total of 12 conditions (2 sample sizes x 2 DIF magnitudes x 3 latent trait

distributions) were simulated. In this second phase of the simulation, each condition

was replicated 1000 times. The full list of study design conditions are shown in Table 3-

3.

Data Analysis

The 1000 sets of dichotomous item responses were generated by R V2.9.0 (R

Development Core Team, 2009). The data sets for each of the 12 conditions were

saved and exported from R to Mplus V5.1 for analysis. As was done, in the first part of

the study, the Type=Montecarlo facility was used to accommodate the analysis of the

multiple datasets generated external to Mplus and for saving the results for subsequent

analysis. To asses uniform DIF, a simultaneous significance test of the 14 threshold

differences (i.e. with the exception of the referent, Item 1) using a Wald test was

conducted. A significant p-value less than .05 provided evidence of DIF in the item.

Evaluation Criteria

The outcome measures used in evaluating the performance of this factor mixture

method for DIF detection were as follows:

* Convergence rates This was measured by the number of replications that
converged to proper solutions across each of the eight combinations of conditions.
Data sets with improper or non-convergent solutions were not included in the
analysis.

* Type I error rate The Type I error rate (or false-positive rate) was computed as
the proportion of times the DIF-free items were incorrectly identified as having DIF.
Therefore, the overall Type I error rate was calculated by dividing the total number
of times the nine items (i.e. Items 7-15) were falsely rejected by the total number
of properly converged replications for each of the 12 study conditions. The nominal
Type I error rate used in this study was .05.









* Statistical power Power (or the true-positive rate) was computed as the
proportion of times that the analysis correctly identified the DIF items as having
DIF. Therefore, the overall power rate was calculated by dividing the total number
of times any one of the five (i.e. Items 2-6) DIF items was correctly identified by
the total number of properly converged replications across each of the 12
simulated conditions.

In addition to the computation of the overall Type I error and Power rates of the factor

mixture method, a variance components analysis was also conducted to examine the

influence of each of the conditions and their interactions on the performance of the

method. In this analysis which was conducted in R V2.9.0 (R Development Core Team,

2009), the independent variables were the three study conditions (DIF magnitude,

sample size and impact) and the dependent variables were the Type I error and power

rates. Eta-squared (r72), which calculates the percentage of variance explained by each

of the main effects and their interactions, was used as a measure of effect size.

Model Estimation

The parameters of the mixture models were estimated in Mplus V5.1 (Muthen &

Muthen, 1998-2008) with robust maximum likelihood estimation (MLR) using the EM

algorithm, which is the default estimator for mixture analysis in Mplus. One of the main

limitations of running a mixture simulation study is the lengthy computation time periods

needed for model estimation. In the interest of time, the random starts feature which

randomly generates sets of starting values was not used in this part of the study.

Instead, true population parameters for the factor loadings and thresholds and factor

variances were substituted for the starting values in this portion of the analysis. This

change reduced the computation time for model estimation considerably.









Table 3-1. Generating population parameter values for reference group

Item Number a b
1 1.0950 -0.0672
2 0.5001 -1.0000
3 0.5001 -0.5000
4 1.0000 0.0000
5 2.0000 -1.0000
6 2.0000 0.0000
7 0.5584 -0.7024
8 0.9819 0.6450
9 0.5724 -0.5478
10 1.4023 -0.3206
11 0.4035 -1.1824
12 1.0219 -0.4656
13 0.9989 -0.2489
14 0.7342 -0.4323
15 0.8673 0.7020
Note. Item 1 is the referent, therefore its loadings were fixed at 1 and its thresholds
constrained equal across classes. Uniform DIF against the focal group was simulated
on Items 2 to 6.









Table 3-2. Fixed and manipulated


Sample size
Magnitude of DIF
Latent mean distributions

Test length
Number of DIF items
Sample size ratio
Class proportion
Overlap


simulation conditions used in
Manipulated
Conditions
500, 1000
1.0, 1.5
OR~N(0,1), OF~N(0,1)
OR~N(.5,1), OF~N(0,1)


study 1
Fixed
Conditions




15 items
5 items (33.3%)


.5
80%


Table 3-3. Fixed and manipulated simulation conditions used in


Sample size
Magnitude of DIF
Latent mean distributions


Test length
Number of DIF items
Sample size ratio
Class proportion
Overlap


Manipulated
Conditions
500, 1000
1.0, 1.5
OR~N(0,1), OF~N(0,1)
OR~N(.5,1), OF~N(0,1)
OR~N(1,1), OF~N(0,1)


study 2
Fixed
Conditions





15 items
5 items (33.3%)
1:1
.5
80%









CHAPTER 4
RESULTS

Research Study 1

In this section, the results of the first part of the simulation are presented. To

answer the research question, data were generated for a two-group, two-class

population with five of the 15 items simulated to display uniform DIF. The following

conditions were manipulated in this study: sample size (500, 1000), DIF magnitude (1.0,

1.5), and differences in latent ability means (0 SD, 0.5 SD). The factor mixture model as

formulated in Equations 22 and 23 was applied to determine how successful the method

was at recovering the correct number of classes. For each of the eight condition

combinations, one-, two- and three-class models were fit to the data. These results are

presented in two sections. First, the rates of model convergence for each of the eight

simulation conditions are reported. Secondly, the information criteria (IC) results which

were used for model comparison and class enumeration are discussed. The results for

Study 1 are summarized in Tables 4-1 through 4-4.

Convergence Rates

Table 4-1 presents the data on the number of convergent solutions for each

combination of the eight simulation conditions. As was previously mentioned in the

Methods section, non-convergent cases were excluded from the analysis, therefore for

some conditions results were based on fewer than 50 replications. The results showed

that overall the convergence rates were very high (ranging from .82 to 1.0), and there

were minimal convergence problems. Of the 1200 (50x3x8) replications, 1147

successfully converged resulting in a 96% overall convergence rate. In addition, as the

number of latent classes was increased, there was a corresponding decrease, albeit









minimal, in the number of properly converged solutions. More specifically, while the

one-class model attained perfect convergence rates, the average convergence rates for

the two- and three-class mixture models were 96% and 91% respectively. An

inspection of the results also revealed a positive relationship between the convergence

rate and the DIF magnitude. Of the 16 cells that failed to converge in the two-class

model, 15 of them were for the smaller DIF condition. A similar trend was observed with

the three-class model. Namely, of the 37 cells that failed to produce a properly

convergent solution in the three-class model, 27 were associated with the smaller

DIF=1.0 condition. The cases with non-convergent solutions were excluded from the

second part of this analysis.

Class Enumeration

Summary data based on the three IC measures (AIC, BIC, and ssaBIC) for the

one-, two-, and three-class models are provided in Tables 4-2 through 4-4. In comparing

the fit of the models across classes, the smallest average IC value was used as the

criterion in selecting the "best-fitting" model. An examination of the average IC values

highlighted both overall and IC-specific patterns of results. First, as expected there is a

general increase in the average IC values as sample size increases. Second, it is

observed that the differences in average IC values between neighboring competing

models were generally not substantial, and even negligible under some conditions.

Third, with respect to the individual indices, a high level of inconsistency in model

selection patterns is observed. The results for the three indices are described in more

detail in the following sections.









Akaike Information Criteria (AIC)

The average AIC values across the three specified mixture models are presented

in Table 4-2. Overall, the pattern of results shows that the AIC tended to over-extract

the number of latent classes. This trend was observed for six of the eight simulated

conditions where the lowest AIC values corresponded to the three-class mixture model.

The only exceptions to this pattern occurred for two of the four conditions when the DIF

magnitude was increased to 1.5. In these cases, the lowest average AIC values

occurred at the "correct" two-class model. However, it is important to note that the

differences between neighboring class solutions were rather small, with the largest

absolute difference between values being less than 40 points. Moreover, the differences

are practically negligible between the two- and three-class models ranging in absolute

magnitude over the eight simulated conditions from .02 to 8.72. Although, smaller IC

values are indicative of better model fit, given the minor differences between the

average AIC values, it makes selection between these two models a less "clear-cut"

decision.

Bayesian Information Criteria (BIC)

The BIC results are presented in Table 4-3. Based on the average BIC values, this

index consistently selected the simpler one-class model as the correct model for the

data. For each of the eight manipulated simulation conditions, the lowest values

corresponded to the one-class mixture model. Compared to the AIC, the IC differences

between neighboring class models are generally higher for the BIC than for the

corresponding AIC solutions. More specifically, the differences between neighboring

class models ranged in absolute magnitude, between 40 and 88 points on average. The

differences between the one-class and the "correct" two-class model were minimized









when the DIF magnitude was increased from 1.0 to 1.5. In these cases, even though

the average IC values corresponded to the one-class model, because the average

values are so similar in magnitude, it makes it difficult to unequivocally choose the one-

class solution as the best-fitting model.

Sample-size adjusted BIC (ssaBIC)

Summary values for the ssaBIC compared across the three mixture models are

presented in Table 4-4. These results reflected patterns seen with both the AIC and the

BIC. First, similar to the BIC, the ssaBIC values suggested the simpler one-class model

under conditions where the magnitude of the simulated uniform DIF is at the lower value

of 1.0. However, the index also exhibits a pattern similar to that of the AIC by

associating the smallest average IC values with the "correct" two-class model when

larger uniform DIF of 1.5 was simulated. Finally, as was the case with the other two IC

measures, the magnitude of differences across the three models was small. This was

especially true of the differences between the one- and two- class solutions, which

ranged on average from 5.0 to 34.1 points.

Research Study 2

In the second part of the study, the objective was to evaluate the Type I error rate

and power of the factor mixture approach in the detection of DIF. The manipulated

conditions again included sample size, magnitude of DIF and differences in latent ability

means. In addition to the conditions used in the first phase of the study, one additional

level of latent mean differences was included. For a measure of large impact, the mean

for the reference group was simulated to be 1.0 SD standard higher than the focal

group. Therefore, a total of 12 conditions were manipulated: 2 DIF magnitude (1.0, 1.5)

x 2 sample size (500, 1000) x 3 differences in latent trait means (0, 0.5 SD, 1.0 SD).









One thousand replications were generated for each of the 12 simulation conditions

examined in the Type I error rate and Power studies. The results for the Type I error

rate and statistical power are addressed in the sections below.

Nonconvergent Solutions

In this part of the study, population parameters replaced the starting values

randomly generated by Mplus. This change substantially reduced the computational

load and decreased the model estimation time. The convergence rates across each of

the conditions are presented in Table 4-5. Overall, results indicate no convergence

problems, with rates ranging between 99.4% and a perfect convergence rate.

Type I Error Rate

The factor mixture model was evaluated in terms of its ability to control the Type I

error rate under a variety of simulated conditions. Of the 15 items, nine were simulated

to be DIF-free. The Type I error rate was assessed by computing the proportion of times

the nine DIF items were incorrectly identified as having DIF. An item was considered to

display DIF if the differences in thresholds were significantly different from zero.

Therefore for the nine non-DIF items, the Type I error rate was computed as the

proportion of times that the items obtained p-values less than .05. The Type I error rates

across the 12 simulation conditions are presented in Table 4-6. The values in the table

represent the proportion of times that the method incorrectly flagged a non-DIF item as

displaying DIF.

The results in Table 4-6 indicate that the factor mixture analysis method did not

perform as well as expected in controlling the Type I error rate. The results showed

elevated Type I error rates across all the study conditions, which means that the

approach consistently produced false identifications at a rate exceeding the nominal









alpha level of .05. Overall, the average Type I error rate was 11.8%, which even after

accounting for random sampling error would still be considered unacceptably high.

Across the individual conditions, the error rates ranged from .09 to .16. Not surprisingly,

the factor mixture method exhibited its strongest control of the rate of incorrect

identifications for conditions of large DIF magnitude (DIF = 1.5), large sample size (N =

1000), and where there was either none or a moderate (0.5 SD) amount of impact. An

initial examination of the pattern of results suggested that while the sample size and DIF

magnitude are inversely related to Type I error rate, an increase in the mean latent trait

differences resulted in slightly higher Type I error rates. For example, for the cells with

DIF magnitude of 1.0, sample size of N = 500, and no impact, the Type I error rate was

0.12; however, when the latent trait means differed by 1.0 SD, the rate of false

identifications increased marginally to 0.16. A more detailed discussion of the effect of

each of the three conditions is presented in the following sections.

Magnitude of DIF

Table 4-7 and 4-8 display the aggregated results for the effect of the two levels of

DIF magnitude (1.0 and 1.5) on Type I error rates. Overall, the rates of false

identifications showed a slight decrease as the magnitude of DIF was increased. For

example, when DIF of 1.0 was simulated, error rates across the conditions were

between .10 and .16, with an average rate of .13. However, for larger DIF of 1.5, the

rates ranged from .09 to .12, averaging at .10. Regardless of the size of DIF, the inflated

rates were most pronounced for the smaller sample size of N=500 and when the

difference in latent trait means was maximized (1.0 SD).









Sample size

The results in Tables 4-9 and 4-10 suggest a weak inverse relationship between

sample size and the ability of the factor mixture method to control Type I error rates. At

the smaller sample size (N=500), the rate of false identifications ranged from .10 to .16,

with an average rate of .12. Of the six cells associated with the smaller sample size, the

test showed greatest control of the Type I error when larger DIF (1.5) was simulated

and there was equality of the latent trait means. Increasing the sample size to 1000

decreased the Type I error rates marginally. Across the six conditions, the error rates

were now between .09 and .14, averaging at .11, a negligible decrease from the

average rate when N=500. However, the pattern of false identifications remained

consistent across sample sizes: poor Type I error control was observed when smaller

DIF (1.0) was simulated and there was large impact (1.0 SD); in contrast, improved

control was observed for larger DIF magnitude (1.5) and in the absence of impact.

Impact

Three levels of impact (0, .5 SD, and 1.0 SD) were simulated in favor of the

reference group. The aggregated Type I error rates which are summarized in Tables 4-

11 through 4-13 showed that the differences in latent trait means between groups had

no appreciable effect on the rate of incorrect identifications The Type I error rates for the

no-impact, 0.5 SD and 1.0 SD conditions increased marginally from .11 to .12 to .13, a

change that can be attributed to the presence of random error. Though not below the

nominal alpha value of .05, the Type I error rates were best controlled when both DIF

(1.5) and sample size (N=1000) were large.









Variance components analysis

Following the descriptive analysis of the pattern of Type I error rates across the

simulated conditions, a variance components analysis was conducted to specifically

examine the influence of each of the simulation conditions and interaction of the

conditions on the Type I error rates. The results of this analysis are presented in Table

4-13. Based on the q2 values which ranged from 0.000 to 0.007, the only factor

contributing to the variance in Type I error rates was the magnitude of DIF accounting

for a mere 0.7%. All other main effects and interactions produced trivial r2 values.

Statistical Power

In the analyses above, the proportion of false DIF detections produced by the

factor mixture approach consistently exceeded the nominal value of 0.05. Typically,

when this level occurs, power rates are no longer interpretable in terms of the standard

alpha level. In this case, the power rates have still been analyzed and are displayed in

Tables 4-15 and 4-24. However, it is important to note that these results should be

interpreted with caution given the elevated Type I rates.

Power was assessed as the proportion of times across the 1000 replications that

the factor mixture analysis correctly identified the five items (i.e. Items 2 to 6) simulated

as having uniform DIF. Typically, values of at least .80 indicate that the analysis method

is reasonably accurate in correctly detecting items with DIF. Results for the power

analysis are displayed in Tables 4-15 through 4-22.

The overall accuracy of DIF detection of factor mixture analysis was 0.447, with

the power of correct detection ranging from .264 to .801 across all simulated conditions.

The only combination of conditions for which an acceptable level of power was achieved

was when larger DIF (1.5) and sample size (N=1000) were simulated and impact was









absent. For all other conditions the test failed to maintain adequate power. An initial

examination of these results suggests that whereas higher rates of DF detection are

positively associated with DIF magnitude and sample size, there was a seemingly weak

negative effect of impact. A more detailed discussion of the effect of each of the three

conditions on DIF power rates follows.

Magnitude of DIF

As expected, increasing the magnitude of DIF significantly improved the power

performance of factor mixture DIF detection (refer to Tables 4-16 and 4-17). On one

hand, when DIF of 1.0 was simulated, the detection rates ranged on average from .264

to .350. On the other, average detection rates ranged from .425 to .801 when larger DIF

(1.5) was simulated in the items. Overall, similar detection patterns were observed at

both levels of DIF: the accuracy of power detection was highest with larger sample sizes

(N=1000) and in the absence of impact. In direct contrast, power was notably reduced

when smaller sample sizes (N=500) and maximum impact (1.0 SD) were simulated.

Sample size

As shown in Tables 4-18 and 4-19, power rates were positively related to sample

size; a result that was not unexpected. For sample size conditions of N=500, the DIF

detection rate was .375, on average. However, a marked improvement in detection

performance was observed (.520) when the sample size was increased to N=1000. A

comparison across the two levels of sample size reveals that the factor mixture

procedure exhibited its greatest power to detect DIF under the combined conditions of

large DIF differences (1.5 SD) and equality of latent trait means.









Impact

The effect of impact on DIF detection rates was also examined. The three levels

investigated were: (i) equal latent trait means, (ii) a 0.5 SD difference between latent

means, representing a moderate amount of impact, and (iii) a 1.0 SD difference

between latent trait means, representing a large amount of impact. The aggregated

results in Tables 4-20 through 4-22 show that as the difference in latent trait means

between the groups was increased there was a negligible decline in the accuracy of the

factor mixture method to detect DIF. For example, the average power rate decreased

marginally from .486 to .455 to .401 under the no-impact, 0.5 SD, and the 1.0 SD

conditions respectively. These results show that the presence of impact did not

adversely affect the ability of the factor mixture approach to detect DIF.

Effect of item discrimination parameter values

For the five items simulated to contain DIF, three different levels of item

discrimination were selected. For two items (Items 2 and 3), the discrimination

parameter value was set at 0.5 to mimic low discriminating items, one item's (Item 4) a-

parameter was selected as 1.0 a medium level of discrimination, while two items

(Items 5 and 6) with an a-parameter of 2.0 represented highly discriminating items. The

discrimination parameter values for the non-DIF items were randomly selected from a

normal distribution within a 2 range.

The power rates for DIF detection categorized by the level of item discrimination

are shown in Table 4-23. These results show, as expected, that power is influenced by

the item discrimination parameter. More specifically, the accuracy of DIF detection

increased as the item discrimination values increased. The factor mixture method had

on average a .369 rate of detecting DIF in low discriminating items, this increased to









.495 and .502 when DIF was simulated in items with medium and high values on the a-

parameter respectively. Moreover, while there was generally a clear difference in the

accuracy of DIF detection between the low discriminating items (a=.5) and either the

medium or highly discriminating items; no discernible differences were evident when

comparing DIF detection rates between items with medium (a=1.0) and high values

(a=1.5) on the a-parameter. The patterns of DIF detection discussed earlier remained

consistent across the simulation conditions regardless of the items' discriminating

ability.

Variance components analysis

Finally, a variance components analysis was conducted to determine the influence

on power rates of the simulation conditions and their interactions. In this analysis, the

power rates across the five DIF items were used as the dependent variable while the

simulated conditions served as the independent variables. As was expected, the results

showed that of the main effects, the DIF magnitude was the most significant contributor

by accounting for 19% of the variance in the power rates. Following was sample size

with approximately 5% and the interaction between these two factors with 1.2%. Each of

the other terms contributed less than 1.0% to the variance in the power rates. These

results are shown on Table 4-24.









Table 4-1. Number of converged replications for the three factor mixture models
DIF Sample Ability One-class Two-class Three-class
Magnitude Size Differences
1.0 500 0 50 46 43
0.5 50 46 46
1000 0 50 45 41
0.5 50 48 43
1.5 500 0 50 50 47
0.5 50 49 46
1000 0 50 50 49
0.5 50 50 48









Table 4-2. Mean AIC values for the three mixture models
DIF Sample Ability One-class Two-class
Magnitude Size Differences


500

1000

500


1000 0
0.5
Note: AIC Akaike Information Criterion


9,520.64
9,310.67
18,956.98
18,578.46
9,495.81
9,314.87
18,953.93
18,559.08


9,509.62
9,305.81
18,961.67
18,578.01
9,473.12
9,284.66
18,916.12
18,556.92


Three-class

9,501.86
9,298.37
18,961.03
18,569.40
9,473.10
9,293.38
18,918.62
18,550.50


Table 4-3. Mean BIC values for the three mixture models
DIF Sample Ability One-class Two-class
Magnitude Size Differences
1.0 500 0 9647.08 9707.71


1000

500

1000


9,437.11
19,104.21
18,725.70
9,622.25
9,441.31
19,101.16
18,706.31


9,503.89
19,192.34
18,808.68
9,671.21
9,482.75
19,146.78
18,787.58


Three-class

9,771.59
9,568.11
19,275.12
18,883.21
9,742.83
9,563.11
19,232.72
18,864.60


Note: BIC Bayesian Information Criterion


Table 4-4. Mean ssaBIC values for the three mixture models


DIF
Magnitude


Sample
Size


Ability
Differences


One-class


Two-class


1.0 500 0 9,551.86 9,558.53
0.5 9,341.89 9,354.71
1000 0 19,008.93 19,043.06
0.5 18,630.42 18,659.40
1.5 500 0 9,527.02 9,522.03
0.5 9,346.09 9,333.56
1000 0 19,005.88 18,997.51
0.5 18,611.03 18,638.31
Note: ssaBIC sample size adjusted Bayesian Information Criterion


Three-class

9,568.45
9,364.97
19,071.86
18,679.95
9,539.69
9,359.97
19,029.45
18,661.33









Table 4-5. Percentages of converged solutions across study conditions
DIF Magnitude Sample Size Ability Differences Percentage of converged
solutions
1.0 500 0 99.8
0.5 99.5
1.0 99.4
1000 0 100.0
0.5 99.7
1.0 99.6
1.5 500 0 99.8
0.5 99.7
1.0 99.7
1000 0 100.0
0.5 100.0
1.0 99.8


Table 4-6. Overall Type I error rates across
DIF Sample Size
1.0 500


1000


500


1000


study conditions
Impact
0
0.5
1.0
0
0.5
1.0
0
0.5
1.0
0
0.5
1.0


Error rates
0.123
0.126
0.159
0.131
0.129
0.138
0.097
0.112
0.116
0.092
0.092
0.100









Table 4-7. Type I error rates for DIF = 1.0


Sample size
500
500
500
1000
1000
1000


Impact
0
0.5
1.0
0
0.5
1.0


Error rates
0.123
0.126
0.159
0.131
0.129
0.138




Error rates
0.097
0.112
0.116
0.092
0.092
0.100


Table 4-8. Type I error rates for DIF = 1.5


Impact


Sample size
500
500
500
1000
1000
1000


4-9. Type I error rates for sample size of 500
Impact


0
0.5
1.0



4-10. Type I error rates for sample size of 1000
Impact


Error rates
0.123
0.126
0.159
0.097
0.112
0.116




Error rates
0.131
0.129
0.138
0.092
0.092
0.100


Table
DIF
1.0
1.0
1.0
1.5
1.5
1.5



Table
DIF
1.0
1.0
1.0
1.5
1.5
1.5









Table 4-11. Type I error rates for impact of 0 SD
DIF Sample size
1.0 500
1.0 1000
1.5 500
1.5 1000



Table 4-12. Type I error rates for impact of 0.5 SD
DIF Sample size
1.0 500
1.0 1000
1.5 500
1.5 1000



Table 4-13. Type I error rates for impact of 1.0 SD
DIF Sample size
1.0 500
1.0 1000
1.5 500
1.5 1000


Error rates
0.123
0.131
0.097
0.092




Error rates
0.126
0.129
0.112
0.092




Error rates
0.159
0.138
0.116
0.100


Table 4-14. Variance components analysis for Type I error
Condition r2
DIF Magnitude (D) .007
Sample size (S) .000
Impact (I) .001
D*S .000
D*I .000
S*I .000
D*S*I .000









Table 4-15. Overall power rates across study conditions
DIF Sample Size Impact Power
1.0 500 0 0.268
0.5 0.267
1.0 0.264
1000 0 0.350
0.5 0.324
1.0 0.291
1.5 500 0 0.525
0.5 0.498
1.0 0.425
1000 0 0.801
0.5 0.731
1.0 0.623



Table 4-16. Power rates for DIF of 1.0
Sample size Impact Power
500 0 0.268
0.5 0.267
1.0 0.264
1000 0 0.350
0.5 0.324
1.0 0.291




Table 4-17. Power rates for DIF of 1.5
Sample size Impact Power
500 0 0.525
0.5 0.498
1.0 0.425
1000 0 0.801
0.5 0.731
1.0 0.623









Table 4-18. Power rates
DIF
1.0


Table 4-19. Power rates
DIF
1.0


Table
DIF
1.0
1.0
1.5
1.5



Table
DIF
1.0
1.0
1.5
1.5



Table
DIF
1.0
1.0
1.5
1.5


4-20. Power rates









4-21. Power rates


for sample size N of 500
Impact


for sample size N of 1000
Impact


for impact of 0 SD
Sample Size
500
1000
500
1000


for impact of 0.5 SD
Sample Size
500
1000
500
1000


4-22. Power rates for impact of 1.0 SD
Sample Size
500
1000
500
1000


Power
0.268
0.267
0.264
0.525
0.498
0.425




Power
0.350
0.324
0.291
0.801
0.731
0.623


Power
0.268
0.350
0.525
0.801


Power
0.267
0.324
0.498
0.731


Power
0.264
0.291
0.425
0.623









Table 4-23. Power rates for DIF detection based on item discrimination


Sample Impact
Size
500 0.0


1000


500


1000


a = .5


.184
.200
.200
.261
.248
.226
.409
.411
.340
.731
.640
.531

.365


.194
.220
.227
.257
.243
.238
.425
.401
.360
.728
.655
.528

.373


a= 1.0


.273
.273
.273
.387
.363
.332
.612
.561
.465
.885
.814
.697

.495


.332
.321
.326
.430
.372
.300
.598
.546
.481
.841
.782
.675

.500


Table 4-24. Variance components analysis for power results
Condition q2
DIF Magnitude (D) .190
Sample size (S) .046
Impact (I) .009
D*S .012
D*I .005
S*I .002
D*S*I .001


a = 2.0


.359
.323
.296
.416
.392
.357
.580
.571
.476
.819
.763
.685

.503









CHAPTER 5
DISCUSSION

This study was designed to evaluate the overall performance of the factor mixture

analysis in detecting uniform DIF. Specifically, there were two primary research goals,

namely: (i) to assess the ability of the factor mixture approach to correctly recover the

number of latent classes, and (ii) to examine the Type I error rates and statistical power

associated with the approach under various study conditions. Using data generated by

a 2PL IRT framework, a Monte Carlo simulation study was conducted to investigate the

properties of the proposed factor mixture model approach to DIF detection. First, a 15-

item dichotomous test simulated for a two-group, two-class population was generated.

In both parts of the study, the effect of DIF magnitude, sample size and differences in

latent trait means on the performance of the mixture approach were examined. First,

the major findings of each phase of the simulation are summarized. This will be followed

by a discussion of the limitations of this study and suggestions for future research.

Class Enumeration and Performance of Fit Indices

In assessing the accuracy of the factor mixture approach to accurately recover the

correct number of latent classes, models with one through three latent classes were fit

to the simulated data. In addition, three commonly-used information criteria indices

(AIC, BIC, and ssaBIC) were used in the selection of the "correct" model. Overall, there

was a high level of inconsistency among the three ICs. In this study, the AIC tended to

over-extract the number of classes and under the majority of study conditions supported

the more complex but "incorrect" three-class model over the "true" two-class model.

This behavior was sharply contrasted with that of the BIC, which tended to

underestimate the correct number of latent classes and consistently favored the simpler









one-class model. In contrast to the distinctly different results produced by the AIC and

BIC, the ssaBIC produced more balanced results by showing a preference for the two-

class model over the 1-class model as the magnitude of DIF simulated between groups

increased. Moreover, of the three factors examined (magnitude of DIF, sample size, and

presence of impact) the patterns of model selection were most affected by the change

in DIF magnitude. However, while the behavior of the three ICs was influenced when

larger amounts of DIF were simulated, the effect was different across ICs. For example,

when the DIF magnitude was increased from 1.0 to 1.5, the ssaBIC identified the two-

class model under three of the four conditions. In the case of the AIC, the two-class

model had its lowest average IC values for two of the four conditions. And while the BIC

still tended to favor the one-class model, the differences between the one-class and

two-class model were minimized on increasing the DIF magnitude from 1.0 to 1.5.

Therefore, the ssaBIC was most affected by the presence of larger DIF, followed by the

AIC and lastly the BIC.

In discussing these findings, it is important to note that the results of this Monte

Carlo study though disappointing were not totally unexpected since previous research

studies have also reported similar inconsistent performances for these fit indices (Li et

al., 2009; Lin & Dayton, 1997; Nylund et al., 2006; Reynolds, 2008; Tofighi & Enders,

2007; Yang, 2006). Additionally, the pattern of results exhibited in this study by the

indices has also been observed in other mixture model studies. For example, in

research conducted by Li et al. (2009), Lin & Dayton (1997), and Yang (1998), the

authors observed similar patterns of behavior, namely, the tendency of the AIC to

overestimate the true number of classes and the BIC to select simpler models with a









smaller number of latent classes. On the other hand, while simulation results from

Nylund et al. (2006) supported the finding of the AIC favoring models with more latent

classes, their study found the BIC to be most consistent indicator of the true number of

latent classes. This latter result contrasted with other studies which touted the merits of

the ssaBIC for class enumeration over the BIC (Henson, 2004; Yang, 2006; Tofighi &

Enders, 2007). Therefore, given the inconsistencies in results, no single information

criteria index can be regarded as being the most appropriate for class enumeration for

all types of finite mixture models. Liu (2009) argued that because the performances of

the indices depend heavily on the estimation model and the population assumptions

that these inconsistencies should be expected. In addition, because to date no full scale

study has been conducted comparing the performance of these indices for factor

mixture DIF applications, no definite conclusion can be reached regarding the index that

is best suited for this type of factor mixture application. Clearly, this represents an

opportunity for future research.

Results from this study also point to several instances where negligible differences

in IC values between neighboring class models were observed. Therefore, even though

a model may have produced the lowest average IC value, the IC value of the k+1 or k-1

class model did not differ substantially from that of the k-class model. In cases such as

this, the absence of an agreed-upon standard for calculating the significance of these IC

differences increases the ambiguity of the selection of the "correct" model. This

presents the opportunity for the creation of such a significance statistic; a possibility that

will be explored later as a potential area for further research.









Overall, the ambiguity of these findings serve to reinforce the point that was made

earlier, namely, that the IC results should never be relied upon as the sole determinant

of the number of classes. Several researchers have stressed the importance of

incorporating substantive theory in guiding the model selection decision (Allua, 2007;

Bauer and Curran, 2004; Kim, 2009; Nylund et al., 2007; Reynolds, 2008). Moreover,

Reynolds (2008) contends that the researcher often has some belief about the

underlying subpopulations, therefore this should be taken into account in determining

which of the models best fit the data.

Type I Error and Statistical Power Performance

In this phase of the study, the performance of the factor mixture model was

evaluated in terms of its Type I error rate and power of DIF detection. As was done in

the first part of the study data were again simulated for a 15-item test based on the 2PL

IRT model. However, in this case it was assumed that the number of classes was

known to be two. Five of the 15 items were simulated to contain uniform DIF in favor of

the reference group. In investigating the Type I error rate and power of the test, three

factors (DIF magnitude, sample size and impact) shown previously to affect DIF

detection were also manipulated and their effect on the test was noted. More

specifically, two levels of DIF magnitude (1.0 1.5) and of sample size (N=500, N=1000)

were simulated. For the effect of impact, three levels, 0, 0.5 SD and 1.0 SD were

chosen to reflect none, moderate and large mean differences in the latent trait. For

each of the 12 conditions, a total of 1000 replications were run. The Type I error and

statistical power of the factor mixture method for DIF detection was investigated across

all conditions.









Type I Error Rate Study

With the exception of the referent (Item 1) whose thresholds were constrained

across latent classes for identification purposes, the remaining nine DIF-free items were

used in assessing the ability of the factor mixture to control the Type I error close to the

nominal alpha level of .05. However, the DIF factor mixture approach yielded inflated

error rates ranging in magnitude from .092 to .159 across all 12 study conditions.

Whereas the rates of incorrect detection improved with large DIF and sample size, the

effect of increasing impact had little effect in controlling the Type I error rates. In

assessing the performance of several DIF detection procedures, previous studies have

confirmed the inverse relationship between the inflation of Type I error rates and both

sample size and size of DIF, with tests attaining their optimal performance at controlling

Type I error rates when samples sizes are larger and with higher amounts of DIF

(Cohen, Kim & Baker, 1993; Dainis, 2008; Donoghue, Holland, & Thayer, 1993; Oort,

1998; Wanichtanom, 2001). Previous simulation results regarding the influence of

impact on Type I error rates have been divided. Whereas some studies have reported

Type I error inflation in the presence of impact (Cheung & Rensvold, 1999; Lee, 2009;

Roussos & Stout, 1996; Uttaro & Millsap, 1994), others have shown good control of the

error rates for moderate impact of .5 SD (Stark et al., 2006) and even for latent mean

differences as large as 1 SD (Shealy & Stout, 1993). Differences in latent ability

distributions are common for both cognitive and non-cognitive measures, hence it is

critical for DIF detection methods, particularly those that do not differentiate between the

presence of DIF and impact, are robust to the effects of group differences in latent trait

means.









Statistical Power Study

The study also evaluated the power of the factor mixture approach to detect

uniform DIF. In spite of the failure of the factor mixture analysis to adequately protect

the Type I error rates across the study conditions, the power results were still reviewed

to get some sense of the pattern of DIF detection. Overall, these findings represent a

mix of the predictable and the unexpected. What was expected was that the power of

the factor mixture analysis method of DIF detection would increase as sample size and

magnitude of DIF increased. In addition, it was not a surprising outcome that the

magnitude of the discrimination parameter also influenced DIF detection rates; power

was highest when detecting DIF in the more highly discriminating items, followed by

studied items with medium and low discrimination parameters. Overall, these results are

not only intuitively appealing but have been consistently supported by prior research

conducted with different methods of DIF detection (Donoghue et al., 1993; Narayanan &

Swaminathan, 1994; Rogers & Swaminathan, 1993; Stark et al., 2006). On the other

hand, the surprising result was that even in the presence of large latent trait mean

differences of 1.0 SD, the rates of DIF detection were not adversely affected by impact.

While this finding was consistent with some studies (Gonzalez-Roma et al., 2006;

Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Shealy & Stout,

1993; Stark et al., 2006), others have reported contradictory results with reductions in

power as the disparity in latent means increased (Ankemann et al., 1999; Clauser,

Mazor, & Hambleton, 1993; Finch & French, 2007; Narayanan & Swaminathan, 1996;

Tian, 1999; Zwick, Donoghue, & Grima, 1993). However, it is important to note that

these prior empirical studies all utilized standard DIF analyses rather than a mixture

approach, as was used in this simulation.









Reconciling the Simulation Results

On one hand, the overall pattern of findings across the simulation conditions

exhibits consistency with previous DIF results. On the other, the factor mixture approach

was not as successful as was hoped at controlling the rate of false identifications and as

a result in demonstrating power to detect DIF. However, if the factor mixture approach is

to be regarded as a viable DIF detection method, possible reasons for this deviation

from the expected performance must be addressed. Under a manifest approach to DIF

an item is said to exhibit DIF if groups matched on the latent ability trait differ in their

probabilities of item response (Cohen et al., 1993). Therefore, in that context, DIF is

defined with respect to the manifest groups being considered. By contrast, the mixture

approach posits a different conceptualization of DIF. In this case, the underlying

assumption is that DIF is observed because of differences in item responses between

unobserved latent classes rather than known manifest groups. Moreover, it is further

assumed that unless there is perfect overlap between the manifest groups and the

latent classes then the two methods should not be expected to produce the same DIF

results (De Ayala et al., 2002). Perfect overlap implies that the composition in each of

the latent classes is exactly the same as in the two manifest groups. For instance, in the

case of a two-class, two-group population, 100% of the reference group would comprise

latent class 1, while 100% of the focal groups would belong to latent class 2. However,

De Ayala et al. (2002) contend that it is unlikely that this perfect equivalence between

latent classes and manifest groups will occur. Because the composition of the latent

classes is likely to differ from that of the manifest groups, then it should be expected

that the DIF results will differ, particularly as the level of overlap moves from 100% to

50%. Therefore, while there is expected to be some similarity in results between the two









approaches, the results are not necessarily identical except in the case of perfect group-

class correspondence. In this simulation, given that the overlap between the latent

classes and manifest groups was simulated to be 80%, then the DIF results should be

expected to differ to some degree. Therefore, one possible reason for the Type I error

rate inflation may have emerged because of this difference in definition and

conceptualization of DIF. Additionally, the procedure used to test the invariance of the

items may also have contributed to this seemingly high rate of inflation. In testing the

significance of the differences in item thresholds, Mplus invokes a Wald test. An

examination of these estimates revealed several large coefficients which in turn would

have resulted in large z-statistics and an increased likelihood of significance. However,

the issue of whether the inflated error rate resulted from applying a factor mixture

approach to these data or from the using the significance testing of threshold

differences in testing for non-invariant items remains unresolved.

Limitations of the Study and Suggestions for Future Research

As with all simulation research, there are several limitations to this study.

However, these limitations also point to the need for future research. First, in

determining the correct number of latent classes, the findings were limited by use of

only one type of model fit index. It would have been interesting to compare the results of

the information criteria indices (i.e. AIC, BIC and ssaBIC) with those of alternative tests

such as the Lo-Mendell-Rubin likelihood ratio test (LMR LRT) and the bootstrap LRT

(BLRT). In their simulation study, Nylund et al. (2006) found that the LMR LRT has been

was reasonably effective at identifying the correct mixture model. However, the BLRT

outperformed both the likelihood-based indices and the LMR LRT as the most

consistent indicator for choosing the correct number of classes. While these results are









promising, the LMR LRT and the BLRT are not without their potential drawbacks.

Jeffries (2003) has been critical of the LMR LRT's use in mixture modeling and has

suggested that the statistic be applied with caution. In addition, the BLRT which uses

bootstrap samples is a far more computationally intensive approach than the

information-based statistics. As a result, the BLRT though seemingly a reliable index, is

seldom used in practice by the applied researcher (Liu, 2008). Therefore, additional

attention may be focused on identifying alternative, robust model selection measures

that provide more consistency than the ICs but are less computationally demanding

than the BLRT. A second limiting factor in this part of the study was that the selection of

the best-fitting model was based on the average IC values. A more reliable approach

would have been to determine the percentage of times (out of the completed

replications) that each index identifies the correct model. However, in this study, it was

not possible to provide a one-to-one comparison of the IC values across the three class-

solutions when a 100% convergence rate was not achieved. Therefore, in future

research, this change should be implemented so that the percentage of correct model

identifications can be compared for each of the indices. It should also be mentioned that

while previous studies have evaluated the performance of model selection methods with

respect to a variety of mixture models (GMM, LCA, FMM), to date no research has been

conducted to evaluate the performance of these indices when used in the context of DIF

detection. To fill this gap in the methodology literature requires a more extensive study

focusing on the detection of DIF with mixture models.

As with all simulation research, the findings can only be generalized to the limited

number of conditions selected for this study. It should be noted that in the original









design of this study, several additional conditions were considered. However, given the

computational intensity of mixture modeling, and in the interest of time, it was decided to

reduce the number of study conditions to the smaller set that was studied. Therefore,

future research should consider a broader range of simulation conditions which would

make for a more realistic study. For example, in addition to sample sizes, it would be of

interest to investigate the ratios of focal to reference groups sample size as well. In this

study, a 1:1 sample size ratio of focal to reference group was considered. And while this

may be representative of an evenly split manifest variable such as gender (Samuelsen,

2005), unequal sample groups tend the mimic minority population characteristics such

as race (e.g. Caucasian vs. black or Hispanic). In traditional DIF assessments, power

rates are typically higher for equal focal and reference group sizes than with unequal

sample size ratios (Atar, 2007). Therefore it would be interesting to investigate whether

this finding is consistent with factor mixture DIF detection methods.

Other conditions, fixed in this current study, that could be manipulated in future

research include: (i) the nature of the items, (ii) the scale length, and (iii) the type of DIF.

In the study, data were simulated for dichotomous items only. An interesting extension

would be the evaluation of the model using categorical response data generated from

different IRT polytomous models (e.g. the graded response model or the partial credit

model). Another condition that could be manipulated is the number/proportion of items

simulated to contain DIF. In addition, assessing the performance of the model selection

indices and the mixture model with respect to varying scale lengths should also make

for a more complete, informative study. While it is expected that longer tests would

produce lower Type I error rates and increase power, it would be of interest to









determine how short the scale should be for the test to perform adequately. The focus

of this study was on the detection of uniform DIF. However, in future research, the type

of DIF factor can be extended to include both uniform and non-uniform DIF. To test for

the presence of non-uniform DIF, the factor mixture model as implemented in this study

must be reformulated so that in addition to the item thresholds, factor loadings are

allowed to vary across classes as well. The Type I error rates and power of the factor

mixture model to detect non-uniform DIF can then be evaluated and compared with the

corresponding results for uniform DIF. Additionally, in this study, the item discrimination

parameter was not included as a factor in the study. Instead its effect was examined on

its own as a single condition. Therefore, in future research, the effect of including this

study condition may be investigated.

In generating the data, the mixture proportion for the two-classes was simulated to

be .50. However, after the model estimation phase, the ability of the factor mixture

approach to accurately recover the class proportions was not evaluated. This omission

should also be addressed in future research.

Finally, to the author's knowledge, the strategy used in testing the items for non-

invariance has been recently introduced to the factor mixture literature and to date has

been implemented in two studies. Its advantage is that it provides a simpler more direct

alternative to DIF detection than the CFA baseline approaches which require the

estimation and comparison of two models. However, it has not yet been subjected to the

methodological rigor of more established methods. Therefore, a potential extension to

this study would be a comparison of the performance of the significance testing of the









threshold differences using the Mplus model constraint option versus either a

constrained- or a free-baseline strategy for testing DIF with mixture CFA models.

Conclusion

In the last decade, a burgeoning literature on mixture modeling and its applications

has emerged. And although several of these research efforts have been concentrated in

the area of growth mixture modeling, there is also a groundswell of interest in applying a

mixture approach in the study of measurement invariance. Therefore, in concluding this

dissertation it is important to reiterate the motivation that should precede the use of this

technique as well as some key concerns that applied should keep in mind when

deciding whether a mixture modeling is an appropriate approach for their research. The

intrinsic appeal of mixture models is that they allow for the exploration of unobserved

population heterogeneity using latent variables. Under the traditional conceptualization,

DIF is defined with respect to distinct, known sub-groups. Therefore, in using standard

DIF approaches, practitioners are seeking to determine if after controlling for latent

ability whether differences in item response patterns are a result of a known variable

such as gender or race. However, when investigating DIF from a latent perspective,

there is an implied assumption that the presence of unobserved latent classes gives rise

to the pattern of differential functioning in the items. Advocates of this approach contend

that it allows for a better understanding of why examinees may be responding differently

to items and this is certainly an attractive inducement to practitioners. However, these

results suggest that unless large sample sizes and large amounts of DIF are simulated

in the data the factor mixture approach is likely to be unsuccessful at disentangling the

population into distinct, distinguishable latent classes. Additionally, commonly-used fit

indices such as the AIC, BIC, and ssaBIC are likely to produce inconsistent results and









may cause the incorrect selection of more or fewer classes than actually exist in the

population. Therefore, it is critical that the practitioner has a strong theoretical

justification to support the assumption of population heterogeneity. This should

decrease the ambiguity in the selection of the best-fitting model for the data and in the

interpretation of the nature of the latent classes. However, when the data and the theory

support the existence of these latent classes, the technique can be used successfully to

detect qualitatively different subpopulations with differential patterns of response that

may otherwise had been overlooked using a traditional classic DIF-procedure. In the

context of education research, the application of mixture models can provide valuable

diagnostic information that can be used to gain insight into students' cognitive strengths

and weaknesses.

This study was designed as a means of bridging the gap between the manifest

and latent approaches by examining the performance of the factor mixture approach in

detecting DIF in items generated via a traditional framework. And even though the

manifest approach will remain a staple in the DIF literature, it is expected that the

interest in factor mixture models in DIF will continue to grow. Therefore, by further

exploring these two approaches differ not only as concepts but also in results and

application will ensure that each is appropriately used in practice.









APPENDIX A
MPLUS CODE FOR ESTIMATING 2-CLASS FMM

TITLE: Factor mixture model for a two-class solution.

DATA: FILE IS allnames.txt;
TYPE = montecarlo;


VARIABLE:




ANALYSIS:


NAMES = ul-u15 class group;
USEV ARE ul-ul 5;
CATEGORICAL = ul-u15;
CLASSES = c (2);

TYPE = MIXTURE;
ALGORITHM = INTEGRATION;
INTEGRATION = STANDARD (20);
STARTS = 600 20;
PROCESS = 2;


MODEL: %OVERALL%
f BY ul-u15;

%c#1 %
[u2$1-u15$1];
f;

%c#2%
[u2$1-u15$1];
f;


OUTPUT:


TECH8TECH9;
STANDARDIZED;


SAVEDATA: RESULTS ARE results.txt;









APPENDIX B
MPLUS CODE FOR DIF DETECTION

Factor mixture model for a two-class solution.
Items = 15, DIF =1.0

FILE IS allnames.txt;
TYPE = montecarlo;


VARIABLE:




ANALYSIS:


NAMES = ul-u15 class group;
USEV ARE ul-u15;
CATEGORICAL = ul-u15;
CLASSES = c (2);

TYPE = MIXTURE;
ALGORITHM = INTEGRATION;
INTEGRATION = STANDARD (20);
STARTS = 0;
PROCESS = 2;


MODEL: %OVERALL%
f BY ul @1 u2*0.500 / / / / u15*0.867;

%c#1%
[ul$1] (pl_1); !Assigns names to indicators for constraint purposes
[u2$1*-0.500] (p1_2);
////
[u15$1*0.609] (p1_15);
f;

%c#2%
[u1$1] (p1_1); !Threshold of Item 1 constrained equal across classes
[u2$1*0.000] (p2_2); !Remaining 14 item thresholds freely estimated
////
[u15$1*0.609] (p2_15);
f;

MODEL CONSTRAINT:

New(difi2 difi3 difi4 difi5 difi6 difi7 difi8 difi9 difil0 difil 1 difil2 difil3 difil4 difil5);
Declares new variables (difi2,...,difil5)which are functions of
previous variables


difi2= p2_2 p1_2;
difi5=p25 5;
difil5= p2_15 p1_15;


!Estimates threshold differences


TITLE:


DATA:










LIST OF REFERENCES


Abraham, A. A. (2008). Model Selection Methods in the linear mixed model for
longitudinal data. Unpublished doctoral dissertation, University of North Carolina
at Chapel Hill.

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item
validity from a multidimensional perspective. Journal of Educational
Measurement, 29, 67-91.

Ainsworth, A.T. (2007). Dimensionality and invariance: Assessing DIF using bifactor
MIMIC models. Unpublished doctoral dissertation, University of California, Los
Angeles.

Agrawal, A., & Lynskey, M.T. (2007). Does gender contribute to heterogeneity in criteria
for cannabis abuse and dependence? Results from the national epidemiological
survey on alcohol and related conditions. Drug and Alcohol Dependence, 88,
300-307.

Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332.

Allua, S.S. (2007). Evaluation of single- and multilevel factor mixture model estimation.
Unpublished doctoral dissertation, University of Texas: Austin.

Anderson, L. W. (1985). Opportunity to learn. In T. Husen & T. N. Postlethwaite (Eds.),
The international encyclopedia of education (Vol. 6, pp. 3682-3686). Oxford:
Pergamon Press.

Angoff, W.H. (1972). A technique for the investigation of cultural differences. Paper
presented at the annual meeting of the American Psychological Association,
Honolulu. (ERIC Document Reproduction Service No. ED 069686).

Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W.
Holland & H. Wainer (Eds.) Differential item functioning (pp. 3-23). Hillsdale, N.J.:
Lawrence Erlbaum.

Angoff, W. H., & Sharon, A. T. (1974). The evaluation of differences in test performance
of two or more groups. Educational and Psychological Measurement, 34, 807-
816.

Ankemann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of
the likelihood ratio goodness-of-fit statistic in detecting differential item
functioning. Journal of Educational Measurement, 36, 277-300.









Atar, B. (2007). Differential item functioning analyses for mixed response data using IRT
likelihood-ratio test, logistic regression, and GLLAMM procedures. Unpublished
doctoral dissertation, Florida State University.

Bandalos, D. L., & Cohen, A.S. (2006). Using factor mixture models to identify
differentially functioning test items. Paper presented at the annual meeting of the
American Educational Research Association, San Francisco.

Bauer, D. J., & Curran, P.J. (2004). The integration of continuous and discrete latent
variable models: Potential problems and promising opportunities. Psychological
Methods, 9, 3-29.

Bilir, M. K. (2009). Mixture item response theory-mimic model: Simultaneous estimation
of differential item functioning for manifest groups and latent classes.
Unpublished doctoral dissertation, Florida State University.

Bock, R.D., & Aiken, M. (1981). Marginal maximum likelihood estimation of item
parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.

Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for
multiple-choice data. Journal of Educational and Behavioral Statistics, 26, 381-
409.

Bontempo, D. E. (2006). Polytomous factor analytic models in developmental research.
Unpublished doctoral dissertation, The Pennsylvania State University.

Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Newbury
Park, CA: Sage.

Cheung, G.W. & Rensvold, R.B. (1999) Testing factorial invariance across groups: A
reconceptualization and proposed new method. Journal of Management, 25, 1-
27.

Cho, S.-J. (2007). A multilevel mixture IRT model for DIF analysis. Unpublished doctoral
dissertation, University of Georgia, Athens.

Chung, M.C., Dennis, I., Easthope, Y., Werrett, J., & Farmer, S. (2005). A multiple-
indicator multiple-cause model for posttraumatic stress reactions: Personality,
coping, and maladjustment. Psychosomatic Medicine, 67, 251-259.

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify
differentially functioning test items. Educational Measurement: Issues and
Practice, 17, 31-44.









Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1994). The effects of score group
width on the Mantel-Haenszel procedure. Journal of Educational Measurement,
57, 67-78.

Clark, S.L. (2010). Mixture modeling with behavioral data. Doctoral dissertation,
Unpublished doctoral dissertation. University of California, Los Angeles.

Clark, S.L., Muthen, B., Kaprio, J., D'Onofrio, B.M., Viken, R., Rose, R.J., Smalley, S. L.
(2009). Models and strategies for factor mixture analysis: Two examples
concerning the structure underlying psychological disorders. Manuscript
submitted for publication.

Cleary, T. A. & Hilton, T. J. (1968). An investigation of item bias. Educational and
Psychological Measurement, 5, 115-124.

Cohen, A.S., & Bolt, D.M. (2005). A mixture model analysis of differential item
functioning. Journal of Educational Measurement, 42, 133-148.

Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning
in the graded response model. Applied Psychological Measurement, 17, 335 -
350.

Cole, N. S. (1993). History and development of DIF. In P.W. Holland & H. Wainer (Eds.)
Differential item functioning (pp. 25-33). Hillsdale, N.J.: Lawrence Erlbaum.

Dainis, A. M. (2008). Methods for identifying differential item and test functioning: An
investigation of Type I error rates and power. Unpublished doctoral dissertation,
James Madison University.

De Ayala, R.J. (2009). Theory and practice of item response theory. Guilford Publishing.

De Ayala, R.J., Kim, S.-H., Stapleton, L.M., & Dayton, C.M. (2002). Differential item
functioning: A mixture distribution conceptualization. International Journal of
Testing, 2, 243-276.

Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors
that affect the Mantel-Haenszel and standardization measures of differential item
functioning. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning
(pp. 137-166). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel
and standardization. In P.W. Holland & H. Wainer (Eds.) Differential item
functioning. Hillsdale, N.J.: Lawrence Erlbaum.









Dorans NJ, & Kulick E. (1986). Demonstrating the utility of the standardization approach
to assessing unexpected differential item performance on the Scholastic Aptitude
Test. Journal of Educational Measurement, 23, 355-368.

Duncan, S. C. (2006). Improving the prediction of differential item functioning: A
comparison of the use of an effect size for logistic regression DIF and Mantel-
Haenszel DIF methods. Unpublished doctoral dissertation, Texas A&M
University.

Educational Testing Service (2008). What's the DIF? Helping to ensure test question
fairness. Retrieved December 8, 2009, from: http://www.ets.org/portal/site/ets/

Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in
identifying differential item functioning on teacher certification tests. Applied
Measurement in Education, 3, 347-360.

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with
Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological
Measurement, 29, 278-295.

Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning.
Educational and Psychological Measurement, 67, 565-582.

Fukuhara, H. (2009). A differential item functioning model for testlet-based items
using a bi-factor multidimensional item response theory model: A Bayesian
approach. Unpublished doctoral dissertation, Florida State University.

Furlow, C. F., Raiford Ross, T., & Gagne, P. (2009). The impact of multidimensionality
on the detection of differential bundle functioning using simultaneous item bias
test. Applied Psychological Measurement, 33, 441-464.

Gagne, P. (2004). Generalized confirmatory factor mixture models: A tool for assessing
factorial invariance across unspecified populations. Unpublished doctoral
dissertation. University of Maryland.

Gagne, P. (2006). Mean and covariance structure models. In G.R. Hancock & F.R.
Lawrence (Eds.), Structural Equation Modeling: A second course (pp. 197-224).
Greenwood, CT: Information Age Publishing, Inc.

Gallo, J. J., Anthony, J. C., & Muthen, B. 0. (1994). Age differences in the symptoms of
depression: A latent trait analysis. Journal of Gerontology: Psychological
Sciences, 49, P251-P264.

Gelin, M. N. (2005). Type I error rates of the DIF MIMIC approach using Joreskog's
covariance matrix with ML and WLS estimation. Unpublished doctoral
dissertation, The University of British Columbia.









Gelin, M. N., & Zumbo, B.D. (2007). Operating characteristics of the DIF MIMIC
approach using Joreskog's covariance matrix with ML and WLS estimation for
short scales. Journal of Modern Applied Statistical Methods, 6, 573-588.

Glockner-Rist, A., & Hoitjink, H. (2003). The best of both worlds: Factor analysis of
dichotomous data using item response theory and structural equation modeling.
Structural Equation Modeling, 10, 544-565.

Gomez, R. & Vance, A. (2008). Parent ratings of ADHD symptoms: Differential
symptom functioning across Malaysian Malay and Chinese children. Journal of
Abnormal Child Psychology, 36, 955-967.

Gonzalez-Roma, V., Hernandez, A., & G6mez-Benito, J. (2006). Power and Type I error
of the mean and covariance structure analysis model for detecting differential
item functioning in graded response items. Multivariate Behavioral Research, 41,
29-53.

Gierl, M. J., Bisanz, J., Bisanz, G. L., Boughton, K. A., & Khaliq, S. N. (2001).
Illustrating the utility of differential bundle functioning analysis to identify and
interpret group differences on achievement tests. Educational Measurement:
Issues and Practice, 20, 26-36.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Thousand Oaks, CA: Sage Publications.

Hancock, G. R., Lawrence, F. R., & Nevitt, J. (2000). Type I error and power of latent
mean methods and MANOVA in factorially invariant and noninvariant latent
variable systems. Structural Equation Modeling, 7, 534-556.

Henson, J. M. (2004). Latent variable mixture modeling as applied to survivors of breast
cancer. Unpublished doctoral dissertation. University of California, Los Angeles.

Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale
of item difficulty (ETS-RR-94-13). Princeton, NJ; Educational Testing Service.

Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-
Haenszel procedure. In H. Wainer& H.I. Braun (Eds.), Test Validity. Hillsdale,
N.J.: Erlbaum.

Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, N.J.:
Lawrence Erlbaum.

Jeffries, N. (2003). A note on "Testing the number of components in a normal mixture."
Biometrika, 90, 991-994.


100









Joreskog K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and
multiple causes of a single latent variable. Journal of the American Statistical
Association, 10, 631- 639.

Kamata, A., & Bauer, D. J. (2008). A note on the relationship between factor analytic
and item response theory models. Structural Equation Modeling: A
Multidisciplinary Journal, 15, 136 153.

Kamata, A., & Binici, S. (2003). Random effect DIF analysis via hierarchical generalized
linear modeling. Paper presented at the annual International Meeting of the
Psychometric Society, Sardinia, Italy.

Kamata, A., & Vaughn, B.K. (2004). An introduction to differential item functioning
analysis. Learning Disabilities: A Contemporary Journal, 2, 49-69.

Kuo, P.H., Aggen, S.H., Prescott, C.A., Kendler, K.S., & Neale, M.C. (2008). Using a
factor mixture modeling approach in alcohol dependence in a general population
sample. Drug and Alcohol Dependence, 98, 105-114.

Larson, S. L. (1999). Rural-urban comparisons of item responses in a measure of
depression. Unpublished doctoral dissertation, University of Nebraska.

Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some
examinees are motivated. Unpublished doctoral dissertation, James Madison
university.

Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton
Mifflin.

Lee, J. (2009). Type I error and power of the mean and covariance structure
confirmatory factor analysis for differential item functioning detection:
Methodological issues and resolutions. Unpublished doctoral dissertation,
University of Kansas.

Leite, W. L. & Cooper, L. (2007). Diagnosing social desirability bias with structural
equation mixture models. Paper presented at the Annual Meeting of the
American Psychological Association.

Li, F., Cohen, A.S., Kim, S-H., & Cho, S-J. (2009). Model selection methods for mixture
dichotomous IRT models. Applied Psychological Measurement, 33, 353-373.

Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested
latent class models. Journal of Educational and Behavioral Statistics, 22, 249-
264.


101









Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of
current practice and future implications. In P.W. Holland & H. Wainer (Eds.)
Differential item functioning (pp. 349 -364). Hillsdale, N.J.: Lawrence Erlbaum.

Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content and group
membership on achievement test items. Journal of Educational Measurement,
18, 109-118.

Liu, C. Q. (2008). Identification of latent groups in Growth Mixture Modeling:
A Monte Carlo study. Unpublished doctoral dissertation, University of Virginia.

Lo, Y., Mendell, N., & Rubin, D. (2001). Testing the number of components in a normal
mixture. Biometrika, 88, 767-778.

Lord, F. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Erlbaum.

Lubke, G. H. & Muthen, B. (2005). Investigating population heterogeneity with factor
mixture models. Psychological Methods, 10, 21-39.

Lubke, G. H. & Muthen, B. (2007). Performance of factor mixture models as a function
of model size, covariate effects, and class-specific parameters. Structural
Equation Modeling: A Multidisciplinary Journal, 14, 26-47.

Lubke, G. H. & Neale, M. C. (2006). Distinguishing between latent classes and
continuous factors: Resolution by maximum likelihood? Multivariate Behavioral
Research, 41, 499- 532.

Macintosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model
parameters to IRT parameters in DIF analysis. Applied Psychological
Measurement, 372-379.

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from
retrospective studies of disease. Journal of the National Cancer Institute, 22,
719-748.

McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric Monographs, 15, 1-
167.

McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K.
J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S.
school mathematics from an international perspective. Champaign, IL: Stipes.

McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.


102









Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory
and confirmatory factor analytic methodologies for establishing measurement
equivalence/invariance. Organizational Research Methods, 7, 361-388.

Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo
examination of the sensitivity of the DFIT framework for tests of measurement
invariance with Likert data. Applied Psychological Measurement, 31, 430-455.

Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance.
Psychometrika, 58, 525-543.

Millsap, R.E., & Everson, H.T. (1993). Methodology review: Statistical approaches for
assessing measurement bias. Applied Psychological Measurement, 17, 297-334.

Mislevy, R. J., Levy, R, Kroopnick, M., & Rutstein, D. (2008). Evidentiary foundations of
mixture item response theory models. In G. R. Hancock, K. M. Samuelsen (Eds.),
Advances in latent variable mixture models. (pp. 149-175). Charlotte, NC:
Information Age Publishing.

Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects
employ different solution strategies. Psychometrika, 55, 195-215.

Moustaki, I. (2000). A latent variable model for ordinal variables. Applied Psychological
Measurement, 24, 211-224.

Muthen, B. 0. (1985). A method for studying the homogeneity of test items with respect
to other relevant variables. Journal of Educational Statistics, 10, 121-132.

Muthen, B. 0. (1988). Some uses of structural equation modeling in validity studies:
Extending IRT to external variables. In H. Wainer and H. Braun (Eds.), Test
validity (pp. 213-238). Hillsdale, NJ:Lawrence Erlbaum.

Muthen, B. 0. (1989). Using item-specific instructional information in achievement
modeling. Psychometrika, 54, 385-396.

Muthen, B. O., & Asparouhov, T. (2002). Latent variable analysis with categorical
outcomes: Multiple-group and growth modeling in Mplus (Mplus Web Note No.
4). Retrieved April 28, 2005, from
http://www.statmodel.com/mplus/examples/webnote.html

Muthen, B., Asparouhov, T. & Rebollo, I. (2006). Advances in behavioral genetics
modeling using Mplus: Applications of factor mixture modeling to twin data. Twin
Research and Human Genetics, 9, 313-324.


103









Muthen, B. O., Grant, B., & Hasin, D. (1993). The dimensionality of alcohol abuse and
dependence: Factor analysis of DSM-III-R and proposed DSM-IV criteria in the
1988 National Health Interview Survey. Addiction, 88, 1079-1090.

Muthen, B. O., Kao, C., & Burstein, L. (1991). Instructionally sensitive psychometrics:
An application of a new IRT-based detection technique to mathematics
achievement test items. Journal of Educational Measurement, 28, 1-22.

Muthen, B. O., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item
bias analysis. Journal of Educational Statistics, 10, 133-142.

Muthen, L.K. and Muthen, B.O. (1998-2008). Mplus user's guide. Fifth edition.
Los Angeles, CA: Muthen & Muthen.

Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and
simultaneous tem bias procedures for detecting differential item functioning.
Applied Psychological Measurement, 18, 315-328.

Navas-Ara, M. J., & Gomez-Benito, J. (2002). Effects of ability scale purification on
identification of DIF. European Journal of Psychological Assessment, 18, 9-15.

Nylund, K. L., Asparouhov, T., & Muthen, B. (2006). Deciding on the number of classes
in latent class analysis and growth mixture modeling. A Monte Carlo simulation
study. Structural Equation Modeling, 14, 535-569.

O'Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are
associated with differential item functioning. In P. W. Holland, & H. Wainer (Eds.),
Differential item functioning (pp. 255-276). Hillsdale, NJ: Erlbaum.

Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis.
Structural Equation Modeling, 5, 107-124.

Penfield, R.D., & Lam, T. C. M. (2000). Assessing differential item functioning in
performance assessment: Review and recommendations. Educational
Measurement: Issues and Practice, 19, 5-15.

Potenza, M.T. & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A
framework for classification and evaluation. Applied Psychological Measurement,
19, 23-27.

Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 54,
495-502.

Raju, N.S., Bode, R.K., & Larsen, V.S. (1989). An empirical assessment of the Mantel-
Haenszel statistic to detect differential item functioning. Applied Measurement in
Education, 2, 1-13.


104










Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas
between two item response functions. Applied Psychological Measurement, 14,
197-207.

Reynolds, M. R. (2008). The use of factor mixture modeling to investigate population
heterogeneity in hierarchical models of intelligence. Unpublished doctoral
dissertation, University of Texas, Austin.

Rindskopf, D. (2003). Mixture or homogeneous? Comment on Bauer and Curran
(2003). Psychological Methods, 8, 364-368.

Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and
Mantel-Haenszel procedures for detecting differential item functioning. Applied
Psychological Measurement, 17, 105-116.

Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to
item analysis. Applied Psychological Measurement, 14, 271-282.

Roussos, L. A., & Stout, W. F. (1996a). A multidimensionality-based DIF analysis
paradigm. Applied Psychological Measurement, 20, 355-371.

Roussos, L. A., & Stout, W. F. (1996b). Simulation studies of the effects of small sample
size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error
performance. Journal of Educational Measurement, 33, 215-230.

Samuelsen, K. (2005). Examining differential item functioning from a latent class
perspective. Unpublished doctoral dissertation, University of Maryland: College
Park.

Samuelsen, K. (2008). Examining differential item functioning from a latent perspective.
In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture
models, (pp. 177-197). Charlotte, NC: Information Age Publishing.

Sawatzky, R. (2007). The measurement of quality of life and its relationship with
perceived health status in adolescents. Unpublished doctoral dissertation, The
University of British Columbia.

Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6,
461-464.

Sclove, L. (1987). Application of model-selection criteria to some problems in
multivariate analysis. Psychometrika, 52, 333-343.


105









Shealy, R., & Stout, W. F. (1993a). A model-based standardization approach that
separates true bias/DIF from group differences and detects test bias/DIF as well
as item bias/DIF. Psychometrika, 58, 159-194.

Shealy, R., & Stout, W. F. (1993b). An item response theory model for test bias and
differential test functioning. In P. W. Holland & H. Wainer (Eds.), Differential item
functioning (pp. 197-239). Hillsdale, NJ: Erlbaum.

Shih, C-L & Wang, W-C. (2009). Differential item functioning detection using multiple
indicators, multiple causes method with a pure short anchor. Applied
Psychological Measurement, 33, 184-199.

Sorbom, D. (1974). A general method for studying differences in factor means and
factor structures between groups. British Journal of Mathematical and Statistical
Psychology. 27, 229-239.

Standards for educational and psychological testing. (1999). Washington, D.C:
American Educational Research Association, American Psychological
Association, National Council on Measurement in Education.

Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item
functioning with confirmatory factor analysis and item response theory: Toward a
unified strategy. Journal of Applied Psychology, 91, 1291-1306.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using
logistic regression procedures. Journal of Educational Measurement, 27, 361-
370.

Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory
and factor analysis of discretized variables. Psychometrika, 52, 393-408.

Thissen, D. (1991). MULTILOGTM User's Guide. Multiple, Categorical Item Analysis and
Test Scoring Using Item Response Theory. Chicago: Scientific Software, Inc.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning
using the parameters of item response models. In P.W. Holland & H. Wainer
(Eds.) Differential Item Functioning (pp. 67-113). Hillsdale, N.J.: Lawrence
Erlbaum.

Thurstone, L.L. (1925). A method of scaling educational and psychological tests.
Journal of Educational Psychology, 16, 263-278.

Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago: University of Chicago Press.

Tian, F. (1999). Detecting differential item functioning in polytomous items. Unpublished
doctoral dissertation, University of Ottawa.


106









Tofighi, D., & Enders, C. K. (2007). Identifying the correct number of classes in a growth
mixture model. In G. R. Hancock (Ed.), Mixture models in latent variable research
(pp. 317-341). Greenwich, CT: Information Age.

Uttaro, T. & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in
the detection of differential item functioning. Applied Psychological Measurement,
18, 15-25.

Wainer H. (1993). Model-based standardized measurement of an item's differential
impact. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 123-
135). Hillsdale, N.J.: Lawrence Erlbaum.

Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions
and detection. Journal of Educational Measurement, 28, 197-219.

Wang, W-C., Shih, C-L., & Yang, C-C. (2009). The MIMIC method with scale purification
for detecting differential item functioning. Educational and Psychological
Measurement, 69, 713-732.

Wanichtanom, R. (2001). Methods of detecting differential item functioning: A
comparison of item response theory and confirmatory factor analysis.
Unpublished doctoral dissertation, Old Dominion University.

Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of
Mathematical Psychology, 44, 92-107.

Webb, M-Y., Cohen, A.S., & Schwanenflugel, P.J. (2008). A mixture model
analysis of differential item functioning on the Peabody Picture Vocabulary
Test-Ill. Educational and Psychological Measurement, 68, 335-351.

Woods, Carol M. (2009). Evaluation of MIMIC-model methods for DIF testing with
comparison to two-group analysis. Multivariate Behavioral Research, 44, 1-27.

Yang, C. C. (1998). Finite mixture model selection with psychometric applications.
Unpublished doctoral dissertation, University of California, Los Angeles.

Yang, C. C. (2006). Evaluating latent class analysis models in qualitative phenotype
Identification. Computational Statistics and Data Analysis, 50, 1090-1104.

Yoon, M. (2007). Statistical power in testing factorial invariance with ordinal measures.
Unpublished doctoral dissertation, Arizona State University.

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item
functioning (DIF): Logistic regression modeling as a unitary framework for binary
and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human
Resources Research and Evaluation, Department of National Defense.


107










Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses:
Considering where it has been, where it is now, and where it is going. Language
Assessment Quarterly, 4, 223-233.

Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy
research: Bringing the context into picture by investigating sociological /
community moderated (or mediated) test and item bias. Journal of Educational
Research and Policy Studies, 5, 1-23.

Zwick, R., Donoghue, J., & Grima, A. (1993). Assessment of differential item functioning
for performance tasks. Journal of Educational Measurement, 30, 233-251.


108









BIOGRAPHICAL SKETCH

Mary Grace-Anne Jackman was born in Bridgetown, Barbados. In 1994, she

graduated from the University of the West Indies, Barbados with a Bachelor of Science

degree in mathematics and computer science (first class honors). After being awarded

an Errol Barrow Scholarship, she entered Oxford University in 1996 and received a

Master of Science degree in Applied Statistics in 1997. In 2002 she graduated from the

University of Georgia with a master's degree in marketing research. Following four

years as a marketing research consultant in New York and Barbados, she began

doctoral studies in research and evaluation methodology at the University of Florida in

the fall of 2006.


109





PAGE 1

A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By MARY GRACE-ANNE JACKMAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 1

PAGE 2

2010 Mary Grace-Anne Jackman 2

PAGE 3

To my parents, Enid and Kenmore Jackman, and my brother, Stephen 3

PAGE 4

ACKNOWLEDGMENTS First and foremost, I am most grateful to Almighty God through whom all things are possible. I am also forever indebted to the Office of Graduate Studies and the Alumni Fellowship Committee for the financial support that has made these four years of study possible. The completion of this dissertation would also not have been possible without the support and guidance of my dissertation committee: Dr. David Miller, Dr. Walter Leite, Dr. James Algina, and Dr. Craig Wood. I would like to thank my committee chair and academic adviser, Dr. Miller, for his expert guidance and insightful counsel during this PhD program. I am also grateful to Dr. Leite, for his patient mentorship and his uncanny ability to reduce my seemingly insurmountable programming mountains to mere molehills. I must also acknowledge Dr. Algina, the epitome of teaching excellence. I am indeed privileged to have been your student. I am also thankful to Dr. Wood for agreeing to be a part of my dissertation committee in my time of need and for your quiet vote of confidence. Finally, I also need to express my immense gratitude and deepest appreciation to my friends, in the US and at home in Barbados. This could not have been possible without your unwavering support and unending encouragement. I am grateful for the Skype chats, emails, texts and all the other modes of communication you used to continually encourage, support me and keep me abreast of all the happenings back home. This journey has been made easier because of your friendship, support and prayers. 4

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS..................................................................................................4 LIST OF TABLES............................................................................................................8 LIST OF FIGURES........................................................................................................10 ABSTRACT...................................................................................................................11 CHAPTER 1 INTRODUCTION....................................................................................................13 2 LITERATURE REVIEW..........................................................................................21 Differential Item Functioning...................................................................................21 Types of Differential Item Functioning..............................................................23 DIF vs. Impact..................................................................................................24 Frameworks for Examining DIF........................................................................24 Observed score framework........................................................................24 The latent variable framework....................................................................25 SEM-based DIF Detection Methods.................................................................26 Factor Analytic Models with Ordered Categorical Items...................................27 Mixture Modeling as an Alternative Approach to DIF Detection..............................31 Estimation of Mixture Models..................................................................................36 Class Enumeration...........................................................................................37 Information Criteria Indices...............................................................................39 Mixture Model Estimation Challenges..............................................................43 Purpose of Study..............................................................................................44 3 METHODOLOGY...................................................................................................49 Factor Mixture Model Specification for Latent Class DIF Detection........................49 Data Generation.....................................................................................................50 Simulation Study Design.........................................................................................51 Research Study 1.............................................................................................52 Manipulated Conditions....................................................................................52 Sample size...............................................................................................52 Magnitude of uniform DIF..........................................................................53 Ability differences between groups............................................................53 Fixed Simulation Conditions.............................................................................54 Test length.................................................................................................54 Number of DIF items..................................................................................55 Sample size ratio.......................................................................................55 Percentage of overlap between manifest and latent classes.....................55 5

PAGE 6

Mixing proportion.......................................................................................56 Study Design Overview....................................................................................56 Evaluation Criteria............................................................................................57 Research Study 2.............................................................................................57 Data Analysis...................................................................................................58 Evaluation Criteria............................................................................................58 Model Estimation..............................................................................................59 4 RESULTS...............................................................................................................62 Research Study 1...................................................................................................62 Convergence Rates..........................................................................................62 Class Enumeration...........................................................................................63 Akaike Information Criteria (AIC)...............................................................64 Bayesian Information Criteria (BIC)...........................................................64 Sample-size adjusted BIC (ssaBIC)...........................................................65 Research Study 2...................................................................................................65 Nonconvergent Solutions.................................................................................66 Type I Error Rate..............................................................................................66 Magnitude of DIF.......................................................................................67 Sample size...............................................................................................68 Impact........................................................................................................68 Variance components analysis..................................................................69 Statistical Power...............................................................................................69 Magnitude of DIF.......................................................................................70 Sample size...............................................................................................70 Impact........................................................................................................71 Effect of item discrimination parameter values...........................................71 Variance components analysis..................................................................72 5 DISCUSSION.........................................................................................................81 Class Enumeration and Performance of Fit Indices................................................81 Type I Error and Statistical Power Performance.....................................................84 Type I Error Rate Study....................................................................................85 Statistical Power Study.....................................................................................86 Reconciling the Simulation Results..................................................................87 Limitations of the Study and Suggestions for Future Research..............................88 Conclusion..............................................................................................................92 APPENDIX A MPLUS CODE FOR ESTIMATING 2-CLASS FMM................................................94 B MPLUS CODE FOR DIF DETECTION...................................................................95 LIST OF REFERENCES...............................................................................................96 6

PAGE 7

BIOGRAPHICAL SKETCH..........................................................................................109 7

PAGE 8

LIST OF TABLES Table page 3-1 Generating population parameter values for reference group............................60 3-2 Fixed and manipulated simulation conditions used in study 1............................61 3-3 Fixed and manipulated simulation conditions used in study 2............................61 4-1 Number of converged replications for the three factor mixture models...............73 4-2 Mean AIC values for the three mixture models...................................................74 4-3 Mean BIC values for the three mixture models...................................................74 4-4 Mean ssaBIC values for the three mixture models.............................................74 4-5 Percentages of converged solutions across study conditions............................75 4-6 Overall Type I error rates across study conditions..............................................75 4-7 Type I error rates for DIF = 1.0...........................................................................76 4-8 Type I error rates for DIF = 1.5...........................................................................76 4-9 Type I error rates for sample size of 500............................................................76 4-10 Type I error rates for sample size of 1000..........................................................76 4-11 Type I error rates for impact of 0 SD..................................................................77 4-12 Type I error rates for impact of 0.5 SD...............................................................77 4-13 Type I error rates for impact of 1.0 SD...............................................................77 4-14 Variance components analysis for Type I error..................................................77 4-15 Overall power rates across study conditions......................................................78 4-16 Power rates for DIF of 1.0..................................................................................78 4-17 Power rates for DIF of 1.5..................................................................................78 4-18 Power rates for sample size N of 500.................................................................79 4-19 Power rates for sample size N of 1000...............................................................79 4-20 Power rates for impact of 0 SD...........................................................................79 8

PAGE 9

4-21 Power rates for impact of 0.5 SD........................................................................79 4-22 Power rates for impact of 1.0 SD........................................................................79 4-23 Power rates for DIF detection based on item discriminations.............................80 4-24 Variance components analysis for power results................................................80 9

PAGE 10

LIST OF FIGURES Figure page 2-1 Example of uniform DIF......................................................................................46 2-2 Example of non-uniform DIF...............................................................................46 2-3 Depiction of relationship between y* and y for a dichotomous item....................47 2-4 Diagram depicting specification of the factor mixture model...............................48 10

PAGE 11

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By Mary Grace-Anne Jackman August 2010 Chair: M. David Miller Cochair: Walter Leite Major: Research and Evaluation Methodology This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two-parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these 11

PAGE 12

two extremes and tended to favor the "true" two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced. 12

PAGE 13

CHAPTER 1 INTRODUCTION Given the ubiquity of testing in the United States coupled with the serious consequences of high-stakes decisions associated with these assessments, it is critical that conclusions drawn about group differences among examinee groups be accurate and that the validity of interpretations is not compromised. One way of eliminating the threat of invalid interpretations is to ensure that tests are fair and the items do not disadvantage subgroups of examinees. In addressing the issue of fairness in testing, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) outlines four widely used interpretations of fairness. The first defines a fair test as one that is free from bias that either systematically favors or systematically disadvantages one identifiable subgroup of examinees over another. In the second definition, fairness refers to the belief that equal treatment should be afforded to all examinees during the testing process. The third definition has sparked some controversy among testing professionals. It defines fairness as the equality of outcomes as characterized by comparable overall passing rates across examinee subgroups. However, it is widely agreed that while the presence of group differences in ability distributions between groups should not be ignored, it is not an indication of test bias. Therefore, a more acceptable definition specifies that in a fair test, examinees possessing equal levels of the underlying trait being measured should have comparable testing outcomes, regardless of group membership. The fourth and final definition of fairness requires that all examinees be afforded an equal, adequate opportunity to learn the tested material (AERA, APA, & NCME, 1999). Clearly, the concept of fairness is a complex, multi-faceted construct and therefore it is highly unlikely that consensus will be reached on all 13

PAGE 14

aspects of its definition, interpretation and implementation (AERA, APA, & NCME, 1999). However, there is agreement that fairness must be paramount to test developers during the writing and review of items as well as during the administration and scoring of tests. In other words, the minimum fairness requirements are that items are free from bias and that all examinees receive an equitable level of treatment during the testing process (AERA, APA, & NCME, 1999). In developing unbiased test items, one of the primary concerns is ensuring that the items do not function differentially for different subgroups of examinees. This issue of item invariance is investigated through the use of a statistical technique known as differential item functioning (DIF). DIF detection is particularly critical if meaningful comparisons are to be made between different examinee subgroups. The fundamental premise of DIF is that if test takers have approximately the same knowledge, then they should perform in similar ways on individual test questions regardless of their sex, race or ethnicity (ETS, 2008). Therefore, the process of DIF assessment involves the accumulation of empirical evidence to determine whether items function differentially for examinees with the same ability. DIF analysis is widely regarded as the psychometric standard in the investigation of bias and test fairness. Consequently, it has been the topic of extensive research (Camilli & Shepard, 1994; Clauser & Mazor, 1998; Holland & Thayer, 1988; Holland & Wainer, 1993; Milsap & Everson, 1993; Penfield & Lam, 2000; Potenza & Dorans, 1995; Swaminathan & Rogers, 1990). As part of its evolution, several statistical DIF detection techniques, including non-parametric (Holland & Thayer, 1988; Swaminathan & Rogers, 1990), IRT-based methods (Thissen, Steinberg, & Wainer, 1993; Wainer, Sireci, & Thissen, 1991) and SEM-based methods (Jreskog & 14

PAGE 15

Goldberger, 1975; MacIntosh & Hashim, 2003; Muthn, 1985, 1988; Muthn, Kao, & Burstein, 1991) have been developed. Typically these methods have all focused on a manifest approach to detecting DIF. In other words, they use pre-existing group characteristics such as gender (e.g. males vs. females) or ethnicity (e.g. Caucasian vs. African Americans) to investigate the occurrence of DIF in a studied item. And while this approach to DIF testing has been widely practiced and accepted as the standard for DIF testing, some believe that this emphasis on statistical DIF analysis has been less successful in determining substantive causes or sources of DIF. For example, in traditional DIF analyses, after an item has been flagged as exhibiting DIF, content-experts may subsequently conduct a substantive review to determine the source of the DIF (Furlow, Ross, & Gagn, 2009). However, while this is acknowledged as an important step, some view its success at being able to truly understand the explanatory sources of DIF as minimal, at best (Engelhard, Hansche, & Rutledge, 1990; Gierl et al., 2001; ONeill & McPeek, 1993; Roussos & Stout, 1996). More specifically, inconsistencies in agreement between reviewers judgments or between reviewers and DIF statistics make it difficult to form any definitive conclusions explaining the occurrence of DIF (Engelhard et al., 1990). Others suggest that the inability of traditional DIF methods to unearth the causes of DIF is because the pre-selected grouping variables on which these analyses are based are often not the real dimensions causing DIF. Rather, they view these a priori variables as mere proxies for educational (dis)advantage attributes that if identified could better explain the pattern of differential responses among examinees (Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton, 2002; Dorans & Holland, 1993; Samuelsen, 2005, 2008; Webb, Cohen, & 15

PAGE 16

Schwanenflugel, 2008). Finally, the assumption of an inherent homogeneity in responses among examinees in the subgroups has also been cited as another weakness of the traditional DIF approach (De Ayala, 2009; De Ayala, Kim, Stapleton & Dayton, 2002; Samuelsen, 2005). This view has been supported by the observation that even within a seemingly homogenous manifest group (e.g. Hispanic or black) there can be high levels of heterogeneity, resulting in segments which respond differently to the item than other examinees in that group. Using race as an example, De Ayala (2009) noted that a racial category such as Asian American would lump together examinees of Filipino, Korean, Indonesian, Taiwanese descent as a single homogeneous group, ignoring their intra-manifest variability. As a result, De Ayala (2009) argues that this assumed homogeneity in traditional manifest DIF assessments may lead to false conclusions about the existence or magnitude of DIF. As an alternative to the traditional manifest approach, a latent mixture conceptualization of DIF has been proposed. Rather than focusing on a priori examinee characteristics, this method characterizes DIF as being the result of unobserved heterogeneity in the population. A latent mixture conceptualization relaxes the requirement that associates DIF with a specific preexisting variable and the assumption that manifest groups are homogenous. Instead, examinees are classified into latent subpopulations based on their differential response patterns. These latent subpopulations or latent classes arise as a result of qualitative differences (e.g. use of problem solving strategies, response styles or level of cognitive thinking) among examinee subgroups (Mislevy et al., 2008; Samuelsen, 2005). Interestingly, more than a decade before the latent mixture conceptualization had been proposed, Angoff (1993) 16

PAGE 17

also voiced his concern about the inability of traditional methods to provide substantive interpretations of DIF. In offering evidence supporting this view Angoff (1993) reported that in attempting to account for DIF test developers are often confronted by DIF results that they cannot understand; and no amount of deliberation seems to explain why some perfectly reasonable items have large DIF values (Angoff, 1993, pg. 19). This dissertation proposes the use of factor mixture models as an alternative approach to investigating heterogeneity in item parameters when the source of heterogeneity is unobserved. Factor mixture modeling blends factor analytic (Thurstone, 1947) and latent class (Lazarsfield & Henry, 1968) models, two structural equation modeling (SEM) based methods that provide a unique but complementary approach to explaining the covariation in the data. The latent class model accounts for the item relationships by assuming the existence of qualitatively different subpopulations or latent classes (Bauer & Curran, 2004). However, since class membership is unobserved individuals are categorized based not on an observed grouping variable but rather by using probabilities to determine their most likely latent class assignment. On the other hand, the factor analytic model assumes the existence of an underlying, continuous factor structure in explaining the commonality among the item responses. As part of the factor mixture estimation, the class-specific item parameters from the factor analytic model that are estimated can be compared to determine their level of measurement non-invariance or differential functioning. A significant DIF coefficient provides evidence that the item is functioning differentially among latent classes after controlling for the latent ability trait. 17

PAGE 18

In estimating factor mixture models, one important decision to be made is the determination of the number of latent classes. However, while there are several fit criteria such as the Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), the adjusted BIC (ssaBIC; Sclove, 1987), and the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) available to assist the researcher in making the determination, there is seldom perfect agreement among these fit indices. Therefore, practitioners are cautioned against applying the mixture approach without having theoretical support for their hypothesis of unobserved heterogeneity in the population of interest. Generally, when group membership is known a priori, SEM-based DIF detection models can be specified using either a multiple indicators, multiple causes (MIMIC) model or a multiple-group CFA approach (Allua, 2007). In this paper, the factor mixture model will be specified using the mixture analog of the manifest multiple-group CFA. Therefore, in the model specification since the observed group variable will now be replaced by a latent categorical variable, not only will the heterogeneity of item parameters be examined but a profile of the latent, unobserved subpopulations can be examined as well. Finally, if needed, covariates can also be included in the model to help in explaining the composition of the latent classes as well. The primary purpose of this dissertation was to explore the utility of the factor mixture models in the detection of DIF. In a 2006 paper, Bandalos and Cohen commented that while the estimation of the factor mixture models had previously been presented in the IRT literature, the models were not as frequently utilized with SEM-based models. However, programming enhancements to software packages such as 18

PAGE 19

Mplus (Muthn & Muthn, 1998-2008) have increased the likelihood that the estimation of SEM-based factor mixture models will be more commonly estimated in practice (Bandalos & Cohen, 2006). In this study, the performance of the factor mixture model was evaluated primarily in terms of its ability to produce high convergence rates, control Type I error rate, and enhance statistical power under a variety of realistic study conditions. The simulated conditions examined sample size, magnitude of DIF, and similarity of latent ability means. In sum, this Monte Carlo simulation was conducted to determine the conditions under which a factor mixture approach performs best when assessing item non-invariance and ultimately whether its use in practice should be recommended. SEM-based approaches are not commonly used in the discipline of educational testing which traditionally has been considered the domain of techniques developed within an IRT framework. And although the equivalence between factor analytic and IRT approaches for categorical items has long been established and applied repeatedly in the literature (Bock & Aiken, 1981; Finch 2005; Glockner-Rist & Hoijtink 2003; Moustaki 2000; Muthn, 1985; Muthn & Asparouhov 2002; Takane & de Leeuw, 1987), these methods are still utilized primarily within their respective discipline of origin. Therefore, despite its obvious potential, it is unlikely that this latent conceptualization will gain widespread acceptance, unless the applied community is convinced that (i) framing DIF with respect to latent qualitative differences and (ii) using a SEM-based approach are both worthwhile, practical options. In sum, if a mixture approach can be shown to add substantial value in an area such as DIF detection, then research of this kind will 19

PAGE 20

contribute positively to bridging the gap between SEM-based and IRT-based methods in the area of testing and measurement. 20

PAGE 21

CHAPTER 2 LITERATURE REVIEW Differential Item Functioning From a historical standpoint, the term item bias was coined in the 1960s during an era when a public campaign for social equality, justice and fairness was being waged on all fronts. The term referenced studies designed to investigate the claim that the principal, if not the sole, reason for the great disparity in test performance between Black and Hispanic students and White students on tests of cognitive ability is that the tests contain items that are outside the realms of the minority cultures (Angoff, 1993, pg. 3). Concerns about bias in testing were particularly relevant in cases where the results were used in high-stakes decisions involving job selection and promotions, certification, licensure and achievement. What followed in the early s and s was a series of studies using rudimentary methods based on classical test theory (CTT) techniques (Gelin, 2005). One of these early methods involved an analysis of variance (ANOVA) approach (Angoff & Sharon, 1974; Cleary & Hilton, 1986) and focused on the interaction between group membership and item performance as a means of identifying outliers and detecting potentially biased items. Another method, the delta-plot technique, (Angoff, 1972; Thurstone, 1925) used plots of the transformed CTT index of item difficulty (p-values) for each group as a means of detecting biased items. However, the main weakness of these early methods was that they both failed to control for the underlying construct that was purported to be measured (e.g. ability) by the test. Another criticism was that because these methods only considered item difficulty, their implicit assumption of equally discriminating items led to an increase in the incidence of false negative or false positive identifications. 21

PAGE 22

During this period, the term bias also came under heavy scrutiny. It was felt that the strong emotional connotation which the word carried was creating a semantic rift between the technical testing community and the general public. As a result of this debate, the term differential item functioning (DIF) was proposed as a less value-laden, more neutral replacement (Angoff, 1993; Cole, 1993). The DIF concept is defined as the accumulation of empirical evidence to investigate whether there is a difference in performance between comparable groups of examinees (Hambleton, Swaminathan, & Rogers, 1991). More specifically, it refers to a difference in the probability of correctly responding to an item between two subgroups of examinees of the same ability or groups matched by their performance on the test representing the underlying construct of interest (Kamata & Binici, 2003; Potenza & Dorans, 1995). These two groups are referred to as the focal group, those examinees expected to be disadvantaged by the item(s) of interest on the test (e.g. females or African Americans), and the reference group, those examinees expected to be favored by the DIF items (e.g. males or Caucasians). Since the introduction of the early CTT methods, a variety of additional techniques have been introduced for the detection of DIF (Clauser & Mazor, 1998). These include the nonparametric Mantel-Haenszel chi-square method (Holland & Thayer, 1988; Mantel & Haenszel, 1959), the standardization method (Dorans & Holland, 1993), logistic regression (Swaminathan & Rogers, 1990), likelihood ratio tests (Wainer, Sireci, & Thissen, 1991), and item response theory (IRT) approaches comparing parameter estimates (Lord, 1980) or estimating the areas between the item characteristic curves (Raju, 1988, 1990). Additionally, though not as popular in the testing and measurement 22

PAGE 23

discipline, SEM-based methods such as the multiple-groups approach (Lee, 2009; Srbom, 1974) and the MIMIC model (Gallo, Anthony, & Muthn, 1994; Muthn, 1985, 1988) have also been used to detect DIF. The fact that these methods involve some level of conditioning on the latent construct of interest, differentiate them from the earlier CTT ANOVA and p-value approaches. Furthermore, DIF assessment has now emerged as an essential element in the investigation of test validity and fairness and for testing companies such as Educational Testing Service (ETS), the use of DIF is a critical part of the test validation process. Coles (1993) comments in which she describes DIF as a technical tool in ETS to help us assure ourselves that tests are as fair as we can make them underscore the importance of this analytical approach as a standard practice in the design and administration of fair tests. Types of Differential Item Functioning There are two primary types of DIF: uniform and non-uniform DIF. Uniform DIF occurs when the probability of correctly responding to or endorsing a response category is consistently higher or lower for either the reference or focal group across all levels of the ability scale. As shown in Fig 2-1, when uniform DIF is present in an item, the two ICCs do not cross at any point along the range of ability. In this example, the probability of responding correctly to this dichotomous item is uniformly lower for group 2 than for group 1 along the entire range of latent ability. Conversely, when there is an interaction between the latent ability trait and group membership and the ICCs cross at some point along the ability range, then this is referred to as non-uniform DIF. This means that for a portion of the scale, one group is more likely to correctly respond or endorse a response category. However, this advantage is reversed for the second portion of the scale. An example of this is shown 23

PAGE 24

in Figure 2-2. It is important to note that while some methods can detect both types of DIF, others are capable of detecting uniform DIF only. DIF vs. Impact As was mentioned previously, DIF occurs when there is a significant difference in the probability of answering an item correctly or endorsing an item category between groups of the same ability level (Wainer, 1993). On the other hand, impact refers to legitimate group differences in the probability of getting an item correct or endorsing an item category (Wainer, 1993). Therefore, in distinguishing between these two concepts, it is important to note that while DIF assessment includes comparable groups of examinees with the same trait level, impact refers to group differences without controlling or matching on the construct of interest. Frameworks for Examining DIF Traditionally, DIF is examined within one of two frameworks: (i) the observed score framework and (ii) the latent variable framework. Observed score framework The observed score framework includes methods that adjust for ability by conditioning on some observed or manifest variable designed to serve as a proxy for the underlying ability trait (Ainsworth, 2007). When an observed variable such as the total test score is used it is assumed that this internal criterion is an unbiased estimate of the underlying construct. Therefore, as an alternative to the aggregate test score, an alternative or a purified form of the total test score may also be used as the internal matching criteria. This means that provided that DIF is detected, the total test score is adjusted by dropping those items identified as displaying DIF and calculating a revised total score. This iterative process which is used to refine the criterion measure and limit 24

PAGE 25

the impact of DIF contamination is known as purification (Kamata & Vaughn, 2004). While internal criterion measures are most commonly used, external criteria consisting of a set of items that were not part of the administered test could also function as an external criterion measure. An external criterion may be an adequate choice particularly when there are a high proportion of DIF items in the scale or the total score is deemed to be an inappropriate measure (Gelin, 2005). However, external measures are seldom used in practice due to the difficulty in finding an adequate set of items that would be more appropriate at measuring the latent trait than the actual test that was designed specifically for that use (Gelin, 2005; Shih & Wang, 2009). Procedures that use the observed variable framework include contingency tabulation methods such as the Mantel-Haenszel (MH) (Holland & Thayer, 1998), generalized linear models such as logistic regression (Swaminathan & Rogers, 1990) or ordinal regression (Zumbo, 1999), and the standardization method (Dorans & Kulik, 1983). The latent variable framework Unlike the procedures implemented within the observed score framework, these techniques do not control using an observed, manifest measure like the total test score or purified total test score. Instead, latent variable DIF detection methods involve the use of an assumed underlying latent trait such as ability. The two main classes of methods which use the latent variable framework are: (i) item response theory (IRT) and (ii) structural equation modeling (SEM) based approaches. IRT methods include techniques in which comparisons are made either between item parameters (Lord, 1980), or between item characteristic curves (Raju, 1988, 1990) or likelihood ratio test methods (IRT-LRT; Thissen, Steinberg, & Wainer, 1988). However, since the focus of 25

PAGE 26

this dissertation is on the use of an SEM-based approach to detect DIF, no detailed description of the IRT-based DIF detection methods will be given. SEM-based DIF Detection Methods While it is possible to specify an exploratory factor model where the measurement model relating the item responses to the latent underlying factors is unknown, a CFA approach in which the factor structure has been specified is more often used in DIF detection. Typically, a CFA model is formulated as: Y (1) (2) where Y is a px1 vector of scores on the p observed variables, is a px1 vector of measurement intercepts, is a pxm matrix of factor loadings, is a mx1 vector of factor scores on the m latent factors, and is a px1 vector of residuals or measurement errors representing the unique portion of the observed variables not explained by the common factor(s). It is assumed that the s have zero mean and are uncorrelated not only with the s but with each other as well. Additionally, denotes an mx1 vector of factor means and is a m-vector of factor residuals. The model-implied mean and covariance structures are formulated as follows: (3) () (4) where is a px1 vector of means of the observed variables, is an mx1 vector of factor means, () is a pxp matrix of variances and covariances of the p observed variables, is an mxm covariance matrix of the latent factors and is a square pxp matrix of variances and covariances for the measurement errors. For a single-group 26

PAGE 27

CFA model, it is also assumed that the independent observations are drawn from a single, homogenous population. The estimates of the parameters are found by minimizing the discrepancy between the observed covariance matrix, S and the sample-implied covariance () The general form of the discrepancy function is denoted by F(S,()) but there are several different types of functions that can be used in the estimation process. For example, if multivariate normality of the data is assumed, then the maximum likelihood (ML) estimates of the parameters are obtained by minimizing the following: 1()log()(())log()(MLFtrSSpY 1)Y (5) where S is the sample covariance matrix, () is the model-implied covariance matrix and p is the number of observed variables. As presented, this formulation of the factor model is described for items with a continuous response format. However, since the focus of this dissertation is on DIF assessment with dichotomous items, what follows is a respecification of the model to accommodate categorical items. Factor Analytic Models with Ordered Categorical Items In educational measurement, the ability trait is typically measured by dichotomous or polytomous items rather than continuous items. One method of dealing with categorical item responses is to specify a threshold model or a latent response variable (LRV) Formulation (LRV; Muthn & Asparouhov, 2002). The LRV formulation assumes that underlying each observed item response y, is a continuous and normally distributed latent response variable y*. This continuous latent variable can be thought of as a response tendency with higher values indicating a greater propensity of answering the item correctly. Further, it is assumed that when this tendency is sufficiently high thereby 27

PAGE 28

exceeding a specific threshold value, then the examinee will answer the item correctly. Likewise, if it falls below the threshold, then an incorrect response is observed. Therefore, based on this formulation, the observed items responses can be viewed as discrete categorizations of the continuous latent variables. The relationship between these two variables y and y* are represented by the following nonlinear function: *1, if ,cycy c (6) where c denotes the number of response categories for y and the threshold structure is defined by 012... C for c categories with c-1 thresholds. In the case of binary items, the mapping of y1 onto y1* is expressed as: *111*110,,1, ifyyify where 1 denotes the threshold parameter for test item y1. This relationship is illustrated in Figure 2-3. Because of the LRV formulation, the measurement component of the model which relates, in this case, the continuous latent response variables to the latent factor and to the group membership variable is respecified as: *ijiijijy (7) where is individuals i latent response to item j. The distributional assumptions of the p-vector of measurement errors determine the appropriate link function to be selected. For example, if it is assumed that the measurement errors are normally distributed then the probit link function, that is, the inverse of the cumulative normal distribution function, is used. As a result the thresholds and factor loadings are interpreted as probit coefficients in the linear probit regression equation. The alternative is to assume a *ijy 1[ ] 28

PAGE 29

logistic distribution function for the measurement errors which allows the coefficients to be interpreted either in terms of logits or converted to changes in odds. Under the LRV formulation, the single factor model for a continuous latent trait measured by binary outcomes is expressed as in Equation 7. Therefore, the conditional probability of a correct response as a function of is: **1/21/21||1| 1() ()ijjijijijijiiijijiiijijPyPyPyFvVFvV (8) where ()ijV is the residual variance and F can be either the standard normal or logistic distribution function depending on the distributional assumptions of the ijs (Muthn & Asparouhov, 2002). Further, in addition to the LRV, latent variable models with categorical variables can also be presented using an alternative formulation. The conditional probability curve formulation focuses on directly modeling the nonlinear relationship between the observed s y and the latent factor trait, as: 1|ijjijiPyFab (9) where is the item discrimination, bi is the item difficulty, and the distribution of F is either the standard normal or logistic distribution function. In their 2002 paper, Muthn and Asparouhov illustrate the equivalence of results between these two conceptual formulations of modeling factor analytic models with categorical outcomes. The authors showed that equating the two formulations: ia 1/2()iiijijijiFvVFab (10) 29

PAGE 30

where as previously indicated, F is either the standard normal or logistic distribution depending on the distributional assumptions of the ij However, it should be noted that in the case of factor mixture modeling with categorical variables, the default estimation method used in Mplus (Muthn & Muthn, 1998-2008) is robust maximum likelihood estimation method (MLR) and the default distribution of F is the logistic distribution. To allow for the estimation of the thresholds, the intercepts in the measurement model are assumed to be zero. As a result, the factor analytic logistic parameters can be converted to IRT parameters using: ( )( ) and () iiijiabVar i (11) Additionally, to ensure model identification, it is necessary to assign a scale to the latent trait. One method of setting the scale of the latent trait is to standardize it by setting the mean equal to zero and fixing the variance at one, that is, =0 and =1. In this case, the factor loadings and thresholds can then be converted to item discriminations and item difficulties using the following expressions (Muthn & Asparouhov, 2002): and ()iiiijav ib (12) The ease with which estimates of the factor analysis parameters can be converted to the more recognizable IRT scale should increase the interpretability and utility of the results to the applied researchers (Fukuhara, 2009). An alternative method of setting the scale of the latent factor is by fixing one loading per factor to one. As a result, the simplified conversion formulae will differ and by extension, the magnitude of the parameter estimates will be affected as well (Bontempo, 2006; Kamata & Bauer, 2008). 30

PAGE 31

Therefore, when running these models with factor analytic programs such as Mplus, the user should be aware of the default scale setting methods since this invariably will affect the parameter conversion formulae. Mixture Modeling as an Alternative Approach to DIF Detection Traditional DIF detection methods assume a manifest approach in which examinees are compared based on demographic classifications such as gender (females vs. males) or race (African Americans vs. Caucasians). While this approach has a long history and has been used successfully in the past to assess DIF, recent emerging research has suggested that this perspective may be limiting in scope (Cohen & Bolt, 2005; De Ayala et al., 2002; Mislevy et al., 2008; Samuelsen, 2005, 2008). As an alternative to the traditional manifest DIF approach, a latent DIF conceptualization, rooted in latent class and mixture modeling methods, has been proposed (Samuelsen, 2005, 2008). In this conceptualization of DIF, rather than focusing on a priori examinee characteristics, this approach assumes that the underlying population consists of a mixture of heterogeneous, unidentified subpopulations, known as latent classes. These latent classes exist because of perceived qualitative differences (e.g. different learning strategies, different cognitive styles etc.) among examinees. One technique that can be used to examine DIF from a latent perspective is factor mixture modeling (FMM). Factor mixture models result from merging the factor analysis model and the latent class model resulting in a hybrid model consisting of two types of latent variables; continuous latent factors and categorical unobserved classes (Lubke & Muthn, 2005, 2007; Muthn, Asparouhov, & Rebollo, 2006). The simultaneous inclusion of these two types of latent variables allows first for the exploration of unobserved heterogeneity that may exist in the population and the examination of the 31

PAGE 32

underlying dimensionality within these latent groups (Lubke & Muthn, 2005). As a result, a primary advantage of factor mixture modeling is its flexibility in allowing for a wider range of modeling options. For instance, models can be specified for multiple factors, multiple latent classes and the structure of the within-class models can vary in the complexity of its relationships not only with the latent factors but with observed combinations of continuous and categorical covariates as well (Allua, 2007; Lubke & Muthn, 2005, 2007; McLachlan & Peel, 2000). However, it is important to note that as more specifications are introduced to a model, not only does the level of complexity of model increase but the computational intensity of the estimation process as well. Therefore, it is recommended that researchers should be guided by substantive theory not only in supporting their hypothesis of population heterogeneity but also with regard to the complexity of the specification of their mixture models as well (Allua, 2007; Jedidi, Jagpal, & DeSarbo, 1997). Previous applied studies highlight several fields where mixture modeling has been applied successfully to investigate population heterogeneity (Bauer & Curran, 2004; Jedidi et al., 1997; Kuo et al., 2008; Lubke & Muthn, 2005, 2007; Lubke & Neale, 2006; Muthn & Asparouhov, 2006; Muthn, 2006). One area in which factor mixture models have been utilized with some success is that of substance abuse research (Kuo et al., 2008; Muthn et al., 2006). In their research on alcohol dependency, Kuo et al. (2008) compared a factor mixture model approach against a latent class and a factor model approach to determine which of the three best explained the alcohol dependence symptomology patterns. What they found was that while a pure factor analytic model provided an unsatisfactory solution, the latent class approach provided a better fit to the 32

PAGE 33

alcohol dependency data. However, the single factor, three-class factor mixture model provided the best fit to the data and best accounted for the covariation in both the pattern of symptoms and the heterogeneity of the population. In another study from the substance abuse literature, Muthn et al. (2006) compared two types of factor mixture models to a factor analysis and latent class approach in analyzing the responses of 842 pairs of male twins to 22 alcohol criteria items. The findings showed that both factor mixture models fit the data well and explained heritability both with regard to the underlying dimensional structure of the data and the latent class profiles of the heterogeneous population (Muthn et al., 2006). With regard to DIF, factor mixture models may be used to investigate item parameters differences between the latent classes of subpopulations of examinees. In this conceptualization of DIF, the unobserved latent classes represent qualitatively different groups of individuals whose item responses function differentially across the classes (Bandalos & Cohen, 2006). To allow for the specification of these models, a categorical latent variable is integrated into the common factor model specified in Equations 1 and 2. As a result, the K-class factor mixture model is now expressed as: ikkikikiky (13) ikkik (14) where the subscript k indicates the parameters that can vary across the latent classes. Figure 2-4 provides a depiction of a factor mixture model where the unidimensional latent factor is measured by five observed items and there are K latent classes in the population. In the diagram, the relationships are specified as follows: 33

PAGE 34

The arrows from the latent factor to the item responses (from to the Ys) represent the factor loadings or the parameters measuring the relationship between the latent factor and the items. The arrows from the latent class variable to the item responses (from Ck to the Ys) are the class-specific item thresholds conditional on each of the K latent classes. The broken-line arrows from the latent class variable to the factor loadings arrows indicate that these loadings are also class-specific and therefore can vary across the K latent classes. The arrow from the latent class to the latent factor (i.e. from Ck to the ) allows for factor means and/or factor variances to be class-specific as well. Since the model parameters are allowed to be class-specific, that is, both item thresholds and factor loadings can be specified as non-invariant, then this specification allows for testing of both uniform and non-uniform DIF. The focus of this dissertation is on assessing the performance of the factor mixture model in the detection of uniform DIF. Therefore for this specification, while the item thresholds are allowed to vary across the levels of the latent class variable, the factor loadings are constrained across the K classes. In Mplus the implementation of the factor mixture model for DIF detection can be conceptualized as a multiple-group approach where DIF is tested across latent classes rather than manifest groups. Therefore using Equations 1 and 2, the CFA mixture model can be reformulated as: and kkkkkkkiY k (15) where the parameters are as previously defined for each of the k = 1, 2,,K latent classes. Once again, it is assumed that the measurement errors have an expected value of zero and are independent of the latent trait(s), and of each other. Similarly, the model implied mean and covariance structure for the observed variables in each of the k = 1, 2,,K latent classes can be defined as: 34

PAGE 35

and kkkkkkkk k (16) Three main strategies for using CFA-based approaches to investigate measurement invariance have been proposed: (i) a constrained-baseline approach (Stark, Chernyshenko, & Drasgow, 2006), (ii) a free-baseline approach (Stark et al., 2006), and (iii) a more recent approach which tests the significance of the threshold differences via the Mplus model constraint feature (Clark, 2010; Clark et al., 2009). In the first two approaches, DIF testing is conducted via a series of tests of hierarchically nested models (Lee, 2009). The constrained-baseline approach begins with a baseline model in which all the parameters are constrained equal across groups, then one-at-a-time parameters of the studied item(s) are freed. On the other hand, the free-baseline approach starts with a model in which all parameters (except those needed for model identification) are freely estimated across groups. After this model has been estimated, then the parameters of the item(s) of interest are constrained in a sequential manner. With either of these two approaches, a series of nested comparisons are conducted to determine the level of measurement invariance. For example, if testing for uniform DIF, the baseline model is compared with models formed by either individually constraining or releasing the item thresholds of interest across the groups/latent classes. Stark et al. (2006) noted that whereas the constrained-baseline approach is typically used in IRT research, the free-baseline approach is more common with CFA-based methods. However, one disadvantage of these baseline approaches is that they require two models (i.e. baseline and constrained or augmented) to be fitted to the data, which increases the complexity of the model estimation procedure. 35

PAGE 36

Unlike the two previous methods which use a nested models approach, the third method does not require the specification of two sets of models. The approach which has been credited to Mplus Tihomir Asparouhov has recently been used in factor mixture studies conducted by Clark (2010) and Clark et al. (2009). The Mplus (Muthn & Muthn, 1998-2008) implementation of this approach as described by Clark (2010) is as follows: The thresholds of the all items, except those of a referent needed for identification, are allowed to vary freely across the latent classes. Next, the Mplus model constraint option is invoked to create a set of new variables. Each new variable defines a threshold difference across classes for each of the items to be tested for DIF. For example, the estimated threshold difference for item 6 may be defined as: dif_it6 = t2_i6 t1_i6, where t2_i6 and t1_i6 are the user-supplied variable names for the threshold of item 6 in class 2 and class 1 respectively. The creation of the 14 threshold equations allows for the testing of the item thresholds across classes via a Wald test to determine whether the differences are significantly different from zero. A significant p-value provides evidence that the item is functioning differentially while a non-significant result indicates that the item is DIF-free. In the case where the referent item is not known to be invariant across classes, a series of tests are undertaken in which each items thresholds are successively constrained equal across classes and the threshold differences are estimated for the remaining items. Finally, a tally is made of the total number of times that each item displayed a significant p-value, when its thresholds were not constrained. Since this method does not require the formation of two models, one obvious advantage is its simplicity. However, unlike the more established baseline procedures, it has not been subjected to the methodological rigor that should precede the acceptance and usage of an approach in applied settings. Estimation of Mixture Models The purpose of the mixture modeling estimation process is to attempt to disentangle the hypothesized mixture of distributions into the pre-specified number of 36

PAGE 37

latent classes. Unlike the manifest situation where group membership is observed and group proportions are known, class membership is unobserved. Therefore an additional model parameter, known as the mixing proportion, is estimated (Gagn, 2004). The K-1 mixing proportions estimate the proportion of individuals comprising each of the K hypothesized classes. Additionally, while individuals obtain a probability for being a member in each of the K classes, they are assigned to a specific class based on their highest posterior probability of class membership. To estimate the model parameters, the joint log-likelihood of the mixture across all observations is maximized (Gagn, 2004). For a mixture of two latent subpopulations, the joint log-likelihood of the mixture model can be expressed as the maximization of: 2111ln((1)) NikiLL 2i (17) where and represent the likelihood of the ith examinee being a member of subpopulation 1 and subpopulation 2 respectively, represents the unknown mixing proportion, and N is the total number of examinees in the sample. Likewise, for k-subpopulations, Gagn (2004) presents the expression for the joint log-likelihood of the mixture model expressed as: 1iL 2iL 1/2(.5)()()/211(2)ikkikNKxxpkkkie (18) where and kkkkkkkkk Class Enumeration An important decision to be made is determining the number of latent class existing in the population (Bauer, & Curran, 2004; Nylund, Asparouhov, & Muthn, 2006). Traditionally, researchers use standard chi-squared based statistics to compare 37

PAGE 38

models. However, in mixture analysis when comparing models with differing numbers of latent classes, the traditional likelihood ratio test for nested models is no longer appropriate (Bauer, & Curran, 2004; McLachlan & Peel, 2000; Muthn, 2007). Instead, alternative model selection indices are used to compare competing models with different numbers of latent classes. These include: (i) information-based criteria such as Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), and the adjusted BIC (ssaBIC; Sclove, 1987), (ii) likelihood-based tests such as the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) and the bootstrapped version of the LRT (BLRT; McLachlan & Peel, 2000), and (iii) statistics based on the classification of individuals using estimated posterior probabilities, such as entropy (Lubke & Muthn, 2007). And while there has been limited research conducted comparing the performances of these various model selection methods, no consistent guidelines have been established determining which model selection indices are most useful in comparing models or selecting the best-fitting model (Lubke & Neale, 2006; Nylund, et al. 2006; Tofighi & Enders, 2008; Yang, 2006). The reason for this is that there is seldom unanimous agreement across the various model selection indices and as a result the possibility of misspecification of the number of classes is a likely occurrence (Bauer & Curran, 2004; Nylund et al., 2006). Therefore, the researcher should not rely on these indices as the sole determinant of the number of latent classes. Rather, it is advised that in addition to the statistical indices, a theoretical justification should also guide not only the selection of the optimal number of classes but the interpretation of the classes as well (Bauer & Curran, 2004; Gagn, 2006; 38

PAGE 39

Muthn, 2003). The most common information criterion indices for model selection are introduced below. Information Criteria Indices The information criteria measures (e.g. the AIC, BIC, and the sample-size adjusted BIC) are all based on the log-likelihood of the estimated model and the number of free parameters in the model. On their own, individual values of these information-based criteria for a specified model are not very useful. Instead, for a specified model, the indices for each of the measures are compared with models of varying numbers of classes. For example, if the hypothesized model is one with two latent classes, this would be successively compared with a one-, three-, and fourclass models. Typically, the model with the lowest information criteria value compared to the other models with different number of classes is selected as the best-fitting model (Lubke & Muthn, 2005; Nylund et al., 2006). The information criteria such as the AIC, BIC, and ssaBIC are all based on the log-likelihood and adjust differently for the number of free parameters and sample size (Lubke & Muthn, 2005; Lubke & Neale, 2006). The AIC, which is defined as a function of log-likelihood and number of estimated parameters penalizes for overparameterization only but not for sample size, 2log2. A ICLp (19) On the other hand, BIC and ssaBIC, adjust for both the number of parameters and the sample size (Nylund et al., 2006). The BIC and ssaBIC are given by: 2log.log() B ICLpN (20) 22log.log24NssaBICLp (21) 39

PAGE 40

As was noted earlier, when comparing models with different numbers of latent classes, lower values of AIC, BIC, and ssaBIC indicate better fitting models. A typical approach to class enumeration begins with the fitting of a baseline one-class model and successively fitting models with the goal of identifying the mixture model with the smallest number of latent classes that provide the best fit to the data (Lui, 2008). However, previous research has found that results from different information criteria can provide ambiguous evidence regarding the optimal number of classes. In addition, across different mixture models, there is also inconsistency regarding the model selection information criterion that performs best. Nylund et al. (2006) conducted a simulation study comparing the performance of commonly-used information criteria for three types of mixture models: latent class, factor mixture, and growth mixture models. Overall, the researchers found that among the information criteria measures, the AIC, which does not adjust for sample size, performed poorly and identified the correct k-class model on fewer occasions than the two sample-sized adjusted indices, the BIC and the ssaBIC. Moreover, the AIC frequently favored the selection of the k+1-class model over the correct k-class. In addition, whereas the ssaBIC generally performed well with smaller sample sizes (N=200, 500), the BIC tended to be the most consistent overall performer, particularly with larger sample sizes (N=1000). Based on their simulation results, Nylund et al. (2006) concluded that the BIC was the most accurate and consistent of the IC measures at determining the correct number of latent classes. Yang (1998) evaluated the performance of eight information criteria in the selection of latent class analysis (LCA) models for six simulated levels of sample size. 40

PAGE 41

The results suggested that the ssaBIC outperformed the other five IC measures including the AIC and the BIC. For instance, with smaller sample sizes (N =100, 200) the ssaBIC had the highest accuracy rates of 62.7% and 77.5% respectively. In addition, Yang (1998) found that both BIC and a consistent form of the AIC (CAIC) tended to incorrectly select models with fewer latent classes than actually simulated. The performance of the BIC and CAIC only improved after the sample size increased to the largest condition of N=1000. The researcher concluded that in the case of LCA models, the ssaBIC outperformed the AIC and BIC at determining the correct number of latent classes (Yang, 1998). Tofighi and Enders (2007) extended their simulation research to evaluating the accuracy of information-based indices in identifying the correct number of latent classes in growth mixture models (GMM). Manipulated factors included the number of repeated measures, sample size, separation of latent classes, mixing proportions, and within-class distribution shape simulated for a three-class population GMM. The researchers found that of the ICs, the ssaBIC was most successful at consistently extracting the correct number of latent classes. Once again, the BIC showed its sensitivity to small sample sizes and frequently favored too few classes. The accuracy of the ssaBIC persisted even when the latent classes were not well-separated, whereas the ssaBIC extracted the correct three-class solution in 88% of the replications, the BIC and CAIC only correctly identified this solution 11% and 4% of the time respectively. In examining the accuracy of model selection indices in multilevel factor mixture models, Allua (2007) found that while the BIC and ssaBIC outperformed the AIC in correct predictions when data were generated from a one-class model, none of the fit 41

PAGE 42

indices performed credibly when a two-class model was used as the data-generating model. In this case all of the fit indices tended to underestimate the number of latent classes by continuing to favor the one-class model over the correct two-class model. This inconsistency between model fit measures has also evidenced in applied studies. Using an illustrative example, Lubke and Muthn (2005) applied factor mixture modeling to continuous observed outcomes from the Longitudinal Study of American Youth (LSAY) as a means of exploring the unobserved population heterogeneity. A series of increasingly invariant models were estimated and compared to a two-factor single-class model baseline model. For each of the models fit to the data, twothrough five-class solutions were specified. The commonly used relative fit indices (AIC, BIC, ssaBIC and aLRT) were used in choosing the best fitting models. However, there were several instances of disagreement between the IC results. For example, among the non-invariant and fully invariant models, while the AIC and the ssaBIC identified the 4-class solution as the best fitting model, the BIC and aLRT produced their lowest values for the 3-class solution. In summarizing their results, the authors suggested that in addition to relying on the model fit measures, researchers should explore the additional classes, in a similar manner as additional factors are investigated in factor analysis, to determine if their inclusion provides new, substantively meaningful interpretations to the solution. Overall, results from both simulation and applied studies highlight the lack of agreement among the mixture model fit indices. Researchers have attributed this inconsistency of performance to the heavy dependence of the indices on the type of mixture model under consideration as well as the assumptions made about the 42

PAGE 43

populations (Liu, 2008; Nylund et al., 2006). Therefore, rather than viewing any single measure as being superior, each should be seen as a contributory piece of evidence in determining the comparison of one model versus another. However, while the model fit indices provide the statistical perspective this should be augmented with a complementary approach of incorporating a substantive theoretical justification to aid in both the selection of the optimal number as well as the interpretability of the latent classes as well (Bauer & Curran, 2004). Mixture Model Estimation Challenges While factor mixture modeling is an attractive tool for simultaneously investigating population heterogeneity and latent class dimensionality, it is not without its challenges. The merging of the two types of latent variables into one integrated framework results in a model that requires a high level of computational intensity during the estimation process. As a result, factor mixture models require lengthy computation times which in turn reduce the number of replications that can be simulated within a realistic time frame. In addition to the increased computation times, the models are susceptible to problems due to multiple maxima solutions. Ideally, in ML estimation as the iterative procedure progresses, the log-likelihood should monotonically increase until it reaches one final maxima. However, with mixture models the solution often converges on a local rather than a global maximum, thereby producing biased parameter estimates. Whether the expectation-maximization (EM) algorithm converges to a local or global maximum largely depends on the set of different starting values that are used. Therefore, one approach to mitigating this problem is to incorporate multiple random starts, a practice that is permitted in Mplus. In the event that the default number of random starts (in Mplus, the defaults are 10 random starting sets and the best 2 sets used for final 43

PAGE 44

optimization) is insufficient to converge on a maximum likelihood solution, Mplus allows the user the flexibility to increase the number of start values. By adjusting the random starts option to include a larger number of start values both in the initial analysis and final optimization phases allows for a more thorough investigation of multiple solutions and should improve the likelihood of successful convergence (Muthn & Muthn, 1998-2008). However, since the increase in the number of random starts will also increase the computational load and estimation time, it is recommended that prior to conducting a full study researchers should experiment with various sets of user-defined starting values to determine an appropriate number of sets of starting values (Nylund et al., 2006). During this process, it is important to examine the results from the final stage solutions to determine whether the best log-likelihood is replicated multiple times. This ensures that the solution converged on a global maximum, thus reducing the possibility that the parameter estimates are derived from local solutions. Purpose of Study In the past, the lack of usage of SEM-based mixture models has been attributed to an unavailability of commercial software (Bandalos & Cohen, 2006). However, given the recent innovations integrated in software packages such as Mplus (Muthn & Muthn, 1998-2008), the estimation of SEM mixture models is now possible (Bandalos & Cohen, 2006). The purpose of this study was to evaluate the performance of factor mixture modeling as a method for detecting items exhibiting manifest-group DIF. In this study, manifest-group DIF was generated in a set of dichotomous data for a two-group, two-class population. The questions addressed were as follows. How successful is the factor mixture modeling approach at recovering the correct number of latent classes? 44

PAGE 45

If the number of classes are known a priori, how well does the factor mixture model perform at detecting differentially functioning items. Specifically, how are the (i) convergence rates, (ii) Type I error rate, (iii) and power to detect DIF affected under various manipulated conditions characteristic of those that may be encountered in DIF research? 45

PAGE 46

Figure 2-1. Example of uniform DIF Figure 2-2. Example of non-uniform DIF 46

PAGE 47

y Incorrect Correct y Figure 2-3. Depiction of relationship between y* and y for a dichotomous item 47

PAGE 48

Y3 Y1 Y2 Y4 Y5 Ck Figure 2-4. Diagram depicting specification of the factor mixture model 48

PAGE 49

CHAPTER 3 METHODOLOGY The simulation was conducted in two parts. The first part of the study focused on the ability of the factor mixture model to recover the correct number of latent classes under a variety of simulated conditions. In the second phase of the study, the number of classes was assumed known and the emphasis was on evaluating the performance of the mixture model at identifying differentially functioning items. Following is a description of the model as implemented, as well as the study design used in evaluating the performance of the factor mixture modeling approach to DIF detection. Factor Mixture Model Specification for Latent Class DIF Detection The factor mixture model was specified in its hybrid form as having both a single factor measured by 15 dichotomous items and a categorical latent class variable. The factor mixture model was formulated in the study as: *ikkiikiky (22) ikkik (23) where the parameters are as previously defined in Chapter 2 and k = 1 to K indexes the number of latent classes. To accommodate the testing of uniform DIF, the model was formulated so that the factor loadings were constrained to be class-invariant but the item thresholds were allowed to vary across classes. Therefore in Equation 22, the parameter is not indexed by the k subscript. Overall, the single-factor mixture model was specified as follows: 1. The factor loadings were constrained equal across the latent classes. For scaling purposes, the factor loadings of the referent (i.e. item 1) were fixed at one for each of the latent classes. 49

PAGE 50

2. To ensure identification, the item thresholds of the referent were also held equal across the latent classes. The remaining 14 (i.e. K -1) item thresholds were freely estimated. 3. One of the factor means was constrained to zero while the remaining factor mean was freely estimated. For K latent classes, the Mplus default is to fix the mean of the last or highest numbered latent class to zero (i.e. k = 0). Therefore in this case, the mean of the first class was freely estimated 4. Factor variances were freely estimated for all latent classes. Data Generation The discrimination and difficulty parameters used in this study were adopted from dissertation research conducted by Wanichtanom (2001). The original test (Wanichtanom, 2001) consisted of 50 items however in this case, parameters for ten of the 50 items have been selected. These ten items from the Wanichtanom (2001) study represented the DIF-free test items. In the original study, the item discrimination parameters were drawn from a uniform distribution within a 0 to 2.0 range and the difficulty parameters from a normal distribution within a -2.0 to 2.0 range (Wanichtanom, 2001). The remaining five DIF items that formed part of the scale reflected low (i.e. 0.5), medium (i.e. 1.0) and high (i.e. 2.0) levels of discrimination. For the entire 15-item test, the discrimination a parameters ranged from 0.4 to 2.0, with a mean of 0.98 while the difficulty b parameters ranged from -1.2 to 0.7 with a mean of -0.34. Uniform DIF was simulated against the focal group on Items two to six. The values of the item parameters are presented in Table 3-1. Data were generated using R statistical software (R Development Core Team, 2009). The ability parameters were drawn from normal distributions for both the reference and focal groups. For these dichotomous items, the probability of a correct response was computed using the 2PL IRT model as: 50

PAGE 51

[()]1Pr(1)1ijiijjabYe (24) where ai is the item discrimination parameter, bi is the item threshold parameter, and j is the latent ability trait for examinee j. To determine each examinees item response, the calculated probability Pr(1)ijjY was compared to a randomly generated number from a uniform U(0,1) distribution. If that probability exceeded the random number, the examinees item response was scored as correct (i.e. coded as 1). On the other hand, if the probability of a correct response was less than the random number the item response was scored as incorrect and coded as 0. Finally, 50 replications were run for each set of simulation conditions and the dichotomous item response datasets were exported to Mplus V5.1 (Muthn & Muthn, 1998-2008) for the analysis phase. Since the data were generated externally, the Mplus Type=Montecarlo option was used to analyze the multiple datasets and to save the results for the replications that converged successfully. Simulation Study Design In their 1988 paper, Lautenschlager and Park reiterated the need for Monte Carlo studies to be designed in such a way that they simulate real data conditions as closely as possible. This advice was followed when selecting the condition and levels for this simulation study. The conditions were chosen to replicate those adopted in previous latent DIF studies (Bolt, Cohen, & Wollack, 2001; Bilir, 2009; Cohen & Bolt, 2005; De Ayala et al., 2002; Samuelsen, 2005) and mixture modeling studies (Gagn, 2004; Lee, 2009; Lubke & Muthn, 2005, 2007). 51

PAGE 52

Research Study 1 In the first part of the study dichotomous item responses were generated for the two-group, two-class scenario. The focus was on determining the success rate of the specified factor mixture model to recover the correct number of latent classes. Solutions for onethrough three-class mixture models were estimated and three information-based criteria values were compared across the models. The model with the lowest IC value was selected as the best-fitting model (Lubke & Muthn, 2005; Nylund et al., 2006). The fixed and manipulated factors used in this study are listed below. Manipulated Conditions Sample size Previous findings have shown that as with pure CFA models, sample size affects the convergence rates of mixture models as well (Gagn, 2004; Lubke, 2006). In evaluating the performance of several CFA mixture models, Gagn (2004) reported a significant increase in the convergence rates as the sample size was increased from a minimum of 200 to 500 to 1000. A review of previous simulation and real data mixture model research found that whereas only a few studies used as few as 200 simulees (Gagn, 2004; Nylund et al., 2006), sample sizes of at least 500 were most frequently used (Bolt et al., 2001; Bilir, 2009; Cho, 2007; De Ayala et al., 2002; Rost, 1990; Samuelsen, 2005). In this study, the two combinations of sample size (N=500, N=1000) were chosen to be representative of realistic research samples and to reduce the possibility of convergence problems. In addition, the sample size of 500 was used as a lower limit to examine the effects of small sample size on the performance of the factor mixture approach to DIF detection. 52

PAGE 53

Magnitude of uniform DIF In previous DIF studies (Camilli & Shepard, 1987; De Ayala et al., 2005; Meade, Lautenschlager, & Johnson, 2007; Samuelsen, 2005) the manipulated difficulty shifts have typically varied in magnitude from .3 to 1.5. Overall, these results have shown higher DIF detection rates with items simulated to have moderate or strong amounts of DIF. However, with mixture models, it may be necessary to simulate larger DIF magnitudes to ensure the detection of DIF. This hypothesis was based on the results from preliminary small-scale simulation in which several levels of DIF magnitude were manipulated. As a result, this study focused on DIF effects at the upper range of the scale where the magnitude of manifest differential functioning is large, namely, b = 1.0 and b = 1.5. For items with no DIF, the item difficulties are defined as biF = biR. On the other hand, when there is uniform DIF, the items will be simulated to function differently in favor of the reference group and the item difficulties are defined as biF = biR + b (where b = 1.0 or 1.5). Ability differences between groups Several researchers have recommended the inclusion of latent ability differences (i.e. impact) in DIF detection studies since they contend that in real data sets, the focal and reference populations typically have different latent distributions (Camilli & Shepard, 1987; De Ayala et al., 2002; Donoghue, Holland, & Thayer, 1993; Duncan, 2006; Stark et al., 2006). Simulation results of the effects of impact on DIF detection have varied. For instance, some researchers have reported good control of Type I error rates with a moderate difference of .5 SD (Stark et al., 2006) and even with differences as large as 1 SD (Narayanan & Swaminathan, 1994). On the other hand, others (Cheung & Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994) have 53

PAGE 54

reported inflated Type I error rates with unequal latent trait distributions. The results are also mixed with respect to the presence of impact on power. Whereas some studies have shown reduced power (Ankemann, Witt, & Dunbar, 1999; Clauser, Mazor, & Hambleton, 1993; Narayanan & Swaminathan, 1996), others (Gonzlez-Rom et al., 2006; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006) found that DIF detection rates were not negatively affected by the dissimilarity of latent distributions. In this first part of the study, two conditions of differences in mean latent ability were manipulated: 1. Equal latent ability means with the reference and focal groups both generated from a standard normal distribution (i.e. R~N(0,1), F~N(0,1)), and 2. Unequal latent ability means with the reference group having a latent ability mean .5 standard deviation higher than the focal group (i.e. R~N(0.5,1), F~N(0,1)). Fixed Simulation Conditions Test length The test was simulated for a fixed length of 15 dichotomous items. Previous studies using factor mixture modeling have typically used shorter scale lengths varying between 4 and 12 observed items for a single-factor model with categorical items (Lubke & Neale, 2008; Nylund et al., 2006; Kuo et al., 2008; Reynolds, 2008; Sawatzy, 2007). This may be due to the fact that longer computation times are required when fitting mixture models to categorical data (Lubke & Neale, 2008). Therefore, while more test items may have been included, this length was chosen not only to be consistent with previous research, but also taking into account the computational intensity of factor mixture models. 54

PAGE 55

Number of DIF items In previous simulations studies, the percentage of DIF items has typically varied from 0% to 50% as a maximum (Bilir, 2009; Cho, 2007; Samuelsen, 2005; Wang et al., 2009). For example, Samuelsen (2005) considered cases with 10%, 30% and 50% of DIF items, Cho (2007) investigated cases with 10% and 30% DIF items, and Wang et al. (2009) manipulated the number of DIF items in increments of 10% from 0% to 40%. With respect to real tests, Shih and Wang (2009) reported that they typically contain at least 20% of DIF items. In this study, the percentage of DIF items was 33.3% (five items), with the DIF items all favoring the reference group. Items 2 through 6 were selected to display uniform DIF. Sample size ratio With respect to the ratio of focal to reference groups, Atar (2007) reports that in actual testing situations, the sample size for the reference group may be as small as the sample size for the focal group or the sample size for the reference group may be larger than the one for the focal group (pg. 29). In this study, a 1:1 sample size ratio of focal to reference group will be considered for each of the two sample sizes. Using comparison groups of equal size is representative of an evenly split manifest variable frequently used in DIF studies such as gender (Samuelsen, 2005). Percentage of overlap between manifest and latent classes In the manifest DIF approach when an item is identified as having DIF, there is an implied assumption that all members of the focal groups must have been disadvantaged by this item. However, under a latent conceptualization, the view is that DIF is detected based on the degree of overlap between the manifest groups and the latent classes. In this context, overlap refers to the percentage of membership homogeneity between the 55

PAGE 56

manifest groups and latent classes. For example, if each of the examinees in either the manifest-focal or the manifest-reference group belongs to the same latent class, then this is referred to as 100% overlap. Therefore, as the level of group-class overlap decreases, there is a corresponding decrease in the level of homogeneity between groups and classes as well. In Samuelsens (2005) study, five levels of overlap decreasing in increments of 10% from 100% to 60% were considered. Samuelsen (2005) found that as the group-class overlap increased, the power of the mixture approach to correctly detect DIF increased as well. In this study, the level of overlap was fixed at 80%, a somewhat realistic expectation of what may be encountered in practice. This means that DIF was simulated against 80% of the simulees in the focal group. Mixing proportion The mixing proportion (k) represents the proportion of the population in class k, which was fixed at .50. Although the class membership was known, it was not used in the simulation. Study Design Overview In sum, a total of three fully crossed factors resulting in eight simulation conditions (2 sample sizes x 2 DIF magnitudes x 2 latent ability distributions) were manipulated to determine their effect on the recovery of the correct number of latent classes. For each of the eight conditions, a total of 50 replications were run. It is important to note that in the original plan for this study, a larger number of replications was proposed. However, initial simulation runs revealed that the computational time necessary to complete larger numbers of replications was impractical for this dissertation. Therefore, given the timing 56

PAGE 57

constraints, a smaller number of data sets (i.e. 50) were replicated. The list of study conditions is provided in Table 3-2. Evaluation Criteria As previously noted, the objective of this first part of the simulation was to determine the success rate of the factor mixture method in identifying the correct number of classes. The three likelihood-based model fit indices (AIC, BIC, and ssaBIC) provided by Mplus were compared, with smaller values indicating better model fit. The outcome measures evaluated for each of the three (i.e. onethrough threeclass) factor mixture solutions fit to the data were: Convergence rates This was represented as the number of replications that converged to a proper solutions across the 50 simulations for each set of the eight conditions. Data sets with improper or non-convergent solutions were not included in the analysis. IC Performance Performance was evaluated by calculating the average IC values and comparing the values for each index across the one-, two-, and three-class models. For each of the simulated conditions, the lowest average IC value and the corresponding model are identified. Research Study 2 In the second part of the study, research was conducted to evaluate the Type I error rate and power performance of the factor mixture model at detecting uniform DIF, assuming that the correct number of classes is known. With respect to the study design, two levels of DIF magnitude (DIF = 1.0, 1.5) and two levels of sample size (N = 500, 1000) were again simulated using the same levels as in Study 1. However, an additional level will be included for the impact condition. More specifically, in addition to the no-impact and moderate impact condition, a large level of impact (i.e. mean for the reference group was 1.0 SD higher than the mean of the focal group) was included as well. The inclusion of this new level permitted a more complete investigation of the 57

PAGE 58

robustness of the factor mixture model in DIF detection to the influence of impact. Overall, a total of 12 conditions (2 sample sizes x 2 DIF magnitudes x 3 latent trait distributions) were simulated. In this second phase of the simulation, each condition was replicated 1000 times. The full list of study design conditions are shown in Table 3-3. Data Analysis The 1000 sets of dichotomous item responses were generated by R V2.9.0 (R Development Core Team, 2009). The data sets for each of the 12 conditions were saved and exported from R to Mplus V5.1 for analysis. As was done, in the first part of the study, the Type=Montecarlo facility was used to accommodate the analysis of the multiple datasets generated external to Mplus and for saving the results for subsequent analysis. To asses uniform DIF, a simultaneous significance test of the 14 threshold differences (i.e. with the exception of the referent, Item 1) using a Wald test was conducted. A significant p-value less than .05 provided evidence of DIF in the item. Evaluation Criteria The outcome measures used in evaluating the performance of this factor mixture method for DIF detection were as follows: Convergence rates This was measured by the number of replications that converged to proper solutions across each of the eight combinations of conditions. Data sets with improper or non-convergent solutions were not included in the analysis. Type I error rate The Type I error rate (or false-positive rate) was computed as the proportion of times the DIF-free items were incorrectly identified as having DIF. Therefore, the overall Type I error rate was calculated by dividing the total number of times the nine items (i.e. Items 7-15) were falsely rejected by the total number of properly converged replications for each of the 12 study conditions. The nominal Type I error rate used in this study was .05. 58

PAGE 59

Statistical power Power (or the true-positive rate) was computed as the proportion of times that the analysis correctly identified the DIF items as having DIF. Therefore, the overall power rate was calculated by dividing the total number of times any one of the five (i.e. Items 2-6) DIF items was correctly identified by the total number of properly converged replications across each of the 12 simulated conditions. In addition to the computation of the overall Type I error and Power rates of the factor mixture method, a variance components analysis was also conducted to examine the influence of each of the conditions and their interactions on the performance of the method. In this analysis which was conducted in R V2.9.0 (R Development Core Team, 2009), the independent variables were the three study conditions (DIF magnitude, sample size and impact) and the dependent variables were the Type I error and power rates. Eta-squared ( 2 ), which calculates the percentage of variance explained by each of the main effects and their interactions, was used as a measure of effect size. Model Estimation The parameters of the mixture models were estimated in Mplus V5.1 (Muthn & Muthn, 1998-2008) with robust maximum likelihood estimation (MLR) using the EM algorithm, which is the default estimator for mixture analysis in Mplus. One of the main limitations of running a mixture simulation study is the lengthy computation time periods needed for model estimation. In the interest of time, the random starts feature which randomly generates sets of starting values was not used in this part of the study. Instead, true population parameters for the factor loadings and thresholds and factor variances were substituted for the starting values in this portion of the analysis. This change reduced the computation time for model estimation considerably. 59

PAGE 60

Table 3-1. Generating population parameter values for reference group Item Number a b 1 1.0950 -0.0672 2 0.5001 -1.0000 3 0.5001 -0.5000 4 1.0000 0.0000 5 2.0000 -1.0000 6 2.0000 0.0000 7 0.5584 -0.7024 8 0.9819 0.6450 9 0.5724 -0.5478 10 1.4023 -0.3206 11 0.4035 -1.1824 12 1.0219 -0.4656 13 0.9989 -0.2489 14 0.7342 -0.4323 15 0.8673 0.7020 Note. Item 1 is the referent, therefore its loadings were fixed at 1 and its thresholds constrained equal across classes. Uniform DIF against the focal group was simulated on Items 2 to 6. 60

PAGE 61

Table 3-2. Fixed and manipulated simulation conditions used in study 1 Manipulated Conditions Fixed Conditions Sample size 500, 1000 Magnitude of DIF 1.0, 1.5 Latent mean distributions R~N(0,1), F~N(0,1) R~N(.5,1), F~N(0,1) Test length 15 items Number of DIF items 5 items (33.3%) Sample size ratio 1:1 Class proportion .5 Overlap 80% Table 3-3. Fixed and manipulated simulation conditions used in study 2 Manipulated Conditions Fixed Conditions Sample size 500, 1000 Magnitude of DIF 1.0, 1.5 Latent mean distributions R~N(0,1), F~N(0,1) R~N(.5,1), F~N(0,1) R~N(1,1), F~N(0,1) Test length 15 items Number of DIF items 5 items (33.3%) Sample size ratio 1:1 Class proportion .5 Overlap 80% 61

PAGE 62

CHAPTER 4 RESULTS Research Study 1 In this section, the results of the first part of the simulation are presented. To answer the research question, data were generated for a two-group, two-class population with five of the 15 items simulated to display uniform DIF. The following conditions were manipulated in this study: sample size (500, 1000), DIF magnitude (1.0, 1.5), and differences in latent ability means (0 SD, 0.5 SD). The factor mixture model as formulated in Equations 22 and 23 was applied to determine how successful the method was at recovering the correct number of classes. For each of the eight condition combinations, one-, twoand three-class models were fit to the data. These results are presented in two sections. First, the rates of model convergence for each of the eight simulation conditions are reported. Secondly, the information criteria (IC) results which were used for model comparison and class enumeration are discussed. The results for Study 1 are summarized in Tables 4-1 through 4-4. Convergence Rates Table 4-1 presents the data on the number of convergent solutions for each combination of the eight simulation conditions. As was previously mentioned in the Methods section, non-convergent cases were excluded from the analysis, therefore for some conditions results were based on fewer than 50 replications. The results showed that overall the convergence rates were very high (ranging from .82 to 1.0), and there were minimal convergence problems. Of the 1200 (50x3x8) replications, 1147 successfully converged resulting in a 96% overall convergence rate. In addition, as the number of latent classes was increased, there was a corresponding decrease, albeit 62

PAGE 63

minimal, in the number of properly converged solutions. More specifically, while the one-class model attained perfect convergence rates, the average convergence rates for the twoand three-class mixture models were 96% and 91% respectively. An inspection of the results also revealed a positive relationship between the convergence rate and the DIF magnitude. Of the 16 cells that failed to converge in the two-class model, 15 of them were for the smaller DIF condition. A similar trend was observed with the three-class model. Namely, of the 37 cells that failed to produce a properly convergent solution in the three-class model, 27 were associated with the smaller DIF=1.0 condition. The cases with non-convergent solutions were excluded from the second part of this analysis. Class Enumeration Summary data based on the three IC measures (AIC, BIC, and ssaBIC) for the one-, two-, and three-class models are provided in Tables 4-2 through 4-4. In comparing the fit of the models across classes, the smallest average IC value was used as the criterion in selecting the best-fitting model. An examination of the average IC values highlighted both overall and IC-specific patterns of results. First, as expected there is a general increase in the average IC values as sample size increases. Second, it is observed that the differences in average IC values between neighboring competing models were generally not substantial, and even negligible under some conditions. Third, with respect to the individual indices, a high level of inconsistency in model selection patterns is observed. The results for the three indices are described in more detail in the following sections. 63

PAGE 64

Akaike Information Criteria (AIC) The average AIC values across the three specified mixture models are presented in Table 4-2. Overall, the pattern of results shows that the AIC tended to over-extract the number of latent classes. This trend was observed for six of the eight simulated conditions where the lowest AIC values corresponded to the three-class mixture model. The only exceptions to this pattern occurred for two of the four conditions when the DIF magnitude was increased to 1.5. In these cases, the lowest average AIC values occurred at the correct two-class model. However, it is important to note that the differences between neighboring class solutions were rather small, with the largest absolute difference between values being less than 40 points. Moreover, the differences are practically negligible between the twoand three-class models ranging in absolute magnitude over the eight simulated conditions from .02 to 8.72. Although, smaller IC values are indicative of better model fit, given the minor differences between the average AIC values, it makes selection between these two models a less clear-cut decision. Bayesian Information Criteria (BIC) The BIC results are presented in Table 4-3. Based on the average BIC values, this index consistently selected the simpler one-class model as the correct model for the data. For each of the eight manipulated simulation conditions, the lowest values corresponded to the one-class mixture model. Compared to the AIC, the IC differences between neighboring class models are generally higher for the BIC than for the corresponding AIC solutions. More specifically, the differences between neighboring class models ranged in absolute magnitude, between 40 and 88 points on average. The differences between the one-class and the correct two-class model were minimized 64

PAGE 65

when the DIF magnitude was increased from 1.0 to 1.5. In these cases, even though the average IC values corresponded to the one-class model, because the average values are so similar in magnitude, it makes it difficult to unequivocally choose the one-class solution as the best-fitting model. Sample-size adjusted BIC (ssaBIC) Summary values for the ssaBIC compared across the three mixture models are presented in Table 4-4. These results reflected patterns seen with both the AIC and the BIC. First, similar to the BIC, the ssaBIC values suggested the simpler one-class model under conditions where the magnitude of the simulated uniform DIF is at the lower value of 1.0. However, the index also exhibits a pattern similar to that of the AIC by associating the smallest average IC values with the correct two-class model when larger uniform DIF of 1.5 was simulated. Finally, as was the case with the other two IC measures, the magnitude of differences across the three models was small. This was especially true of the differences between the oneand twoclass solutions, which ranged on average from 5.0 to 34.1 points. Research Study 2 In the second part of the study, the objective was to evaluate the Type I error rate and power of the factor mixture approach in the detection of DIF. The manipulated conditions again included sample size, magnitude of DIF and differences in latent ability means. In addition to the conditions used in the first phase of the study, one additional level of latent mean differences was included. For a measure of large impact, the mean for the reference group was simulated to be 1.0 SD standard higher than the focal group. Therefore, a total of 12 conditions were manipulated: 2 DIF magnitude (1.0, 1.5) x 2 sample size (500, 1000) x 3 differences in latent trait means (0, 0.5 SD, 1.0 SD). 65

PAGE 66

One thousand replications were generated for each of the 12 simulation conditions examined in the Type I error rate and Power studies. The results for the Type I error rate and statistical power are addressed in the sections below. Nonconvergent Solutions In this part of the study, population parameters replaced the starting values randomly generated by Mplus. This change substantially reduced the computational load and decreased the model estimation time. The convergence rates across each of the conditions are presented in Table 4-5. Overall, results indicate no convergence problems, with rates ranging between 99.4% and a perfect convergence rate. Type I Error Rate The factor mixture model was evaluated in terms of its ability to control the Type I error rate under a variety of simulated conditions. Of the 15 items, nine were simulated to be DIF-free. The Type I error rate was assessed by computing the proportion of times the nine DIF items were incorrectly identified as having DIF. An item was considered to display DIF if the differences in thresholds were significantly different from zero. Therefore for the nine non-DIF items, the Type I error rate was computed as the proportion of times that the items obtained p-values less than .05. The Type I error rates across the 12 simulation conditions are presented in Table 4-6. The values in the table represent the proportion of times that the method incorrectly flagged a non-DIF item as displaying DIF. The results in Table 4-6 indicate that the factor mixture analysis method did not perform as well as expected in controlling the Type I error rate. The results showed elevated Type I error rates across all the study conditions, which means that the approach consistently produced false identifications at a rate exceeding the nominal 66

PAGE 67

alpha level of .05. Overall, the average Type I error rate was 11.8%, which even after accounting for random sampling error would still be considered unacceptably high. Across the individual conditions, the error rates ranged from .09 to .16. Not surprisingly, the factor mixture method exhibited its strongest control of the rate of incorrect identifications for conditions of large DIF magnitude (DIF = 1.5), large sample size (N = 1000), and where there was either none or a moderate (0.5 SD) amount of impact. An initial examination of the pattern of results suggested that while the sample size and DIF magnitude are inversely related to Type I error rate, an increase in the mean latent trait differences resulted in slightly higher Type I error rates. For example, for the cells with DIF magnitude of 1.0, sample size of N = 500, and no impact, the Type I error rate was 0.12; however, when the latent trait means differed by 1.0 SD, the rate of false identifications increased marginally to 0.16. A more detailed discussion of the effect of each of the three conditions is presented in the following sections. Magnitude of DIF Table 4-7 and 4-8 display the aggregated results for the effect of the two levels of DIF magnitude (1.0 and 1.5) on Type I error rates. Overall, the rates of false identifications showed a slight decrease as the magnitude of DIF was increased. For example, when DIF of 1.0 was simulated, error rates across the conditions were between .10 and .16, with an average rate of .13. However, for larger DIF of 1.5, the rates ranged from .09 to .12, averaging at .10. Regardless of the size of DIF, the inflated rates were most pronounced for the smaller sample size of N=500 and when the difference in latent trait means was maximized (1.0 SD). 67

PAGE 68

Sample size The results in Tables 4-9 and 4-10 suggest a weak inverse relationship between sample size and the ability of the factor mixture method to control Type I error rates. At the smaller sample size (N=500), the rate of false identifications ranged from .10 to .16, with an average rate of .12. Of the six cells associated with the smaller sample size, the test showed greatest control of the Type I error when larger DIF (1.5) was simulated and there was equality of the latent trait means. Increasing the sample size to 1000 decreased the Type I error rates marginally. Across the six conditions, the error rates were now between .09 and .14, averaging at .11, a negligible decrease from the average rate when N=500. However, the pattern of false identifications remained consistent across sample sizes: poor Type I error control was observed when smaller DIF (1.0) was simulated and there was large impact (1.0 SD); in contrast, improved control was observed for larger DIF magnitude (1.5) and in the absence of impact. Impact Three levels of impact (0, .5 SD, and 1.0 SD) were simulated in favor of the reference group. The aggregated Type I error rates which are summarized in Tables 4-11 through 4-13 showed that the differences in latent trait means between groups had no appreciable effect on the rate of incorrect identifications The Type I error rates for the no-impact, 0.5 SD and 1.0 SD conditions increased marginally from .11 to .12 to .13, a change that can be attributed to the presence of random error. Though not below the nominal alpha value of .05, the Type I error rates were best controlled when both DIF (1.5) and sample size (N=1000) were large. 68

PAGE 69

Variance components analysis Following the descriptive analysis of the pattern of Type I error rates across the simulated conditions, a variance components analysis was conducted to specifically examine the influence of each of the simulation conditions and interaction of the conditions on the Type I error rates. The results of this analysis are presented in Table 4-13. Based on the values which ranged from 0.000 to 0.007, the only factor contributing to the variance in Type I error rates was the magnitude of DIF accounting for a mere 0.7%. All other main effects and interactions produced trivial values. Statistical Power In the analyses above, the proportion of false DIF detections produced by the factor mixture approach consistently exceeded the nominal value of 0.05. Typically, when this level occurs, power rates are no longer interpretable in terms of the standard alpha level. In this case, the power rates have still been analyzed and are displayed in Tables 4-15 and 4-24. However, it is important to note that these results should be interpreted with caution given the elevated Type I rates. Power was assessed as the proportion of times across the 1000 replications that the factor mixture analysis correctly identified the five items (i.e. Items 2 to 6) simulated as having uniform DIF. Typically, values of at least .80 indicate that the analysis method is reasonably accurate in correctly detecting items with DIF. Results for the power analysis are displayed in Tables 4-15 through 4-22. The overall accuracy of DIF detection of factor mixture analysis was 0.447, with the power of correct detection ranging from .264 to .801 across all simulated conditions. The only combination of conditions for which an acceptable level of power was achieved was when larger DIF (1.5) and sample size (N=1000) were simulated and impact was 69

PAGE 70

absent. For all other conditions the test failed to maintain adequate power. An initial examination of these results suggests that whereas higher rates of DF detection are positively associated with DIF magnitude and sample size, there was a seemingly weak negative effect of impact. A more detailed discussion of the effect of each of the three conditions on DIF power rates follows. Magnitude of DIF As expected, increasing the magnitude of DIF significantly improved the power performance of factor mixture DIF detection (refer to Tables 4-16 and 4-17). On one hand, when DIF of 1.0 was simulated, the detection rates ranged on average from .264 to .350. On the other, average detection rates ranged from .425 to .801 when larger DIF (1.5) was simulated in the items. Overall, similar detection patterns were observed at both levels of DIF: the accuracy of power detection was highest with larger sample sizes (N=1000) and in the absence of impact. In direct contrast, power was notably reduced when smaller sample sizes (N=500) and maximum impact (1.0 SD) were simulated. Sample size As shown in Tables 4-18 and 4-19, power rates were positively related to sample size; a result that was not unexpected. For sample size conditions of N=500, the DIF detection rate was .375, on average. However, a marked improvement in detection performance was observed (.520) when the sample size was increased to N=1000. A comparison across the two levels of sample size reveals that the factor mixture procedure exhibited its greatest power to detect DIF under the combined conditions of large DIF differences (1.5 SD) and equality of latent trait means. 70

PAGE 71

Impact The effect of impact on DIF detection rates was also examined. The three levels investigated were: (i) equal latent trait means, (ii) a 0.5 SD difference between latent means, representing a moderate amount of impact, and (iii) a 1.0 SD difference between latent trait means, representing a large amount of impact. The aggregated results in Tables 4-20 through 4-22 show that as the difference in latent trait means between the groups was increased there was a negligible decline in the accuracy of the factor mixture method to detect DIF. For example, the average power rate decreased marginally from .486 to .455 to .401 under the no-impact, 0.5 SD, and the 1.0 SD conditions respectively. These results show that the presence of impact did not adversely affect the ability of the factor mixture approach to detect DIF. Effect of item discrimination parameter values For the five items simulated to contain DIF, three different levels of item discrimination were selected. For two items (Items 2 and 3), the discrimination parameter value was set at 0.5 to mimic low discriminating items, one items (Item 4) a-parameter was selected as 1.0 a medium level of discrimination, while two items (Items 5 and 6) with an a-parameter of 2.0 represented highly discriminating items. The discrimination parameter values for the non-DIF items were randomly selected from a normal distribution within a range. The power rates for DIF detection categorized by the level of item discrimination are shown in Table 4-23. These results show, as expected, that power is influenced by the item discrimination parameter. More specifically, the accuracy of DIF detection increased as the item discrimination values increased. The factor mixture method had on average a .369 rate of detecting DIF in low discriminating items, this increased to 71

PAGE 72

.495 and .502 when DIF was simulated in items with medium and high values on the a-parameter respectively. Moreover, while there was generally a clear difference in the accuracy of DIF detection between the low discriminating items (a=.5) and either the medium or highly discriminating items; no discernible differences were evident when comparing DIF detection rates between items with medium (a=1.0) and high values (a=1.5) on the a-parameter. The patterns of DIF detection discussed earlier remained consistent across the simulation conditions regardless of the items discriminating ability. Variance components analysis Finally, a variance components analysis was conducted to determine the influence on power rates of the simulation conditions and their interactions. In this analysis, the power rates across the five DIF items were used as the dependent variable while the simulated conditions served as the independent variables. As was expected, the results showed that of the main effects, the DIF magnitude was the most significant contributor by accounting for 19% of the variance in the power rates. Following was sample size with approximately 5% and the interaction between these two factors with 1.2%. Each of the other terms contributed less than 1.0% to the variance in the power rates. These results are shown on Table 4-24. 72

PAGE 73

Table 4-1. Number of converged replications for the three factor mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 50 46 43 0.5 50 46 46 1000 0 50 45 41 0.5 50 48 43 1.5 500 0 50 50 47 0.5 50 49 46 1000 0 50 50 49 0.5 50 50 48 73

PAGE 74

Table 4-2. Mean AIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,520.64 9,509.62 9,501.86 0.5 9,310.67 9,305.81 9,298.37 1000 0 18,956.98 18,961.67 18,961.03 0.5 18,578.46 18,578.01 18,569.40 1.5 500 0 9,495.81 9,473.12 9,473.10 0.5 9,314.87 9,284.66 9,293.38 1000 0 18,953.93 18,916.12 18,918.62 0.5 18,559.08 18,556.92 18,550.50 Note: AIC Akaike Information Criterion Table 4-3. Mean BIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,647.08 9,707.71 9,771.59 0.5 9,437.11 9,503.89 9,568.11 1000 0 19,104.21 19,192.34 19,275.12 0.5 18,725.70 18,808.68 18,883.21 1.5 500 0 9,622.25 9,671.21 9,742.83 0.5 9,441.31 9,482.75 9,563.11 1000 0 19,101.16 19,146.78 19,232.72 0.5 18,706.31 18,787.58 18,864.60 Note: BIC Bayesian Information Criterion Table 4-4. Mean ssaBIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,551.86 9,558.53 9,568.45 0.5 9,341.89 9,354.71 9,364.97 1000 0 19,008.93 19,043.06 19,071.86 0.5 18,630.42 18,659.40 18,679.95 1.5 500 0 9,527.02 9,522.03 9,539.69 0.5 9,346.09 9,333.56 9,359.97 1000 0 19,005.88 18,997.51 19,029.45 0.5 18,611.03 18,638.31 18,661.33 Note: ssaBIC sample size adjusted Bayesian Information Criterion 74

PAGE 75

Table 4-5. Percentages of converged solutions across study conditions DIF Magnitude Sample Size Ability Differences Percentage of converged solutions 1.0 500 0 99.8 0.5 99.5 1.0 99.4 1000 0 100.0 0.5 99.7 1.0 99.6 1.5 500 0 99.8 0.5 99.7 1.0 99.7 1000 0 100.0 0.5 100.0 1.0 99.8 Table 4-6. Overall Type I error rates across study conditions DIF Sample Size Impact Error rates 1.0 500 0 0.123 0.5 0.126 1.0 0.159 1000 0 0.131 0.5 0.129 1.0 0.138 1.5 500 0 0.097 0.5 0.112 1.0 0.116 1000 0 0.092 0.5 0.092 1.0 0.100 75

PAGE 76

Table 4-7. Type I error rates for DIF = 1.0 Sample size Impact Error rates 500 0 0.123 500 0.5 0.126 500 1.0 0.159 1000 0 0.131 1000 0.5 0.129 1000 1.0 0.138 Table 4-8. Type I error rates for DIF = 1.5 Sample size Impact Error rates 500 0 0.097 500 0.5 0.112 500 1.0 0.116 1000 0 0.092 1000 0.5 0.092 1000 1.0 0.100 Table 4-9. Type I error rates for sample size of 500 DIF Impact Error rates 1.0 0 0.123 1.0 0.5 0.126 1.0 1.0 0.159 1.5 0 0.097 1.5 0.5 0.112 1.5 1.0 0.116 Table 4-10. Type I error rates for sample size of 1000 DIF Impact Error rates 1.0 0 0.131 1.0 0.5 0.129 1.0 1.0 0.138 1.5 0 0.092 1.5 0.5 0.092 1.5 1.0 0.100 76

PAGE 77

Table 4-11. Type I error rates for impact of 0 SD DIF Sample size Error rates 1.0 500 0.123 1.0 1000 0.131 1.5 500 0.097 1.5 1000 0.092 Table 4-12. Type I error rates for impact of 0.5 SD DIF Sample size Error rates 1.0 500 0.126 1.0 1000 0.129 1.5 500 0.112 1.5 1000 0.092 Table 4-13. Type I error rates for impact of 1.0 SD DIF Sample size Error rates 1.0 500 0.159 1.0 1000 0.138 1.5 500 0.116 1.5 1000 0.100 Table 4-14. Variance components analysis for Type I error Condition DIF Magnitude (D) .007 Sample size (S) .000 Impact (I) .001 D*S .000 D*I .000 S*I .000 D*S*I .000 77

PAGE 78

Table 4-15. Overall power rates across study conditions DIF Sample Size Impact Power 1.0 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 1.5 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 Table 4-16. Power rates for DIF of 1.0 Sample size Impact Power 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 Table 4-17. Power rates for DIF of 1.5 Sample size Impact Power 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 78

PAGE 79

Table 4-18. Power rates for sample size N of 500 DIF Impact Power 1.0 0 0.268 0.5 0.267 1.0 0.264 1.5 0 0.525 0.5 0.498 1.0 0.425 Table 4-19. Power rates for sample size N of 1000 DIF Impact Power 1.0 0 0.350 0.5 0.324 1.0 0.291 1.5 0 0.801 0.5 0.731 1.0 0.623 Table 4-20. Power rates for impact of 0 SD DIF Sample Size Power 1.0 500 0.268 1.0 1000 0.350 1.5 500 0.525 1.5 1000 0.801 Table 4-21. Power rates for impact of 0.5 SD DIF Sample Size Power 1.0 500 0.267 1.0 1000 0.324 1.5 500 0.498 1.5 1000 0.731 Table 4-22. Power rates for impact of 1.0 SD DIF Sample Size Power 1.0 500 0.264 1.0 1000 0.291 1.5 500 0.425 1.5 1000 0.623 79

PAGE 80

Table 4-23. Power rates for DIF detection based on item discriminations DIF Sample Size Impact a = .5 a = 1.0 a = 2.0 1.0 500 0.0 .184 .194 .273 .332 .359 0.5 .200 .220 .273 .321 .323 1.0 .200 .227 .273 .326 .296 1000 0.0 .261 .257 .387 .430 .416 0.5 .248 .243 .363 .372 .392 1.0 .226 .238 .332 .300 .357 1.5 500 0.0 .409 .425 .612 .598 .580 0.5 .411 .401 .561 .546 .571 1.0 .340 .360 .465 .481 .476 1000 0.0 .731 .728 .885 .841 .819 0.5 .640 .655 .814 .782 .763 1.0 .531 .528 .697 .675 .685 .365 .373 .495 .500 .503 Table 4-24. Variance components analysis for power results Condition DIF Magnitude (D) .190 Sample size (S) .046 Impact (I) .009 D*S .012 D*I .005 S*I .002 D*S*I .001 80

PAGE 81

CHAPTER 5 DISCUSSION This study was designed to evaluate the overall performance of the factor mixture analysis in detecting uniform DIF. Specifically, there were two primary research goals, namely: (i) to assess the ability of the factor mixture approach to correctly recover the number of latent classes, and (ii) to examine the Type I error rates and statistical power associated with the approach under various study conditions. Using data generated by a 2PL IRT framework, a Monte Carlo simulation study was conducted to investigate the properties of the proposed factor mixture model approach to DIF detection. First, a 15-item dichotomous test simulated for a two-group, two-class population was generated. In both parts of the study, the effect of DIF magnitude, sample size and differences in latent trait means on the performance of the mixture approach were examined. First, the major findings of each phase of the simulation are summarized. This will be followed by a discussion of the limitations of this study and suggestions for future research. Class Enumeration and Performance of Fit Indices In assessing the accuracy of the factor mixture approach to accurately recover the correct number of latent classes, models with one through three latent classes were fit to the simulated data. In addition, three commonly-used information criteria indices (AIC, BIC, and ssaBIC) were used in the selection of the correct model. Overall, there was a high level of inconsistency among the three ICs. In this study, the AIC tended to over-extract the number of classes and under the majority of study conditions supported the more complex but incorrect three-class model over the true two-class model. This behavior was sharply contrasted with that of the BIC, which tended to underestimate the correct number of latent classes and consistently favored the simpler 81

PAGE 82

one-class model. In contrast to the distinctly different results produced by the AIC and BIC, the ssaBIC produced more balanced results by showing a preference for the two-class model over the 1-class model as the magnitude of DIF simulated between groups increased. Moreover, of the three factors examined (magnitude of DIF, sample size, and presence of impact) the patterns of model selection were most affected by the change in DIF magnitude. However, while the behavior of the three ICs was influenced when larger amounts of DIF were simulated, the effect was different across ICs. For example, when the DIF magnitude was increased from 1.0 to 1.5, the ssaBIC identified the two-class model under three of the four conditions. In the case of the AIC, the two-class model had its lowest average IC values for two of the four conditions. And while the BIC still tended to favor the one-class model, the differences between the one-class and two-class model were minimized on increasing the DIF magnitude from 1.0 to 1.5. Therefore, the ssaBIC was most affected by the presence of larger DIF, followed by the AIC and lastly the BIC. In discussing these findings, it is important to note that the results of this Monte Carlo study though disappointing were not totally unexpected since previous research studies have also reported similar inconsistent performances for these fit indices (Li et al., 2009; Lin & Dayton, 1997; Nylund et al., 2006; Reynolds, 2008; Tofighi & Enders, 2007; Yang, 2006). Additionally, the pattern of results exhibited in this study by the indices has also been observed in other mixture model studies. For example, in research conducted by Li et al. (2009), Lin & Dayton (1997), and Yang (1998), the authors observed similar patterns of behavior, namely, the tendency of the AIC to overestimate the true number of classes and the BIC to select simpler models with a 82

PAGE 83

smaller number of latent classes. On the other hand, while simulation results from Nylund et al. (2006) supported the finding of the AIC favoring models with more latent classes, their study found the BIC to be most consistent indicator of the true number of latent classes. This latter result contrasted with other studies which touted the merits of the ssaBIC for class enumeration over the BIC (Henson, 2004; Yang, 2006; Tofighi & Enders, 2007). Therefore, given the inconsistencies in results, no single information criteria index can be regarded as being the most appropriate for class enumeration for all types of finite mixture models. Liu (2009) argued that because the performances of the indices depend heavily on the estimation model and the population assumptions that these inconsistencies should be expected. In addition, because to date no full scale study has been conducted comparing the performance of these indices for factor mixture DIF applications, no definite conclusion can be reached regarding the index that is best suited for this type of factor mixture application. Clearly, this represents an opportunity for future research. Results from this study also point to several instances where negligible differences in IC values between neighboring class models were observed. Therefore, even though a model may have produced the lowest average IC value, the IC value of the k+1 or k-1 class model did not differ substantially from that of the k-class model. In cases such as this, the absence of an agreed-upon standard for calculating the significance of these IC differences increases the ambiguity of the selection of the correct model. This presents the opportunity for the creation of such a significance statistic; a possibility that will be explored later as a potential area for further research. 83

PAGE 84

Overall, the ambiguity of these findings serve to reinforce the point that was made earlier, namely, that the IC results should never be relied upon as the sole determinant of the number of classes. Several researchers have stressed the importance of incorporating substantive theory in guiding the model selection decision (Allua, 2007; Bauer and Curran, 2004; Kim, 2009; Nylund et al., 2007; Reynolds, 2008). Moreover, Reynolds (2008) contends that the researcher often has some belief about the underlying subpopulations, therefore this should be taken into account in determining which of the models best fit the data. Type I Error and Statistical Power Performance In this phase of the study, the performance of the factor mixture model was evaluated in terms of its Type I error rate and power of DIF detection. As was done in the first part of the study data were again simulated for a 15-item test based on the 2PL IRT model. However, in this case it was assumed that the number of classes was known to be two. Five of the 15 items were simulated to contain uniform DIF in favor of the reference group. In investigating the Type I error rate and power of the test, three factors (DIF magnitude, sample size and impact) shown previously to affect DIF detection were also manipulated and their effect on the test was noted. More specifically, two levels of DIF magnitude (1.0 1.5) and of sample size (N=500, N=1000) were simulated. For the effect of impact, three levels, 0, 0.5 SD and 1.0 SD were chosen to reflect none, moderate and large mean differences in the latent trait. For each of the 12 conditions, a total of 1000 replications were run. The Type I error and statistical power of the factor mixture method for DIF detection was investigated across all conditions. 84

PAGE 85

Type I Error Rate Study With the exception of the referent (Item 1) whose thresholds were constrained across latent classes for identification purposes, the remaining nine DIF-free items were used in assessing the ability of the factor mixture to control the Type I error close to the nominal alpha level of .05. However, the DIF factor mixture approach yielded inflated error rates ranging in magnitude from .092 to .159 across all 12 study conditions. Whereas the rates of incorrect detection improved with large DIF and sample size, the effect of increasing impact had little effect in controlling the Type I error rates. In assessing the performance of several DIF detection procedures, previous studies have confirmed the inverse relationship between the inflation of Type I error rates and both sample size and size of DIF, with tests attaining their optimal performance at controlling Type I error rates when samples sizes are larger and with higher amounts of DIF (Cohen, Kim & Baker, 1993; Dainis, 2008; Donoghue, Holland, & Thayer, 1993; Oort, 1998; Wanichtanom, 2001). Previous simulation results regarding the influence of impact on Type I error rates have been divided. Whereas some studies have reported Type I error inflation in the presence of impact (Cheung & Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994), others have shown good control of the error rates for moderate impact of .5 SD (Stark et al., 2006) and even for latent mean differences as large as 1 SD (Shealy & Stout, 1993). Differences in latent ability distributions are common for both cognitive and non-cognitive measures, hence it is critical for DIF detection methods, particularly those that do not differentiate between the presence of DIF and impact, are robust to the effects of group differences in latent trait means. 85

PAGE 86

Statistical Power Study The study also evaluated the power of the factor mixture approach to detect uniform DIF. In spite of the failure of the factor mixture analysis to adequately protect the Type I error rates across the study conditions, the power results were still reviewed to get some sense of the pattern of DIF detection. Overall, these findings represent a mix of the predictable and the unexpected. What was expected was that the power of the factor mixture analysis method of DIF detection would increase as sample size and magnitude of DIF increased. In addition, it was not a surprising outcome that the magnitude of the discrimination parameter also influenced DIF detection rates; power was highest when detecting DIF in the more highly discriminating items, followed by studied items with medium and low discrimination parameters. Overall, these results are not only intuitively appealing but have been consistently supported by prior research conducted with different methods of DIF detection (Donoghue et al., 1993; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Stark et al., 2006). On the other hand, the surprising result was that even in the presence of large latent trait mean differences of 1.0 SD, the rates of DIF detection were not adversely affected by impact. While this finding was consistent with some studies (Gonzlez-Rom et al., 2006; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006), others have reported contradictory results with reductions in power as the disparity in latent means increased (Ankemann et al., 1999; Clauser, Mazor, & Hambleton, 1993; Finch & French, 2007; Narayanan & Swaminathan, 1996; Tian, 1999; Zwick, Donoghue, & Grima, 1993). However, it is important to note that these prior empirical studies all utilized standard DIF analyses rather than a mixture approach, as was used in this simulation. 86

PAGE 87

Reconciling the Simulation Results On one hand, the overall pattern of findings across the simulation conditions exhibits consistency with previous DIF results. On the other, the factor mixture approach was not as successful as was hoped at controlling the rate of false identifications and as a result in demonstrating power to detect DIF. However, if the factor mixture approach is to be regarded as a viable DIF detection method, possible reasons for this deviation from the expected performance must be addressed. Under a manifest approach to DIF an item is said to exhibit DIF if groups matched on the latent ability trait differ in their probabilities of item response (Cohen et al., 1993). Therefore, in that context, DIF is defined with respect to the manifest groups being considered. By contrast, the mixture approach posits a different conceptualization of DIF. In this case, the underlying assumption is that DIF is observed because of differences in item responses between unobserved latent classes rather than known manifest groups. Moreover, it is further assumed that unless there is perfect overlap between the manifest groups and the latent classes then the two methods should not be expected to produce the same DIF results (De Ayala et al., 2002). Perfect overlap implies that the composition in each of the latent classes is exactly the same as in the two manifest groups. For instance, in the case of a two-class, two-group population, 100% of the reference group would comprise latent class 1, while 100% of the focal groups would belong to latent class 2. However, De Ayala et al. (2002) contend that it is unlikely that this perfect equivalence between latent classes and manifest groups will occur. Because the composition of the latent classes is likely to differ from that of the manifest groups, then it should be expected that the DIF results will differ, particularly as the level of overlap moves from 100% to 50%. Therefore, while there is expected to be some similarity in results between the two 87

PAGE 88

approaches, the results are not necessarily identical except in the case of perfect group-class correspondence. In this simulation, given that the overlap between the latent classes and manifest groups was simulated to be 80%, then the DIF results should be expected to differ to some degree. Therefore, one possible reason for the Type I error rate inflation may have emerged because of this difference in definition and conceptualization of DIF. Additionally, the procedure used to test the invariance of the items may also have contributed to this seemingly high rate of inflation. In testing the significance of the differences in item thresholds, Mplus invokes a Wald test. An examination of these estimates revealed several large coefficients which in turn would have resulted in large z-statistics and an increased likelihood of significance. However, the issue of whether the inflated error rate resulted from applying a factor mixture approach to these data or from the using the significance testing of threshold differences in testing for non-invariant items remains unresolved. Limitations of the Study and Suggestions for Future Research As with all simulation research, there are several limitations to this study. However, these limitations also point to the need for future research. First, in determining the correct number of latent classes, the findings were limited by use of only one type of model fit index. It would have been interesting to compare the results of the information criteria indices (i.e. AIC, BIC and ssaBIC) with those of alternative tests such as the Lo-Mendell-Rubin likelihood ratio test (LMR LRT) and the bootstrap LRT (BLRT). In their simulation study, Nylund et al. (2006) found that the LMR LRT has been was reasonably effective at identifying the correct mixture model. However, the BLRT outperformed both the likelihood-based indices and the LMR LRT as the most consistent indicator for choosing the correct number of classes. While these results are 88

PAGE 89

promising, the LMR LRT and the BLRT are not without their potential drawbacks. Jeffries (2003) has been critical of the LMR LRTs use in mixture modeling and has suggested that the statistic be applied with caution. In addition, the BLRT which uses bootstrap samples is a far more computationally intensive approach than the information-based statistics. As a result, the BLRT though seemingly a reliable index, is seldom used in practice by the applied researcher (Liu, 2008). Therefore, additional attention may be focused on identifying alternative, robust model selection measures that provide more consistency than the ICs but are less computationally demanding than the BLRT. A second limiting factor in this part of the study was that the selection of the best-fitting model was based on the average IC values. A more reliable approach would have been to determine the percentage of times (out of the completed replications) that each index identifies the correct model. However, in this study, it was not possible to provide a one-to-one comparison of the IC values across the three class-solutions when a 100% convergence rate was not achieved. Therefore, in future research, this change should be implemented so that the percentage of correct model identifications can be compared for each of the indices. It should also be mentioned that while previous studies have evaluated the performance of model selection methods with respect to a variety of mixture models (GMM, LCA, FMM), to date no research has been conducted to evaluate the performance of these indices when used in the context of DIF detection. To fill this gap in the methodology literature requires a more extensive study focusing on the detection of DIF with mixture models. As with all simulation research, the findings can only be generalized to the limited number of conditions selected for this study. It should be noted that in the original 89

PAGE 90

design of this study, several additional conditions were considered. However, given the computational intensity of mixture modeling, and in the interest of time, it was decided to reduce the number of study conditions to the smaller set that was studied. Therefore, future research should consider a broader range of simulation conditions which would make for a more realistic study. For example, in addition to sample sizes, it would be of interest to investigate the ratios of focal to reference groups sample size as well. In this study, a 1:1 sample size ratio of focal to reference group was considered. And while this may be representative of an evenly split manifest variable such as gender (Samuelsen, 2005), unequal sample groups tend the mimic minority population characteristics such as race (e.g. Caucasian vs. black or Hispanic). In traditional DIF assessments, power rates are typically higher for equal focal and reference group sizes than with unequal sample size ratios (Atar, 2007). Therefore it would be interesting to investigate whether this finding is consistent with factor mixture DIF detection methods. Other conditions, fixed in this current study, that could be manipulated in future research include: (i) the nature of the items, (ii) the scale length, and (iii) the type of DIF. In the study, data were simulated for dichotomous items only. An interesting extension would be the evaluation of the model using categorical response data generated from different IRT polytomous models (e.g. the graded response model or the partial credit model). Another condition that could be manipulated is the number/proportion of items simulated to contain DIF. In addition, assessing the performance of the model selection indices and the mixture model with respect to varying scale lengths should also make for a more complete, informative study. While it is expected that longer tests would produce lower Type I error rates and increase power, it would be of interest to 90

PAGE 91

determine how short the scale should be for the test to perform adequately. The focus of this study was on the detection of uniform DIF. However, in future research, the type of DIF factor can be extended to include both uniform and non-uniform DIF. To test for the presence of non-uniform DIF, the factor mixture model as implemented in this study must be reformulated so that in addition to the item thresholds, factor loadings are allowed to vary across classes as well. The Type I error rates and power of the factor mixture model to detect non-uniform DIF can then be evaluated and compared with the corresponding results for uniform DIF. Additionally, in this study, the item discrimination parameter was not included as a factor in the study. Instead its effect was examined on its own as a single condition. Therefore, in future research, the effect of including this study condition may be investigated. In generating the data, the mixture proportion for the two-classes was simulated to be .50. However, after the model estimation phase, the ability of the factor mixture approach to accurately recover the class proportions was not evaluated. This omission should also be addressed in future research. Finally, to the authors knowledge, the strategy used in testing the items for non-invariance has been recently introduced to the factor mixture literature and to date has been implemented in two studies. Its advantage is that it provides a simpler more direct alternative to DIF detection than the CFA baseline approaches which require the estimation and comparison of two models. However, it has not yet been subjected to the methodological rigor of more established methods. Therefore, a potential extension to this study would be a comparison of the performance of the significance testing of the 91

PAGE 92

threshold differences using the Mplus model constraint option versus either a constrainedor a free-baseline strategy for testing DIF with mixture CFA models. Conclusion In the last decade, a burgeoning literature on mixture modeling and its applications has emerged. And although several of these research efforts have been concentrated in the area of growth mixture modeling, there is also a groundswell of interest in applying a mixture approach in the study of measurement invariance. Therefore, in concluding this dissertation it is important to reiterate the motivation that should precede the use of this technique as well as some key concerns that applied should keep in mind when deciding whether a mixture modeling is an appropriate approach for their research. The intrinsic appeal of mixture models is that they allow for the exploration of unobserved population heterogeneity using latent variables. Under the traditional conceptualization, DIF is defined with respect to distinct, known sub-groups. Therefore, in using standard DIF approaches, practitioners are seeking to determine if after controlling for latent ability whether differences in item response patterns are a result of a known variable such as gender or race. However, when investigating DIF from a latent perspective, there is an implied assumption that the presence of unobserved latent classes gives rise to the pattern of differential functioning in the items. Advocates of this approach contend that it allows for a better understanding of why examinees may be responding differently to items and this is certainly an attractive inducement to practitioners. However, these results suggest that unless large sample sizes and large amounts of DIF are simulated in the data the factor mixture approach is likely to be unsuccessful at disentangling the population into distinct, distinguishable latent classes. Additionally, commonly-used fit indices such as the AIC, BIC, and ssaBIC are likely to produce inconsistent results and 92

PAGE 93

may cause the incorrect selection of more or fewer classes than actually exist in the population. Therefore, it is critical that the practitioner has a strong theoretical justification to support the assumption of population heterogeneity. This should decrease the ambiguity in the selection of the best-fitting model for the data and in the interpretation of the nature of the latent classes. However, when the data and the theory support the existence of these latent classes, the technique can be used successfully to detect qualitatively different subpopulations with differential patterns of response that may otherwise had been overlooked using a traditional classic DIF-procedure. In the context of education research, the application of mixture models can provide valuable diagnostic information that can be used to gain insight into students cognitive strengths and weaknesses. This study was designed as a means of bridging the gap between the manifest and latent approaches by examining the performance of the factor mixture approach in detecting DIF in items generated via a traditional framework. And even though the manifest approach will remain a staple in the DIF literature, it is expected that the interest in factor mixture models in DIF will continue to grow. Therefore, by further exploring these two approaches differ not only as concepts but also in results and application will ensure that each is appropriately used in practice. 93

PAGE 94

APPENDIX A MPLUS CODE FOR ESTIMATING 2-CLASS FMM TITLE: Factor mixture model for a two-class solution. DATA: FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: NAMES = u1-u15 class group; USEV ARE u1-u15; CATEGORICAL = u1-u15; CLASSES = c (2); ANALYSIS: TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 600 20; PROCESS = 2; MODEL: %OVERALL% f BY u1-u15; %c#1% [u2$1-u15$1]; f; %c#2% [u2$1-u15$1]; f; OUTPUT: TECH8 TECH9; STANDARDIZED; SAVEDATA: RESULTS ARE results.txt; 94

PAGE 95

APPENDIX B MPLUS CODE FOR DIF DETECTION TITLE: Factor mixture model for a two-class solution. Items = 15 DIF =1.0 DATA: FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: NAMES = u1-u15 class group; USEV ARE u1-u15; CATEGORICAL = u1-u15; CLASSES = c (2); ANALYSIS: TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 0; PROCESS = 2 ; MODEL: %OVERALL% f BY u1@1 u2*0.500 / / / / u15*0.867; %c#1% [u1$1] (p1_1); !Assigns names to indicators for constraint purposes [u2$1*-0.500] (p1_2); / / / / [u15$1*0.609] (p1_15); f; %c#2% [u1$1] (p1_1); !Threshold of Item 1 constrained equal across classes [u2$1*0.000] (p2_2); !Remaining 14 item thresholds freely estimated / / / / [u15$1*0.609] (p2_15); f; MODEL CONSTRAINT: New(difi2 difi3 difi4 difi5 difi6 difi7 difi8 difi9 difi10 difi11 difi12 difi13 difi14 difi15); Declares new variables (difi2,...,difi15)which are functions of previous variables difi2= p2_2 p1_2; !Estimates threshold differences / / / / difi15= p2_15 p1_15; 95

PAGE 96

LIST OF REFERENCES Abraham, A. A. (2008). Model Selection Methods in the linear mixed model for longitudinal data. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91. Ainsworth, A.T. (2007). Dimensionality and invariance: Assessing DIF using bifactor MIMIC models. Unpublished doctoral dissertation, University of California, Los Angeles. Agrawal, A., & Lynskey, M.T. (2007). Does gender contribute to heterogeneity in criteria for cannabis abuse and dependence? Results from the national epidemiological survey on alcohol and related conditions. Drug and Alcohol Dependence, 88, 300. Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317. Allua, S.S. (2007). Evaluation of singleand multilevel factor mixture model estimation. Unpublished doctoral dissertation, University of Texas: Austin. Anderson, L. W. (1985). Opportunity to learn. In T. Husen & T. N. Postlethwaite (Eds.), The international encyclopedia of education (Vol. 6, pp. 3682-3686). Oxford: Pergamon Press. Angoff, W.H. (1972). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service No. ED 069686). Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 3-23). Hillsdale, N.J.: Lawrence Erlbaum. Angoff, W. H., & Sharon, A. T. (1974). The evaluation of differences in test performance of two or more groups. Educational and Psychological Measurement, 34, 807-816. Ankemann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277-300. 96

PAGE 97

Atar, B. (2007). Differential item functioning analyses for mixed response data using IRT likelihood-ratio test, logistic regression, and GLLAMM procedures. Unpublished doctoral dissertation, Florida State University. Bandalos, D. L., & Cohen, A.S. (2006). Using factor mixture models to identify differentially functioning test items. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Bauer, D. J., & Curran, P.J. (2004). The integration of continuous and discrete latent variable models: Potential problems and promising opportunities. Psychological Methods, 9, 3-29. Bilir, M. K. (2009). Mixture item response theory-mimic model: Simultaneous estimation of differential item functioning for manifest groups and latent classes. Unpublished doctoral dissertation, Florida State University. Bock, R.D., & Aiken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26, 381-409. Bontempo, D. E. (2006). Polytomous factor analytic models in developmental research. Unpublished doctoral dissertation, The Pennsylvania State University. Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Newbury Park, CA: Sage. Cheung, G.W. & Rensvold, R.B. (1999) Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1-27. Cho, S.-J. (2007). A multilevel mixture IRT model for DIF analysis. Unpublished doctoral dissertation, University of Georgia, Athens. Chung, M.C., Dennis, I., Easthope, Y., Werrett, J., & Farmer, S. (2005). A multiple-indicator multiple-cause model for posttraumatic stress reactions: Personality, coping, and maladjustment. Psychosomatic Medicine, 67, 251. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. 97

PAGE 98

Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1994). The effects of score group width on the Mantel-Haenszel procedure. Journal of Educational Measurement, 57, 67-78. Clark, S.L. (2010). Mixture modeling with behavioral data. Doctoral dissertation, Unpublished doctoral dissertation. University of California, Los Angeles. Clark, S.L., Muthn, B., Kaprio, J., DOnofrio, B.M., Viken, R., Rose, R.J., Smalley, S. L. (2009). Models and strategies for factor mixture analysis: Two examples concerning the structure underlying psychological disorders. Manuscript submitted for publication. Cleary, T. A. & Hilton, T. J. (1968). An investigation of item bias. Educational and Psychological Measurement, 5, 115-124. Cohen, A.S., & Bolt, D.M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133-148. Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335 350. Cole, N. S. (1993). History and development of DIF. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 25-33). Hillsdale, N.J.: Lawrence Erlbaum. Dainis, A. M. (2008). Methods for identifying differential item and test functioning: An investigation of Type I error rates and power. Unpublished doctoral dissertation, James Madison University. De Ayala, R.J. (2009). Theory and practice of item response theory. Guilford Publishing. De Ayala, R.J., Kim, S.-H., Stapleton, L.M., & Dayton, C.M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2, 243-276. Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning (pp. 137-166). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.) Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. 98

PAGE 99

Dorans NJ, & Kulick E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368. Duncan, S. C. (2006). Improving the prediction of differential item functioning: A comparison of the use of an effect size for logistic regression DIF and Mantel-Haenszel DIF methods. Unpublished doctoral dissertation, Texas A&M University. Educational Testing Service (2008). What's the DIF? Helping to ensure test question fairness. Retrieved December 8, 2009, from: http://www.ets.org/portal/site/ets/ Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3, 347-360. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278-295. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning. Educational and Psychological Measurement, 67, 565-582. Fukuhara, H. (2009). A differential item functioning model for testlet-based items using a bi-factor multidimensional item response theory model: A Bayesian approach. Unpublished doctoral dissertation, Florida State University. Furlow, C. F., Raiford Ross, T., & Gagn, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33, 441-464. Gagn, P. (2004). Generalized confirmatory factor mixture models: A tool for assessing factorial invariance across unspecified populations. Unpublished doctoral dissertation. University of Maryland. Gagn, P. (2006). Mean and covariance structure models. In G.R. Hancock & F.R. Lawrence (Eds.), Structural Equation Modeling: A second course (pp. 197-224). Greenwood, CT: Information Age Publishing, Inc. Gallo, J. J., Anthony, J. C., & Muthn, B. O. (1994). Age differences in the symptoms of depression: A latent trait analysis. Journal of Gerontology: Psychological Sciences, 49, P251-P264. Gelin, M. N. (2005). Type I error rates of the DIF MIMIC approach using Jreskogs covariance matrix with ML and WLS estimation. Unpublished doctoral dissertation, The University of British Columbia. 99

PAGE 100

Gelin, M. N., & Zumbo, B.D. (2007). Operating characteristics of the DIF MIMIC approach using Jreskogs covariance matrix with ML and WLS estimation for short scales. Journal of Modern Applied Statistical Methods, 6, 573-588. Glockner-Rist, A., & Hoitjink, H. (2003). The best of both worlds: Factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10, 544-565. Gomez, R. & Vance, A. (2008). Parent ratings of ADHD symptoms: Differential symptom functioning across Malaysian Malay and Chinese children. Journal of Abnormal Child Psychology, 36, 955-967. Gonzlez-Rom, V., Hernndez, A., & Gmez-Benito, J. (2006). Power and Type I error of the mean and covariance structure analysis model for detecting differential item functioning in graded response items. Multivariate Behavioral Research, 41, 29-53. Gierl, M. J., Bisanz, J., Bisanz, G. L., Boughton, K. A., & Khaliq, S. N. (2001). Illustrating the utility of differential bundle functioning analysis to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practice, 20, 26-36. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage Publications. Hancock, G. R., Lawrence, F. R., & Nevitt, J. (2000). Type I error and power of latent mean methods and MANOVA in factorially invariant and noninvariant latent variable systems. Structural Equation Modeling, 7, 534-556. Henson, J. M. (2004). Latent variable mixture modeling as applied to survivors of breast cancer. Unpublished doctoral dissertation. University of California, Los Angeles. Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty (ETS-RR-94-13). Princeton, NJ; Educational Testing Service. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H.I. Braun (Eds.), Test Validity. Hillsdale, N.J.: Erlbaum. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. Jeffries, N. (2003). A note on Testing the number of components in a normal mixture. Biometrika, 90, 991. 100

PAGE 101

Jreskog K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 10, 631 639. Kamata, A., & Bauer, D. J. (2008). A note on the relationship between factor analytic and item response theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 136 153. Kamata, A., & Binici, S. (2003). Random effect DIF analysis via hierarchical generalized linear modeling. Paper presented at the annual International Meeting of the Psychometric Society, Sardinia, Italy. Kamata, A., & Vaughn, B.K. (2004). An introduction to differential item functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 49-69. Kuo, P.H., Aggen, S.H., Prescott, C.A., Kendler, K.S., & Neale, M.C. (2008). Using a factor mixture modeling approach in alcohol dependence in a general population sample. Drug and Alcohol Dependence, 98, 105. Larson, S. L. (1999). Rural-urban comparisons of item responses in a measure of depression. Unpublished doctoral dissertation, University of Nebraska. Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some examinees are amotivated. Unpublished doctoral dissertation, James Madison university. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Lee, J. (2009). Type I error and power of the mean and covariance structure confirmatory factor analysis for differential item functioning detection: Methodological issues and resolutions. Unpublished doctoral dissertation, University of Kansas. Leite, W. L. & Cooper, L. (2007). Diagnosing social desirability bias with structural equation mixture models. Paper presented at the Annual Meeting of the American Psychological Association. Li, F., Cohen, A.S., Kim, S-H., & Cho, S-J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353-373. Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22, 249-264. 101

PAGE 102

Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 349 -364). Hillsdale, N.J.: Lawrence Erlbaum. Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content and group membership on achievement test items. Journal of Educational Measurement, 18, 109-118. Liu, C. Q. (2008). Identification of latent groups in Growth Mixture Modeling: A Monte Carlo study. Unpublished doctoral dissertation, University of Virginia. Lo, Y., Mendell, N., & Rubin, D. (2001). Testing the number of components in a normal mixture. Biometrika, 88, 767. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lubke, G. H. & Muthn, B. (2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10, 21-39. Lubke, G. H. & Muthn, B. (2007). Performance of factor mixture models as a function of model size, covariate effects, and class-specific parameters. Structural Equation Modeling: A Multidisciplinary Journal, 14, 26-47. Lubke, G. H. & Neale, M. C. (2006). Distinguishing between latent classes and continuous factors: Resolution by maximum likelihood? Multivariate Behavioral Research, 41, 499532. MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 372-379. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric Monographs, 15, 1-167. McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. 102

PAGE 103

Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organizational Research Methods, 7, 361. Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity of the DFIT framework for tests of measurement invariance with Likert data. Applied Psychological Measurement, 31, 430-455. Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543. Millsap, R.E., & Everson, H.T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297-334. Mislevy, R. J., Levy, R, Kroopnick, M., & Rutstein, D. (2008). Evidentiary foundations of mixture item response theory models. In G. R. Hancock, K. M. Samuelsen (Eds.), Advances in latent variable mixture models. (pp. 149-175). Charlotte, NC: Information Age Publishing. Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195-215. Moustaki, I. (2000). A latent variable model for ordinal variables. Applied Psychological Measurement, 24, 211-224. Muthn, B. O. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10, 121-132. Muthn, B. O. (1988). Some uses of structural equation modeling in validity studies: Extending IRT to external variables. In H. Wainer and H. Braun (Eds.), Test validity (pp. 213-238). Hillsdale, NJ:Lawrence Erlbaum. Muthn, B. O. (1989). Using item-specific instructional information in achievement modeling. Psychometrika, 54, 385-396. Muthn, B. O., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus (Mplus Web Note No. 4). Retrieved April 28, 2005, from http://www.statmodel.com/mplus/examples/webnote.html Muthn, B., Asparouhov, T. & Rebollo, I. (2006). Advances in behavioral genetics modeling using Mplus: Applications of factor mixture modeling to twin data. Twin Research and Human Genetics, 9, 313-324. 103

PAGE 104

Muthn, B. O., Grant, B., & Hasin, D. (1993). The dimensionality of alcohol abuse and dependence: Factor analysis of DSM-III-R and proposed DSM-IV criteria in the 1988 National Health Interview Survey. Addiction, 88, 1079-1090. Muthn, B. O., Kao, C., & Burstein, L. (1991). Instructionally sensitive psychometrics: An application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28, 1-22. Muthn, B. O., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10, 133-142. Muthn, L.K. and Muthn, B.O. (1998-2008). Mplus users guide. Fifth edition. Los Angeles, CA: Muthn & Muthn. Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous tem bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18, 315-328. Navas-Ara, M. J., & Gomez-Benito, J. (2002). Effects of ability scale purification on identification of DIF. European Journal of Psychological Assessment, 18, 9-15. Nylund, K. L., Asparouhov, T., & Muthn, B. (2006). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569. ONeill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Erlbaum. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107-124. Penfield, R.D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5-15. Potenza, M.T. & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23-27. Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 54, 495-502. Raju, N.S., Bode, R.K., & Larsen, V.S. (1989). An empirical assessment of the MantelHaenszel statistic to detect differential item functioning. Applied Measurement in Education, 2, 1-13. 104

PAGE 105

Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207. Reynolds, M. R. (2008). The use of factor mixture modeling to investigate population heterogeneity in hierarchical models of intelligence. Unpublished doctoral dissertation, University of Texas, Austin. Rindskopf, D. (2003). Mixture or homogeneous? Comment on Bauer and Curran (2003). Psychological Methods, 8, 364. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. Roussos, L. A., & Stout, W. F. (1996a). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371. Roussos, L. A., & Stout, W. F. (1996b). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215-230. Samuelsen, K. (2005). Examining differential item functioning from a latent class perspective. Unpublished doctoral dissertation, University of Maryland: College Park. Samuelsen, K. (2008). Examining differential item functioning from a latent perspective. In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture models, (pp. 177-197). Charlotte, NC: Information Age Publishing. Sawatzky, R. (2007). The measurement of quality of life and its relationship with perceived health status in adolescents. Unpublished doctoral dissertation, The University of British Columbia. Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461. Sclove, L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333. 105

PAGE 106

Shealy, R., & Stout, W. F. (1993a). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159-194. Shealy, R., & Stout, W. F. (1993b). An item response theory model for test bias and differential test functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197-239). Hillsdale, NJ: Erlbaum. Shih, C-L & Wang, W-C. (2009). Differential item functioning detection using multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33, 184-199. Srbom, D. (1974). A general method for studying differences in factor means and factor structures between groups. British Journal of Mathematical and Statistical Psychology. 27, 229-239. Standards for educational and psychological testing. (1999). Washington, D.C: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1291-1306. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. Thissen, D. (1991). MULTILOGTM User's Guide. Multiple, Categorical Item Analysis and Test Scoring Using Item Response Theory. Chicago: Scientific Software, Inc. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.) Differential Item Functioning (pp. 67-113). Hillsdale, N.J.: Lawrence Erlbaum. Thurstone, L.L. (1925). A method of scaling educational and psychological tests. Journal of Educational Psychology, 16, 263-278. Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago: University of Chicago Press. Tian, F. (1999). Detecting differential item functioning in polytomous items. Unpublished doctoral dissertation, University of Ottawa. 106

PAGE 107

Tofighi, D., & Enders, C. K. (2007). Identifying the correct number of classes in a growth mixture model. In G. R. Hancock (Ed.), Mixture models in latent variable research (pp. 317). Greenwich, CT: Information Age. Uttaro, T. & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement, 18, 15-25. Wainer H. (1993). Model-based standardized measurement of an items differential impact. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 123-135). Hillsdale, N.J.: Lawrence Erlbaum. Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197. Wang, W-C., Shih, C-L., & Yang, C-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713-732. Wanichtanom, R. (2001). Methods of detecting differential item functioning: A comparison of item response theory and confirmatory factor analysis. Unpublished doctoral dissertation, Old Dominion University. Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92-107. Webb, M-Y., Cohen, A.S., & Schwanenflugel, P.J. (2008). A mixture model analysis of differential item functioning on the Peabody Picture Vocabulary Test-III. Educational and Psychological Measurement, 68, 335-351. Woods, Carol M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1-27. Yang, C. C. (1998). Finite mixture model selection with psychometric applications. Unpublished doctoral dissertation, University of California, Los Angeles. Yang, C. C. (2006). Evaluating latent class analysis models in qualitative phenotype Identification. Computational Statistics and Data Analysis, 50, 1090. Yoon, M. (2007). Statistical power in testing factorial invariance with ordinal measures. Unpublished doctoral dissertation, Arizona State University. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. 107

PAGE 108

Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233. Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy research: Bringing the context into picture by investigating sociological / community moderated (or mediated) test and item bias. Journal of Educational Research and Policy Studies, 5, 1-23. Zwick, R., Donoghue, J., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233-251. 108

PAGE 109

BIOGRAPHICAL SKETCH Mary Grace-Anne Jackman was born in Bridgetown, Barbados. In 1994, she graduated from the University of the West Indies, Barbados with a Bachelor of Science degree in mathematics and computer science (first class honors). After being awarded an Errol Barrow Scholarship, she entered Oxford University in 1996 and received a Master of Science degree in Applied Statistics in 1997. In 2002 she graduated from the University of Georgia with a masters degree in marketing research. Following four years as a marketing research consultant in New York and Barbados, she began doctoral studies in research and evaluation methodology at the University of Florida in the fall of 2006. 109