Citation |

- Permanent Link:
- http://ufdc.ufl.edu/UFE0041954/00001
## Material Information- Title:
- A Monte Carlo Investigation of the Performance of Factor Mixture Modeling in the Detection of Differential Item Functioning
- Creator:
- Jackman, Mary
- Place of Publication:
- [Gainesville, Fla.]
- Publisher:
- University of Florida
- Publication Date:
- 2010
- Language:
- english
- Physical Description:
- 1 online resource (109 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Research and Evaluation Methodology
Human Development and Organizational Studies in Education - Committee Chair:
- Miller, M David
- Committee Co-Chair:
- Leite, Walter
- Committee Members:
- Algina, James J.
Wood, R. Craig - Graduation Date:
- 8/7/2010
## Subjects- Subjects / Keywords:
- Error rates ( jstor )
False positive errors ( jstor ) Mathematical variables ( jstor ) Modeling ( jstor ) Parametric models ( jstor ) Psychometrics ( jstor ) Sample size ( jstor ) Simulations ( jstor ) Statistical models ( jstor ) Test bias ( jstor ) Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF differential, factor, functioning, item, mixture, modeling - Genre:
- Electronic Thesis or Dissertation
bibliography ( marcgt ) theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) Research and Evaluation Methodology thesis, Ph.D.
## Notes- Abstract:
- This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two-parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions - sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these two extremes and tended to favor the 'true' two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2010.
- Local:
- Adviser: Miller, M David.
- Local:
- Co-adviser: Leite, Walter.
- Electronic Access:
- RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31
- Statement of Responsibility:
- by Mary Jackman.
## Record Information- Source Institution:
- UFRGP
- Rights Management:
- Applicable rights reserved.
- Embargo Date:
- 8/31/2011
- Resource Identifier:
- 004979715 ( ALEPH )
705932442 ( OCLC ) - Classification:
- LD1780 2010 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

one-class model. In contrast to the distinctly different results produced by the AIC and BIC, the ssaBIC produced more balanced results by showing a preference for the two- class model over the 1-class model as the magnitude of DIF simulated between groups increased. Moreover, of the three factors examined (magnitude of DIF, sample size, and presence of impact) the patterns of model selection were most affected by the change in DIF magnitude. However, while the behavior of the three ICs was influenced when larger amounts of DIF were simulated, the effect was different across ICs. For example, when the DIF magnitude was increased from 1.0 to 1.5, the ssaBIC identified the two- class model under three of the four conditions. In the case of the AIC, the two-class model had its lowest average IC values for two of the four conditions. And while the BIC still tended to favor the one-class model, the differences between the one-class and two-class model were minimized on increasing the DIF magnitude from 1.0 to 1.5. Therefore, the ssaBIC was most affected by the presence of larger DIF, followed by the AIC and lastly the BIC. In discussing these findings, it is important to note that the results of this Monte Carlo study though disappointing were not totally unexpected since previous research studies have also reported similar inconsistent performances for these fit indices (Li et al., 2009; Lin & Dayton, 1997; Nylund et al., 2006; Reynolds, 2008; Tofighi & Enders, 2007; Yang, 2006). Additionally, the pattern of results exhibited in this study by the indices has also been observed in other mixture model studies. For example, in research conducted by Li et al. (2009), Lin & Dayton (1997), and Yang (1998), the authors observed similar patterns of behavior, namely, the tendency of the AIC to overestimate the true number of classes and the BIC to select simpler models with a Gelin, M. N., & Zumbo, B.D. (2007). Operating characteristics of the DIF MIMIC approach using Joreskog's covariance matrix with ML and WLS estimation for short scales. Journal of Modern Applied Statistical Methods, 6, 573-588. Glockner-Rist, A., & Hoitjink, H. (2003). The best of both worlds: Factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10, 544-565. Gomez, R. & Vance, A. (2008). Parent ratings of ADHD symptoms: Differential symptom functioning across Malaysian Malay and Chinese children. Journal of Abnormal Child Psychology, 36, 955-967. Gonzalez-Roma, V., Hernandez, A., & G6mez-Benito, J. (2006). Power and Type I error of the mean and covariance structure analysis model for detecting differential item functioning in graded response items. Multivariate Behavioral Research, 41, 29-53. Gierl, M. J., Bisanz, J., Bisanz, G. L., Boughton, K. A., & Khaliq, S. N. (2001). Illustrating the utility of differential bundle functioning analysis to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practice, 20, 26-36. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage Publications. Hancock, G. R., Lawrence, F. R., & Nevitt, J. (2000). Type I error and power of latent mean methods and MANOVA in factorially invariant and noninvariant latent variable systems. Structural Equation Modeling, 7, 534-556. Henson, J. M. (2004). Latent variable mixture modeling as applied to survivors of breast cancer. Unpublished doctoral dissertation. University of California, Los Angeles. Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty (ETS-RR-94-13). Princeton, NJ; Educational Testing Service. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel- Haenszel procedure. In H. Wainer& H.I. Braun (Eds.), Test Validity. Hillsdale, N.J.: Erlbaum. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. Jeffries, N. (2003). A note on "Testing the number of components in a normal mixture." Biometrika, 90, 991-994. 100 In estimating factor mixture models, one important decision to be made is the determination of the number of latent classes. However, while there are several fit criteria such as the Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), the adjusted BIC (ssaBIC; Sclove, 1987), and the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) available to assist the researcher in making the determination, there is seldom perfect agreement among these fit indices. Therefore, practitioners are cautioned against applying the mixture approach without having theoretical support for their hypothesis of unobserved heterogeneity in the population of interest. Generally, when group membership is known a priori, SEM-based DIF detection models can be specified using either a multiple indicators, multiple causes (MIMIC) model or a multiple-group CFA approach (Allua, 2007). In this paper, the factor mixture model will be specified using the mixture analog of the manifest multiple-group CFA. Therefore, in the model specification since the observed group variable will now be replaced by a latent categorical variable, not only will the heterogeneity of item parameters be examined but a profile of the latent, unobserved subpopulations can be examined as well. Finally, if needed, covariates can also be included in the model to help in explaining the composition of the latent classes as well. The primary purpose of this dissertation was to explore the utility of the factor mixture models in the detection of DIF. In a 2006 paper, Bandalos and Cohen commented that while the estimation of the factor mixture models had previously been presented in the IRT literature, the models were not as frequently utilized with SEM- based models. However, programming enhancements to software packages such as A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By MARY GRACE-ANNE JACKMAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 During this period, the term "bias" also came under heavy scrutiny. It was felt that the strong emotional connotation which the word carried was creating a semantic rift between the technical testing community and the general public. As a result of this debate, the term differential item functioning (DIF) was proposed as a less value-laden, more neutral replacement (Angoff, 1993; Cole, 1993). The DIF concept is defined as the accumulation of empirical evidence to investigate whether there is a difference in performance between comparable groups of examinees (Hambleton, Swaminathan, & Rogers, 1991). More specifically, it refers to a difference in the probability of correctly responding to an item between two subgroups of examinees of the same ability or groups matched by their performance on the test representing the underlying construct of interest (Kamata & Binici, 2003; Potenza & Dorans, 1995). These two groups are referred to as the focal group, those examinees expected to be disadvantaged by the item(s) of interest on the test (e.g. females or African Americans), and the reference group, those examinees expected to be favored by the DIF items (e.g. males or Caucasians). Since the introduction of the early CTT methods, a variety of additional techniques have been introduced for the detection of DIF (Clauser & Mazor, 1998). These include the nonparametric Mantel-Haenszel chi-square method (Holland & Thayer, 1988; Mantel & Haenszel, 1959), the standardization method (Dorans & Holland, 1993), logistic regression (Swaminathan & Rogers, 1990), likelihood ratio tests (Wainer, Sireci, & Thissen, 1991), and item response theory (IRT) approaches comparing parameter estimates (Lord, 1980) or estimating the areas between the item characteristic curves (Raju, 1988, 1990). Additionally, though not as popular in the testing and measurement approaches, the results are not necessarily identical except in the case of perfect group- class correspondence. In this simulation, given that the overlap between the latent classes and manifest groups was simulated to be 80%, then the DIF results should be expected to differ to some degree. Therefore, one possible reason for the Type I error rate inflation may have emerged because of this difference in definition and conceptualization of DIF. Additionally, the procedure used to test the invariance of the items may also have contributed to this seemingly high rate of inflation. In testing the significance of the differences in item thresholds, Mplus invokes a Wald test. An examination of these estimates revealed several large coefficients which in turn would have resulted in large z-statistics and an increased likelihood of significance. However, the issue of whether the inflated error rate resulted from applying a factor mixture approach to these data or from the using the significance testing of threshold differences in testing for non-invariant items remains unresolved. Limitations of the Study and Suggestions for Future Research As with all simulation research, there are several limitations to this study. However, these limitations also point to the need for future research. First, in determining the correct number of latent classes, the findings were limited by use of only one type of model fit index. It would have been interesting to compare the results of the information criteria indices (i.e. AIC, BIC and ssaBIC) with those of alternative tests such as the Lo-Mendell-Rubin likelihood ratio test (LMR LRT) and the bootstrap LRT (BLRT). In their simulation study, Nylund et al. (2006) found that the LMR LRT has been was reasonably effective at identifying the correct mixture model. However, the BLRT outperformed both the likelihood-based indices and the LMR LRT as the most consistent indicator for choosing the correct number of classes. While these results are BIOGRAPHICAL SKETCH Mary Grace-Anne Jackman was born in Bridgetown, Barbados. In 1994, she graduated from the University of the West Indies, Barbados with a Bachelor of Science degree in mathematics and computer science (first class honors). After being awarded an Errol Barrow Scholarship, she entered Oxford University in 1996 and received a Master of Science degree in Applied Statistics in 1997. In 2002 she graduated from the University of Georgia with a master's degree in marketing research. Following four years as a marketing research consultant in New York and Barbados, she began doctoral studies in research and evaluation methodology at the University of Florida in the fall of 2006. 109 logistic distribution function for the measurement errors which allows the coefficients to be interpreted either in terms of logits or converted to changes in odds. Under the LRV formulation, the single factor model for a continuous latent trait measured by binary outcomes is expressed as in Equation 7. Therefore, the conditional probability of a correct response as a function of r is: P(Y, =I |)= P(Y*>j> T,)= I-P(y/ __,F [_ (. _, ,,, )V(,,j) 1,1] where V(E,)is the residual variance and F can be either the standard normal or logistic distribution function depending on the distributional assumptions of theess (Muthen & Asparouhov, 2002). Further, in addition to the LRV, latent variable models with categorical variables can also be presented using an alternative formulation. The conditional probability curve formulation focuses on directly modeling the nonlinear relationship between the observed y, and the latent factor trait, n as: P(y,=l 11,)=F[ (7j,-b )] (9) where ac is the item discrimination, bi is the item difficulty, and the distribution of F is either the standard normal or logistic distribution function. In their 2002 paper, Muthen and Asparouhov illustrate the equivalence of results between these two conceptual formulations of modeling factor analytic models with categorical outcomes. The authors showed that equating the two formulations: SF[- -(-, V,, ) )-1/2b ] _=_F [c (i,, (10) smaller number of latent classes. On the other hand, while simulation results from Nylund et al. (2006) supported the finding of the AIC favoring models with more latent classes, their study found the BIC to be most consistent indicator of the true number of latent classes. This latter result contrasted with other studies which touted the merits of the ssaBIC for class enumeration over the BIC (Henson, 2004; Yang, 2006; Tofighi & Enders, 2007). Therefore, given the inconsistencies in results, no single information criteria index can be regarded as being the most appropriate for class enumeration for all types of finite mixture models. Liu (2009) argued that because the performances of the indices depend heavily on the estimation model and the population assumptions that these inconsistencies should be expected. In addition, because to date no full scale study has been conducted comparing the performance of these indices for factor mixture DIF applications, no definite conclusion can be reached regarding the index that is best suited for this type of factor mixture application. Clearly, this represents an opportunity for future research. Results from this study also point to several instances where negligible differences in IC values between neighboring class models were observed. Therefore, even though a model may have produced the lowest average IC value, the IC value of the k+1 or k-1 class model did not differ substantially from that of the k-class model. In cases such as this, the absence of an agreed-upon standard for calculating the significance of these IC differences increases the ambiguity of the selection of the "correct" model. This presents the opportunity for the creation of such a significance statistic; a possibility that will be explored later as a potential area for further research. BIOGRAPHICAL SKETCH ............................................................... ............... 109 As was noted earlier, when comparing models with different numbers of latent classes, lower values of AIC, BIC, and ssaBIC indicate better fitting models. A typical approach to class enumeration begins with the fitting of a baseline one-class model and successively fitting models with the goal of identifying the mixture model with the smallest number of latent classes that provide the best fit to the data (Lui, 2008). However, previous research has found that results from different information criteria can provide ambiguous evidence regarding the optimal number of classes. In addition, across different mixture models, there is also inconsistency regarding the model selection information criterion that performs best. Nylund et al. (2006) conducted a simulation study comparing the performance of commonly-used information criteria for three types of mixture models: latent class, factor mixture, and growth mixture models. Overall, the researchers found that among the information criteria measures, the AIC, which does not adjust for sample size, performed poorly and identified the correct k-class model on fewer occasions than the two sample-sized adjusted indices, the BIC and the ssaBIC. Moreover, the AIC frequently favored the selection of the k+1-class model over the correct k-class. In addition, whereas the ssaBIC generally performed well with smaller sample sizes (N=200, 500), the BIC tended to be the most consistent overall performer, particularly with larger sample sizes (N=1000). Based on their simulation results, Nylund et al. (2006) concluded that the BIC was the most accurate and consistent of the IC measures at determining the correct number of latent classes. Yang (1998) evaluated the performance of eight information criteria in the selection of latent class analysis (LCA) models for six simulated levels of sample size. TABLE OF CONTENTS page ACKNOW LEDGM ENTS ............. ............... .................................. ............... 4 LIST O F TA BLES .................................................................................................... 8 LIST OF FIGURES................................. ......... 10 A B S T R A C T .................... ...................................................................................... 11 CHAPTER 1 IN T R O D U C T IO N .............................................................................. ..................... 13 2 LITERATURE REVIEW ............................................... 21 Differential Item Functioning ............................................... 21 Types of Differential Item Functioning ................................... 23 D IF vs. Im pact ............... ........................................ ....... ........... 24 Frameworks for Exam ining DIF .................................................... ........ 24 Observed score framework................. ...... .................... 24 The latent variable framework................ ......... ....... 25 SEM-based DIF Detection Methods .................................... 26 Factor Analytic Models with Ordered Categorical Items................................. 27 Mixture Modeling as an Alternative Approach to DIF Detection .............................. 31 Estim ation of M ixture M odels ................................. ...... ... .............. ............. 36 Class Enumeration ...... .................................................... ......... ................ 37 Information Criteria Indices............................ .................... 39 Mixture Model Estimation Challenges ................................... 43 Purpose of Study ...... ...................................................... .......................... 44 3 METHODOLOGY ....................................... ..................... 49 Factor Mixture Model Specification for Latent Class DIF Detection ........................ 49 D ata G e ne ratio n ........................................................................................ .. 50 Simulation Study Design............................................. ............... 51 Research Study 1 .... ................................................................................ 52 M manipulated C ond itions .................................. ........ ............... ............... 52 Sample size ...................... ........ ......... ................... 52 M agnitude of uniform DIF .................. ............................... ....... 53 Ability differences between groups ..................................... 53 Fixed Sim ulation Conditions ........... ...... ............... ............................. 54 Test length ..... ............................................ .............................. 54 N um ber of D IF item s.................................................. .................... 55 Sam ple size ratio ....................................................... .. 55 Percentage of overlap between manifest and latent classes .................... 55 Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207. Reynolds, M. R. (2008). The use of factor mixture modeling to investigate population heterogeneity in hierarchical models of intelligence. Unpublished doctoral dissertation, University of Texas, Austin. Rindskopf, D. (2003). Mixture or homogeneous? Comment on Bauer and Curran (2003). Psychological Methods, 8, 364-368. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. Roussos, L. A., & Stout, W. F. (1996a). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371. Roussos, L. A., & Stout, W. F. (1996b). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215-230. Samuelsen, K. (2005). Examining differential item functioning from a latent class perspective. Unpublished doctoral dissertation, University of Maryland: College Park. Samuelsen, K. (2008). Examining differential item functioning from a latent perspective. In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture models, (pp. 177-197). Charlotte, NC: Information Age Publishing. Sawatzky, R. (2007). The measurement of quality of life and its relationship with perceived health status in adolescents. Unpublished doctoral dissertation, The University of British Columbia. Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464. Sclove, L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333-343. 105 LIST OF REFERENCES Abraham, A. A. (2008). Model Selection Methods in the linear mixed model for longitudinal data. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91. Ainsworth, A.T. (2007). Dimensionality and invariance: Assessing DIF using bifactor MIMIC models. Unpublished doctoral dissertation, University of California, Los Angeles. Agrawal, A., & Lynskey, M.T. (2007). Does gender contribute to heterogeneity in criteria for cannabis abuse and dependence? Results from the national epidemiological survey on alcohol and related conditions. Drug and Alcohol Dependence, 88, 300-307. Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332. Allua, S.S. (2007). Evaluation of single- and multilevel factor mixture model estimation. Unpublished doctoral dissertation, University of Texas: Austin. Anderson, L. W. (1985). Opportunity to learn. In T. Husen & T. N. Postlethwaite (Eds.), The international encyclopedia of education (Vol. 6, pp. 3682-3686). Oxford: Pergamon Press. Angoff, W.H. (1972). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service No. ED 069686). Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 3-23). Hillsdale, N.J.: Lawrence Erlbaum. Angoff, W. H., & Sharon, A. T. (1974). The evaluation of differences in test performance of two or more groups. Educational and Psychological Measurement, 34, 807- 816. Ankemann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277-300. Magnitude of uniform DIF In previous DIF studies (Camilli & Shepard, 1987; De Ayala et al., 2005; Meade, Lautenschlager, & Johnson, 2007; Samuelsen, 2005) the manipulated difficulty shifts have typically varied in magnitude from .3 to 1.5. Overall, these results have shown higher DIF detection rates with items simulated to have moderate or strong amounts of DIF. However, with mixture models, it may be necessary to simulate larger DIF magnitudes to ensure the detection of DIF. This hypothesis was based on the results from preliminary small-scale simulation in which several levels of DIF magnitude were manipulated. As a result, this study focused on DIF effects at the upper range of the scale where the magnitude of manifest differential functioning is large, namely, Ab = 1.0 and Ab = 1.5. For items with no DIF, the item difficulties are defined as biF = biR. On the other hand, when there is uniform DIF, the items will be simulated to function differently in favor of the reference group and the item difficulties are defined as biF = biR + Ab (where Ab = 1.0 or 1.5). Ability differences between groups Several researchers have recommended the inclusion of latent ability differences (i.e. impact) in DIF detection studies since they contend that in real data sets, the focal and reference populations typically have different latent distributions (Camilli & Shepard, 1987; De Ayala et al., 2002; Donoghue, Holland, & Thayer, 1993; Duncan, 2006; Stark et al., 2006). Simulation results of the effects of impact on DIF detection have varied. For instance, some researchers have reported good control of Type I error rates with a moderate difference of .5 SD (Stark et al., 2006) and even with differences as large as 1 SD (Narayanan & Swaminathan, 1994). On the other hand, others (Cheung & Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994) have Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By Mary Grace-Anne Jackman August 2010 Chair: M. David Miller Cochair: Walter Leite Major: Research and Evaluation Methodology This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two- parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these Joreskog K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 10, 631- 639. Kamata, A., & Bauer, D. J. (2008). A note on the relationship between factor analytic and item response theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 136 153. Kamata, A., & Binici, S. (2003). Random effect DIF analysis via hierarchical generalized linear modeling. Paper presented at the annual International Meeting of the Psychometric Society, Sardinia, Italy. Kamata, A., & Vaughn, B.K. (2004). An introduction to differential item functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 49-69. Kuo, P.H., Aggen, S.H., Prescott, C.A., Kendler, K.S., & Neale, M.C. (2008). Using a factor mixture modeling approach in alcohol dependence in a general population sample. Drug and Alcohol Dependence, 98, 105-114. Larson, S. L. (1999). Rural-urban comparisons of item responses in a measure of depression. Unpublished doctoral dissertation, University of Nebraska. Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some examinees are motivated. Unpublished doctoral dissertation, James Madison university. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Lee, J. (2009). Type I error and power of the mean and covariance structure confirmatory factor analysis for differential item functioning detection: Methodological issues and resolutions. Unpublished doctoral dissertation, University of Kansas. Leite, W. L. & Cooper, L. (2007). Diagnosing social desirability bias with structural equation mixture models. Paper presented at the Annual Meeting of the American Psychological Association. Li, F., Cohen, A.S., Kim, S-H., & Cho, S-J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353-373. Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22, 249- 264. 101 Table 4-5. Percentages of converged solutions across study conditions DIF Magnitude Sample Size Ability Differences Percentage of converged solutions 1.0 500 0 99.8 0.5 99.5 1.0 99.4 1000 0 100.0 0.5 99.7 1.0 99.6 1.5 500 0 99.8 0.5 99.7 1.0 99.7 1000 0 100.0 0.5 100.0 1.0 99.8 Table 4-6. Overall Type I error rates across DIF Sample Size 1.0 500 1000 500 1000 study conditions Impact 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 Error rates 0.123 0.126 0.159 0.131 0.129 0.138 0.097 0.112 0.116 0.092 0.092 0.100 minimal, in the number of properly converged solutions. More specifically, while the one-class model attained perfect convergence rates, the average convergence rates for the two- and three-class mixture models were 96% and 91% respectively. An inspection of the results also revealed a positive relationship between the convergence rate and the DIF magnitude. Of the 16 cells that failed to converge in the two-class model, 15 of them were for the smaller DIF condition. A similar trend was observed with the three-class model. Namely, of the 37 cells that failed to produce a properly convergent solution in the three-class model, 27 were associated with the smaller DIF=1.0 condition. The cases with non-convergent solutions were excluded from the second part of this analysis. Class Enumeration Summary data based on the three IC measures (AIC, BIC, and ssaBIC) for the one-, two-, and three-class models are provided in Tables 4-2 through 4-4. In comparing the fit of the models across classes, the smallest average IC value was used as the criterion in selecting the "best-fitting" model. An examination of the average IC values highlighted both overall and IC-specific patterns of results. First, as expected there is a general increase in the average IC values as sample size increases. Second, it is observed that the differences in average IC values between neighboring competing models were generally not substantial, and even negligible under some conditions. Third, with respect to the individual indices, a high level of inconsistency in model selection patterns is observed. The results for the three indices are described in more detail in the following sections. latent classes. Unlike the manifest situation where group membership is observed and group proportions are known, class membership is unobserved. Therefore an additional model parameter, known as the mixing proportion, (p, is estimated (Gagne, 2004). The K-1 mixing proportions estimate the proportion of individuals comprising each of the K hypothesized classes. Additionally, while individuals obtain a probability for being a member in each of the K classes, they are assigned to a specific class based on their highest posterior probability of class membership. To estimate the model parameters, the joint log-likelihood of the mixture across all observations is maximized (Gagne, 2004). For a mixture of two latent subpopulations, the joint log-likelihood of the mixture model can be expressed as the maximization of: ] In( +(1-)L) (17) i=1 k=1 where L,1 and L2 represent the likelihood of the ith examinee being a member of subpopulation 1 and subpopulation 2 respectively, (p represents the unknown mixing proportion, and N is the total number of examinees in the sample. Likewise, for k- subpopulations, Gagne (2004) presents the expression for the joint log-likelihood of the mixture model expressed as: ({ok(2;z)- p 12 (- 5)(x~/k)k (18) =1 k=l 1 where ilk = k + AKkk and 'k =Ak (kk +k,. Class Enumeration An important decision to be made is determining the number of latent class existing in the population (Bauer, & Curran, 2004; Nylund, Asparouhov, & Muthen, 2006). Traditionally, researchers use standard chi-squared based statistics to compare Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233. Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy research: Bringing the context into picture by investigating sociological / community moderated (or mediated) test and item bias. Journal of Educational Research and Policy Studies, 5, 1-23. Zwick, R., Donoghue, J., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233-251. 108 two extremes and tended to favor the "true" two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced. Akaike Information Criteria (AIC) The average AIC values across the three specified mixture models are presented in Table 4-2. Overall, the pattern of results shows that the AIC tended to over-extract the number of latent classes. This trend was observed for six of the eight simulated conditions where the lowest AIC values corresponded to the three-class mixture model. The only exceptions to this pattern occurred for two of the four conditions when the DIF magnitude was increased to 1.5. In these cases, the lowest average AIC values occurred at the "correct" two-class model. However, it is important to note that the differences between neighboring class solutions were rather small, with the largest absolute difference between values being less than 40 points. Moreover, the differences are practically negligible between the two- and three-class models ranging in absolute magnitude over the eight simulated conditions from .02 to 8.72. Although, smaller IC values are indicative of better model fit, given the minor differences between the average AIC values, it makes selection between these two models a less "clear-cut" decision. Bayesian Information Criteria (BIC) The BIC results are presented in Table 4-3. Based on the average BIC values, this index consistently selected the simpler one-class model as the correct model for the data. For each of the eight manipulated simulation conditions, the lowest values corresponded to the one-class mixture model. Compared to the AIC, the IC differences between neighboring class models are generally higher for the BIC than for the corresponding AIC solutions. More specifically, the differences between neighboring class models ranged in absolute magnitude, between 40 and 88 points on average. The differences between the one-class and the "correct" two-class model were minimized * Statistical power Power (or the true-positive rate) was computed as the proportion of times that the analysis correctly identified the DIF items as having DIF. Therefore, the overall power rate was calculated by dividing the total number of times any one of the five (i.e. Items 2-6) DIF items was correctly identified by the total number of properly converged replications across each of the 12 simulated conditions. In addition to the computation of the overall Type I error and Power rates of the factor mixture method, a variance components analysis was also conducted to examine the influence of each of the conditions and their interactions on the performance of the method. In this analysis which was conducted in R V2.9.0 (R Development Core Team, 2009), the independent variables were the three study conditions (DIF magnitude, sample size and impact) and the dependent variables were the Type I error and power rates. Eta-squared (r72), which calculates the percentage of variance explained by each of the main effects and their interactions, was used as a measure of effect size. Model Estimation The parameters of the mixture models were estimated in Mplus V5.1 (Muthen & Muthen, 1998-2008) with robust maximum likelihood estimation (MLR) using the EM algorithm, which is the default estimator for mixture analysis in Mplus. One of the main limitations of running a mixture simulation study is the lengthy computation time periods needed for model estimation. In the interest of time, the random starts feature which randomly generates sets of starting values was not used in this part of the study. Instead, true population parameters for the factor loadings and thresholds and factor variances were substituted for the starting values in this portion of the analysis. This change reduced the computation time for model estimation considerably. * If the number of classes are known a priori, how well does the factor mixture model perform at detecting differentially functioning items. Specifically, how are the (i) convergence rates, (ii) Type I error rate, (iii) and power to detect DIF affected under various manipulated conditions characteristic of those that may be encountered in DIF research? Table 4-2. Mean AIC values for the three mixture models DIF Sample Ability One-class Two-class Magnitude Size Differences 500 1000 500 1000 0 0.5 Note: AIC Akaike Information Criterion 9,520.64 9,310.67 18,956.98 18,578.46 9,495.81 9,314.87 18,953.93 18,559.08 9,509.62 9,305.81 18,961.67 18,578.01 9,473.12 9,284.66 18,916.12 18,556.92 Three-class 9,501.86 9,298.37 18,961.03 18,569.40 9,473.10 9,293.38 18,918.62 18,550.50 Table 4-3. Mean BIC values for the three mixture models DIF Sample Ability One-class Two-class Magnitude Size Differences 1.0 500 0 9647.08 9707.71 1000 500 1000 9,437.11 19,104.21 18,725.70 9,622.25 9,441.31 19,101.16 18,706.31 9,503.89 19,192.34 18,808.68 9,671.21 9,482.75 19,146.78 18,787.58 Three-class 9,771.59 9,568.11 19,275.12 18,883.21 9,742.83 9,563.11 19,232.72 18,864.60 Note: BIC Bayesian Information Criterion Table 4-4. Mean ssaBIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class 1.0 500 0 9,551.86 9,558.53 0.5 9,341.89 9,354.71 1000 0 19,008.93 19,043.06 0.5 18,630.42 18,659.40 1.5 500 0 9,527.02 9,522.03 0.5 9,346.09 9,333.56 1000 0 19,005.88 18,997.51 0.5 18,611.03 18,638.31 Note: ssaBIC sample size adjusted Bayesian Information Criterion Three-class 9,568.45 9,364.97 19,071.86 18,679.95 9,539.69 9,359.97 19,029.45 18,661.33 Variance components analysis Following the descriptive analysis of the pattern of Type I error rates across the simulated conditions, a variance components analysis was conducted to specifically examine the influence of each of the simulation conditions and interaction of the conditions on the Type I error rates. The results of this analysis are presented in Table 4-13. Based on the q2 values which ranged from 0.000 to 0.007, the only factor contributing to the variance in Type I error rates was the magnitude of DIF accounting for a mere 0.7%. All other main effects and interactions produced trivial r2 values. Statistical Power In the analyses above, the proportion of false DIF detections produced by the factor mixture approach consistently exceeded the nominal value of 0.05. Typically, when this level occurs, power rates are no longer interpretable in terms of the standard alpha level. In this case, the power rates have still been analyzed and are displayed in Tables 4-15 and 4-24. However, it is important to note that these results should be interpreted with caution given the elevated Type I rates. Power was assessed as the proportion of times across the 1000 replications that the factor mixture analysis correctly identified the five items (i.e. Items 2 to 6) simulated as having uniform DIF. Typically, values of at least .80 indicate that the analysis method is reasonably accurate in correctly detecting items with DIF. Results for the power analysis are displayed in Tables 4-15 through 4-22. The overall accuracy of DIF detection of factor mixture analysis was 0.447, with the power of correct detection ranging from .264 to .801 across all simulated conditions. The only combination of conditions for which an acceptable level of power was achieved was when larger DIF (1.5) and sample size (N=1000) were simulated and impact was 4-21 Pow er rates for im pact of 0.5 S D ............................................... ... .. ............... 79 4-22 Pow er rates for im pact of 1.0 S D ............................................... ... .. ............... 79 4-23 Power rates for DIF detection based on item discriminations........................... 80 4-24 Variance components analysis for power results................ ... ........... 80 Item 1, Group 1 vs. Item 1. Group 2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ---- -4 -3 -2 -1 0 Latent Trait Figure 2-1. Example of uniform DIF Item 1, Group 1 Item 1, Group 2 1 2 3 4 Item 1, Group 1 vs. Item 1, Group 2 0.9 0.8 , 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 Latent Trait Figure 2-2. Example of non-uniform DIF Item 1. Group 1 Item 1, Group 2 2 3 4 exceeding a specific threshold value, then the examinee will answer the item correctly. Likewise, if it falls below the threshold, then an incorrect response is observed. Therefore, based on this formulation, the observed items responses can be viewed as discrete categorizations of the continuous latent variables. The relationship between these two variables y and y are represented by the following nonlinear function: y =c, if r-1 where c denotes the number of response categories for y and the threshold structure is defined by r= oo < < ... < r, = +oo for c categories with c-1 thresholds. In the case r0, if y* < r1w of binary items, the mapping of yi onto yi is expressed as: y, = ify* > where r, denotes the threshold parameter for test item yl. This relationship is illustrated in Figure 2-3. Because of the LRV formulation, the measurement component of the model which relates, in this case, the continuous latent response variables to the latent factor and to the group membership variable is respecified as: y,j =v, +iAr, +s E (7) where y j is individual's i latent response to item j. The distributional assumptions of the p-vector of measurement errors determine the appropriate link function to be selected. For example, if it is assumed that the measurement errors are normally distributed then the probit link function, that is, the inverse of the cumulative normal distribution function, t '[ ] is used. As a result the thresholds and factor loadings are interpreted as probit coefficients in the linear probit regression equation. The alternative is to assume a Muthen, B. O., Grant, B., & Hasin, D. (1993). The dimensionality of alcohol abuse and dependence: Factor analysis of DSM-III-R and proposed DSM-IV criteria in the 1988 National Health Interview Survey. Addiction, 88, 1079-1090. Muthen, B. O., Kao, C., & Burstein, L. (1991). Instructionally sensitive psychometrics: An application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28, 1-22. Muthen, B. O., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10, 133-142. Muthen, L.K. and Muthen, B.O. (1998-2008). Mplus user's guide. Fifth edition. Los Angeles, CA: Muthen & Muthen. Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous tem bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18, 315-328. Navas-Ara, M. J., & Gomez-Benito, J. (2002). Effects of ability scale purification on identification of DIF. European Journal of Psychological Assessment, 18, 9-15. Nylund, K. L., Asparouhov, T., & Muthen, B. (2006). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569. O'Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Erlbaum. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107-124. Penfield, R.D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5-15. Potenza, M.T. & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23-27. Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 54, 495-502. Raju, N.S., Bode, R.K., & Larsen, V.S. (1989). An empirical assessment of the Mantel- Haenszel statistic to detect differential item functioning. Applied Measurement in Education, 2, 1-13. 104 LIST OF TABLES Table page 3-1 Generating population parameter values for reference group ........................... 60 3-2 Fixed and manipulated simulation conditions used in study 1 ........................... 61 3-3 Fixed and manipulated simulation conditions used in study 2........................... 61 4-1 Number of converged replications for the three factor mixture models .............. 73 4-2 Mean AIC values for the three mixture models .............................. ............... 74 4-3 Mean BIC values for the three mixture models ....................... ....... ............... 74 4-4 Mean ssaBIC values for the three mixture models .................................... 74 4-5 Percentages of converged solutions across study conditions ........................... 75 4-6 Overall Type I error rates across study conditions................................... 75 4-7 Type I error rates for D IF = 1.0 ........... ................. ...................... ............... 76 4-8 Type I error rates for D IF = 1.5 ........... ................. ...................... ............... 76 4-9 Type I error rates for sample size of 500 .............................. ............... 76 4-10 Type I error rates for sam ple size of 1000 ..................... ............................... 76 4-11 Type I error rates for impact of 0 SD ............. ..... ...................... 77 4-12 Type I error rates for impact of 0.5 SD ............................ .......... ..... 77 4-13 Type I error rates for impact of 1.0 SD ........................... ............ ... 77 4-14 Variance components analysis for Type I error .............. .... ................ 77 4-15 Overall power rates across study conditions .............. ...... .................. 78 4-16 Pow er rates for D IF of 1.0 ............................... ............................. ............... 78 4-17 Pow er rates for D IF of 1.5 ............................... ............................. ............... 78 4-18 Power rates for sample size N of 500 ................... ........ ............... 79 4-19 Power rates for sample size N of 1000........................... .................. 79 4-20 Pow er rates for im pact of 0 S D ....................................................... ............... 79 Table 4-23. Power rates for DIF detection based on item discrimination Sample Impact Size 500 0.0 1000 500 1000 a = .5 .184 .200 .200 .261 .248 .226 .409 .411 .340 .731 .640 .531 .365 .194 .220 .227 .257 .243 .238 .425 .401 .360 .728 .655 .528 .373 a= 1.0 .273 .273 .273 .387 .363 .332 .612 .561 .465 .885 .814 .697 .495 .332 .321 .326 .430 .372 .300 .598 .546 .481 .841 .782 .675 .500 Table 4-24. Variance components analysis for power results Condition q2 DIF Magnitude (D) .190 Sample size (S) .046 Impact (I) .009 D*S .012 D*I .005 S*I .002 D*S*I .001 a = 2.0 .359 .323 .296 .416 .392 .357 .580 .571 .476 .819 .763 .685 .503 underlying dimensionality within these latent groups (Lubke & Muthen, 2005). As a result, a primary advantage of factor mixture modeling is its flexibility in allowing for a wider range of modeling options. For instance, models can be specified for multiple factors, multiple latent classes and the structure of the within-class models can vary in the complexity of its relationships not only with the latent factors but with observed combinations of continuous and categorical covariates as well (Allua, 2007; Lubke & Muthen, 2005, 2007; McLachlan & Peel, 2000). However, it is important to note that as more specifications are introduced to a model, not only does the level of complexity of model increase but the computational intensity of the estimation process as well. Therefore, it is recommended that researchers should be guided by substantive theory not only in supporting their hypothesis of population heterogeneity but also with regard to the complexity of the specification of their mixture models as well (Allua, 2007; Jedidi, Jagpal, & DeSarbo, 1997). Previous applied studies highlight several fields where mixture modeling has been applied successfully to investigate population heterogeneity (Bauer & Curran, 2004; Jedidi et al., 1997; Kuo et al., 2008; Lubke & Muthen, 2005, 2007; Lubke & Neale, 2006; Muthen & Asparouhov, 2006; Muthen, 2006). One area in which factor mixture models have been utilized with some success is that of substance abuse research (Kuo et al., 2008; Muthen et al., 2006). In their research on alcohol dependency, Kuo et al. (2008) compared a factor mixture model approach against a latent class and a factor model approach to determine which of the three best explained the alcohol dependence symptomology patterns. What they found was that while a pure factor analytic model provided an unsatisfactory solution, the latent class approach provided a better fit to the Figure 2-4. Diagram depicting specification of the factor mixture model CHAPTER 5 DISCUSSION This study was designed to evaluate the overall performance of the factor mixture analysis in detecting uniform DIF. Specifically, there were two primary research goals, namely: (i) to assess the ability of the factor mixture approach to correctly recover the number of latent classes, and (ii) to examine the Type I error rates and statistical power associated with the approach under various study conditions. Using data generated by a 2PL IRT framework, a Monte Carlo simulation study was conducted to investigate the properties of the proposed factor mixture model approach to DIF detection. First, a 15- item dichotomous test simulated for a two-group, two-class population was generated. In both parts of the study, the effect of DIF magnitude, sample size and differences in latent trait means on the performance of the mixture approach were examined. First, the major findings of each phase of the simulation are summarized. This will be followed by a discussion of the limitations of this study and suggestions for future research. Class Enumeration and Performance of Fit Indices In assessing the accuracy of the factor mixture approach to accurately recover the correct number of latent classes, models with one through three latent classes were fit to the simulated data. In addition, three commonly-used information criteria indices (AIC, BIC, and ssaBIC) were used in the selection of the "correct" model. Overall, there was a high level of inconsistency among the three ICs. In this study, the AIC tended to over-extract the number of classes and under the majority of study conditions supported the more complex but "incorrect" three-class model over the "true" two-class model. This behavior was sharply contrasted with that of the BIC, which tended to underestimate the correct number of latent classes and consistently favored the simpler in Figure 2-2. It is important to note that while some methods can detect both types of DIF, others are capable of detecting uniform DIF only. DIF vs. Impact As was mentioned previously, DIF occurs when there is a significant difference in the probability of answering an item correctly or endorsing an item category between groups of the same ability level (Wainer, 1993). On the other hand, impact refers to legitimate group differences in the probability of getting an item correct or endorsing an item category (Wainer, 1993). Therefore, in distinguishing between these two concepts, it is important to note that while DIF assessment includes comparable groups of examinees with the same trait level, impact refers to group differences without controlling or matching on the construct of interest. Frameworks for Examining DIF Traditionally, DIF is examined within one of two frameworks: (i) the observed score framework and (ii) the latent variable framework. Observed score framework The observed score framework includes methods that adjust for ability by conditioning on some observed or manifest variable designed to serve as a proxy for the underlying ability trait (Ainsworth, 2007). When an observed variable such as the total test score is used it is assumed that this internal criterion is an unbiased estimate of the underlying construct. Therefore, as an alternative to the aggregate test score, an alternative or a purified form of the total test score may also be used as the internal matching criteria. This means that provided that DIF is detected, the total test score is adjusted by dropping those items identified as displaying DIF and calculating a revised total score. This iterative process which is used to refine the criterion measure and limit T Incorrect Correct Figure 2-3. Depiction of relationship between y and y for a dichotomous item Figure 2-3. Depiction of relationship between y and y for a dichotomous item Table 3-2. Fixed and manipulated Sample size Magnitude of DIF Latent mean distributions Test length Number of DIF items Sample size ratio Class proportion Overlap simulation conditions used in Manipulated Conditions 500, 1000 1.0, 1.5 OR~N(0,1), OF~N(0,1) OR~N(.5,1), OF~N(0,1) study 1 Fixed Conditions 15 items 5 items (33.3%) .5 80% Table 3-3. Fixed and manipulated simulation conditions used in Sample size Magnitude of DIF Latent mean distributions Test length Number of DIF items Sample size ratio Class proportion Overlap Manipulated Conditions 500, 1000 1.0, 1.5 OR~N(0,1), OF~N(0,1) OR~N(.5,1), OF~N(0,1) OR~N(1,1), OF~N(0,1) study 2 Fixed Conditions 15 items 5 items (33.3%) 1:1 .5 80% APPENDIX A MPLUS CODE FOR ESTIMATING 2-CLASS FMM TITLE: Factor mixture model for a two-class solution. DATA: FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: ANALYSIS: NAMES = ul-u15 class group; USEV ARE ul-ul 5; CATEGORICAL = ul-u15; CLASSES = c (2); TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 600 20; PROCESS = 2; MODEL: %OVERALL% f BY ul-u15; %c#1 % [u2$1-u15$1]; f; %c#2% [u2$1-u15$1]; f; OUTPUT: TECH8TECH9; STANDARDIZED; SAVEDATA: RESULTS ARE results.txt; CHAPTER 1 INTRODUCTION Given the ubiquity of testing in the United States coupled with the serious consequences of high-stakes decisions associated with these assessments, it is critical that conclusions drawn about group differences among examinee groups be accurate and that the validity of interpretations is not compromised. One way of eliminating the threat of invalid interpretations is to ensure that tests are fair and the items do not disadvantage subgroups of examinees. In addressing the issue of fairness in testing, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) outlines four widely used interpretations of fairness. The first defines a fair test as one that is free from bias that either systematically favors or systematically disadvantages one identifiable subgroup of examinees over another. In the second definition, fairness refers to the belief that equal treatment should be afforded to all examinees during the testing process. The third definition has sparked some controversy among testing professionals. It defines fairness as the equality of outcomes as characterized by comparable overall passing rates across examinee subgroups. However, it is widely agreed that while the presence of group differences in ability distributions between groups should not be ignored, it is not an indication of test bias. Therefore, a more acceptable definition specifies that in a fair test, examinees possessing equal levels of the underlying trait being measured should have comparable testing outcomes, regardless of group membership. The fourth and final definition of fairness requires that all examinees be afforded an equal, adequate opportunity to learn the tested material (AERA, APA, & NCME, 1999). Clearly, the concept of fairness is a complex, multi- faceted construct and therefore it is highly unlikely that consensus will be reached on all discipline, SEM-based methods such as the multiple-groups approach (Lee, 2009; Sorbom, 1974) and the MIMIC model (Gallo, Anthony, & Muthen, 1994; Muthen, 1985, 1988) have also been used to detect DIF. The fact that these methods involve some level of conditioning on the latent construct of interest, differentiate them from the earlier CTT ANOVA and p-value approaches. Furthermore, DIF assessment has now emerged as an essential element in the investigation of test validity and fairness and for testing companies such as Educational Testing Service (ETS), the use of DIF is a critical part of the test validation process. Cole's (1993) comments in which she describes DIF as "a technical tool in ETS to help us assure ourselves that tests are as fair as we can make them" underscore the importance of this analytical approach as a standard practice in the design and administration of fair tests. Types of Differential Item Functioning There are two primary types of DIF: uniform and non-uniform DIF. Uniform DIF occurs when the probability of correctly responding to or endorsing a response category is consistently higher or lower for either the reference or focal group across all levels of the ability scale. As shown in Fig 2-1, when uniform DIF is present in an item, the two ICCs do not cross at any point along the range of ability. In this example, the probability of responding correctly to this dichotomous item is uniformly lower for group 2 than for group 1 along the entire range of latent ability. Conversely, when there is an interaction between the latent ability trait and group membership and the ICCs cross at some point along the ability range, then this is referred to as non-uniform DIF. This means that for a portion of the scale, one group is more likely to correctly respond or endorse a response category. However, this "advantage" is reversed for the second portion of the scale. An example of this is shown Tofighi, D., & Enders, C. K. (2007). Identifying the correct number of classes in a growth mixture model. In G. R. Hancock (Ed.), Mixture models in latent variable research (pp. 317-341). Greenwich, CT: Information Age. Uttaro, T. & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement, 18, 15-25. Wainer H. (1993). Model-based standardized measurement of an item's differential impact. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 123- 135). Hillsdale, N.J.: Lawrence Erlbaum. Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197-219. Wang, W-C., Shih, C-L., & Yang, C-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713-732. Wanichtanom, R. (2001). Methods of detecting differential item functioning: A comparison of item response theory and confirmatory factor analysis. Unpublished doctoral dissertation, Old Dominion University. Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92-107. Webb, M-Y., Cohen, A.S., & Schwanenflugel, P.J. (2008). A mixture model analysis of differential item functioning on the Peabody Picture Vocabulary Test-Ill. Educational and Psychological Measurement, 68, 335-351. Woods, Carol M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1-27. Yang, C. C. (1998). Finite mixture model selection with psychometric applications. Unpublished doctoral dissertation, University of California, Los Angeles. Yang, C. C. (2006). Evaluating latent class analysis models in qualitative phenotype Identification. Computational Statistics and Data Analysis, 50, 1090-1104. Yoon, M. (2007). Statistical power in testing factorial invariance with ordinal measures. Unpublished doctoral dissertation, Arizona State University. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. 107 Table 3-1. Generating population parameter values for reference group Item Number a b 1 1.0950 -0.0672 2 0.5001 -1.0000 3 0.5001 -0.5000 4 1.0000 0.0000 5 2.0000 -1.0000 6 2.0000 0.0000 7 0.5584 -0.7024 8 0.9819 0.6450 9 0.5724 -0.5478 10 1.4023 -0.3206 11 0.4035 -1.1824 12 1.0219 -0.4656 13 0.9989 -0.2489 14 0.7342 -0.4323 15 0.8673 0.7020 Note. Item 1 is the referent, therefore its loadings were fixed at 1 and its thresholds constrained equal across classes. Uniform DIF against the focal group was simulated on Items 2 to 6. Number of DIF items In previous simulations studies, the percentage of DIF items has typically varied from 0% to 50% as a maximum (Bilir, 2009; Cho, 2007; Samuelsen, 2005; Wang et al., 2009). For example, Samuelsen (2005) considered cases with 10%, 30% and 50% of DIF items, Cho (2007) investigated cases with 10% and 30% DIF items, and Wang et al. (2009) manipulated the number of DIF items in increments of 10% from 0% to 40%. With respect to real tests, Shih and Wang (2009) reported that they typically contain at least 20% of DIF items. In this study, the percentage of DIF items was 33.3% (five items), with the DIF items all favoring the reference group. Items 2 through 6 were selected to display uniform DIF. Sample size ratio With respect to the ratio of focal to reference groups, Atar (2007) reports that in actual testing situations, the sample size for the reference group may be as small as the sample size for the focal group or the sample size for the reference group may be larger than the one for the focal group" (pg. 29). In this study, a 1:1 sample size ratio of focal to reference group will be considered for each of the two sample sizes. Using comparison groups of equal size is representative of an evenly split manifest variable frequently used in DIF studies such as gender (Samuelsen, 2005). Percentage of overlap between manifest and latent classes In the manifest DIF approach when an item is identified as having DIF, there is an implied assumption that all members of the focal groups must have been disadvantaged by this item. However, under a latent conceptualization, the view is that DIF is detected based on the degree of overlap between the manifest groups and the latent classes. In this context, overlap refers to the percentage of membership homogeneity between the populations (Liu, 2008; Nylund et al., 2006). Therefore, rather than viewing any single measure as being superior, each should be seen as a contributory piece of evidence in determining the comparison of one model versus another. However, while the model fit indices provide the statistical perspective this should be augmented with a complementary approach of incorporating a substantive theoretical justification to aid in both the selection of the optimal number as well as the interpretability of the latent classes as well (Bauer & Curran, 2004). Mixture Model Estimation Challenges While factor mixture modeling is an attractive tool for simultaneously investigating population heterogeneity and latent class dimensionality, it is not without its challenges. The merging of the two types of latent variables into one integrated framework results in a model that requires a high level of computational intensity during the estimation process. As a result, factor mixture models require lengthy computation times which in turn reduce the number of replications that can be simulated within a realistic time frame. In addition to the increased computation times, the models are susceptible to problems due to multiple maxima solutions. Ideally, in ML estimation as the iterative procedure progresses, the log-likelihood should monotonically increase until it reaches one final maxima. However, with mixture models the solution often converges on a local rather than a global maximum, thereby producing biased parameter estimates. Whether the expectation-maximization (EM) algorithm converges to a local or global maximum largely depends on the set of different starting values that are used. Therefore, one approach to mitigating this problem is to incorporate multiple random starts, a practice that is permitted in Mplus. In the event that the default number of random starts (in Mplus, the defaults are 10 random starting sets and the best 2 sets used for final ACKNOWLEDGMENTS First and foremost, I am most grateful to Almighty God through whom all things are possible. I am also forever indebted to the Office of Graduate Studies and the Alumni Fellowship Committee for the financial support that has made these four years of study possible. The completion of this dissertation would also not have been possible without the support and guidance of my dissertation committee: Dr. David Miller, Dr. Walter Leite, Dr. James Algina, and Dr. Craig Wood. I would like to thank my committee chair and academic adviser, Dr. Miller, for his expert guidance and insightful counsel during this PhD program. I am also grateful to Dr. Leite, for his patient mentorship and his uncanny ability to reduce my seemingly insurmountable programming mountains to mere molehills. I must also acknowledge Dr. Algina, the epitome of teaching excellence. I am indeed privileged to have been your student. I am also thankful to Dr. Wood for agreeing to be a part of my dissertation committee in my time of need and for your quiet vote of confidence. Finally, I also need to express my immense gratitude and deepest appreciation to my friends, in the US and at home in Barbados. This could not have been possible without your unwavering support and unending encouragement. I am grateful for the Skype chats, emails, texts and all the other modes of communication you used to continually encourage, support me and keep me abreast of all the happenings back home. This journey has been made easier because of your friendship, support and prayers. may cause the incorrect selection of more or fewer classes than actually exist in the population. Therefore, it is critical that the practitioner has a strong theoretical justification to support the assumption of population heterogeneity. This should decrease the ambiguity in the selection of the best-fitting model for the data and in the interpretation of the nature of the latent classes. However, when the data and the theory support the existence of these latent classes, the technique can be used successfully to detect qualitatively different subpopulations with differential patterns of response that may otherwise had been overlooked using a traditional classic DIF-procedure. In the context of education research, the application of mixture models can provide valuable diagnostic information that can be used to gain insight into students' cognitive strengths and weaknesses. This study was designed as a means of bridging the gap between the manifest and latent approaches by examining the performance of the factor mixture approach in detecting DIF in items generated via a traditional framework. And even though the manifest approach will remain a staple in the DIF literature, it is expected that the interest in factor mixture models in DIF will continue to grow. Therefore, by further exploring these two approaches differ not only as concepts but also in results and application will ensure that each is appropriately used in practice. Table 4-7. Type I error rates for DIF = 1.0 Sample size 500 500 500 1000 1000 1000 Impact 0 0.5 1.0 0 0.5 1.0 Error rates 0.123 0.126 0.159 0.131 0.129 0.138 Error rates 0.097 0.112 0.116 0.092 0.092 0.100 Table 4-8. Type I error rates for DIF = 1.5 Impact Sample size 500 500 500 1000 1000 1000 4-9. Type I error rates for sample size of 500 Impact 0 0.5 1.0 4-10. Type I error rates for sample size of 1000 Impact Error rates 0.123 0.126 0.159 0.097 0.112 0.116 Error rates 0.131 0.129 0.138 0.092 0.092 0.100 Table DIF 1.0 1.0 1.0 1.5 1.5 1.5 Table DIF 1.0 1.0 1.0 1.5 1.5 1.5 determine how short the scale should be for the test to perform adequately. The focus of this study was on the detection of uniform DIF. However, in future research, the type of DIF factor can be extended to include both uniform and non-uniform DIF. To test for the presence of non-uniform DIF, the factor mixture model as implemented in this study must be reformulated so that in addition to the item thresholds, factor loadings are allowed to vary across classes as well. The Type I error rates and power of the factor mixture model to detect non-uniform DIF can then be evaluated and compared with the corresponding results for uniform DIF. Additionally, in this study, the item discrimination parameter was not included as a factor in the study. Instead its effect was examined on its own as a single condition. Therefore, in future research, the effect of including this study condition may be investigated. In generating the data, the mixture proportion for the two-classes was simulated to be .50. However, after the model estimation phase, the ability of the factor mixture approach to accurately recover the class proportions was not evaluated. This omission should also be addressed in future research. Finally, to the author's knowledge, the strategy used in testing the items for non- invariance has been recently introduced to the factor mixture literature and to date has been implemented in two studies. Its advantage is that it provides a simpler more direct alternative to DIF detection than the CFA baseline approaches which require the estimation and comparison of two models. However, it has not yet been subjected to the methodological rigor of more established methods. Therefore, a potential extension to this study would be a comparison of the performance of the significance testing of the Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 349 -364). Hillsdale, N.J.: Lawrence Erlbaum. Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content and group membership on achievement test items. Journal of Educational Measurement, 18, 109-118. Liu, C. Q. (2008). Identification of latent groups in Growth Mixture Modeling: A Monte Carlo study. Unpublished doctoral dissertation, University of Virginia. Lo, Y., Mendell, N., & Rubin, D. (2001). Testing the number of components in a normal mixture. Biometrika, 88, 767-778. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lubke, G. H. & Muthen, B. (2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10, 21-39. Lubke, G. H. & Muthen, B. (2007). Performance of factor mixture models as a function of model size, covariate effects, and class-specific parameters. Structural Equation Modeling: A Multidisciplinary Journal, 14, 26-47. Lubke, G. H. & Neale, M. C. (2006). Distinguishing between latent classes and continuous factors: Resolution by maximum likelihood? Multivariate Behavioral Research, 41, 499- 532. Macintosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 372-379. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric Monographs, 15, 1- 167. McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. 102 Table 4-15. Overall power rates across study conditions DIF Sample Size Impact Power 1.0 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 1.5 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 Table 4-16. Power rates for DIF of 1.0 Sample size Impact Power 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 Table 4-17. Power rates for DIF of 1.5 Sample size Impact Power 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 Statistical Power Study The study also evaluated the power of the factor mixture approach to detect uniform DIF. In spite of the failure of the factor mixture analysis to adequately protect the Type I error rates across the study conditions, the power results were still reviewed to get some sense of the pattern of DIF detection. Overall, these findings represent a mix of the predictable and the unexpected. What was expected was that the power of the factor mixture analysis method of DIF detection would increase as sample size and magnitude of DIF increased. In addition, it was not a surprising outcome that the magnitude of the discrimination parameter also influenced DIF detection rates; power was highest when detecting DIF in the more highly discriminating items, followed by studied items with medium and low discrimination parameters. Overall, these results are not only intuitively appealing but have been consistently supported by prior research conducted with different methods of DIF detection (Donoghue et al., 1993; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Stark et al., 2006). On the other hand, the surprising result was that even in the presence of large latent trait mean differences of 1.0 SD, the rates of DIF detection were not adversely affected by impact. While this finding was consistent with some studies (Gonzalez-Roma et al., 2006; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006), others have reported contradictory results with reductions in power as the disparity in latent means increased (Ankemann et al., 1999; Clauser, Mazor, & Hambleton, 1993; Finch & French, 2007; Narayanan & Swaminathan, 1996; Tian, 1999; Zwick, Donoghue, & Grima, 1993). However, it is important to note that these prior empirical studies all utilized standard DIF analyses rather than a mixture approach, as was used in this simulation. * The arrows from the latent factor to the item responses (from n to the Ys) represent the factor loadings or the A parameters measuring the relationship between the latent factor and the items. * The arrows from the latent class variable to the item responses (from Ck to the Ys) are the class-specific item thresholds conditional on each of the K latent classes. * The broken-line arrows from the latent class variable to the factor loadings arrows indicate that these loadings are also class-specific and therefore can vary across the K latent classes. * The arrow from the latent class to the latent factor (i.e. from Ck to the r) allows for factor means and/or factor variances to be class-specific as well. Since the model parameters are allowed to be class-specific, that is, both item thresholds and factor loadings can be specified as non-invariant, then this specification allows for testing of both uniform and non-uniform DIF. The focus of this dissertation is on assessing the performance of the factor mixture model in the detection of uniform DIF. Therefore for this specification, while the item thresholds are allowed to vary across the levels of the latent class variable, the factor loadings are constrained across the K classes. In Mplus, the implementation of the factor mixture model for DIF detection can be conceptualized as a multiple-group approach where DIF is tested across latent classes rather than manifest groups. Therefore using Equations 1 and 2, the CFA mixture model can be reformulated as: Yk =rk+ kk+8 k and 7lk a + k (15) where the parameters are as previously defined for each of the k = 1, 2,...,K latent classes. Once again, it is assumed that the measurement errors have an expected value of zero and are independent of the latent trait(s), and of each other. Similarly, the model implied mean and covariance structure for the observed variables in each of the k = 1, 2,...,K latent classes can be defined as: Table 4-11. Type I error rates for impact of 0 SD DIF Sample size 1.0 500 1.0 1000 1.5 500 1.5 1000 Table 4-12. Type I error rates for impact of 0.5 SD DIF Sample size 1.0 500 1.0 1000 1.5 500 1.5 1000 Table 4-13. Type I error rates for impact of 1.0 SD DIF Sample size 1.0 500 1.0 1000 1.5 500 1.5 1000 Error rates 0.123 0.131 0.097 0.092 Error rates 0.126 0.129 0.112 0.092 Error rates 0.159 0.138 0.116 0.100 Table 4-14. Variance components analysis for Type I error Condition r2 DIF Magnitude (D) .007 Sample size (S) .000 Impact (I) .001 D*S .000 D*I .000 S*I .000 D*S*I .000 Goldberger, 1975; Macintosh & Hashim, 2003; Muthen, 1985, 1988; Muthen, Kao, & Burstein, 1991) have been developed. Typically these methods have all focused on a manifest approach to detecting DIF. In other words, they use pre-existing group characteristics such as gender (e.g. males vs. females) or ethnicity (e.g. Caucasian vs. African Americans) to investigate the occurrence of DIF in a studied item. And while this approach to DIF testing has been widely practiced and accepted as the standard for DIF testing, some believe that this emphasis on statistical DIF analysis has been less successful in determining substantive causes or sources of DIF. For example, in traditional DIF analyses, after an item has been flagged as exhibiting DIF, content- experts may subsequently conduct a substantive review to determine the source of the DIF (Furlow, Ross, & Gagne, 2009). However, while this is acknowledged as an important step, some view its success at being able to truly understand the explanatory sources of DIF as minimal, at best (Engelhard, Hansche, & Rutledge, 1990; Gierl et al., 2001; O'Neill & McPeek, 1993; Roussos & Stout, 1996). More specifically, inconsistencies in agreement between reviewers' judgments or between reviewers and DIF statistics make it difficult to form any definitive conclusions explaining the occurrence of DIF (Engelhard et al., 1990). Others suggest that the inability of traditional DIF methods to unearth the causes of DIF is because the pre-selected grouping variables on which these analyses are based are often not the real dimensions causing DIF. Rather, they view these a priori variables as mere proxies for educational disadvantagee attributes that if identified could better explain the pattern of differential responses among examinees (Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton, 2002; Dorans & Holland, 1993; Samuelsen, 2005, 2008; Webb, Cohen, & APPENDIX B MPLUS CODE FOR DIF DETECTION Factor mixture model for a two-class solution. Items = 15, DIF =1.0 FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: ANALYSIS: NAMES = ul-u15 class group; USEV ARE ul-u15; CATEGORICAL = ul-u15; CLASSES = c (2); TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 0; PROCESS = 2; MODEL: %OVERALL% f BY ul @1 u2*0.500 / / / / u15*0.867; %c#1% [ul$1] (pl_1); !Assigns names to indicators for constraint purposes [u2$1*-0.500] (p1_2); //// [u15$1*0.609] (p1_15); f; %c#2% [u1$1] (p1_1); !Threshold of Item 1 constrained equal across classes [u2$1*0.000] (p2_2); !Remaining 14 item thresholds freely estimated //// [u15$1*0.609] (p2_15); f; MODEL CONSTRAINT: New(difi2 difi3 difi4 difi5 difi6 difi7 difi8 difi9 difil0 difil 1 difil2 difil3 difil4 difil5); Declares new variables (difi2,...,difil5)which are functions of previous variables difi2= p2_2 p1_2; difi5=p25 5; difil5= p2_15 p1_15; !Estimates threshold differences TITLE: DATA: Dorans NJ, & Kulick E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368. Duncan, S. C. (2006). Improving the prediction of differential item functioning: A comparison of the use of an effect size for logistic regression DIF and Mantel- Haenszel DIF methods. Unpublished doctoral dissertation, Texas A&M University. Educational Testing Service (2008). What's the DIF? Helping to ensure test question fairness. Retrieved December 8, 2009, from: http://www.ets.org/portal/site/ets/ Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3, 347-360. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278-295. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning. Educational and Psychological Measurement, 67, 565-582. Fukuhara, H. (2009). A differential item functioning model for testlet-based items using a bi-factor multidimensional item response theory model: A Bayesian approach. Unpublished doctoral dissertation, Florida State University. Furlow, C. F., Raiford Ross, T., & Gagne, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33, 441-464. Gagne, P. (2004). Generalized confirmatory factor mixture models: A tool for assessing factorial invariance across unspecified populations. Unpublished doctoral dissertation. University of Maryland. Gagne, P. (2006). Mean and covariance structure models. In G.R. Hancock & F.R. Lawrence (Eds.), Structural Equation Modeling: A second course (pp. 197-224). Greenwood, CT: Information Age Publishing, Inc. Gallo, J. J., Anthony, J. C., & Muthen, B. 0. (1994). Age differences in the symptoms of depression: A latent trait analysis. Journal of Gerontology: Psychological Sciences, 49, P251-P264. Gelin, M. N. (2005). Type I error rates of the DIF MIMIC approach using Joreskog's covariance matrix with ML and WLS estimation. Unpublished doctoral dissertation, The University of British Columbia. Therefore, when running these models with factor analytic programs such as Mplus, the user should be aware of the default scale setting methods since this invariably will affect the parameter conversion formulae. Mixture Modeling as an Alternative Approach to DIF Detection Traditional DIF detection methods assume a manifest approach in which examinees are compared based on demographic classifications such as gender (females vs. males) or race (African Americans vs. Caucasians). While this approach has a long history and has been used successfully in the past to assess DIF, recent emerging research has suggested that this perspective may be limiting in scope (Cohen & Bolt, 2005; De Ayala et al., 2002; Mislevy et al., 2008; Samuelsen, 2005, 2008). As an alternative to the traditional manifest DIF approach, a latent DIF conceptualization, rooted in latent class and mixture modeling methods, has been proposed (Samuelsen, 2005, 2008). In this conceptualization of DIF, rather than focusing on a priori examinee characteristics, this approach assumes that the underlying population consists of a mixture of heterogeneous, unidentified subpopulations, known as latent classes. These latent classes exist because of perceived qualitative differences (e.g. different learning strategies, different cognitive styles etc.) among examinees. One technique that can be used to examine DIF from a latent perspective is factor mixture modeling (FMM). Factor mixture models result from merging the factor analysis model and the latent class model resulting in a hybrid model consisting of two types of latent variables; continuous latent factors and categorical unobserved classes (Lubke & Muthen, 2005, 2007; Muthen, Asparouhov, & Rebollo, 2006). The simultaneous inclusion of these two types of latent variables allows first for the exploration of unobserved heterogeneity that may exist in the population and the examination of the alpha level of .05. Overall, the average Type I error rate was 11.8%, which even after accounting for random sampling error would still be considered unacceptably high. Across the individual conditions, the error rates ranged from .09 to .16. Not surprisingly, the factor mixture method exhibited its strongest control of the rate of incorrect identifications for conditions of large DIF magnitude (DIF = 1.5), large sample size (N = 1000), and where there was either none or a moderate (0.5 SD) amount of impact. An initial examination of the pattern of results suggested that while the sample size and DIF magnitude are inversely related to Type I error rate, an increase in the mean latent trait differences resulted in slightly higher Type I error rates. For example, for the cells with DIF magnitude of 1.0, sample size of N = 500, and no impact, the Type I error rate was 0.12; however, when the latent trait means differed by 1.0 SD, the rate of false identifications increased marginally to 0.16. A more detailed discussion of the effect of each of the three conditions is presented in the following sections. Magnitude of DIF Table 4-7 and 4-8 display the aggregated results for the effect of the two levels of DIF magnitude (1.0 and 1.5) on Type I error rates. Overall, the rates of false identifications showed a slight decrease as the magnitude of DIF was increased. For example, when DIF of 1.0 was simulated, error rates across the conditions were between .10 and .16, with an average rate of .13. However, for larger DIF of 1.5, the rates ranged from .09 to .12, averaging at .10. Regardless of the size of DIF, the inflated rates were most pronounced for the smaller sample size of N=500 and when the difference in latent trait means was maximized (1.0 SD). .495 and .502 when DIF was simulated in items with medium and high values on the a- parameter respectively. Moreover, while there was generally a clear difference in the accuracy of DIF detection between the low discriminating items (a=.5) and either the medium or highly discriminating items; no discernible differences were evident when comparing DIF detection rates between items with medium (a=1.0) and high values (a=1.5) on the a-parameter. The patterns of DIF detection discussed earlier remained consistent across the simulation conditions regardless of the items' discriminating ability. Variance components analysis Finally, a variance components analysis was conducted to determine the influence on power rates of the simulation conditions and their interactions. In this analysis, the power rates across the five DIF items were used as the dependent variable while the simulated conditions served as the independent variables. As was expected, the results showed that of the main effects, the DIF magnitude was the most significant contributor by accounting for 19% of the variance in the power rates. Following was sample size with approximately 5% and the interaction between these two factors with 1.2%. Each of the other terms contributed less than 1.0% to the variance in the power rates. These results are shown on Table 4-24. Shealy, R., & Stout, W. F. (1993a). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159-194. Shealy, R., & Stout, W. F. (1993b). An item response theory model for test bias and differential test functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197-239). Hillsdale, NJ: Erlbaum. Shih, C-L & Wang, W-C. (2009). Differential item functioning detection using multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33, 184-199. Sorbom, D. (1974). A general method for studying differences in factor means and factor structures between groups. British Journal of Mathematical and Statistical Psychology. 27, 229-239. Standards for educational and psychological testing. (1999). Washington, D.C: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1291-1306. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361- 370. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. Thissen, D. (1991). MULTILOGTM User's Guide. Multiple, Categorical Item Analysis and Test Scoring Using Item Response Theory. Chicago: Scientific Software, Inc. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.) Differential Item Functioning (pp. 67-113). Hillsdale, N.J.: Lawrence Erlbaum. Thurstone, L.L. (1925). A method of scaling educational and psychological tests. Journal of Educational Psychology, 16, 263-278. Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago: University of Chicago Press. Tian, F. (1999). Detecting differential item functioning in polytomous items. Unpublished doctoral dissertation, University of Ottawa. 106 One thousand replications were generated for each of the 12 simulation conditions examined in the Type I error rate and Power studies. The results for the Type I error rate and statistical power are addressed in the sections below. Nonconvergent Solutions In this part of the study, population parameters replaced the starting values randomly generated by Mplus. This change substantially reduced the computational load and decreased the model estimation time. The convergence rates across each of the conditions are presented in Table 4-5. Overall, results indicate no convergence problems, with rates ranging between 99.4% and a perfect convergence rate. Type I Error Rate The factor mixture model was evaluated in terms of its ability to control the Type I error rate under a variety of simulated conditions. Of the 15 items, nine were simulated to be DIF-free. The Type I error rate was assessed by computing the proportion of times the nine DIF items were incorrectly identified as having DIF. An item was considered to display DIF if the differences in thresholds were significantly different from zero. Therefore for the nine non-DIF items, the Type I error rate was computed as the proportion of times that the items obtained p-values less than .05. The Type I error rates across the 12 simulation conditions are presented in Table 4-6. The values in the table represent the proportion of times that the method incorrectly flagged a non-DIF item as displaying DIF. The results in Table 4-6 indicate that the factor mixture analysis method did not perform as well as expected in controlling the Type I error rate. The results showed elevated Type I error rates across all the study conditions, which means that the approach consistently produced false identifications at a rate exceeding the nominal Overall, the ambiguity of these findings serve to reinforce the point that was made earlier, namely, that the IC results should never be relied upon as the sole determinant of the number of classes. Several researchers have stressed the importance of incorporating substantive theory in guiding the model selection decision (Allua, 2007; Bauer and Curran, 2004; Kim, 2009; Nylund et al., 2007; Reynolds, 2008). Moreover, Reynolds (2008) contends that the researcher often has some belief about the underlying subpopulations, therefore this should be taken into account in determining which of the models best fit the data. Type I Error and Statistical Power Performance In this phase of the study, the performance of the factor mixture model was evaluated in terms of its Type I error rate and power of DIF detection. As was done in the first part of the study data were again simulated for a 15-item test based on the 2PL IRT model. However, in this case it was assumed that the number of classes was known to be two. Five of the 15 items were simulated to contain uniform DIF in favor of the reference group. In investigating the Type I error rate and power of the test, three factors (DIF magnitude, sample size and impact) shown previously to affect DIF detection were also manipulated and their effect on the test was noted. More specifically, two levels of DIF magnitude (1.0 1.5) and of sample size (N=500, N=1000) were simulated. For the effect of impact, three levels, 0, 0.5 SD and 1.0 SD were chosen to reflect none, moderate and large mean differences in the latent trait. For each of the 12 conditions, a total of 1000 replications were run. The Type I error and statistical power of the factor mixture method for DIF detection was investigated across all conditions. models. However, in mixture analysis when comparing models with differing numbers of latent classes, the traditional likelihood ratio test for nested models is no longer appropriate (Bauer, & Curran, 2004; McLachlan & Peel, 2000; Muthen, 2007). Instead, alternative model selection indices are used to compare competing models with different numbers of latent classes. These include: (i) information-based criteria such as Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), and the adjusted BIC (ssaBIC; Sclove, 1987), (ii) likelihood-based tests such as the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) and the bootstrapped version of the LRT (BLRT; McLachlan & Peel, 2000), and (iii) statistics based on the classification of individuals using estimated posterior probabilities, such as entropy (Lubke & Muthen, 2007). And while there has been limited research conducted comparing the performances of these various model selection methods, no consistent guidelines have been established determining which model selection indices are most useful in comparing models or selecting the best-fitting model (Lubke & Neale, 2006; Nylund, et al. 2006; Tofighi & Enders, 2008; Yang, 2006). The reason for this is that there is seldom unanimous agreement across the various model selection indices and as a result the possibility of misspecification of the number of classes is a likely occurrence (Bauer & Curran, 2004; Nylund et al., 2006). Therefore, the researcher should not rely on these indices as the sole determinant of the number of latent classes. Rather, it is advised that in addition to the statistical indices, a theoretical justification should also guide not only the selection of the optimal number of classes but the interpretation of the classes as well (Bauer & Curran, 2004; Gagne, 2006; optimization) is insufficient to converge on a maximum likelihood solution, Mplus allows the user the flexibility to increase the number of start values. By adjusting the random starts option to include a larger number of start values both in the initial analysis and final optimization phases allows for a more thorough investigation of multiple solutions and should improve the likelihood of successful convergence (Muthen & Muthen, 1998- 2008). However, since the increase in the number of random starts will also increase the computational load and estimation time, it is recommended that prior to conducting a full study researchers should experiment with various sets of user-defined starting values to determine an appropriate number of sets of starting values (Nylund et al., 2006). During this process, it is important to examine the results from the final stage solutions to determine whether the best log-likelihood is replicated multiple times. This ensures that the solution converged on a global maximum, thus reducing the possibility that the parameter estimates are derived from local solutions. Purpose of Study In the past, the lack of usage of SEM-based mixture models has been attributed to an unavailability of commercial software (Bandalos & Cohen, 2006). However, given the recent innovations integrated in software packages such as Mplus (Muthen & Muthen, 1998-2008), the estimation of SEM mixture models is now possible (Bandalos & Cohen, 2006). The purpose of this study was to evaluate the performance of factor mixture modeling as a method for detecting items exhibiting manifest-group DIF. In this study, manifest-group DIF was generated in a set of dichotomous data for a two-group, two- class population. The questions addressed were as follows. * How successful is the factor mixture modeling approach at recovering the correct number of latent classes? Research Study 1 In the first part of the study dichotomous item responses were generated for the two-group, two-class scenario. The focus was on determining the success rate of the specified factor mixture model to recover the correct number of latent classes. Solutions for one- through three-class mixture models were estimated and three information- based criteria values were compared across the models. The model with the lowest IC value was selected as the best-fitting model (Lubke & Muthen, 2005; Nylund et al., 2006). The fixed and manipulated factors used in this study are listed below. Manipulated Conditions Sample size Previous findings have shown that as with pure CFA models, sample size affects the convergence rates of mixture models as well (Gagne, 2004; Lubke, 2006). In evaluating the performance of several CFA mixture models, Gagne (2004) reported a significant increase in the convergence rates as the sample size was increased from a minimum of 200 to 500 to 1000. A review of previous simulation and real data mixture model research found that whereas only a few studies used as few as 200 simulees (Gagne, 2004; Nylund et al., 2006), sample sizes of at least 500 were most frequently used (Bolt et al., 2001; Bilir, 2009; Cho, 2007; De Ayala et al., 2002; Rost, 1990; Samuelsen, 2005). In this study, the two combinations of sample size (N=500, N=1000) were chosen to be representative of realistic research samples and to reduce the possibility of convergence problems. In addition, the sample size of 500 was used as a lower limit to examine the effects of small sample size on the performance of the factor mixture approach to DIF detection. Table 4-1. Number of converged replications for the three factor mixture models DIF Sample Ability One-class Two-class Three-class Magnitude Size Differences 1.0 500 0 50 46 43 0.5 50 46 46 1000 0 50 45 41 0.5 50 48 43 1.5 500 0 50 50 47 0.5 50 49 46 1000 0 50 50 49 0.5 50 50 48 manifest groups and latent classes. For example, if each of the examinees in either the manifest-focal or the manifest-reference group belongs to the same latent class, then this is referred to as 100% overlap. Therefore, as the level of group-class overlap decreases, there is a corresponding decrease in the level of homogeneity between groups and classes as well. In Samuelsen's (2005) study, five levels of overlap decreasing in increments of 10% from 100% to 60% were considered. Samuelsen (2005) found that as the group-class overlap increased, the power of the mixture approach to correctly detect DIF increased as well. In this study, the level of overlap was fixed at 80%, a somewhat realistic expectation of what may be encountered in practice. This means that DIF was simulated against 80% of the simulees in the focal group. Mixing proportion The mixing proportion ((pk) represents the proportion of the population in class k, which was fixed at .50. Although the class membership was known, it was not used in the simulation. Study Design Overview In sum, a total of three fully crossed factors resulting in eight simulation conditions (2 sample sizes x 2 DIF magnitudes x 2 latent ability distributions) were manipulated to determine their effect on the recovery of the correct number of latent classes. For each of the eight conditions, a total of 50 replications were run. It is important to note that in the original plan for this study, a larger number of replications was proposed. However, initial simulation runs revealed that the computational time necessary to complete larger numbers of replications was impractical for this dissertation. Therefore, given the timing contribute positively to bridging the gap between SEM-based and IRT-based methods in the area of testing and measurement. CHAPTER 3 METHODOLOGY The simulation was conducted in two parts. The first part of the study focused on the ability of the factor mixture model to recover the correct number of latent classes under a variety of simulated conditions. In the second phase of the study, the number of classes was assumed known and the emphasis was on evaluating the performance of the mixture model at identifying differentially functioning items. Following is a description of the model as implemented, as well as the study design used in evaluating the performance of the factor mixture modeling approach to DIF detection. Factor Mixture Model Specification for Latent Class DIF Detection The factor mixture model was specified in its hybrid form as having both a single factor measured by 15 dichotomous items and a categorical latent class variable. The factor mixture model was formulated in the study as: y *=k r+ Ak + ck (22) rk = k + k (23) where the parameters are as previously defined in Chapter 2 and k = 1 to K indexes the number of latent classes. To accommodate the testing of uniform DIF, the model was formulated so that the factor loadings were constrained to be class-invariant but the item thresholds were allowed to vary across classes. Therefore in Equation 22, the A parameter is not indexed by the k subscript. Overall, the single-factor mixture model was specified as follows: 1. The factor loadings were constrained equal across the latent classes. For scaling purposes, the factor loadings of the referent (i.e. item 1) were fixed at one for each of the latent classes. Sample size The results in Tables 4-9 and 4-10 suggest a weak inverse relationship between sample size and the ability of the factor mixture method to control Type I error rates. At the smaller sample size (N=500), the rate of false identifications ranged from .10 to .16, with an average rate of .12. Of the six cells associated with the smaller sample size, the test showed greatest control of the Type I error when larger DIF (1.5) was simulated and there was equality of the latent trait means. Increasing the sample size to 1000 decreased the Type I error rates marginally. Across the six conditions, the error rates were now between .09 and .14, averaging at .11, a negligible decrease from the average rate when N=500. However, the pattern of false identifications remained consistent across sample sizes: poor Type I error control was observed when smaller DIF (1.0) was simulated and there was large impact (1.0 SD); in contrast, improved control was observed for larger DIF magnitude (1.5) and in the absence of impact. Impact Three levels of impact (0, .5 SD, and 1.0 SD) were simulated in favor of the reference group. The aggregated Type I error rates which are summarized in Tables 4- 11 through 4-13 showed that the differences in latent trait means between groups had no appreciable effect on the rate of incorrect identifications The Type I error rates for the no-impact, 0.5 SD and 1.0 SD conditions increased marginally from .11 to .12 to .13, a change that can be attributed to the presence of random error. Though not below the nominal alpha value of .05, the Type I error rates were best controlled when both DIF (1.5) and sample size (N=1000) were large. CHAPTER 4 RESULTS Research Study 1 In this section, the results of the first part of the simulation are presented. To answer the research question, data were generated for a two-group, two-class population with five of the 15 items simulated to display uniform DIF. The following conditions were manipulated in this study: sample size (500, 1000), DIF magnitude (1.0, 1.5), and differences in latent ability means (0 SD, 0.5 SD). The factor mixture model as formulated in Equations 22 and 23 was applied to determine how successful the method was at recovering the correct number of classes. For each of the eight condition combinations, one-, two- and three-class models were fit to the data. These results are presented in two sections. First, the rates of model convergence for each of the eight simulation conditions are reported. Secondly, the information criteria (IC) results which were used for model comparison and class enumeration are discussed. The results for Study 1 are summarized in Tables 4-1 through 4-4. Convergence Rates Table 4-1 presents the data on the number of convergent solutions for each combination of the eight simulation conditions. As was previously mentioned in the Methods section, non-convergent cases were excluded from the analysis, therefore for some conditions results were based on fewer than 50 replications. The results showed that overall the convergence rates were very high (ranging from .82 to 1.0), and there were minimal convergence problems. Of the 1200 (50x3x8) replications, 1147 successfully converged resulting in a 96% overall convergence rate. In addition, as the number of latent classes was increased, there was a corresponding decrease, albeit indices performed credibly when a two-class model was used as the data-generating model. In this case all of the fit indices tended to underestimate the number of latent classes by continuing to favor the one-class model over the "correct" two-class model. This inconsistency between model fit measures has also evidenced in applied studies. Using an illustrative example, Lubke and Muthen (2005) applied factor mixture modeling to continuous observed outcomes from the Longitudinal Study of American Youth (LSAY) as a means of exploring the unobserved population heterogeneity. A series of increasingly invariant models were estimated and compared to a two-factor single-class model baseline model. For each of the models fit to the data, two- through five-class solutions were specified. The commonly used relative fit indices (AIC, BIC, ssaBIC and aLRT) were used in choosing the best fitting models. However, there were several instances of disagreement between the IC results. For example, among the non-invariant and fully invariant models, while the AIC and the ssaBIC identified the 4- class solution as the best fitting model, the BIC and aLRT produced their lowest values for the 3-class solution. In summarizing their results, the authors suggested that in addition to relying on the model fit measures, researchers should explore the additional classes, in a similar manner as additional factors are investigated in factor analysis, to determine if their inclusion provides new, substantively meaningful interpretations to the solution. Overall, results from both simulation and applied studies highlight the lack of agreement among the mixture model fit indices. Researchers have attributed this inconsistency of performance to the heavy dependence of the indices on the type of mixture model under consideration as well as the assumptions made about the Mplus (Muthen & Muthen, 1998-2008) have increased the likelihood that the estimation of SEM-based factor mixture models will be more commonly estimated in practice (Bandalos & Cohen, 2006). In this study, the performance of the factor mixture model was evaluated primarily in terms of its ability to produce high convergence rates, control Type I error rate, and enhance statistical power under a variety of realistic study conditions. The simulated conditions examined sample size, magnitude of DIF, and similarity of latent ability means. In sum, this Monte Carlo simulation was conducted to determine the conditions under which a factor mixture approach performs best when assessing item non-invariance and ultimately whether its use in practice should be recommended. SEM-based approaches are not commonly used in the discipline of educational testing which traditionally has been considered the domain of techniques developed within an IRT framework. And although the equivalence between factor analytic and IRT approaches for categorical items has long been established and applied repeatedly in the literature (Bock & Aiken, 1981; Finch 2005; Glockner-Rist & Hoijtink 2003; Moustaki 2000; Muthen, 1985; Muthen & Asparouhov 2002; Takane & de Leeuw, 1987), these methods are still utilized primarily within their respective discipline of origin. Therefore, despite its obvious potential, it is unlikely that this latent conceptualization will gain widespread acceptance, unless the applied community is convinced that (i) framing DIF with respect to latent qualitative differences and (ii) using a SEM-based approach are both worthwhile, practical options. In sum, if a mixture approach can be shown to add substantial value in an area such as DIF detection, then research of this kind will Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1994). The effects of score group width on the Mantel-Haenszel procedure. Journal of Educational Measurement, 57, 67-78. Clark, S.L. (2010). Mixture modeling with behavioral data. Doctoral dissertation, Unpublished doctoral dissertation. University of California, Los Angeles. Clark, S.L., Muthen, B., Kaprio, J., D'Onofrio, B.M., Viken, R., Rose, R.J., Smalley, S. L. (2009). Models and strategies for factor mixture analysis: Two examples concerning the structure underlying psychological disorders. Manuscript submitted for publication. Cleary, T. A. & Hilton, T. J. (1968). An investigation of item bias. Educational and Psychological Measurement, 5, 115-124. Cohen, A.S., & Bolt, D.M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133-148. Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335 - 350. Cole, N. S. (1993). History and development of DIF. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 25-33). Hillsdale, N.J.: Lawrence Erlbaum. Dainis, A. M. (2008). Methods for identifying differential item and test functioning: An investigation of Type I error rates and power. Unpublished doctoral dissertation, James Madison University. De Ayala, R.J. (2009). Theory and practice of item response theory. Guilford Publishing. De Ayala, R.J., Kim, S.-H., Stapleton, L.M., & Dayton, C.M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2, 243-276. Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning (pp. 137-166). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.) Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. The results suggested that the ssaBIC outperformed the other five IC measures including the AIC and the BIC. For instance, with smaller sample sizes (N =100, 200) the ssaBIC had the highest accuracy rates of 62.7% and 77.5% respectively. In addition, Yang (1998) found that both BIC and a consistent form of the AIC (CAIC) tended to incorrectly select models with fewer latent classes than actually simulated. The performance of the BIC and CAIC only improved after the sample size increased to the largest condition of N=1000. The researcher concluded that in the case of LCA models, the ssaBIC outperformed the AIC and BIC at determining the correct number of latent classes (Yang, 1998). Tofighi and Enders (2007) extended their simulation research to evaluating the accuracy of information-based indices in identifying the correct number of latent classes in growth mixture models (GMM). Manipulated factors included the number of repeated measures, sample size, separation of latent classes, mixing proportions, and within- class distribution shape simulated for a three-class population GMM. The researchers found that of the ICs, the ssaBIC was most successful at consistently extracting the correct number of latent classes. Once again, the BIC showed its sensitivity to small sample sizes and frequently favored too few classes. The accuracy of the ssaBIC persisted even when the latent classes were not well-separated, whereas the ssaBIC extracted the correct three-class solution in 88% of the replications, the BIC and CAIC only correctly identified this solution 11% and 4% of the time respectively. In examining the accuracy of model selection indices in multilevel factor mixture models, Allua (2007) found that while the BIC and ssaBIC outperformed the AIC in correct predictions when data were generated from a one-class model, none of the fit Schwanenflugel, 2008). Finally, the assumption of an inherent homogeneity in responses among examinees in the subgroups has also been cited as another weakness of the traditional DIF approach (De Ayala, 2009; De Ayala, Kim, Stapleton & Dayton, 2002; Samuelsen, 2005). This view has been supported by the observation that even within a seemingly homogenous manifest group (e.g. Hispanic or black) there can be high levels of heterogeneity, resulting in segments which respond differently to the item than other examinees in that group. Using race as an example, De Ayala (2009) noted that a racial category such as Asian American would lump together examinees of Filipino, Korean, Indonesian, Taiwanese descent as a single homogeneous group, ignoring their intra-manifest variability. As a result, De Ayala (2009) argues that this assumed homogeneity in traditional manifest DIF assessments may lead to false conclusions about the existence or magnitude of DIF. As an alternative to the traditional manifest approach, a latent mixture conceptualization of DIF has been proposed. Rather than focusing on a priori examinee characteristics, this method characterizes DIF as being the result of unobserved heterogeneity in the population. A latent mixture conceptualization relaxes the requirement that associates DIF with a specific preexisting variable and the assumption that manifest groups are homogenous. Instead, examinees are classified into latent subpopulations based on their differential response patterns. These latent subpopulations or latent classes arise as a result of qualitative differences (e.g. use of problem solving strategies, response styles or level of cognitive thinking) among examinee subgroups (Mislevy et al., 2008; Samuelsen, 2005). Interestingly, more than a decade before the latent mixture conceptualization had been proposed, Angoff (1993) constraints, a smaller number of data sets (i.e. 50) were replicated. The list of study conditions is provided in Table 3-2. Evaluation Criteria As previously noted, the objective of this first part of the simulation was to determine the success rate of the factor mixture method in identifying the correct number of classes. The three likelihood-based model fit indices (AIC, BIC, and ssaBIC) provided by Mplus were compared, with smaller values indicating better model fit. The outcome measures evaluated for each of the three (i.e. one- through three- class) factor mixture solutions fit to the data were: * Convergence rates This was represented as the number of replications that converged to a proper solutions across the 50 simulations for each set of the eight conditions. Data sets with improper or non-convergent solutions were not included in the analysis. * IC Performance Performance was evaluated by calculating the average IC values and comparing the values for each index across the one-, two-, and three- class models. For each of the simulated conditions, the lowest average IC value and the corresponding model are identified. Research Study 2 In the second part of the study, research was conducted to evaluate the Type I error rate and power performance of the factor mixture model at detecting uniform DIF, assuming that the correct number of classes is known. With respect to the study design, two levels of DIF magnitude (DIF = 1.0, 1.5) and two levels of sample size (N = 500, 1000) were again simulated using the same levels as in Study 1. However, an additional level will be included for the impact condition. More specifically, in addition to the no- impact and moderate impact condition, a large level of impact (i.e. mean for the reference group was 1.0 SD higher than the mean of the focal group) was included as well. The inclusion of this new level permitted a more complete investigation of the alcohol dependency data. However, the single factor, three-class factor mixture model provided the best fit to the data and best accounted for the covariation in both the pattern of symptoms and the heterogeneity of the population. In another study from the substance abuse literature, Muthen et al. (2006) compared two types of factor mixture models to a factor analysis and latent class approach in analyzing the responses of 842 pairs of male twins to 22 alcohol criteria items. The findings showed that both factor mixture models fit the data well and explained heritability both with regard to the underlying dimensional structure of the data and the latent class profiles of the heterogeneous population (Muthen et al., 2006). With regard to DIF, factor mixture models may be used to investigate item parameters differences between the latent classes of subpopulations of examinees. In this conceptualization of DIF, the unobserved latent classes represent qualitatively different groups of individuals whose item responses function differentially across the classes (Bandalos & Cohen, 2006). To allow for the specification of these models, a categorical latent variable is integrated into the common factor model specified in Equations 1 and 2. As a result, the K-class factor mixture model is now expressed as: y,k = k+ Akk + ,k (13) rk = k + k (14) where the subscript k indicates the parameters that can vary across the latent classes. Figure 2-4 provides a depiction of a factor mixture model where the unidimensional latent factor is measured by five observed items and there are K latent classes in the population. In the diagram, the relationships are specified as follows: promising, the LMR LRT and the BLRT are not without their potential drawbacks. Jeffries (2003) has been critical of the LMR LRT's use in mixture modeling and has suggested that the statistic be applied with caution. In addition, the BLRT which uses bootstrap samples is a far more computationally intensive approach than the information-based statistics. As a result, the BLRT though seemingly a reliable index, is seldom used in practice by the applied researcher (Liu, 2008). Therefore, additional attention may be focused on identifying alternative, robust model selection measures that provide more consistency than the ICs but are less computationally demanding than the BLRT. A second limiting factor in this part of the study was that the selection of the best-fitting model was based on the average IC values. A more reliable approach would have been to determine the percentage of times (out of the completed replications) that each index identifies the correct model. However, in this study, it was not possible to provide a one-to-one comparison of the IC values across the three class- solutions when a 100% convergence rate was not achieved. Therefore, in future research, this change should be implemented so that the percentage of correct model identifications can be compared for each of the indices. It should also be mentioned that while previous studies have evaluated the performance of model selection methods with respect to a variety of mixture models (GMM, LCA, FMM), to date no research has been conducted to evaluate the performance of these indices when used in the context of DIF detection. To fill this gap in the methodology literature requires a more extensive study focusing on the detection of DIF with mixture models. As with all simulation research, the findings can only be generalized to the limited number of conditions selected for this study. It should be noted that in the original reported inflated Type I error rates with unequal latent trait distributions. The results are also mixed with respect to the presence of impact on power. Whereas some studies have shown reduced power (Ankemann, Witt, & Dunbar, 1999; Clauser, Mazor, & Hambleton, 1993; Narayanan & Swaminathan, 1996), others (Gonzalez-Roma et al., 2006; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006) found that DIF detection rates were not negatively affected by the dissimilarity of latent distributions. In this first part of the study, two conditions of differences in mean latent ability were manipulated: 1. Equal latent ability means with the reference and focal groups both generated from a standard normal distribution (i.e. OR~N(0,1), OF~N(0,1)), and 2. Unequal latent ability means with the reference group having a latent ability mean .5 standard deviation higher than the focal group (i.e. ER~N(0.5,1), OF~N(0,1)). Fixed Simulation Conditions Test length The test was simulated for a fixed length of 15 dichotomous items. Previous studies using factor mixture modeling have typically used shorter scale lengths varying between 4 and 12 observed items for a single-factor model with categorical items (Lubke & Neale, 2008; Nylund et al., 2006; Kuo et al., 2008; Reynolds, 2008; Sawatzy, 2007). This may be due to the fact that longer computation times are required when fitting mixture models to categorical data (Lubke & Neale, 2008). Therefore, while more test items may have been included, this length was chosen not only to be consistent with previous research, but also taking into account the computational intensity of factor mixture models. design of this study, several additional conditions were considered. However, given the computational intensity of mixture modeling, and in the interest of time, it was decided to reduce the number of study conditions to the smaller set that was studied. Therefore, future research should consider a broader range of simulation conditions which would make for a more realistic study. For example, in addition to sample sizes, it would be of interest to investigate the ratios of focal to reference groups sample size as well. In this study, a 1:1 sample size ratio of focal to reference group was considered. And while this may be representative of an evenly split manifest variable such as gender (Samuelsen, 2005), unequal sample groups tend the mimic minority population characteristics such as race (e.g. Caucasian vs. black or Hispanic). In traditional DIF assessments, power rates are typically higher for equal focal and reference group sizes than with unequal sample size ratios (Atar, 2007). Therefore it would be interesting to investigate whether this finding is consistent with factor mixture DIF detection methods. Other conditions, fixed in this current study, that could be manipulated in future research include: (i) the nature of the items, (ii) the scale length, and (iii) the type of DIF. In the study, data were simulated for dichotomous items only. An interesting extension would be the evaluation of the model using categorical response data generated from different IRT polytomous models (e.g. the graded response model or the partial credit model). Another condition that could be manipulated is the number/proportion of items simulated to contain DIF. In addition, assessing the performance of the model selection indices and the mixture model with respect to varying scale lengths should also make for a more complete, informative study. While it is expected that longer tests would produce lower Type I error rates and increase power, it would be of interest to the impact of DIF contamination is known as purification (Kamata & Vaughn, 2004). While internal criterion measures are most commonly used, external criteria consisting of a set of items that were not part of the administered test could also function as an external criterion measure. An external criterion may be an adequate choice particularly when there are a high proportion of DIF items in the scale or the total score is deemed to be an inappropriate measure (Gelin, 2005). However, external measures are seldom used in practice due to the difficulty in finding an adequate set of items that would be more appropriate at measuring the latent trait than the actual test that was designed specifically for that use (Gelin, 2005; Shih & Wang, 2009). Procedures that use the observed variable framework include contingency tabulation methods such as the Mantel-Haenszel (MH) (Holland & Thayer, 1998), generalized linear models such as logistic regression (Swaminathan & Rogers, 1990) or ordinal regression (Zumbo, 1999), and the standardization method (Dorans & Kulik, 1983). The latent variable framework Unlike the procedures implemented within the observed score framework, these techniques do not control using an observed, manifest measure like the total test score or purified total test score. Instead, latent variable DIF detection methods involve the use of an assumed underlying latent trait such as ability. The two main classes of methods which use the latent variable framework are: (i) item response theory (IRT) and (ii) structural equation modeling (SEM) based approaches. IRT methods include techniques in which comparisons are made either between item parameters (Lord, 1980), or between item characteristic curves (Raju, 1988, 1990) or likelihood ratio test methods (IRT-LRT; Thissen, Steinberg, & Wainer, 1988). However, since the focus of Pr(Yj = 1 0) = ] (24) ]+ e [- ,(o0 b,)] where ai is the item discrimination parameter, bi is the item threshold parameter, and Oj is the latent ability trait for examinee j. To determine each examinee's item response, the calculated probability Pr(Y =1 O0)was compared to a randomly generated number from a uniform U(0,1) distribution. If that probability exceeded the random number, the examinee's item response was scored as correct (i.e. coded as 1). On the other hand, if the probability of a correct response was less than the random number the item response was scored as incorrect and coded as 0. Finally, 50 replications were run for each set of simulation conditions and the dichotomous item response datasets were exported to Mplus V5.1 (Muthen & Muthen, 1998-2008) for the analysis phase. Since the data were generated externally, the Mplus Type=Montecarlo option was used to analyze the multiple datasets and to save the results for the replications that converged successfully. Simulation Study Design In their 1988 paper, Lautenschlager and Park reiterated the need for Monte Carlo studies to be designed in such a way that they simulate real data conditions as closely as possible. This advice was followed when selecting the condition and levels for this simulation study. The conditions were chosen to replicate those adopted in previous latent DIF studies (Bolt, Cohen, & Wollack, 2001; Bilir, 2009; Cohen & Bolt, 2005; De Ayala et al., 2002; Samuelsen, 2005) and mixture modeling studies (Gagne, 2004; Lee, 2009; Lubke & Muthen, 2005, 2007). threshold differences using the Mplus model constraint option versus either a constrained- or a free-baseline strategy for testing DIF with mixture CFA models. Conclusion In the last decade, a burgeoning literature on mixture modeling and its applications has emerged. And although several of these research efforts have been concentrated in the area of growth mixture modeling, there is also a groundswell of interest in applying a mixture approach in the study of measurement invariance. Therefore, in concluding this dissertation it is important to reiterate the motivation that should precede the use of this technique as well as some key concerns that applied should keep in mind when deciding whether a mixture modeling is an appropriate approach for their research. The intrinsic appeal of mixture models is that they allow for the exploration of unobserved population heterogeneity using latent variables. Under the traditional conceptualization, DIF is defined with respect to distinct, known sub-groups. Therefore, in using standard DIF approaches, practitioners are seeking to determine if after controlling for latent ability whether differences in item response patterns are a result of a known variable such as gender or race. However, when investigating DIF from a latent perspective, there is an implied assumption that the presence of unobserved latent classes gives rise to the pattern of differential functioning in the items. Advocates of this approach contend that it allows for a better understanding of why examinees may be responding differently to items and this is certainly an attractive inducement to practitioners. However, these results suggest that unless large sample sizes and large amounts of DIF are simulated in the data the factor mixture approach is likely to be unsuccessful at disentangling the population into distinct, distinguishable latent classes. Additionally, commonly-used fit indices such as the AIC, BIC, and ssaBIC are likely to produce inconsistent results and 2. To ensure identification, the item thresholds of the referent were also held equal across the latent classes. The remaining 14 (i.e. K -1) item thresholds were freely estimated. 3. One of the factor means was constrained to zero while the remaining factor mean was freely estimated. For K latent classes, the Mplus default is to fix the mean of the last or highest numbered latent class to zero (i.e. Pk = 0). Therefore in this case, the mean of the first class was freely estimated 4. Factor variances were freely estimated for all latent classes. Data Generation The discrimination and difficulty parameters used in this study were adopted from dissertation research conducted by Wanichtanom (2001). The original test (Wanichtanom, 2001) consisted of 50 items however in this case, parameters for ten of the 50 items have been selected. These ten items from the Wanichtanom (2001) study represented the DIF-free test items. In the original study, the item discrimination parameters were drawn from a uniform distribution within a 0 to 2.0 range and the difficulty parameters from a normal distribution within a -2.0 to 2.0 range (Wanichtanom, 2001). The remaining five DIF items that formed part of the scale reflected low (i.e. 0.5), medium (i.e. 1.0) and high (i.e. 2.0) levels of discrimination. For the entire 15-item test, the discrimination a parameters ranged from 0.4 to 2.0, with a mean of 0.98 while the difficulty b parameters ranged from -1.2 to 0.7 with a mean of -0.34. Uniform DIF was simulated against the focal group on Items two to six. The values of the item parameters are presented in Table 3-1. Data were generated using R statistical software (R Development Core Team, 2009). The ability parameters were drawn from normal distributions for both the reference and focal groups. For these dichotomous items, the probability of a correct response was computed using the 2PL IRT model as: Reconciling the Simulation Results On one hand, the overall pattern of findings across the simulation conditions exhibits consistency with previous DIF results. On the other, the factor mixture approach was not as successful as was hoped at controlling the rate of false identifications and as a result in demonstrating power to detect DIF. However, if the factor mixture approach is to be regarded as a viable DIF detection method, possible reasons for this deviation from the expected performance must be addressed. Under a manifest approach to DIF an item is said to exhibit DIF if groups matched on the latent ability trait differ in their probabilities of item response (Cohen et al., 1993). Therefore, in that context, DIF is defined with respect to the manifest groups being considered. By contrast, the mixture approach posits a different conceptualization of DIF. In this case, the underlying assumption is that DIF is observed because of differences in item responses between unobserved latent classes rather than known manifest groups. Moreover, it is further assumed that unless there is perfect overlap between the manifest groups and the latent classes then the two methods should not be expected to produce the same DIF results (De Ayala et al., 2002). Perfect overlap implies that the composition in each of the latent classes is exactly the same as in the two manifest groups. For instance, in the case of a two-class, two-group population, 100% of the reference group would comprise latent class 1, while 100% of the focal groups would belong to latent class 2. However, De Ayala et al. (2002) contend that it is unlikely that this perfect equivalence between latent classes and manifest groups will occur. Because the composition of the latent classes is likely to differ from that of the manifest groups, then it should be expected that the DIF results will differ, particularly as the level of overlap moves from 100% to 50%. Therefore, while there is expected to be some similarity in results between the two where as previously indicated, F is either the standard normal or logistic distribution depending on the distributional assumptions of theEs. However, it should be noted that in the case of factor mixture modeling with categorical variables, the default estimation method used in Mplus (Muthen & Muthen, 1998-2008) is robust maximum likelihood estimation method (MLR) and the default distribution of F is the logistic distribution. To allow for the estimation of the thresholds, the intercepts in the measurement model are assumed to be zero. As a result, the factor analytic logistic parameters can be converted to IRT parameters using: a=, and b= (11) Var(8_) A 7 Additionally, to ensure model identification, it is necessary to assign a scale to the latent trait. One method of setting the scale of the latent trait is to standardize it by setting the mean equal to zero and fixing the variance at one, that is, /, =0 and o- =1. In this case, the factor loadings and thresholds can then be converted to item discrimination and item difficulties using the following expressions (Muthen & Asparouhov, 2002): a= and b = (12) The ease with which estimates of the factor analysis parameters can be converted to the more recognizable IRT scale should increase the interpretability and utility of the results to the applied researchers (Fukuhara, 2009). An alternative method of setting the scale of the latent factor is by fixing one loading per factor to one. As a result, the simplified conversion formulae will differ and by extension, the magnitude of the parameter estimates will be affected as well (Bontempo, 2006; Kamata & Bauer, 2008). To my parents, Enid and Kenmore Jackman, and my brother, Stephen CHAPTER 2 LITERATURE REVIEW Differential Item Functioning From a historical standpoint, the term item bias was coined in the 1960s during an era when a public campaign for social equality, justice and fairness was being waged on all fronts. The term referenced studies designed to investigate the claim that "the principal, if not the sole, reason for the great disparity in test performance between Black and Hispanic students and White students on tests of cognitive ability is that the tests contain items that are outside the realms of the minority cultures" (Angoff, 1993, pg. 3). Concerns about bias in testing were particularly relevant in cases where the results were used in high-stakes decisions involving job selection and promotions, certification, licensure and achievement. What followed in the early '60s and '70s was a series of studies using rudimentary methods based on classical test theory (CTT) techniques (Gelin, 2005). One of these early methods involved an analysis of variance (ANOVA) approach (Angoff & Sharon, 1974; Cleary & Hilton, 1986) and focused on the interaction between group membership and item performance as a means of identifying outliers and detecting potentially biased items. Another method, the delta-plot technique, (Angoff, 1972; Thurstone, 1925) used plots of the transformed CTT index of item difficulty (p-values) for each group as a means of detecting biased items. However, the main weakness of these early methods was that they both failed to control for the underlying construct that was purported to be measured (e.g. ability) by the test. Another criticism was that because these methods only considered item difficulty, their implicit assumption of equally discriminating items led to an increase in the incidence of false negative or false positive identifications. Table 4-18. Power rates DIF 1.0 Table 4-19. Power rates DIF 1.0 Table DIF 1.0 1.0 1.5 1.5 Table DIF 1.0 1.0 1.5 1.5 Table DIF 1.0 1.0 1.5 1.5 4-20. Power rates 4-21. Power rates for sample size N of 500 Impact for sample size N of 1000 Impact for impact of 0 SD Sample Size 500 1000 500 1000 for impact of 0.5 SD Sample Size 500 1000 500 1000 4-22. Power rates for impact of 1.0 SD Sample Size 500 1000 500 1000 Power 0.268 0.267 0.264 0.525 0.498 0.425 Power 0.350 0.324 0.291 0.801 0.731 0.623 Power 0.268 0.350 0.525 0.801 Power 0.267 0.324 0.498 0.731 Power 0.264 0.291 0.425 0.623 aspects of its definition, interpretation and implementation (AERA, APA, & NCME, 1999). However, there is agreement that fairness must be paramount to test developers during the writing and review of items as well as during the administration and scoring of tests. In other words, the minimum fairness requirements are that items are free from bias and that all examinees receive an equitable level of treatment during the testing process (AERA, APA, & NCME, 1999). In developing unbiased test items, one of the primary concerns is ensuring that the items do not function differentially for different subgroups of examinees. This issue of item invariance is investigated through the use of a statistical technique known as differential item functioning (DIF). DIF detection is particularly critical if meaningful comparisons are to be made between different examinee subgroups. The fundamental premise of DIF is "that if test takers have approximately the same knowledge, then they should perform in similar ways on individual test questions regardless of their sex, race or ethnicity" (ETS, 2008). Therefore, the process of DIF assessment involves the accumulation of empirical evidence to determine whether items function differentially for examinees with the same ability. DIF analysis is widely regarded as the psychometric standard in the investigation of bias and test fairness. Consequently, it has been the topic of extensive research (Camilli & Shepard, 1994; Clauser & Mazor, 1998; Holland & Thayer, 1988; Holland & Wainer, 1993; Milsap & Everson, 1993; Penfield & Lam, 2000; Potenza & Dorans, 1995; Swaminathan & Rogers, 1990). As part of its evolution, several statistical DIF detection techniques, including non-parametric (Holland & Thayer, 1988; Swaminathan & Rogers, 1990), IRT-based methods (Thissen, Steinberg, & Wainer, 1993; Wainer, Sireci, & Thissen, 1991) and SEM-based methods (Joreskog & Atar, B. (2007). Differential item functioning analyses for mixed response data using IRT likelihood-ratio test, logistic regression, and GLLAMM procedures. Unpublished doctoral dissertation, Florida State University. Bandalos, D. L., & Cohen, A.S. (2006). Using factor mixture models to identify differentially functioning test items. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Bauer, D. J., & Curran, P.J. (2004). The integration of continuous and discrete latent variable models: Potential problems and promising opportunities. Psychological Methods, 9, 3-29. Bilir, M. K. (2009). Mixture item response theory-mimic model: Simultaneous estimation of differential item functioning for manifest groups and latent classes. Unpublished doctoral dissertation, Florida State University. Bock, R.D., & Aiken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26, 381- 409. Bontempo, D. E. (2006). Polytomous factor analytic models in developmental research. Unpublished doctoral dissertation, The Pennsylvania State University. Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Newbury Park, CA: Sage. Cheung, G.W. & Rensvold, R.B. (1999) Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1- 27. Cho, S.-J. (2007). A multilevel mixture IRT model for DIF analysis. Unpublished doctoral dissertation, University of Georgia, Athens. Chung, M.C., Dennis, I., Easthope, Y., Werrett, J., & Farmer, S. (2005). A multiple- indicator multiple-cause model for posttraumatic stress reactions: Personality, coping, and maladjustment. Psychosomatic Medicine, 67, 251-259. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. Impact The effect of impact on DIF detection rates was also examined. The three levels investigated were: (i) equal latent trait means, (ii) a 0.5 SD difference between latent means, representing a moderate amount of impact, and (iii) a 1.0 SD difference between latent trait means, representing a large amount of impact. The aggregated results in Tables 4-20 through 4-22 show that as the difference in latent trait means between the groups was increased there was a negligible decline in the accuracy of the factor mixture method to detect DIF. For example, the average power rate decreased marginally from .486 to .455 to .401 under the no-impact, 0.5 SD, and the 1.0 SD conditions respectively. These results show that the presence of impact did not adversely affect the ability of the factor mixture approach to detect DIF. Effect of item discrimination parameter values For the five items simulated to contain DIF, three different levels of item discrimination were selected. For two items (Items 2 and 3), the discrimination parameter value was set at 0.5 to mimic low discriminating items, one item's (Item 4) a- parameter was selected as 1.0 a medium level of discrimination, while two items (Items 5 and 6) with an a-parameter of 2.0 represented highly discriminating items. The discrimination parameter values for the non-DIF items were randomly selected from a normal distribution within a 2 range. The power rates for DIF detection categorized by the level of item discrimination are shown in Table 4-23. These results show, as expected, that power is influenced by the item discrimination parameter. More specifically, the accuracy of DIF detection increased as the item discrimination values increased. The factor mixture method had on average a .369 rate of detecting DIF in low discriminating items, this increased to robustness of the factor mixture model in DIF detection to the influence of impact. Overall, a total of 12 conditions (2 sample sizes x 2 DIF magnitudes x 3 latent trait distributions) were simulated. In this second phase of the simulation, each condition was replicated 1000 times. The full list of study design conditions are shown in Table 3- 3. Data Analysis The 1000 sets of dichotomous item responses were generated by R V2.9.0 (R Development Core Team, 2009). The data sets for each of the 12 conditions were saved and exported from R to Mplus V5.1 for analysis. As was done, in the first part of the study, the Type=Montecarlo facility was used to accommodate the analysis of the multiple datasets generated external to Mplus and for saving the results for subsequent analysis. To asses uniform DIF, a simultaneous significance test of the 14 threshold differences (i.e. with the exception of the referent, Item 1) using a Wald test was conducted. A significant p-value less than .05 provided evidence of DIF in the item. Evaluation Criteria The outcome measures used in evaluating the performance of this factor mixture method for DIF detection were as follows: * Convergence rates This was measured by the number of replications that converged to proper solutions across each of the eight combinations of conditions. Data sets with improper or non-convergent solutions were not included in the analysis. * Type I error rate The Type I error rate (or false-positive rate) was computed as the proportion of times the DIF-free items were incorrectly identified as having DIF. Therefore, the overall Type I error rate was calculated by dividing the total number of times the nine items (i.e. Items 7-15) were falsely rejected by the total number of properly converged replications for each of the 12 study conditions. The nominal Type I error rate used in this study was .05. this dissertation is on the use of an SEM-based approach to detect DIF, no detailed description of the IRT-based DIF detection methods will be given. SEM-based DIF Detection Methods While it is possible to specify an exploratory factor model where the measurement model relating the item responses to the latent underlying factors is unknown, a CFA approach in which the factor structure has been specified is more often used in DIF detection. Typically, a CFA model is formulated as: Y=v+Ar+s (1) r =a+, (2) where Y is a pxl vector of scores on the p observed variables, v is a pxl vector of measurement intercepts, A is a pxm matrix of factor loadings, r7 is a mx1 vector of factor scores on the m latent factors, and E is a pxl vector of residuals or measurement errors representing the unique portion of the observed variables not explained by the common factorss. It is assumed that the Es have zero mean and are uncorrelated not only with the r, but with each other as well. Additionally, a denotes an mx1 vector of factor means and ; is a m-vector of factor residuals. The model-implied mean and covariance structures are formulated as follows: / = v+AK (3) E(O)= ADA' + (4) where p/ is a pxl vector of means of the observed variables, K is an mx1 vector of factor means, (0O) is a pxp matrix of variances and covariances of the p observed variables, D is an mxm covariance matrix of the latent factors and 0 is a square pxp matrix of variances and covariances for the measurement errors. For a single-group 2010 Mary Grace-Anne Jackman CFA model, it is also assumed that the independent observations are drawn from a single, homogenous population. The estimates of the parameters are found by minimizing the discrepancy between the observed covariance matrix, S and the sample-implied covariance E(0). The general form of the discrepancy function is denoted by F(S,Z(e)) but there are several different types of functions that can be used in the estimation process. For example, if multivariate normality of the data is assumed, then the maximum likelihood (ML) estimates of the parameters are obtained by minimizing the following: F ()= [log l(0) +tr(S(O0) ')-logS p]+(Y -)' 1(Y -) (5) where S is the sample covariance matrix, Z(8) is the model-implied covariance matrix and p is the number of observed variables. As presented, this formulation of the factor model is described for items with a continuous response format. However, since the focus of this dissertation is on DIF assessment with dichotomous items, what follows is a respecification of the model to accommodate categorical items. Factor Analytic Models with Ordered Categorical Items In educational measurement, the ability trait is typically measured by dichotomous or polytomous items rather than continuous items. One method of dealing with categorical item responses is to specify a threshold model or a latent response variable (LRV) Formulation (LRV; Muthen & Asparouhov, 2002). The LRV formulation assumes that underlying each observed item response y, is a continuous and normally distributed latent response variable y*. This continuous latent variable can be thought of as a response tendency with higher values indicating a greater propensity of answering the item correctly. Further, it is assumed that when this tendency is sufficiently high thereby |

Full Text |

PAGE 1 A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By MARY GRACE-ANNE JACKMAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 1 PAGE 2 2010 Mary Grace-Anne Jackman 2 PAGE 3 To my parents, Enid and Kenmore Jackman, and my brother, Stephen 3 PAGE 4 ACKNOWLEDGMENTS First and foremost, I am most grateful to Almighty God through whom all things are possible. I am also forever indebted to the Office of Graduate Studies and the Alumni Fellowship Committee for the financial support that has made these four years of study possible. The completion of this dissertation would also not have been possible without the support and guidance of my dissertation committee: Dr. David Miller, Dr. Walter Leite, Dr. James Algina, and Dr. Craig Wood. I would like to thank my committee chair and academic adviser, Dr. Miller, for his expert guidance and insightful counsel during this PhD program. I am also grateful to Dr. Leite, for his patient mentorship and his uncanny ability to reduce my seemingly insurmountable programming mountains to mere molehills. I must also acknowledge Dr. Algina, the epitome of teaching excellence. I am indeed privileged to have been your student. I am also thankful to Dr. Wood for agreeing to be a part of my dissertation committee in my time of need and for your quiet vote of confidence. Finally, I also need to express my immense gratitude and deepest appreciation to my friends, in the US and at home in Barbados. This could not have been possible without your unwavering support and unending encouragement. I am grateful for the Skype chats, emails, texts and all the other modes of communication you used to continually encourage, support me and keep me abreast of all the happenings back home. This journey has been made easier because of your friendship, support and prayers. 4 PAGE 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS..................................................................................................4 LIST OF TABLES............................................................................................................8 LIST OF FIGURES........................................................................................................10 ABSTRACT...................................................................................................................11 CHAPTER 1 INTRODUCTION....................................................................................................13 2 LITERATURE REVIEW..........................................................................................21 Differential Item Functioning...................................................................................21 Types of Differential Item Functioning..............................................................23 DIF vs. Impact..................................................................................................24 Frameworks for Examining DIF........................................................................24 Observed score framework........................................................................24 The latent variable framework....................................................................25 SEM-based DIF Detection Methods.................................................................26 Factor Analytic Models with Ordered Categorical Items...................................27 Mixture Modeling as an Alternative Approach to DIF Detection..............................31 Estimation of Mixture Models..................................................................................36 Class Enumeration...........................................................................................37 Information Criteria Indices...............................................................................39 Mixture Model Estimation Challenges..............................................................43 Purpose of Study..............................................................................................44 3 METHODOLOGY...................................................................................................49 Factor Mixture Model Specification for Latent Class DIF Detection........................49 Data Generation.....................................................................................................50 Simulation Study Design.........................................................................................51 Research Study 1.............................................................................................52 Manipulated Conditions....................................................................................52 Sample size...............................................................................................52 Magnitude of uniform DIF..........................................................................53 Ability differences between groups............................................................53 Fixed Simulation Conditions.............................................................................54 Test length.................................................................................................54 Number of DIF items..................................................................................55 Sample size ratio.......................................................................................55 Percentage of overlap between manifest and latent classes.....................55 5 PAGE 6 Mixing proportion.......................................................................................56 Study Design Overview....................................................................................56 Evaluation Criteria............................................................................................57 Research Study 2.............................................................................................57 Data Analysis...................................................................................................58 Evaluation Criteria............................................................................................58 Model Estimation..............................................................................................59 4 RESULTS...............................................................................................................62 Research Study 1...................................................................................................62 Convergence Rates..........................................................................................62 Class Enumeration...........................................................................................63 Akaike Information Criteria (AIC)...............................................................64 Bayesian Information Criteria (BIC)...........................................................64 Sample-size adjusted BIC (ssaBIC)...........................................................65 Research Study 2...................................................................................................65 Nonconvergent Solutions.................................................................................66 Type I Error Rate..............................................................................................66 Magnitude of DIF.......................................................................................67 Sample size...............................................................................................68 Impact........................................................................................................68 Variance components analysis..................................................................69 Statistical Power...............................................................................................69 Magnitude of DIF.......................................................................................70 Sample size...............................................................................................70 Impact........................................................................................................71 Effect of item discrimination parameter values...........................................71 Variance components analysis..................................................................72 5 DISCUSSION.........................................................................................................81 Class Enumeration and Performance of Fit Indices................................................81 Type I Error and Statistical Power Performance.....................................................84 Type I Error Rate Study....................................................................................85 Statistical Power Study.....................................................................................86 Reconciling the Simulation Results..................................................................87 Limitations of the Study and Suggestions for Future Research..............................88 Conclusion..............................................................................................................92 APPENDIX A MPLUS CODE FOR ESTIMATING 2-CLASS FMM................................................94 B MPLUS CODE FOR DIF DETECTION...................................................................95 LIST OF REFERENCES...............................................................................................96 6 PAGE 7 BIOGRAPHICAL SKETCH..........................................................................................109 7 PAGE 8 LIST OF TABLES Table page 3-1 Generating population parameter values for reference group............................60 3-2 Fixed and manipulated simulation conditions used in study 1............................61 3-3 Fixed and manipulated simulation conditions used in study 2............................61 4-1 Number of converged replications for the three factor mixture models...............73 4-2 Mean AIC values for the three mixture models...................................................74 4-3 Mean BIC values for the three mixture models...................................................74 4-4 Mean ssaBIC values for the three mixture models.............................................74 4-5 Percentages of converged solutions across study conditions............................75 4-6 Overall Type I error rates across study conditions..............................................75 4-7 Type I error rates for DIF = 1.0...........................................................................76 4-8 Type I error rates for DIF = 1.5...........................................................................76 4-9 Type I error rates for sample size of 500............................................................76 4-10 Type I error rates for sample size of 1000..........................................................76 4-11 Type I error rates for impact of 0 SD..................................................................77 4-12 Type I error rates for impact of 0.5 SD...............................................................77 4-13 Type I error rates for impact of 1.0 SD...............................................................77 4-14 Variance components analysis for Type I error..................................................77 4-15 Overall power rates across study conditions......................................................78 4-16 Power rates for DIF of 1.0..................................................................................78 4-17 Power rates for DIF of 1.5..................................................................................78 4-18 Power rates for sample size N of 500.................................................................79 4-19 Power rates for sample size N of 1000...............................................................79 4-20 Power rates for impact of 0 SD...........................................................................79 8 PAGE 9 4-21 Power rates for impact of 0.5 SD........................................................................79 4-22 Power rates for impact of 1.0 SD........................................................................79 4-23 Power rates for DIF detection based on item discriminations.............................80 4-24 Variance components analysis for power results................................................80 9 PAGE 10 LIST OF FIGURES Figure page 2-1 Example of uniform DIF......................................................................................46 2-2 Example of non-uniform DIF...............................................................................46 2-3 Depiction of relationship between y* and y for a dichotomous item....................47 2-4 Diagram depicting specification of the factor mixture model...............................48 10 PAGE 11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A MONTE CARLO INVESTIGATION OF THE PERFORMANCE OF FACTOR MIXTURE MODELING IN THE DETECTION OF DIFFERENTIAL ITEM FUNCTIONING By Mary Grace-Anne Jackman August 2010 Chair: M. David Miller Cochair: Walter Leite Major: Research and Evaluation Methodology This dissertation evaluated the performance of factor mixture modeling in the detection of differential item functioning (DIF). Using a Monte Carlo simulation, the study first investigated the ability of the factor mixture model to recover the number of true latent classes existing in the population. Data were simulated based on the two-parameter logistic (2PL) item response theory (IRT) model for 15 dichotomous items for a two-group, two-class population. In addition, the three simulation conditions sample size, DIF magnitude, and mean latent trait differences were manipulated. One-, two-, and three-class factor mixture models were estimated and compared using three commonly-used likelihood-based fit indices: the Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). Overall, there was a high level of inconsistency between the indices with respect to the best-fitting model. Whereas the AIC tended to over-extract the number of latent classes and under most study conditions selected the three-class model, the BIC erred on the side of parsimony and consistently selected the simpler one-class model. On the other hand, the ssaBIC held the middle ground between these 11 PAGE 12 two extremes and tended to favor the "true" two-class mixture model as the sample size or DIF magnitude was increased. In the second phase of the study, the factor mixture approach was assessed in terms of its Type I error rate and statistical power to detect uniform DIF. One thousand data sets were replicated for each of the 12 study conditions. The presence of uniform DIF was assessed via a significance test of each of differences in item thresholds across latent classes. Overall, the results were not as encouraging as was hoped. Inflated Type I errors were observed under all of the study conditions, particularly when the sample size and DIF magnitude were reduced. 12 PAGE 13 CHAPTER 1 INTRODUCTION Given the ubiquity of testing in the United States coupled with the serious consequences of high-stakes decisions associated with these assessments, it is critical that conclusions drawn about group differences among examinee groups be accurate and that the validity of interpretations is not compromised. One way of eliminating the threat of invalid interpretations is to ensure that tests are fair and the items do not disadvantage subgroups of examinees. In addressing the issue of fairness in testing, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) outlines four widely used interpretations of fairness. The first defines a fair test as one that is free from bias that either systematically favors or systematically disadvantages one identifiable subgroup of examinees over another. In the second definition, fairness refers to the belief that equal treatment should be afforded to all examinees during the testing process. The third definition has sparked some controversy among testing professionals. It defines fairness as the equality of outcomes as characterized by comparable overall passing rates across examinee subgroups. However, it is widely agreed that while the presence of group differences in ability distributions between groups should not be ignored, it is not an indication of test bias. Therefore, a more acceptable definition specifies that in a fair test, examinees possessing equal levels of the underlying trait being measured should have comparable testing outcomes, regardless of group membership. The fourth and final definition of fairness requires that all examinees be afforded an equal, adequate opportunity to learn the tested material (AERA, APA, & NCME, 1999). Clearly, the concept of fairness is a complex, multi-faceted construct and therefore it is highly unlikely that consensus will be reached on all 13 PAGE 14 aspects of its definition, interpretation and implementation (AERA, APA, & NCME, 1999). However, there is agreement that fairness must be paramount to test developers during the writing and review of items as well as during the administration and scoring of tests. In other words, the minimum fairness requirements are that items are free from bias and that all examinees receive an equitable level of treatment during the testing process (AERA, APA, & NCME, 1999). In developing unbiased test items, one of the primary concerns is ensuring that the items do not function differentially for different subgroups of examinees. This issue of item invariance is investigated through the use of a statistical technique known as differential item functioning (DIF). DIF detection is particularly critical if meaningful comparisons are to be made between different examinee subgroups. The fundamental premise of DIF is that if test takers have approximately the same knowledge, then they should perform in similar ways on individual test questions regardless of their sex, race or ethnicity (ETS, 2008). Therefore, the process of DIF assessment involves the accumulation of empirical evidence to determine whether items function differentially for examinees with the same ability. DIF analysis is widely regarded as the psychometric standard in the investigation of bias and test fairness. Consequently, it has been the topic of extensive research (Camilli & Shepard, 1994; Clauser & Mazor, 1998; Holland & Thayer, 1988; Holland & Wainer, 1993; Milsap & Everson, 1993; Penfield & Lam, 2000; Potenza & Dorans, 1995; Swaminathan & Rogers, 1990). As part of its evolution, several statistical DIF detection techniques, including non-parametric (Holland & Thayer, 1988; Swaminathan & Rogers, 1990), IRT-based methods (Thissen, Steinberg, & Wainer, 1993; Wainer, Sireci, & Thissen, 1991) and SEM-based methods (Jreskog & 14 PAGE 15 Goldberger, 1975; MacIntosh & Hashim, 2003; Muthn, 1985, 1988; Muthn, Kao, & Burstein, 1991) have been developed. Typically these methods have all focused on a manifest approach to detecting DIF. In other words, they use pre-existing group characteristics such as gender (e.g. males vs. females) or ethnicity (e.g. Caucasian vs. African Americans) to investigate the occurrence of DIF in a studied item. And while this approach to DIF testing has been widely practiced and accepted as the standard for DIF testing, some believe that this emphasis on statistical DIF analysis has been less successful in determining substantive causes or sources of DIF. For example, in traditional DIF analyses, after an item has been flagged as exhibiting DIF, content-experts may subsequently conduct a substantive review to determine the source of the DIF (Furlow, Ross, & Gagn, 2009). However, while this is acknowledged as an important step, some view its success at being able to truly understand the explanatory sources of DIF as minimal, at best (Engelhard, Hansche, & Rutledge, 1990; Gierl et al., 2001; ONeill & McPeek, 1993; Roussos & Stout, 1996). More specifically, inconsistencies in agreement between reviewers judgments or between reviewers and DIF statistics make it difficult to form any definitive conclusions explaining the occurrence of DIF (Engelhard et al., 1990). Others suggest that the inability of traditional DIF methods to unearth the causes of DIF is because the pre-selected grouping variables on which these analyses are based are often not the real dimensions causing DIF. Rather, they view these a priori variables as mere proxies for educational (dis)advantage attributes that if identified could better explain the pattern of differential responses among examinees (Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton, 2002; Dorans & Holland, 1993; Samuelsen, 2005, 2008; Webb, Cohen, & 15 PAGE 16 Schwanenflugel, 2008). Finally, the assumption of an inherent homogeneity in responses among examinees in the subgroups has also been cited as another weakness of the traditional DIF approach (De Ayala, 2009; De Ayala, Kim, Stapleton & Dayton, 2002; Samuelsen, 2005). This view has been supported by the observation that even within a seemingly homogenous manifest group (e.g. Hispanic or black) there can be high levels of heterogeneity, resulting in segments which respond differently to the item than other examinees in that group. Using race as an example, De Ayala (2009) noted that a racial category such as Asian American would lump together examinees of Filipino, Korean, Indonesian, Taiwanese descent as a single homogeneous group, ignoring their intra-manifest variability. As a result, De Ayala (2009) argues that this assumed homogeneity in traditional manifest DIF assessments may lead to false conclusions about the existence or magnitude of DIF. As an alternative to the traditional manifest approach, a latent mixture conceptualization of DIF has been proposed. Rather than focusing on a priori examinee characteristics, this method characterizes DIF as being the result of unobserved heterogeneity in the population. A latent mixture conceptualization relaxes the requirement that associates DIF with a specific preexisting variable and the assumption that manifest groups are homogenous. Instead, examinees are classified into latent subpopulations based on their differential response patterns. These latent subpopulations or latent classes arise as a result of qualitative differences (e.g. use of problem solving strategies, response styles or level of cognitive thinking) among examinee subgroups (Mislevy et al., 2008; Samuelsen, 2005). Interestingly, more than a decade before the latent mixture conceptualization had been proposed, Angoff (1993) 16 PAGE 17 also voiced his concern about the inability of traditional methods to provide substantive interpretations of DIF. In offering evidence supporting this view Angoff (1993) reported that in attempting to account for DIF test developers are often confronted by DIF results that they cannot understand; and no amount of deliberation seems to explain why some perfectly reasonable items have large DIF values (Angoff, 1993, pg. 19). This dissertation proposes the use of factor mixture models as an alternative approach to investigating heterogeneity in item parameters when the source of heterogeneity is unobserved. Factor mixture modeling blends factor analytic (Thurstone, 1947) and latent class (Lazarsfield & Henry, 1968) models, two structural equation modeling (SEM) based methods that provide a unique but complementary approach to explaining the covariation in the data. The latent class model accounts for the item relationships by assuming the existence of qualitatively different subpopulations or latent classes (Bauer & Curran, 2004). However, since class membership is unobserved individuals are categorized based not on an observed grouping variable but rather by using probabilities to determine their most likely latent class assignment. On the other hand, the factor analytic model assumes the existence of an underlying, continuous factor structure in explaining the commonality among the item responses. As part of the factor mixture estimation, the class-specific item parameters from the factor analytic model that are estimated can be compared to determine their level of measurement non-invariance or differential functioning. A significant DIF coefficient provides evidence that the item is functioning differentially among latent classes after controlling for the latent ability trait. 17 PAGE 18 In estimating factor mixture models, one important decision to be made is the determination of the number of latent classes. However, while there are several fit criteria such as the Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), the adjusted BIC (ssaBIC; Sclove, 1987), and the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) available to assist the researcher in making the determination, there is seldom perfect agreement among these fit indices. Therefore, practitioners are cautioned against applying the mixture approach without having theoretical support for their hypothesis of unobserved heterogeneity in the population of interest. Generally, when group membership is known a priori, SEM-based DIF detection models can be specified using either a multiple indicators, multiple causes (MIMIC) model or a multiple-group CFA approach (Allua, 2007). In this paper, the factor mixture model will be specified using the mixture analog of the manifest multiple-group CFA. Therefore, in the model specification since the observed group variable will now be replaced by a latent categorical variable, not only will the heterogeneity of item parameters be examined but a profile of the latent, unobserved subpopulations can be examined as well. Finally, if needed, covariates can also be included in the model to help in explaining the composition of the latent classes as well. The primary purpose of this dissertation was to explore the utility of the factor mixture models in the detection of DIF. In a 2006 paper, Bandalos and Cohen commented that while the estimation of the factor mixture models had previously been presented in the IRT literature, the models were not as frequently utilized with SEM-based models. However, programming enhancements to software packages such as 18 PAGE 19 Mplus (Muthn & Muthn, 1998-2008) have increased the likelihood that the estimation of SEM-based factor mixture models will be more commonly estimated in practice (Bandalos & Cohen, 2006). In this study, the performance of the factor mixture model was evaluated primarily in terms of its ability to produce high convergence rates, control Type I error rate, and enhance statistical power under a variety of realistic study conditions. The simulated conditions examined sample size, magnitude of DIF, and similarity of latent ability means. In sum, this Monte Carlo simulation was conducted to determine the conditions under which a factor mixture approach performs best when assessing item non-invariance and ultimately whether its use in practice should be recommended. SEM-based approaches are not commonly used in the discipline of educational testing which traditionally has been considered the domain of techniques developed within an IRT framework. And although the equivalence between factor analytic and IRT approaches for categorical items has long been established and applied repeatedly in the literature (Bock & Aiken, 1981; Finch 2005; Glockner-Rist & Hoijtink 2003; Moustaki 2000; Muthn, 1985; Muthn & Asparouhov 2002; Takane & de Leeuw, 1987), these methods are still utilized primarily within their respective discipline of origin. Therefore, despite its obvious potential, it is unlikely that this latent conceptualization will gain widespread acceptance, unless the applied community is convinced that (i) framing DIF with respect to latent qualitative differences and (ii) using a SEM-based approach are both worthwhile, practical options. In sum, if a mixture approach can be shown to add substantial value in an area such as DIF detection, then research of this kind will 19 PAGE 20 contribute positively to bridging the gap between SEM-based and IRT-based methods in the area of testing and measurement. 20 PAGE 21 CHAPTER 2 LITERATURE REVIEW Differential Item Functioning From a historical standpoint, the term item bias was coined in the 1960s during an era when a public campaign for social equality, justice and fairness was being waged on all fronts. The term referenced studies designed to investigate the claim that the principal, if not the sole, reason for the great disparity in test performance between Black and Hispanic students and White students on tests of cognitive ability is that the tests contain items that are outside the realms of the minority cultures (Angoff, 1993, pg. 3). Concerns about bias in testing were particularly relevant in cases where the results were used in high-stakes decisions involving job selection and promotions, certification, licensure and achievement. What followed in the early s and s was a series of studies using rudimentary methods based on classical test theory (CTT) techniques (Gelin, 2005). One of these early methods involved an analysis of variance (ANOVA) approach (Angoff & Sharon, 1974; Cleary & Hilton, 1986) and focused on the interaction between group membership and item performance as a means of identifying outliers and detecting potentially biased items. Another method, the delta-plot technique, (Angoff, 1972; Thurstone, 1925) used plots of the transformed CTT index of item difficulty (p-values) for each group as a means of detecting biased items. However, the main weakness of these early methods was that they both failed to control for the underlying construct that was purported to be measured (e.g. ability) by the test. Another criticism was that because these methods only considered item difficulty, their implicit assumption of equally discriminating items led to an increase in the incidence of false negative or false positive identifications. 21 PAGE 22 During this period, the term bias also came under heavy scrutiny. It was felt that the strong emotional connotation which the word carried was creating a semantic rift between the technical testing community and the general public. As a result of this debate, the term differential item functioning (DIF) was proposed as a less value-laden, more neutral replacement (Angoff, 1993; Cole, 1993). The DIF concept is defined as the accumulation of empirical evidence to investigate whether there is a difference in performance between comparable groups of examinees (Hambleton, Swaminathan, & Rogers, 1991). More specifically, it refers to a difference in the probability of correctly responding to an item between two subgroups of examinees of the same ability or groups matched by their performance on the test representing the underlying construct of interest (Kamata & Binici, 2003; Potenza & Dorans, 1995). These two groups are referred to as the focal group, those examinees expected to be disadvantaged by the item(s) of interest on the test (e.g. females or African Americans), and the reference group, those examinees expected to be favored by the DIF items (e.g. males or Caucasians). Since the introduction of the early CTT methods, a variety of additional techniques have been introduced for the detection of DIF (Clauser & Mazor, 1998). These include the nonparametric Mantel-Haenszel chi-square method (Holland & Thayer, 1988; Mantel & Haenszel, 1959), the standardization method (Dorans & Holland, 1993), logistic regression (Swaminathan & Rogers, 1990), likelihood ratio tests (Wainer, Sireci, & Thissen, 1991), and item response theory (IRT) approaches comparing parameter estimates (Lord, 1980) or estimating the areas between the item characteristic curves (Raju, 1988, 1990). Additionally, though not as popular in the testing and measurement 22 PAGE 23 discipline, SEM-based methods such as the multiple-groups approach (Lee, 2009; Srbom, 1974) and the MIMIC model (Gallo, Anthony, & Muthn, 1994; Muthn, 1985, 1988) have also been used to detect DIF. The fact that these methods involve some level of conditioning on the latent construct of interest, differentiate them from the earlier CTT ANOVA and p-value approaches. Furthermore, DIF assessment has now emerged as an essential element in the investigation of test validity and fairness and for testing companies such as Educational Testing Service (ETS), the use of DIF is a critical part of the test validation process. Coles (1993) comments in which she describes DIF as a technical tool in ETS to help us assure ourselves that tests are as fair as we can make them underscore the importance of this analytical approach as a standard practice in the design and administration of fair tests. Types of Differential Item Functioning There are two primary types of DIF: uniform and non-uniform DIF. Uniform DIF occurs when the probability of correctly responding to or endorsing a response category is consistently higher or lower for either the reference or focal group across all levels of the ability scale. As shown in Fig 2-1, when uniform DIF is present in an item, the two ICCs do not cross at any point along the range of ability. In this example, the probability of responding correctly to this dichotomous item is uniformly lower for group 2 than for group 1 along the entire range of latent ability. Conversely, when there is an interaction between the latent ability trait and group membership and the ICCs cross at some point along the ability range, then this is referred to as non-uniform DIF. This means that for a portion of the scale, one group is more likely to correctly respond or endorse a response category. However, this advantage is reversed for the second portion of the scale. An example of this is shown 23 PAGE 24 in Figure 2-2. It is important to note that while some methods can detect both types of DIF, others are capable of detecting uniform DIF only. DIF vs. Impact As was mentioned previously, DIF occurs when there is a significant difference in the probability of answering an item correctly or endorsing an item category between groups of the same ability level (Wainer, 1993). On the other hand, impact refers to legitimate group differences in the probability of getting an item correct or endorsing an item category (Wainer, 1993). Therefore, in distinguishing between these two concepts, it is important to note that while DIF assessment includes comparable groups of examinees with the same trait level, impact refers to group differences without controlling or matching on the construct of interest. Frameworks for Examining DIF Traditionally, DIF is examined within one of two frameworks: (i) the observed score framework and (ii) the latent variable framework. Observed score framework The observed score framework includes methods that adjust for ability by conditioning on some observed or manifest variable designed to serve as a proxy for the underlying ability trait (Ainsworth, 2007). When an observed variable such as the total test score is used it is assumed that this internal criterion is an unbiased estimate of the underlying construct. Therefore, as an alternative to the aggregate test score, an alternative or a purified form of the total test score may also be used as the internal matching criteria. This means that provided that DIF is detected, the total test score is adjusted by dropping those items identified as displaying DIF and calculating a revised total score. This iterative process which is used to refine the criterion measure and limit 24 PAGE 25 the impact of DIF contamination is known as purification (Kamata & Vaughn, 2004). While internal criterion measures are most commonly used, external criteria consisting of a set of items that were not part of the administered test could also function as an external criterion measure. An external criterion may be an adequate choice particularly when there are a high proportion of DIF items in the scale or the total score is deemed to be an inappropriate measure (Gelin, 2005). However, external measures are seldom used in practice due to the difficulty in finding an adequate set of items that would be more appropriate at measuring the latent trait than the actual test that was designed specifically for that use (Gelin, 2005; Shih & Wang, 2009). Procedures that use the observed variable framework include contingency tabulation methods such as the Mantel-Haenszel (MH) (Holland & Thayer, 1998), generalized linear models such as logistic regression (Swaminathan & Rogers, 1990) or ordinal regression (Zumbo, 1999), and the standardization method (Dorans & Kulik, 1983). The latent variable framework Unlike the procedures implemented within the observed score framework, these techniques do not control using an observed, manifest measure like the total test score or purified total test score. Instead, latent variable DIF detection methods involve the use of an assumed underlying latent trait such as ability. The two main classes of methods which use the latent variable framework are: (i) item response theory (IRT) and (ii) structural equation modeling (SEM) based approaches. IRT methods include techniques in which comparisons are made either between item parameters (Lord, 1980), or between item characteristic curves (Raju, 1988, 1990) or likelihood ratio test methods (IRT-LRT; Thissen, Steinberg, & Wainer, 1988). However, since the focus of 25 PAGE 26 this dissertation is on the use of an SEM-based approach to detect DIF, no detailed description of the IRT-based DIF detection methods will be given. SEM-based DIF Detection Methods While it is possible to specify an exploratory factor model where the measurement model relating the item responses to the latent underlying factors is unknown, a CFA approach in which the factor structure has been specified is more often used in DIF detection. Typically, a CFA model is formulated as: Y (1) (2) where Y is a px1 vector of scores on the p observed variables, is a px1 vector of measurement intercepts, is a pxm matrix of factor loadings, is a mx1 vector of factor scores on the m latent factors, and is a px1 vector of residuals or measurement errors representing the unique portion of the observed variables not explained by the common factor(s). It is assumed that the s have zero mean and are uncorrelated not only with the s but with each other as well. Additionally, denotes an mx1 vector of factor means and is a m-vector of factor residuals. The model-implied mean and covariance structures are formulated as follows: (3) () (4) where is a px1 vector of means of the observed variables, is an mx1 vector of factor means, () is a pxp matrix of variances and covariances of the p observed variables, is an mxm covariance matrix of the latent factors and is a square pxp matrix of variances and covariances for the measurement errors. For a single-group 26 PAGE 27 CFA model, it is also assumed that the independent observations are drawn from a single, homogenous population. The estimates of the parameters are found by minimizing the discrepancy between the observed covariance matrix, S and the sample-implied covariance () The general form of the discrepancy function is denoted by F(S,()) but there are several different types of functions that can be used in the estimation process. For example, if multivariate normality of the data is assumed, then the maximum likelihood (ML) estimates of the parameters are obtained by minimizing the following: 1()log()(())log()(MLFtrSSpY 1)Y (5) where S is the sample covariance matrix, () is the model-implied covariance matrix and p is the number of observed variables. As presented, this formulation of the factor model is described for items with a continuous response format. However, since the focus of this dissertation is on DIF assessment with dichotomous items, what follows is a respecification of the model to accommodate categorical items. Factor Analytic Models with Ordered Categorical Items In educational measurement, the ability trait is typically measured by dichotomous or polytomous items rather than continuous items. One method of dealing with categorical item responses is to specify a threshold model or a latent response variable (LRV) Formulation (LRV; Muthn & Asparouhov, 2002). The LRV formulation assumes that underlying each observed item response y, is a continuous and normally distributed latent response variable y*. This continuous latent variable can be thought of as a response tendency with higher values indicating a greater propensity of answering the item correctly. Further, it is assumed that when this tendency is sufficiently high thereby 27 PAGE 28 exceeding a specific threshold value, then the examinee will answer the item correctly. Likewise, if it falls below the threshold, then an incorrect response is observed. Therefore, based on this formulation, the observed items responses can be viewed as discrete categorizations of the continuous latent variables. The relationship between these two variables y and y* are represented by the following nonlinear function: *1, if ,cycy c (6) where c denotes the number of response categories for y and the threshold structure is defined by 012... C for c categories with c-1 thresholds. In the case of binary items, the mapping of y1 onto y1* is expressed as: *111*110,,1, ifyyify where 1 denotes the threshold parameter for test item y1. This relationship is illustrated in Figure 2-3. Because of the LRV formulation, the measurement component of the model which relates, in this case, the continuous latent response variables to the latent factor and to the group membership variable is respecified as: *ijiijijy (7) where is individuals i latent response to item j. The distributional assumptions of the p-vector of measurement errors determine the appropriate link function to be selected. For example, if it is assumed that the measurement errors are normally distributed then the probit link function, that is, the inverse of the cumulative normal distribution function, is used. As a result the thresholds and factor loadings are interpreted as probit coefficients in the linear probit regression equation. The alternative is to assume a *ijy 1[ ] 28 PAGE 29 logistic distribution function for the measurement errors which allows the coefficients to be interpreted either in terms of logits or converted to changes in odds. Under the LRV formulation, the single factor model for a continuous latent trait measured by binary outcomes is expressed as in Equation 7. Therefore, the conditional probability of a correct response as a function of is: **1/21/21||1| 1() ()ijjijijijijiiijijiiijijPyPyPyFvVFvV (8) where ()ijV is the residual variance and F can be either the standard normal or logistic distribution function depending on the distributional assumptions of the ijs (Muthn & Asparouhov, 2002). Further, in addition to the LRV, latent variable models with categorical variables can also be presented using an alternative formulation. The conditional probability curve formulation focuses on directly modeling the nonlinear relationship between the observed s y and the latent factor trait, as: 1|ijjijiPyFab (9) where is the item discrimination, bi is the item difficulty, and the distribution of F is either the standard normal or logistic distribution function. In their 2002 paper, Muthn and Asparouhov illustrate the equivalence of results between these two conceptual formulations of modeling factor analytic models with categorical outcomes. The authors showed that equating the two formulations: ia 1/2()iiijijijiFvVFab (10) 29 PAGE 30 where as previously indicated, F is either the standard normal or logistic distribution depending on the distributional assumptions of the ij However, it should be noted that in the case of factor mixture modeling with categorical variables, the default estimation method used in Mplus (Muthn & Muthn, 1998-2008) is robust maximum likelihood estimation method (MLR) and the default distribution of F is the logistic distribution. To allow for the estimation of the thresholds, the intercepts in the measurement model are assumed to be zero. As a result, the factor analytic logistic parameters can be converted to IRT parameters using: ( )( ) and () iiijiabVar i (11) Additionally, to ensure model identification, it is necessary to assign a scale to the latent trait. One method of setting the scale of the latent trait is to standardize it by setting the mean equal to zero and fixing the variance at one, that is, =0 and =1. In this case, the factor loadings and thresholds can then be converted to item discriminations and item difficulties using the following expressions (Muthn & Asparouhov, 2002): and ()iiiijav ib (12) The ease with which estimates of the factor analysis parameters can be converted to the more recognizable IRT scale should increase the interpretability and utility of the results to the applied researchers (Fukuhara, 2009). An alternative method of setting the scale of the latent factor is by fixing one loading per factor to one. As a result, the simplified conversion formulae will differ and by extension, the magnitude of the parameter estimates will be affected as well (Bontempo, 2006; Kamata & Bauer, 2008). 30 PAGE 31 Therefore, when running these models with factor analytic programs such as Mplus, the user should be aware of the default scale setting methods since this invariably will affect the parameter conversion formulae. Mixture Modeling as an Alternative Approach to DIF Detection Traditional DIF detection methods assume a manifest approach in which examinees are compared based on demographic classifications such as gender (females vs. males) or race (African Americans vs. Caucasians). While this approach has a long history and has been used successfully in the past to assess DIF, recent emerging research has suggested that this perspective may be limiting in scope (Cohen & Bolt, 2005; De Ayala et al., 2002; Mislevy et al., 2008; Samuelsen, 2005, 2008). As an alternative to the traditional manifest DIF approach, a latent DIF conceptualization, rooted in latent class and mixture modeling methods, has been proposed (Samuelsen, 2005, 2008). In this conceptualization of DIF, rather than focusing on a priori examinee characteristics, this approach assumes that the underlying population consists of a mixture of heterogeneous, unidentified subpopulations, known as latent classes. These latent classes exist because of perceived qualitative differences (e.g. different learning strategies, different cognitive styles etc.) among examinees. One technique that can be used to examine DIF from a latent perspective is factor mixture modeling (FMM). Factor mixture models result from merging the factor analysis model and the latent class model resulting in a hybrid model consisting of two types of latent variables; continuous latent factors and categorical unobserved classes (Lubke & Muthn, 2005, 2007; Muthn, Asparouhov, & Rebollo, 2006). The simultaneous inclusion of these two types of latent variables allows first for the exploration of unobserved heterogeneity that may exist in the population and the examination of the 31 PAGE 32 underlying dimensionality within these latent groups (Lubke & Muthn, 2005). As a result, a primary advantage of factor mixture modeling is its flexibility in allowing for a wider range of modeling options. For instance, models can be specified for multiple factors, multiple latent classes and the structure of the within-class models can vary in the complexity of its relationships not only with the latent factors but with observed combinations of continuous and categorical covariates as well (Allua, 2007; Lubke & Muthn, 2005, 2007; McLachlan & Peel, 2000). However, it is important to note that as more specifications are introduced to a model, not only does the level of complexity of model increase but the computational intensity of the estimation process as well. Therefore, it is recommended that researchers should be guided by substantive theory not only in supporting their hypothesis of population heterogeneity but also with regard to the complexity of the specification of their mixture models as well (Allua, 2007; Jedidi, Jagpal, & DeSarbo, 1997). Previous applied studies highlight several fields where mixture modeling has been applied successfully to investigate population heterogeneity (Bauer & Curran, 2004; Jedidi et al., 1997; Kuo et al., 2008; Lubke & Muthn, 2005, 2007; Lubke & Neale, 2006; Muthn & Asparouhov, 2006; Muthn, 2006). One area in which factor mixture models have been utilized with some success is that of substance abuse research (Kuo et al., 2008; Muthn et al., 2006). In their research on alcohol dependency, Kuo et al. (2008) compared a factor mixture model approach against a latent class and a factor model approach to determine which of the three best explained the alcohol dependence symptomology patterns. What they found was that while a pure factor analytic model provided an unsatisfactory solution, the latent class approach provided a better fit to the 32 PAGE 33 alcohol dependency data. However, the single factor, three-class factor mixture model provided the best fit to the data and best accounted for the covariation in both the pattern of symptoms and the heterogeneity of the population. In another study from the substance abuse literature, Muthn et al. (2006) compared two types of factor mixture models to a factor analysis and latent class approach in analyzing the responses of 842 pairs of male twins to 22 alcohol criteria items. The findings showed that both factor mixture models fit the data well and explained heritability both with regard to the underlying dimensional structure of the data and the latent class profiles of the heterogeneous population (Muthn et al., 2006). With regard to DIF, factor mixture models may be used to investigate item parameters differences between the latent classes of subpopulations of examinees. In this conceptualization of DIF, the unobserved latent classes represent qualitatively different groups of individuals whose item responses function differentially across the classes (Bandalos & Cohen, 2006). To allow for the specification of these models, a categorical latent variable is integrated into the common factor model specified in Equations 1 and 2. As a result, the K-class factor mixture model is now expressed as: ikkikikiky (13) ikkik (14) where the subscript k indicates the parameters that can vary across the latent classes. Figure 2-4 provides a depiction of a factor mixture model where the unidimensional latent factor is measured by five observed items and there are K latent classes in the population. In the diagram, the relationships are specified as follows: 33 PAGE 34 The arrows from the latent factor to the item responses (from to the Ys) represent the factor loadings or the parameters measuring the relationship between the latent factor and the items. The arrows from the latent class variable to the item responses (from Ck to the Ys) are the class-specific item thresholds conditional on each of the K latent classes. The broken-line arrows from the latent class variable to the factor loadings arrows indicate that these loadings are also class-specific and therefore can vary across the K latent classes. The arrow from the latent class to the latent factor (i.e. from Ck to the ) allows for factor means and/or factor variances to be class-specific as well. Since the model parameters are allowed to be class-specific, that is, both item thresholds and factor loadings can be specified as non-invariant, then this specification allows for testing of both uniform and non-uniform DIF. The focus of this dissertation is on assessing the performance of the factor mixture model in the detection of uniform DIF. Therefore for this specification, while the item thresholds are allowed to vary across the levels of the latent class variable, the factor loadings are constrained across the K classes. In Mplus the implementation of the factor mixture model for DIF detection can be conceptualized as a multiple-group approach where DIF is tested across latent classes rather than manifest groups. Therefore using Equations 1 and 2, the CFA mixture model can be reformulated as: and kkkkkkkiY k (15) where the parameters are as previously defined for each of the k = 1, 2,,K latent classes. Once again, it is assumed that the measurement errors have an expected value of zero and are independent of the latent trait(s), and of each other. Similarly, the model implied mean and covariance structure for the observed variables in each of the k = 1, 2,,K latent classes can be defined as: 34 PAGE 35 and kkkkkkkk k (16) Three main strategies for using CFA-based approaches to investigate measurement invariance have been proposed: (i) a constrained-baseline approach (Stark, Chernyshenko, & Drasgow, 2006), (ii) a free-baseline approach (Stark et al., 2006), and (iii) a more recent approach which tests the significance of the threshold differences via the Mplus model constraint feature (Clark, 2010; Clark et al., 2009). In the first two approaches, DIF testing is conducted via a series of tests of hierarchically nested models (Lee, 2009). The constrained-baseline approach begins with a baseline model in which all the parameters are constrained equal across groups, then one-at-a-time parameters of the studied item(s) are freed. On the other hand, the free-baseline approach starts with a model in which all parameters (except those needed for model identification) are freely estimated across groups. After this model has been estimated, then the parameters of the item(s) of interest are constrained in a sequential manner. With either of these two approaches, a series of nested comparisons are conducted to determine the level of measurement invariance. For example, if testing for uniform DIF, the baseline model is compared with models formed by either individually constraining or releasing the item thresholds of interest across the groups/latent classes. Stark et al. (2006) noted that whereas the constrained-baseline approach is typically used in IRT research, the free-baseline approach is more common with CFA-based methods. However, one disadvantage of these baseline approaches is that they require two models (i.e. baseline and constrained or augmented) to be fitted to the data, which increases the complexity of the model estimation procedure. 35 PAGE 36 Unlike the two previous methods which use a nested models approach, the third method does not require the specification of two sets of models. The approach which has been credited to Mplus Tihomir Asparouhov has recently been used in factor mixture studies conducted by Clark (2010) and Clark et al. (2009). The Mplus (Muthn & Muthn, 1998-2008) implementation of this approach as described by Clark (2010) is as follows: The thresholds of the all items, except those of a referent needed for identification, are allowed to vary freely across the latent classes. Next, the Mplus model constraint option is invoked to create a set of new variables. Each new variable defines a threshold difference across classes for each of the items to be tested for DIF. For example, the estimated threshold difference for item 6 may be defined as: dif_it6 = t2_i6 t1_i6, where t2_i6 and t1_i6 are the user-supplied variable names for the threshold of item 6 in class 2 and class 1 respectively. The creation of the 14 threshold equations allows for the testing of the item thresholds across classes via a Wald test to determine whether the differences are significantly different from zero. A significant p-value provides evidence that the item is functioning differentially while a non-significant result indicates that the item is DIF-free. In the case where the referent item is not known to be invariant across classes, a series of tests are undertaken in which each items thresholds are successively constrained equal across classes and the threshold differences are estimated for the remaining items. Finally, a tally is made of the total number of times that each item displayed a significant p-value, when its thresholds were not constrained. Since this method does not require the formation of two models, one obvious advantage is its simplicity. However, unlike the more established baseline procedures, it has not been subjected to the methodological rigor that should precede the acceptance and usage of an approach in applied settings. Estimation of Mixture Models The purpose of the mixture modeling estimation process is to attempt to disentangle the hypothesized mixture of distributions into the pre-specified number of 36 PAGE 37 latent classes. Unlike the manifest situation where group membership is observed and group proportions are known, class membership is unobserved. Therefore an additional model parameter, known as the mixing proportion, is estimated (Gagn, 2004). The K-1 mixing proportions estimate the proportion of individuals comprising each of the K hypothesized classes. Additionally, while individuals obtain a probability for being a member in each of the K classes, they are assigned to a specific class based on their highest posterior probability of class membership. To estimate the model parameters, the joint log-likelihood of the mixture across all observations is maximized (Gagn, 2004). For a mixture of two latent subpopulations, the joint log-likelihood of the mixture model can be expressed as the maximization of: 2111ln((1)) NikiLL 2i (17) where and represent the likelihood of the ith examinee being a member of subpopulation 1 and subpopulation 2 respectively, represents the unknown mixing proportion, and N is the total number of examinees in the sample. Likewise, for k-subpopulations, Gagn (2004) presents the expression for the joint log-likelihood of the mixture model expressed as: 1iL 2iL 1/2(.5)()()/211(2)ikkikNKxxpkkkie (18) where and kkkkkkkkk Class Enumeration An important decision to be made is determining the number of latent class existing in the population (Bauer, & Curran, 2004; Nylund, Asparouhov, & Muthn, 2006). Traditionally, researchers use standard chi-squared based statistics to compare 37 PAGE 38 models. However, in mixture analysis when comparing models with differing numbers of latent classes, the traditional likelihood ratio test for nested models is no longer appropriate (Bauer, & Curran, 2004; McLachlan & Peel, 2000; Muthn, 2007). Instead, alternative model selection indices are used to compare competing models with different numbers of latent classes. These include: (i) information-based criteria such as Akaike Information Criteria (AIC; Akaike, 1987), Bayesian Information Criteria (BIC, Schwartz, 1978), and the adjusted BIC (ssaBIC; Sclove, 1987), (ii) likelihood-based tests such as the Lo-Mendell Rubin adjusted likelihood ratio test (LMR aLRT; Lo, Mendell, & Rubin, 2001) and the bootstrapped version of the LRT (BLRT; McLachlan & Peel, 2000), and (iii) statistics based on the classification of individuals using estimated posterior probabilities, such as entropy (Lubke & Muthn, 2007). And while there has been limited research conducted comparing the performances of these various model selection methods, no consistent guidelines have been established determining which model selection indices are most useful in comparing models or selecting the best-fitting model (Lubke & Neale, 2006; Nylund, et al. 2006; Tofighi & Enders, 2008; Yang, 2006). The reason for this is that there is seldom unanimous agreement across the various model selection indices and as a result the possibility of misspecification of the number of classes is a likely occurrence (Bauer & Curran, 2004; Nylund et al., 2006). Therefore, the researcher should not rely on these indices as the sole determinant of the number of latent classes. Rather, it is advised that in addition to the statistical indices, a theoretical justification should also guide not only the selection of the optimal number of classes but the interpretation of the classes as well (Bauer & Curran, 2004; Gagn, 2006; 38 PAGE 39 Muthn, 2003). The most common information criterion indices for model selection are introduced below. Information Criteria Indices The information criteria measures (e.g. the AIC, BIC, and the sample-size adjusted BIC) are all based on the log-likelihood of the estimated model and the number of free parameters in the model. On their own, individual values of these information-based criteria for a specified model are not very useful. Instead, for a specified model, the indices for each of the measures are compared with models of varying numbers of classes. For example, if the hypothesized model is one with two latent classes, this would be successively compared with a one-, three-, and fourclass models. Typically, the model with the lowest information criteria value compared to the other models with different number of classes is selected as the best-fitting model (Lubke & Muthn, 2005; Nylund et al., 2006). The information criteria such as the AIC, BIC, and ssaBIC are all based on the log-likelihood and adjust differently for the number of free parameters and sample size (Lubke & Muthn, 2005; Lubke & Neale, 2006). The AIC, which is defined as a function of log-likelihood and number of estimated parameters penalizes for overparameterization only but not for sample size, 2log2. A ICLp (19) On the other hand, BIC and ssaBIC, adjust for both the number of parameters and the sample size (Nylund et al., 2006). The BIC and ssaBIC are given by: 2log.log() B ICLpN (20) 22log.log24NssaBICLp (21) 39 PAGE 40 As was noted earlier, when comparing models with different numbers of latent classes, lower values of AIC, BIC, and ssaBIC indicate better fitting models. A typical approach to class enumeration begins with the fitting of a baseline one-class model and successively fitting models with the goal of identifying the mixture model with the smallest number of latent classes that provide the best fit to the data (Lui, 2008). However, previous research has found that results from different information criteria can provide ambiguous evidence regarding the optimal number of classes. In addition, across different mixture models, there is also inconsistency regarding the model selection information criterion that performs best. Nylund et al. (2006) conducted a simulation study comparing the performance of commonly-used information criteria for three types of mixture models: latent class, factor mixture, and growth mixture models. Overall, the researchers found that among the information criteria measures, the AIC, which does not adjust for sample size, performed poorly and identified the correct k-class model on fewer occasions than the two sample-sized adjusted indices, the BIC and the ssaBIC. Moreover, the AIC frequently favored the selection of the k+1-class model over the correct k-class. In addition, whereas the ssaBIC generally performed well with smaller sample sizes (N=200, 500), the BIC tended to be the most consistent overall performer, particularly with larger sample sizes (N=1000). Based on their simulation results, Nylund et al. (2006) concluded that the BIC was the most accurate and consistent of the IC measures at determining the correct number of latent classes. Yang (1998) evaluated the performance of eight information criteria in the selection of latent class analysis (LCA) models for six simulated levels of sample size. 40 PAGE 41 The results suggested that the ssaBIC outperformed the other five IC measures including the AIC and the BIC. For instance, with smaller sample sizes (N =100, 200) the ssaBIC had the highest accuracy rates of 62.7% and 77.5% respectively. In addition, Yang (1998) found that both BIC and a consistent form of the AIC (CAIC) tended to incorrectly select models with fewer latent classes than actually simulated. The performance of the BIC and CAIC only improved after the sample size increased to the largest condition of N=1000. The researcher concluded that in the case of LCA models, the ssaBIC outperformed the AIC and BIC at determining the correct number of latent classes (Yang, 1998). Tofighi and Enders (2007) extended their simulation research to evaluating the accuracy of information-based indices in identifying the correct number of latent classes in growth mixture models (GMM). Manipulated factors included the number of repeated measures, sample size, separation of latent classes, mixing proportions, and within-class distribution shape simulated for a three-class population GMM. The researchers found that of the ICs, the ssaBIC was most successful at consistently extracting the correct number of latent classes. Once again, the BIC showed its sensitivity to small sample sizes and frequently favored too few classes. The accuracy of the ssaBIC persisted even when the latent classes were not well-separated, whereas the ssaBIC extracted the correct three-class solution in 88% of the replications, the BIC and CAIC only correctly identified this solution 11% and 4% of the time respectively. In examining the accuracy of model selection indices in multilevel factor mixture models, Allua (2007) found that while the BIC and ssaBIC outperformed the AIC in correct predictions when data were generated from a one-class model, none of the fit 41 PAGE 42 indices performed credibly when a two-class model was used as the data-generating model. In this case all of the fit indices tended to underestimate the number of latent classes by continuing to favor the one-class model over the correct two-class model. This inconsistency between model fit measures has also evidenced in applied studies. Using an illustrative example, Lubke and Muthn (2005) applied factor mixture modeling to continuous observed outcomes from the Longitudinal Study of American Youth (LSAY) as a means of exploring the unobserved population heterogeneity. A series of increasingly invariant models were estimated and compared to a two-factor single-class model baseline model. For each of the models fit to the data, twothrough five-class solutions were specified. The commonly used relative fit indices (AIC, BIC, ssaBIC and aLRT) were used in choosing the best fitting models. However, there were several instances of disagreement between the IC results. For example, among the non-invariant and fully invariant models, while the AIC and the ssaBIC identified the 4-class solution as the best fitting model, the BIC and aLRT produced their lowest values for the 3-class solution. In summarizing their results, the authors suggested that in addition to relying on the model fit measures, researchers should explore the additional classes, in a similar manner as additional factors are investigated in factor analysis, to determine if their inclusion provides new, substantively meaningful interpretations to the solution. Overall, results from both simulation and applied studies highlight the lack of agreement among the mixture model fit indices. Researchers have attributed this inconsistency of performance to the heavy dependence of the indices on the type of mixture model under consideration as well as the assumptions made about the 42 PAGE 43 populations (Liu, 2008; Nylund et al., 2006). Therefore, rather than viewing any single measure as being superior, each should be seen as a contributory piece of evidence in determining the comparison of one model versus another. However, while the model fit indices provide the statistical perspective this should be augmented with a complementary approach of incorporating a substantive theoretical justification to aid in both the selection of the optimal number as well as the interpretability of the latent classes as well (Bauer & Curran, 2004). Mixture Model Estimation Challenges While factor mixture modeling is an attractive tool for simultaneously investigating population heterogeneity and latent class dimensionality, it is not without its challenges. The merging of the two types of latent variables into one integrated framework results in a model that requires a high level of computational intensity during the estimation process. As a result, factor mixture models require lengthy computation times which in turn reduce the number of replications that can be simulated within a realistic time frame. In addition to the increased computation times, the models are susceptible to problems due to multiple maxima solutions. Ideally, in ML estimation as the iterative procedure progresses, the log-likelihood should monotonically increase until it reaches one final maxima. However, with mixture models the solution often converges on a local rather than a global maximum, thereby producing biased parameter estimates. Whether the expectation-maximization (EM) algorithm converges to a local or global maximum largely depends on the set of different starting values that are used. Therefore, one approach to mitigating this problem is to incorporate multiple random starts, a practice that is permitted in Mplus. In the event that the default number of random starts (in Mplus, the defaults are 10 random starting sets and the best 2 sets used for final 43 PAGE 44 optimization) is insufficient to converge on a maximum likelihood solution, Mplus allows the user the flexibility to increase the number of start values. By adjusting the random starts option to include a larger number of start values both in the initial analysis and final optimization phases allows for a more thorough investigation of multiple solutions and should improve the likelihood of successful convergence (Muthn & Muthn, 1998-2008). However, since the increase in the number of random starts will also increase the computational load and estimation time, it is recommended that prior to conducting a full study researchers should experiment with various sets of user-defined starting values to determine an appropriate number of sets of starting values (Nylund et al., 2006). During this process, it is important to examine the results from the final stage solutions to determine whether the best log-likelihood is replicated multiple times. This ensures that the solution converged on a global maximum, thus reducing the possibility that the parameter estimates are derived from local solutions. Purpose of Study In the past, the lack of usage of SEM-based mixture models has been attributed to an unavailability of commercial software (Bandalos & Cohen, 2006). However, given the recent innovations integrated in software packages such as Mplus (Muthn & Muthn, 1998-2008), the estimation of SEM mixture models is now possible (Bandalos & Cohen, 2006). The purpose of this study was to evaluate the performance of factor mixture modeling as a method for detecting items exhibiting manifest-group DIF. In this study, manifest-group DIF was generated in a set of dichotomous data for a two-group, two-class population. The questions addressed were as follows. How successful is the factor mixture modeling approach at recovering the correct number of latent classes? 44 PAGE 45 If the number of classes are known a priori, how well does the factor mixture model perform at detecting differentially functioning items. Specifically, how are the (i) convergence rates, (ii) Type I error rate, (iii) and power to detect DIF affected under various manipulated conditions characteristic of those that may be encountered in DIF research? 45 PAGE 46 Figure 2-1. Example of uniform DIF Figure 2-2. Example of non-uniform DIF 46 PAGE 47 y Incorrect Correct y Figure 2-3. Depiction of relationship between y* and y for a dichotomous item 47 PAGE 48 Y3 Y1 Y2 Y4 Y5 Ck Figure 2-4. Diagram depicting specification of the factor mixture model 48 PAGE 49 CHAPTER 3 METHODOLOGY The simulation was conducted in two parts. The first part of the study focused on the ability of the factor mixture model to recover the correct number of latent classes under a variety of simulated conditions. In the second phase of the study, the number of classes was assumed known and the emphasis was on evaluating the performance of the mixture model at identifying differentially functioning items. Following is a description of the model as implemented, as well as the study design used in evaluating the performance of the factor mixture modeling approach to DIF detection. Factor Mixture Model Specification for Latent Class DIF Detection The factor mixture model was specified in its hybrid form as having both a single factor measured by 15 dichotomous items and a categorical latent class variable. The factor mixture model was formulated in the study as: *ikkiikiky (22) ikkik (23) where the parameters are as previously defined in Chapter 2 and k = 1 to K indexes the number of latent classes. To accommodate the testing of uniform DIF, the model was formulated so that the factor loadings were constrained to be class-invariant but the item thresholds were allowed to vary across classes. Therefore in Equation 22, the parameter is not indexed by the k subscript. Overall, the single-factor mixture model was specified as follows: 1. The factor loadings were constrained equal across the latent classes. For scaling purposes, the factor loadings of the referent (i.e. item 1) were fixed at one for each of the latent classes. 49 PAGE 50 2. To ensure identification, the item thresholds of the referent were also held equal across the latent classes. The remaining 14 (i.e. K -1) item thresholds were freely estimated. 3. One of the factor means was constrained to zero while the remaining factor mean was freely estimated. For K latent classes, the Mplus default is to fix the mean of the last or highest numbered latent class to zero (i.e. k = 0). Therefore in this case, the mean of the first class was freely estimated 4. Factor variances were freely estimated for all latent classes. Data Generation The discrimination and difficulty parameters used in this study were adopted from dissertation research conducted by Wanichtanom (2001). The original test (Wanichtanom, 2001) consisted of 50 items however in this case, parameters for ten of the 50 items have been selected. These ten items from the Wanichtanom (2001) study represented the DIF-free test items. In the original study, the item discrimination parameters were drawn from a uniform distribution within a 0 to 2.0 range and the difficulty parameters from a normal distribution within a -2.0 to 2.0 range (Wanichtanom, 2001). The remaining five DIF items that formed part of the scale reflected low (i.e. 0.5), medium (i.e. 1.0) and high (i.e. 2.0) levels of discrimination. For the entire 15-item test, the discrimination a parameters ranged from 0.4 to 2.0, with a mean of 0.98 while the difficulty b parameters ranged from -1.2 to 0.7 with a mean of -0.34. Uniform DIF was simulated against the focal group on Items two to six. The values of the item parameters are presented in Table 3-1. Data were generated using R statistical software (R Development Core Team, 2009). The ability parameters were drawn from normal distributions for both the reference and focal groups. For these dichotomous items, the probability of a correct response was computed using the 2PL IRT model as: 50 PAGE 51 [()]1Pr(1)1ijiijjabYe (24) where ai is the item discrimination parameter, bi is the item threshold parameter, and j is the latent ability trait for examinee j. To determine each examinees item response, the calculated probability Pr(1)ijjY was compared to a randomly generated number from a uniform U(0,1) distribution. If that probability exceeded the random number, the examinees item response was scored as correct (i.e. coded as 1). On the other hand, if the probability of a correct response was less than the random number the item response was scored as incorrect and coded as 0. Finally, 50 replications were run for each set of simulation conditions and the dichotomous item response datasets were exported to Mplus V5.1 (Muthn & Muthn, 1998-2008) for the analysis phase. Since the data were generated externally, the Mplus Type=Montecarlo option was used to analyze the multiple datasets and to save the results for the replications that converged successfully. Simulation Study Design In their 1988 paper, Lautenschlager and Park reiterated the need for Monte Carlo studies to be designed in such a way that they simulate real data conditions as closely as possible. This advice was followed when selecting the condition and levels for this simulation study. The conditions were chosen to replicate those adopted in previous latent DIF studies (Bolt, Cohen, & Wollack, 2001; Bilir, 2009; Cohen & Bolt, 2005; De Ayala et al., 2002; Samuelsen, 2005) and mixture modeling studies (Gagn, 2004; Lee, 2009; Lubke & Muthn, 2005, 2007). 51 PAGE 52 Research Study 1 In the first part of the study dichotomous item responses were generated for the two-group, two-class scenario. The focus was on determining the success rate of the specified factor mixture model to recover the correct number of latent classes. Solutions for onethrough three-class mixture models were estimated and three information-based criteria values were compared across the models. The model with the lowest IC value was selected as the best-fitting model (Lubke & Muthn, 2005; Nylund et al., 2006). The fixed and manipulated factors used in this study are listed below. Manipulated Conditions Sample size Previous findings have shown that as with pure CFA models, sample size affects the convergence rates of mixture models as well (Gagn, 2004; Lubke, 2006). In evaluating the performance of several CFA mixture models, Gagn (2004) reported a significant increase in the convergence rates as the sample size was increased from a minimum of 200 to 500 to 1000. A review of previous simulation and real data mixture model research found that whereas only a few studies used as few as 200 simulees (Gagn, 2004; Nylund et al., 2006), sample sizes of at least 500 were most frequently used (Bolt et al., 2001; Bilir, 2009; Cho, 2007; De Ayala et al., 2002; Rost, 1990; Samuelsen, 2005). In this study, the two combinations of sample size (N=500, N=1000) were chosen to be representative of realistic research samples and to reduce the possibility of convergence problems. In addition, the sample size of 500 was used as a lower limit to examine the effects of small sample size on the performance of the factor mixture approach to DIF detection. 52 PAGE 53 Magnitude of uniform DIF In previous DIF studies (Camilli & Shepard, 1987; De Ayala et al., 2005; Meade, Lautenschlager, & Johnson, 2007; Samuelsen, 2005) the manipulated difficulty shifts have typically varied in magnitude from .3 to 1.5. Overall, these results have shown higher DIF detection rates with items simulated to have moderate or strong amounts of DIF. However, with mixture models, it may be necessary to simulate larger DIF magnitudes to ensure the detection of DIF. This hypothesis was based on the results from preliminary small-scale simulation in which several levels of DIF magnitude were manipulated. As a result, this study focused on DIF effects at the upper range of the scale where the magnitude of manifest differential functioning is large, namely, b = 1.0 and b = 1.5. For items with no DIF, the item difficulties are defined as biF = biR. On the other hand, when there is uniform DIF, the items will be simulated to function differently in favor of the reference group and the item difficulties are defined as biF = biR + b (where b = 1.0 or 1.5). Ability differences between groups Several researchers have recommended the inclusion of latent ability differences (i.e. impact) in DIF detection studies since they contend that in real data sets, the focal and reference populations typically have different latent distributions (Camilli & Shepard, 1987; De Ayala et al., 2002; Donoghue, Holland, & Thayer, 1993; Duncan, 2006; Stark et al., 2006). Simulation results of the effects of impact on DIF detection have varied. For instance, some researchers have reported good control of Type I error rates with a moderate difference of .5 SD (Stark et al., 2006) and even with differences as large as 1 SD (Narayanan & Swaminathan, 1994). On the other hand, others (Cheung & Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994) have 53 PAGE 54 reported inflated Type I error rates with unequal latent trait distributions. The results are also mixed with respect to the presence of impact on power. Whereas some studies have shown reduced power (Ankemann, Witt, & Dunbar, 1999; Clauser, Mazor, & Hambleton, 1993; Narayanan & Swaminathan, 1996), others (Gonzlez-Rom et al., 2006; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006) found that DIF detection rates were not negatively affected by the dissimilarity of latent distributions. In this first part of the study, two conditions of differences in mean latent ability were manipulated: 1. Equal latent ability means with the reference and focal groups both generated from a standard normal distribution (i.e. R~N(0,1), F~N(0,1)), and 2. Unequal latent ability means with the reference group having a latent ability mean .5 standard deviation higher than the focal group (i.e. R~N(0.5,1), F~N(0,1)). Fixed Simulation Conditions Test length The test was simulated for a fixed length of 15 dichotomous items. Previous studies using factor mixture modeling have typically used shorter scale lengths varying between 4 and 12 observed items for a single-factor model with categorical items (Lubke & Neale, 2008; Nylund et al., 2006; Kuo et al., 2008; Reynolds, 2008; Sawatzy, 2007). This may be due to the fact that longer computation times are required when fitting mixture models to categorical data (Lubke & Neale, 2008). Therefore, while more test items may have been included, this length was chosen not only to be consistent with previous research, but also taking into account the computational intensity of factor mixture models. 54 PAGE 55 Number of DIF items In previous simulations studies, the percentage of DIF items has typically varied from 0% to 50% as a maximum (Bilir, 2009; Cho, 2007; Samuelsen, 2005; Wang et al., 2009). For example, Samuelsen (2005) considered cases with 10%, 30% and 50% of DIF items, Cho (2007) investigated cases with 10% and 30% DIF items, and Wang et al. (2009) manipulated the number of DIF items in increments of 10% from 0% to 40%. With respect to real tests, Shih and Wang (2009) reported that they typically contain at least 20% of DIF items. In this study, the percentage of DIF items was 33.3% (five items), with the DIF items all favoring the reference group. Items 2 through 6 were selected to display uniform DIF. Sample size ratio With respect to the ratio of focal to reference groups, Atar (2007) reports that in actual testing situations, the sample size for the reference group may be as small as the sample size for the focal group or the sample size for the reference group may be larger than the one for the focal group (pg. 29). In this study, a 1:1 sample size ratio of focal to reference group will be considered for each of the two sample sizes. Using comparison groups of equal size is representative of an evenly split manifest variable frequently used in DIF studies such as gender (Samuelsen, 2005). Percentage of overlap between manifest and latent classes In the manifest DIF approach when an item is identified as having DIF, there is an implied assumption that all members of the focal groups must have been disadvantaged by this item. However, under a latent conceptualization, the view is that DIF is detected based on the degree of overlap between the manifest groups and the latent classes. In this context, overlap refers to the percentage of membership homogeneity between the 55 PAGE 56 manifest groups and latent classes. For example, if each of the examinees in either the manifest-focal or the manifest-reference group belongs to the same latent class, then this is referred to as 100% overlap. Therefore, as the level of group-class overlap decreases, there is a corresponding decrease in the level of homogeneity between groups and classes as well. In Samuelsens (2005) study, five levels of overlap decreasing in increments of 10% from 100% to 60% were considered. Samuelsen (2005) found that as the group-class overlap increased, the power of the mixture approach to correctly detect DIF increased as well. In this study, the level of overlap was fixed at 80%, a somewhat realistic expectation of what may be encountered in practice. This means that DIF was simulated against 80% of the simulees in the focal group. Mixing proportion The mixing proportion (k) represents the proportion of the population in class k, which was fixed at .50. Although the class membership was known, it was not used in the simulation. Study Design Overview In sum, a total of three fully crossed factors resulting in eight simulation conditions (2 sample sizes x 2 DIF magnitudes x 2 latent ability distributions) were manipulated to determine their effect on the recovery of the correct number of latent classes. For each of the eight conditions, a total of 50 replications were run. It is important to note that in the original plan for this study, a larger number of replications was proposed. However, initial simulation runs revealed that the computational time necessary to complete larger numbers of replications was impractical for this dissertation. Therefore, given the timing 56 PAGE 57 constraints, a smaller number of data sets (i.e. 50) were replicated. The list of study conditions is provided in Table 3-2. Evaluation Criteria As previously noted, the objective of this first part of the simulation was to determine the success rate of the factor mixture method in identifying the correct number of classes. The three likelihood-based model fit indices (AIC, BIC, and ssaBIC) provided by Mplus were compared, with smaller values indicating better model fit. The outcome measures evaluated for each of the three (i.e. onethrough threeclass) factor mixture solutions fit to the data were: Convergence rates This was represented as the number of replications that converged to a proper solutions across the 50 simulations for each set of the eight conditions. Data sets with improper or non-convergent solutions were not included in the analysis. IC Performance Performance was evaluated by calculating the average IC values and comparing the values for each index across the one-, two-, and three-class models. For each of the simulated conditions, the lowest average IC value and the corresponding model are identified. Research Study 2 In the second part of the study, research was conducted to evaluate the Type I error rate and power performance of the factor mixture model at detecting uniform DIF, assuming that the correct number of classes is known. With respect to the study design, two levels of DIF magnitude (DIF = 1.0, 1.5) and two levels of sample size (N = 500, 1000) were again simulated using the same levels as in Study 1. However, an additional level will be included for the impact condition. More specifically, in addition to the no-impact and moderate impact condition, a large level of impact (i.e. mean for the reference group was 1.0 SD higher than the mean of the focal group) was included as well. The inclusion of this new level permitted a more complete investigation of the 57 PAGE 58 robustness of the factor mixture model in DIF detection to the influence of impact. Overall, a total of 12 conditions (2 sample sizes x 2 DIF magnitudes x 3 latent trait distributions) were simulated. In this second phase of the simulation, each condition was replicated 1000 times. The full list of study design conditions are shown in Table 3-3. Data Analysis The 1000 sets of dichotomous item responses were generated by R V2.9.0 (R Development Core Team, 2009). The data sets for each of the 12 conditions were saved and exported from R to Mplus V5.1 for analysis. As was done, in the first part of the study, the Type=Montecarlo facility was used to accommodate the analysis of the multiple datasets generated external to Mplus and for saving the results for subsequent analysis. To asses uniform DIF, a simultaneous significance test of the 14 threshold differences (i.e. with the exception of the referent, Item 1) using a Wald test was conducted. A significant p-value less than .05 provided evidence of DIF in the item. Evaluation Criteria The outcome measures used in evaluating the performance of this factor mixture method for DIF detection were as follows: Convergence rates This was measured by the number of replications that converged to proper solutions across each of the eight combinations of conditions. Data sets with improper or non-convergent solutions were not included in the analysis. Type I error rate The Type I error rate (or false-positive rate) was computed as the proportion of times the DIF-free items were incorrectly identified as having DIF. Therefore, the overall Type I error rate was calculated by dividing the total number of times the nine items (i.e. Items 7-15) were falsely rejected by the total number of properly converged replications for each of the 12 study conditions. The nominal Type I error rate used in this study was .05. 58 PAGE 59 Statistical power Power (or the true-positive rate) was computed as the proportion of times that the analysis correctly identified the DIF items as having DIF. Therefore, the overall power rate was calculated by dividing the total number of times any one of the five (i.e. Items 2-6) DIF items was correctly identified by the total number of properly converged replications across each of the 12 simulated conditions. In addition to the computation of the overall Type I error and Power rates of the factor mixture method, a variance components analysis was also conducted to examine the influence of each of the conditions and their interactions on the performance of the method. In this analysis which was conducted in R V2.9.0 (R Development Core Team, 2009), the independent variables were the three study conditions (DIF magnitude, sample size and impact) and the dependent variables were the Type I error and power rates. Eta-squared ( 2 ), which calculates the percentage of variance explained by each of the main effects and their interactions, was used as a measure of effect size. Model Estimation The parameters of the mixture models were estimated in Mplus V5.1 (Muthn & Muthn, 1998-2008) with robust maximum likelihood estimation (MLR) using the EM algorithm, which is the default estimator for mixture analysis in Mplus. One of the main limitations of running a mixture simulation study is the lengthy computation time periods needed for model estimation. In the interest of time, the random starts feature which randomly generates sets of starting values was not used in this part of the study. Instead, true population parameters for the factor loadings and thresholds and factor variances were substituted for the starting values in this portion of the analysis. This change reduced the computation time for model estimation considerably. 59 PAGE 60 Table 3-1. Generating population parameter values for reference group Item Number a b 1 1.0950 -0.0672 2 0.5001 -1.0000 3 0.5001 -0.5000 4 1.0000 0.0000 5 2.0000 -1.0000 6 2.0000 0.0000 7 0.5584 -0.7024 8 0.9819 0.6450 9 0.5724 -0.5478 10 1.4023 -0.3206 11 0.4035 -1.1824 12 1.0219 -0.4656 13 0.9989 -0.2489 14 0.7342 -0.4323 15 0.8673 0.7020 Note. Item 1 is the referent, therefore its loadings were fixed at 1 and its thresholds constrained equal across classes. Uniform DIF against the focal group was simulated on Items 2 to 6. 60 PAGE 61 Table 3-2. Fixed and manipulated simulation conditions used in study 1 Manipulated Conditions Fixed Conditions Sample size 500, 1000 Magnitude of DIF 1.0, 1.5 Latent mean distributions R~N(0,1), F~N(0,1) R~N(.5,1), F~N(0,1) Test length 15 items Number of DIF items 5 items (33.3%) Sample size ratio 1:1 Class proportion .5 Overlap 80% Table 3-3. Fixed and manipulated simulation conditions used in study 2 Manipulated Conditions Fixed Conditions Sample size 500, 1000 Magnitude of DIF 1.0, 1.5 Latent mean distributions R~N(0,1), F~N(0,1) R~N(.5,1), F~N(0,1) R~N(1,1), F~N(0,1) Test length 15 items Number of DIF items 5 items (33.3%) Sample size ratio 1:1 Class proportion .5 Overlap 80% 61 PAGE 62 CHAPTER 4 RESULTS Research Study 1 In this section, the results of the first part of the simulation are presented. To answer the research question, data were generated for a two-group, two-class population with five of the 15 items simulated to display uniform DIF. The following conditions were manipulated in this study: sample size (500, 1000), DIF magnitude (1.0, 1.5), and differences in latent ability means (0 SD, 0.5 SD). The factor mixture model as formulated in Equations 22 and 23 was applied to determine how successful the method was at recovering the correct number of classes. For each of the eight condition combinations, one-, twoand three-class models were fit to the data. These results are presented in two sections. First, the rates of model convergence for each of the eight simulation conditions are reported. Secondly, the information criteria (IC) results which were used for model comparison and class enumeration are discussed. The results for Study 1 are summarized in Tables 4-1 through 4-4. Convergence Rates Table 4-1 presents the data on the number of convergent solutions for each combination of the eight simulation conditions. As was previously mentioned in the Methods section, non-convergent cases were excluded from the analysis, therefore for some conditions results were based on fewer than 50 replications. The results showed that overall the convergence rates were very high (ranging from .82 to 1.0), and there were minimal convergence problems. Of the 1200 (50x3x8) replications, 1147 successfully converged resulting in a 96% overall convergence rate. In addition, as the number of latent classes was increased, there was a corresponding decrease, albeit 62 PAGE 63 minimal, in the number of properly converged solutions. More specifically, while the one-class model attained perfect convergence rates, the average convergence rates for the twoand three-class mixture models were 96% and 91% respectively. An inspection of the results also revealed a positive relationship between the convergence rate and the DIF magnitude. Of the 16 cells that failed to converge in the two-class model, 15 of them were for the smaller DIF condition. A similar trend was observed with the three-class model. Namely, of the 37 cells that failed to produce a properly convergent solution in the three-class model, 27 were associated with the smaller DIF=1.0 condition. The cases with non-convergent solutions were excluded from the second part of this analysis. Class Enumeration Summary data based on the three IC measures (AIC, BIC, and ssaBIC) for the one-, two-, and three-class models are provided in Tables 4-2 through 4-4. In comparing the fit of the models across classes, the smallest average IC value was used as the criterion in selecting the best-fitting model. An examination of the average IC values highlighted both overall and IC-specific patterns of results. First, as expected there is a general increase in the average IC values as sample size increases. Second, it is observed that the differences in average IC values between neighboring competing models were generally not substantial, and even negligible under some conditions. Third, with respect to the individual indices, a high level of inconsistency in model selection patterns is observed. The results for the three indices are described in more detail in the following sections. 63 PAGE 64 Akaike Information Criteria (AIC) The average AIC values across the three specified mixture models are presented in Table 4-2. Overall, the pattern of results shows that the AIC tended to over-extract the number of latent classes. This trend was observed for six of the eight simulated conditions where the lowest AIC values corresponded to the three-class mixture model. The only exceptions to this pattern occurred for two of the four conditions when the DIF magnitude was increased to 1.5. In these cases, the lowest average AIC values occurred at the correct two-class model. However, it is important to note that the differences between neighboring class solutions were rather small, with the largest absolute difference between values being less than 40 points. Moreover, the differences are practically negligible between the twoand three-class models ranging in absolute magnitude over the eight simulated conditions from .02 to 8.72. Although, smaller IC values are indicative of better model fit, given the minor differences between the average AIC values, it makes selection between these two models a less clear-cut decision. Bayesian Information Criteria (BIC) The BIC results are presented in Table 4-3. Based on the average BIC values, this index consistently selected the simpler one-class model as the correct model for the data. For each of the eight manipulated simulation conditions, the lowest values corresponded to the one-class mixture model. Compared to the AIC, the IC differences between neighboring class models are generally higher for the BIC than for the corresponding AIC solutions. More specifically, the differences between neighboring class models ranged in absolute magnitude, between 40 and 88 points on average. The differences between the one-class and the correct two-class model were minimized 64 PAGE 65 when the DIF magnitude was increased from 1.0 to 1.5. In these cases, even though the average IC values corresponded to the one-class model, because the average values are so similar in magnitude, it makes it difficult to unequivocally choose the one-class solution as the best-fitting model. Sample-size adjusted BIC (ssaBIC) Summary values for the ssaBIC compared across the three mixture models are presented in Table 4-4. These results reflected patterns seen with both the AIC and the BIC. First, similar to the BIC, the ssaBIC values suggested the simpler one-class model under conditions where the magnitude of the simulated uniform DIF is at the lower value of 1.0. However, the index also exhibits a pattern similar to that of the AIC by associating the smallest average IC values with the correct two-class model when larger uniform DIF of 1.5 was simulated. Finally, as was the case with the other two IC measures, the magnitude of differences across the three models was small. This was especially true of the differences between the oneand twoclass solutions, which ranged on average from 5.0 to 34.1 points. Research Study 2 In the second part of the study, the objective was to evaluate the Type I error rate and power of the factor mixture approach in the detection of DIF. The manipulated conditions again included sample size, magnitude of DIF and differences in latent ability means. In addition to the conditions used in the first phase of the study, one additional level of latent mean differences was included. For a measure of large impact, the mean for the reference group was simulated to be 1.0 SD standard higher than the focal group. Therefore, a total of 12 conditions were manipulated: 2 DIF magnitude (1.0, 1.5) x 2 sample size (500, 1000) x 3 differences in latent trait means (0, 0.5 SD, 1.0 SD). 65 PAGE 66 One thousand replications were generated for each of the 12 simulation conditions examined in the Type I error rate and Power studies. The results for the Type I error rate and statistical power are addressed in the sections below. Nonconvergent Solutions In this part of the study, population parameters replaced the starting values randomly generated by Mplus. This change substantially reduced the computational load and decreased the model estimation time. The convergence rates across each of the conditions are presented in Table 4-5. Overall, results indicate no convergence problems, with rates ranging between 99.4% and a perfect convergence rate. Type I Error Rate The factor mixture model was evaluated in terms of its ability to control the Type I error rate under a variety of simulated conditions. Of the 15 items, nine were simulated to be DIF-free. The Type I error rate was assessed by computing the proportion of times the nine DIF items were incorrectly identified as having DIF. An item was considered to display DIF if the differences in thresholds were significantly different from zero. Therefore for the nine non-DIF items, the Type I error rate was computed as the proportion of times that the items obtained p-values less than .05. The Type I error rates across the 12 simulation conditions are presented in Table 4-6. The values in the table represent the proportion of times that the method incorrectly flagged a non-DIF item as displaying DIF. The results in Table 4-6 indicate that the factor mixture analysis method did not perform as well as expected in controlling the Type I error rate. The results showed elevated Type I error rates across all the study conditions, which means that the approach consistently produced false identifications at a rate exceeding the nominal 66 PAGE 67 alpha level of .05. Overall, the average Type I error rate was 11.8%, which even after accounting for random sampling error would still be considered unacceptably high. Across the individual conditions, the error rates ranged from .09 to .16. Not surprisingly, the factor mixture method exhibited its strongest control of the rate of incorrect identifications for conditions of large DIF magnitude (DIF = 1.5), large sample size (N = 1000), and where there was either none or a moderate (0.5 SD) amount of impact. An initial examination of the pattern of results suggested that while the sample size and DIF magnitude are inversely related to Type I error rate, an increase in the mean latent trait differences resulted in slightly higher Type I error rates. For example, for the cells with DIF magnitude of 1.0, sample size of N = 500, and no impact, the Type I error rate was 0.12; however, when the latent trait means differed by 1.0 SD, the rate of false identifications increased marginally to 0.16. A more detailed discussion of the effect of each of the three conditions is presented in the following sections. Magnitude of DIF Table 4-7 and 4-8 display the aggregated results for the effect of the two levels of DIF magnitude (1.0 and 1.5) on Type I error rates. Overall, the rates of false identifications showed a slight decrease as the magnitude of DIF was increased. For example, when DIF of 1.0 was simulated, error rates across the conditions were between .10 and .16, with an average rate of .13. However, for larger DIF of 1.5, the rates ranged from .09 to .12, averaging at .10. Regardless of the size of DIF, the inflated rates were most pronounced for the smaller sample size of N=500 and when the difference in latent trait means was maximized (1.0 SD). 67 PAGE 68 Sample size The results in Tables 4-9 and 4-10 suggest a weak inverse relationship between sample size and the ability of the factor mixture method to control Type I error rates. At the smaller sample size (N=500), the rate of false identifications ranged from .10 to .16, with an average rate of .12. Of the six cells associated with the smaller sample size, the test showed greatest control of the Type I error when larger DIF (1.5) was simulated and there was equality of the latent trait means. Increasing the sample size to 1000 decreased the Type I error rates marginally. Across the six conditions, the error rates were now between .09 and .14, averaging at .11, a negligible decrease from the average rate when N=500. However, the pattern of false identifications remained consistent across sample sizes: poor Type I error control was observed when smaller DIF (1.0) was simulated and there was large impact (1.0 SD); in contrast, improved control was observed for larger DIF magnitude (1.5) and in the absence of impact. Impact Three levels of impact (0, .5 SD, and 1.0 SD) were simulated in favor of the reference group. The aggregated Type I error rates which are summarized in Tables 4-11 through 4-13 showed that the differences in latent trait means between groups had no appreciable effect on the rate of incorrect identifications The Type I error rates for the no-impact, 0.5 SD and 1.0 SD conditions increased marginally from .11 to .12 to .13, a change that can be attributed to the presence of random error. Though not below the nominal alpha value of .05, the Type I error rates were best controlled when both DIF (1.5) and sample size (N=1000) were large. 68 PAGE 69 Variance components analysis Following the descriptive analysis of the pattern of Type I error rates across the simulated conditions, a variance components analysis was conducted to specifically examine the influence of each of the simulation conditions and interaction of the conditions on the Type I error rates. The results of this analysis are presented in Table 4-13. Based on the values which ranged from 0.000 to 0.007, the only factor contributing to the variance in Type I error rates was the magnitude of DIF accounting for a mere 0.7%. All other main effects and interactions produced trivial values. Statistical Power In the analyses above, the proportion of false DIF detections produced by the factor mixture approach consistently exceeded the nominal value of 0.05. Typically, when this level occurs, power rates are no longer interpretable in terms of the standard alpha level. In this case, the power rates have still been analyzed and are displayed in Tables 4-15 and 4-24. However, it is important to note that these results should be interpreted with caution given the elevated Type I rates. Power was assessed as the proportion of times across the 1000 replications that the factor mixture analysis correctly identified the five items (i.e. Items 2 to 6) simulated as having uniform DIF. Typically, values of at least .80 indicate that the analysis method is reasonably accurate in correctly detecting items with DIF. Results for the power analysis are displayed in Tables 4-15 through 4-22. The overall accuracy of DIF detection of factor mixture analysis was 0.447, with the power of correct detection ranging from .264 to .801 across all simulated conditions. The only combination of conditions for which an acceptable level of power was achieved was when larger DIF (1.5) and sample size (N=1000) were simulated and impact was 69 PAGE 70 absent. For all other conditions the test failed to maintain adequate power. An initial examination of these results suggests that whereas higher rates of DF detection are positively associated with DIF magnitude and sample size, there was a seemingly weak negative effect of impact. A more detailed discussion of the effect of each of the three conditions on DIF power rates follows. Magnitude of DIF As expected, increasing the magnitude of DIF significantly improved the power performance of factor mixture DIF detection (refer to Tables 4-16 and 4-17). On one hand, when DIF of 1.0 was simulated, the detection rates ranged on average from .264 to .350. On the other, average detection rates ranged from .425 to .801 when larger DIF (1.5) was simulated in the items. Overall, similar detection patterns were observed at both levels of DIF: the accuracy of power detection was highest with larger sample sizes (N=1000) and in the absence of impact. In direct contrast, power was notably reduced when smaller sample sizes (N=500) and maximum impact (1.0 SD) were simulated. Sample size As shown in Tables 4-18 and 4-19, power rates were positively related to sample size; a result that was not unexpected. For sample size conditions of N=500, the DIF detection rate was .375, on average. However, a marked improvement in detection performance was observed (.520) when the sample size was increased to N=1000. A comparison across the two levels of sample size reveals that the factor mixture procedure exhibited its greatest power to detect DIF under the combined conditions of large DIF differences (1.5 SD) and equality of latent trait means. 70 PAGE 71 Impact The effect of impact on DIF detection rates was also examined. The three levels investigated were: (i) equal latent trait means, (ii) a 0.5 SD difference between latent means, representing a moderate amount of impact, and (iii) a 1.0 SD difference between latent trait means, representing a large amount of impact. The aggregated results in Tables 4-20 through 4-22 show that as the difference in latent trait means between the groups was increased there was a negligible decline in the accuracy of the factor mixture method to detect DIF. For example, the average power rate decreased marginally from .486 to .455 to .401 under the no-impact, 0.5 SD, and the 1.0 SD conditions respectively. These results show that the presence of impact did not adversely affect the ability of the factor mixture approach to detect DIF. Effect of item discrimination parameter values For the five items simulated to contain DIF, three different levels of item discrimination were selected. For two items (Items 2 and 3), the discrimination parameter value was set at 0.5 to mimic low discriminating items, one items (Item 4) a-parameter was selected as 1.0 a medium level of discrimination, while two items (Items 5 and 6) with an a-parameter of 2.0 represented highly discriminating items. The discrimination parameter values for the non-DIF items were randomly selected from a normal distribution within a range. The power rates for DIF detection categorized by the level of item discrimination are shown in Table 4-23. These results show, as expected, that power is influenced by the item discrimination parameter. More specifically, the accuracy of DIF detection increased as the item discrimination values increased. The factor mixture method had on average a .369 rate of detecting DIF in low discriminating items, this increased to 71 PAGE 72 .495 and .502 when DIF was simulated in items with medium and high values on the a-parameter respectively. Moreover, while there was generally a clear difference in the accuracy of DIF detection between the low discriminating items (a=.5) and either the medium or highly discriminating items; no discernible differences were evident when comparing DIF detection rates between items with medium (a=1.0) and high values (a=1.5) on the a-parameter. The patterns of DIF detection discussed earlier remained consistent across the simulation conditions regardless of the items discriminating ability. Variance components analysis Finally, a variance components analysis was conducted to determine the influence on power rates of the simulation conditions and their interactions. In this analysis, the power rates across the five DIF items were used as the dependent variable while the simulated conditions served as the independent variables. As was expected, the results showed that of the main effects, the DIF magnitude was the most significant contributor by accounting for 19% of the variance in the power rates. Following was sample size with approximately 5% and the interaction between these two factors with 1.2%. Each of the other terms contributed less than 1.0% to the variance in the power rates. These results are shown on Table 4-24. 72 PAGE 73 Table 4-1. Number of converged replications for the three factor mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 50 46 43 0.5 50 46 46 1000 0 50 45 41 0.5 50 48 43 1.5 500 0 50 50 47 0.5 50 49 46 1000 0 50 50 49 0.5 50 50 48 73 PAGE 74 Table 4-2. Mean AIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,520.64 9,509.62 9,501.86 0.5 9,310.67 9,305.81 9,298.37 1000 0 18,956.98 18,961.67 18,961.03 0.5 18,578.46 18,578.01 18,569.40 1.5 500 0 9,495.81 9,473.12 9,473.10 0.5 9,314.87 9,284.66 9,293.38 1000 0 18,953.93 18,916.12 18,918.62 0.5 18,559.08 18,556.92 18,550.50 Note: AIC Akaike Information Criterion Table 4-3. Mean BIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,647.08 9,707.71 9,771.59 0.5 9,437.11 9,503.89 9,568.11 1000 0 19,104.21 19,192.34 19,275.12 0.5 18,725.70 18,808.68 18,883.21 1.5 500 0 9,622.25 9,671.21 9,742.83 0.5 9,441.31 9,482.75 9,563.11 1000 0 19,101.16 19,146.78 19,232.72 0.5 18,706.31 18,787.58 18,864.60 Note: BIC Bayesian Information Criterion Table 4-4. Mean ssaBIC values for the three mixture models DIF Magnitude Sample Size Ability Differences One-class Two-class Three-class 1.0 500 0 9,551.86 9,558.53 9,568.45 0.5 9,341.89 9,354.71 9,364.97 1000 0 19,008.93 19,043.06 19,071.86 0.5 18,630.42 18,659.40 18,679.95 1.5 500 0 9,527.02 9,522.03 9,539.69 0.5 9,346.09 9,333.56 9,359.97 1000 0 19,005.88 18,997.51 19,029.45 0.5 18,611.03 18,638.31 18,661.33 Note: ssaBIC sample size adjusted Bayesian Information Criterion 74 PAGE 75 Table 4-5. Percentages of converged solutions across study conditions DIF Magnitude Sample Size Ability Differences Percentage of converged solutions 1.0 500 0 99.8 0.5 99.5 1.0 99.4 1000 0 100.0 0.5 99.7 1.0 99.6 1.5 500 0 99.8 0.5 99.7 1.0 99.7 1000 0 100.0 0.5 100.0 1.0 99.8 Table 4-6. Overall Type I error rates across study conditions DIF Sample Size Impact Error rates 1.0 500 0 0.123 0.5 0.126 1.0 0.159 1000 0 0.131 0.5 0.129 1.0 0.138 1.5 500 0 0.097 0.5 0.112 1.0 0.116 1000 0 0.092 0.5 0.092 1.0 0.100 75 PAGE 76 Table 4-7. Type I error rates for DIF = 1.0 Sample size Impact Error rates 500 0 0.123 500 0.5 0.126 500 1.0 0.159 1000 0 0.131 1000 0.5 0.129 1000 1.0 0.138 Table 4-8. Type I error rates for DIF = 1.5 Sample size Impact Error rates 500 0 0.097 500 0.5 0.112 500 1.0 0.116 1000 0 0.092 1000 0.5 0.092 1000 1.0 0.100 Table 4-9. Type I error rates for sample size of 500 DIF Impact Error rates 1.0 0 0.123 1.0 0.5 0.126 1.0 1.0 0.159 1.5 0 0.097 1.5 0.5 0.112 1.5 1.0 0.116 Table 4-10. Type I error rates for sample size of 1000 DIF Impact Error rates 1.0 0 0.131 1.0 0.5 0.129 1.0 1.0 0.138 1.5 0 0.092 1.5 0.5 0.092 1.5 1.0 0.100 76 PAGE 77 Table 4-11. Type I error rates for impact of 0 SD DIF Sample size Error rates 1.0 500 0.123 1.0 1000 0.131 1.5 500 0.097 1.5 1000 0.092 Table 4-12. Type I error rates for impact of 0.5 SD DIF Sample size Error rates 1.0 500 0.126 1.0 1000 0.129 1.5 500 0.112 1.5 1000 0.092 Table 4-13. Type I error rates for impact of 1.0 SD DIF Sample size Error rates 1.0 500 0.159 1.0 1000 0.138 1.5 500 0.116 1.5 1000 0.100 Table 4-14. Variance components analysis for Type I error Condition DIF Magnitude (D) .007 Sample size (S) .000 Impact (I) .001 D*S .000 D*I .000 S*I .000 D*S*I .000 77 PAGE 78 Table 4-15. Overall power rates across study conditions DIF Sample Size Impact Power 1.0 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 1.5 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 Table 4-16. Power rates for DIF of 1.0 Sample size Impact Power 500 0 0.268 0.5 0.267 1.0 0.264 1000 0 0.350 0.5 0.324 1.0 0.291 Table 4-17. Power rates for DIF of 1.5 Sample size Impact Power 500 0 0.525 0.5 0.498 1.0 0.425 1000 0 0.801 0.5 0.731 1.0 0.623 78 PAGE 79 Table 4-18. Power rates for sample size N of 500 DIF Impact Power 1.0 0 0.268 0.5 0.267 1.0 0.264 1.5 0 0.525 0.5 0.498 1.0 0.425 Table 4-19. Power rates for sample size N of 1000 DIF Impact Power 1.0 0 0.350 0.5 0.324 1.0 0.291 1.5 0 0.801 0.5 0.731 1.0 0.623 Table 4-20. Power rates for impact of 0 SD DIF Sample Size Power 1.0 500 0.268 1.0 1000 0.350 1.5 500 0.525 1.5 1000 0.801 Table 4-21. Power rates for impact of 0.5 SD DIF Sample Size Power 1.0 500 0.267 1.0 1000 0.324 1.5 500 0.498 1.5 1000 0.731 Table 4-22. Power rates for impact of 1.0 SD DIF Sample Size Power 1.0 500 0.264 1.0 1000 0.291 1.5 500 0.425 1.5 1000 0.623 79 PAGE 80 Table 4-23. Power rates for DIF detection based on item discriminations DIF Sample Size Impact a = .5 a = 1.0 a = 2.0 1.0 500 0.0 .184 .194 .273 .332 .359 0.5 .200 .220 .273 .321 .323 1.0 .200 .227 .273 .326 .296 1000 0.0 .261 .257 .387 .430 .416 0.5 .248 .243 .363 .372 .392 1.0 .226 .238 .332 .300 .357 1.5 500 0.0 .409 .425 .612 .598 .580 0.5 .411 .401 .561 .546 .571 1.0 .340 .360 .465 .481 .476 1000 0.0 .731 .728 .885 .841 .819 0.5 .640 .655 .814 .782 .763 1.0 .531 .528 .697 .675 .685 .365 .373 .495 .500 .503 Table 4-24. Variance components analysis for power results Condition DIF Magnitude (D) .190 Sample size (S) .046 Impact (I) .009 D*S .012 D*I .005 S*I .002 D*S*I .001 80 PAGE 81 CHAPTER 5 DISCUSSION This study was designed to evaluate the overall performance of the factor mixture analysis in detecting uniform DIF. Specifically, there were two primary research goals, namely: (i) to assess the ability of the factor mixture approach to correctly recover the number of latent classes, and (ii) to examine the Type I error rates and statistical power associated with the approach under various study conditions. Using data generated by a 2PL IRT framework, a Monte Carlo simulation study was conducted to investigate the properties of the proposed factor mixture model approach to DIF detection. First, a 15-item dichotomous test simulated for a two-group, two-class population was generated. In both parts of the study, the effect of DIF magnitude, sample size and differences in latent trait means on the performance of the mixture approach were examined. First, the major findings of each phase of the simulation are summarized. This will be followed by a discussion of the limitations of this study and suggestions for future research. Class Enumeration and Performance of Fit Indices In assessing the accuracy of the factor mixture approach to accurately recover the correct number of latent classes, models with one through three latent classes were fit to the simulated data. In addition, three commonly-used information criteria indices (AIC, BIC, and ssaBIC) were used in the selection of the correct model. Overall, there was a high level of inconsistency among the three ICs. In this study, the AIC tended to over-extract the number of classes and under the majority of study conditions supported the more complex but incorrect three-class model over the true two-class model. This behavior was sharply contrasted with that of the BIC, which tended to underestimate the correct number of latent classes and consistently favored the simpler 81 PAGE 82 one-class model. In contrast to the distinctly different results produced by the AIC and BIC, the ssaBIC produced more balanced results by showing a preference for the two-class model over the 1-class model as the magnitude of DIF simulated between groups increased. Moreover, of the three factors examined (magnitude of DIF, sample size, and presence of impact) the patterns of model selection were most affected by the change in DIF magnitude. However, while the behavior of the three ICs was influenced when larger amounts of DIF were simulated, the effect was different across ICs. For example, when the DIF magnitude was increased from 1.0 to 1.5, the ssaBIC identified the two-class model under three of the four conditions. In the case of the AIC, the two-class model had its lowest average IC values for two of the four conditions. And while the BIC still tended to favor the one-class model, the differences between the one-class and two-class model were minimized on increasing the DIF magnitude from 1.0 to 1.5. Therefore, the ssaBIC was most affected by the presence of larger DIF, followed by the AIC and lastly the BIC. In discussing these findings, it is important to note that the results of this Monte Carlo study though disappointing were not totally unexpected since previous research studies have also reported similar inconsistent performances for these fit indices (Li et al., 2009; Lin & Dayton, 1997; Nylund et al., 2006; Reynolds, 2008; Tofighi & Enders, 2007; Yang, 2006). Additionally, the pattern of results exhibited in this study by the indices has also been observed in other mixture model studies. For example, in research conducted by Li et al. (2009), Lin & Dayton (1997), and Yang (1998), the authors observed similar patterns of behavior, namely, the tendency of the AIC to overestimate the true number of classes and the BIC to select simpler models with a 82 PAGE 83 smaller number of latent classes. On the other hand, while simulation results from Nylund et al. (2006) supported the finding of the AIC favoring models with more latent classes, their study found the BIC to be most consistent indicator of the true number of latent classes. This latter result contrasted with other studies which touted the merits of the ssaBIC for class enumeration over the BIC (Henson, 2004; Yang, 2006; Tofighi & Enders, 2007). Therefore, given the inconsistencies in results, no single information criteria index can be regarded as being the most appropriate for class enumeration for all types of finite mixture models. Liu (2009) argued that because the performances of the indices depend heavily on the estimation model and the population assumptions that these inconsistencies should be expected. In addition, because to date no full scale study has been conducted comparing the performance of these indices for factor mixture DIF applications, no definite conclusion can be reached regarding the index that is best suited for this type of factor mixture application. Clearly, this represents an opportunity for future research. Results from this study also point to several instances where negligible differences in IC values between neighboring class models were observed. Therefore, even though a model may have produced the lowest average IC value, the IC value of the k+1 or k-1 class model did not differ substantially from that of the k-class model. In cases such as this, the absence of an agreed-upon standard for calculating the significance of these IC differences increases the ambiguity of the selection of the correct model. This presents the opportunity for the creation of such a significance statistic; a possibility that will be explored later as a potential area for further research. 83 PAGE 84 Overall, the ambiguity of these findings serve to reinforce the point that was made earlier, namely, that the IC results should never be relied upon as the sole determinant of the number of classes. Several researchers have stressed the importance of incorporating substantive theory in guiding the model selection decision (Allua, 2007; Bauer and Curran, 2004; Kim, 2009; Nylund et al., 2007; Reynolds, 2008). Moreover, Reynolds (2008) contends that the researcher often has some belief about the underlying subpopulations, therefore this should be taken into account in determining which of the models best fit the data. Type I Error and Statistical Power Performance In this phase of the study, the performance of the factor mixture model was evaluated in terms of its Type I error rate and power of DIF detection. As was done in the first part of the study data were again simulated for a 15-item test based on the 2PL IRT model. However, in this case it was assumed that the number of classes was known to be two. Five of the 15 items were simulated to contain uniform DIF in favor of the reference group. In investigating the Type I error rate and power of the test, three factors (DIF magnitude, sample size and impact) shown previously to affect DIF detection were also manipulated and their effect on the test was noted. More specifically, two levels of DIF magnitude (1.0 1.5) and of sample size (N=500, N=1000) were simulated. For the effect of impact, three levels, 0, 0.5 SD and 1.0 SD were chosen to reflect none, moderate and large mean differences in the latent trait. For each of the 12 conditions, a total of 1000 replications were run. The Type I error and statistical power of the factor mixture method for DIF detection was investigated across all conditions. 84 PAGE 85 Type I Error Rate Study With the exception of the referent (Item 1) whose thresholds were constrained across latent classes for identification purposes, the remaining nine DIF-free items were used in assessing the ability of the factor mixture to control the Type I error close to the nominal alpha level of .05. However, the DIF factor mixture approach yielded inflated error rates ranging in magnitude from .092 to .159 across all 12 study conditions. Whereas the rates of incorrect detection improved with large DIF and sample size, the effect of increasing impact had little effect in controlling the Type I error rates. In assessing the performance of several DIF detection procedures, previous studies have confirmed the inverse relationship between the inflation of Type I error rates and both sample size and size of DIF, with tests attaining their optimal performance at controlling Type I error rates when samples sizes are larger and with higher amounts of DIF (Cohen, Kim & Baker, 1993; Dainis, 2008; Donoghue, Holland, & Thayer, 1993; Oort, 1998; Wanichtanom, 2001). Previous simulation results regarding the influence of impact on Type I error rates have been divided. Whereas some studies have reported Type I error inflation in the presence of impact (Cheung & Rensvold, 1999; Lee, 2009; Roussos & Stout, 1996; Uttaro & Millsap, 1994), others have shown good control of the error rates for moderate impact of .5 SD (Stark et al., 2006) and even for latent mean differences as large as 1 SD (Shealy & Stout, 1993). Differences in latent ability distributions are common for both cognitive and non-cognitive measures, hence it is critical for DIF detection methods, particularly those that do not differentiate between the presence of DIF and impact, are robust to the effects of group differences in latent trait means. 85 PAGE 86 Statistical Power Study The study also evaluated the power of the factor mixture approach to detect uniform DIF. In spite of the failure of the factor mixture analysis to adequately protect the Type I error rates across the study conditions, the power results were still reviewed to get some sense of the pattern of DIF detection. Overall, these findings represent a mix of the predictable and the unexpected. What was expected was that the power of the factor mixture analysis method of DIF detection would increase as sample size and magnitude of DIF increased. In addition, it was not a surprising outcome that the magnitude of the discrimination parameter also influenced DIF detection rates; power was highest when detecting DIF in the more highly discriminating items, followed by studied items with medium and low discrimination parameters. Overall, these results are not only intuitively appealing but have been consistently supported by prior research conducted with different methods of DIF detection (Donoghue et al., 1993; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Stark et al., 2006). On the other hand, the surprising result was that even in the presence of large latent trait mean differences of 1.0 SD, the rates of DIF detection were not adversely affected by impact. While this finding was consistent with some studies (Gonzlez-Rom et al., 2006; Narayanan & Swaminathan, 1994; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Stark et al., 2006), others have reported contradictory results with reductions in power as the disparity in latent means increased (Ankemann et al., 1999; Clauser, Mazor, & Hambleton, 1993; Finch & French, 2007; Narayanan & Swaminathan, 1996; Tian, 1999; Zwick, Donoghue, & Grima, 1993). However, it is important to note that these prior empirical studies all utilized standard DIF analyses rather than a mixture approach, as was used in this simulation. 86 PAGE 87 Reconciling the Simulation Results On one hand, the overall pattern of findings across the simulation conditions exhibits consistency with previous DIF results. On the other, the factor mixture approach was not as successful as was hoped at controlling the rate of false identifications and as a result in demonstrating power to detect DIF. However, if the factor mixture approach is to be regarded as a viable DIF detection method, possible reasons for this deviation from the expected performance must be addressed. Under a manifest approach to DIF an item is said to exhibit DIF if groups matched on the latent ability trait differ in their probabilities of item response (Cohen et al., 1993). Therefore, in that context, DIF is defined with respect to the manifest groups being considered. By contrast, the mixture approach posits a different conceptualization of DIF. In this case, the underlying assumption is that DIF is observed because of differences in item responses between unobserved latent classes rather than known manifest groups. Moreover, it is further assumed that unless there is perfect overlap between the manifest groups and the latent classes then the two methods should not be expected to produce the same DIF results (De Ayala et al., 2002). Perfect overlap implies that the composition in each of the latent classes is exactly the same as in the two manifest groups. For instance, in the case of a two-class, two-group population, 100% of the reference group would comprise latent class 1, while 100% of the focal groups would belong to latent class 2. However, De Ayala et al. (2002) contend that it is unlikely that this perfect equivalence between latent classes and manifest groups will occur. Because the composition of the latent classes is likely to differ from that of the manifest groups, then it should be expected that the DIF results will differ, particularly as the level of overlap moves from 100% to 50%. Therefore, while there is expected to be some similarity in results between the two 87 PAGE 88 approaches, the results are not necessarily identical except in the case of perfect group-class correspondence. In this simulation, given that the overlap between the latent classes and manifest groups was simulated to be 80%, then the DIF results should be expected to differ to some degree. Therefore, one possible reason for the Type I error rate inflation may have emerged because of this difference in definition and conceptualization of DIF. Additionally, the procedure used to test the invariance of the items may also have contributed to this seemingly high rate of inflation. In testing the significance of the differences in item thresholds, Mplus invokes a Wald test. An examination of these estimates revealed several large coefficients which in turn would have resulted in large z-statistics and an increased likelihood of significance. However, the issue of whether the inflated error rate resulted from applying a factor mixture approach to these data or from the using the significance testing of threshold differences in testing for non-invariant items remains unresolved. Limitations of the Study and Suggestions for Future Research As with all simulation research, there are several limitations to this study. However, these limitations also point to the need for future research. First, in determining the correct number of latent classes, the findings were limited by use of only one type of model fit index. It would have been interesting to compare the results of the information criteria indices (i.e. AIC, BIC and ssaBIC) with those of alternative tests such as the Lo-Mendell-Rubin likelihood ratio test (LMR LRT) and the bootstrap LRT (BLRT). In their simulation study, Nylund et al. (2006) found that the LMR LRT has been was reasonably effective at identifying the correct mixture model. However, the BLRT outperformed both the likelihood-based indices and the LMR LRT as the most consistent indicator for choosing the correct number of classes. While these results are 88 PAGE 89 promising, the LMR LRT and the BLRT are not without their potential drawbacks. Jeffries (2003) has been critical of the LMR LRTs use in mixture modeling and has suggested that the statistic be applied with caution. In addition, the BLRT which uses bootstrap samples is a far more computationally intensive approach than the information-based statistics. As a result, the BLRT though seemingly a reliable index, is seldom used in practice by the applied researcher (Liu, 2008). Therefore, additional attention may be focused on identifying alternative, robust model selection measures that provide more consistency than the ICs but are less computationally demanding than the BLRT. A second limiting factor in this part of the study was that the selection of the best-fitting model was based on the average IC values. A more reliable approach would have been to determine the percentage of times (out of the completed replications) that each index identifies the correct model. However, in this study, it was not possible to provide a one-to-one comparison of the IC values across the three class-solutions when a 100% convergence rate was not achieved. Therefore, in future research, this change should be implemented so that the percentage of correct model identifications can be compared for each of the indices. It should also be mentioned that while previous studies have evaluated the performance of model selection methods with respect to a variety of mixture models (GMM, LCA, FMM), to date no research has been conducted to evaluate the performance of these indices when used in the context of DIF detection. To fill this gap in the methodology literature requires a more extensive study focusing on the detection of DIF with mixture models. As with all simulation research, the findings can only be generalized to the limited number of conditions selected for this study. It should be noted that in the original 89 PAGE 90 design of this study, several additional conditions were considered. However, given the computational intensity of mixture modeling, and in the interest of time, it was decided to reduce the number of study conditions to the smaller set that was studied. Therefore, future research should consider a broader range of simulation conditions which would make for a more realistic study. For example, in addition to sample sizes, it would be of interest to investigate the ratios of focal to reference groups sample size as well. In this study, a 1:1 sample size ratio of focal to reference group was considered. And while this may be representative of an evenly split manifest variable such as gender (Samuelsen, 2005), unequal sample groups tend the mimic minority population characteristics such as race (e.g. Caucasian vs. black or Hispanic). In traditional DIF assessments, power rates are typically higher for equal focal and reference group sizes than with unequal sample size ratios (Atar, 2007). Therefore it would be interesting to investigate whether this finding is consistent with factor mixture DIF detection methods. Other conditions, fixed in this current study, that could be manipulated in future research include: (i) the nature of the items, (ii) the scale length, and (iii) the type of DIF. In the study, data were simulated for dichotomous items only. An interesting extension would be the evaluation of the model using categorical response data generated from different IRT polytomous models (e.g. the graded response model or the partial credit model). Another condition that could be manipulated is the number/proportion of items simulated to contain DIF. In addition, assessing the performance of the model selection indices and the mixture model with respect to varying scale lengths should also make for a more complete, informative study. While it is expected that longer tests would produce lower Type I error rates and increase power, it would be of interest to 90 PAGE 91 determine how short the scale should be for the test to perform adequately. The focus of this study was on the detection of uniform DIF. However, in future research, the type of DIF factor can be extended to include both uniform and non-uniform DIF. To test for the presence of non-uniform DIF, the factor mixture model as implemented in this study must be reformulated so that in addition to the item thresholds, factor loadings are allowed to vary across classes as well. The Type I error rates and power of the factor mixture model to detect non-uniform DIF can then be evaluated and compared with the corresponding results for uniform DIF. Additionally, in this study, the item discrimination parameter was not included as a factor in the study. Instead its effect was examined on its own as a single condition. Therefore, in future research, the effect of including this study condition may be investigated. In generating the data, the mixture proportion for the two-classes was simulated to be .50. However, after the model estimation phase, the ability of the factor mixture approach to accurately recover the class proportions was not evaluated. This omission should also be addressed in future research. Finally, to the authors knowledge, the strategy used in testing the items for non-invariance has been recently introduced to the factor mixture literature and to date has been implemented in two studies. Its advantage is that it provides a simpler more direct alternative to DIF detection than the CFA baseline approaches which require the estimation and comparison of two models. However, it has not yet been subjected to the methodological rigor of more established methods. Therefore, a potential extension to this study would be a comparison of the performance of the significance testing of the 91 PAGE 92 threshold differences using the Mplus model constraint option versus either a constrainedor a free-baseline strategy for testing DIF with mixture CFA models. Conclusion In the last decade, a burgeoning literature on mixture modeling and its applications has emerged. And although several of these research efforts have been concentrated in the area of growth mixture modeling, there is also a groundswell of interest in applying a mixture approach in the study of measurement invariance. Therefore, in concluding this dissertation it is important to reiterate the motivation that should precede the use of this technique as well as some key concerns that applied should keep in mind when deciding whether a mixture modeling is an appropriate approach for their research. The intrinsic appeal of mixture models is that they allow for the exploration of unobserved population heterogeneity using latent variables. Under the traditional conceptualization, DIF is defined with respect to distinct, known sub-groups. Therefore, in using standard DIF approaches, practitioners are seeking to determine if after controlling for latent ability whether differences in item response patterns are a result of a known variable such as gender or race. However, when investigating DIF from a latent perspective, there is an implied assumption that the presence of unobserved latent classes gives rise to the pattern of differential functioning in the items. Advocates of this approach contend that it allows for a better understanding of why examinees may be responding differently to items and this is certainly an attractive inducement to practitioners. However, these results suggest that unless large sample sizes and large amounts of DIF are simulated in the data the factor mixture approach is likely to be unsuccessful at disentangling the population into distinct, distinguishable latent classes. Additionally, commonly-used fit indices such as the AIC, BIC, and ssaBIC are likely to produce inconsistent results and 92 PAGE 93 may cause the incorrect selection of more or fewer classes than actually exist in the population. Therefore, it is critical that the practitioner has a strong theoretical justification to support the assumption of population heterogeneity. This should decrease the ambiguity in the selection of the best-fitting model for the data and in the interpretation of the nature of the latent classes. However, when the data and the theory support the existence of these latent classes, the technique can be used successfully to detect qualitatively different subpopulations with differential patterns of response that may otherwise had been overlooked using a traditional classic DIF-procedure. In the context of education research, the application of mixture models can provide valuable diagnostic information that can be used to gain insight into students cognitive strengths and weaknesses. This study was designed as a means of bridging the gap between the manifest and latent approaches by examining the performance of the factor mixture approach in detecting DIF in items generated via a traditional framework. And even though the manifest approach will remain a staple in the DIF literature, it is expected that the interest in factor mixture models in DIF will continue to grow. Therefore, by further exploring these two approaches differ not only as concepts but also in results and application will ensure that each is appropriately used in practice. 93 PAGE 94 APPENDIX A MPLUS CODE FOR ESTIMATING 2-CLASS FMM TITLE: Factor mixture model for a two-class solution. DATA: FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: NAMES = u1-u15 class group; USEV ARE u1-u15; CATEGORICAL = u1-u15; CLASSES = c (2); ANALYSIS: TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 600 20; PROCESS = 2; MODEL: %OVERALL% f BY u1-u15; %c#1% [u2$1-u15$1]; f; %c#2% [u2$1-u15$1]; f; OUTPUT: TECH8 TECH9; STANDARDIZED; SAVEDATA: RESULTS ARE results.txt; 94 PAGE 95 APPENDIX B MPLUS CODE FOR DIF DETECTION TITLE: Factor mixture model for a two-class solution. Items = 15 DIF =1.0 DATA: FILE IS allnames.txt; TYPE = montecarlo; VARIABLE: NAMES = u1-u15 class group; USEV ARE u1-u15; CATEGORICAL = u1-u15; CLASSES = c (2); ANALYSIS: TYPE = MIXTURE; ALGORITHM = INTEGRATION; INTEGRATION = STANDARD (20); STARTS = 0; PROCESS = 2 ; MODEL: %OVERALL% f BY u1@1 u2*0.500 / / / / u15*0.867; %c#1% [u1$1] (p1_1); !Assigns names to indicators for constraint purposes [u2$1*-0.500] (p1_2); / / / / [u15$1*0.609] (p1_15); f; %c#2% [u1$1] (p1_1); !Threshold of Item 1 constrained equal across classes [u2$1*0.000] (p2_2); !Remaining 14 item thresholds freely estimated / / / / [u15$1*0.609] (p2_15); f; MODEL CONSTRAINT: New(difi2 difi3 difi4 difi5 difi6 difi7 difi8 difi9 difi10 difi11 difi12 difi13 difi14 difi15); Declares new variables (difi2,...,difi15)which are functions of previous variables difi2= p2_2 p1_2; !Estimates threshold differences / / / / difi15= p2_15 p1_15; 95 PAGE 96 LIST OF REFERENCES Abraham, A. A. (2008). Model Selection Methods in the linear mixed model for longitudinal data. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91. Ainsworth, A.T. (2007). Dimensionality and invariance: Assessing DIF using bifactor MIMIC models. Unpublished doctoral dissertation, University of California, Los Angeles. Agrawal, A., & Lynskey, M.T. (2007). Does gender contribute to heterogeneity in criteria for cannabis abuse and dependence? Results from the national epidemiological survey on alcohol and related conditions. Drug and Alcohol Dependence, 88, 300. Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317. Allua, S.S. (2007). Evaluation of singleand multilevel factor mixture model estimation. Unpublished doctoral dissertation, University of Texas: Austin. Anderson, L. W. (1985). Opportunity to learn. In T. Husen & T. N. Postlethwaite (Eds.), The international encyclopedia of education (Vol. 6, pp. 3682-3686). Oxford: Pergamon Press. Angoff, W.H. (1972). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. (ERIC Document Reproduction Service No. ED 069686). Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 3-23). Hillsdale, N.J.: Lawrence Erlbaum. Angoff, W. H., & Sharon, A. T. (1974). The evaluation of differences in test performance of two or more groups. Educational and Psychological Measurement, 34, 807-816. Ankemann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277-300. 96 PAGE 97 Atar, B. (2007). Differential item functioning analyses for mixed response data using IRT likelihood-ratio test, logistic regression, and GLLAMM procedures. Unpublished doctoral dissertation, Florida State University. Bandalos, D. L., & Cohen, A.S. (2006). Using factor mixture models to identify differentially functioning test items. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Bauer, D. J., & Curran, P.J. (2004). The integration of continuous and discrete latent variable models: Potential problems and promising opportunities. Psychological Methods, 9, 3-29. Bilir, M. K. (2009). Mixture item response theory-mimic model: Simultaneous estimation of differential item functioning for manifest groups and latent classes. Unpublished doctoral dissertation, Florida State University. Bock, R.D., & Aiken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26, 381-409. Bontempo, D. E. (2006). Polytomous factor analytic models in developmental research. Unpublished doctoral dissertation, The Pennsylvania State University. Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Newbury Park, CA: Sage. Cheung, G.W. & Rensvold, R.B. (1999) Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1-27. Cho, S.-J. (2007). A multilevel mixture IRT model for DIF analysis. Unpublished doctoral dissertation, University of Georgia, Athens. Chung, M.C., Dennis, I., Easthope, Y., Werrett, J., & Farmer, S. (2005). A multiple-indicator multiple-cause model for posttraumatic stress reactions: Personality, coping, and maladjustment. Psychosomatic Medicine, 67, 251. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. 97 PAGE 98 Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1994). The effects of score group width on the Mantel-Haenszel procedure. Journal of Educational Measurement, 57, 67-78. Clark, S.L. (2010). Mixture modeling with behavioral data. Doctoral dissertation, Unpublished doctoral dissertation. University of California, Los Angeles. Clark, S.L., Muthn, B., Kaprio, J., DOnofrio, B.M., Viken, R., Rose, R.J., Smalley, S. L. (2009). Models and strategies for factor mixture analysis: Two examples concerning the structure underlying psychological disorders. Manuscript submitted for publication. Cleary, T. A. & Hilton, T. J. (1968). An investigation of item bias. Educational and Psychological Measurement, 5, 115-124. Cohen, A.S., & Bolt, D.M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133-148. Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335 350. Cole, N. S. (1993). History and development of DIF. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 25-33). Hillsdale, N.J.: Lawrence Erlbaum. Dainis, A. M. (2008). Methods for identifying differential item and test functioning: An investigation of Type I error rates and power. Unpublished doctoral dissertation, James Madison University. De Ayala, R.J. (2009). Theory and practice of item response theory. Guilford Publishing. De Ayala, R.J., Kim, S.-H., Stapleton, L.M., & Dayton, C.M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2, 243-276. Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning (pp. 137-166). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.) Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. 98 PAGE 99 Dorans NJ, & Kulick E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368. Duncan, S. C. (2006). Improving the prediction of differential item functioning: A comparison of the use of an effect size for logistic regression DIF and Mantel-Haenszel DIF methods. Unpublished doctoral dissertation, Texas A&M University. Educational Testing Service (2008). What's the DIF? Helping to ensure test question fairness. Retrieved December 8, 2009, from: http://www.ets.org/portal/site/ets/ Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3, 347-360. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278-295. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning. Educational and Psychological Measurement, 67, 565-582. Fukuhara, H. (2009). A differential item functioning model for testlet-based items using a bi-factor multidimensional item response theory model: A Bayesian approach. Unpublished doctoral dissertation, Florida State University. Furlow, C. F., Raiford Ross, T., & Gagn, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33, 441-464. Gagn, P. (2004). Generalized confirmatory factor mixture models: A tool for assessing factorial invariance across unspecified populations. Unpublished doctoral dissertation. University of Maryland. Gagn, P. (2006). Mean and covariance structure models. In G.R. Hancock & F.R. Lawrence (Eds.), Structural Equation Modeling: A second course (pp. 197-224). Greenwood, CT: Information Age Publishing, Inc. Gallo, J. J., Anthony, J. C., & Muthn, B. O. (1994). Age differences in the symptoms of depression: A latent trait analysis. Journal of Gerontology: Psychological Sciences, 49, P251-P264. Gelin, M. N. (2005). Type I error rates of the DIF MIMIC approach using Jreskogs covariance matrix with ML and WLS estimation. Unpublished doctoral dissertation, The University of British Columbia. 99 PAGE 100 Gelin, M. N., & Zumbo, B.D. (2007). Operating characteristics of the DIF MIMIC approach using Jreskogs covariance matrix with ML and WLS estimation for short scales. Journal of Modern Applied Statistical Methods, 6, 573-588. Glockner-Rist, A., & Hoitjink, H. (2003). The best of both worlds: Factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10, 544-565. Gomez, R. & Vance, A. (2008). Parent ratings of ADHD symptoms: Differential symptom functioning across Malaysian Malay and Chinese children. Journal of Abnormal Child Psychology, 36, 955-967. Gonzlez-Rom, V., Hernndez, A., & Gmez-Benito, J. (2006). Power and Type I error of the mean and covariance structure analysis model for detecting differential item functioning in graded response items. Multivariate Behavioral Research, 41, 29-53. Gierl, M. J., Bisanz, J., Bisanz, G. L., Boughton, K. A., & Khaliq, S. N. (2001). Illustrating the utility of differential bundle functioning analysis to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practice, 20, 26-36. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage Publications. Hancock, G. R., Lawrence, F. R., & Nevitt, J. (2000). Type I error and power of latent mean methods and MANOVA in factorially invariant and noninvariant latent variable systems. Structural Equation Modeling, 7, 534-556. Henson, J. M. (2004). Latent variable mixture modeling as applied to survivors of breast cancer. Unpublished doctoral dissertation. University of California, Los Angeles. Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty (ETS-RR-94-13). Princeton, NJ; Educational Testing Service. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H.I. Braun (Eds.), Test Validity. Hillsdale, N.J.: Erlbaum. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, N.J.: Lawrence Erlbaum. Jeffries, N. (2003). A note on Testing the number of components in a normal mixture. Biometrika, 90, 991. 100 PAGE 101 Jreskog K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 10, 631 639. Kamata, A., & Bauer, D. J. (2008). A note on the relationship between factor analytic and item response theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 136 153. Kamata, A., & Binici, S. (2003). Random effect DIF analysis via hierarchical generalized linear modeling. Paper presented at the annual International Meeting of the Psychometric Society, Sardinia, Italy. Kamata, A., & Vaughn, B.K. (2004). An introduction to differential item functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 49-69. Kuo, P.H., Aggen, S.H., Prescott, C.A., Kendler, K.S., & Neale, M.C. (2008). Using a factor mixture modeling approach in alcohol dependence in a general population sample. Drug and Alcohol Dependence, 98, 105. Larson, S. L. (1999). Rural-urban comparisons of item responses in a measure of depression. Unpublished doctoral dissertation, University of Nebraska. Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some examinees are amotivated. Unpublished doctoral dissertation, James Madison university. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Lee, J. (2009). Type I error and power of the mean and covariance structure confirmatory factor analysis for differential item functioning detection: Methodological issues and resolutions. Unpublished doctoral dissertation, University of Kansas. Leite, W. L. & Cooper, L. (2007). Diagnosing social desirability bias with structural equation mixture models. Paper presented at the Annual Meeting of the American Psychological Association. Li, F., Cohen, A.S., Kim, S-H., & Cho, S-J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353-373. Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22, 249-264. 101 PAGE 102 Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 349 -364). Hillsdale, N.J.: Lawrence Erlbaum. Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content and group membership on achievement test items. Journal of Educational Measurement, 18, 109-118. Liu, C. Q. (2008). Identification of latent groups in Growth Mixture Modeling: A Monte Carlo study. Unpublished doctoral dissertation, University of Virginia. Lo, Y., Mendell, N., & Rubin, D. (2001). Testing the number of components in a normal mixture. Biometrika, 88, 767. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lubke, G. H. & Muthn, B. (2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10, 21-39. Lubke, G. H. & Muthn, B. (2007). Performance of factor mixture models as a function of model size, covariate effects, and class-specific parameters. Structural Equation Modeling: A Multidisciplinary Journal, 14, 26-47. Lubke, G. H. & Neale, M. C. (2006). Distinguishing between latent classes and continuous factors: Resolution by maximum likelihood? Multivariate Behavioral Research, 41, 499532. MacIntosh, R., & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 372-379. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric Monographs, 15, 1-167. McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. 102 PAGE 103 Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organizational Research Methods, 7, 361. Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity of the DFIT framework for tests of measurement invariance with Likert data. Applied Psychological Measurement, 31, 430-455. Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543. Millsap, R.E., & Everson, H.T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297-334. Mislevy, R. J., Levy, R, Kroopnick, M., & Rutstein, D. (2008). Evidentiary foundations of mixture item response theory models. In G. R. Hancock, K. M. Samuelsen (Eds.), Advances in latent variable mixture models. (pp. 149-175). Charlotte, NC: Information Age Publishing. Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195-215. Moustaki, I. (2000). A latent variable model for ordinal variables. Applied Psychological Measurement, 24, 211-224. Muthn, B. O. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10, 121-132. Muthn, B. O. (1988). Some uses of structural equation modeling in validity studies: Extending IRT to external variables. In H. Wainer and H. Braun (Eds.), Test validity (pp. 213-238). Hillsdale, NJ:Lawrence Erlbaum. Muthn, B. O. (1989). Using item-specific instructional information in achievement modeling. Psychometrika, 54, 385-396. Muthn, B. O., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus (Mplus Web Note No. 4). Retrieved April 28, 2005, from http://www.statmodel.com/mplus/examples/webnote.html Muthn, B., Asparouhov, T. & Rebollo, I. (2006). Advances in behavioral genetics modeling using Mplus: Applications of factor mixture modeling to twin data. Twin Research and Human Genetics, 9, 313-324. 103 PAGE 104 Muthn, B. O., Grant, B., & Hasin, D. (1993). The dimensionality of alcohol abuse and dependence: Factor analysis of DSM-III-R and proposed DSM-IV criteria in the 1988 National Health Interview Survey. Addiction, 88, 1079-1090. Muthn, B. O., Kao, C., & Burstein, L. (1991). Instructionally sensitive psychometrics: An application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28, 1-22. Muthn, B. O., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10, 133-142. Muthn, L.K. and Muthn, B.O. (1998-2008). Mplus users guide. Fifth edition. Los Angeles, CA: Muthn & Muthn. Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel and simultaneous tem bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18, 315-328. Navas-Ara, M. J., & Gomez-Benito, J. (2002). Effects of ability scale purification on identification of DIF. European Journal of Psychological Assessment, 18, 9-15. Nylund, K. L., Asparouhov, T., & Muthn, B. (2006). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569. ONeill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Erlbaum. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107-124. Penfield, R.D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5-15. Potenza, M.T. & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23-27. Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 54, 495-502. Raju, N.S., Bode, R.K., & Larsen, V.S. (1989). An empirical assessment of the MantelHaenszel statistic to detect differential item functioning. Applied Measurement in Education, 2, 1-13. 104 PAGE 105 Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207. Reynolds, M. R. (2008). The use of factor mixture modeling to investigate population heterogeneity in hierarchical models of intelligence. Unpublished doctoral dissertation, University of Texas, Austin. Rindskopf, D. (2003). Mixture or homogeneous? Comment on Bauer and Curran (2003). Psychological Methods, 8, 364. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. Roussos, L. A., & Stout, W. F. (1996a). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371. Roussos, L. A., & Stout, W. F. (1996b). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215-230. Samuelsen, K. (2005). Examining differential item functioning from a latent class perspective. Unpublished doctoral dissertation, University of Maryland: College Park. Samuelsen, K. (2008). Examining differential item functioning from a latent perspective. In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture models, (pp. 177-197). Charlotte, NC: Information Age Publishing. Sawatzky, R. (2007). The measurement of quality of life and its relationship with perceived health status in adolescents. Unpublished doctoral dissertation, The University of British Columbia. Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461. Sclove, L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333. 105 PAGE 106 Shealy, R., & Stout, W. F. (1993a). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159-194. Shealy, R., & Stout, W. F. (1993b). An item response theory model for test bias and differential test functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197-239). Hillsdale, NJ: Erlbaum. Shih, C-L & Wang, W-C. (2009). Differential item functioning detection using multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33, 184-199. Srbom, D. (1974). A general method for studying differences in factor means and factor structures between groups. British Journal of Mathematical and Statistical Psychology. 27, 229-239. Standards for educational and psychological testing. (1999). Washington, D.C: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1291-1306. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. Thissen, D. (1991). MULTILOGTM User's Guide. Multiple, Categorical Item Analysis and Test Scoring Using Item Response Theory. Chicago: Scientific Software, Inc. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.) Differential Item Functioning (pp. 67-113). Hillsdale, N.J.: Lawrence Erlbaum. Thurstone, L.L. (1925). A method of scaling educational and psychological tests. Journal of Educational Psychology, 16, 263-278. Thurstone, L. L. (1947). Multiple Factor Analysis. Chicago: University of Chicago Press. Tian, F. (1999). Detecting differential item functioning in polytomous items. Unpublished doctoral dissertation, University of Ottawa. 106 PAGE 107 Tofighi, D., & Enders, C. K. (2007). Identifying the correct number of classes in a growth mixture model. In G. R. Hancock (Ed.), Mixture models in latent variable research (pp. 317). Greenwich, CT: Information Age. Uttaro, T. & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement, 18, 15-25. Wainer H. (1993). Model-based standardized measurement of an items differential impact. In P.W. Holland & H. Wainer (Eds.) Differential item functioning (pp. 123-135). Hillsdale, N.J.: Lawrence Erlbaum. Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197. Wang, W-C., Shih, C-L., & Yang, C-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713-732. Wanichtanom, R. (2001). Methods of detecting differential item functioning: A comparison of item response theory and confirmatory factor analysis. Unpublished doctoral dissertation, Old Dominion University. Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92-107. Webb, M-Y., Cohen, A.S., & Schwanenflugel, P.J. (2008). A mixture model analysis of differential item functioning on the Peabody Picture Vocabulary Test-III. Educational and Psychological Measurement, 68, 335-351. Woods, Carol M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1-27. Yang, C. C. (1998). Finite mixture model selection with psychometric applications. Unpublished doctoral dissertation, University of California, Los Angeles. Yang, C. C. (2006). Evaluating latent class analysis models in qualitative phenotype Identification. Computational Statistics and Data Analysis, 50, 1090. Yoon, M. (2007). Statistical power in testing factorial invariance with ordinal measures. Unpublished doctoral dissertation, Arizona State University. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. 107 PAGE 108 Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233. Zumbo, B. D., & Gelin, M. N. (2005). A matter of test bias in educational policy research: Bringing the context into picture by investigating sociological / community moderated (or mediated) test and item bias. Journal of Educational Research and Policy Studies, 5, 1-23. Zwick, R., Donoghue, J., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30, 233-251. 108 PAGE 109 BIOGRAPHICAL SKETCH Mary Grace-Anne Jackman was born in Bridgetown, Barbados. In 1994, she graduated from the University of the West Indies, Barbados with a Bachelor of Science degree in mathematics and computer science (first class honors). After being awarded an Errol Barrow Scholarship, she entered Oxford University in 1996 and received a Master of Science degree in Applied Statistics in 1997. In 2002 she graduated from the University of Georgia with a masters degree in marketing research. Following four years as a marketing research consultant in New York and Barbados, she began doctoral studies in research and evaluation methodology at the University of Florida in the fall of 2006. 109 |