ASSESSING THE MODEL FIT AND CLASSIFICATION ACCURACY IN COGN I TIVE DIAGNOSIS MODELS By MIAO GAO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014
Â© 2014 Miao Gao
To m y p arents
4 ACKNOWLEDGMENTS It would not have been possible for me to complete this dissertation without the support and guidance of a number of individuals. First , I would like to express my gratitude to my advisor Dr. M. David Miller for leading and mentoring me for the dissertation, and for all the suppo rt he provided in every aspect through the entire program. I would also like to thank my committee members, Dr. James Algina, Dr. Walter Leite, Dr. Anne Corinne Manley and Dr. Tim Jacobbe. I especially thank Dr. Jame s Algina for always being so supportive of graduate students, and for many opportunities and experiences he offered me during graduate school. I would like to thank Dr. Walter Leite for his training in the simulation research and help me become a researche r. I am grateful to Dr. Corinne Manley for her insights of cognitive diagnos tic models and her helpful comments and s uggestions to my dissertation. I also want to thank Dr. Tim Jacobbe for being my very supportive external committee member. I want to expr ess my gratitude to the Department of Research and Evaluation Methodology and the Office of Graduate School for all the support s during these four years. I also appreciate Dr. Cynthia Griffin, Dr. Nancy Dana, Dr. Stephen Pape and Dr. Patricia Snyder for gi ving me the opportunities working on the projects and gaining valuable experiences, and for providing me the financial support for my study. Last but not least, I am very thankful to my parents for their endless love and greatest support through out my graduate s tudies . Without them, I could not have achieve d the end of the Ph. D process.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 LIST OF ABBREVIATIONS ................................ ................................ ............................. 9 ABSTRACT ................................ ................................ ................................ ................... 10 CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW ................................ ..................... 12 Background ................................ ................................ ................................ ............. 12 Cognitive Diagnost ic Models ................................ ................................ ................... 15 The Generalized DINA Model Framework ................................ .............................. 17 The Q Matrix ................................ ................................ ................................ ........... 20 Model Fit Evaluation ................................ ................................ ............................... 21 Relative Fit ................................ ................................ ................................ ....... 22 Absolute Fit ................................ ................................ ................................ ...... 23 Classification Accuracy ................................ ................................ ........................... 25 Review of Simulation Studies ................................ ................................ ................. 27 Review of Applied Studies ................................ ................................ ...................... 35 Research Quest ions ................................ ................................ ............................... 38 2 DESIGN OF THE STUDY ................................ ................................ ....................... 40 Data Generation ................................ ................................ ................................ ..... 40 Number of Re spondents ................................ ................................ ................... 40 Number of Attributes ................................ ................................ ........................ 42 Marginal Attribute Difficulty ................................ ................................ ............... 43 Correlation Between Attributes ................................ ................................ ......... 43 Number of Items ................................ ................................ ............................... 44 Item Paramet er Specification for Data Generation ................................ ........... 45 Data Analysis ................................ ................................ ................................ .......... 46 CDM Specification ................................ ................................ ............................ 46 Q Matrix S pecification ................................ ................................ ...................... 46 Model Fit Evaluation ................................ ................................ ......................... 48 Classification Accuracy ................................ ................................ ..................... 48 3 RESULTS ................................ ................................ ................................ ............... 52 Model Fit Evaluati on ................................ ................................ ............................... 52
6 Relative Fit Evaluation ................................ ................................ ...................... 52 Absolute Fit Evaluation ................................ ................................ ..................... 55 Classification Accuracy ................................ ................................ ........................... 57 4 DISCUSSION AND CONCLUSION ................................ ................................ ........ 67 Discussion ................................ ................................ ................................ .............. 67 Implications of the Study ................................ ................................ ......................... 73 Future Directions and Conclusion ................................ ................................ ........... 74 APPENDIX REVIEW OF SIMULATION STUDIES ................................ ........................ 76 LIST OF REFERENCES ................................ ................................ ............................... 78 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 82
7 LIST OF TABLES Table page 2 1 Simulation Conditions for Data Generation and Model Estimation ..................... 49 2 2 Correctly Specified Q Matrix for J = 14 and 28 ................................ ................... 49 2 3 True Item Parameters ( ) ................................ ................................ ................. 50 2 4 The Q Matrix Misspecification and True Q Matrix ................................ .............. 51 3 1 Selection Rates of Relative Fit Indices for CDM Misspecification ....................... 60 3 2 Selection Rates of Relative Fit Indices for Q matrix Misspecification ................. 62 3 3 Selection Rates of Relative Fit Indices for Both CDM and Q matrix Misspecifications ................................ ................................ ................................ 63 3 4 Selection Rates of RMSEA for the Misspecification of CDM, Q matrix and Both ................................ ................................ ................................ .................... 63 3 5 Correct Overall Classification Rates in All Conditions ................................ ........ 64 3 6 Correct Overall and Class specific Classification Rates for Misspecifications of CDM and Q matrix ................................ ................................ .......................... 65 A 1 Review of the Conditions in Simulation Studies ................................ .................. 76
8 LIST OF FIGURES Figure page 3 1 Empirical sampling distributions of the RMSEA indexes for the different CDMs and Q matrices ................................ ................................ ........................ 66 3 2 Empirical sampling distributions of the RMSEA indexes for the G DINA with the correct Q matrix by different factors ................................ ............................. 66
9 LIST OF ABBREVIATIONS A CDM Additive Cognitive Diagnostic Model AHM Attribute Hierarchy Method AIC Akaike Information Criteria BIC Bayesian Information Criteria CAIC Consistent Akaike Information Criteria CDM Cognitive Diagnosis Model CFA Confirmatory Factor Analysis C RUM Compensatory Reparameterized Unified Model CTT Classical Test Theory DINA DINO Deterministic GDM General Diagnostic Model G DINA G eneralized D eterministic I nput, N G ate HO DINA Higher IRT Item Response Theory L CDM Log Linear Cognitive Diagnosis Model LL L og L ikelihood MCLCM multiple classification latent class model NIDO N oisy I nput D M odel RMSEA Root Mean Square Error of Approximation RSM Rule Space Method RUM Reparametrized Unified Model SEM Structural Equation Modeling UM Unified Model
10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ASSESSING THE MODEL FIT AND CLASSIFICATION ACCURACY IN COGN I TIVE DIAGNOSIS MODELS By Miao Gao August 2014 Chair: M. David Miller Major: Research and Evaluation Methodology This simulation study investigated the performance of model fit indices and examinee s classification accuracy for correct and misspecified cognitive diagnostic models (CDMs) under various conditions within the generalized deterministic input, noisy and gate (G DINA) model framework. The manipulated conditions included number of respondents (500, 1000 and 5000), attribute correlations (.4 and .8) and number of items in a test (14 and 28). The data was generated in the saturated G DINA model . In addition, two reduced models were used to fit the data : the additive CDM ( A CDM ) and DINA model . Five fit indices were considered: 2 log likelihood ( 2LL), Akaike Information Criteria (AIC), Bayesia n Information Criteria (BIC), Consistent AIC (CAIC) and root mean square error of approximation (RMSEA). Two types of classification accuracy were examined: the proportion of examinee classified correctly for all K skills and the proportion of examinees classified correctly for each latent class . Results showed that relative and absolute fit indices can be used conjunctively to detect the misspecification effectively. With CDM misspecification, the AIC can detect the correct saturated CDM well. With Q ma trix misspecification or with both CDM and Q matrix misspecification, the AIC and BIC had good selection rates. F itting the data
11 with the saturated CDM was helpful for selecting the correct Q matrix by AIC and BIC. The RMSEA was sensitive to CDM spec ificat ion but not the Q matrix. The over specified Q matrix was more difficult to detect. More test items, more examinees and smaller attribute correlation allowed easier detection of the CDM and Q matrix. Results also demonstrated that t he model with better fit yield higher correct classification rates. The CDM misspecification had little impact on classification accuracy. H owever, the Q matrix misspecification decreased the classification rates. The proportion of examinees classified correctly for each latent c lass was related to the types of Q matrix misspecification, where the under fitting and balanced misfit of the Q matrix had more severe impact than over fitting the Q matrix. More test items had greater positive impact on classification accuracy than more respondents taking the test .
12 CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW Background specific knowledge structures and the multiple attr ibutes for the purpose of making classification based decisions (Rupp, Templin , & Henson, 2012). Rupp et al. (2012) define the respondents as the individuals who provide the behavioral observations; items as the tasks that are presented to the respondents; attribute as the latent characteristics of respondents; and latent variables as the variables that represent the attributes. The terms respondents and examinees are used interchangeably; attribute and skill are also used interchangeably. The advantages of CDMs are most apparent in situations where diagnostic feedback needs to be provided to respondents, and criterion referenced interpretations of multiple proficiencies are most needed (Rupp &Templin, 2008a). Often the large scale tests yield an overall sca led score with reference to the distribution or to a standard , which provide the summative outcome report but not enough meaningful information to foster learning activities. Recently t here has been an increasing demand, from both researchers and educational stakeholders, for more formative test information (Mislevy, 2006; Rupp & Templin, 2008a ; Robets & Gierl, M.J. (2010 ). R esearchers and stakeholders often wish to obtain the classification of responden ts with respect to their skills mastery to facilitate an CDM yields a profile of scores for each individual based on the cognitive skills measured by the test. In large scale assessments, the reported scores are often in the form of
13 mastery probabilities. These probabilities are meaningful because the interpretations for student performance are based on the skills measur ed by the test, which will ultimately promote teaching and learning activities. Statistically speaking, CDMs are probabilistic confirmatory multidimensional latent variable models with a complex loading structure (Rupp & Templin, 2008b). There are many dif ferent types of CDMs, such as the attribute hierarchy method (AHM), the deterministic input, noisy and gate (DINA), the higher order DINA (HO DINA) model, the deterministic input, noisy or gate (DINO), the reparametrized unified model (RUM) and fusion model (Rupp, Templin & Henson, 2012). All of these CDMs are developed for the purpose of identifying the mastery or nonmastery of multiple attributes for solving problems to create multivariate classifications of subjects (de la Torre & Lee, 2010; Rupp et al., 2012). Despite the diverse types of CDMs, in recent years there has been a trend to specify the models in a general model framework. T h e three most commonly seen frameworks are the general diagnostic model (GDM ; von Davier, 2008 , 2010), the generalize DINA ; de la Torre, 2008) and the log linear cognitive diagnosis model (L CDM ; Henson, Templin, & Willse, 2009). These models in three frameworks should yie ld equivalent results (Chen, de la Torre, & Zhang, 2 013). Given these general frameworks are newly developed, limited studies have investigated the behavior and outcome of these models . T hus more research is needed for the models in a general framework. This study explored the models in G DINA framework.
14 To implement the CDMs, two kinds of inputs are required for the analysis. One is the response data from the assessment, which can be either dichotomous or polytomous (Rupp & Templin, 2008a). The other input is the specification of the attributes measured by an assessment. This item by attribute specification is usually constructed by the content experts and is called the Q matrix. In this matrix, each elements q jk indicate wh ether the attribute k is measured by the item j , where q jk =1 indicates the attribute k is measured by item j and q jk =0 indicates the attribute k is not measured by item j . The Q matrix reflects the loading structure of the multiple attributes on the item s, thus different Q matrix specification s reflect different theoretical hypotheses about the structure of the cognitive diagnostic analysis. The Q matrix is a key component in CDMs and the specification of the Q matrix is a crucial step in the diagnostic a nalysis. However, the evaluation of the correctness of Q matrix is challenging. This is partially due to the complexity of CDMs and also partially due to the relatively paucity of research about how sensitive the fit indices are under various conditions. With the availability of various CDM s and Q matri ces , choosing the most appropriate CDM in conjunction with the correct Q matrix for a particular application is important. This study investigated the performance of the fit indices for selecting the correct CDM and Q matrix specification in G DINA model framework. The primary purpose of CDMs is identifying the mastery or nonmastery of multiple attributes in order to create multivariate classifications of examinees. In order t o assess the outcome of CDM analy sis, this study also evaluated under different CDM and Q matrix specification s .
15 Cognitive Diagnostic Models CDM s mastery of multiple fine grained attributes. The key features of CDMs are not unique but have their roots in many other psychometric and statistical frameworks, such as classical test theory (CTT), item response theory (IRT), structural equation modeling (SEM), confirmatory factor analysis (CFA), categorical data analysis, linear logistic test model, and Bayesian s tatistics (Mislevy, 2006; Rupp et al. , 2012). To compare and contrast the CDMs with other w ell known models, one can view the choice between discrete latent variable models such as CDMs and continuous latent variable models such as multidimensional IRT models or FA models as a difference in the degree of approximation of continuous latent variab le distributions without resorting to model external arguments about cognitively grounded diagnostic claims (Rupp & Templin, 2008b) . All CDMs share a few characteristic features: (1) the multidimensional nature, (2) the confirmatory nature, (3) the complex loading structure, (4) the nature of manifest response variables, (5) the nature and interaction of the latent predictor variables, (6) the criterion referenced interpretations, (7) the diagnostic purposes to which they are put, (8) the cognitive theory t hat is necessary for their meaningful application, and (9) the modeling of different response strategies by groups of subjects (Rupp & Templin, 2008b). CDMs, statistically speaking, are latent structural models, and more specifically, they are restricted latent class models (Kunina Habenicht, Rupp , & Wilhelm, 2012). They are suitable for modeling categorical response variables and contain categorical latent predictor variables that generate latent classes. They enable multiple criterion -
16 referenced interpre tations and feedback for diagnostic purposes that are referenced to a cognitively grounded theory of response processes at a fine grain size. CDMs hold the assumption s about the conditional independence of attribute profiles and independence among examine es. Conditional independence means that item responses are independent conditionally on the latent class of examinee, which is similar to local independence in IRT (Rupp et al., 2012). Besides this assumption that all the CDMs hold, another assumption abou t how the latent predictor variables are combined across the different attributes to produce the manifest responses can classify CDMs into two categories: compensatory and non compensatory models. The non compensatory CDMs reflect the assumption that a def icit in one latent variable cannot be compensated for by a surplus in another latent variable (Rupp et al., 2012). Hence, an examinee must have all the required attributes by the item in order to get the item correct. The non compensatory model use a conju nctive condensation function (Maris, 1995, 1999), and statistically, this is represented by a product term over attributes in the model equation. Sometimes, these two terms non compensatory and conjunctive are used interchangeably. The examples of the non compensatory models include the rule space method (RSM), attribute hierarchy method (AHM), DINA model (Junker & Sijtsma, 2001), the noisy inputs, deterministic, and gate (NIDA) model (Junker & Sijtsma, 2001), the non compensatory reparameterized unified model (NC RUM or RUM), and the reduced repara meterized unified model (rRUM; DiBello et al., 2007; Hartz, 2002), the unified model (UM; DiBello, Stout, & Roussos, 1995), and the conjunctive multiple classification latent class model (the conjunctive MCLCM; Maris, 1999).
17 The compensatory CDMs reflect the assumption that a deficit in one latent variable can be compensated for with a surplus on a different latent variable (Rupp et al., 2012). The term disjunctive is used interchangeably with compensatory. Stati stically, these models can contain both sums and products of the attributes. Some examples of the compensatory models include the n Templin & Henson, 2006), the compensatory reparameterized unified model (C RUM), general diagnostic model (GDM; von Davier, 2006), and the compensatory multiple classification latent class model (the compensatory MCLCM; Maris, 1999). There are other criteria considered to differentiate the CDMs, such as the measurement scales of the manifest response variables they can model (dichotomous vs. polytomous), and the measurement scales of the latent predictor variables they contain (dichotomous vs. polyto mous). With all the se different types of CDMs, it is not entirely clear to what extent these models are related to one another, and how the results yielded by these models are comparable to each other. Therefore, in recent years there has been a trend to s pecify the CDMs in a general framework. The three most commonly seen frameworks are the GDM, G DINA, and L CDM. This study introduced and investigated the models in the G DINA framework. The Generalized DINA Model Framework The G DINA model is a generaliza tion of the DINA model with more relaxed assumption. It serves as a general framework for deriving other CDM formulations, estimating some commonly used CDMs, and testing the adequacy of reduced models in place of the saturated model. Like all other CDMs, the G DINA model also requires a
18 Q matrix , where each element i s for item j and attribute k . The G DINA discriminates latent classes into latent groups, where represents the required attributes for item j . Each latent group is reduced to an attribute vector represented by , whose elements are the required attributes for item j. In this study, it would suffice to consider the reduced attribute vector in place of the full attribute vector . Each latent group has the probability of cor rect answering the item represented by (de la Torre, 2011). (1 1) where is the intercept for item j ; is the main effect due to ; is the interaction effect due to and ; and is the interaction effect due to . According to de la Torre (2011), the interpretation of the param eters can be as follows: is the baseline probability that is the probability of a correct response when none of the required attributes is present; is the change in the probability of a correct response as a res ult of mastering a single attribute; , a first order interaction effect, is the change in the probability of a correct response due to the mastery of both and that is over and above the additive impact of the mastery of the same two attributes;
19 and is the change in the probability of a correct response due to the mastery of all the required attributes that is over and above the additive im pact of the main and lower order interaction effects. The intercept is always non negative, the main effects are typically non negative, but the interaction effects can take on any values. This implies that mastering any one of the required attributes corr esponds to some increase in the probability of getting the item correct. Some reduced models can be seen as th e special cases of the G DINA model. The most commonly seen reduced model is the DINA model . The item response function (IRF) for DINA model can be defined as follows (de la Torre, 2011): (1 2) where is a vector of ones and of length , is the probability that individuals who lack at least one of the prescribed attributes for the item j will guess correctly, is the probability that individuals who have all the required attributes will not slip and get the item wrong. The DINA model can be obtained from the G DINA model by setting all the parameters, except and , to zero. In terms of the G DINA parameters, and . The DINA model can also be expressed as follows: (1 3) T hus the DINA model has two parameters per it em. We can see from the formula that the incremental probability can be only expected when all the required attributes are simultaneously mastered. As a consequence, the examinees are classified into two
20 classes by the model for each item: one class with e xaminees who possess all attributes required by the item and one class with examinees who lack at least one required attribute. It is worth noting that within the second class, no further differentiation between examinees that lack different attributes is made. Another special case of the G DINA model is the additive CDM ( A CDM ) , which contains only the intercept and main effects. The IRF for A CDM is defined as follows (de la Torre, 2011): (1 4) T his model indicates that mastering attribute increases the probability of success on item j by , and the contribution of each attribute is independent. The A CDM has parameters for i tem j . Different link functions can be used in specifying general models for CDMs . T he t hree commonly used link functions are identity , logit and log . The previous introduced formulas of G DINA, DINA and A CDM use the identity link function. The model using the logit link is referred as the log odds CDM, which is equivalent to L CDM. Both the G DINA model and logit CDM describe the additive impact of attribute mastery on the probability and the logit of the probability of succe ss, while the log CDM describe s the multiplicative impact of attribute mastery on the probability of success. The Q Matrix One crucial step in cognitive diagnostic assessment is developing the Q matrix since the Q matrix and CDM are integral parts of the m odeling process. The Q matrix
21 specifies the attribute structure measured by an assessment. An example of a Q matrix can be demonstrated as follows: (1 5 ) where j indic ates for item and k indicates for attributes. The element, q jk , is specified as j th item requires the k th attribute to answer this item correctly, otherwise, q jk is 2 K 1 possible rows for a Q matrix because there is no item measuring no attribute. As an input for CDM , the Q matrix is a critical component in cognitive analysis. The Q matrix is specified by content experts and this specification process is a subjective act ivity. Different specifications of the Q matrix reflect different theoretical hypotheses about the structure of the diagnostic assessment (Kunina Habenicht et al., 2012). Hence, the quality of the Q matrix determines the diagnostic information obtained fro m the CDM analysis. Model Fit Evaluation M odel data misfit could happen in the cognitive diagnostic assessment due to the nature of the attributes, the attribute structure, misspecified constraints of model parameters, the Q matrix misspecification, the ch oice of CDMs, or a non homogeneous population (Rupp & Templin, 2008a, 2008b). When these misfits happen, the conditional independence assumption of a CDM is likely to be violated. This assumption states that there should be no r esidual dependencies among i tem responses once the
22 CDM is fit, and the model fit statistics can be thought as a check of this assumption. (Rupp et al., 2012). Given the availability of different CDMs and Q matrices, the choice of the appropriate model and the evaluation of model fi t using suitable fit statistics become increasingly important in cognitive diagnostic analysis. This study focused on the CDM and Q matrix misspecification in the G DINA model framework. Two types of model fit were the focus in this study: relative and abs olute model fit indices. Relative Fit Relative fit refers to the process of selecting the best fitting model among a set of competing models (Chen et al., 2013). Four commonly used relative fit indices for assessing model fit are: 2 log likelihood ( 2LL) , Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC) and Consistent AIC (CAIC). All these indices are computed as a function of the maximized likelihood (ML). (1 6 ) where N is the sample size, L is the total number of attribute patterns, is the response vector for examinee i , is the l th attribute pattern, represents the parameter estimates, is the likelihood of the response vector of examinee i given , and is the prior probability of (Chen et al., 2013). Thus (1 7 ) One of the most commonly used information criterion is AIC. The usual form of AIC is (1 8 )
23 where LL is the log likelihood and p is the number of parameters to be estimated (Akaike, 1987). The CAIC (Bozdogan, 1987), a derivative of the AIC, includes a penalty for models having larger numbers of parameters using the sample size n and is defined as (1 9 ) Another commonly used criterion the Bayesian Information Criteria (BIC) , was proposed by Schwarz (1978) (1 10 ) where n is the sample size, and p is the number of parameters. In the DINA model, p equals ( 2J+2 k 1 ) where J is the number of item s and K is the number of attributes. For all the 2LL, AIC, BIC and CAIC, the lower the value is, the better the fit is. The 2LL always favor s the more comp lex model that has more parameters. The AIC, BIC and CAIC are known as information criteria because their theoretical derivation draws heavily on the definition of statistical information in models (Rupp et al., 2012). They represent statistical compromise s between model fit and model parsimony, which means that overly complex models that lead to a small improvement in fit compared to a much simpler model will be penalized. Absolute Fit Solely using the relative fit statistics in practice should be used cautio usly because they cannot tell whether the model fits the data in an absolute sense. Absolute fit refers to the process of determining whether the model fits the data adequately (Chen et al., 2013). Fit statistics that are sensitive to misspecificati ons under absolute fit evaluation should reject misspecified models with high probability. In practice, it is likely
24 that more than one model can fit the data adequately (Chen et al., 2013; Rupp et al., 2012). Therefore, Rupp et al. (2012) suggest first ev aluating the absolute model fit in order to filter the models fitting the data well. If multiple models have acceptable absolute fit, then the relative model fit indices can be used to pick the best fit model among these models. The root mean square error of approximation (RMSEA) is one of the popular absolute fit indices and was first developed by Steigerand Lind (1980, cited in Steiger, 1990). In cogni tive diagnosis analysis, RMSEA wa s used as the absolute fit index in the study of Kunina Habenicht et al. (2012). RMSEA compares observed and predicted responses for different latent classes and weights the difference unequally based on squared differences. The RMSEA item fit index is a chi square based measure of item fit in CDMs similar to the RMSEA item fi t in the software mdltm (von Davier, 200 8; Kunina Habenicht et al., 2012). For item j , the RMSEA is calculated as follows: (1 11 ) where c denotes the class of the skill vector , k is the item category, is the estimated class probability of , is the estimated item response function, n jkc is the expected number of students with skill on item j in category k and N jc is the expected number of students with skill on item j . In this study, t he mean of RMSEA item fit indexes over all the items in a test was used as the model level fit statistic. The RMSEA is sensiti ve to the number of estimated parameters in the model (Hooper, Coughlan & Mullen, 2008). S pecifically, the RMSEA usually favors parsimonious model s in that it chooses the model with fewer parameters. Traditionally
25 in the SEM framework, the guideline of RMS EA is suggested to use <.06 as good fit (Hu & Bentler, 1999). Slightly different than in SEM , t he R MSEA <.05 is recommend as a guideline in IRT framework indicating good fit (Gardner et al., 2002 ; Kunina Habenicht, Rupp, & Wilhelm, 2009; McDonald & Mok, 1995 ). Classification Accuracy The primary purpose of CDMs is identifying the mastery or nonmastery of multiple attributes in order to create multivariate classifications of examinees (de la Torre & Lee, 2010; Rupp et al., 2012). Rather than an overa ll score, the CDMs provide the meaningful information about the multiple proficiencies of the attributes measured by an assessment for each examinee. This is aligned with educational polic ies , such as the No Child Left Behind policy in the United State or similar policies in Europe, where stakeholders often wish to obtain dichotomous or polytomous classification of examinees with respect to individual skills rather than information about their performance with reference to a certain continuous distribution (Rupp and Templin, weaknesses in order to provide the instruction to the specific needs of the students, so that it can help predict progress tow ard summative test administere d. To classify examinees in to different latent classes, there are three common approaches in assigning an examinee into a latent class: the maximum likelihood estimation (MLE), the maximum a posteriori (MAP) estimat e of the posterior distribution and an expected a posteriori (EAP) estimate for each attribute (Huebner & Wang, 2011). For MLE classification, the likelihood is computed at each attribute profile, and the examinee is assigned an estimated attribute profile that maximizes the likelihood. Sometim es when the distribution of attribute profiles is expected, the prior probabilities
26 are obtained. At this point, MAP classification can be applied by computing the posterior ibute, whereas MLE and MAP do not (Huebner & Wang, 2011). EAP calculates the probabilities of mastery for each attribute for an examinee and sets up a cutoff probability value (usually at 0.5) to classify the attribute into mastery or non mastery (Huebner & Wang, 2011). Hence, this cutoff can be altered based on different research purposes. However, computing MLE and MAP are more statistically straightforward. To assess the classification results yielded by the CDM, we can evaluate the classification accur acy, which refers to the degree to which the classification of exam Gierl, & Chang , 2012). Various criteria of classification accuracy can be considered: (a) the marginal correct classification rate for each skill, (b) the total correct classification rate for I*K individual skills classified per replication, (c) the proportion of examinee of examinees classified correctly for all K 1 skills (e correctly for all K and (e) the proportion of examinees classified correctly for each latent class (Huebner and Wang, 2011; R upp and Templin, 2008a). This study focused on the last two criteria to examine the classification accuracy : the proportion of examinees classified correctly for all K skills is referred as the overall classification accuracy, and the proportion of examine es classified correctly for each latent class is referred to as the class specific classification accuracy.
27 Review of Simulation Studies In recent years, the trend in cognitive diagnostic analysis is developing the models in a general model framework. T her efore the results yield by different CDMs are comparable and the practitioners can build up the most appropriate CDM in a general model framework. The three most commonly seen frameworks that have been raised recently are the GDM (von Davier, 2010 ), the G DINA (de la Torr e, 2008) and the L CDM (Henson et al. , 2009). Since these are newly developed models, the systematic simulation studies exploring the features of these general models are somewhat lacking, and a few application stu dies applied these model s into real data analysis. Review of the simulation studies focused on the studies involving model fit evaluation and classification accuracy especially when CDM specification and/or Q matrix specification occurred. Due to the limit of the simulation stud ies in general CDM framework, the review expanded to the simulation studies conducted in some ot her popular models such as DINA model. The Appendix demonstrated the specifications regarding sample size, number of attributes, and the assumed correlational s tructure of the skills in previous studies. The Q matrix play s an important role in CDMs and ha s a significant impact on the outcome of the analysis. F ew simulation studies have investigated the Q matrix in a systematic way. Several studies examined the Q matrix from the perspective of development and validation of the Q matrix (de la Torre, 2008; DeCarlo, 2012; Henson & Douglas, 2005; Liu, Douglas and Henson, 2009). De la Torre (2008) investigated a method for the valid ation of Q matrix in the DINA model. He identify and correct the misspecified Q matrix by using Ox. This simulation study used
28 the method in conjunction with the DINA model . It found the Q matrix misspecification resulted in more bias in parameter estimates and was reflected in the value of . De la Torre discussed that the method was potentially viable for detecting the inappropriate Q matrix, however, a more complete process of Q matrix validation should concern both statistical information and the substantive knowledge. DeCarlo (2012) showed that the uncertainty in the Q matrix can be recognized and explored via a Bayesian extension of the DINA model by using the software OpenBUGS. The Q matrix is usually specified by the content experts and considered as fixed. This study specified some elements in Q matrix as random vari ables , and the way to detect the questionable elements was to use the posterior distributions compared with prior distribution to determine the elements. The study examined the uncertainty in some different degrees and situations that are 1) 4 out of 60 el ements were uncertain when the rest of the Q matrix were correctly specified; 2) 12 out of 60 elements were uncertain with the rest of the elements correctly specified; and 3) 12 elements were uncertain and 6 other elements were misspecified in the Q matri x with a total of 60 elements. They found the proposed Bayesian method was helpful to determine which attribute s should be included or not in the Q matrix . Specifically, the situation that was easiest to detect was when the attribute s were correctly specif ied for most of the elements in the Q matrix and only a few attributes were questionable . Liu et al. (2009) addressed the use of factor analysis for Q matrix development. In one simulation study, they simulated the data in the DINA model and analyzed the data with omitting a column of the Q matrix that was used to generate data. They found the examinees had spuriously low score s when the Q matrix was partly missing.
29 The studies that explicitly examined the effects of Q matrix misspecification were conducte d by Chen et al. (2013) and Rupp and Templin (2008a). For the different types of Q matrix misspecification, there are often three types being considered when fitting CDM to analyze data: underfitting, overfitting and a balanced misfit for the Q matrix. In the underfitting Q parameters are estimated for the item under consideration. In the overfitting Q matrix, are unduly estimated. In a balanced misfit for the Q changed while the overall number of 1s and 0s in the Q matrix remains the same as the original Q matrix. In practice, it is not easy to evaluate the correctness of the Q matrix due to its subjective nature a nd the complexity when applied to the model as well as the relative paucity of research on evaluation of Q matrix in both assessment and item level. Chen et al. (2013) investigated the relative and absolute fit stati stics for selecting the correct CDM and Q matrix in a general framework by using the codes written in Ox. The data was simulated in DINA and A CDM and analyzed in the saturated model besides the generating models in the G DINA framework. For the relative fit indices, they found BIC performed better than AIC in CDM and Q matrix selection; and the 2LL always selected the saturated model. For the absolute fit indices, univariate residual measure (item proportion correct) was not sensitive to CDM or Q matrix misspecificatio n; bivariate residual measures based on Fisher transformed correlation of item pairs and log odds ratio of item pairs performed similarly in that they are sensitive to CDM or Q matrix misspecification in some conditions, had low Type I error
30 rate and had low power for incorrect combination of saturated model with true or over specified Q matrix. Yet Chen et al. (2012) provided a relatively comprehensive inve stigation for the fit indices. This study used the reduced model as generating model a nd only focused on the fit performance. For the Q matrix misspecification, they only altered one item for each type of misspecification . T h us more levels of the manipulation could be further explored. Rupp and Templin (2008a) focused on the Q matrix misspecification and its impact on parameter estimates and classification accuracy in the DINA model using Mplus. The study used a Q matrix that mapped all possible attribute patterns for a four attribute as sessment and manipulated the items in such a way that sets of patterns were eliminated for different substantively motivated reasons. The results found that different types of Q matrix misspecification had different impacts on parameter estimates and class ification accuracy. Two types of classification accuracy were asses sed: overall and class specific classification accuracy. Particularly, the assessing of the class specific classification was related to the way the Q matrix were set up and manipulated, wh ich was one of the strengths of this study. However, this study did not vary the levels of the condition (e.g. number of attributes was fixed at 4 and sample size was fixed at 10000) and the data was simulated and analyzed in the DINA model only , so no cho ice of CDM specification was assessed. Only one replication was used for assessing the parameter estimation and classification accuracy with a relatively large sample size. A few studies explored the higher order structure of the attributes in the reduced models such as the DINA model and RUM (de la Torre and Douglas, 2004, 2008 ;
31 Leighton, Gierl, & Hunka, 2004; Templin, Henson, Templin and Roussos, 2008 ). Although the attribute structure is of the focus in this study, the findings from these studies that we re about the attribute correlation and classification were related to this study. Templin, Henson, Templin and Roussos (2008) investigated the robustness of hierarchical modeling of skill structure s in RUM by using the software Arpeggio . This simulation study examined three correlational structures that were a general model using multivar iate normal distribution of the skills, a higher order single factor structure , and an independent attributes model as a baseline approach . The results found t hat the general model and higher order model performed equally well with respect to classificat ion and parameter estimates accuracy, regardless of the true attributes correlatio nal structure . The higher order model was preferred in practical application b ecause the parsimonious feature of the model. The poor performance of the indepen dence model in indicate d that at least a moderate positive correlation between the attributes is to be expected in practice . De la Torre and Dou glas (2004) investigated the DINA model and linear logistic model (LLM) with the higher order latent trait. They examined the effects of model misspecification on parameter estimates by using the Markov chain Monte Carlo (MCMC) algorithms. The results show ed that using the correct model improved the accuracy and stability of parameter estimates. More specifically, if the model used to estimate parameters was the data generating model, the correct classification rate of the attribute was high. They also poin ted out that their results indicated specifying the
32 Q matrix correctly was of greater importance than identifying the correct response model , but this needed to be investigated further. De la Torre and Douglas (2008) compared three models for classifying e xaminees when modeling the joint distribution of latent attribute s . The three comparison models were the NIDA, the single strategy DINA and the multiple strategy DINA model. The results indicated that model misspecification had little effect on the paramet er estimation accuracy and correct classification rates. The fit indices log marginal likelihood, AIC and BIC favored the single strategy and multiple strategy DINA models over the NIDA. de la Torre and Douglas suggested the single strategy DINA rather t han multiple strategy DINA because the single strategy DINA model was more parsimonious given the fit were the same. itive diagnostic analysis (Cui et al. , 2012; Huebner & Wang, 2011 ). Huebner and Wang (2011) compared MLE and MAP classification methods (MLE/MAP) with the EAP classification accuracy. Various criteria were used for the classification accuracy: a) the marginal correct classification rate for each skill, (b) the total correct classification rate for I*K individual skills classified per replication, (c) the proportion of examinee of examinees classified correctly for all K the proportion of examinee of examinees classified correctly for all K 1 skills and (e) the proportion of examinee of examinees classified incorrectly for all K 1
33 results yield by MLE/MAP methods and EAP method were consistent in general. EAP classified fewer examinees correct ly on all K skills but classified more examinees almost or exactly correctly with fewer severe misspecifications. EAP also yielded more correct total individual skill classifications than MLE/MAP. Cui, Gierl and Change (2012) introduced two new classificat ion indices: classification consistency index ( ) and classification accuracy index ( ) . Classification consistency refers to the degree to which classifications are consistent based on two independent administratio ns or two parallel forms. Classification accuracy refers to the degree to which the classification of the observed student latent classes agrees with The simulation study used the DINA model and examined the performances o f the two indices under various conditions : item discrimination power, total number of attributes measured in a test, dependency among the attributes and sample size . The findings revealed that the sample estimates and standard error of and were fairly accurate even when sample size was small (e.g. 100) ; the sampling distribution of two indices matched closely with the empirical distributions across simulation conditions; higher discriminating items, smaller number of attributes measured by a test, and more attribute dependency had positive effect on the two indices . Few simulation studies have been conducted in a general model framework, among whic h, recently two studies ( Kunina Habenicht, Rupp and Wilhelm, 2012; Templin and Bradshow, 2014 ) simulated data in a saturated model and analyzed with reduced mo dels, while the other studies simulated data in reduced models and included the saturated model for analysis (Chen et al., 2013). The real data structure is usually
34 complex in practical application s even when the simpler case exists . T he simulation in a mo re complex saturated model mimic s the real situation better. Kunina Habenicht et al. (2012) explored the effects of model misspecification on item parameter estimation, respondent classification and model data fit by using the software Mplus. The data was simulated in the saturated model in the L CDM framework, and then analyzed in the misspecified CDM with the interaction terms omitted as well as the misspecified Q matrix. They found both AIC and BIC were sensitive to Q matrix specification; AIC performed better than BIC when using the incorrect CDM with omitted interaction but with correct Q matrix; and the RMSEA and mean absolute difference between observed and predicted response proportion with latent classes were not sensitive to model misspecification unless Q matrix was over specified. They also revealed that Q matrix misspecification affected classification accuracy more seriously than CDM misspecification; the test length had a positive effect on classification, while the sample size, marginal attri bute means , and attribute correlations did not have a noticeable impact on classification accuracy. There are some further steps that can be taken from this study. This study only assessed the overall classification accuracy of each examinee for all the at tributes but any other types of classification accuracy such as the class specific classification accuracy. Also the misspecification of Q matrix in this study was through randomly permutating 30% of all matrix entries while matching the mar ginal distributional properties, which can increase the generalizability of the results. However, it lost the control of detecting the more specific effects of Q matrix misspecification, for example, how the different types of misspecification affect the m odel estimation.
35 Templin and Bradshaw (2014) introduced the Hierarchical Diagnostic Classification Model (HDCM), which adapted the L CDM to the cases where att ribute hierarchies are present . They generated the data from three different models: the full L CDM (3 attributes resulting 8 profile model), the HDCM with one nested attribute (6 profile model), and the HDCM with two nested attributes (4 profile model). They assessed two measures of classification accuracy: the attribute profile classification rate and marginal attribute classification rate for each model (4 profile, 6 profile, and 8 profile). The attribute profile classification rate was the proportion of examinees that were correctly classified for all the attributes. The marginal attribute classif ication rate was the proportion of times a single attribute was classified correctly across all examinees and attributes. Templin and Bradshaw found that when the estimation model matched the generating model, the classification rates were high; when the e stimation model had more profiles than the true generating model, classification was roughly equal to what were found when estimation and true model matched; when the estimation model had fewer profiles than the true model, classification rates plummeted ( as much as 30% for the attribute profile classification rates and 12% for the average attribute classification rates). They suggested a top down approach that estimating the general LCDM first and then reducing the model based on item parameters that are f ound to not be significant. This approach allows for the detection of an attribute hierarchy, in addition to identifying when constraints on the LCDM to yield item based DINA or DINO equivalent specifications are appropriate. Review of Applied Studies In addition to the simulation studies, some practical studies applied CDMs to the real data analysis. Yet the application of CDMs is not the focus in this simulation study,
36 the findings from practical data analysis provide valuable information for the usage of CDMs and i nform the questions to investigate with simulation study . Quite a few studies applied the reduced model such as the DINA model, and more recently, few studies arose with the analysis of applying the models in a general model framework. The rev iew of applied studies in this section first introduced a popular dataset being analyzed in CDM literature, and then focused on the studies involving the models in a general framework. The fraction subtraction data has been analyzed in many articles, mainl y using the DINA model (e.g. Chen et al., 2013; Cui et al., 2012; DeCarlo, 2011; de la Torre, 2009; de la Torre & Douglas, 2004, 2008 ). The fraction subtraction data was originally described and analyzed by Tatsuoka (1990) and recently reused in Tatsuoka ( 2002) . This data contained the responses of 536 middle school students to 20 fraction subtraction items measuring 8 attributes. Some similar findings were revealed from different studies using DINA model, and here illustrating one stud y conducted by DeCarl o (2011). DeCarlo analyzed the fraction subtraction data in DINA model and found that the posterior probabilities of the skills were determined largely by the prior probabilities. The results suggested that classification problem s found in the DINA model m ight also arise for other CDMs, depending on the Q matrix specification, which needed further research to examine the effects of Q matrix misspecification on classification. In recent years, the general model framework has been introduced (e.g. G DINA was intr oduced by de la Torre in 2011). F ew application studies have applied the G -
37 DINA model to the real data analysis (Basokcu, Ogretmen and Kelecioglu, 2013; Basokcu, 2014; Chen et al., 2013). Basokcu et al. (2013) compared the model fit and item fit ind ices in DINA and G DINA models by using the responses from 408,692 examinees taking Turkey 2008 OKS examination grade 6 mathematics test. Model fit indices used in this study were 2LL, AIC and BIC stat istics; item fit assessment were analyzed using residual correlation and probabilities. Results showed that G DINA model had better fit than DINA model for the data analyzed. The residuals values in DINA were higher than in G DINA, where higher residual v alues indicated poorer fit. All the 2LL, AIC and BIC favored the G DINA model as well. Basokcu (2014) examined the effects of Q matrix and small sample size on the mastery of the attribute by applying the DINA and G DINA model in Ox software. A mathematic test consist ing of 18 multiple choice questions measuring 4 attributes was used in this study and a group of 1000 examinees ha d taken this test. Five differently specified Q matri ces were examined. The small sample size was achieved through sub sampling 3 0, 50, 100, 200 and 400 examinees and each group was sub sampled 25 times. The author stated that the contribution of each attribute to probability of correct answering was different in the G DINA model . For students master ing one or more attributes , the p robability of correct ly answering an item depend ed on the weight of the attribute. The results showed that Q matrix had significant impact on model fit and decisions about students. Specifically, fit statistics of 2LL, AIC and BIC in G DINA w ere lower tha n the DINA model as shown in this study; the classification by G DINA model was affected less by changes in Q matrix.
38 Chen et al. (2013) also analyzed the fraction subtraction data to illustrate the findings from their simulation study about the fit statistics. In their simulation study, they matrix m isspecification were of concern. In the empirical example, they only focused on the choice of best fitting CDM by using different fit indices. Six CDMs were used to fit the data: the saturated model, the DINA model, the DINO model, the A CDM, the LLM and the R RUM. The 2LL favored the saturated model as expected, and both AIC and BIC selected the LLM as the best fitting model. Based on the BIC, thr ee models DINA, LLM and R RUM performed better than the saturated model. Chen et al. further checked the Q matrix specification and noted from the study of de la Torre and Douglas (2004) that Item 8 does not necessarily require mastery of the only prescrib ed attribute. I n other words, students who have not mastered the attribute but are familiar with the inverse property of addition can still answer the item correctly. Thus, Chen et al. removed item 8 and re analyzed the data with the same six models and th e Q matrix with item 8 deleted. The results showed that all the models achieved better fit and the LLM was still the best fitting model. Research Questions The purposes of this study are evaluating the performance of model fit indices lassification when CDM and Q matrix misspecification occur in various conditions. From the literature review, few studies explicitly addressed the impact of Q matrix misspecification in CDMs (Chen et al., 2013; DeCarlo, 2012; de la Torre, 2008; Kunina Habe nicht et al., 2012; Liu, Douglas, & Henson, 2009; Rupp & Templin, 2008a). Given the newly introduced general model framework, few studies have included the models in either analysis or simulation within the G DINA framework.
39 O nly two studies I found in lit erature simulated the data in the saturated model and analyzed with the reduced models ( Kunina Habenicht et al, 2012; Templin & Bradshow, 2014). Further, we know relatively little about the performance of the model fit indices especially when the Q matrix and/or CDM are misspecified (Chen et al., 2013; de la Torre & Douglas, 2008; Kunina Habenicht et al., 2013). accuracy, as a primary interest in cognitive diagnostic analysis, needs more investigation in simulation studies (Hu ebner & Wang, 2011; Rupp & Templin, 2008; Templin & Bradshaw, 2014) . Particularly, when both Q matrix and CDM misspecification are of concern, no study has examined the class specific classification accuracy even if it closely relates to the specification of Q matrix. Therefore this study addressed the following research questions: 1. Which relative fit indices ( 2LL, AIC, BIC and CAIC) best select the correct model with the CDM and/or Q matrix misspecification in various conditions? 2. nd Q matrix misspecification on the absolute fit index RMSEA in various conditions ? 3. matrix and CDM misspecification are of concern in various conditions? 4. How accurate are the class specific clas sification when Q matrix and CDM misspecification are of concern? Various conditions being investigated in this study are number of respondents, attribute correlation, and number of items in an assessment.
40 CHAPTER 2 DESIGN OF THE STUDY A simulation study was conducted to investigate the effects of Q matrix misspecification and CDM misspecification on model fit and classification accuracy. All data generation and estimation s were conducted using the free software R with the version 3.1 14. The data was generated using the saturated model G DINA. In addition to G DINA, two red uced models ( A CDM and DINA models) were also used to fit the data. The number of respondents, the correlation between attributes, and the number of item s measured in a test were mani pulated and resulted in 12 data generating conditions with 1000 replications for each condition. Six different Q matrices (1 true and 5 misspecified) combined with 3 analysis models were used to estimate each dataset, which re sulted in a total of 216,000 estimations. Table 2 1 shows the overall design of the study broken down into subsections in this section . Data Generation Number of Respondents Three levels of number of respondents reflecting small, moderate and large samples were investigated in this study: N = 500, 1000 and 5000. Previous research has shown this is a relevant factor that influences model fit, parameter estimates, and classifica tion accuracy (Chen et al., 2013; Cui et al., 2012; de la Torre, 2009; de la Torre & Douglas, 2004; Shu et al., 2013). Several studies have shown that number of respondents should be at least 500 in order to have an acceptable model fit and relatively acc urate parameter estimates (Chen et al., 2013; Cui et al., 2012; de la Torre, 2009; de la Torre & Douglas, 2004; Shu et al., 2013). The estimates of the intercept and main effect s were consistent with at
41 least 500 examinees in the log linear CDM framework, and higher sample size was required for the estimates of interaction effects (Choi et al., 2010, cited in Kunina Habenicht et al. , 2012). Since this study used saturated G DINA as the generating model and the highest order interaction in G DINA was a thre e way interaction, I chose the lowest level of sample size to be 500. The s tudies exploring the classification usually include larger sample sizes than the studies focusing on fit and/or parameter estimation. Kunina Habenicht et al. (2012) used sample size s of 1,000 and 10,000 and found that different sample sizes had significant effects on model fit but did not have a noticeable impact on the overall classification accuracy. A couple of studies only included one level of sample size to explore the examinee classification ; the sample size they chose were 5,000 and 10,000 ( Huebner & Wang , 2011; Rupp & Templin, 2008a). S pecifically, Huebner and Wang (2011) used 5,000 as their sample size to investigate the classification methods; Rupp and Templin (2008a) simul ated 10,000 examinees to investigate the impact of Q matrix misspecification on parameter estimates and classification accuracy. Thus , for exploring model fit and classification accuracy in this study , I include d N = 1000 and 5000 reflecting moderate and l arge sample size. Another study by Cui et al. (2012) created two criteria for classification accuracy and consistency , and found that the two indices for classification were fairly accurate and co nsistent even when the sample size wa s small (e.g., 100, 50 0 and 1000). In a pilot study, I tried to include a small sample size of 200 , so the four levels of sample size were 200, 500, 1,000 and 5,000. When the sample size was down to 200, the absolute fit index mean RMSEA was approximately .1 even when fitting the true model and
42 correct Q matrix in the optimal conditions , which indicated an inadequate fit of the model to the d ata. When the sample size increased to 500, the mean RMSEA achieved approximately .05 for the correct model and Q matrix. RMSEA <.05 is recommend as a rule in IRT , indicating good fit (Gardner et al., 2002). Besides considerating absolute fit, the classifi cation by each latent class is another concern. When calculating the classification accuracy by each latent class, if the sample size i s too small, there is a high probability that no respondents c ould be classified into certain latent class es , regardless of whether the model fits the data adequately. For the sample sizes I tried, 200 was not sufficient to calculate the class specific classification accuracy rates for a test with 4 attributes , which means there were 16 latent classes, and 500 could work as the minimal sample size calculating the class specific classification rates. Therefore, based on the literature review and the pilot study, 500 was chosen as the lowest level of sample size in this study, 5000 represented the large sample size to contrast the results yield ed by a small sample size, and a moderate level of 1000 was included in order to see the trend of the effects of sample size on model fit and classification accuracy. Number of Attributes This study focused on one level of the number of a ttributes K =4. A literature review of the CDM simulation studies indicates that there are usually three to eight attributes being designed in an assessment, which also reflects the number of attributes in application examples (Chen, 2009; Chen, de la Torr e & Zhang, 2013; DeCarlo, 2012; de la Torre, 2009; de la Torre & Douglas, 2004; Huebner & Wang, 2011; Kunina Habenicht et al., 2012; Rupp & Templin, 2008). When the number of attributes increases, number of items in a test has to increase in order to provi de the reliable
43 information for each attribute being measured. Since the computation time increases as the number of attributes increases (de la Torre, 2009; Rupp & Templin, 2008b), the number of attributes and items in an assessment are limited in researc h from a statistical perspective. Considering all the other factors being manipulated in the simulation and a fairly large estimation process, the number of attributes was fixed at 4 in this study. Marginal Attribute Difficulty A multivariate normal distri bution for latent attributes with the mean vector and correlation matrix study, the mean vector =(0, 0, 0, 0) was used for the four attributes test; this led to the same marginal mastery proportions for all attributes of .50. This mean vector is also called marginal attribute difficulty. Correlation Between Attributes Two levels of attribute correla tion were set to values of = .4 and = .8 to represent moderate and high correlation. A range of .3 to .9 of the tetrachoric correlation is typical in education al assessment and CDM research (Cui et al., 2012; He nson, Roussos, Douglas & He, 2008; Henson, Templin & Douglas, 2007; Kunina Habenicht et al., 2012). The detailed settings of attribute correlation in these stu dies are shown in the Appendix. A weakly correlated attributes level could be included as a contr ast, but I chose not to do this to keep the overall simulation and estimation manageable. T he correlations were set to be equal across all attribute pairs in the correlation matrix :
44 (1 12) where are the defined attribute correlations. For example, for K=4 and = .4 in this study, the correlation matrix is as follows: (1 13) Number of Items The number of items in a test was set to two levels in this study: J =14 and 28. The number of items is related to the number of attributes measured in a test . In order to understand the setup for this s tudy, let us first consider the number of all possible attribute patterns for 4 attributes w as . If a test contain s the items reflecting all possible attribute patterns with at least one at tribute measured by the item, there are a total of 15 attribute patte rns that could exist for items. In addition, Kunina Habenicht et al. (2012) pointed out that when more than three attributes load ed on a single item , it led to large standard errors of parameter estimates and was also computat ionally very time consuming. A few previous studies have set the maximum number of required attributes by one item to three (Chen et al., 2013; de la Torre, 2008, 2009; de la Torre & Douglas, 2004; Kunina Habenicht et al., 2012). Following these, I used a situation in which a test contained the items reflecting all possible attribute patterns with at least one attribute and a maximum of 3 attributes being assessed by an item. Specifically, for an assessment measuring 4 attributes, there were a total of 14 i tems ( ) as the minim um to represent all the different attribute patterns as much as possible.
45 This simulation design also considered the conditions where test length is equal to as well as greater than the number of possible attribu te patterns in order to investigate the impact of these conditions on model fit and classification accuracy. Insufficient number of items per attribute would cause a convergence and classification problem in analysis ( Kunina Habenicht et al., 2012). Two le vels of the item number were examined in this study J = 14 and 28 for the number of attribute s K = 4. The correctly specified Q matrix for J = 28 is shown in Table 2 2. The Q matrix for J = 14 was imbedded as a subset of this Q matrix. Item Parameter Specification for Data Generation In order to define meaningful and realistic parameters in the G DINA model for simulation, the parameter setting was referenced from a real data analysis study. This study used the DINA and G DINA models to fit 4677 examin in a 2008 OKS examination of 6 th grade mathematics test , and examined the model fit and parameter estimates ( Basokcu et al. , 2013). The G DINA is a saturated model including all the possible interactions of the attributes so that one attribute items contain 2 parameters, two attribute items contain 4 parameters, and three attribute items contain 8 parameters. The true item parameters ( ) used in this simulation study are presented in Table 2 3. For simplicity, all the items measuring one attribute used the same parameter setting, all the items measuring two attributes used the same parameter setting, and all the items measuring three attributes used the same parameter setting as well.
46 Data Analysis This study i nvestigated the sensitivity of various model fit indices and assessed overall and class specific classification accuracy under different settings. In total, there were 216 different settings for data analysis, which contained : 1) 18 diverse estimations tha t included the combinations of 3 CDM specifications and 6 Q matrix specification s ; and 2) 12 different data generating conditions that included number of respondents, numbers of items and attribute correlations. For each setting, the fit indices 2LL, AIC, BIC, CAIC and RMSEA were evaluated; the overall and class specific classification accuracy were also assessed. CDM Specification For each of the generated datasets, a total of 3 CDMs within the G DINA framework were applied for the data analysis: the G DINA, A CDM and DINA model s . The G DINA model was the true generating model, which included intercept, main effect, two way interaction effect, and three way interaction effect parameters for all conditions. In addition to the true model, two types of mi sspecified CDMs were used to analyze the data. CDM misspecification in this study refers to incorrect parameterization of the modeling process. As two comparison models, A CDM contained only intercept and main effect s for each item; and the DINA model cont ained only intercept and the highest order of in teraction effect for each item. Q Matrix Specification A specified Q matrix was incorporat ed in CDM for cognitive diagnostic analysis. This study examined a total of 6 Q matrices , including 1 correctly specif ied Q matrix and 5 misspecified Q matrices.
47 Different types of the Q matrix misspecification were investigated: underfitting the Q matrix (i.e., specifying 0s where there should be 1s), overfitting the Q matrix (i.e., specifying 1s where there should be 0 s), and a balanced misfit (i.e., exchanging 0s and 1s while controlling for the overall number of changes). These Q matrix misspecification s were investigated by randomly selecting the attributes being altered in an item. Table 2 4 shows the Q matrix missp ecification conditions along with the generating Q matrix. Taking the test with J =14 items as an example, qt 14 was the true Q matrix for data generation. Two under specified Q matrices qu3 14 and qu2 14 meant that qu3 14 Q matrix changed all 3 attribute i tems into selected 2 attribute items , and this selection of the attribute deletion was random for each item; qu2 14 Q matrix changed all 2 attribute items into selected 1 attribute items , and this selection of the attribute deletion was random for each ite m. Similarly, two over specifications qo1 14 and qo2 14 Q matrices were created by randomly selecting the attribute being added. For creating the balanced misfit for the Q matrix ( qm 14 ), the items need ed to be altered were first randomly selected ; then , the attributes that need ed to be altered were selected randomly for each item. The assessment with number of items J = 28 had doubled the items as in the assessment J =14. The misspecification of Q matrix in J =28 only occurred in items 1 to 14, and item s 15 to 28 always remain ed the same as in true Q matrix ( qt 28 ). In this way, the number of misspecified items in J = 28 was the same as in J =14 when controlling the type of misspecification, which made the results comparable for different test length.
48 M odel Fit Evaluation The sensitivity of five fit statistics to the specification s of CDM, the Q matrix, and both were investigated. The 2LL, AIC, BIC and CAIC were used as relative fit indices, and the mean RMSEA was the absolute fit index. Both types of f it indices were used because if only relative fit indices were used, it could result in a situation where a best fitting model was selected by relative fit indices but actually all the models did not fit the data adequately; if only absolute fit indices we re used, it is likely that more than one model fit s the data adequately in practice (Rupp et al., 2012) . For all the fit indices, the smaller the values are, the better the model fit is. For each fit index, the best fitting model with the smallest value wa s c ompared and selected for the different specification of CDM, the Q matrix, and both across all the conditions. The proportion of times that the true CDM model or/and true Q matrix was selected out of 1000 replication was reported as the selection rate f or each fit index. Besides the selection rate, the values of RMSEA were also reported since it is an absolute fit index and a value le ss than .05 indicates good fit ( Gardner et al., 2002 ; Kunina Habenicht , et al., 2009) . Classification Accuracy Classificat ion accuracy refers to the degree to which the classification of patterns were used as th estimated from the response data using MLE method were used as the estimated latent classes. The simulated and estimated latent class w ere then compa red for each examinee: if they were
49 classified inaccurately. By taking the average of 0/1 over all examinees and all replications , the overall c orrect classification rates were calculated for each condition. By taking the average of 0/1 for the examinees by each latent class, the class specific correct classification rates were calculated. The classification accuracy was then compared for all the estimation settings. Table 2 1 . Simulation Conditions for Data Generation and Model Estimation Characteristics Number of Levels Values of Levels Data Generation Number of Respondents 3 N = 500, 1000 and 5000 Number of Attributes 1 K = 4 Marginal Attribute Difficulty 1 =(0, 0, 0, 0) Attribute Correlations 2 = .4 and .8 (Same for All Pairs of Attributes) Number of Items 2 J = 14 and 28 Item Parameter Specification 1 Model Estimation CDMs 3 G DINA (generating model), A CDM, DINA Specified Q matrix 6 1 true Q matrix and 5 misspecified Q matrices Number of Replications 1000 Number of Estimations Table 2 2 . Correctly Specified Q Matrix for J = 14 and 28 Attribute Attribute Item 1 2 3 4 Item 1 2 3 4 1 1 0 0 0 15 1 0 0 0 2 0 1 0 0 16 0 1 0 0 3 0 0 1 0 17 0 0 1 0 4 0 0 0 1 18 0 0 0 1 5 1 1 0 0 19 1 1 0 0 6 1 0 1 0 20 1 0 1 0 7 1 0 0 1 21 1 0 0 1 8 0 1 1 0 22 0 1 1 0 9 0 1 0 1 23 0 1 0 1 10 0 0 1 1 24 0 0 1 1 11 1 1 1 0 25 1 1 1 0 12 1 1 0 1 26 1 1 0 1 13 1 0 1 1 27 1 0 1 1 14 0 1 1 1 28 0 1 1 1 Note. Items 1 14 are used when J = 14.
50 Table 2 3 . True Item Parameters ( ) Attribute Pattern and Parameters 1 attribute item 0 1 .21 .68 2 attribute item 00 10 01 11 .18 .25 .15 .59 3 attribute item 000 100 010 001 110 101 011 111 .26 .12 .1 7 .18 .13 .27 .26 .51
51 Table 2 4 . The Q Matrix Misspecification and True Q Matrix K J Q matrix Alternations Item Altered Total # of changes (1 to 0) Total # of changes (0 to 1) Ave. # of attribute s per item Ave. # of items per attribute 4 14 qt 14 Data generating Q matrix 0 0 0 2 7 qu3 14 All 3 attribute items are changed into selected 2 attribute items. I11 I14 4 0 1.71 6 qu2 14 All 2 attribute items are changed into selected 1 attribute items. I5 I10 6 0 1.57 5.5 qo1 14 All 1 attribute items are changed into selected 2 attribute items. I1 4 0 4 2.29 8 qo2 14 All 2 attribute items are changed into selected 3 attribute items. I5 I10 0 6 2.43 8.5 qm 14 Attributes are deleted and added to balance out the overall number of changes. 2 items randomly selected from I1 I4; 3 items randomly selected from I5 I10; 2 items randomly selected from I11 I14 7 7 2 7 4 28 qt 28 Data generating Q matrix 0 0 0 2 14 qu3 28 Half of the 3 attribute items are changed into selected 2 attribute items. I11 I14 4 0 1.86 13 qu2 28 Half of the 2 attribute items are changed into selected 1 attribute items. I5 I10 6 0 1.79 12.5 qo1 28 Half of the 1 attribute items are changed into selected 2 attribute items. I1 4 0 4 2.14 15 qo2 28 Half of the 2 attribute items are changed into selected 3 attribute items. I5 I10 0 6 2.21 15.5 qm 28 Attributes are deleted and added to balance out the overall number of changes. 2 items randomly selected from I1 I4; 3 items randomly selected from I5 I10; 2 items randomly selected from I11 I14 7 7 2 14
52 CHAPTER 3 RESULTS The result s section was organized into two parts. The first part illustrated the performance of model fit indices including relat ive and absolute fit evaluation, and the second part presented the examinee classification accuracy. Model Fit Evaluation Relative Fit Evaluation First, the selection rates of 2LL, AIC, BIC and CAIC were evaluated when only CDM misspecifica tion was of concern (see Table 3 1 ). The index 2LL almost always select the correct G DINA model in all conditions. This was expected because the G DINA model was a saturated model, which had a more complex parameterization and always ha d a higher ML than any reduced models. T he AIC had b etter selection rates than BIC and CAIC for picking the correct CDM , which was the G DINA model in all conditions. The AIC almost always select the correct CDM in the Q matri ces qt , qu3 , qo1 and qo2 . I n the qm , the selection rates of AIC ranges from .709 t o 1 for selecting correct CDM. I n qu2 , the AIC performed worse ranging from .106 to 1. H owever, when test length J =28 or sample size N =5000 in qu2 , the selection rates of AIC was .9 or above. The fit indices BIC and CAIC showed a similar performance in all conditions while BIC performed better than CAIC. This is not surpris ing because they share the same properties by including sample size in the formula. The BIC and CAIC had overall acceptable selection rates (around .8 to 1) except in a few condition. When sample size was 500, the BIC and the CAIC had low selection rates to detect the correct CDM in all Q matrices. Also in qu2 , the BIC and CAIC c annot detect the correct CDM at all when test length was J=14.
53 T able 3 1 also showed t he sample size and test length had notable effects on the performance of AIC, BIC and CAIC to detect the correct CDM, while the attribute correlation had a relatively wea ker impact . When N increases, the selection rates for correct CDM by all the indices increase d dramatically. When J increases, the selections rates went up as well. The correlation between attributes had a negative impact on the selection rates of AIC, BIC and CAIC. The effects of N and J were expected because either increasing sample size from person perspective or increasing test length from item perspective provided more information to evaluate the fit of the model. It was worthwhile to note that when te st length was longer (J=28), AIC and 2LL can always detect the saturated G DINA model regardless of the Q matrix misspecification. In summary, the AIC and 2LL had good selection rates for the correct saturated CDM except for a few conditions. The select ion rates for detecting the correct CDM was higher in qt , qu3 , qo1 and qo2 , a nd the selection rates in misspecified Q matrices qu2 and qm are more problematic. The larger sample size and/or the longer the test length are helpful for selecting the correct CDM. The smaller the attribute correlation is, the easier the correct CDM can be detected. Second, the selection rates of 2LL, AIC, BIC and CAIC were evaluated when only Q matrix misspecification was of concern (see Table 3 2 ). The 2LL cannot detect the correct Q matrix at all when the saturate d G DINA was the fitted model ; w hen A CDM was fitted , t he selection rates of 2LL were generally low ranging from 0 to .7 ; and when the DINA model was the fitted model the selection rates of 2LL for selecting the c orrect Q matrix rang ed from .599 to 1. For the AIC, the true Q matrix was mostly selected (at least .863) when the generating G DINA model was fitted in all conditions;
54 when fitting A CDM and DINA, the selection rates of AIC varied from .004 to 1 across al l the conditions. The performance BIC and CAIC to detect the correct Q matrix was generally better than AIC. Further the BIC and CAIC showed similar selection rates in all the conditions while BIC was slightly better than CAIC. When the G DINA model was fi tted the selection rates of BIC and CAIC were high with a range of .863 to 1 except in the conditions N =500, J =14 and = .8. When fitting A CDM and DINA, the selection rates of BIC and CAIC both varied from .599 to 1 across all conditions. It is noted that when the DINA model was fitted, the 2LL, AIC, BIC and CAIC yield the identical selection rates. A diagnosis of the original fit indices data found that the patterns of these relative fit indices were highly consistent for the different types of Q matrices, thus the smallest value of each fit index almost always occurred in the same type of Q matrix for each replication, which turned out to be the same selection rates for each index. Table 3 2 also illustrated the effects of N , J and on the performance of each relative fit index for selecting the correct Q matrix. When N increases, the selection rates of the correct Q matrix for AIC, BIC and CAIC increase when G DINA and DINA were fitted but the pattern w as inconsistent when A CDM was fitted. When J increases, the selection rates for AIC, BIC and CAIC increase in general when the G DINA and DINA were fitted; and as J increases the selection rates for AIC, BIC and CAIC increase at N =500 and decrease at N =50 00 when A CDM was fitted. As increased, the selection rates for the correct Q matrix showed various patterns for each index in different conditions. When G DINA and DINA were the fitted models, the selection rates of AIC, BIC and CA IC decrease in general as increases; when A CDM was the fitted model, the selection rates of AIC increase at N=500 and 1000 and decrease at N=5000,
55 while the BIC and CAIC select the correct Q matrix more frequently as increases no matter at which level of sample size. In summary, the AIC had an overall stable abil ity for detecting the correct Q matrix. T he BIC had a generally better performance of selecting the correct Q matrix except in the condition N=500, J=14 and =.8 where AIC had a better selection rate. The selection rates for correct Q matrix in the saturated model was higher than the other two models, so it also suggest s that the saturated model can be used to pick up the correct Q matrix if sample size is big enough (at least 1000). Third, the performance of 2LL, AIC, BIC and CAIC were evaluated when both CDM and Q matrix mi sspecification were of concern (Table 3 3 ). We can see that the 2LL cannot select the correct CDM and Q matrix at the same time in all conditions. The AIC had a good overall selection rates in all conditions ranging from .863 to .955. The selection rates of the BIC and CAIC showed the same pattern and the BIC performed better, so the BIC is the focus rather than CAIC. The selection rates of BIC were higher than the AIC when N is larger and/or is smaller. More specifically, when samp le size is 1000 and larger, the BIC had remarkable selection rates for both the correct CDM and Q matrix; when sample is smaller (e.g. 500) especially combined with larger attribute correlation (e.g. = .8), the AIC can be an alternativ e to detect the correct CDM and Q matrix. Absolute Fit Evaluation As mentioned earlier, if only relative fit indices were used for selecting the model there could be a situation in which none of the competing models fit the data
56 adequately. Next in this section the absolute fit index RMSEA was evaluated in the absolute sense to see if the model fits the data well. Figure 3 1 shows boxplots of the empirical sampling distribution of RMSEA for G DINA, A CDM and DINA models. These plots only concern the diffe rent CDMs and Q matrices and collapse over the levels of sample size, test length and attribute correlation. For the effects of CDM specifications , Figure 3 1 show ed the values of RMSEA in G DINA were generally lower than A CDM and DINA, which indicate d th at G DINA ha d a better model fit. This is expected because G DINA is the saturated generating model. Comparing the A CDM and DINA models, the A CDM ha d lower values of RMSEA . Also it is noticeable that the RMSEA in the DINA model were more spread out , whic h mean t the variation of RMSEA was larger . For the effects of different types of Q matrices in each model, Figure 3 1 also illustrated that the low values of RMSEA occurred in the true Q matrix (qt) and over specified Q matri x (qo1 and qo2) in the G DINA model and A CDM. The 75 th percentile values of RMSEA for the G DINA model in qt, qo1 and qo2 were under .05 in general , which indicated good fit. The under specified Q matrices qu3 and qu2 showed much higher RMSEA values than qt in G DINA model, slightly h igher in A CDM and not big difference in DINA. The balanced misfit Q matrix qm the had high est RMSEA values in all three models indicating the worse fit to the data. In sum, the G DINA and A CDM had better fit than DINA model; the true and over specified Q matrix fit the data better. This also implied that the over specified Q matrix was not easy to detect even in the true CDM.
57 In order to investigate the effects of sample size, test length and attribute correlation on the absolute fit index, Figure 3 2 focused on the empirical sampling distributions of the RMSEA indexes for the G DINA with the correct Q matrix. Results show that as sample size increases the values of RMSEA decrease dramatically , which is expected. It also showed the values of RMSEA incr eased as test length increased, and the differences of RMSEA values between test length J =14 and J =28 decreased as sample size increased. When test length increases, the model is more complex with larger number of parameters , and this more complex model is harder to achieve fit unless more information is provided by more respondents . Thus we see the decline of the fit due to the increase of test length is ameliorated as the sample size increases. The correlation between attributes does not have a notable im pact on RMSEA. Besides assessing the values of RMSEA to evaluate the fit to the data, I also compared the RMSEA in different settings for selecting the correct CDMs, Q matrix and both. As we can see in Table 3 4 , RMSEA had overall good selection rates for picking the correct CDM in qt, qu3, qo1 and qo2 , which was consistent with the low values of RMSEA in the previous boxplots. The RMSEA failed to select the correct Q matrix, and both the correct CDM and Q matrix. Classification Accuracy The classification is usually of primary interest in CDM estimations as the decisions about the examinees are made based on the classification. Two types of the classification accuracy were illustrated in this part: the overall classification accuracy and the classification accuracy by each latent class. Table 3 5 demonstrated the overall correct classification rates by the d ifferent levels of all factors. For the purpose of explaining the results more explicitly , the effects
58 of N, J and on the overall classification accuracy were the focus in Table 3 5 , and then the impact of CDM misspecification and Q matrix misspecification we re examined in Table 3 6 that included both overall and class specific classification. As shown in Table 3 5 , w hen test length increased the correct overall classification rates were much higher. For example, in G DINA model with qt matrix, the correct overall classification rates went up from .711 to .886 as test length increased from J=14 to 28 when controlling = .4 and N = 500. This is expected because more pieces of information provided by the items for each dimension can be used to detect the classificati on. Second, as the sample size increased the overall classification rates slightly incr eased in all conditions. For example, still in G DINA model with qt matrix, the overall classification accuracy increased from .886 to .893 as sample size increased from 500 to 5000 when controlling = .4 and J=28. Comparing the effect s of J and N on classification accuracy, we can see that more items in a test are more critical than more examinees to get a better classification accuracy. Third, the increase in attribute correlation slightly increased the overall classification accuracy . To investigate the effects of the misspecification of CDM and Q matrix, the correct overall and class specific classification rat es were tabulated in Table 3 6 . The classification rates in this table were collapsed over the other factors N, J and for the sim pler illustration. The CDM misspecification does not have much impact on the overall classification accuracy, however, the overall classification rates vary among the different specified Q matrices. The class specific classifica tion rates are related to the different types of misspecified Q matrices as well .
59 For CDM misspecification, Table 3 6 showed that the overall classification accuracy was highest in G DINA no matter which specified Q matrix was used . This makes sense because G DINA was the generating model. Comparing the other two CDMs, A CDM has higher classification rates than DINA. The A CDM yielded very similar overall classification rates with the true model G DINA where A CDM contained only main effe cts of the attributes and omitted all the interactions. The DINA model showed the lowest classification rates among three CDMs where DINA contained only the highest order of interactions among attributes. For investigating Q matrix misspecification, the condition qt was the correct Q matrix and can be used as baseline rates under optimal esti mation conditions in each CDM. T he correct overall classification rates in qt were higher than the other misspecified Q matrices in three CDMs . The effects of the misspecified Q matrices on classification accuracy were then compared with the true Q matrix in different CDMs as follows. The classification rates in G DINA and A CDM showed similar patterns for the Q matrix m isspecification. Within these two models, the overall and class specific classification rates for the condition qu3, qo1 and qo2 were close to the rates in qt . T he misspecified qu2 had low er overall classification rates, and the low class specific classifi cation rates occurred corresponding to the attribute patterns that matched the manipulated items (1 attribute class and 2 attribute class). T he misspecified qm showed the lowest overall classification rates , and the low class specific classification rates occurred in alm ost all attribute classes. This is not surpris ing because the qm included a ll types of misspecification.
60 To compare the effects of different Q matrices in the DINA model, the correct classification rates were highest in qt ; the condition qu 3 yielded almost the same results with qt ; while the lowest classification still occurred in qm among all the conditions. Different than G DINA and A CDM, in the condition qo1 where the one attribute items were changed to two attributes, the classification rates for having one attribute were very low which matched the manipulated items. The Q matrices qo2 and qu2 in DINA model yield the moderate classification rates. In summary, it is not only the manipulation of the overall number of 0s and 1s in the Q mat rix matters but also the way of the misspecification impacts the classification accuracy. Finally, across all conditions it was worthwhile to note that the latent class es with more attributes ha d higher classification accuracy . T he attribute class in which all attributes were mastered (attribute pattern 1111) maintained very high correct classification rates no matter which CDM and Q matrix were used. Especially in the DINA model, the mis classification of examine e s in this attribute class never occu rred . Table 3 1 . Selection Rates of Relative Fit Indices for CDM Misspecification Q matrix J N 2LL AIC BIC CAIC qt 0.4 14 500 1 1 0.934 0.706 1000 1 1 1 1 5000 1 1 1 1 28 500 1 1 1 1 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 1 1 0.167 0.025 1000 1 1 0.95 0 0.793 5000 1 1 1 1 28 500 1 1 0.967 0.708 1000 1 1 1 1 5000 1 1 1 1 qu3 0.4 14 500 1 1 0.923 0.731 1000 1 1 1 1 5000 1 1 1 1 28 500 1 1 1 1
61 Table 3 1. Continued Q matrix J N 2LL AIC BIC CAIC 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 1 1 0.537 0.244 1000 1 1 0.988 0.962 5000 1 1 1 1 28 500 1 1 0.987 0.894 1000 1 1 1 1 5000 1 1 1 1 qu2 0.4 14 500 1 0.121 0 0 1000 1 0.225 0 0 5000 1 0.911 0 0 28 500 1 1 0.546 0.194 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 1 0.106 0 0 1000 1 0.296 0 0 5000 1 0.961 0 0 28 500 1 1 0 0 1000 1 1 0.358 0.066 5000 1 1 1 1 qo1 0.4 14 500 1 1 0.924 0.758 1000 1 1 1 1 5000 1 1 1 1 28 500 1 1 1 1 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 1 0.999 0.187 0.056 1000 1 1 0.785 0.493 5000 1 1 1 1 28 500 1 1 0.89 0 0.513 1000 1 1 1 1 5000 1 1 1 1 qo2 0.4 14 500 1 1 0.34 0 0.079 1000 1 1 0.971 0.923 5000 1 1 1 1 28 500 1 1 1 0.991 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 1 0.998 0.047 0.01 0 1000 1 1 0.128 0.066 5000 1 1 1 1 28 500 1 1 0.313 0.05 0 1000 1 1 1 0.995 5000 1 1 1 1
62 Table 3 1. Continued Q matrix J N 2LL AIC BIC CAIC qm 0.4 14 500 0.992 0.816 0.003 0 1000 0.977 0.905 0.072 0.019 5000 0.966 0.929 0.853 0.848 28 500 1 1 0.894 0.638 1000 1 1 1 1 5000 1 1 1 1 0.8 14 500 0.99 0 0.709 0 0 1000 0.99 0 0.955 0.012 0.004 5000 0.997 0.996 0.996 0.993 28 500 1 1 0.024 0.003 1000 1 1 0.792 0.423 5000 1 1 1 1 Table 3 2 . Selection Rates of Relative Fit Indices for Q matrix Misspecification Model J N 2LL AIC BIC CAIC G DINA 0.4 14 500 0 0.909 0.965 0.863 1000 0 0.940 1 1 5000 0 0.947 1 1 28 500 0 0.933 1 1 1000 0 0.944 1 1 5000 0 0.945 1 1 0.8 14 500 0 0.863 0.346 0.146 1000 0 0.940 0.981 0.923 5000 0 0.957 1 1 28 500 0 0.944 0.997 0.991 1000 0 0.945 1 1 5000 0 0.955 1 1 A CDM 0.4 14 500 0.290 0.557 0.723 0.703 1000 0.502 0.609 0.735 0.729 5000 0.700 0.705 0.733 0.741 28 500 0.382 0.699 0.922 0.919 1000 0.345 0.561 0.963 0.974 5000 0.016 0.045 0.631 0.723 0.8 14 500 0.190 0.813 0.876 0.822 1000 0.074 0.777 1 0.999 5000 0 0.308 0.995 0.996 28 500 0.019 0.838 0.997 0.997 1000 0 0.769 0.999 0.999 5000 0 0.004 0.999 0.999 DINA 0.4 14 500 0.777 0.777 0.777 0.777 1000 0.861 0.861 0.861 0.861 5000 0.993 0.993 0.993 0.993 28 500 0.967 0.967 0.967 0.967
63 Table 3 2. Continued Model J N 2LL AIC BIC CAIC 1000 0.997 0.997 0.997 0.997 5000 1 1 1 1 0.8 14 500 0.599 0.599 0.599 0.599 1000 0.667 0.667 0.667 0.667 5000 0.833 0.833 0.833 0.833 28 500 0.933 0.933 0.933 0.933 1000 0.992 0.992 0.992 0.992 5000 1 1 1 1 Table 3 3 . Selection Rates of Relative Fit Indices for Both CDM and Q matrix Misspecifications J N 2LL AIC BIC CAIC 0.4 14 500 0 0.909 0.905 0.628 1000 0 0.940 1 1 5000 0 0.947 1 1 28 500 0 0.933 1 1 1000 0 0.944 1 1 5000 0 0.945 1 1 0.8 14 500 0 0.863 0.078 0.01 0 1000 0 0.940 0.936 0.747 5000 0 0.957 1 1 28 500 0 0.944 0.964 0.702 1000 0 0.945 1 1 5000 0 0.955 1 1 Table 3 4 . Selection Rates of RMSEA for the Misspecification of CDM, Q matrix and Both For CDM For Q matrix For J N qt qu3 qu2 qo1 qo2 qm G DINA A CDM DINA Both 0.4 14 500 1 0.997 0.172 1 1 0.427 0.01 0.078 0.003 0.01 1000 1 1 0.282 1 1 0.624 0 0.137 0.001 0 5000 1 1 0.381 1 1 0.951 0 0.276 0 0 28 500 1 1 1 1 1 1 0 0.173 0.47 0 1000 1 1 1 1 1 1 0 0.142 0.367 0 5000 1 1 1 1 1 1 0 0.006 0.177 0 0.8 14 500 0.996 0.982 0.157 0.991 0.996 0.483 0.092 0.133 0.002 0.088 1000 1 1 0.288 1 1 0.814 0.003 0.049 0 0.003 5000 1 1 0.497 1 1 0.999 0 0 0 0 28 500 0.997 1 0.954 0.997 0.997 0.908 0.006 0.044 0.169 0.006 1000 1 1 1 1 1 1 0 0.003 0.1 00 0 5000 1 1 1 1 1 1 0 0 0.016 0
64 Table 3 5 . Correct Overall Classification Rates in All Conditions Q matrix Specification Model J N qt qu3 qu2 qo1 qo2 qm G DINA 0.4 14 500 0.711 0.689 0.525 0.706 0.695 0.467 1000 0.719 0.698 0.533 0.717 0.713 0.477 5000 0.727 0.705 0.542 0.726 0.725 0.481 28 500 0.886 0.879 0.837 0.885 0.883 0.815 1000 0.889 0.884 0.843 0.889 0.888 0.826 5000 0.893 0.888 0.850 0.893 0.893 0.834 0.8 14 500 0.720 0.731 0.647 0.714 0.709 0.629 1000 0.723 0.734 0.651 0.720 0.720 0.639 5000 0.726 0.733 0.650 0.725 0.724 0.644 28 500 0.873 0.876 0.846 0.871 0.870 0.824 1000 0.875 0.879 0.852 0.874 0.873 0.831 5000 0.877 0.881 0.858 0.877 0.876 0.837 A CDM 0.4 14 500 0.669 0.645 0.536 0.657 0.641 0.460 1000 0.675 0.652 0.540 0.666 0.647 0.462 5000 0.689 0.653 0.543 0.678 0.654 0.453 28 500 0.838 0.833 0.810 0.833 0.828 0.776 1000 0.850 0.846 0.816 0.847 0.845 0.792 5000 0.859 0.851 0.820 0.860 0.859 0.800 0.8 14 500 0.738 0.726 0.641 0.730 0.717 0.617 1000 0.750 0.736 0.642 0.751 0.743 0.619 5000 0.755 0.742 0.643 0.760 0.758 0.609 28 500 0.887 0.868 0.845 0.887 0.886 0.827 1000 0.888 0.868 0.846 0.889 0.888 0.831 5000 0.889 0.870 0.848 0.890 0.889 0.835 DINA 0.4 14 500 0.643 0.648 0.512 0.448 0.526 0.374 1000 0.647 0.652 0.513 0.454 0.530 0.375 5000 0.648 0.652 0.515 0.458 0.532 0.379 28 500 0.851 0.855 0.767 0.736 0.773 0.737 1000 0.858 0.861 0.771 0.740 0.785 0.744 5000 0.865 0.867 0.781 0.746 0.794 0.753 0.8 14 500 0.673 0.664 0.652 0.505 0.613 0.485 1000 0.678 0.666 0.653 0.509 0.613 0.488 5000 0.682 0.668 0.653 0.511 0.613 0.490 28 500 0.870 0.872 0.828 0.712 0.811 0.750 1000 0.878 0.879 0.832 0.715 0.821 0.753 5000 0.882 0.883 0.836 0.717 0.825 0.759
65 Table 3 6 . Correct Overall and Class specific Classification Rates for Misspecifications of CDM and Q matrix Model Q matrix Over all Attribute Classes 0000 1000 0100 0010 0001 1100 1010 1001 0110 0101 0011 1110 1101 1011 0111 1111 G DINA qt .802 .587 .653 .666 .666 .673 .810 .816 .830 .822 .836 .847 .922 .934 .934 .940 1 qu3 .798 .688 .637 .609 .609 .578 .756 .728 .726 .758 .756 .815 .840 .890 .890 .912 1 qu2 .719 .792 .532 .568 .568 .305 .502 .745 .500 .532 .406 .492 .849 .807 .807 .672 .944 qo1 .800 .586 .648 .661 .661 .668 .806 .812 .827 .818 .832 .843 .923 .933 .933 .939 1 qo2 .797 .592 .631 .648 .648 .654 .792 .801 .816 .804 .817 .829 .921 .932 .932 .936 1 qm .692 .741 .290 .353 .353 .526 .438 .414 .556 .518 .471 .624 .673 .557 .557 .574 .994 A CDM qt .791 .647 .628 .623 .623 .599 .731 .753 .759 .767 .771 .776 .864 .872 .872 .877 .995 qu3 .774 .740 .612 .592 .592 .541 .636 .652 .637 .688 .680 .782 .712 .809 .809 .830 .991 qu2 .711 .781 .545 .593 .593 .307 .488 .728 .485 .524 .375 .481 .826 .800 .800 .664 .936 qo1 .780 .612 .608 .603 .603 .583 .723 .733 .746 .759 .764 .764 .869 .879 .879 .882 .994 qo2 .787 .638 .608 .611 .611 .589 .726 .747 .754 .753 .762 .768 .867 .877 .877 .880 .997 qm .673 .735 .334 .362 .362 .481 .390 .355 .549 .434 .429 .516 .654 .553 .553 .531 .994 DINA qt .765 .597 .551 .593 .593 .608 .684 .690 .698 .710 .715 .733 .922 .908 .908 .925 1 qu3 .764 .601 .505 .547 .547 .571 .715 .705 .727 .765 .749 .766 .887 .906 .906 .925 1 qu2 .693 .735 .449 .482 .482 .253 .506 .705 .462 .463 .330 .384 .843 .726 .726 .675 1 qo1 .604 .246 .242 .240 .240 .286 .664 .629 .668 .548 .724 .627 .892 .870 .870 .848 1 qo2 .686 .555 .439 .455 .455 .467 .568 .557 .570 .579 .579 .573 .864 .820 .820 .610 1 qm .591 .396 .221 .376 .376 .419 .384 .433 .492 .462 .445 .626 .725 .517 .517 .635 1
66 Figure 3 1 . Empirical sampling distributions of the RMSEA indexes for the different CDMs and Q matrices Figure 3 2 . Empirical sampling distributions of the RMSEA indexes for the G DINA with the correct Q matrix by different factors
67 CHAPTER 4 DISCUSSION AND CONCLUSION Discussion This simulation study differed from previous studies in the following five aspects. First, the G DINA model was used as a framework that aligned with the trend in CDM development. The simulation was conducted in the saturated mode l and fit the data with two reduced models as well as the saturated model, which approximates better to the practice of real data analysis. Second, both the Q matrix misspecification and CDM misspecification were investigated separately and conjunctively. Third, the under , over and mixed misspecified Q matrices were created by randomly selecting the attributes so the generalizability of results was broader. Fourth, both types of model fit indices , including the relative and absolute fit indices, were exam ined in this study. Fifth, not only the overall classification accuracy but also the class specified classification rates , often the primary interest in CDM analysis, w ere investigated under different conditions in this study . When we analyze data in a cognitive diagnosis framework, there are often several alternative CDMs and Q matrices to choose from . If either the CDM or the Q matrix within the CDM is incorrect, the conditional independence assumption is likely to be violated (Rupp et al., 2012). If w e only assess the relative fit , the comparative model fit can be misleading because the candidate models could all fit poorly in an absolute sense. Therefore , a more desirable strategy would be to first evaluate the absolute model fit of each model; i f mul tiple models seem to fit the data relatively well, then information criteria could be used to select the most parsimonious and best fitting model (Rupp et al., 2012, p278). This study has not examined the step wise procedures by
68 using the absolute and rela tive fit indices in a sequence to achieve the best fitting model. For future research, the sequence and effectiveness of the usage of the model fit indices would be interest ing to evaluate . Th is study showed that the absolute fit index RMSEA can help to de tect the CDM and Q matrix. The RMSEA < .05 is recommend ed as a rule in IRT to indicate good fit (Gardner et al., 2002). The results demonstrated that the RMSEA had lower values in the G DINA model, and particularly in the true Q matrix and over specified Q matrices. The RMSEA can detect the CDM better than it can distinguish the Q matrices. The over specified Q matrix can be difficult to detect. Therefore, we need to use the absolute fit index in conjunction with the relati ve fit indices to select the best fitting model. For the relative fit indices, if the candidate models only concern the misspecific ation of CDM, the AIC yields better selection rates than BIC . This may be because the BIC penalizes free parameters more str ongly than AIC and tends to favor parsimonious models. From my literature review, only one other study (Kunina Habenicht et al., 2012) generated data using the saturated model , and all the others used the reduced models for data generation . Kunina Habenich t et al. (2012) and my study both generated the data using the saturated model , but we used the different general framework of CDMs (L CDM versus G DINA) . Our results consistently revealed that the AIC could detect the correct CDM more frequently than the BIC when the generating model was the saturated model. T hus, t he AIC is suggest ed to select the correct CDM when the correct CDM is the saturated model .
69 The 2LL always select s the saturated CDM because of the mathematical features. In this study, the 2LL seem s to have a high selection rate for the correct saturated CDM but that is because the correct CDM was the saturated model. In contrast, Chen et al. ( 2013 ) generated the data using the reduced model , and the 2LL still favor ed the saturated model. Thus , the 2LL is not suggested to select the correct CDM. The hypothesis test of 2LL could be conducted for future research , so the decrease of fit in reduced models can be tested for significance . Furthermore, this study found that the Q matrix specificatio n had an impact on the detection of the correct CDM . When the true Q matrix was used, t he AIC always point ed to the correct saturated CDM . When in the misspecified Q matrix, the selection rates of the relative fit indices for the correct CDM were lower. Also , t he types of the Q matrix had different effects on the correct CDM selection. In certain misspecified Q matri ces (e.g. qu2 and qm ), the selection rates of the A IC were much lower compared with the true Q matrix , and the selection rates of the BIC and the CAIC were very low . The true Q matrix is unknown in practice . It is worth noting that when the test length is long enough (e.g. , J = 28 for a 4 attribute test in this study), the AIC can always detect the correct saturated CDM regardless of the Q matrix misspecification . This is meaningful in practical application s , in which increasing the test length helps to detect the correct CDM when the true Q matrix is unknown. Besides the positive effect of test length , the increase in sample size also im proved the performance of the AIC, BIC and CAIC to pick the correct saturated model. W hen the sample size was large (e.g. , 1000) , t he AIC h ad a remarkable
70 selection rate , and t he performance of the BIC and the CAIC was also acceptable in most conditions . In sum, the AIC is suggested for practitioners to select the correct saturated CDM . More test items and larger sample size s allow for easier detection of the CDM regardless of the Q matrix misspecification. If the candidate models only concern the misspeci fication of the Q matrix, both the AIC and the BIC are sens itive to the Q matrix specifications. The AIC has a stable performance overall , but the selection rates are lower than the BIC in most conditions . T he BIC has a remarkable performance in most condi tions of specifically, the BIC is recommended as the fit index for selecting the correct Q matrix when the sample size is large (e.g. ), test length is long (e.g. J=28) , and the attribute correlation is small (e.g. = .4). Otherwise, the AIC can be an alternative to select the correct Q matrix. Another finding is that the saturated model can be used to fit the data in order to select the correct Q mat rix when the true CDM is unknown. Chen et al. (2003) used the reduced A CDM and DINA models as generating model s ; t his study used the saturated G DINA model as the generating model. After combining the results of both studies, it is found that the selection rates of the BIC for detecting the correct Q matrix maintained a high level in the saturated model , independent of the generating model. The longer test length, the larger sample size and the smaller attribute correlation had positive effects on the detection of the correct Q matrix. The increase of test length improved the selection rates more noticeably than the increase of sample size . Also, the
71 effects of these factors had more influence on the selection of the Q matrix in the G DINA model and A CDM than in the DINA one . If the candidate mode ls involve both the CDM and Q matrix misspecification, t he rational e to choose the relative fit indices is similar with detecting the Q matrix misspecification only. The AIC has more stable selection rates in all conditions , and maintained relative ly high rates ranging from . 863 to .957. The selection rates of the BIC can achieve 1 in the optimal conditions of a large r sample, longer test length , and smaller attribute correlations. The CAIC had very si milar selection patterns with the BIC in all conditions, but the performance of the CAIC was not as good as the BIC. The effects of the three factors were s imilar to the other two situations . T he larger sample size , the longer test , and the smaller attribu te correlation are all helpful for detecting the correct model. influence of the CDM misspecification, Q matrix misspecification, number of respondents, test len gth and attribute correlation. The correct classification rates showed relatively small variation among the CDM misspecification , even though the G DINA model still maintained the highest level of classification accuracy . The Q matrix misspecification affected the classification accuracy in a more obvious way , because the Q matrix reflects the loading structure of the multi dimensional model. As expected, the true Q matrix yielded the most accurate classification. The misspecified Q matrix qm was most p roblematic . Although qm had a similar number of attributes being altered with the other types of misspecified Q matrices, qm contained all types of the misspecification and represented the most severe misspecification . The qm randomly
72 selected 2 items from 1 attribute items, 3 items from 2 attribute items, and 2 items from 3 attribute items ; mean while , the other types of misspecification only altered one type of the items. If we further check the classification at the latent class level, we can see that the class specific classification accuracy is associated with the different types of Q matrix misspecification s . In particular , the deletion of certain attribute combination s has lowered the correct classification rates of the corresponding latent class (e.g. , qu2 ) . Moreover , the effects of differently specified Q matrices on classification accuracy varied in three CDMs. This may be due in part to the different features of three CDMs . T he saturated G DINA model contains all the main effects and interactions, the A CDM contains the main effects only, and the DINA includes only the highest order of interaction. For example, the under specified Q matrix ( qu2 ) lowered the classification rates more strongly in the G DINA and the A CDM, whereas the over specified Q matrix ( qo1 ) influenced the DINA model more . Besides the CDM and Q matrix, the number of respondents and test length both illustrated clear positive effects on classification accuracy. Especially, the increase in test length improves the classification ac curacy more dramatically than the increase in sample size. The attribute correlation showed an opposite effect on the model fit and classification accuracy, but neither effects were prominent. When the attribute correlation increased, the model fit decreas ed slightly and classification accuracy vaguely increased. Regardless of the different types of CDMs and Q matrices, it was noteworthy that the examinees in the latent classes with more attributes had higher classification accuracy, and the examinees in the latent classes with fewer attributes could not be
73 classified accurately. This becomes considerable in practice when applying these C DMs to identify the mastery and non mastery of multiple attributes , especially for the examinees at the lower end. Th e similar pattern of class specific classification accuracy was shown in another study (Rupp & Templin, 2008) , which is the only study I fo und that examined the classification accuracy at the latent class level. Rupp and Templin (2008) generat ed and analyz ed the data using the DINA model only, and involved the different types of Q matrix misspecification . They found that the attribute class w ith all attributes present (e.g. attribute pattern 1111) almost never showed any misspecification rates ; the attribute class with all attributes absent (e.g. attribute pattern 0000) had low misspecification rates as well . For addressing the possible reason s of this phenomenon, future research should examine the impact of item difficulty and the distribution of attribute patterns . The results of classification and model fit are consistent in the way that the models with better model fit usually have higher classification accuracy. For example, the generating G DINA model with the true Q matrix had better fit and better classification accuracy. The misspecified Q matrices qm and qu2 had worse model fit and showed lower classification rates. This is important for practical cognitive diagnosis analysis because the t rue generating condition is un known , and we can infer that the classification is more accurate if we can achieve better model fit. Implications of the Study This study contributes to a better understanding of the effects of CDM and Q matrix misspecification on model fit and classification accuracy. The different factors, such as number of test items, number of examinees and attribute correlation, all have impa Although we do not yet
74 perfectly understand the behavior of CDM, this study has clearly shown that a comprehensive decision making process is needed for model selection and classification evaluation . Both absolute and relative fit indices can be used conjunctively to select the correct CDM and Q matrix when the generating model is the saturated model . The choice of the relative fit indices depends on the competing models involving CDM and/or Q matrix misspecification as well as the conditions such as the sample size, test length and attribute correlations. The classification consistency is relevant to the degree of model fit and the conditions. The specification of Q matrix and CDM plays a critical ro le for achieving bette r classification accuracy. The increase in test length is also essential to improve the classification accuracy. This study also has important practical implications. T he G DINA model offers a more flexible framework to investigate th The computational time is acceptable and the CDM package in R is fairly not complicated for applied data analysis, which would encourage researchers to apply this approach in practical CDM analysis. Future Directions and Conclusion Although this study found some promising results, some additional work are needed to further understand the behavior of CDM. First, this study only used one absolute fit index RMSEA , and other types of absolute fit indices can be assessed for future studies. Second, the item level fit indices can be considered in conjunction with model level fit indices to im prove the judgment of the fit. Third, besides the overall and class specific classification rates, the other types of the cl assification accuracy , such as marginal correct classif ication rate for each attribute, can be investigated. Fourth, the estimation of attribute pattern in this study used MLE method, and the other estimation
75 methods such as MAP can be evaluated too. Fifth , the evaluation of the parameter estimates in the saturated G DINA model would be interesting to investigate since this is a relatively new and complex model. Finding the precise sources affecting model fit and classification accuracy remains challenging given all the different components are required for cognitive diagnosis analysis. However, this study has shown that the conjunction of different fit indices can provide a viable detection of CDM and Q matrix misspecification when the saturated model is the generating model in various conditions. The CDM with better model fit usually provide more accurate classification. The Q matrix for a test should be carefully specified and selected during the CDM analysis because of its dr amatic impact on model fit and classification . More test items and more examinees taking the test helps to achieve better model fit and classification accuracy. Particularly , more test items have greater impact than more examinees. In practice, a good test design with the sufficient items and a careful specification of the Q matrix play a critical role to achieve accurate CDM estimation.
76 APPENDIX REVIEW OF SIMULATION STUDIES Table A 1 . Review of the Conditions in Simulation Studies Study Sample size Nu mber of item Number of attribute Model for simulation Parameters in DINA Correlation structure of the attributes Chen et al., 2013 500 and 1000 examinees, 500 replications 15, 30 5 DINA A CDM = .10 = .90 Chen, 2009 2000 examinee 36 6 ~(.07, .06) ~ (.58, .14) Cui et al., 2012 100, 500, 1000 examinees, 2000 replications 20 3, 5, 8 DINA , ~ N (.10, .05) replaced negative values with .01; N (.25, .05) Higher order DeCarlo, 2012 1000 examinees, 20 replications 15 4 RDINA Higher order de la Torre (2008) 5000 examinees 30 5 DINA = = .20 Equal probability for all attribute patterns de la Torre (2009) 2000 examinees, 100 replications 30 5 DINA = = .20 Higher order de la Torre & Douglas (2004) 1000 examinees, 25 replications 30 5 DINA and LLM ~N (0,1) Higher order de la Torre & Lee (2010) 1000 examinees, 100 replications 20 5 DINA = = .10 Higher order and equal probability Feng et al., 2013 3000 examinees, 100 replications 30 4 RUM Result in behavior comparable with s and g in de la Torre (2009) Henson & Douglas, 2005 10000 examinees, 2000 replications 20, 20 4, 8 DINA and RUM , ~ U (.05, .40) Multivariate normal Huebner & Wang, 2011 5000 examinees, 25 replications 15, 63 4, 6 DINA , ~ U (.05, .30) and (.20, .45)
77 Table A 1. Continued Study Sample size Number of item Number of attribute Model for simulation Parameters in DINA Correlation structure of the attributes Kunina Habenicht et al., 2012 1000 and 10000 examinees, 150 replications 25, 50 3, 5 Log linear CDM .5 vs .8 (same for all pairs of dimensions) Liu, Dougla & Henson (2009) 2200 examinees (1800 normal examinees and 400 aberrant) 30, 45, 60 and 90 5 DINA , ~ U (0, .30) Equal probability for all attribute patterns Rupp & Templin, 2008a 10000 examinees, 1 replication 15 4 DINA ~(.00, .25) ~(.00, .15) Higher order Shu et al., 2013 1000, 500, 200, 150, 50 and 20 examinees, 25 times 4, 6 DINA , ~ U (.10, .20) and U (.20, .40) Higher order
78 LIST OF REFERENCES Akaike, H. (1987). Factor analysis and AIC. Psychometrika , 52, 317 332. Basokcu, T. O. (2014). Classification Accuracy Effects of Q Matrix Validation and Sample Size in DINA and G DINA Models. Journal of Education and Practice , 5 (6), 220 23. Basokcu, T. O., Ogretmen, T., & Kelecioglu, H. (2013). Model data fit comparison between DINA and G DINA in cognitive diagnostic models. Education Journal , 2 (6), 256 262. g eneral theory and its analytical extensions. Psychometrika , 52, 345 37. Chen, J., de la Torre, J., & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50 (2), 123 14. Cheng, Y. (200 9). When cognitive diagnosis meets computerized adaptive testing: CD CAT. Psychometrika , 74 (4), 619 632. Choi, H.J., Templin, J.L., Cohen, A.S., & Atwood, C.H. (2010, April). The impact of model misspecification on estimation accuracy in diagnostic classi fication models. Paper presented at the meeting of the National Council on Measurement in Education (NCME), Denver, CO. Cui, Y., Gierl, M. J., & Chang, H. H. (2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment. Journal of Educational Measurement , 49 (1), 19 38. DeCarlo, L. T. (2011). On the analysis of fraction subtraction data: The DINA model, classification, latent class sizes, and the Q matrix. Applied Psychological Measurement , 35 (1), 8 26. DeCarlo, L. T. (2 012). Recognizing Uncertainty in the Q Matrix via a Bayesian Extension of the DINA Model. Applied Psychological Measurement , 36 (6), 447 468. de la Torre, J. (2008). An Empirically Based Method of Q Matrix Validation for the DINA Model: Development and App lications. Journal of Educational Measurement , 45 (4), 343 362. de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34 , 115 13.
79 de la Torre, J., & Douglas, J. A. (2004). Higher order l atent trait models for cognitive diagnosis. Psychometrika , 69 (3), 333 353. de la Torre, J., & Douglas, J. A. (2008). Model evaluation and multiple strategies in cognitive diagnosis: An analysis of fraction subtraction data. Psychometrika , 73 (4), 595 624. de l a Torre, J., & Lee, Y. S. (2010). A note on the invariance of the DINA model parameters. Journal of Educational Measurement , 47 (1), 115 127. DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. Cognitively diagnostic assessment , 361 389. Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care , 40 (9), 812 823. Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement , 29 (4), 262 277. Henson, R., Roussos, L., Douglas, J., & He, X. (2008). Cognitive diagnostic attribute level discrimination indices. Applied Psychological Measurement , 32 (4), 275 288. Henson, R., Templin, J., & Douglas, J. (2007). Using Efficient Model Based Sum Scores for Conducting Skills Diagnoses. Journal of Educational M easurement , 44 (4), 361 376. Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log linear models with latent variables. Psychometrika, 74 (2), 191. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal , 6 (1), 1 55. Huebner, A., & Wang, C. (2011). A note on compa ring examinee classification methods for cognitive diagnosis models. Educational and Psychological Measurement , 71 (2), 407 419. Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement , 25 (3), 258 272. Kunina Habenicht, O., Rupp, A. A., & Wilhelm, O. (2009). A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory
80 factor analysis and diagnostic classification models. Studies in Educational Evaluation , 35 (2), 64 70. Kunina Habenicht, O., Rupp, A.A., & Wilhelm, O. (2012). The impact of model misspecification on parameter estimation and item fit assessment in log linear diagnostic class ification models. Journal of Educational Measurement 49 , 59 81. Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The Attribute Hierarchy Method for Cognitive Assessment: A Variation on Tatsuoka's Rule Space Approach. Journal of Educational Measuremen t , 41 (3), 205 237. Liu, Y., Douglas, J. A., & Henson, R. A. (2009). Testing person fit in cognitive diagnosis. Applied psychological measurement , 33 (8), 579 598. Maris, E. (1995). Psychometric latent response models. Psychometrika , 60 (4), 523 547. Maris , E. (1999). Estimating multiple classification latent class models. Psychometrika , 64 (2), 187 212. McDonald, R. P., & Mok, M. M. C. (1995). Goodness of fit in item response models. Multivariate Behavioral Research , 30 (1), 23 4. Mislevy, R.J. (2006). Cog nitive psychology and educational assessment. In R. L. Brennan (Ed.), Educational measurement (4 th ed., pp257 305). Washington, DC: American Council on Education. Robets, M.R., & Gierl, M.J. (2010). Developing Score Reports for Cognitive Diagnostic Assessments. Educational Measurement: Issues and Practice , 29(3), 25 38. Rupp, A. A., & Templin, J. (2008a). The effects of Q Matrix misspecification on parameter estimates and classic ification accuracy in the DINA model. Educational and Psychological Measurement, 68 (1), 78 96. Rupp, A. A., & Templin, J. L. (2008b). Unique characteristics of diagnostic classification models: A comprehensive review of the current state of the art. Measu rement , 6 (4), 219 262. Rupp, A. A., Templin, J., & Henson, R. A. (2012). Diagnostic measurement: Theory, methods, and applications . Guilford Press. Shu, Z., Henson, R., & Willse, J. (2013). Using Neural Network Analysis to Define Methods of DINA Model Es timation for Small Sample Sizes. Journal of Classification , 30 (2), 173 194. Steiger, J.H. (1990), "Structural model evaluation and modification," Multivariate Behavioral Research, 25, 214 12.
81 Schwarz, G. (1978). Estimating the dimension of a model. The an nals of statistics , 6 (2), 461 464. Tatsuoka, K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & Safto, M. (Eds.), Monitoring skills and knowledge acquisition (pp. 453 488).Hi llsdale, HJ: Erlbaum. Tatsuoka, C. (2002). Data analytic methods for latent partially ordered classification models. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 51 (3), 337 35. Templin, J.L., & Bradshaw, L. (2014). Hierarchica l diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika , 79(2), 317 339. Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psycho logical methods , 11 (3), 287. Templin, J. L., Henson, R. A., Templin, S. E., & Roussos, L. (2008). Robustness of hierarchical modeling of skill association in cognitive diagnosis models. Applied Psychological Measurement , 32 (7), 559 574. von Davier, M. (2006). Multidimensional latent trait modeling (MDLTM) [Software program]. Princeton, NJ: Educational Testing Service. Von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology , 61 (2), 287 307. von Davier, M. (2010). Hierarchical mixtures of diagnostic models. Psychology Science Quarterly , 52 (1), 8.
82 BIOGRAPHICAL SKETCH Miao Gao was born in Nanjing, China. She graduated from Nanjing Normal University in 2007 with a b achelor degree in b iological e ducation. After that she moved to Australia and obtained the Master of Arts in e ducation from Murdoch University in 2008. She then entered the graduate school at Central Washington University in US and obtained the Master o f Science degree with a major in science education in 201 1 . In the same year, she began her doctoral stud ies in the department of Research and Evaluation Methodology at University of Florida.