<%BANNER%>

Equating Using Unidimensional Dichotomous and Polytomous IRT Models for Testlet-Based Tests under Common-Item Nonequival...

MISSING IMAGE

Material Information

Title:
Equating Using Unidimensional Dichotomous and Polytomous IRT Models for Testlet-Based Tests under Common-Item Nonequivalent Groups Design
Physical Description:
1 online resource (126 p.)
Language:
english
Creator:
Zhang, Lidong
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology, Human Development and Organizational Studies in Education
Committee Chair:
Miller, David
Committee Members:
Leite, Walter Lana
Huggins, Anne Corinne
Algina, James J
Jacobbe, Timothy

Subjects

Subjects / Keywords:
equating -- irt -- testlet -- unidimensional
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre:
Research and Evaluation Methodology thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
The relative equating performance of the Graded Response Model (GRM) and the Generalized Partial Credit (GPC) model was compared with that of the two parameter logistic (2PL) model using simulated testlet data under a common-item nonequivalent groups design. Impacts of various levels of testlet effects, calibration procedures, group differences, number of common items, sample size were investigated. Three traditional linear equating methods were used as criteria for the IRT true score equating and IRT observed score equating results from the three item response theory models. In general, the equating performance based on the two polytomous models yielded results that were more compatible with the results of the traditional equating methods with the presence of testlet effects. Even in some conditions without testlet effects, the equating performance of the two polytomous models was more similar to that of the traditional methods than the dichotomous 2PL model, particularly when the number of common items was larger. Of the two polytomous models, the GRM was found to render results more agreed with those of traditional linear methods in conditions of separate calibration with linking. The characteristic curve linking methods outperformed the moment methods in a majority of conditions. The separate calibration procedures were better than the concurrent calibration procedure in most of the conditions, especially when the number of common items was small.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Lidong Zhang.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Miller, David.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045728:00001

MISSING IMAGE

Material Information

Title:
Equating Using Unidimensional Dichotomous and Polytomous IRT Models for Testlet-Based Tests under Common-Item Nonequivalent Groups Design
Physical Description:
1 online resource (126 p.)
Language:
english
Creator:
Zhang, Lidong
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology, Human Development and Organizational Studies in Education
Committee Chair:
Miller, David
Committee Members:
Leite, Walter Lana
Huggins, Anne Corinne
Algina, James J
Jacobbe, Timothy

Subjects

Subjects / Keywords:
equating -- irt -- testlet -- unidimensional
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
Genre:
Research and Evaluation Methodology thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
The relative equating performance of the Graded Response Model (GRM) and the Generalized Partial Credit (GPC) model was compared with that of the two parameter logistic (2PL) model using simulated testlet data under a common-item nonequivalent groups design. Impacts of various levels of testlet effects, calibration procedures, group differences, number of common items, sample size were investigated. Three traditional linear equating methods were used as criteria for the IRT true score equating and IRT observed score equating results from the three item response theory models. In general, the equating performance based on the two polytomous models yielded results that were more compatible with the results of the traditional equating methods with the presence of testlet effects. Even in some conditions without testlet effects, the equating performance of the two polytomous models was more similar to that of the traditional methods than the dichotomous 2PL model, particularly when the number of common items was larger. Of the two polytomous models, the GRM was found to render results more agreed with those of traditional linear methods in conditions of separate calibration with linking. The characteristic curve linking methods outperformed the moment methods in a majority of conditions. The separate calibration procedures were better than the concurrent calibration procedure in most of the conditions, especially when the number of common items was small.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Lidong Zhang.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Miller, David.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045728:00001


This item has the following downloads:


Full Text

PAGE 1

1 EQUATING USING UNIDIMENS IONAL DICHOTOMOUS AND POLYTOMOUS IRT MODELS FOR TESTLET BASED TESTS UNDER COMMON ITEM NONEQUIVALENT GROUPS DESIGN By LIDONG ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

PAGE 2

2 2013 Lidong Zhang

PAGE 3

3 To my family

PAGE 4

4 ACKNOWLEDGMENTS My sincere thanks go to my advisor and dissertation committee chair Dr. David Miller, for his thoughtful guidance through out each stage of my doctoral study. I am grateful for the research assistantship that he provided through CAPES and t he valuable learning opportunities he offered on p rogram evaluation. Without his continued support, I could not have succeeded in my pursuit of a doctoral degree. Thanks also go to Dr. James Algina, Dr. Walter Leite Dr. Anne Corinne Huggins and Dr. Tim Jacobbe, members of my dissertation committee, for t heir insightful suggestions on improvement of this study. I would like to express my gratitude to Dr. Algina and Dr. Leite for the academic training I obtained through doing my research project with them. To Dr. Huggins, I am thankful for her reviews of m y proposal and dissertation, and the instructive discussions we had while doing my dissertation. I would like to thank Dr. Austin for proof reading my dissertation and encouraging me all the time. My thanks also go to my mom, for her timely help always T hank God for giving me strength to persevere.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ........................... 7 LIST OF ABBREVIATIONS ................................ ................................ ................................ .......... 9 ABSTRACT ................................ ................................ ................................ ................................ ... 10 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .................. 12 2 LITERATURE REVIEW ................................ ................................ ................................ ....... 17 Desirable Properties of Testlet ................................ ................................ ................................ 17 Local Dependence and Its Modeling ................................ ................................ ...................... 18 Equating and Linking ................................ ................................ ................................ ............. 20 IRT Linking and Equating for Testlet based Tests ................................ ................................ 24 Comparison of Separate and Concurrent Estimation ................................ .............................. 32 Separate and Concurrent Calibration under the Unidimensional IRT Models for Testlet Data ................................ ................................ ................................ ................................ ..... 41 Research Questions ................................ ................................ ................................ ................. 42 3 METHOD ................................ ................................ ................................ ............................... 43 Factors Manipulated ................................ ................................ ................................ ............... 43 Calibration Models ................................ ................................ ................................ .......... 44 The 2PL model ................................ ................................ ................................ ......... 44 Graded response model (GRM) ................................ ................................ ............... 44 Generalized partial credit (GPC) model ................................ ................................ ... 45 Examinee group ability distribution ................................ ................................ ......... 47 Variance of Testlet Effect ................................ ................................ ................................ 49 Number of Examinees ................................ ................................ ................................ ..... 50 Number of Common Testlets ................................ ................................ .......................... 50 Calibration Methods ................................ ................................ ................................ ........ 51 Mean/mean and mean/ sigma methods for polytomous models ............................... 51 Characteristic curve methods ................................ ................................ ................... 52 Concurrent calibration ................................ ................................ .............................. 52 Parameters Holding Constant ................................ ................................ ................................ 52 Evaluation Criteria ................................ ................................ ................................ .................. 53 Tucker Observed Score Equating Method ................................ ................................ ...... 54 Levine Observed Score Equating Method ................................ ................................ ....... 56 Levine True Score Method ................................ ................................ .............................. 57 IRT True Score Equating ................................ ................................ ................................ 57

PAGE 6

6 IRT Observed Score Equating ................................ ................................ ......................... 59 Evaluation Indices for Equating Results ................................ ................................ ......... 61 Data Generation and Analysis Procedure ................................ ................................ ............... 61 4 RESULTS ................................ ................................ ................................ ............................... 64 Proper Solution Rates of the Th ree IRT Models ................................ ................................ .... 65 Missing Category Rate of the Two Polytomous IRT Models ................................ ................ 68 Influence of Factors of Interest ................................ ................................ ............................... 69 ANOVAs for IRT True Score Equating ................................ ................................ ................. 70 Mean URMSD for IRT True Score Equating ................................ ................................ .. 71 Mean WRMSD of IRT True Score Equating ................................ ................................ .. 76 Summary of the Results ................................ ................................ ................................ .......... 78 5 DISCUSSION AND CONCLUSION ................................ ................................ .................... 96 Factors of Larger Impact ................................ ................................ ................................ ........ 9 7 Number of Common Items ................................ ................................ .............................. 97 Model ................................ ................................ ................................ ............................... 98 Variance of Testlet Factor ................................ ................................ ............................. 101 Impacts of Other Factors ................................ ................................ ................................ ...... 101 Linking Method ................................ ................................ ................................ ............. 101 Group Mean Difference and Sample Size ................................ ................................ ..... 103 Conclusion ................................ ................................ ................................ ............................ 103 Limitations and Future Studies ................................ ................................ ............................. 105 APPENDIX: MEAN URMSD AND WRMSD FOR IRT OBSERVED SCORE EQUATING ................................ ................................ ................................ .......................... 108 LIST OF REFERENCES ................................ ................................ ................................ ............. 120 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ....... 126

PAGE 7

7 LIST OF TABLES Table page 3 1 Parameters and manipulated factors for simulation ................................ ........................... 63 4 1 Proper solution rate with the three models using separate calibration ............................... 81 4 2 Convergence rate for three models using concurren t calibration ................................ ...... 82 4 3 Rate of missing categories using separate calibration ................................ ....................... 83 4 4 Mean URMSD from 2PL model for IRT true score e quating in two common testlets condition ................................ ................................ ................................ ............................ 84 4 5 Mean URMSD from GRM model for IRT true score equating in two common testlets condition ................................ ................................ ................................ ................ 85 4 6 Mean URMSD from GPC model for IRT true score equating in two common testlets condition ................................ ................................ ................................ ............................ 86 4 7 Mean URMSD from 2PL model for IRT true score equating in four common testlets condit ion ................................ ................................ ................................ ............................ 87 4 8 Mean URMSD from GRM model for IRT true score equating in four common testlets condition ................................ ................................ ................................ ................ 88 4 9 Mean URMSD from GPC mo del for IRT true score equating in four common testlets condition ................................ ................................ ................................ ............................ 89 4 10 Mean WRMSD from 2PL model for IRT true score equating in two common testlets condition ................................ ................................ ................................ ............................ 90 4 11 Mean WRMSD from GRM model for IRT true score equating in two common testlets condition ................................ ................................ ................................ ................ 91 4 12 Mean WRMSD from GPC model for IRT true score equating in two common testlets condition ................................ ................................ ................................ ............................ 92 4 13 Mean WRMSD from 2PL model for IRT true score equating in four common testlets condition ................................ ................................ ................................ ............................ 93 4 14 Mean WRMSD from GRM model for IRT true score equating in four common testlets condition ................................ ................................ ................................ ................ 94 4 15 Mean WRMSD from GPC model for IRT true score equating in four common testlets condition ................................ ................................ ................................ ................ 95 5 1 Contrast of mean URMSD in C2 and C4 conditions ................................ ....................... 107

PAGE 8

8 A 1 Mean URMSD from 2PL model for IRT observed score equating in two co mmon testlets condition ................................ ................................ ................................ .............. 108 A 2 Mean URMSD from GRM model for IRT observed score equating in two common testlets condition ................................ ................................ ................................ .............. 109 A 3 Mean URMSD from GPC model for IRT observed score equating in two common testlets condition ................................ ................................ ................................ .............. 110 A 4 Mean URMSD from 2PL model for IRT observed score equating in four common testlets conditio n ................................ ................................ ................................ .............. 111 A 5 Mean URMSD from GRM model for IRT observed score equating in four common testlets condition ................................ ................................ ................................ .............. 112 A 6 Mean URMSD from GP C model for IRT observed score equating in four common testlets condition ................................ ................................ ................................ .............. 113 A 7 Mean WRMSD from 2PL model for IRT observed score equating in two common testlets condition ................................ ................................ ................................ .............. 114 A 8 Mean WRMSD from GRM model for IRT observed score equating in two common testlets condition ................................ ................................ ................................ .............. 115 A 9 Mean WRMSD from GPC model for IRT observ ed score equating in two common testlets condition ................................ ................................ ................................ .............. 116 A 10 Mean WRMSD from 2PL model for IRT observed score equating in four common testlets condition ................................ ................................ ................................ .............. 117 A 11 Mean WRMSD from GRM model for IRT observed score equating in four common testlets condition ................................ ................................ ................................ .............. 118 A 12 Mean WRMSD from GPC model for IRT observed score equating in four common testlets condition ................................ ................................ ................................ .............. 119

PAGE 9

9 LIST OF ABBREVIATIONS 2PL 2 parameter logistic C 2 2 common testlets C4 4 common testlets GRM Graded Response Model GP C Generalized Partial Credit HB Haebara IRT Item Response Theory IRT TE Item Response Theory True Score Equa ting IRT OE Item Response Theory Observed Score Equating LOE Levine observed score equating LTE Levine true score equating MM Mean/Mean MS Mean/Sigma SL Stocking & Lord TOE Tucker observed score equating URMSD Unweighted Root Mean Squared Difference WRMS D Weighted Root Mean Squared Difference

PAGE 10

10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy EQUATING USING UNIDIMENS IONAL DICHOTOMOUS AND POLYTOMOUS IRT MODELS FOR TESTLET BASED TESTS UNDER COMMON ITEM NONEQUIVALENT GROUPS DESIGN By Lidong Zhang August 2013 Chair: David Miller Major: Research and Evaluation Methodology The relative equating performance of the Graded Respo nse Model (GRM) and the Generalized Partial Credit (GPC) model was compared with that of the two parameter logistic (2PL) model using simulated testlet data under a common item nonequivalent groups design. Impacts of various levels of testlet effects, cali bration procedures, group differences, number of common items, sample size were investigated. Three traditional linear equating methods were used as criteria for the IRT true score equating and IRT observed score equating results from the three item respon se theory models. In general, the equating performance based on the two polytomous models yielded results that were more compatible with the results of the traditional equating methods with the presence of testlet effects. Even in some conditions without t estlet effects, the equating performance of the two polytomous models was more similar to that of the traditional methods than the dichotomous 2PL model, particularly when the number of common items was larger. Of the two polytomous models, the GRM was fou nd to render results in more agreement with those of traditional linear methods in conditions of separate calibration with linking. The characteristic curve linking methods outperformed the moment methods in a majority of conditions. The separate calibrati on procedures were better than the concurrent

PAGE 11

11 calibration procedure in most of the conditions, especially when the number of common items was small.

PAGE 12

12 CHAPTER 1 INTRODUCTION Testlet s where a bundle of items designed according to a common stimulus (e.g., a reading passage, a graph or a laboratory scenario) have emerged as a popular and desirable item format in standard tests owing to its unique characteristics (Wainer, Bradlow & Du, 2000). Compared with the traditional single independent items the context dependent nature of within testlet items makes it possible to assess more complicated or high level skills. Asking a set of (DeMars, 2006) However, the context dependent nature of the testlet items might present a certain level of local dependence among them thereby introduces some nuisance testlet factors in addition to the primary trait measured by the whole test Disregarding the within testlet item dependence and using the standard indep endent item model, such as a unidimensional dichotomous model, could lead to inaccurate item and primary trait parameter estimates, and the test reliability might be inflated (DeMars, 2006; Li, 2009; Sireci, S. G., Thissen, D., & Wainer, H, 1991; Wainer, B radlow & Wang, 2007; Yen, 1993). Therefore, various approaches have been proposed to model the dependencies among the testlet items. Treating the whole testlet as a single polytomous item by summing up the item scores within the testlet i s one approach (Co ok, Dodd, & Fitzpatrick, 1999; Wainer, Bradlow & Wang, 2007; Yen, 1993). Applying one form of multidimensional item response theory (IRT) models (i.e., the bi factor model (Gibbons & Hedeker, 1992), the testlet model (Bradlow, Wainer, & Wang, 1999; Wainer & Wang, 2000)) is another approach in which specific testlet related dimensions are incorporated into the primary dimension of interest. Using either simulated or empirical data, researchers have examined the performance of these approaches for the testlet based test with respect to item and primary trait parameter estimates, test reliability, test

PAGE 13

13 information, linking and equating. Not only did they compare the unidimensional dichotomous model with the testlets as polytomous item model (Lee, Kolen, Frisbie & Ankenmann, 2001; Wainer, 1995; Yen, 1993; Zhang, 2007), but they also compared the unidimensional dichotomous model with the testlet effect model (Bradlow, Wainer, & Wang, 1999; Li, Bolt, & Fu, 2005), or they compared the unidimensional dichotomous mo del, testlets as polytomous item model and testlet effect model altogether (DeMars, 2006; Li, 2009). These studies provided insightful information from dive rse perspectives. However, the findings about the performance of the testlets as polytomous item mod el and the unidimensional dichotomous model for modeling the testlet data were not consistent with each other. Some studies found that the testlets as polytomous item model tended to produce less accurate primary trait estimates compared with the unidimens ional dichotomous model. For instance, i n DeMars (2006), three models, i.e., the three parameter bi factor model, testlet model and unidimenional three parameter logistic model (3PL), were used to simulate three types of population data. The three models are nested models with the bi factor model as a general model and the other two as its constrained versions. Five levels of testlet magnitude were considered in data simulated with bi factor model and testlet effect model. All three models plus the generalized partial credit model (GPC) (Muraki, 1992) were fit to the three types of data generated. Model fit, reliability, and accuracy of primary trait parameter estimates were c ompared for all the four models. A ccuracy of item parameter estimates were compared for the three nested models. As for the accuracy of the primary trait estimation, this study showed that although all the four models h ad satisfactory estimates the GPC model produced larger bias and root mean square error at the lower end of the primary trait distribution. Thus, the GPC model is the relatively weaker one for modeling data having testlet effect s In another study, Li (2009)

PAGE 14

14 compar ed the item paramater estimates linking and the primary trait estimation of the 3PL testlet model, the 3PL model and the graded response model (GRM) (Samejima, 1969) for data simulated with the 3PL testlet model. This study also found that the primary trait estimates from the three models were similar, but the GRM produced comparatively less accurate trait estimates. When it comes to the linking parameter recovery and equating effectiveness under the testlets as polytomous item model and the unidimensional dichotomous model, the results in extant literature are somewhat contradictory with one another. In Li (2009), recovery of the linking parameters for the GRM was found to be better than the 3PL model and similar to the 3PL testlet model with medium or high level of testlet effect s Zhang (2007) simulated testlet data based on the 3PL compensatory multidime nsional model. Patterns of both within and between testlet item dependencies were considered. The equating effectiveness of the 3PL model and the GPC model for this type of testlet data were compared. It was found that the GPC model worked better than the 3PL model if the test was slightly or moderately multidimensional. When the test had a high degree of multidimensionality and the between testlets item dependence was high, the equating results of the GPC model was worse than the 3PL model. While in Lee, K olen, Frisbie, and Ankenmann (2001), real testlet data were used to compare the equating performan ce of the 3PL with that of the Nominal M odel (Bock, 1972) and the GRM. Their study showed that the two polytomous models outperformed the dichotomous model wh en the violation of local independence was more serious. In contrast, there was not much difference between the polytomous and the dichotomous models when the violation was slight. Across these studies, it is hard to reach a conclusion about the condition s in which the polytomous models work consistently better than the dichotomous models.

PAGE 15

15 Given the inconsistent findings in the previous research, the current study aimed to specifically assess the unidimensional dichotomous model (2PL) and the polytomous IR T models (GRM and GPC) equating performance for testlet based tests under the common item nonequivalent groups design. GRM and GPC models are commonly used to model the dependencies of the testlet items in existing IRT application studies ( Cook, Dodd, & F itzpatrick, 1999 ; DeMars, 2006; Lee, Kolen, Frisbie, & Ankenmann, 2001; Li, 2009 ; Sireci et al., 1991; Wainer, 1995; Yen, 1993; Zhang, 2007 ). Although the previous research showed that the two models produced similar results for polytomous data ( Maydeu Oli vares, Drasgow, & Mead, 1994; Tang & Eignor, 1997), Cook, Dodd and Fitzpatrick (1999) found that the GRM yielded more information than the GPC model over the whole range of ability scale when they were fit to the simulated testlet data. Thus they recommend ed future exploration of the practical implications of this intriguing phenomenon. In literature, few studies have examined the equating results using the two polytomous models simultaneously for testlet data under the common item non equivalent groups des ign. There is an obvious need for conducting such a research. Two procedures for obtaining a common scale of the item and ability parameter estimates using IRT models were considered, i.e., separate parameter estimation and concurrent estimation,. Several studies have reported linking or equating performance of the unidimensional dichotomous model versus the polytomous model or the testlet effect model for testlet based tests using separate calibration ( Lee, Kolen, Frisbie, & Ankenmann, 2001; Li, Bolt, & Fu 2005; Li, 2009; Zhang, 2007). Nevertheless, none of these studies have compared the separate calibration with the concurrent calibration, which is an alternative procedure to obtain a common scale. An inspection of studies comparing the separate versus c oncurrent calibration for scale transformation suggested that the concurrent procedure is more accurate than the separate

PAGE 16

16 procedure if the data fit the IRT model but not when there are violations of the IRT assumptions (Petersen, Cook, & Stocking, 1983; Wi ngersky, Cook, & Eignor, 1986; Kim & Cohen, 1998; B guin, Hanson, & Glas, 2000; Hanson & B guin, 2002; Kim & Kolen, 2006; Lee & Ban, 2010), for instance, when the data present multidimensionality. As a type of data with multidimens ionality, little research has been done to compare the separate and concurrent procedures of IRT models for the testlet data to obtain a common scale. Li, et al (2005) also recommended future research on this issue. In view of the lack of research in this area, the current study aims to use a Monte Carlo simulation to compare the performance of the 2PL model, the GRM model and the GPC model to obtain a common scale for a testlet based test using both the separate calibration and the concurrent calibration u nder the common item non equivalent groups design. Four parameter linking methods for the separate calibration were employed: Mean/Sigma (Marco, 1977), Mean/Mean (Loyd & Hoover, 1980), Haebara (Haebara, 1980), Stocking and Lord (Stocking & Lord, 1983). IR T true score equating and IRT observed score equating results for the dichotomous and the two polytomous models with both separate and concurrent calibration s were used to detect the equating effectiveness of the three models in various conditions of inter est.

PAGE 17

17 CHAPTER 2 LITERATURE REVIEW Desirable Properties of Testlet In standardized tests, such as a reading test, it is fairly common to see the testlet format followed by a series of multiple choice questions based on the same passage. Wainer and Kiely ontent area that is developed (Wainer & Kiely, 1987, p.190). The adoption of testlet as an item format was originally regarded by Wainer and Kiely as a possible avenu e of solving some of the issues arising in the test construction and administration of Computerized Adaptive Testing (CAT). These issues were concerned with the differential context effects due to item location, cross information and unbalanced content. An d these concerns, to a certain degree, are connected with the atomistic nature of the single independent small items used in the CAT item format. In contrast, since each item under a predesigned testlet carries its own context, and each testlet can represe nt a content area of interest, the balance of test content is therefore more easily achieved. In addition, the path of within testlet items could follow either a hierarchical branching scheme, which routes test takers to successive items adaptive to their previous response, or a linear scheme, which delivers the same number of items to all test takers. The two types of branching schemes also apply to the testlets. Thus, as a unit of test construction, testlet brought forth a flexible alternative to address the item ordering problem associated with the algorithmic methods of test construction in CAT (Wainer & Kiely, 1987).

PAGE 18

18 Apart from providing a possible solution to problems encountered in CAT test construction and development per se, the testlet format als o offers an efficient way to use the test paragraph or a laboratory scenario. Given the amount of time a test taker spends processing the information in such an ex tended stimulus, it might not be reasonable to ask a single question and to collect a limited piece of information. For instance, a 250 word passage needs four to six questions for test takers to delve into the meaning it carries, and 10 to 12 questions ar e considered appropriate for a longer passage (Wainer, Bradlow & Wang, 2007). Furthermore, under the context of statewide educational assessment, the testlet format allows for measuring the higher order thinking skills and understanding, which is stipulate d in the No Child Left Behind Act (NCLB 2001 ). The interrelated property of the items within a testlet, provides a better item does (DeMars, 2006). Where the proble ms in a test required must be solved in a stepwise or a sequential style, it is more suitable to use the context dependent testlet items. All these desirable properties of testlet contribute to its popularity in the standard tests as an item format. Local Dependence and Its Modeling The context dependent nature of the items within a testlet facilitates coping with problems in some testing situations. Nonetheless, its application is not without concerns. When the test is scored within an IRT framework, were the testlet format used, the interrelated testlet items are likely to violate one of the basic assumptions in the unidimensional item response theory, i.e., local independence. This assumption dictates that controlling for the latent construct of interest, such as the reading ability in a reading test, the responses of an examinee to the items are independent. It indicates that the correctness of an examinee answering one item does not

PAGE 19

19 depend on her/his response to another item in this test. The response of an examinee to an item is solely explained by the latent construct of interest. Mathematically, it could be expressed as P ( ) = P ( (2 1) where and are two items in a test, and of local independence, the probability of answering item correctly is independent of the outcome of any other item However, in the case of testlet based tests, each testlet might introduce a new dimension knowledge or motivational factors specific to the testlet (DeMars, 20 06). Then, within each testlet, the items not only indicate a general latent construct that the overall test measures but a testlet specific latent construct that is a nuisance factor to the test user. As mentioned in Chapter one, there are three general a pproaches to model the dependencies among testlet items. One approach is to ignore the dependence and use the binary IRT model anyway. Another approach is to treat each individual testlet as a polytomously scored item, and use one type of polytomous IRT mo dels, the local independence assumption is met on the testlet level. Still another approach incorporates the additional dimension introduced by each testlet and uses some sort of multidimensional IRT models or their equivalents. The impacts of the three ap proaches on the latent trait and item parameter estimation, test reliability, and test information have been well studied by quite a few researchers in literature (Bradlow, Wainer & Wang, 1999; Cook, Dodd, & Fitzpatrick, 1999; DeMars, 2006; Sireci, S. G., Thissen, D., & Wainer, H, 1991). Nevertheless, there is a paucity of research investigating the impact of these approaches on equating for testlets based tests.

PAGE 20

20 Equating and Linking Many testing programs such as the TOEFL (Test of English as a Foreign Lan guage) or the GRE (Graduate Record Examinations) administer multiple forms of the test on multiple occasions with each containing a different set of questions. Doing so is for the sake of test security on one hand and for flexibility of test date options o n the other hand. Although each test form is supposed to be constructed as similar as possible in content and test statistical specifications, there still might be a difference in difficulty among them. Thus, equating serves as a statistical means to adjus t the scores on these alternate forms so that they could be used interchangeably (Kolen & Brennan, 2004). In this way, the comparability of scores from multiple forms of the same test is achieved. All equating methods consist of at least two elements, equa ting designs and a statistical procedure to analyze the data (Petersen, Cook & Stocking, 1983). There are various options for data collection design in equating, and the three commonly used ones are: 1) random groups design; 2) single group design; 3) com mon item nonequivalent groups design. In random groups design examinees are randomly assigned one alternate test form to be administered. A spiraling process could be used to alternate the test forms which results in randomly equivalent groups taking one o f the forms administered. In single group design, one examinee should take two test forms and counterbalancing the order of administering the test forms will be considered to handle the order effects. Common item nonequivalent groups design involves using anchor items across test forms taken by groups of examinees with nonequivalent ability distribution. Anchor items are a set of common items representative of the total test both in content and statistical characteristics, and scores on these forms are equa ted by means of their relations to scores of the anchor items. Because this design administers one test form per test administration, it is widely used in many testing programs (Lee & Ban, 2010).

PAGE 21

21 If the test is scored using item response theory methods, responses at the item level instead of total test scores level, equating would involve three steps in general (Kolen & Brennan, 2004). First, estimate item parameters based on a corresponding IRT model. The item parameters for di fferent test forms could be estimated separately by different runs of the estimation software, which is called separate calibration. Or multiple forms can be calibrated together as a big data set to get the item parameters estimates by a single run of the estimation software, which is called concurrent calibration. Second, if separate calibration is employed to obtain item parameter estimates for test forms based on examinees from different populations, due to the scale indeterminancy property, IRT linking procedure for separate calibration should be used to transform the parameters of a target form to the scale of a base form. For example, the estimation software BILOG (Mislevy & Bock, 1990) sets the ability distribution for a given group of examinees as st andard normal with a mean of 0 and a standard deviation of 1 to resolve the issue of scale indeterminancy. In the same manner, the ability distribution of another group will be defined as having a mean of 0 and a standard deviation of 1 as well, regardless of the fact that the two groups have different abilities. So IRT linking procedure is needed to transform the ability parameter and item parameter estimates from different groups onto a common metric. The ability parameter estimates and item parameter est imates of IRT models from separate estimation procedures are only invariant up to a linear transformation. That is to say, when an IRT model fits a data set, then any linear transformation of the ability scale fits the same data, provided that the item par ameters are transformed. For instance, in a two parameter logistic (2PL) IRT model, the latent trait values for the two scales are related as follows (Kolen & Brennan, 2004):

PAGE 22

22 (2 2) where and are values of for individual i on scale J and scale I In this manner, parameter estimates on form I were put on the scale of form J using and This linear relationship en sures the 2PL function results identical probability for a correct response to item j using either set of parameter estimates. Therefore, computing the linking parameters A and B is the focus of IRT separate calibration. If the groups taking different test forms are equivalent in ability distribution like in the random groups design or single group design, IRT linking is not necessary. Nevertheless, linking for equivalent groups might adjust for minor differences in the latent variable scale between the gro ups taking the two forms caused by sampling error. This in turn, reduces estimation error (Hanson & B current study is operationally defined as the procedure to put the IRT model parameter estim ates between the forms to be equated (Lee and Ban, 2010). There are some commonly used IRT linking methods to calculate the linking parameters A and B in the co mmon item nonequivalent groups design as described in Kolen and Brennan (2004), i.e., Mean/Mean (Loyd & Hoover, 1980), Mean/Sigma (Marco, 1977), Haebara (1980), and Stocking/Lord (Stocking & Lord, 1983). For the unidimensional IRT model, the Mean/Mean meth od uses the mean of a parameter estimates and b parameter estimates from the common items of the two test forms to estimate the linking constants A and B respectively, which could be expressed as A = (2 3) B = (2 4)

PAGE 23

23 The Mean/Sigma method estimates the A and B constants by using the standard deviation and mean of the b parameter estimates from the common items of the two te st forms. The relationships are as follows A = (2 5) The B constant could be calculated using Equation (2 4). Haebara and Stocking Lord methods uses the item characteristic curves and test characteristic curves, respectively, to obtain A and B constants. In IRT applications, the scale of the latent trait is unique only up to a linear transformation as well. This scale indeterminancy indicates that, for two latent ability Scales I a nd J (Kolen & Brennan, 2004), = A ). (2 6) For person i and an item j the probability that this person i with ability will answer this item j correctly is ident ical regardless of the scale used to report the score. However, if the parameters are replaced by their estimates, then Equation (2 6) does not necessarily hold over all persons and items for any combinations of A and B That is to say, there will be a dif ference between the two probability estimates. The two characteristic curve methods estimate A and B linking constants by minimizing this difference between the two probability estimates. Specifically, the Haebara method sums up the squared difference betw een the item characteristic curves of each item for examinees of a particular ability. Thus, this method is also called the item characteristic curve method. For any given ability the function of this method could be expressed as Hdiff ( = (2 7)

PAGE 24

24 where j:V refers to the fact that the summation is over all the common items. This approach sums up the squared difference between each item characteristic curve on the two scales. Then the difference over all abi lities is cumulated using the criterion Hcrit = (2 8) The criterion is not met until finding A and B values that minimize it. The Stocking Lord method, instead, uses the squared differe nce of the test characteristic curves, SLdiff ( ) = (2 9) Here the SLdiff ( ) is the squared difference between test characteristic curves on the two scales for a given ability Then, the differe nce is cumulated over all abilities using the criterion SLcrit = (2 10) As in the Haebara approach, the criterion is met by finding the A and B values that minimize it. Once the A and B are obtained using any of the linking methods, IRT scale transformation could be accomplished. The distribution of the abilities for the two groups on the common scale would be expected to differ. Thus, the transformed parameter estimates can be used in the t hird step to establish equating relationship between the alternate test forms if number correct scoring is employed. The number correct scores on the target form are converted to the scale of number correct scores on the base form and then to scale scores IRT Linking and Equating for Testlet based Tests Flexible as the IRT models are compared with the classical test models, the strong statistical assumptions required are hard to meet completely in real testing settings. When IRT models are applied to scor e the testlet based test, the violation of local independence assumption not only impacts the ability and item parameters estimation and test information, but the linking and final equating results as well (DeMars, 2006; Lee et al 2001; Li, 2009; Li, Bolt & Fu, 2005;

PAGE 25

25 Sireci, Thissen, & Wainer, 1991; Thissen, Steinberg, & Mooney, 1989; Yen, 1993; Zhang, 2007). Previous linking or equating studies for testlet based tests usually focused on comparing the linking or equating performance of the IRT models used to model the local dependence within the testlet items. These models, as mentioned before, fall into three general categories, namely, unidimensional dichotomous models, unidimentional polytomous models and multidimensional models. Most of these studies em phasized how the magnitude of testlet effect would affect the linking or equating results based on any of the three types IRT models. Among these studies, s ome researchers examined the IRT linking for testlet based tests under non equivalent groups equatin g design. In Li, Bolt and Fu (2005), the test characteristic curve linking procedure (Stocking & Lord, 1983) was extended to link IRT parameters from separate calibrations based on a two parameter normal ogive (2PNO) testlet model. Data were generated unde r both the 2PNO and the 2PNO testlet model s Two levels of common testlets were considered, i.e., 2 and 4. For data generated under the 2PNO testlet model, the testlet effect, which is the variance of the random testlet factor, was defined as having three levels (0, .5, 1) representing no testlet effect, medium testlet effect and large testlet effect respectively The three levels of testlet effects were randomly assigned to the six testlets for each pair of test forms. Then the two models were fit to each type of data set generated. The recovery of the linking parameters A and B obtained from the linking procedures based on the 2PNO testlet model and the 2PNO model were compared. Absolute value of the bias for A and B were used to evaluate the linking param eter recovery. For data simulated under the 2PNO testlet model, the linking parameters recovery with the 2PNO testlet model was superior to that with the 2PNO model. For data simulated under the 2PNO model, which indicated the condition of meeting the loca l independence assumption, it turned out that the linking parameter recovery with the 2PNO testlet

PAGE 26

26 model was comparable to that with the 2PNO model. The authors ascribed this counter intuitive finding to two possible reasons. One is the difference between t he estimation methods used to estimate the two IRT models. The 2PNO testlet model was estimated using the Markov Chain Monte Carlo (MCMC) algorithm implemented in WinBUGS 1.4 (Spiegelhalter, Thomas, & Best, 2003), while the 2PNO model was estimated using t he marginal maximum likelihood (MML) method in BILOG (Mislevy & Bock, 2003). It was found in other studies as well that under some conditions without local independence violation, the MCMC method yielded slightly better parameter estimates than BILOG (Du, 1998; Wainer, Bradlow, & Du, 2000). The other possible reason might be the use of BILOG to estimate parameters of the normal ogive model by applying the scale constant D = 1.7 to the logits fro m the BILOG estimates, because BILOG is supposed to fit the log istic model. The approximation of the normal ogive metric might produce less accurate parameter estimate s for the 2PNO model. The impact of the magnitude of testlet effect on the recovery of the linking parameters was not specified by the authors. Therefor e, the overall conclusion of this study is that the linking procedure for the 2PNO testlet model results in more accurate linking parameter estimates than that for the 2PNO model were there testlet effect (medium or large). Appealing as it is, the authors also pointed out that this linking procedure for the 2PNO testlet model took relatively longer time to compute due to the iterative and computationally intensive algorithm. About 55 minutes were needed to obtain the linking parameters for each pair of data set under the four common testlet conditions and 25 minutes under the two common testlet conditions using a computer with a 2.8 GHz CPU and 512 MB memory. Considering the IRT equating which should be conducted after the linking procedure, application of t his characteristic curve linking method to testlet based test equating is formidable (Zhang, 2007).

PAGE 27

27 In another study on IRT linking for testlet based test under common item non equivalent groups design, Li (2009) extended the Haebara item characteristic l inking method (Haebara, 1980) to the 3PL testlet model. T he linking parameter recovery was compared among three models (3PL, GRM and 3PL testlet model ) used to fit the same data generated from the 3PL testlet model. Three levels of the random testlet fact or variance (0, 1, 2) were considered to investigate the impact of the magnitude of testlet effect on the linking results from each model. The three levels of testlet factor variance were regarded as representing no, medium and strong testlet effect respe ctively. When the testlet factor variance is 0, the 3PL testlet model becomes 3PL model. 50 replications were employed for each condition. Mean squared error (MSE) and bias were used to evaluate the recovery of the estimated linking parameters. WinBUGS was used to estimate parameters for the 3PL testlet model, BILOG MG (Zimowski, Muraki, Mislevy, & Bock, 2005) for 3PL model, and PARSCLAE ( Muraki & Bock, 1997) for the GRM. When the testlet effect was absent, the MSE and absolute value of bias of the linking parameter estimates with the 3PL model turned out to be lower than those with the 3PL testlet model and the GRM, which indicated a better recovery of the linking parameters. With the increase of the testlet effect, the linking parameter recovery with the GRM and the 3PL testlet model tended to outperform that with the 3PL model, especially when the variance of the testlet factor was as high as 2. It should be noted that when the variance of the testlet factor is 1, one of the linking parameters estimate, i .e., B was more accurate with the 3PL model than the other two models, while the A parameter with this model was less accurate compared with that from the GRM and the 3PL testlet model, but the difference was slight in terms of the MSE values. Therefore, this study suggested that when the testlet effect was strong, the linking parameter recovery with the GRM and the 3PL testlet model was superior to the 3PL model, and the results from the GRM

PAGE 28

28 and the testlet model were similar. However, when the testlet ef fect was medium, the linking parameter recovery from the three models was comparable. As mentioned before, after completing the IRT linking procedure, IRT equating should be conducted to convert the number correct scores on the target form to the scale of number correct scores on the base form. Some researchers then compared the equating effectiveness of the unidimentional dichotomous models and polytomous models for testlet based test. In Lee, Kolen, Frisbie, and Ankenmann ( 2001 ), the equating performances based on two polytomous IRT models, i.e., Norminal Model (NM, Bock, 1972) and the GRM, were compared with that based on the 3PL model for testlet based tests. Real data collected following a random group design was used in the equating analysis, where the two groups to be equated are supposed to have identical ability distribution Therefore, item parameter estimates from each form need not be put on the same scale through linking. The magnitude of local dependence was detected by computing the statistics (Yen, 1984), which measures the residuals correlation between item pairs after partialing out the general latent trait that the IRT model measures. Two types of statistics could be calculated for a test form, i.e., within testlet and between testlet s. If the local independence assumption holds, then the mean value of the statistics for both the within and between testlet pairs would be close to the expected values of the measures (Lee et al ., 2001). In addition, factor analysis was conducted to check the dimensionality. Results from statistics and factor analysis showed that the local independence assumption was hold at the testlet level but not at the individual items level. In other words, when testlet scores were used as the unit of analysis, both local independence and unidimensional assumptions hold. Three traditional equating methods, (mean, linear, and equipercentile) served as the baseline methods since they use the total test

PAGE 29

29 scores and are not subject to the influence of local independence violation. IRT observed score equating and true score equating methods with the 3PL model, NM and GRM were compared with the equating results from the three baseline models. Two criter ia were used for comparison among these models. First, equated score moments. The moments of the converted scores for each method were calculated and the absolute values of the difference between these moments from the moments of the base form (the form to be equated on) were compared. Second, equating conditioning on number correct scores. The discrepancy between the equated scores from the IRT equating methods and the baseline equating methods were compared. The results of this study suggested that the m ore the local independence assumption was violated, the more similar the IRT equating results with the two polytomous models were to the three baseline equating methods than were the dichotomous models. On the contrary, when the violation of the assumption was slight, the equating results from the dichotomous and polytomous IRT models were close to each other. The software RAGE (Zeng, Kolen & Hanson, 1995) was used to obtain the equating results with the three baseline equating methods. BILOG (Mislevy & Boc k, 1990) and MULTILOG (Thissen, 1991) were employed to estimate the item parameters under the dichotomous and polytomous models, respectively. IRT true score and observed score equating relationships were constructed using PIE (Hanson & Zeng, 1995). In another equating research for testlet based test, Zhang (2007) went beyond the conditions in Lee, et al (2001) by considering the impact of the pattern of within and the between testlet local dependenc e on the equating performance of the unidimensional dichotomous model (3PL) and the polytomous model (GPC). Both simulated and real data were examined. The simulated data were generated based on a bivariate extension of the unidimensional 3PL model with co mpensatory abilities (Reckase & McKinley, 1983), which in

PAGE 30

30 fact is a special case of the multidimensional 3PL compensatory model (Reckase, 1985). Three levels of within testlet local dependence were manipulated (e.g., low, medium and high) together with a c ondition of no local dependence. Data were collected with a random groups design. IRT true score and observed score equating results under the 3PL and the GPC models were compared with the traditional equipercentile equating, the first and the second orde r equity. This study found that the equating under the GPC model tended to have better performance than that condition of either high or no multidimensionality, equat ing under the 3PL model was closer to the equipercentile results. The pattern of the within and between testlets local dependence heavily influenced the equating performance of the GPC model. If the absolute value of the within testlet local dependence is large and the between testlet local dependence is low, the GPC model would outperform the 3PL model. Otherwise, the 3PL model works better. It should be purposes, i within and between testlet item pair correlations, while in the original test let structure proposed by Wainer and Kiely (1987), the between testlet items should be conditionally independent. The inspection of these previous linking and equating studies on testlet based tests showed that although the two proposed characteristic curv e linking methods based on the testlet models (Li, Bolt & Fu, 2005; Li, 2009) were reported to have better linking parameter recovery than the dichotomous model if the testlet effect was present, the intensive computational algorithms used in these linking methods somewhat hinder their application in equating testlet based tests due to long computation time. As for the effectiveness of the polytomous models,

PAGE 31

31 testlet m odel and better than the dichotomous model when there was testlet effect, especially when the testlet effect was strong. However, the two studies on equating under the dichotomous and polytomous IRT models for testlet based tests presented mixed results re garding their performance in conditions with different levels of testlet effects. Particularly, contradictory conclusions were drawn with respect to whether the polytomous model yielded better equating results compared to the dichotomous model when the tes tlet effect was strong or the level of multidimensionality was high. In Zhang (2007), the equating under the dichotomous 3PL model was reported as superior to that under the GPC model when the tests had a high level of multidimensionality, whereas Lee, et al (2001) found that the two polytomous models outperformed the dichotomous model as long as the testlet effect was of moderate or high level. Across these studies on linking and equating for testlet based tests, different equating designs were adopted. T he population model used to generate the testlet data in each simulation study was not identical. Consequently, it is not surprising to see the inconsistent results. However, in order to provide comprehensive knowledge of this issue for practitioners and r esearchers, more conditions need to be investigated. Particularly, the relationship between the magnitude of testlet effect and the equating performance under the dichotomous and the polytomous models requires further clarification. Moreover, another popul ar IRT parameter estimation procedure, concurrent calibration, has not been explored in any of these studies within the context of testlet based tests. Therefore, the current study aims to investigate the equating effectiveness with the unidimensional dich otomous model and the polytomous models for testlet based tests under a common item non equivalent groups design using both separate and concurrent calibrations.

PAGE 32

32 Comparison of Separate and Concurrent Estimation As described before, item parameters of the test forms could be estimated either in separate runs of IRT model estimation software for each form, or in a single run using the data of all the test forms simultaneously. The former is called separate calibration. The latter is concurrent calibration. I n the case of separate calibration, if the groups are non equivalent in ability distribution, IRT linking procedure should be implemented to find the linear relationship of item parameter estimates, and put the item parameter estimates from one form on the scale of the parameter estimates from the other form. In contrast, item parameter estimates from different forms are put on the common metric directly in the case of the concurrent calibration. Many a study has compared the two IRT estimation procedures u nder both the unidimensional and the multidimensional tests context using either real or simulated data (Peterson, Cook, & Stocking, 1983; Wingersky, Cook, & Eignor, 1986; Kim & Cohen, 1998; B guin & Hanson, & Glas, 2000; B guin & H anson, 2001; Hanson & B guin, 2002; Kim & Kolen, 2006; Simon, 2008; Lee & Ban, 2010). In Peterson et al. (1983), several equating methods, both conventional and IRT, were compared graphically and analytically using real data. For the IRT equati ng methods, both the separate and the concurrent calibration were considered based on the three parameter logistic IRT model. The verbal and mathematical portions of the Scholastic Aptitude Test were used to compare these equating methods. The IRT model wa s estimated by LOGIST (Wingersky, Barton, & Lord, 1982) which implements joint maximum likelihood estimation (JMLE). They found that when tests were not strictly parallel, e.g., when there was a difference in content and test length, equating methods for t he three parameter IRT model yielded more stable results. Among the IRT methods, equating based on the concurrent calibration provided the most stable equating results.

PAGE 33

33 Wingersky, Cook, and Eignor (1986) investigated the effects of the characteristics of t he linking items on IRT true score equating results. The study was carried out using the 3PL IRT model and Monte Carlo procedures. The characteristics of the items were investigated for two of the common scaling designs: concurrent calibration and the cha racteristic curve transformation method (Stocking & Lord, 1983). The effects on IRT true score equating results were studied for the two scaling designs using three different anchor test lengths (10, 20, and 40 items), two variations in the size of the sta ndard errors of estimation of the linking items, and two distributions of examinee ability (peaked and uniform). The study showed very little difference in equating results based on placing item parameter estimates on the same scale using a concurrent cali bration procedure or a characteristic curve method of scaling. For both scaling methods, the accuracy of the equating results improved as the number of linking items was increased. The characteristic curve transformation method seemed to require slightly m ore items than the concurrent calibration procedure. The equating results were slightly better when a uniform distribution of abilities, rather than a peaked distribution of abilities, was used to estimate parameters of the linking items. Kim and Cohen (19 98) compared the separate calibration and the concurrent calibration under the two parameter logistic (2PL) IRT model using simulated data. Each form had 50 items in total. The test characteristic curve linking method was employed in the separate calibrati on. They manipulated the number of common items which had 4 levels (5, 10, 25, 50), target group ability distribution which had two levels, i.e., N(0,1) and N(1,1), and the estimation methods for concurrent calibration, i.e., the marginal maximum a poster iori estimation (MMAPE) implemented in BILOG (Mislevy & Bock, 1990) and the marginal maximum likelihood (MMLE) estimation implemented in MULTILOG (Thissen, 1991). The main purpose of this

PAGE 34

34 study was to investigate the impact of small sample size (500) and d ifferent sizes of common items on obtaining a common metric from separate and concurrent calibration. Their review of previous research on comparing the separate and the concurrent calibration with the JMLE showed that concurrent calibration provided more stable results when the sample size was as large as 1000. However, their research found that with a small sample size of 500, the recovery of item parameters from separate calibration was more accurate than that from concurrent calibration if the number of common items was small, especially when the ability distribution of the groups was different. As the number of common items increased, similar results were obtained using either the separate or the concurrent calibration, regardless of the type of softwar e used for the concurrent calibration. Overall, recovery of item parameters using BILOG was more accurate than that using MULTILOG. the results were susceptible to conf ounding with the difference between computer programs. To overcome this problem, Hanson and B guin (2002) compared the separate calibration with the concurrent calibration under the common item nonequivalent groups design using both BILOG MG (Z imowski, Muraki, Mislevy & Bock, 1996) and MULTILOG (Thissen, 1991). 3PL model was used to fit the data simulated with the same model. Four IRT linking methods, Mean/Mean, Mean/Sigma, Haebara and Stocking Lord, were employed in the separated calibration co ndition. Two levels of sample sizes were manipulated: 1000 and 3000. The ability distribution for the base group was N(0,1), and the ability distribution for the target group had two levels: N(0,1) and N(1,1). Each form had a total of 60 items and the comm on items were manipulated as having two levels: 10 and 20. Two evaluation criteria were considered. One criterion was based on the IRT true score equating function from the target form to the base form and the other was based

PAGE 35

35 on how close the estimated ite m characteristic curves are to the true item characteristic curves for the target form items. This study found that by using BILOG MG, the concurrent estimation always turned out to be more accurate than the separate estimation and the effect was larger in the condition that the ability distributions for the two forms to be equated were different. In the case of MULTILOG, when the groups had same ability distributions, the concurrent estimation resulted in lower error than separate estimation. When the grou ps were nonequivalent, results based on the IRT true score equating criterion showed that the separate estimation with the Stocking Lord and in some cases Haebara linking methods had less error than the concurrent estimation. This result occurred when the convergence criterion in the MULTILOG was not met. If the criterion of item characteristic curves was applied, the concurrent estimation yielded less error than the separate estimation. Therefore, the concurrent estimation generally would lead to more accu rate equating results as compared with separate estimation. However, the authors cautioned that results of this study and other studies on this topic were not sufficient to recommend a complete preference for the concurrent calibration. The authors also su ggested that the larger sample size might be a factor that accounted for the lower error through using the concurrent estimation. Future research on performance of the separate versus the concurrent calibration when the model is misspecified to some degree was recommended. In another study, Lee and Ban (2010) compared the relative performance of the concurrent calibration with the separate calibration and proficiency transformation in a random groups equating design. Although the IRT linking procedure is no t necessary in a random groups equating design, as the authors explained, it is required for purposes such as building up an item pool using the new forms developed over multiple years or comparing IRT based statistics across years, thus the parameter esti mates for all forms need to be put on the same scale. The

PAGE 36

36 equating linkage plan in this study has three sets of data: Form A, a base form administered at Time 1 to Group 1, determined the basic scale. Then at Time 2 Form A and a new form, Form B, were admi nistered to a group of examinees (Group 2) in a spiral manner, so the groups of examinees taking Form A at Time 2 and the groups of examinees taking Form B at Time 2 could be viewed as randomly equivalent groups from the same population. Consequently, the item parameter estimates for these two forms within Time 2 were on the same scale. Form A administered at Time 2 thus served as a link to put the item parameter estimates of Form B on the scale of the base form, Form A administered at Time 1. 3PL model was used to simulate and analyze the data. The ability distribution for the base form was set as N(0,1), while the ability distribution for the other two forms was manipulated as having three levels: N(0,1), N(0.5,1), N(1,1). Two sample sizes, 500 and 3000, w ere included. The total number of items per form had two levels: 25 and 75. BILOG MG was used for parameter estimates in both the separate calibration and the concurrent calibration. Two evaluation criteria were employed, one based on the test characterist ic curves (TCC) which evaluates how close the estimated TCC of Form B is to the one computed using the generating item parameters. The other criterion is based on the expected observed score distributions. Separate calibration without linking served as a b aseline criterion. The results showed that conditions with larger sample size, 3000, yielded substantial lower linking errors than conditions with small sample size, 500. As the difference in ability distribution between Group 1 and Group 2 enlarged, linki ng errors tended to increase as did the no scaling. In the condition of equivalent groups, the concurrent calibration outperformed the separate calibration and proficiency transformation procedures. However, no scaling would be a better choice. If there wa s a difference in the group ability distributions, separate calibration

PAGE 37

37 appeared to be superior to the concurrent calibration and proficiency transformation. This result is consistent with what Kim and Cohen (1998) found. Apart from investigating this issu e with dichotomous IRT models, polytomous IRT model was explored as well. In Kim and Cohen (2002), comparison of the separate calibration with the concurrent calibration was made using the GRM. Data were simulated with the graded response model. MULTILOG ( Thissen, 1991) conducted the separate and the concurrent estimations. Stocking Lord method was implemented in the separate calibration to obtain the linking constants. Linking for the graded response model was performed through the computer program EQUATE (Baker, 1993). A total of 30 items were generated under an anchor test design and three levels of common items were used (5, 10, 30). The ability distribution for the base group was N (1,1). The ability distribution for the target group was manipulated as h aving two levels: N (0,1) and N (1,1). Sample size for the base and the target groups were set as (Base300/Target300, Base1,000/Target1000, Base1,000/Target300). After the item parameter estimates from the separate and the concurrent calibrations were put on the scale of the base form and further on the metric of the generating item parameters, comparison of the recovery of item parameter estimates through the two estimation procedures was carried out. Root mean square differences (RMSD) between the generatin g parameters and the parameters estimates, i.e., item difficulty and item discrimination, mean distance measure (MDM) of the average square roots of the sum of the squared differences between the difficulty and the discrimination parameter estimates and th eir generating parameter values were used to evaluate the recovery of the parameters from the two calibration procedures. The RMSD was employed to evaluate the recovery of the ability parameters from the two calibration procedures as well. Results for both the item parameter and the ability parameter recovery showed that the concurrent calibration

PAGE 38

38 produced consistently better estimates than the separate calibration, but the difference is slight. When the number of common items increased, the recovery of the parameters was better. Comparison of the separate and the concurrent calibration with the multidimensional IRT model was addressed in literature as well. In Simon (2008), the two estimation procedures were applied to the three parameter compensatory multi dimensional IRT model using data generated under the same model. Three sample sizes were used: 500, 1000, and 3000. Two sets of total items, i.e., 40 and 60, were employed. The number of common items was set at 20. Three levels of mean group abilities were considered to simulate the conditions of equivalent groups with both groups having an ability distribution of N (0,1), and the conditions of non equivalent groups with a group mean ability difference of 0.5 and 1, respectively. Three levels of correlation between the two dimensions were manipulated as r = 0, 0.5 and 0.8. Four types of multidimensional IRT separate linking methods were compared with the concurrent estimation procedure. RMSE (root mean squared error) and bias between the transformed item para meter estimates and the generating parameter were used to indicate the effectiveness of the two estimation procedures. TESTFACT (Wood et al., 1987) was used to implement the separate and concurrent calibrations. The results suggested that the concurrent ca libration outperformed the separate calibration in the conditions of equivalent groups and the correlation between the two dimensions was zero. In conditions of non equivalent groups, one type of separate linking method, ICF (item characteristic function), produced more accurate estimates for the item difficulty parameters than the concurrent calibration. However, the author argued that the concurrent calibration had similar performance as the ICF method when the groups were non equivalent based on the corr elation between the estimated and the generating item parameters.

PAGE 39

39 The studies reviewed so far on the performance of the separate and the concurrent calibration with either unidimensional or multidimensional IRT models in various linking and equating condit ions provided some practical guidelines for practitioners. One common feature of these studies is that they all investigated the condition of having a model data fit. Therefore, exploration of the performance of the separate and the concurrent calibration in the condition of lacking a model data fit was recommended by some researchers (Hanson & B guin, 2002; Lee & Ban, 2010). In response to the lack of research, B guin, Hanson and Glas (2000), B guin and Hanson (2001) fit unidimensional IRT model to data simulated under multidimensional IRT model. In B guin, Hanson and Glas (2000), data were simulated with three parameter compensatory multidimensional IRT model according to equivalent and nonequivalent groups d esigns. The three parameter normal ogive model (3PNO) and its two dimensional counterpart were fit to the data. The separate and the concurrent estimation with the unidimensional IRT model were compared with the results from the concurrent estimation under the corresponding multidimensional IRT model. BILOG MG was used to conduct both the separate and the concurrent estimation for the 3PNO model, Markov Chain Monte Carlo (MCMC) estimation procedure was applied to concurrent calibration with the 3PNO model a nd the multidimensional model. Sample size for each form was fixed at 2000 and the total number of items was 60 for each form with 20 common items among them. Two levels of population mean proficiency difference between the two forms were assumed to simula te the conditions of equivalent and non equivalent groups, respectively. Three levels of covariance between the two dimensions were manipulated. Two evaluation criteria were employed to compare the results. One was based on the differences between the esti mated target form score distributions and the population target form score distribution, the other was based on the differences between the estimated equivalent

PAGE 40

40 score points and the population equivalent score points. The results showed that when the group s were equivalent in ability distribution, errors from the unidimensional 3PNO model were similar to that from the multidimensional model. And the concurrent calibration turned out to be better than the separate calibration. When the groups were non equiva lent, both the separate and the concurrent estimation procedures under the unidimensional model yielded larger errors compared with the concurrent estimation under the multidimensional model. The concurrent estimation with the unidimensional model tended t o produce larger errors than the separate estimation with an increase of the covariance between the two dimensions. The authors concluded that the two estimation procedures under the unidimensional IRT model were impacted by the multidimensionality of the data. B guin and Hanson (2001) extended this study to data simulated with three parameter non compensatory multidimensional data. The conditions examined were similar as in B guin, Hanson and Glas (2000). Both BILOG MG and EPDIRM (H anson, 2000) were used for the separate and the concurrent calibrations under the unidimensional 3PNO model. MCMC estimation was applied to the concurrent calibration under the multidimensional model. The results showed that in general, the concurrent cali bration yielded less error than the separate calibration in most of the conditions. However, in this study only two levels of difference between the groups ability distribution were included to simulate the condition of equivalent and nonequivalent groups. In the condition of nonequivalent groups, only a group mean ability difference of 0.5 between the two groups was considered, while in B guin, Hanson and Glas (2000) a group mean ability difference of 1 between the two groups was examined too. The magnitude of the covariance between the two dimensions in the two studies was not identical. When the groups were equivalent in ability distributions, the concurrent calibration tended to

PAGE 41

41 outperform the separate calibration under the unidimensional IRT model if the data presented multidimensionality. However, when the groups were nonequivalent, results concerning the performance of the separate versus the concurrent calibration under the unidimensional IRT model were not consistent with each other in th e two studies. Magnitude of the group mean ability difference and the type of multidimensionality presented would exert influence on the performance of the separate and the concurrent calibration under the unidimensional model. Kim and Kolen (2006) investi gated the performance of concurrent calibration and separate calibrations for common item nonequivalent groups design for mixed format with multidimensionality. Three levels of nonequivalence between two groups were manipulated (group mean = 0, 0.5, 1). Th e results showed that the concurrent calibration generally outperformed the separate calibration with various linking methods in terms of linking accuracy and robustness to multidimensionality due to item formats. Characteristics curve linking methods are more accurate than the moment methods, regardless of the degrees of multidimensionality. The differences in linking results using the concurrent calibration and separate calibration with the characteristic curve methods were small. Separate and Co ncurrent Calibration under the Unidimensional IRT M odels for Testlet Data From the literature reviewed on the comparison of the separate and the concurrent calibration for IRT models, it is still hard to reach a simple conclusion about the conditions in which each procedure works better. As Hanson and B guin (2002) claimed, results of the extant studies on this topic did not provide sufficient evidence to totally prefer the concurrent calibration to the separate calibration. Across the literature reviewe d by now, the performance of the two procedures was influenced by factors such as sample size, number of common items, group ability distribution, estimation software, data collection design, and model data fit.

PAGE 42

42 Particularly, the amount of research on the performance of the separate and the concurrent calibration when the model is mis specified to some extent is quite small. More future studies are needed to shed light on this issue (Hanson and B guin, 2002). Research Questions Although the equ ating performance of the unidimensional dichotomous and polytomous IRT models for testlet data has been compared in some previous studies under the random groups design (Lee et al 2001, Zhang, 2007), it had not been well studied under the common item none quivalent groups design considering various levels of testlet effects. Moreover, the results from the separate and the concurrent calibrations with these models for testlet based tests were seldom compared. Since the testlet data is multidimensional in nat ure, the impact of this type of multidimensionality on the two estimation procedures with the unidimensional IRT models will add insightful information to this part of literature. Therefore, the current study addressed this topic considering a series of fa ctors that would potentially influence the equating results using the unidimensional models. The research questions are: 1. How well will the 2PL, the GRM and the GPC models work for testlet data in terms of equating under the common item nonequivalent group s design considering different levels of the testlet effects? 2. Is there any difference between the equating using separate and concurrent calibrations with the three IRT models? 3. What factors will exert more impacts on the final equating results using the th ree IRT models?

PAGE 43

43 CHAPTER 3 METHOD Monte Carlo simulation was used in this study to generate testlet data with desired features, so that the factors of interest could be manipulated. The software R 3.0 (R Development Core Team, 2013) executed the simulation The population model for data generation was the 2PL testlet model (Bradlow, Wainer, & Wang, 1999), which incorporates testlet dependence into the standard 2PL model, and it defines the probability of an examinee i answers item j correctly as P (3 1) where j is the latent ability of examinee i are the discrimination and difficulty parameter of item j The new term is a r andom testlet effect parameter represents the interaction of person i with item j in testlet d thus it incorporates the extra dependence of the items within the same testlet. is fixed within a testlet for each person, but it varies across pe The testlet effect is defined as normally distributed with mean 0 and variance The variance indicates the amount of within testlet dependence a nd it is allowed to vary across testlets. 6 testlets with 5 items within each (30 items for the whole test) were generated. Then, the 2PL, GRM, GPC models were fit to the simulated data. The equating results under the three models through separate and conc urrent estimation procedures were evaluated. Factors Manipulated The manipulated major factors of interest in the simulation include: (1) models used for calibration (2PL, GRM, GPC); (2) the mean of the ability distribution for the target test form (0, 0.5 1) i.e., ~ N (0/0.5/1, 1) for the target form, ~ N (0, 1) for the base form; (3) variance

PAGE 44

44 of testlet effect (0, .5, 1, 2); (4) number of examinee s (500, 1000, 3000); (5) number of common testlets (2, 4); (6) cal ibration proce dures: concurrent vs. separate calibration which has four methods (Mean/Mean and Mean/Sigma, Stocking & Lord, Haebara). Calibration Models Three unidimensional IRT models, 2PL, GRM and GPC were fit to the testlet data simulated, so that the equating results from each of these models could be compared. The model definitions for the three models were described in the following section. The 2 PL model The 2PL unidimensional IRT model dictates that the probability of examinee i responding correctl y to item j is P ( 3 2) where j, is the latent ability of examinee i are the discrimination and difficulty parameter of item j Graded response model (GRM) Samejima (1969) proposed the GRM as an extension of the 2PL model to the multiple category situation. In this model, for an item j ) categorie to this item could be classified into +1 categories. The possible scores on item j is defined as ). The response categories are ordered with higher category scores represent ing more of the trai t being measured than do lower scores ( Cook, Dodd, & Fitzpatrick, 1999 ). The GRM specifies that the probability of an examinee with a certain trait level scoring in a category or higher is = (3 3 )

PAGE 45

45 where is the discrimination parameter of item j and is the difficulty parameter associated with category score x for item j. By definition, the probability of responding in the l owest category or higher i s 1 while the probability of responding in the highest category m +1 or higher is 0. To compute the probability of an examinee responding to a particular category, we will take the difference between the adjacent two probabilities, = ( 3 4) Generalized partial credit (GPC) m odel As the GRM, the GPC model (Muraki, 1992) is applicable to ordered polytomous data. Unlike the GRM, the probability of an examinee of given ability level scoring in a specific category can be obtained directly. The category responses related with a given polytomous item j are viewed as a series of successive steps. An examinee either passes or fails to pass each step within an item. T Reversals are allowed for the step difficulties. This model defines the probability of a particular category score, x in a polytomous item j given as ), ( 3 5) where is the item discrimination parameter of polytomous item j is the difficulty of the step associated with category k (k ) and is the highest possible score on item j. GRM and GPC model are commonly used to model the dependencies of the testlet items in existing IRT application studies ( Cook, Dodd, & Fitzpatrick, 1999 ; DeMars, 2006; Lee, Kolen, Frisbie, & Ankenmann, 2001; Li, 2009; Zhang, 2007). The comparison of the two models for the same data was conducted in some previous research ( Cook, Dodd, & Fitzpatrick, 1999 ; Maydeu Olivares, Drasgow, & Mead, 1994; Tang & Eignor, 1997). Maydeu Olivares, Drasgow, and Mead (19 94) simulated polytomous data using both the GRM and the GPC model, and fit

PAGE 46

46 the two models to each type of data simulated They found that neither model was consistently better in modeling data generated by its own parametric form, and in modeling data gen erated by the alternative model. Therefore, they concluded that the two models provided similar fit to the same set of polytomous data. Tang and Eignor (1997) applied both models to TOEFL (Test of English as a Foreign Language) test, PARSCALE was used to i mplement the model calibration. They found that on average, the GPC model performed better than the GRM in terms of the average root mean squared differences (RMSDs) and correlations between the estimated true scores and the observed scores for each polyto mous item. The RMSDs appeared to be smaller and the correlations were higher for the generalized partial credit model. They pointed out that various comparisons across the two models showed some preference for using the GPC model w h ere the PARSCALE was use d for model calibration. Moreo ver, the GPC model did better at capturing the exact patterns in the data compared with the GRM. Cook, Dodd, and Fitzpatrick, ( 1999 ) simulated testlet data within an SEM (structural equation modeling) framework, and fit Partia l Credit model (PC; Masters, 1982), GRM and GPC model to the testlet data. They found that the three models had similar performance in term s of parameter estimation and model fit, while the GRM model produced more information than the PC and GPC models ove r the whole range of the latent trait. They recommended further investigation of the practical implication of this phenomenon. Although the two models were employed in different previous linking or equating studies few have applied both models simultaneou sly for testlet data under the common item non equivalent groups design. After comparing the equating results under the 3PL model and the GPC model for testlet based test, Zhang (2007) suggested the examination of how other polytomous models work for the t estlet based test. In particular, she found that the item

PAGE 47

47 characteristic curves of the GPC model and the GR model were quite different at the lower and the higher ends of the ability distribution for one of the testlet items from the real data. And she sus pected that the equating performance of the two polytomous models would be different for the same set of testlet data. In order to compare the research results across these studies, the current study adopted both models to investigate whether there is any difference between the final equating results. Examinee group ability distribution Although the common item non equivalent group design was used in the data collection, both equivalent and non equivalent group conditions were taken into account in the curr ent study to detect the impact of group ability distribution on equating results. The influence of ability distribution between groups to be equated on the equating results has been addressed in quite a few studies. Cook and Patersen (1987) reviewed the e mpirical studies that examined how equating results was affected by examinee characteristics, and then pointed out that population invariance would affect the equating function for an anchor test design. Their review for conventional equating methods (Klei n & Kolen, 1985; Cook, L. L., Eignor, D. R., & Taft, H, 1985; Klein & Jarjoura, 1985) showed that as the ability levels of the samples used to equate tests became more different, the properties of an anchor test were more of a concern. The more divergent t included in an anchor test and the more representative the anchor test content should be to yield equating results that are in close agreement. In the IRT studies they reviewed, the issue of gr oup ability differences was not specifically addressed. However, the authors speculated that the linking item properties used for IRT item parameter scaling interact with the level of group ability differences as well.

PAGE 48

48 In another review of the impact of gr oup invariance on equating results, Kolen (2004a) also found that unless the test forms to be equated were well paralleled (e.g., high similarity in content, difficulty and reliability), both the equating theory and empirical research indicated that equati ng was population dependent. Kim and Lee (2006) conducted a simulation study to investigate the performance of the four linking methods, mean/mean, mean/sigma, Haebara, and Stocking Lord to a mixed format test Random group design was a dopted for the link ing scenario and the impact of group ability difference was considered. Their research results showed that keeping other factors constant, linking using equivalent groups produced less linking error than the nonequivalent groups. In another simulation rese arch on mixed format equating, Cao (2008) examined the effects of test dimensionality and common item sets properties on equating results using concurrent calibration with unidimensional IRT models. This study also found that group ability had the most not able and significant effects on the equating results. The equivalent groups conditions always outperformed the nonequivalent groups conditions based on various evaluation indices. Results from these studies make it salient that group ability distribution o f the forms to be equated is an important factor to consider in the equating study. Thus, both the equivalent and nonequivalent groups conditions were included in the current study. Three levels of group ability distributions were manipulated. The base for m group ability distribution was set as normally distributed with a mean of 0 and a standard deviation of 1, which is denoted as The target form group ability distribution was manipulated as having three levels: the first level was manipulat ed as having a normal distribution with a mean of 0 and a standard deviation of one, thus it simulated a condition of equivalent gr oups. T he second level had a normal distribution with a mean of 0.5 and a standard deviation of 1, and it represented a condi tion of non equivalent

PAGE 49

49 groups with a medium level of group mean difference. T he third level was fixed to have a normal distribution with a mean of 1 and a standard deviation of 1, and it indicated a condition of non equivalent groups with a high level of g roup mean difference. Only the mean of the ability distributions was manipulat ed to vary in the current study. T he selection of the values for the group ability distributions was based on a survey of the existing linking and equating simulation studies for testlet based tests and comparisons of separate and concurrent calibrations using anchor test design or random groups design (Kim & Cohen, 1998; Kim & Cohen, 2002; Kim & Kolen, 2006; Lee & Ban, 2010; Li, 2009; Hanson & B guin, 2002). Variance of Testlet Effect As mentioned before, the 2PL testlet model was used to simulate the population data with a testlet structure. The testlet parameter denotes the testlet effect of item j with person i that is nested within testlet d ( j ). Ther efore, the local dependence of items within the same testlet for a given person is modeled in a fashion that these within testlet items would share the specific testlet effect in their score predictor (Wainer, Bradlow & Wang, 2007). If the local independen ce assumption is not violated in a test, then = 0 for each item. is defined by the prior specification as normally distributed with a mean of 0 and variance of for model identification. The size of the testlet varia nce, indicates the magnitude of testlet effect. According to Wainer, Bradlow and Wang (2007), the size of the testlet variance is normed based on its ratio to the variance of the group ability The variance of in the current study is fixe d to 1 in each combination of condition. Therefore, four levels of testlet effect were manipulated in the current study according to the ratio of / i.e., 0/0.5/1/2, representing no, low, medium and high levels of testlet effect c onditions. These values were also selected in the study of

PAGE 50

50 Bradlow, Wainer and Wang (1999). E mpirical work has showed that these are plausible values (Wainer, Bradlow & Wang, 2007). Number of Examinees The examinees taking the two forms to be equated in th e current study were supposed to be equal in number. Three sample sizes were chosen, 500/1000/3000, to investigate the influence of sample size on the equating results. In Kim and Cohen (1998), a sample size of 500 was employed to examine the performance o f separate and concurrent calibration in a small sample size condition. Reise and Yu (1990) suggested that a minimum of 500 was needed for the graded response model to obtain an adequate calibration. 500 was adopted in Liang and Wells (2009) as a small sam ple size to detect the effectiveness of a model fit statistic for GPC model. Therefore, 500 was used as representing a condition of small sample size in the current study. The sample sizes of 1000 and 3000 were selected as adequate sample size conditions i n linking or equating studies, which were commonly used in published research (Cook & Paterson, 1987; Hanson & B guin, 2002; Kim & Cohen, 2002; Kim & Li, 2006; Lee & Ban, 2010). Number of Common Testlets For each test form, the number of common testle ts had two levels: 2 and 4, which conformed with what Li, et al (2005) used in their simulation study. The number of common items that are necessary to put item parameter estimates on the same scale has been addressed in previous research. One of th e common findings is that the accuracy of the equating results improved as the number of linking items increased (Wingersky, Cook & Eignor, 1986; Kim & Cohen, 1998; Kim & Cohen, 2002; Hanson & B guin, 2002). Wingersky and Lord (1984) found that not only the number of common items affected the accuracy of linking but also the property of common items. In the most extreme case they claimed that two good linking items, i.e., items with small standard errors worked almost as well as a set of 25 li nking items. As a rule of

PAGE 51

51 thumb, Cook and Eignor (1991) suggested that common items should take 20% of the total items in a test. The number of common testlets used in the current study was above the 20% rule. Thus, it is reasonable to select these values. Calibration Methods Separate and concurrent calibrations were used to estimate the item parameters for the three IRT models. As for the separate calibration, four linking methods, i.e., mean/mean, mean/sigma, Haebara, Stocking Lord, were employed to find the linking parameters under each IRT model. Altogether five methods were used to put the item parameter estimates on the same metric. Mean/mean and mean/sigma methods for polytomous models In Chapter 2 the four linking methods for the dicho tomous IRT mode l were presented. T hey can be applied to the polytomous IRT models as well. With the GRM, for the mean/mean method, the mean of the a parameter estimates over the common items on the two forms from the separate calibrations are calculated and substituted f or the parameters in Equation (2 3) to get linking parameter A T he mean of the b parameter estimates are calculated over all the categories within all the common items on the two forms from the separate calibrations. T hen these b parameter means on the tw o forms could be plugged in Equation (2 4) to find the linking parameter B For the mean/sigma method, the standard deviation of the b parameter estimates calculated over all the categories within all the common items on the two forms from the separate cal ibration are substituted for the parameters in Equation (2 5) to get the linking parameter A L inking parameter B is found in the same way as in the mean/mean method. Similarly, the linking parameters A and B could be found with the mean/mean and mean/sigm a methods under the GPC model. For mean/mean method, the mean of a parameter estimates over all the common items and the mean of category difficulty parameter estimates

PAGE 52

52 over all categories within all common items are calculated and substituted in Equation (2 3) and (2 4), respectively, to get the linking parameters. For mean/sigma method, the mean and standard deviation of the category difficulty parameter estimates over all categories within all common items are found and substituted in Equation (2 4) and (2 5) to get A and B Characteristic curve methods The Haebara and Stocking Lord methods can be used with polytomous IRT models as with the dichotomous models. The only difference is that with polytomous models, both Haebara and Stocking Lord methods consi der category responses and item responses. Concurrent calibration For concurrent calibration, item parameters for items on both forms to be equated can be estimated simultaneously in a single run of the estimation software, and all item parameter estimates are automatically put on the same scale. The R package ltm (Rizopoulos, 2006) was used for both separate and concurrent calibrations in this study. The MMLE (marginal maximum likelihood estimation) was implemented in ltm for item parameter estimation with all three models adopted in the current study. Parameters H olding C onstant The distributions for the item discrimination parameter and the item difficulty parameter in the 2PL testlet model were held constant across the simulat ion conditions. The distribution of was set as and the distribution of was set as The values were the same as what had been used in Bradlow, Howard 999), which were chosen to match the marginal distributions observed for the Scholastic Aptitude Test (SAT). Therefore, they are representative of the real data used in the standard tests. The parameter values and distributions used to

PAGE 53

53 generate the populat ion data for the 2PL testlet model and other manipulated factors are summarized in Table 3 1. Evaluation Criteria The item parameter estimates are put on the same scale once the IRT linking process for the separate calibration is finished or after the conc urrent calibration is implemented. Then, when a test is scored using estimated IRT abilities, it is not necessary to obtain an equating relationship between number correct scores on the forms to be equated The ability estimates can be transformed to scale scores so that the final scores reported to the examinees are positive integers rather than negative or non integer values of the ability estimates (Kolen & Brennan, 2004). However, there are some problems doing so. One problem is that the abilities are e stimated using score s Consequently, it is possible that different ability estimates are obtained for those who have the same number correct scores. In addition, the computation of the IRT ability estimates is an intensive and costly process (Kolen & Brennan, 2004). Relatively larger measurement error will be produced for examinees whose abilities are at the two ends of the group ability distribution, i.e., low an d high range of abilities. As a result, the number correct scores are still used which need t o be equated from the two forms. E ven the tests are constructed and linked using IRT framework. In this case, IRT true score equating and observed score equating c an be applied to achieve this goal (Kolen & Brennan, 2004). Conventional Linear equating methods for non equivalent groups were used in the current study as the base line methods since the total scores are considered and they are not subject to the influen ce of violation of local independence assumption. The literature showed that conventional equipercentile equating method for observed scores does not work well for anchor tests design and was not recommended (Lord & Wingersky, 1984). Therefore, t hree linea r

PAGE 54

54 equating methods Tucker (Gulliksen, 1950) and Levine (1995) observed score equating and Levine (1955) True score equating, were chosen to act as the baseline methods in the current study Tucker Observed Score Equating Method The traditional linear equa ting for the random groups design sets the standardized deviation scores ( z scores) on the two forms to be equal so that it allows for the differences between the two test forms to vary along the score scale other than be ing a constant. The relationship is defined as (3 1) Solving for y in Equation (3 1), (3 2) where refers to a score x on Form X converted to the scale of Form Y u sing linear equating. When this linear conversion is applied to the common item nonequivalent groups design, a single synthetic population consisting of a combination of the two populations, Population 1 and 2, will be employed by using weights where and Here, assume Form X is used by Population 1 and Form Y is used by Population 2. The linear equation for transforming the observed scores on Form X to the scale of observed scores on Form Y is the same as Equation (3 2), excep t for the additional notation of the synthetic population s (3 3 ) There are four parameters for this synthetic population, i.e., , In practice, the se parameters are estimated from the data based on various statistical assumptions.

PAGE 55

55 The Tucker observed method makes two types of assumptions concerning the relationship between the total scores and the common item scores on each form for solving for these parameter estimates. First, the regression of Form X total score X on the common item score V is assumed to be the same for both Populations 1 and 2. Thus, the regression slope of X on V across the two populations (denoted as ) should be equa l and expressed in terms of all the directly observable quantities in Population 1, = (3 4) where is the regression slope of X on V in Population 1. I n the same line the regression slope of Y on V can be expressed in terms of all the directly observable quantities in Population 2, = (3 5) Second, the conditional variance of X given V is ass umed to be equal across Population 1 and 2. The same assumption holds for Y given V They could be expressed as (3 6) and (3 7) where indicates th e correlation. These two assumptions enable the solving of all the non directly observable parameters in the synthetic population in terms of observable quantities. After some substitutions and operations of algebra, the synthetic population means and vari ances in Equation (3 3) are = (3 8) (3 9) (3 10)

PAGE 56

56 and (3 11) It was discovered that the m eans and variances of the synthetic population for Form X and Form Y are obtained through adjustments, which is here, to the directly observable quantities related to their respective population. Levine Observed Score Equating Method Levine o bserved score method also uses the same linear equating function as the Tucker observed method to set up the relationship between the observed scores on X and on Y. However, this method makes assumptions about the true scores on X Y and V First, the corr elation between true scores and and in both Populations are 1, because X Y and V are all measuring the same thing, thus (3 12) and (3 13) Second, the regression of on is assumed to be the same for both Population 1 and 2, which also holds for regression of on Third, the Levine obse rved score method assumes that the measurement error variance for X Y and V are the same across the two populations. With the three assumptions, under the classical congeneric model (Feldt & Brennan, 1989), the estimation of is derived as (3 14) and (3 15)

PAGE 57

57 (For a detailed description of this method, see Kolen and Br ennan, 2004). Levine True Score Method Levine true score equating method (Levine, 1955) uses the same assumptions about true scores as the Levine observed score equating. The major difference between the two equating methods is that the observed score meth od converts observed scores on Form X to the scale of observed scores on Form Y, while the true score methods converts the true scores on the two + + (3 16) method. Equation (3 14) and (3 15) could be used to get and for the true score method u nder the classical congeneric model for an internal anchor. In practice, observed scores are used instead of true scores in this method, so that Equation (3 16) can be expressed as + + (3 17) where indicates the observed scores on Form X being converted to the scale of the observed scores on Form Y. The justification of this relationship is elaborated in Kolen and Brennan (2004). IRT True Score Equating As the term ind icates, IRT true score equating relates number correct true scores on test forms to be equated. After completing the procedure of putting the item parameters from different test forms on the same scale, the equating relationship between the number correct true scores on the two forms is achieved through their mathematical association with a given level of correct true score on a certain form within the IRT framework is defined as a function of the test characteristic curve for s ome corresponding ability value. Using

PAGE 58

58 the denotation in Kolen & Brennan (2004), a specific number correct true score on Form X could be expressed as equal to the test characteristic curve corresponding to an ability, so that ( ) = (3 18) where ( ) is a number correct true score associated with a particular ability T he right hand side of the equation is a summation over the probability of answering all the items on F orm X correctly with ability In a similar fashion, the Form Y counterpart number correct true score associated with this ability is defined as ( ) = (3 19) where ( ) is the resulting Form Y number correct true score corresponding to the same ability in equation (3 18). Therefore, the IRT true score equating involves three steps: 1) Specify a true score on one of the forms to be equated, say Form X. 2) Then, find the using a mathematical method such as the Newton Raphson method which corresponds to this specified true score. 3) Find the Form Y counterpart number cor rect true score by substituting the ability value obtained in step 2 into the right hand side of equation (3 19). ( ) and ( ) then represent identical levels of ability as long as the IRT model holds (Lord & Wingersky, 1984). In practice, estimated item parameters are used in Equation (3 18) and Equation (3 19) since true scores are parameters which are nev er known a nd a table of corresponding true scores on the forms to be equated could be constructed. Alth ough using number correct observed scores as if they are true scores cannot be justified theoretically, Lord and Wingersky (1984) showed that results from observed score conversion were empirically similar to true score conversion. Likewise, the true score for the polytomous IRT models is defined as T ( ) = (3 20) where

PAGE 59

59 j J is the number of polytomous items, here is the number of testlets; K is the number of categories within a polytomous item; is the integer score associated with category k Usually, a scoring function dictating the value of will be chosen from two alternatives: a) then a response associated with the first category gets a score of 1, and with the second category a score of 2, and up to a score of k with the last category. b) thus a response associated with the first category gets a score of 0, and with the second category a score of 1, and a score of k 1 with the last category. The second scoring function was used in the current study. Here in Equation (3 20), could be viewed as a weight corresponding to response category k of t estlet j The same IRT true score equating procedures as described before were applied to relate number correct scores from the two forms to be equated. IRT Observed Score Equating In IRT observed score equating, the IRT model is used to estimate the obser ved number correct scores distribution on each of the two forms for a population of examinees a nd the conventional equipercentile equating is applied to convert the scores on the two forms. The observed score distribution on each form for examinees of a gi ven ability is obtained by using the compound binomial distribution. For instance, for Form X, the probability that an examinee of ability will incorrectly answer all the items on a three item test, x = 0 is The pr obability of answering correctly all three items, x = 3 is where and the probability of a score, x = 2 is + + The probabilities for all the observed scores thus form a condit ional frequency distribution This distribution could be expressed as: + ( x = 0, 1, 3), (3 21)

PAGE 60

60 where = 0 if x < 0 or x > r This recursive procedure could be appli ed to any number of n items to find If the abilities for a finite number of N examinees are provided, then the marginal distribution of the observed score x is: (3 22) In the same manner, for Form Y which has m items and is an alternate form of X the observed number correct score y marginal distribution for a group of M examinees is (3 23) Once the number correct score distribution for the two for ms to be equated was obtained, t he conventional equipercentile method could then be used to find score equivalents, which equates scores from one form to the other by finding the corresponding equal percentile ranks on both forms (Kolen & Brennan, 2004). Similarly, for the polytomous IRT models, the IRT observed score equating can be realized through finding the compound multinomial distribution of the forms to be equated using the recursive procedure. For instance, the probability of earning a score in the first category of item 1 is and the probability of earning a score in second category of item 1 is Following this algorithm, for r > 1 items, the probability of obtaining score x after adding up the r items is for x between and (3 24) where and are the minimum and the maximum scores after adding the r th item, respectively. if or The calculation of the minimum and maximum score after adding a new item is critical to carry out t his recursive procedure. Then for each given ability the observed number correct score distribution of the total scores on the test could be obtained. For a finite number of examinees, the observed score distribution for examinees of various s can be obtain ed by using Equation (3 22) and Equation

PAGE 61

61 (3 23) for both forms to be equated. Likewise, these distributions were equated using the conventional equipercentile methods like what was done with the dichotomous IRT model. Evaluation Indices for Equating Results Two indices, Unweighted Ro ot Mean Square Difference (URMSD ) and Weighted Ro ot Mean Square difference (WRMSD ), were used to evaluate the discrepancy between each IRT equating method and traditional equating method like what was done in Lee et al (2001). The URMSD is defined as URMSD (3 25) where is the equivalent of a number correct raw score i on the new test form established using the old test, is another equivalent of a number correct raw score i on the new test form established using the old test, k is number of score points on the test, and i is each number correct score point. This unweighted index enabled the examination of the difference s occur red throughout the score scale (Harris & Crou se, 1993). Another index, WRMSD is defined as WRMSD (3 26) where is the frequency of a number correct raw score of i on the new test. Thus, this index attaches relatively more importance to score point differences that occur with a higher frequency, whereas less or no importance to differences in the score range where fewer or no examinees scored. Data Generation and Analysis Procedure The population testlet data were generated with the 2PL testlet model by filling in the population parameters for each combination of condition into the Equation (3 1). 0/1 examinee responses for 30 single items were generated using R functions ( R Development Core Team, 2013) for both base and target forms. For each condition, 200 replications were used. In the

PAGE 62

62 literature of comparing the separate and concurrent calibration, 50 and 100 replications were adopted by other researchers (Kim & Cohen, 1 998; Kim & Cohen, 2002; Hanson & B guin, 2002; Lee & Ban, 2010), so it is justifiable to use 200 replications. After item responses for each pair of test forms were generated, ltm implemented the item parameter estimation using Marginal Maximum Likelihood estimator in both separate and concurrent calibration stages. Then the item parameter estimates for the target form through the separate calibration procedure were put on the scale of the base form by using the R package plink (Weeks, 2010) whi ch provides the four linking methods, i.e., Mean/Mean, Mean/Sigma, Haebara, Stocking Lord. Once the scale transformation was finished, IRT true score and observed score equating results could be attained by R functions in plink The item parameter estimate s through the concurrent calibration procedure could be used to get the IRT true score and observed score equating results in plink directly. The equating results with three conventional linear equating methods for common item nonequivalent groups design w ere used as baselines to evaluate the performance of the three IRT models for testlet data with separate and concurrent calibration. R package Equate (Albano, 2011) was used to conduct the three conventional equating methods. The final equating results wer e evaluated using the two indices URMSD D and WRMSD D.

PAGE 63

63 Table 3 1. Parameters and m anipulated f actors for s imulation Base Form Target Form a parameter b parameter N(0,1) N(0,1)/N(0.5,1)/N(1,1) 0/0.5/1/2 0/0.5/1/2 N 500/1000/3000 500/1000/30 00 No. of single items 30 30 No. of Testlets 6 6 No. of Common Testlet 2/4 2/4 Model Compared 2PL/GRM/GPC 2PL/GRM/GPC Linking Methods Mean/Mean, Mean/Sigma, Haebara, Stocking & Lord, Concurrent estimation Mean/Mean, Mean/Sigma, Haebara, Stocking & Lor d, Concurrent estimation Note. = latent trait; = testlet effect; N : sample size; = population mean of a parameter; = population variance of a parameter; = population mean of b parameter; = population variance of b parameter.

PAGE 64

64 CHAPTER 4 RESULTS The item responses for the 30 within testlet items were generated based on the 2PL testlet model for each combination of the levels of the four between subject factors, i.e., variance of testlet factor, sample size, mean of target group ability distribution and number of common testlets. Hence the full crossing of the four factors yielded 72 types of data set s (4332). And each type of data set consists of two forms, the base form and the targe t form. Each pair of test forms had equal sample size and same level of testlet effect across the 6 tes tlets. As presented in Chapter 3 the ability distribution for the base form is fixed as ~ N (0,1), whereas the mean of the target form ability distribut ion was manipulated to vary at three levels (0, 0.5, 1) with the testlet factor variance set at 1, which led to simulation of the equivalent and the nonequivalent groups conditions. The data sets also varied due to the two levels of number of common testle ts contained between the two forms to be equated. Once the 72 types of data set were generated, each data set underwent the item parameter estimation using the three unidimensional IRT models, respectively. When the 2PL model was fit to the data set, the i tem responses for the 30 testlet items were treated as binary responses (0/1) for 30 single independent items. Consequently, the model estimation for the 2PL model resulted in parameter estimates of the 30 items including item discrimination and item diffi culty. For the two polytomous IRT models the 30 testlet items were viewed as 6 polytomous items by summing up the scores of the five within testlet items. In this fashion, the sum of the within testlet items forms 6 possible categories for each testlet as polytomous item (0, 1, 2, 3, 4, 5). Two model estimation procedures, separate and concurrent calibration, were implemented. The item parameter estimates of the two forms were on the same scale with the concurrent calibration procedure, and they could be used to conduct the equating of the number correct scores in the next step. As for the item parameter estimates of

PAGE 65

65 the two forms from the separate calibration, IRT linking was employed to put them on the same metric. Four IRT linking method s Mean/Mean, Me an/Sigma, Stocking & Lord, and Haebara, were applied to obtain the linking constants A and B for this scale transformation. Then the transformed parameter estimates could be used in the subsequent equating stage. Two equating results, IRT true score equati ng (IRT TE) and IRT observed score equating (IRT OE) were calculated for each pair of test forms linked with the four IRT linking methods and the concurrent calibration. Meanwhile, the 72 condition s of data sets generated were equated also by means of the three conventional equating methods, i.e., Tucker observed score equating (TOE), Levine observed score equating (LOE), and Levine true score equating (LT E). Finally, the converted target form score equivalents of the base form yielded from the IRT TE and IRT OE were compared with those obtained from the three conventional equating methods using URMSD and WRMSD as evaluation criteria. The comparisons were undertaken over equating results from all combination of conditions using the three unidimensional IRT models with the five IRT parameter scaling procedures (four in the separate calibration plus the concurrent calibration). Proper Solution Rates of the Three IRT Models The item parameter estimation of the three unidimensional IRT models was implemented wit h the ltm package. All estimation results in iterations with error messages and warning messages were regarded as improper solutions. These improper solutions were not used in the following steps. Additional replications were run to replace them with prope r solutions and ensured that parameter estimates used in the following steps were from 200 replications of proper solutions. For the ease and clarity of presentation, the proper solution rate of the model

PAGE 66

66 estimation for each IRT model employed was operatio nally defined as the rate that both forms had proper solutions out of the 200 replications. In the separate calibration conditions, among the three IRT models used to fit the simulated testlet data, the 2PL model had better proper solution rate across con ditions of interest. The improper solutions occurred in seven conditions, one in the two common testlets condition and six in the four common testlets condition. For the seven conditions, the proper solution rate out of 200 replications ranged from 98.5% t o 99.5%, with the lowest happened in the condition of four common testlets, sample size 3000, testlet variance 0.5 and group mean difference of 1. Improper solutions occurred more frequently in conditions using the two polytomous models. Of the 72 types of data set, 62.5% had the occurrence of improper solutions when estimated with the GRM, 22 in the two common testlets conditions, and 23 in the four common testlet conditions. For the 45 conditions with improper solutions, the rate of proper solutions range d from 96.5% to 99.5%. The lowest proper solution rate happened in the condition of four common testlets, sample size 500, testlet variance 1 and group mean difference of 1. Data analyzed with the GPC model encountered relatively higher improper solution r ate across conditions. 77.78% of the 72 types of data sets had occurrences of improper solutions, 29 in the two common testlets conditions and 27 in the four common testlets conditions. The proper solution rate for these 56 conditions ranged from 95.5% to 99.5%. The lowest proper solution rate 95.5% occurred in the condition of four common testlets, sample size 500, testlet variance 0, and group mean difference of 1. For data sets fitted with the three IRT models, the lowest proper solution rates all happe ned in the conditions of four common testlet and group mean difference of 1. Table 4 1 presented the proper solution rate for data sets analyzed with the three models using separate calibration.

PAGE 67

67 If the item responses of the two forms were estimated simul taneously using the concurrent calibration, the occurrence of improper solutions decreased for estimation with all three IRT model, especially in the cases with the 2PL model and the GRM. Table 4 2 demonstrated the proper solution rates in the concurrent c alibration conditions. Were the 2PL model applied to the simulated data sets, there was no improper solution in any of the two common testlet conditions. And the only condition encountered improper solutions was that with the four common testlets, sample s ize 500, testlet variance 0.5, group mean difference of 1. However, the proper solution rate for this condition was still as high as 99%. When the data were estimated using the GRM, the improper solutions only happened in two conditions using two common te stlets, sample size 3000, testlet variance 2 and nonequivalent groups. The proper solution rates of the two conditions were 99% and 98%, respectively. As Table 4 2 showed, the lowest rate of 98% occurred in the condition of two common testlets, sample si ze 3000, testlet variance 2 and group mean difference of 1. Parameter estimations with the GRM for data sets containing four common testlets were all proper solutions, the proper solution rates were 100% for conditions in this situation. In contrast, more conditions had the occurrence of improper solutions when the GPC model was fit to the data sets. Among the 36 two common testlet conditions, 26 yielded improper solutions, and the proper solution rate for these conditions ranged from 90.5% to 99.5%. The 90 .5% proper solution rate was found in the condition of two common testlets, sample size 3000, testlet variance 0, and group mean difference of 1. Of the 36 conditions with four common testlets, 22 encountered improper solutions. The proper solution rates f or the 22 conditions ranged from 94.5% to 99.5%. The condition with four common testlets, sample size 3000, testlet variance 0, and group mean difference of 1 had the proper solution rate 94.5%.

PAGE 68

68 Inspection of conditions with improper solutions during the parameter estimation procedure using either concurrent calibration or separate calibration made it evident that conditions of large r group mean difference were susceptible to more occurrences of improper solutions, which applies to all three IRT models. Es timation using GPC model incurred more improper solution compared with that using the 2PL model and the GRM, particularly in the conditions employing concurrent calibration. If the separate calibration procedure was carried out, conditions with the two pol ytomous models yielded more improper solutions than with the 2PL model. Nevertheless, were the concurrent calibration procedure implemented, improper solutions were rare in conditions using the 2PL model and the GRM. In Contrast, estimation from the GPC mo del produced more improper solutions in relatively more conditions. On the whole, in terms of the proper solution rate, 2PL model was superior to the two polytomous models, especially in conditions using separate calibration. The GRM was superior to the GP C model, particularly in conditions of concurrent calibration. Missing Category Rate of the Two Polytomous IRT Models As stated previously, testlet as a polytomous item approach treated the whole testlet as a polytomous item by summing up the within test let item scores. In the current study, each testlet has five dichotomously scored within testlet items. According to this approach, there are six possible sum scores ordered from 0 to 5, which could be viewed as six ordered categories of a polytomous item Therefore, each testlet as a polytomous item was supposed to have six categories. However, in the process of summing up the within testlet item scores for each examinee, it happened that there were missing categories for some testlet as a polytomous items For instance, there might be no score 5 in the sum of the within testlet items across all examinees for a so called polytomous item, or there were no score 5 and 4 in the sums. Tang and Eignor (1997) also reported this phenomenon when they applied the GR M and GPC models to TOEFL

PAGE 69

69 test. The occurrence of this phenomenon associated with the sample size of the examinees, small number of examinees would lead to more cases of missing categories, whereas larger sample size tended to avoid this problem. Were this happened in any replication for any of the two forms for the same data set, the replication would not be used in the final equating, The percentage of this phenomenon with each sample size was documented for the conditions using separate calibration and concurrent calibration, respectively. Within each of the two calibration procedures, the percentage of missing categories was calculated separately for conditions containing two common testlets and four common testlets. Table 4 3 showed the summary of the se missing category rates with the separate calibration and the concurrent calibration, respectively. For example, in Table 4 3, under the condition of two common testlets using the separate calibration when sample size is 500, the rate of occurrence of m issing categories across the combination of all other conditions (testlet effect, group mean difference) with 200 replications for each condition is 0.03083. Holding other factors constant, with the increase of sample size, the missing category rate decrea sed to 0.00666 in the condition of a sample size 1000, and 0.00138 in the condition of a sample size 3000. The same trend could be found in cases of using four common testlets. Similarly, the rate of missing categories was reduced by larger sample sizes in conditions using concurrent calibration. The missing category rate was relatively lower in concurrent calibration conditions when the sample size was 500 owing to combination of the responses on the two forms, which doubled the sample size in effect. Inf luence of Factors of Interest The main purpose of this research was to investigate the equating performance under the five other factors, i.e., sample size, magn itude of testlet effect, group difference, number of common testlets, linking methods, were also of interest. In addition to the six major factors of

PAGE 70

70 interest, three other factors, i.e., equating results statistics (IRT TE, IRT OE), baseline equating stati stics (TO E, LOE, LTE ), evaluation criteria statistics ( URMSD WRMSD ) were included to calculate and compare the equating results from the three IRT models. Therefore, there are nine factors involved in this study. An inspection of the equating results mad e manifest the high similarity among the three baseline statistics. An overall ANOVA taking into account of all nine factors was conducted, which was a split plot design with four between subjects factors and five within subjects factors. (e t a squared) for each main effect and interaction appeared in the ANOVA tables were calculated as an index of effect size. It is defined as (4 1) where stands for the variation attributable to a specific factor and is the total variation. It turned out that the for the main effect of the baseline statistics and any of its interaction with other designing factors were negligible, all of them w ere less than 0.00001. Then, the two evaluation statistics URMSD and WRMSD were collapsed over the three baseline statistics. ANOVAs for IRT True Score Equating The results using IRT true score equating statistics and IRT observed score equating statistic s were quite similar in size, and the ANOVAs with URMSD and WRM SD were reported for IRT true score only. s were re calculated for the effects in ANOVA using URMSD and in ANOVA using WRMSD respectively. Although many effects were statistically significant with an alpha level of 0.05, results of the s for the effects in both ANOVAs indicated that only a few of them had substantial effect size. As to the effects in ANOVA using URMSD the rank order of s indicated that nu mber of common testlets had the largest effect size ( 0.083, F = 3827.09, p < 0.0001). The URMSD under the four common testlet conditions had smaller values

PAGE 71

71 than that under the two common testlet conditions. This conforms with finding s in other previous studies that increasing the number of common items would improve the a ccuracy of the equating results (Wingersky, Cook & Eignor, 1986; Kim & Cohen, 2002; Hanson & B guin, 2002). The of testlet effect ( and m odel ( ) were close to each other and ranked next to that of number of common items. Effect of the group ability difference ranked fourth ( ). All other effects had eta squared values around 0.001 or lower. s for the ef fects in ANOVA using WRMSD also showed that the number of common testlets had the highest effect size ( 0.075, F = 3810.73, p < 0.0001). The effect of model ranked second ( followed by the effect size of testlet effect ( All other effects, were it main effects or interactions had s smaller than 0.001. The means and standard deviations of URMSD and WRMSD were then calculated across the 200 replications for each combination of conditions to demonstrate the feature of the results. Table 4 4, Table 4 5 and Table 4 6 presented the mean URMSD for IRT true score equating results in two common items condition, Table 4 7, Table 4 8 and Table 4 9 displayed the mean URMSD for IRT true score equating results in four common items condition, respectively. Mean URMSD for IRT True Score Equating From Table 4 4, 4 5 and 4 6, we can see that in the two common testlets conditions, the URMSD under the two polytomous models were smaller than that under the 2PL model whichever linking method used, even in conditions of no testlet effects. Only in nine conditions (which were highlighted in Table 4 4) the URMSD under the 2PL model was lower than that

PAGE 72

72 under one of the polytomouse models. For example, when sample size equals 1000, testlet effect is 0, group mean ability difference is 1, the URMSD with MM (Mean/Mean) linking method under the 2PL model is 2.114, while in the corresponding condition under the GRM model, the URMSD = 2.128, and the URMSD = 2.132 under the GPC model, whic h were slightly higher. This is the only condition that the equating results from the 2PL model outperformed both of the polytomous models. For other eight cases under the 2PL model with URMSD lower than that under one of the polytomous models, the corresp onding URMSD under the other polytomous model is always lower than that under the 2PL model. In other word s in the same condition, there is almost always one type of polytomous model outperformed the 2PL model in terms of equating accuracy. It is evident that the equating results under the 2PL model were inferior to those under the two polytomous models overall in the two common testlets conditions using the URMSD as the evaluation criterion. Between the two polytomous models, comparisons of the URMSD acr oss the conditions using the IRT TE and two common testlets showed that the equating results from the GRM were more accurate than those from the GPC model with the separate linking methods. For instance, in conditions using the Mean/Mean linking method, ap proximately 67% of the URMSD values under the GRM were lower than that under the GPC model. In conditions using the Mean/sigma linking method, 78% of the URMSD values were smaller than that under the GPC model. Likewise this proportion was 72% using the S tocking/Lord method and 75% using the Haebara method. Overall, t he two models had similar performance in conditions using the concurrent calibration, but the GPC model had slightly better performance with about 53% of the conditions had lower URMSD

PAGE 73

7 3 Anoth er factor having relatively larger effect size is the testlet effect or the variance of the testlet factor Four levels of this factor were manipulated, i.e., 0/0.5/1/2, representing the conditions of no testlet effect, small testlet effect, medium testlet effect and large testlet effect. The URMSD for the IRT TE in the two common tesltets conditions showed that holding other factors constant, the values of URMSD decreased with the ascending of the testlet effect. I t applies to all three IRT m odels. Take the conditions in 2PL model for example, in Table 4 4 we can see that the URMSD under the MM linking method with sample size 500 and group mean difference 0 were 1.975 when testlet effect was 0. As testlet effect increased from 0.5 to 2, the co rresponding URMSD slid to 1.889, 1.776 and 1.741.This trend was somewhat contra intuitive for the 2PL model, which is expected to yield more accurate equating results when there is no testlet effect. In conditions without testlet effect, only nine of them under the 2PL model had smaller URMSD than that under one of the polytomous models. And they were all in conditions with the group mean difference is 1. Of the nine conditions under the 2PL model, merely one of them using the MM linking method had smaller URMSD than that under both polytomous models. It is most likely that there is always a polytomous model bearing more accurate equating results for the same condition. As the magnitude of the testlet effects enhanced, the two polytomous models tended to yi eld better equating results holding other factors constant. And in conditions having testlet effects, the URMSD under the two polytomous models were consistently lower than those under the 2PL model. The factor, group mean difference of the two groups to be equated, had an eta squared of th largest among the effects in ANOVA for IRT TE with the URMSD criteria. Holding other factors constant, as the groups mean difference was enlarged, the URMSD tended to increase.

PAGE 74

74 The effect size for linking methods was smaller than 0.001, and the performance of the four linking methods for separate calibration were similar. Equating results from the concurrent calibration wer e no better than those from the four separate linking methods. In most conditions they led to larger URMSD than the separate linking methods. For IRT TE with two common testlets, about 42% of the conditions using the concurrent calibration under the 2PL mo del had URMSD lower than that in the corresponding conditions using the four separate linking methods. This proportion is 22% under the GRM model and 28% under the GPC model. The mean URMSD for IRT TE using four common testlets presented sa me pattern of th e on the equating results (see Table 4 7 to 4 9). The two polytomous models outperformed the 2PL model in terms of smaller mean URMSD Considering conditions using the four separate linking methods only, equating from the GRM model wa s the most accurate compared with that from the 2PL model and the GPC model. For the GRM and the GPC model, with the separate linking methods, 64% of the URMSD from the GRM in conditions using the Mean/Mean method had smaller values than those in the corre sponding conditions from the GPC model. About 72% of the URMSD from the GRM in conditions using each of the other three separate linking methods had lower values than those from the GPC model. However, when it comes to the concurrent method, the GRM was no t better than the GPC model, about 56% of the conditions under the GPC model using the concurrent method got smaller URMSD values When the number of common testlets was up to four, a total of eight conditions under the 2PL model had URMSD values smaller t han the corresponding conditions under the two polytomous models. Of the eight conditions, seven were in conditions using the separate linking methods. And four out of the seven conditions using the separate linking methods had sample size of 1000, testel t effect of 1 and group mean difference of 0; three had sample size of 1000,

PAGE 75

75 testlet effect of 0 and group mean difference of 0.5. One condition under the 2PL model using the concurrent calibration got lower URMSD than that in the corresponding condition u nder the GRM model. However, none of the URMSD in the eight conditions was lower than that in the corresponding condition under both GRM and GPC model. In line with the two common testlets conditions, the URMSD tended to decrease as the testlet effect beco me stronger with four common testlets holding other factors constant. The equating results tended to be less accurate if the group s mean difference was magnified controlling other factors. Again, as in the two common testlets conditions, if there was no te stlet effect, most of the conditions under the 2PL model produced less accurate equating results compared with that under the two polytomous models. When there was no testlet effect, only three conditions under the 2PL testlet model yielded better equating results than those under the GRM model. They were in conditions of sample size 1000 and groups mean difference of 0.5 with the MM, SL and HB linking methods. Nevertheless, the corresponding three conditions under the GPC model still had more accurate equa ting results than under the 2PL model. For conditions under each IRT model, the URMSD with the four separate linking methods were similar, and they were smaller than that with the concurrent method in most conditions. About 42% of the conditions using the concurrent calibration under the 2PL model had URMSD lower than that in the corresponding conditions using the four separate linking methods. This proportion is 33% under the GRM and 47% under the GPC model. Compared with the two common testlet conditions, the concurrent method yielded relatively more accurate equating results in the four common testlets conditions. Holding other factors constant, larger sample size condition s tended to produce more accurate equating results which were indicated by the sm aller URMSD values

PAGE 76

76 Mean WRMSD of IRT True Score Equating As mentioned before, the ANOVA results and s for WRMSD of IRT TE were in closer agreement with those for the URMSD The number of common testlets was the factor having the largest eff ect size, which was evident in the smaller WRMSD values across conditions using the four common items. The mean WRMSD of IRT TE under the three IRT models were summarized in Table 4 10 to 4 12 for the two common testlets conditions, and in Table 4 13 to 4 15 for the four common testlets conditions. For data using two common testlets under the 2PL model, except for seven conditions (five with concurrent calibration, two with MM method, see the highlighted values in Table 4 10), the WRMSD in all other conditi ons were larger than those under the two polytomous models which indicated less accurate equating results. The equating results under the GRM model were most accurate in conditions using the four linking methods for separate calibration. For the WRMSD wit h each separate linking method under the GRM model, about two thirds of the conditions were smaller than those under the GPC model. With concurrent calibration, the GPC model had better performance than the GRM model and the 2PL model. More conditions unde r the GPC model had smaller WRMSD than those under the GRM model and the 2PL model. However, within conditions under each IRT model, the equating results from the concurrent calibration were less accurate than those from the four separate linking methods. For the 2PL model, approximately 39% of the conditions with the concurrent calibration had smaller WRMSD than the four separate linking methods. And the percentage is 19% under the GRM model and 17% under the GPC model. The four separate linking methods ha d similar performance under each IRT model investigated.

PAGE 77

77 Controlling other factors, enhancing the testlet effects would lead to more accurate equating results. When the testlet effect was absent, the 2PL model did not always yield better equating results c ompared with the two polytomous models. Only four conditions without testlet effect under the 2PL model had smaller mean WRMSD than those under one of the polytomous models. They were in the conditions having group mean difference of 1. T hree of them were with the concurrent calibration, one was with MM method and a sample size of 1000. For a specific condition of the four, for example, the one with MM method and a sample size of 1000, the mean WRMSD was only less than that under one polytomous model --the GPC model. None of the four cases were better than the corresponding condition under both polytomous models. When the common testlets were up to four, the superiority of the GRM was more salient. Compared with the GPC model, equating results from both sep arate and concurrent calibrations under the GRM were better in terms of more conditions of smaller mean WRMSD For conditions with the Mean/Mean method under the GRM, about 67% of them had lower mean WRMSD than those under the GPC model. With the Mean/Sigm a method, the proportion was 75%, and it was up to 89% with both characteristic curve methods. With the concurrent calibration, the proportion was 58%. The 2PL model was still the weakest one, under which only two conditions had lower mean WRMSD than those under the GRM model. All the mean WRMSD under the 2PL were larger than those in the corresponding conditions under the GPC model. Under each model, the performance of the four separate linking methods had little difference. The accuracy of the equating re sults was improved as the number of common testlets increased. Under the 2PL model, with four common testlets, 47% of the WRMSD in conditions using the concurrent calibration were smaller than that using the four separate linking methods,

PAGE 78

78 while it was 39% with two common testlets. Under the GRM model, the proportion was improved from 19% with two common testlets to 33% with four common testlets. Under the GPC model, it was raised from 17% to 50%. The equating results became somewhat better with an increase of the testlet effects holding other factors constant, which could be found with all three models. If th ere was no testlet effect, only two conditions under the 2PL model had smaller mean WRMSD values than those in the corresponding condition under the GR M. Table 4 13 highlighted the two conditions. They were from the Mean/Mean method and Stocking/Lord method, respectively, with sample size 1000 and group mean difference of 1. However, in the same conditions under the GPC model, the mean WRMSD were even lo wer than the two under the 2PL model. For a majority of the equating results under the 2PL model when the testlet effect was absent, they were less accurate compared with those under the two polytomous models. Summary of the Results Evaluation of the overa ll discrepancy between equating results including IRT TE and IRT OE from each of the three IRT models and the three traditional equating methods were achieved through using the two indices: URMSD and WRMSD Inspection of the results showed that for each co ndition, the discrepancy compared with the three traditional equating methods were quite similar based on either URMSD or WRMSD Therefore, the URMSD and the WRMSD from each condition were collapsed over the three levels of the traditional equating methods factor. The two evaluation indices also made it evident that equating results from IRT TE and IRT OE were of little difference for the same condition. ANOVAs were conducted for IRT TE and IRT OE results using both URMSD and WRMSD collapsed over the three levels of the traditional equating methods. Only ANOVA results and the eta squared for the IRT TE were presented for data analyses given the high

PAGE 79

79 similarit y compared with those from IRT O E. were calculated for all the main effects and intera ctions in ANOVAs using the URMSD and the WRMSD respectively, to quantify the magnitude of their effect sizes. According to the rank order of the for effects in ANOVAs using both URMSD and WRMSD indices, number of common testlets is the facto r that exerted the largest effect on the equating results Equating results in conditions with four common testlets had smaller discrepancy compared with those from the three traditional equating methods than those in conditions with two common testlets. E ffect size s factor were close to each other and ranked after the number of common testlets. The two polytomous models had better overall performance than the 2PL model especially in conditions with the presen ce of testlet effect. Even in conditions without the testlet effect, the two polytomous were superior to the 2PL model in most cases. Only a few cases under the 2PL model had more accurate equating results than the corresponding conditions under a certain polytomous models. Of the two polytomous models, equating results under the GRM model were more accurate, particularly in conditions using the separate linking methods. With ascending testlet effect, the equating results tended to become more accurate, and this was found with all three IRT models. The effect size for all other factors or their interactions were rather small. However, the pattern of the URMSD and the WRMSD indicated that the equating results using the four separate linking methods were of sm all differences, they were better than those from the concurrent calibration, especially in conditions with two common testlets. The performance of the concurrent calibration improved as the number of common testlets was increased to four, this trend is mo re conspicuous with the GPC model and the 2PL model. When the groups mean difference was widened, the equating results tended to be less accurate. It was also found that

PAGE 80

80 larger sample size could improve the accuracy of equating results. However, the effect size for sample size was tiny. Because the IRT observed score equating results bear a high resemblance to the IRT true score equating results, the mean URMSD and mean WRMSD under the three IRT models were not presented in this chapter. For further refere nce of this information, the corresponding tables illustrating them were documented in the Appendix.

PAGE 81

81 Table 4 1. Proper solution r ate with the three m odels using s ep arate c alibration C2 C4 N 2PL GRM GPC 2PL GRM GPC 0 500 0 100% 100% 99.5% 100% 100% 99% 0 500 0.5 100% 99.5% 99.5% 100% 98% 100% 0 500 1 100% 97.5% 100% 99.5% 97.5% 99% 0 1000 0 100% 100% 99.5% 100% 100% 99.5% 0 1000 0.5 100% 99% 100% 100% 100% 99.5% 0 1000 1 100% 98% 100% 100% 99% 98.5% 0 3000 0 100% 100% 99.5% 100% 100% 99.5 % 0 3000 0.5 100% 99.5% 99.5% 100% 100% 99% 0 3000 1 100% 98% 99.5% 100% 98.5% 95.5% 0.5 500 0 100% 99.5% 99.5% 100% 100% 99.5% 0.5 500 0.5 100% 99.5% 99.5% 100% 98.5% 99.5% 0.5 500 1 100% 97.5% 99% 99% 99.5% 97.5% 0.5 1000 0 100% 100% 100% 100% 99.5 % 100% 0.5 1000 0.5 100% 100% 98.5% 100% 100% 99.5% 0.5 1000 1 100% 100% 99% 99.5% 97.5% 99.5% 0.5 3000 0 100% 100% 100% 99.5% 100% 99.5% 0.5 3000 0.5 100% 100% 99.5% 100% 99.5% 100% 0.5 3000 1 100% 100% 99% 98.5% 100% 99.5% 1 500 0 99.5% 100% 98.5% 100% 100% 100% 1 500 0.5 100% 99.5% 97.5% 100% 98.5% 99% 1 500 1 100% 99% 97.5% 100% 96.5% 98.5% 1 1000 0 100% 100% 99% 100% 100% 100% 1 1000 0.5 100% 100% 99% 100% 100% 99.5% 1 1000 1 100% 97% 98.5% 100% 98.5% 97.5% 1 3000 0 100% 100% 99.5% 100% 99. 5% 99.5% 1 3000 0.5 100% 99.5% 100% 100% 99.5% 100% 1 3000 1 100% 98% 100% 100% 99.5% 100% 2 500 0 100% 99% 98.5% 100% 99% 99% 2 500 0.5 100% 99.5% 97% 99.5% 99.5% 99.5% 2 500 1 100% 99% 97.5% 99.5% 98.5% 99% 2 1000 0 100% 100% 99% 100% 99.5% 99.5% 2 1000 0.5 100% 99.5% 97.5% 100% 99.5% 100% 2 1000 1 100% 98.5% 99% 100% 98.5% 99% 2 3000 0 100% 99.5% 99.5% 100% 100% 99.5% 2 3000 0.5 100% 99% 98.5% 100% 100% 100% 2 3000 1 100% 98% 97.50% 100% 99.5% 99% Note = variance of the testlet factor; N = sample size; = Target group population meanability; C2 = two common testlet s ; C4 = four common testlet s.

PAGE 82

82 Table 4 2. Convergence r ate for three m odels using concurrent c alibration C2 C4 N 2PL GRM GPC 2PL GRM GPC 0 50 0 0 100% 100% 100% 100% 100% 100% 0 500 0.5 100% 100% 100% 100% 100% 100% 0 500 1 100% 100% 99.5% 100% 100% 100% 0 1000 0 100% 100% 99.5% 100% 100% 100% 0 1000 0.5 100% 100% 99.5% 100% 100% 99% 0 1000 1 100% 100% 99% 100% 100% 100% 0 3000 0 100% 100% 98.5% 100% 100% 99.5% 0 3000 0.5 100% 100% 99.5% 100% 100% 100% 0 3000 1 100% 100% 90.5% 100% 100% 94.5% 0.5 500 0 100% 100% 100% 100% 100% 99.5% 0.5 500 0.5 100% 100% 100% 100% 100% 100% 0.5 500 1 100% 100% 99% 99% 100% 98.5% 0.5 1000 0 100% 100% 1 00% 100% 100% 99.5% 0.5 1000 0.5 100% 100% 100% 100% 100% 100% 0.5 1000 1 100% 100% 100% 100% 100% 99% 0.5 3000 0 100% 100% 100% 100% 100% 99% 0.5 3000 0.5 100% 100% 100% 100% 100% 100% 0.5 3000 1 100% 100% 98% 100% 100% 99% 1 500 0 100% 100% 99% 100 % 100% 99% 1 500 0.5 100% 100% 98.5% 100% 100% 100% 1 500 1 100% 100% 99% 100% 100% 97% 1 1000 0 100% 100% 99.5% 100% 100% 99.5% 1 1000 0.5 100% 100% 99.5% 100% 100% 100% 1 1000 1 100% 100% 98.5% 100% 100% 98% 1 3000 0 100% 100% 100% 100% 100% 99.5% 1 3000 0.5 100% 100% 99.5% 100% 100% 100% 1 3000 1 100% 100% 99% 100% 100% 100% 2 500 0 100% 100% 98% 100% 100% 98% 2 500 0.5 100% 100% 97.5% 100% 100% 99.5% 2 500 1 100% 100% 92% 100% 100% 97.5% 2 1000 0 100% 100% 96.5% 100% 100% 99.5% 2 1000 0.5 1 00% 100% 97% 100% 100% 99.5% 2 1000 1 100% 100% 95% 100% 100% 99% 2 3000 0 100% 100% 99% 100% 100% 100% 2 3000 0.5 100% 99% 99% 100% 100% 99% 2 3000 1 100% 98% 95% 100% 100% 97.5% Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; C4 = four common testlet.

PAGE 83

83 Table 4 3. Rate of missing categories Separate Concurrent N C2 C4 C2 C4 500 0.03083 0.03278 0.0175 0.01167 1000 0.00666 0.005 0.00792 0.0025 300 0 0.00138 0.0025 0.00125 0 Note C2 = two common testlet ; C4 = four common testlet.

PAGE 84

84 Table 4 4. Mean URMSD from 2PL model for IRT true score equating in t wo common testlets c ondition C2 2PL URMSD N MM MS SL HB CC 500 0 0 1.975 1.978 1.927 1.9 56 1.965 1000 0 0 1.848 1.875 1.818 1.841 1.877 3000 0 0 1.771 1.776 1.766 1.775 1.775 500 0.5 0 1.889 1.919 1.886 1.898 1.856 1000 0.5 0 1.808 1.828 1.805 1.812 1.864 3000 0.5 0 1.940 1.946 1.943 1.943 1.856 500 1 0 1.776 1.811 1.778 1.794 1.762 10 00 1 0 1.748 1.759 1.744 1.748 1.727 3000 1 0 1.727 1.738 1.732 1.733 1.634 500 2 0 1.741 1.739 1.735 1.738 1.690 1000 2 0 1.581 1.59 1.587 1.584 1.686 3000 2 0 1.704 1.707 1.705 1.701 1.620 500 0 0.5 1.932 2.022 1.942 1.968 1.914 1000 0 0.5 1.928 1.962 1.914 1.927 1.953 3000 0 0.5 1.785 1.803 1.793 1.798 2.217 500 0.5 0.5 1.957 1.979 1.93 1.959 1.881 1000 0.5 0.5 1.888 1.898 1.886 1.895 1.809 3000 0.5 0.5 1.919 1.919 1.915 1.915 1.900 500 1 0.5 1.816 1.828 1.797 1.815 1.852 1000 1 0.5 1.823 1 .849 1.839 1.836 1.989 3000 1 0.5 1.89 1.883 1.886 1.885 1.804 500 2 0.5 1.741 1.766 1.742 1.741 1.709 1000 2 0.5 1.607 1.613 1.601 1.609 1.623 3000 2 0.5 1.574 1.57 1.572 1.57 1.702 500 0 1 2.127 2.162 2.099 2.125 2.436 1000 0 1 2.114 2.132 2.094 2.114 2.534 3000 0 1 1.997 1.994 1.985 1.988 2.563 500 0.5 1 2.115 2.167 2.116 2.131 2.232 1000 0.5 1 2.04 2.054 2.028 2.037 2.255 3000 0.5 1 1.918 1.916 1.914 1.915 2.080 500 1 1 1.916 1.920 1.898 1.904 2.313 1000 1 1 2.011 2.018 1.996 2.008 1. 936 3000 1 1 1.856 1.843 1.852 1.847 1.989 500 2 1 1.81 1.838 1.829 1.822 1.864 1000 2 1 1.662 1.68 1.664 1.669 1.929 3000 2 1 1.704 1.712 1.710 1.706 1.718 Note = variance of the testlet factor; N = sample size; = Target g roup population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration;

PAGE 85

85 Table 4 5. Mean URMSD from GRM model for IRT true score equating in two common testlets condition C2 GRM URMSD N MM MS SL HB CC 500 0 0 1.657 1.624 1.624 1.625 1.691 1000 0 0 1.752 1.64 1.638 1.639 1.594 3000 0 0 1.686 1.655 1.659 1.652 1.573 500 0.5 0 1.485 1.581 1.589 1.575 1.532 1000 0.5 0 1.546 1.58 1.585 1.577 1.628 3000 0.5 0 1.558 1. 476 1.478 1.477 1.664 500 1 0 1.528 1.557 1.556 1.553 1.456 1000 1 0 1.527 1.452 1.457 1.448 1.525 3000 1 0 1.497 1.517 1.515 1.514 1.513 500 2 0 1.410 1.367 1.372 1.362 1.551 1000 2 0 1.329 1.301 1.305 1.296 1.793 3000 2 0 1.452 1.359 1.367 1.351 1. 559 500 0 0.5 1.762 1.735 1.731 1.734 1.919 1000 0 0.5 1.716 1.685 1.684 1.693 1.780 3000 0 0.5 1.703 1.685 1.680 1.680 1.750 500 0.5 0.5 1.727 1.660 1.663 1.673 1.652 1000 0.5 0.5 1.570 1.568 1.573 1.566 1.601 3000 0.5 0.5 1.665 1.696 1.698 1.695 1. 634 500 1 0.5 1.537 1.544 1.549 1.541 1.531 1000 1 0.5 1.637 1.594 1.602 1.590 1.581 3000 1 0.5 1.677 1.593 1.598 1.590 1.595 500 2 0.5 1.461 1.391 1.395 1.386 1.465 1000 2 0.5 1.469 1.389 1.395 1.386 1.706 3000 2 0.5 1.414 1.450 1.455 1.448 1.333 5 00 0 1 1.956 1.918 1.933 1.947 2.849 1000 0 1 2.128 1.968 1.966 1.964 2.850 3000 0 1 2.031 1.930 1.927 1.921 2.820 500 0.5 1 1.953 1.910 1.919 1.933 2.347 1000 0.5 1 1.771 1.693 1.692 1.697 2.303 3000 0.5 1 1.696 1.713 1.720 1.717 1.949 500 1 1 1.791 1.658 1.662 1.658 1.945 1000 1 1 1.613 1.564 1.567 1.558 1.971 3000 1 1 1.622 1.643 1.647 1.643 1.835 500 2 1 1.523 1.447 1.449 1.448 1.714 1000 2 1 1.496 1.428 1.425 1.428 1.735 3000 2 1 1.553 1.448 1.453 1.439 1.522 Note = variance o f the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 86

86 Table 4 6. Mean URMSD from GPC model for IRT true score equating in two common testlets condition C2 GPC URMSD N MM MS SL HB CC 500 0 0 1.833 1.791 1.786 1.789 1.672 1000 0 0 1.655 1.698 1.691 1.696 1.734 3000 0 0 1.63 1.608 1.544 1.54 1.676 500 0.5 0 1.771 1.714 1.703 1.718 1.520 1000 0.5 0 1.623 1.569 1.567 1.576 1.637 3000 0.5 0 1.528 1.509 1.512 1.507 1.599 500 1 0 1.600 1.563 1.556 1.560 1.558 1000 1 0 1.615 1.596 1.593 1.599 1.504 3000 1 0 1.586 1.592 1.580 1.582 1.617 500 2 0 1.616 1.531 1.529 1.533 1.432 1000 2 0 1.48 9 1.433 1.469 1.461 1.512 3000 2 0 1.452 1.378 1.380 1.376 1.418 500 0 0.5 1.809 1.742 1.726 1.751 1.710 1000 0 0.5 1.751 1.697 1.699 1.703 1.881 3000 0 0.5 1.699 1.771 1.774 1.766 1.903 500 0.5 0.5 1.699 1.686 1.624 1.631 1.712 1000 0.5 0.5 1.703 1. 736 1.789 1.808 1.705 3000 0.5 0.5 1.823 1.781 1.811 1.813 1.741 500 1 0.5 1.733 1.671 1.571 1.668 1.693 1000 1 0.5 1.609 1.561 1.560 1.562 1.623 3000 1 0.5 1.536 1.528 1.528 1.525 1.697 500 2 0.5 1.733 1.663 1.592 1.670 1.622 1000 2 0.5 1.561 1.512 1.510 1.512 1.515 3000 2 0.5 1.404 1.397 1.407 1.388 1.463 500 0 1 1.902 1.874 1.839 1.871 2.290 1000 0 1 2.132 2.073 2.102 2.123 2.174 3000 0 1 1.860 1.847 1.854 1.843 2.374 500 0.5 1 1.942 1.879 1.865 1.883 2.160 1000 0.5 1 1.790 1.797 1.790 1.811 2.012 3000 0.5 1 1.739 1.723 1.722 1.724 1.818 500 1 1 1.845 1.792 1.784 1.806 1.975 1000 1 1 1.726 1.745 1.774 1.781 1.717 3000 1 1 1.643 1.650 1.650 1.646 1.832 500 2 1 1.574 1.514 1.506 1.516 1.718 1000 2 1 1.513 1.521 1.505 1.523 1.641 3000 2 1 1.513 1.507 1.510 1.501 1.498 Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Con current Calibration.

PAGE 87

87 Table 4 7. Mean URMSD from 2PL model for IRT true score equating in four common testlets condition C4 2PL URMSD N MM MS SL HB CC 500 0 0 1.327 1.365 1.316 1.334 1.218 1000 0 0 1.237 1.259 1.23 1.244 1.278 3000 0 0 1.221 1.227 1.214 1.218 1.253 500 0.5 0 1.298 1.318 1.268 1.301 1.336 1000 0.5 0 1.336 1.34 1.318 1.33 1.340 3000 0.5 0 1.264 1.263 1.261 1.264 1.306 500 1 0 1.253 1.255 1.235 1.247 1.215 1000 1 0 1.164 1.171 1.148 1.165 1.328 3000 1 0 1.123 1.123 1.1 24 1.12 1.188 500 2 0 1.248 1.251 1.246 1.243 1.175 1000 2 0 1.178 1.189 1.174 1.178 1.147 3000 2 0 1.162 1.156 1.156 1.155 1.144 500 0 0.5 1.449 1.453 1.431 1.382 1.361 1000 0 0.5 1.235 1.268 1.236 1.250 1.305 3000 0 0.5 1.342 1.344 1.337 1.343 1 .348 500 0.5 0.5 1.298 1.314 1.274 1.293 1.256 1000 0.5 0.5 1.319 1.329 1.308 1.315 1.305 3000 0.5 0.5 1.241 1.241 1.235 1.240 1.279 500 1 0.5 1.340 1.355 1.336 1.343 1.213 1000 1 0.5 1.189 1.195 1.183 1.186 1.228 3000 1 0.5 1.199 1.206 1.202 1.204 1 .242 500 2 0.5 1.222 1.219 1.208 1.218 1.182 1000 2 0.5 1.239 1.245 1.232 1.241 1.227 3000 2 0.5 1.191 1.194 1.191 1.192 1.158 500 0 1 1.433 1.474 1.385 1.405 1.473 1000 0 1 1.383 1.411 1.387 1.392 1.399 3000 0 1 1.363 1.375 1.354 1.364 1.457 500 0. 5 1 1.283 1.310 1.258 1.282 1.306 1000 0.5 1 1.341 1.338 1.320 1.945 1.357 3000 0.5 1 1.297 1.31 1.304 1.303 1.462 500 1 1 1.317 1.325 1.296 1.308 1.361 1000 1 1 1.296 1.304 1.29 1.295 1.263 3000 1 1 1.352 1.350 1.352 1.350 1.319 500 2 1 1.200 1.208 1.195 1.198 1.143 1000 2 1 1.213 1.217 1.204 1.211 1.245 3000 2 1 1.216 1.216 1.216 1.206 1.217 Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C4 = four common testlet; MM = M ean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 88

88 Table 4 8. Mean URMSD from GRM model for IRT true score equating in four common testlets condition C4 GRM URMSD N MM MS SL HB CC 500 0 0 1.200 1.188 1.180 1.183 1.244 1000 0 0 1.068 1.067 1.068 1.063 1.162 3000 0 0 1.172 1.171 1.175 1.169 1.170 500 0.5 0 1.115 1.108 1.107 1.109 1.072 1000 0.5 0 1.106 1.100 1.099 1.098 1.054 3000 0.5 0 1.086 1.064 1.061 1.061 1.053 500 1 0 1.154 1.118 1.120 1.116 1.146 1000 1 0 1.102 1.096 1.098 1.092 1.065 3000 1 0 0.978 0.974 0.977 0.970 0.939 500 2 0 1.024 0.986 0.982 0.984 1.002 1000 2 0 1.025 1.010 1.013 1.008 1.063 3000 2 0 1.057 1.030 1.034 1.028 1.016 500 0 0.5 1.225 1.201 1.202 1.199 1.246 1000 0 0.5 1.2 73 1.263 1.253 1.253 1.142 3000 0 0.5 1.183 1.184 1.179 1.182 1.224 500 0.5 0.5 1.214 1.180 1.177 1.180 1.144 1000 0.5 0.5 1.128 1.123 1.122 1.121 1.167 3000 0.5 0.5 1.113 1.090 1.089 1.090 1.071 500 1 0.5 1.183 1.107 1.109 1.106 1.082 1000 1 0.5 1.0 88 1.069 1.071 1.067 1.031 3000 1 0.5 1.070 1.051 1.054 1.049 1.092 500 2 0.5 1.067 1.006 1.005 1.005 1.062 1000 2 0.5 0.971 0.947 0.951 0.946 1.096 3000 2 0.5 0.984 0.955 0.958 0.955 1.041 500 0 1 1.354 1.261 1.209 1.246 1.244 1000 0 1 1.291 1.249 1 .241 1.257 1.285 3000 0 1 1.296 1.276 1.250 1.250 1.301 500 0.5 1 1.073 1.037 1.026 1.034 1.104 1000 0.5 1 1.233 1.159 1.161 1.164 1.225 3000 0.5 1 1.225 1.208 1.204 1.201 1.232 500 1 1 1.175 1.118 1.109 1.110 1.124 1000 1 1 1.124 1.077 1.083 1.076 1 .199 3000 1 1 1.129 1.113 1.118 1.110 1.106 500 2 1 1.088 1.003 1.004 0.998 1.082 1000 2 1 1.080 1.053 1.054 1.048 1.198 3000 2 1 1.077 1.021 1.025 1.019 1.107 Note = variance of the testlet factor; N = sample size; = Targe t group population mean ability; C4 = four common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 89

89 Table 4 9. Mean URMSD from GPC model for IRT true score equating in four common testlets condition C4 GPC URMSD N MM MS SL HB CC 500 0 0 1.325 1.267 1.274 1.27 1.133 1000 0 0 1.208 1.193 1.184 1.197 1.138 3000 0 0 1.122 1.119 1.115 1.116 1.102 500 0.5 0 1.169 1.169 1.143 1.178 1.088 1000 0.5 0 1.235 1.201 1.198 1.201 1.182 3000 0.5 0 1.1 09 1.099 1.098 1.100 1.049 500 1 0 1.188 1.189 1.180 1.185 1.122 1000 1 0 1.224 1.208 1.202 1.209 1.107 3000 1 0 1.056 1.041 1.041 1.043 1.093 500 2 0 1.163 1.105 1.105 1.106 1.011 1000 2 0 1.099 1.078 1.084 1.075 1.083 3000 2 0 1.027 1.014 0.983 1.0 14 0.956 500 0 0.5 1.331 1.296 1.214 1.224 1.196 1000 0 0.5 1.146 1.126 1.108 1.129 1.096 3000 0 0.5 1.162 1.221 1.155 1.153 1.223 500 0.5 0.5 1.210 1.164 1.162 1.178 1.272 1000 0.5 0.5 1.303 1.255 1.221 1.262 1.190 3000 0.5 0.5 1.050 1.044 1.045 1.0 44 1.187 500 1 0.5 1.155 1.124 1.111 1.124 1.147 1000 1 0.5 1.183 1.162 1.157 1.165 1.123 3000 1 0.5 1.092 1.090 1.088 1.088 1.128 500 2 0.5 1.151 1.092 1.054 1.091 1.044 1000 2 0.5 1.113 1.088 1.085 1.092 1.054 3000 2 0.5 1.004 1.000 0.958 1.000 1.0 33 500 0 1 1.259 1.245 1.203 1.241 1.294 1000 0 1 1.288 1.236 1.264 1.228 1.356 3000 0 1 1.238 1.240 1.239 1.235 1.280 500 0.5 1 1.215 1.198 1.180 1.200 1.236 1000 0.5 1 1.212 1.234 1.212 1.199 1.209 3000 0.5 1 1.196 1.184 1.203 1.183 1.245 500 1 1 1.238 1.174 1.161 1.179 1.163 1000 1 1 1.182 1.170 1.155 1.171 1.189 3000 1 1 1.138 1.125 1.124 1.127 1.186 500 2 1 1.142 1.097 1.092 1.096 1.053 1000 2 1 1.083 1.077 1.057 1.073 1.106 3000 2 1 1.168 1.159 1.161 1.157 1.089 Note = varia nce of the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 90

90 Table 4 10. Mean WRMSD from 2PL mo del for IRT true score equating in t wo common testlets c ondition C2 2PL WRMSD N MM MS SL HB CC 500 0 0 2.039 2.041 1.981 2.013 2.050 1000 0 0 1.925 1.950 1.886 1.913 1.942 3000 0 0 1.826 1.832 1.820 1.831 1.811 500 0.5 0 1.946 1.974 1.942 1.9 52 1.904 1000 0.5 0 1.860 1.882 1.858 1.865 1.926 3000 0.5 0 2.010 2.015 2.013 2.011 1.921 500 1 0 1.801 1.842 1.804 1.822 1.785 1000 1 0 1.780 1.790 1.775 1.778 1.756 3000 1 0 1.760 1.771 1.765 1.766 1.670 500 2 0 1.733 1.731 1.726 1.728 1.678 1000 2 0 1.571 1.582 1.577 1.575 1.676 3000 2 0 1.691 1.693 1.691 1.688 1.611 500 0 0.5 1.947 2.031 1.958 1.98 1.892 1000 0 0.5 1.948 1.976 1.935 1.942 1.954 3000 0 0.5 1.758 1.778 1.768 1.772 2.266 500 0.5 0.5 1.971 1.983 1.938 1.965 1.901 1000 0.5 0.5 1.896 1.901 1.895 1.9 1.829 3000 0.5 0.5 1.931 1.934 1.93 1.928 1.91 500 1 0.5 1.824 1.837 1.806 1.824 1.845 1000 1 0.5 1.822 1.85 1.839 1.834 1.976 3000 1 0.5 1.884 1.877 1.88 1.879 1.798 500 2 0.5 1.702 1.723 1.701 1.697 1.688 1000 2 0.5 1.575 1. 582 1.569 1.578 1.583 3000 2 0.5 1.547 1.54 1.544 1.542 1.663 500 0 1 1.946 1.988 1.933 1.946 2.240 1000 0 1 1.896 1.907 1.871 1.885 2.275 3000 0 1 1.83 1.822 1.814 1.813 2.362 500 0.5 1 1.967 2.011 1.951 1.968 2.085 1000 0.5 1 1.921 1.947 1.914 1.924 2.139 3000 0.5 1 1.813 1.812 1.81 1.811 1.936 500 1 1 1.789 1.779 1.761 1.762 2.15 1000 1 1 1.892 1.893 1.872 1.884 1.827 3000 1 1 1.716 1.704 1.711 1.705 1.846 500 2 1 1.713 1.75 1.737 1.73 1.741 1000 2 1 1.579 1.609 1.587 1.594 1.802 3000 2 1 1.623 1.634 1.631 1.628 1.624 Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 91

91 Table 4 11. Mean WRMSD from GRM model for IRT true score equating in two common testlets condition C2 GRM WRMSD N MM MS SL HB CC 500 0 0 1.647 1.608 1.609 1.611 1.685 1000 0 0 1.766 1.629 1.626 1.628 1.595 3000 0 0 1. 682 1.656 1.660 1.651 1.578 500 0.5 0 1.485 1.598 1.606 1.590 1.537 1000 0.5 0 1.559 1.598 1.604 1.594 1.642 3000 0.5 0 1.564 1.477 1.480 1.477 1.687 500 1 0 1.533 1.559 1.558 1.555 1.460 1000 1 0 1.543 1.459 1.464 1.455 1.541 3000 1 0 1.510 1.530 1. 528 1.526 1.527 500 2 0 1.407 1.364 1.369 1.359 1.546 1000 2 0 1.329 1.296 1.300 1.291 1.784 3000 2 0 1.450 1.356 1.363 1.348 1.563 500 0 0.5 1.714 1.681 1.678 1.678 1.831 1000 0 0.5 1.660 1.601 1.601 1.609 1.746 3000 0 0.5 1.670 1.642 1.636 1.636 1. 717 500 0.5 0.5 1.713 1.619 1.622 1.631 1.630 1000 0.5 0.5 1.543 1.557 1.561 1.553 1.571 3000 0.5 0.5 1.633 1.671 1.672 1.668 1.604 500 1 0.5 1.521 1.523 1.528 1.520 1.503 1000 1 0.5 1.602 1.580 1.589 1.575 1.552 3000 1 0.5 1.644 1.563 1.568 1.559 1. 566 500 2 0.5 1.447 1.373 1.377 1.369 1.433 1000 2 0.5 1.439 1.358 1.363 1.354 1.665 3000 2 0.5 1.398 1.431 1.436 1.429 1.314 500 0 1 1.739 1.729 1.734 1.750 2.448 1000 0 1 1.839 1.681 1.678 1.670 2.486 3000 0 1 1.812 1.734 1.731 1.722 2.455 500 0.5 1 1.813 1.754 1.763 1.768 2.124 1000 0.5 1 1.634 1.575 1.573 1.577 2.077 3000 0.5 1 1.555 1.598 1.602 1.598 1.749 500 1 1 1.658 1.533 1.537 1.532 1.776 1000 1 1 1.535 1.466 1.468 1.458 1.785 3000 1 1 1.486 1.504 1.506 1.500 1.684 500 2 1 1.449 1.394 1.394 1.393 1.589 1000 2 1 1.419 1.352 1.352 1.351 1.593 3000 2 1 1.476 1.396 1.403 1.386 1.428 Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C2 = two common testlet; MM = Me an/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 92

92 Table 4 12. Mean WRMSD from GPC model for IRT true score equating in two common testlets condition C2 GPC WRMSD N MM MS SL HB CC 500 0 0 1.852 1.820 1.813 1.816 1.682 1000 0 0 1.676 1.723 1.716 1.720 1.774 3000 0 0 1.638 1.614 1.547 1.542 1.697 500 0.5 0 1.802 1.746 1.733 1.750 1.533 1000 0.5 0 1.643 1.583 1.578 1.591 1.672 3000 0.5 0 1.542 1.524 1.528 1.521 1.62 500 1 0 1.609 1.567 1.559 1.564 1.574 1000 1 0 1.639 1.623 1.621 1.626 1.517 3000 1 0 1.607 1.614 1.602 1.603 1.643 500 2 0 1.618 1.531 1.530 1.532 1.428 1000 2 0 1.485 1.431 1.465 1.456 1.507 3000 2 0 1.448 1.373 1.376 1.371 1.411 500 0 0.5 1.757 1.706 1.675 1.702 1.671 1000 0 0.5 1.71 6 1.656 1.658 1.659 1.873 3000 0 0.5 1.693 1.775 1.780 1.770 1.904 500 0.5 0.5 1.684 1.675 1.603 1.608 1.711 1000 0.5 0.5 1.699 1.731 1.782 1.805 1.702 3000 0.5 0.5 1.821 1.765 1.801 1.803 1.714 500 1 0.5 1.705 1.664 1.546 1.655 1.684 1000 1 0.5 1.58 4 1.531 1.546 1.531 1.617 3000 1 0.5 1.509 1.500 1.500 1.497 1.665 500 2 0.5 1.703 1.632 1.56 1.638 1.589 1000 2 0.5 1.518 1.472 1.469 1.471 1.479 3000 2 0.5 1.388 1.384 1.393 1.374 1.431 500 0 1 1.702 1.668 1.626 1.652 2.071 1000 0 1 1.911 1.832 1.8 61 1.876 1.955 3000 0 1 1.682 1.644 1.648 1.636 2.210 500 0.5 1 1.800 1.734 1.721 1.729 2.033 1000 0.5 1 1.669 1.675 1.670 1.686 1.890 3000 0.5 1 1.606 1.591 1.587 1.592 1.676 500 1 1 1.721 1.652 1.648 1.662 1.828 1000 1 1 1.645 1.657 1.680 1.679 1.6 16 3000 1 1 1.483 1.491 1.491 1.485 1.680 500 2 1 1.513 1.441 1.436 1.442 1.615 1000 2 1 1.421 1.436 1.421 1.435 1.550 3000 2 1 1.436 1.435 1.435 1.428 1.416 Note = variance of the testlet factor; N = sample size; = target group population mean ability; C2 = two common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 93

93 Table 4 13. Mean WRMSD from 2PL model for IRT true score equating in four common testlets condition C4 2PL WRMSD N MM MS SL HB CC 500 0 0 1.374 1.413 1.36 1.377 1.247 1000 0 0 1.290 1.311 1.282 1.295 1.337 3000 0 0 1.268 1.272 1.258 1.263 1.302 500 0.5 0 1.343 1.360 1.308 1.341 1.380 1000 0.5 0 1.372 1.377 1.352 1.365 1.381 3000 0.5 0 1.3 08 1.306 1.304 1.308 1.356 500 1 0 1.280 1.280 1.260 1.272 1.237 1000 1 0 1.181 1.187 1.163 1.180 1.364 3000 1 0 1.146 1.147 1.148 1.143 1.213 500 2 0 1.237 1.242 1.236 1.233 1.168 1000 2 0 1.170 1.181 1.166 1.169 1.140 3000 2 0 1.155 1.149 1.150 1.1 48 1.139 500 0 0.5 1.440 1.462 1.413 1.381 1.378 1000 0 0.5 1.233 1.280 1.241 1.257 1.335 3000 0 0.5 1.375 1.371 1.370 1.372 1.361 500 0.5 0.5 1.309 1.330 1.287 1.306 1.263 1000 0.5 0.5 1.328 1.340 1.320 1.324 1.326 3000 0.5 0.5 1.262 1.262 1.256 1 .262 1.299 500 1 0.5 1.357 1.373 1.353 1.359 1.204 1000 1 0.5 1.189 1.196 1.185 1.186 1.230 3000 1 0.5 1.200 1.208 1.202 1.205 1.252 500 2 0.5 1.194 1.190 1.179 1.188 1.153 1000 2 0.5 1.215 1.220 1.208 1.215 1.199 3000 2 0.5 1.176 1.178 1.176 1.176 1 .143 500 0 1 1.370 1.410 1.307 1.329 1.359 1000 0 1 1.347 1.370 1.342 1.349 1.323 3000 0 1 1.300 1.310 1.290 1.301 1.406 500 0.5 1 1.208 1.232 1.182 1.203 1.226 1000 0.5 1 1.268 1.269 1.243 1.843 1.277 3000 0.5 1 1.204 1.226 1.217 1.213 1.355 500 1 1 1.247 1.251 1.226 1.234 1.296 1000 1 1 1.221 1.225 1.211 1.217 1.184 3000 1 1 1.269 1.268 1.269 1.267 1.248 500 2 1 1.136 1.146 1.131 1.133 1.071 1000 2 1 1.142 1.143 1.135 1.139 1.184 3000 2 1 1.128 1.126 1.127 1.118 1.151 Note = var iance of the testlet factor; N = sample size; = target group population mean ability; C4 = four common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 94

94 Table 4 14. Mean WRMSD from GRM model for IRT true score equating in four common testlets condition C4 GRM WRMSD N MM MS SL HB CC 500 0 0 1.374 1.413 1.360 1.377 1.247 1000 0 0 1.290 1.311 1.282 1.295 1.337 3000 0 0 1.268 1.272 1.258 1.263 1.302 500 0.5 0 1.343 1.360 1.308 1.341 1.380 1000 0.5 0 1.372 1.377 1.352 1.365 1.381 3000 0.5 0 1.308 1.306 1.304 1.308 1.356 500 1 0 1.280 1.280 1.260 1.272 1.237 1000 1 0 1.181 1.187 1.163 1.180 1.364 3000 1 0 1.146 1.147 1.148 1.143 1.213 500 2 0 1.237 1.242 1.236 1.233 1.168 1 000 2 0 1.170 1.181 1.166 1.169 1.140 3000 2 0 1.155 1.149 1.150 1.148 1.139 500 0 0.5 1.440 1.462 1.413 1.381 1.378 1000 0 0.5 1.233 1.280 1.241 1.257 1.335 3000 0 0.5 1.375 1.371 1.370 1.372 1.361 500 0.5 0.5 1.309 1.330 1.287 1.306 1.263 1000 0.5 0.5 1.328 1.340 1.320 1.324 1.326 3000 0.5 0.5 1.262 1.262 1.256 1.262 1.299 500 1 0.5 1.357 1.373 1.353 1.359 1.204 1000 1 0.5 1.189 1.196 1.185 1.186 1.230 3000 1 0.5 1.200 1.208 1.202 1.205 1.252 500 2 0.5 1.194 1.190 1.179 1.188 1.153 1000 2 0.5 1.215 1.220 1.208 1.215 1.199 3000 2 0.5 1.176 1.178 1.176 1.176 1.143 500 0 1 1.370 1.410 1.307 1.329 1.359 1000 0 1 1.347 1.370 1.342 1.349 1.323 3000 0 1 1.300 1.310 1.290 1.301 1.406 500 0.5 1 1.208 1.232 1.182 1.203 1.226 1000 0.5 1 1.268 1.269 1.243 1.843 1.277 3000 0.5 1 1.204 1.226 1.217 1.213 1.355 500 1 1 1.247 1.251 1.226 1.234 1.296 1000 1 1 1.221 1.225 1.211 1.217 1.184 3000 1 1 1.269 1.268 1.269 1.267 1.248 500 2 1 1.136 1.146 1.131 1.133 1.071 1000 2 1 1.142 1.143 1.135 1.139 1.18 4 3000 2 1 1.128 1.126 1.127 1.118 1.151 Note = variance of the testlet factor; N = sample size; = target group population mean ability; C4 = four common testlet; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haeb ara; CC = Concurrent Calibration.

PAGE 95

95 Table 4 15. Mean WRMSD from GPC model for IRT true score equating in four common testlets condition C4 GPC WRMSD N MM MS SL HB CC 500 0 0 1.374 1.413 1.36 1.377 1.247 1000 0 0 1.290 1.311 1.282 1.295 1.337 30 00 0 0 1.268 1.272 1.258 1.263 1.302 500 0.5 0 1.343 1.360 1.308 1.341 1.380 1000 0.5 0 1.372 1.377 1.352 1.365 1.381 3000 0.5 0 1.308 1.306 1.304 1.308 1.356 500 1 0 1.280 1.280 1.260 1.272 1.237 1000 1 0 1.181 1.187 1.163 1.180 1.364 3000 1 0 1.146 1.147 1.148 1.143 1.213 500 2 0 1.237 1.242 1.236 1.233 1.168 1000 2 0 1.170 1.181 1.166 1.169 1.140 3000 2 0 1.155 1.149 1.150 1.148 1.139 500 0 0.5 1.440 1.462 1.413 1.381 1.378 1000 0 0.5 1.233 1.280 1.241 1.257 1.335 3000 0 0.5 1.375 1.371 1.370 1.372 1.361 500 0.5 0.5 1.309 1.330 1.287 1.306 1.263 1000 0.5 0.5 1.328 1.340 1.320 1.324 1.326 3000 0.5 0.5 1.262 1.262 1.256 1.262 1.299 500 1 0.5 1.357 1.373 1.353 1.359 1.204 1000 1 0.5 1.189 1.196 1.185 1.186 1.230 3000 1 0.5 1.200 1.208 1.202 1.205 1.252 500 2 0.5 1.194 1.190 1.179 1.188 1.153 1000 2 0.5 1.215 1.220 1.208 1.215 1.199 3000 2 0.5 1.176 1.178 1.176 1.176 1.143 500 0 1 1.370 1.410 1.307 1.329 1.359 1000 0 1 1.347 1.370 1.342 1.349 1.323 3000 0 1 1.300 1.310 1.290 1.301 1.406 500 0.5 1 1.208 1.232 1.182 1.203 1.226 1000 0.5 1 1.268 1.269 1.243 1.843 1.277 3000 0.5 1 1.204 1.226 1.217 1.213 1.355 500 1 1 1.247 1.251 1.226 1.234 1.296 1000 1 1 1.221 1.225 1.211 1.217 1.184 3000 1 1 1.269 1.268 1.269 1.267 1.248 500 2 1 1. 136 1.146 1.131 1.133 1.071 1000 2 1 1.142 1.143 1.135 1.139 1.184 3000 2 1 1.128 1.126 1.127 1.118 1.151 Note = variance of the testlet factor; N = sample size; = Target group population mean ability; C4 = four common testl et; MM = Mean/Mean; MS = Mean/Sigma; SL = Stocking/Lord; HB = Haebara; CC = Concurrent Calibration.

PAGE 96

96 CHAPTER 5 DIS CUSSION AND CONCLUSION As alternatives to model the within testlet items dependency, the equating performance of the unidimentional dichotom ous and the polytomous IRT models for testlet based tests under different levels of local dependency was not consistent in the extant research literature. Some studies found that the testlet as polytomous item approach produced better equating results than the unidimensional dichotomous model when there was violation of the local independence, particularly when the violation was serious (Lee et al 2001). Whereas others argued that in conditions with high level of local dependency, the equating performance of the dichotomous model was even better than that of the testlet as polytomous item (Zhang, 2007). These previous studies addressed this issue in the context of a random group s design. To clarify the influence of the magnitude of local dependency on the equating results from the dichotomous IRT model and the polytomous IRT model, the current study investigated this issue through manipulating the degree of local dependency for testlet based tests using simulated data under a common item nonequivalent group s design. This design is commonly used in standard testing programs. Furthermore, in order to obtain a more comprehensive understanding of this topic under various testing conditions, besides the magnitude of local dependency, examinee groups characteristi cs (i.e., sample size, groups mean difference), item parameter estimates scaling methods (four separate linking methods and the concurrent calibration), number of common testlets were all examined. Both IRT true score equating results and IRT observed scor e equating results were considered. Equating results from three conventional linear equating methods for common item nonequivalent groups design served as base lines. URMSD and WRMSD were used to quantify the overall discrepancy between the results under t he three IRT models and the three baseline equating methods

PAGE 97

97 The IRT true score equating and IRT observed score equating results were very similar across conditions. This finding was in line with what Lord and Wingersky (1984) found concerning the performa nce of the two equating methods for common item nonequivalent groups design. In Lee et al (2001), the two equating methods produced very close results using a random groups design as well. Different conclusions regarding results from the two methods were r eported in Kolen (1981) and Han et al. (1997) for tests using the random groups design. More research comparing the two methods is needed, especially for equating testlet based tests. In each condition, the URMSD or the WRMSD resulted from the three baseli ne linear methods were very much alike. In the subsequent ANOVA analyses for effects of interest, the URMSD and WRMSD were collapsed over the three baseline methods, respectively. Then the analyses could focus on the major factors of interest. Two ANOVAs ( one using URMSD and one using WRMSD ) were conducted for both IRT true score equating results and observed equating results. Due to the high similarities between the IRT true score equating and observed score equating results, only the analysis using the IR T true score equating was presented. Regardless of the evaluation criteria used, be it URMSD or WRMSD for effects in the two ANOVAs were in close agreement with each other: number of common testlets, model, and testlet effect emerged as the three factors having larger effect sizes. Factors of Large r Impact Number of Common Items Number of common testlets ranked as the factor of the largest effect on the final equating results. As expected, equating results are more accurate when increased num ber of common items was employed. The drop of the URMSD and WRMSD in four common testlets conditions was conspicuous. Table 5 1 demonstrated the decreasing pattern using the mean URMSD of IRT tr ue score equating as an example. T he condition used had a samp le size of 1000, testlet effect

PAGE 98

98 0.5, and a groups mean difference of 1, with the Stocking and Lord linking method. Equating from all three IRT models improved when the common testlets increased to four, which was about two thirds of the total number of ite ms. Consequently, the discrepancy between the URMSD under each pair of models shrunk. This finding echoes with other research respecting the effect of number of common items on the quality of equating results. Model As a factor of major interest, model di splayed relatively large effect on the equating results as well. The two polytomous models had consistently better overall performance than the 2PL model. The trend appeared to be more distinct in conditions with four common testlets. Only a paucity of con ditions under the 2PL model had more accurate equating results than the corresponding conditions under one of the polytomous models. Of these limited conditions including conditions without testlet effect, most often than not, for the same condition, there was always a type of polytomous model outperformed the 2PL model. A finding in current study disagreed with the previous research (Lee et al 2001; Zhang, 2007) was that the dichotomous model did not show an overwhelming advantage over the polytomous mode ls when the testlet effect was absent. T his advantage diminished as the number of common items increased. The few conditions without testlet effect in which the 2PL had better performance mostly had some level of group difference. For conditions with neith er testlet effect nor group difference, that is the conditions of equivalent groups without testlet effect, none of the equating results under the 2PL model using separate linking methods were better than those under the two polytomous models. However, in Lee et al (2001) and Zhang (2007), where the random groups design was employed, in conditions without testlet effect, the dichotomous model outperformed the polytomous models. There are some po tential reasons for the divergent findings. The dichotomous mod el used in the current study was the 2PL model, which does not have a psudo

PAGE 99

99 guessing parameter, wh ereas the 3PL model was adopted in the two referred studies. Since there was no psudo guessing parameter for either GRM or GPC model, it was considered as bei ng more appropriate in the current study to compare them with the 2PL model other than the 3PL model. When the IRT true score equating was applied to the 3PL model, it happened that the lowest possible true score is the sum of the psudo guessing parameter, instead of 0. A procedure should be conducted to convert the out of range true score on the target form. This could cause different equating results by using the two different dichotomous models. Another reason might ascribe to the practice that linking w as used for the equivalent groups conditions, albeit unnecessary, in the current study. The practice of linking parameter estimates for equivalent groups might lead to different equating results compared with that of no linking (Hanson & B guin 1999; B guin, Hanson & Glas, 2000; Lee & Ban, 2010). Lee and Ban (2010) found that separate calibration without linking outperformed separate calibration with linking in some conditions when the groups were equivalent. Although all these stud ies addressed the condition of equivalent groups, Lee et al used real data, Zhang (2007) simulated data using the two dimension compensatory model and the current study used the 2PL testlet model as the population model which is a constrained version of th e bi factor model (Gibbons & Hedeker, 1992). The non uniform data structure would inevitably induce deviation on the final equating results. Notwithstanding the lack of concordance with other studies in conditions of equivalent groups, the current study al so detected that in conditions of non equivalent groups, the equating performance of the 2PL model could be better than that of the polytomous models when there was no testlet effect, which complied with other studies. But this compliance was rather limite d, it changed once levels of other factor changed, such as the number of common items. When the testlet effect was absent and the groups were nonequivalent, none of the conditions under the

PAGE 100

100 2PL model was found consistently better than any of the polytomous models in conditions with both two and four common testlets. A close scrutiny revealed that once the common testlets up to four and the group difference up to 1, the overall equating results from the two polytomous models were more accurate than that from the 2PL model even when there was no testlet effect, it is consistent across results from either IRT TE or IRT OE, for either URMSD or WRMSD The current study showed that the two polytomous models were superior to the 2PL model in terms of the equating r esults across most of the conditions examined. As for the performance of the two polytomous models, the GRM was better than the GPC model in conditions using separate linking methods, while the GPC model demonstrated slight superiority in conditions using concurrent calibration. Previous comparisons with respect to the GRM and GPC model ( Cook, Dodd, & Fitzpatrick, 1999 ; Maydeu Olivares, Drasgow, & Mead, 1994; Tang & Eignor, 1997,) yielded mixed conclusions concerning the parameter estimation. Some researche rs found the two had similar performance (Maydeu Olivares, Drasgow, & Mead, 1994), while others argued that the GPC had slightly better performance. Cook, Dodd, and Fitz patrick (1991) pointed out that the GRM yielded more information across the whole range of the latent trait than the PC and GPC models, and they recommended further exploration of the practical implication of this phenomenon. As mentioned before, the two models were individually employed in different studies modeling the within testlet items dependency, but never were considered together. In Lee et al (2001), the GRM and the Norminal model (NM; Bock, 1972) were compared with the 3PL model for testlet based tests. Theoretically, the GPC model is a constrained form of the NM. For the Map data t hey used, Lee et al found that the overall equating with the GRM was better than that with the NM and the 3PL model, which was consistent with the findings of the current study in a sense. In terms of equating using the

PAGE 101

101 separate calibration, the GRM is the best one among the three models examined for testlet based tests. Variance of Testlet Factor In the current study, the local dependency of within testlet items was defined as the variance of the testlet factor, Four levels of local dependen cy were manipulated to represent conditions of no, low, medium and high levels of testlet effect. The mean URMSD and WRMSD indicated that as the size of testlet effect increased, the equating results from the three IRT models tended to be more accurate. Th is trend could be found in Lee et al (2001) when the traditional linear method was used as the baseline and the URMSD as the evaluation criteria. However, based on the URMSD from other baseline methods in Lee et al (2001), i.e., mean equating and equiperce ntile equating, only equating under the two polytomous models became more accurate with the increase of the local dependency, whereas the equating under the 3PL model got less accurate instead. Since the equipercentile method for non equivalent groups anch or design does not work well based on the literature (Lord & Wingersky, 1984), it was not selected as a baseline method in current study. Nevertheless, with the presence of testlet effect, equating under the two polytomous models was consistently better th an that under the 2PL model in almost all conditions. It is especially true with the GRM in conditions using four common testlets. It was evident that the polytomous models yielded more accurate equating results compared with the 2PL model if there were te stlet effects. Impact s of Other F actors Linking Method Both separate and concurrent calibrations were used in this study to obtain the parameter estimates for the three IRT models. Parameters estimates from the concurrent calibrations were automatically pu t on the same scale by a single run of the estimation software. It was

PAGE 102

102 operationally used as a method for scale transformation in the current study. Four linking methods were employed to put parameter estimates from the separate calibration on the same sca le. Therefore, five methods were considered for scale transformation. It turned out equating with the four separate linking methods were more accurate than that with the concurrent method under all three IRT models, particularly in conditions of two common testlets. Relatively better performance of the concurrent method was found in conditions with more common testlets and small group differences. Yet it was still inferior to the separate linking methods. This result conformed with what B guin, Hanson, and Glas (2000) found in their study addressing the effect of multidimensionality on separate and concurrent unidimentional estimation in IRT equating. They pointed out that for data presenting multidimensionality the concurrent calibration under t he unidimensional IRT model had worse performance than the corresponding separate calibration. When the unidimensional IRT models were fit to the testlet data, which was a type of data with multidimensionality, equating results from the separate calibratio n were better than that from the concurrent calibration, particularly in conditions with higher degree of group difference. The equating results using the four separate linking methods were similar under each model. However, controlling other factors, the characteristic curve linking methods outperformed the moment methods in most of the conditions in terms of smaller mean URMSD and WRMSD Specifically, the Stocking & Lord method was the best performing separate linking method for both the 2PL and the GPC model, while the Haebara method was the best one for the GRM. This trend was more salient in four common items conditions. In literature, comparisons among the four separate linking methods for unidimensional dichotomous models concluded that the two chara cteristic curve methods generally yielded more stable results than the two moment methods (Kolen & Brennan, 2004). The current study showed that when used to

PAGE 103

103 equate testlet data, the two characteristic methods worked somewhat differently for the three unid imensional IRT models. The Stocking & Lord method had the best performance under the 2PL model and the GPC model, while the Haebara method was the best one under the GRM. In conditions with only two common testlets, the two moment methods and the Haebara m ethod had similar results under the 2PL model and the GPC model, the two moment methods and the Stocking & Lord method had little difference under the GRM. Once the common testlets was up to four, the performance of the characteristic curve method other th an the best performing one under each model also improved. Therefore, the two characteristic curve linking methods were better than the two moment methods. Group Mean Difference and Sample Size The effect size of group mean difference and sample size were rather small although statistically significant. As can be seen that equating under all three IRT models became less accurate when the group difference was widened, which was consistent with previous research (Cohen & Kim, 1998; Kim and Kolen, 2006; B guin, Hanson & Glas, 2000; Li, 2009). When sample size is small, one potential problem for using polytomous models to fit testlet data by summing up the within testlet items scores is there might be more occurrence s of missing categories. But this can be overcome by using larger sample size. The URMSD or WRMSD tended to decrease as the sample size increased. Conclusion As alternative methods for modeling the potential local dependency among within testlet items, comparisons of the equating results from applying the unidimentional dichotomous and polytomous IRT models to testlet based tests were conducted by researchers. T he focus of these comparisons i n general was association between the magnitude of local dependency and the equating accuracy by us ing those unidimentional IRT models. However, findings from empirical

PAGE 104

104 and simulation studies did not concur with each other. Some (Zhang, 2007) reported that the dichotomous model had better equating performance than the polytomous models if strong local d ependency was present, while others found the opposite (Lee et al, 2001). Since different polytomous models were employed in these studies, which to some extent, posed a barrier to comparing the results across them and to generalizing the findings. In addi tion, only random groups equating design condition was examined in these previous equating studies, while another popular equating design condition, i.e., common item n onequivalent groups design, was scarcely addressed. As a result, the goal of the current study was to investigate this issue under common item nonequivalent groups design considering a broader number of conditions, so that a achieved. The results fr om both IRT true score equating and IRT observed score equating showed that among the three IRT models adopted, the two polytomous models, GRM and GPC, were consistently superior to the 2PL model in conditions with various levels of local dependency. Even in some conditions without local dependency, the two polytomous models were found outperformed the 2PL model, especially when more common testlets were used. In conditions using the separate linking methods, equating results under the GRM were more accurat e than those under the GPC model. Equating with the concurrent calibration were less accurate than that with the separate calibration using linking in most of the conditions, particularly when then number of common teslets was small and group difference w as large. Equating results from using the two characteristic curve linking methods turned out to be more accurate than the moment methods in conditions of four common testlets. In two common testlets conditions, the Stocking & Lord method yielded more accu rate equating results for the 2PL model and the GPC model, whereas the other three linking methods produced very similar

PAGE 105

105 results. Meanwhile, the Haebara method had the best performance for the GRM, and the other three linking methods barely had any differe nce. Number of common items had the largest effect size among all the factors explored, which reinforced the idea from previous studies that equating accuracy improved as the number of linking items increased for common item nonequivalent groups design (Wi ngersky, Cook & Eignor, 1986; Kim & Cohen, 1998; Kim & Cohen, 2002; Hanson & B guin, 2002). Limitations and Future Studies The main purpose of this study was to clarify conditions that the unidimensional dichotomous and polytomous IRT models ha d comparatively better equating performance, respectively, when fit to the testlet data. The 2PL testlet model (Bradlow, Wainer, & Wang, 1999) was used as the population model to generate testlet data with different levels of local dependency, but it was n ot used to fit the simulated data and to obtain the IRT true score and observed score equating results. One reason is the large number of conditions considered in the current study and intensive computation needed for the testlet model estimation, linking and equating, hence the long time required for completing the whole simulation and computation. A further study comparing the equating performance of the testlet model, the testlet as a polytomous item model, and the dichotomous model using both separate a nd concurrent calibrations calibrations needs to be conducted. Only the 2PL testlet model was used to simulate the population testlet data with various levels of local dependency. Other models that account for the local dependency, such as the bi factor m odel, could be used in the future study to simulate and fit the testlet data. He, Li, Wolfe and Mao (2012) compared the IRT true score equating of unidimentional IRT models including1PL/ 2PL/ 3PL, bi f actor model, and testlet models including 2PL testlet m odel and 3PL testlet model (Wang, Bradlow, & Wainer, 2002) using empirical data under the common

PAGE 106

106 item nonequivalent groups design. Mean/Mean linking method was used. Traditional equipercentile and linear equating method were used as baselines. The data sho wed a small magnitude of item dependency, and the equating results indicated that bi factor model and testlet model tended to produce closer equating results to the two traditional equating methods. The authors also contended that a simulation study consid ering various levels local dependency was in need for future study. Concurrent calibration in practice is not always possible when the scale of the new test should be converted to the scale of the base form, which already has been calibrated. Kim and Cohen (2002) cautioned as well that concurrent calibration might not always be possible or economical, and combining the existing data with new data as implemented in the concurrent calibration might cause different equating errors. The concurrent calibration w as carried out through the R package ltm in the current study using marginal maximum likelihood estimation (Bock & Aitkin, 1981), which is not appropriate in a strict sense. In practice, estimation software such as BILOG MG, MULTILOG, IRTPRO (Cai, Thissen, & du Toit, 2011) should be used to conduct the concurrent estimation in the nonequivalent groups design, which can handl e multiple groups of examinees with different abilities. It would be informative to implement the concurrent calibration using these so ftware in the future for analyzing the testlet data. Given the prevalence of the testlet format in standard testing programs and the wide application of the common item nonequivalent groups design, it would be useful to investigate further the equating per formance of the possible approaches that can be used to model the local dependency of testlet data under common item nonequivalent groups design. Therefore, knowledge of conditions that each possible approach works better could be gain ed.

PAGE 107

107 Table 5 1. Contr ast of mean URMSD in C2 and C4 conditions Model C2 C4 2PL 2.082 1.320 GRM 1.692 1.161 GPC 1.790 1.212 Note. C2 = two common testlets; C4 = four common testlets.

PAGE 108

108 APPENDIX MEAN URMSD AND WRMSD FOR IRT OBSERVED SCORE EQUATING Table A 1. Mean URMSD from 2PL model for IRT observed score equating in t wo common testlets c ondition C2 2PL WRMSD N MM MS SL HB CC 500 0 0 1.983 1.984 1.934 1.963 1.972 1000 0 0 1.855 1.882 1.824 1.848 1.885 3000 0 0 1.778 1.782 1.772 1.782 1.782 500 0.5 0 1.894 1.92 2 1.891 1.901 1.861 1000 0.5 0 1.813 1.832 1.810 1.817 1.870 3000 0.5 0 1.948 1.953 1.950 1.950 1.862 500 1 0 1.778 1.811 1.780 1.796 1.764 1000 1 0 1.749 1.760 1.746 1.749 1.730 3000 1 0 1.728 1.739 1.733 1.734 1.635 500 2 0 1.740 1.738 1.735 1.737 1.688 1000 2 0 1.581 1.590 1.587 1.584 1.687 3000 2 0 1.704 1.707 1.705 1.702 1.620 500 0 0.5 1.938 2.027 1.948 1.973 1.922 1000 0 0.5 1.935 1.967 1.920 1.933 1.961 3000 0 0.5 1.791 1.809 1.800 1.804 2.233 500 0.5 0.5 1.962 1.979 1.935 1.963 1.884 1 000 0.5 0.5 1.894 1.903 1.892 1.901 1.815 3000 0.5 0.5 1.924 1.924 1.920 1.920 1.905 500 1 0.5 1.819 1.829 1.799 1.817 1.855 1000 1 0.5 1.827 1.853 1.842 1.839 1.995 3000 1 0.5 1.895 1.888 1.891 1.889 1.806 500 2 0.5 1.741 1.766 1.743 1.741 1.708 100 0 2 0.5 1.607 1.613 1.601 1.608 1.624 3000 2 0.5 1.575 1.570 1.572 1.571 1.705 500 0 1 2.132 2.165 2.105 2.131 2.452 1000 0 1 2.116 2.133 2.096 2.116 2.548 3000 0 1 2.000 1.996 1.988 1.990 2.580 500 0.5 1 2.119 2.170 2.120 2.134 2.239 1000 0.5 1 2.04 5 2.059 2.033 2.042 2.263 3000 0.5 1 1.919 1.917 1.915 1.917 2.086 500 1 1 1.914 1.917 1.896 1.901 2.321 1000 1 1 2.016 2.022 2.000 2.012 1.939 3000 1 1 1.859 1.845 1.855 1.849 1.992 500 2 1 1.810 1.839 1.830 1.823 1.865 1000 2 1 1.661 1.679 1.663 1. 668 1.931 3000 2 1 1.705 1.714 1.711 1.707 1.720

PAGE 109

109 Table A 2 Mean URMSD from G RM model for IRT observed score equating in t wo common testlets c ondition C2 GRM URMSD N MM MS SL HB CC 500 0 0 1.651 1.627 1.629 1.631 1.602 1000 0 0 1.645 1.642 1.640 1.641 1.596 3000 0 0 1.680 1.661 1.665 1.658 1.575 500 0.5 0 1.601 1.583 1.591 1.577 1.534 1000 0.5 0 1.612 1.584 1.590 1.581 1.631 3000 0.5 0 1.481 1.478 1.480 1.479 1.669 500 1 0 1.574 1.558 1.558 1.555 1.456 1000 1 0 1.468 1.454 1.459 1.450 1.529 3000 1 0 1.531 1.518 1.516 1.515 1.517 500 2 0 1.380 1.369 1.374 1.364 1.557 1000 2 0 1.315 1.303 1.307 1.298 1.804 3000 2 0 1.371 1.360 1.367 1.352 1.567 500 0 0.5 1.742 1.735 1.732 1.734 1.780 1000 0 0.5 1.562 1.636 1.636 1.571 1.824 3000 0 0.5 1.707 1.687 1.682 1.682 1.752 500 0.5 0.5 1.720 1.665 1.667 1.697 1.656 1000 0.5 0.5 1.604 1.570 1.576 1.567 1.605 3000 0.5 0.5 1.737 1.701 1.704 1.700 1.637 500 1 0.5 1.569 1.547 1.552 1.544 1.535 1000 1 0.5 1.629 1.598 1.606 1.594 1.585 3000 1 0.5 1.626 1.594 1.599 1.591 1.598 500 2 0.5 1.406 1.391 1.395 1.386 1.468 1000 2 0.5 1.409 1.390 1.397 1.387 1.716 3000 2 0.5 1.493 1.454 1.459 1.452 1.339 500 0 1 1.987 1.938 1.953 1.975 2.328 1000 0 1 2.034 1.966 1.967 1.964 2.465 3000 0 1 2.020 1. 925 1.930 1.923 2.501 500 0.5 1 1.959 1.943 1.928 1.941 2.047 1000 0.5 1 1.743 1.688 1.692 1.693 2.162 3000 0.5 1 1.738 1.657 1.742 1.716 1.851 500 1 1 1.704 1.659 1.665 1.659 1.947 1000 1 1 1.575 1.565 1.567 1.558 1.784 3000 1 1 1.665 1.645 1.650 1. 645 1.839 500 2 1 1.456 1.447 1.449 1.448 1.715 1000 2 1 1.478 1.426 1.425 1.426 1.745 3000 2 1 1.465 1.447 1.453 1.438 1.527

PAGE 110

110 Table A 3 Mean URMSD from GPC model for IRT observed score equating in two common testlets condition C2 GPC URMSD N MM MS SL HB CC 500 0 0 1.841 1.800 1.795 1.798 1.682 1000 0 0 1.660 1.704 1.697 1.703 1.744 3000 0 0 1.637 1.615 1.549 1.545 1.685 500 0.5 0 1.776 1.720 1.709 1.724 1.523 1000 0.5 0 1.628 1.574 1.573 1.582 1.643 3000 0.5 0 1.532 1.513 1.517 1.512 1.605 500 1 0 1.603 1.567 1.559 1.564 1.562 1000 1 0 1.621 1.602 1.600 1.606 1.508 3000 1 0 1.590 1.598 1.586 1.587 1.623 500 2 0 1.618 1.534 1.532 1.536 1.434 1000 2 0 1.491 1.436 1.472 1.464 1.515 3000 2 0 1.454 1.379 1.382 1.377 1.421 500 0 0.5 1 .812 1.745 1.731 1.756 1.716 1000 0 0.5 1.755 1.702 1.706 1.708 1.893 3000 0 0.5 1.705 1.779 1.783 1.775 1.915 500 0.5 0.5 1.701 1.692 1.629 1.635 1.715 1000 0.5 0.5 1.708 1.745 1.800 1.819 1.714 3000 0.5 0.5 1.830 1.788 1.819 1.822 1.748 500 1 0.5 1 .737 1.677 1.575 1.673 1.699 1000 1 0.5 1.613 1.566 1.564 1.567 1.626 3000 1 0.5 1.538 1.529 1.529 1.527 1.702 500 2 0.5 1.736 1.667 1.594 1.674 1.626 1000 2 0.5 1.564 1.515 1.514 1.516 1.520 3000 2 0.5 1.407 1.400 1.410 1.391 1.467 500 0 1 1.901 1.8 73 1.839 1.869 2.310 1000 0 1 2.140 2.077 2.109 2.130 2.182 3000 0 1 1.859 1.849 1.857 1.846 2.391 500 0.5 1 1.947 1.883 1.871 1.887 2.172 1000 0.5 1 1.796 1.802 1.794 1.815 2.022 3000 0.5 1 1.741 1.726 1.725 1.727 1.824 500 1 1 1.852 1.798 1.790 1.8 12 1.982 1000 1 1 1.728 1.749 1.779 1.786 1.722 3000 1 1 1.647 1.655 1.654 1.650 1.838 500 2 1 1.575 1.516 1.508 1.518 1.722 1000 2 1 1.514 1.523 1.508 1.526 1.644 3000 2 1 1.514 1.509 1.512 1.503 1.501

PAGE 111

111 Table A 4. Mean URMSD from 2PL model for IRT observed score equating in four c ommon testlets c ondition C4 2PL URMSD N MM MS SL HB CC 500 0 0 1.332 1.368 1.321 1.338 1.222 1000 0 0 1.243 1.263 1.235 1.249 1.285 3000 0 0 1.227 1.232 1.219 1.223 1.259 500 0.5 0 1.301 1.320 1.270 1.303 1.34 1 1000 0.5 0 1.340 1.343 1.321 1.333 1.344 3000 0.5 0 1.268 1.266 1.264 1.268 1.311 500 1 0 1.254 1.255 1.236 1.247 1.217 1000 1 0 1.165 1.171 1.148 1.165 1.330 3000 1 0 1.124 1.125 1.126 1.122 1.190 500 2 0 1.250 1.252 1.248 1.245 1.175 1000 2 0 1. 178 1.189 1.175 1.178 1.147 3000 2 0 1.162 1.156 1.157 1.155 1.145 500 0 0.5 1.460 1.452 1.440 1.386 1.366 1000 0 0.5 1.239 1.270 1.240 1.253 1.310 3000 0 0.5 1.347 1.349 1.343 1.348 1.354 500 0.5 0.5 1.298 1.313 1.275 1.293 1.259 1000 0.5 0.5 1.322 1.331 1.311 1.318 1.308 3000 0.5 0.5 1.245 1.244 1.239 1.244 1.281 500 1 0.5 1.342 1.356 1.337 1.345 1.214 1000 1 0.5 1.190 1.196 1.185 1.187 1.230 3000 1 0.5 1.202 1.209 1.205 1.208 1.245 500 2 0.5 1.221 1.217 1.207 1.217 1.183 1000 2 0.5 1.239 1.24 4 1.232 1.241 1.229 3000 2 0.5 1.191 1.194 1.191 1.192 1.158 500 0 1 1.439 1.474 1.391 1.410 1.481 1000 0 1 1.387 1.413 1.392 1.395 1.404 3000 0 1 1.367 1.379 1.358 1.368 1.464 500 0.5 1 1.287 1.311 1.262 1.285 1.310 1000 0.5 1 1.343 1.341 1.323 1.94 7 1.361 3000 0.5 1 1.299 1.313 1.307 1.305 1.467 500 1 1 1.318 1.325 1.298 1.309 1.364 1000 1 1 1.296 1.304 1.290 1.295 1.263 3000 1 1 1.356 1.354 1.356 1.354 1.321 500 2 1 1.200 1.208 1.195 1.198 1.143 1000 2 1 1.213 1.216 1.204 1.211 1.246 3000 2 1 1.216 1.216 1.216 1.205 1.217

PAGE 112

112 Table A 5 Mean URMSD from GRM model for IRT observed score equating in four common testlets condition C4 GRM URMSD N MM MS SL HB CC 500 0 0 1.201 1.189 1.181 1.185 1.249 1000 0 0 1.068 1.067 1.069 1.064 1.153 3000 0 0 1.177 1.175 1.180 1.173 1.165 500 0.5 0 1.117 1.110 1.109 1.111 1.073 1000 0.5 0 1.107 1.101 1.101 1.099 1.054 3000 0.5 0 1.088 1.062 1.061 1.059 1.055 500 1 0 1.158 1.120 1.124 1.119 1.142 1000 1 0 1.104 1.098 1.101 1.094 1.067 3000 1 0 0 .980 0.977 0.980 0.972 0.940 500 2 0 1.026 0.986 0.983 0.984 1.005 1000 2 0 1.027 1.011 1.015 1.009 1.065 3000 2 0 1.060 1.033 1.037 1.031 1.014 500 0 0.5 1.163 1.208 1.210 1.204 1.249 1000 0 0.5 1.278 1.264 1.256 1.253 1.134 3000 0 0.5 1.186 1.186 1 .182 1.185 1.228 500 0.5 0.5 1.218 1.183 1.181 1.184 1.150 1000 0.5 0.5 1.129 1.124 1.124 1.122 1.169 3000 0.5 0.5 1.115 1.092 1.092 1.091 1.075 500 1 0.5 1.188 1.108 1.112 1.107 1.084 1000 1 0.5 1.091 1.072 1.074 1.070 1.035 3000 1 0.5 1.072 1.053 1 .057 1.051 1.095 500 2 0.5 1.069 1.006 1.006 1.006 1.062 1000 2 0.5 0.973 0.948 0.953 0.947 1.098 3000 2 0.5 0.986 0.955 0.960 0.956 1.042 500 0 1 1.359 1.225 1.264 1.267 1.239 1000 0 1 1.280 1.253 1.252 1.261 1.275 3000 0 1 1.254 1.251 1.252 1.251 1 .297 500 0.5 1 1.075 1.037 1.027 1.034 1.096 1000 0.5 1 1.234 1.165 1.169 1.171 1.226 3000 0.5 1 1.229 1.209 1.206 1.201 1.230 500 1 1 1.179 1.117 1.111 1.109 1.125 1000 1 1 1.126 1.080 1.087 1.079 1.175 3000 1 1 1.131 1.114 1.120 1.112 1.108 500 2 1 1.092 1.001 1.005 0.996 1.085 1000 2 1 1.084 1.053 1.056 1.048 1.207 3000 2 1 1.079 1.022 1.027 1.020 1.108

PAGE 113

113 Table A 6 Mean URMSD from GPC model for IRT observed score equating in four common testlets condition C4 GPC URMSD N MM MS SL HB C C 500 0 0 1.332 1.272 1.282 1.276 1.137 1000 0 0 1.215 1.201 1.192 1.204 1.144 3000 0 0 1.128 1.125 1.122 1.123 1.108 500 0.5 0 1.174 1.173 1.148 1.182 1.091 1000 0.5 0 1.241 1.206 1.204 1.207 1.185 3000 0.5 0 1.113 1.103 1.102 1.104 1.052 500 1 0 1 .191 1.192 1.184 1.189 1.125 1000 1 0 1.229 1.212 1.206 1.213 1.111 3000 1 0 1.060 1.045 1.044 1.047 1.097 500 2 0 1.166 1.108 1.108 1.109 1.013 1000 2 0 1.101 1.079 1.085 1.076 1.085 3000 2 0 1.029 1.016 0.985 1.017 0.958 500 0 0.5 1.340 1.305 1.220 1.230 1.202 1000 0 0.5 1.148 1.127 1.109 1.131 1.099 3000 0 0.5 1.166 1.228 1.159 1.157 1.227 500 0.5 0.5 1.213 1.167 1.165 1.181 1.276 1000 0.5 0.5 1.308 1.259 1.225 1.267 1.193 3000 0.5 0.5 1.052 1.047 1.048 1.047 1.193 500 1 0.5 1.158 1.127 1.114 1.127 1.151 1000 1 0.5 1.187 1.166 1.161 1.168 1.127 3000 1 0.5 1.096 1.093 1.092 1.092 1.132 500 2 0.5 1.153 1.094 1.056 1.093 1.046 1000 2 0.5 1.116 1.091 1.088 1.095 1.057 3000 2 0.5 1.006 1.002 0.959 1.002 1.036 500 0 1 1.264 1.248 1.208 1.245 1 .299 1000 0 1 1.294 1.240 1.270 1.233 1.363 3000 0 1 1.245 1.247 1.246 1.242 1.285 500 0.5 1 1.223 1.204 1.188 1.207 1.243 1000 0.5 1 1.215 1.237 1.215 1.202 1.214 3000 0.5 1 1.199 1.187 1.208 1.187 1.250 500 1 1 1.242 1.176 1.164 1.182 1.166 1000 1 1 1.185 1.172 1.157 1.174 1.192 3000 1 1 1.141 1.127 1.126 1.130 1.189 500 2 1 1.143 1.098 1.093 1.098 1.056 1000 2 1 1.084 1.078 1.058 1.074 1.108 3000 2 1 1.170 1.161 1.163 1.159 1.092

PAGE 114

114 Table A 7. Mean WRMSD from 2PL model for IRT observed score e quating in t wo common testlets c ondition C2 2PL WRMSD N MM MS SL HB CC 500 0 0 2.041 2.041 1.982 2.014 2.051 1000 0 0 1.927 1.951 1.887 1.914 1.945 3000 0 0 1.828 1.834 1.822 1.833 1.813 500 0.5 0 1.947 1.974 1.942 1.952 1.906 1000 0.5 0 1.8 63 1.883 1.860 1.866 1.930 3000 0.5 0 2.014 2.018 2.016 2.015 1.926 500 1 0 1.801 1.842 1.806 1.823 1.785 1000 1 0 1.780 1.790 1.775 1.778 1.759 3000 1 0 1.760 1.772 1.766 1.767 1.671 500 2 0 1.733 1.730 1.726 1.728 1.677 1000 2 0 1.572 1.581 1.578 1 .575 1.677 3000 2 0 1.692 1.694 1.692 1.689 1.612 500 0 0.5 1.947 2.029 1.959 1.979 1.894 1000 0 0.5 1.951 1.977 1.937 1.943 1.957 3000 0 0.5 1.758 1.778 1.769 1.772 2.274 500 0.5 0.5 1.971 1.977 1.939 1.965 1.902 1000 0.5 0.5 1.898 1.903 1.897 1.902 1.833 3000 0.5 0.5 1.934 1.937 1.932 1.931 1.912 500 1 0.5 1.825 1.837 1.807 1.825 1.847 1000 1 0.5 1.823 1.851 1.840 1.835 1.980 3000 1 0.5 1.887 1.880 1.883 1.881 1.800 500 2 0.5 1.702 1.722 1.701 1.697 1.688 1000 2 0.5 1.575 1.582 1.569 1.577 1.5 83 3000 2 0.5 1.547 1.541 1.545 1.542 1.665 500 0 1 1.941 1.982 1.930 1.942 2.249 1000 0 1 1.894 1.903 1.869 1.882 2.283 3000 0 1 1.829 1.820 1.813 1.811 2.372 500 0.5 1 1.964 2.006 1.947 1.964 2.090 1000 0.5 1 1.919 1.945 1.912 1.922 2.145 3000 0.5 1 1.811 1.810 1.809 1.810 1.942 500 1 1 1.787 1.775 1.759 1.759 2.156 1000 1 1 1.891 1.892 1.871 1.883 1.830 3000 1 1 1.716 1.703 1.711 1.705 1.850 500 2 1 1.712 1.749 1.736 1.729 1.743 1000 2 1 1.577 1.607 1.585 1.592 1.803 3000 2 1 1.622 1.633 1.6 30 1.627 1.625

PAGE 115

115 Table A 8 Mean WRMSD from GRM model for IRT observed score equating in two common testlets condition C2 GRM WRMSD N MM MS SL HB CC 500 0 0 1.631 1.609 1.610 1.613 1.568 1000 0 0 1.630 1.629 1.625 1.627 1.596 3000 0 0 1.672 1 .658 1.663 1.654 1.579 500 0.5 0 1.616 1.599 1.607 1.591 1.539 1000 0.5 0 1.629 1.601 1.607 1.597 1.645 3000 0.5 0 1.482 1.477 1.480 1.477 1.690 500 1 0 1.577 1.560 1.560 1.556 1.461 1000 1 0 1.476 1.461 1.466 1.457 1.546 3000 1 0 1.545 1.531 1.529 1 .528 1.530 500 2 0 1.380 1.365 1.371 1.361 1.552 1000 2 0 1.310 1.298 1.302 1.293 1.795 3000 2 0 1.368 1.356 1.364 1.348 1.570 500 0 0.5 1.698 1.681 1.679 1.679 1.712 1000 0 0.5 1.482 1.567 1.565 1.496 1.787 3000 0 0.5 1.668 1.641 1.635 1.635 1.720 500 0.5 0.5 1.682 1.621 1.625 1.663 1.632 1000 0.5 0.5 1.590 1.559 1.564 1.555 1.573 3000 0.5 0.5 1.701 1.675 1.678 1.673 1.606 500 1 0.5 1.553 1.525 1.531 1.522 1.506 1000 1 0.5 1.606 1.583 1.591 1.578 1.555 3000 1 0.5 1.590 1.564 1.570 1.560 1.569 500 2 0.5 1.393 1.375 1.379 1.370 1.435 1000 2 0.5 1.378 1.360 1.365 1.357 1.674 3000 2 0.5 1.474 1.434 1.440 1.432 1.320 500 0 1 1.756 1.762 1.762 1.775 2.067 1000 0 1 1.753 1.678 1.676 1.668 2.179 3000 0 1 1.817 1.733 1.732 1.725 2.216 500 0.5 1 1. 810 1.786 1.774 1.780 1.853 1000 0.5 1 1.622 1.571 1.571 1.573 1.961 3000 0.5 1 1.562 1.543 1.577 1.557 1.666 500 1 1 1.590 1.533 1.540 1.532 1.778 1000 1 1 1.466 1.467 1.469 1.459 1.620 3000 1 1 1.513 1.505 1.508 1.502 1.689 500 2 1 1.397 1.395 1.39 5 1.394 1.591 1000 2 1 1.405 1.350 1.352 1.350 1.603 3000 2 1 1.416 1.397 1.405 1.387 1.433

PAGE 116

116 Table A 9 Mean WRMSD from GPC model for IRT observed score equating in two common testlets condition C2 GPC WRMSD N MM MS SL HB CC 500 0 0 1.853 1. 824 1.818 1.820 1.684 1000 0 0 1.677 1.725 1.718 1.722 1.778 3000 0 0 1.639 1.617 1.549 1.544 1.701 500 0.5 0 1.804 1.749 1.737 1.754 1.535 1000 0.5 0 1.645 1.585 1.581 1.594 1.677 3000 0.5 0 1.544 1.527 1.531 1.524 1.624 500 1 0 1.610 1.570 1.562 1. 567 1.577 1000 1 0 1.643 1.628 1.626 1.631 1.520 3000 1 0 1.611 1.619 1.606 1.607 1.648 500 2 0 1.619 1.534 1.532 1.535 1.431 1000 2 0 1.487 1.434 1.468 1.459 1.510 3000 2 0 1.451 1.375 1.377 1.373 1.414 500 0 0.5 1.755 1.706 1.677 1.702 1.676 1000 0 0.5 1.717 1.656 1.659 1.660 1.879 3000 0 0.5 1.696 1.779 1.784 1.774 1.912 500 0.5 0.5 1.685 1.679 1.606 1.610 1.715 1000 0.5 0.5 1.702 1.736 1.788 1.811 1.708 3000 0.5 0.5 1.826 1.770 1.806 1.808 1.718 500 1 0.5 1.706 1.668 1.548 1.659 1.689 1000 1 0.5 1.586 1.533 1.548 1.534 1.621 3000 1 0.5 1.510 1.502 1.502 1.500 1.670 500 2 0.5 1.706 1.636 1.563 1.642 1.593 1000 2 0.5 1.521 1.476 1.474 1.475 1.483 3000 2 0.5 1.391 1.388 1.397 1.378 1.435 500 0 1 1.696 1.663 1.623 1.646 2.080 1000 0 1 1.91 3 1.831 1.862 1.876 1.964 3000 0 1 1.684 1.646 1.650 1.637 2.224 500 0.5 1 1.800 1.734 1.722 1.730 2.041 1000 0.5 1 1.671 1.676 1.672 1.687 1.898 3000 0.5 1 1.608 1.593 1.589 1.594 1.682 500 1 1 1.724 1.655 1.652 1.665 1.835 1000 1 1 1.647 1.659 1.68 4 1.682 1.621 3000 1 1 1.485 1.493 1.494 1.487 1.686 500 2 1 1.513 1.443 1.438 1.443 1.619 1000 2 1 1.422 1.438 1.423 1.437 1.555 3000 2 1 1.437 1.437 1.437 1.430 1.419

PAGE 117

117 Table A 10. Mean WRMSD from 2PL model for IRT observed score equating in f our co mmon testlets c ondition C4 2PL WRMSD N MM MS SL HB CC 500 0 0 1.374 1.410 1.360 1.376 1.247 1000 0 0 1.291 1.312 1.283 1.296 1.339 3000 0 0 1.269 1.273 1.259 1.264 1.304 500 0.5 0 1.343 1.359 1.309 1.341 1.382 1000 0.5 0 1.374 1.378 1.353 1. 366 1.383 3000 0.5 0 1.311 1.309 1.307 1.310 1.359 500 1 0 1.280 1.279 1.260 1.272 1.238 1000 1 0 1.181 1.187 1.162 1.180 1.366 3000 1 0 1.147 1.148 1.149 1.145 1.215 500 2 0 1.239 1.243 1.238 1.235 1.168 1000 2 0 1.170 1.181 1.166 1.169 1.139 3000 2 0 1.155 1.149 1.150 1.148 1.140 500 0 0.5 1.443 1.457 1.414 1.380 1.381 1000 0 0.5 1.232 1.278 1.241 1.256 1.338 3000 0 0.5 1.377 1.373 1.373 1.374 1.363 500 0.5 0.5 1.308 1.327 1.286 1.304 1.264 1000 0.5 0.5 1.329 1.339 1.321 1.325 1.328 3000 0.5 0.5 1.264 1.263 1.258 1.263 1.300 500 1 0.5 1.358 1.373 1.354 1.359 1.205 1000 1 0.5 1.189 1.196 1.185 1.186 1.231 3000 1 0.5 1.202 1.210 1.204 1.208 1.254 500 2 0.5 1.194 1.189 1.179 1.187 1.153 1000 2 0.5 1.215 1.219 1.208 1.215 1.200 3000 2 0.5 1. 177 1.178 1.176 1.177 1.144 500 0 1 1.368 1.402 1.305 1.326 1.362 1000 0 1 1.346 1.368 1.341 1.348 1.325 3000 0 1 1.299 1.308 1.289 1.300 1.412 500 0.5 1 1.206 1.228 1.181 1.201 1.227 1000 0.5 1 1.266 1.267 1.242 1.842 1.279 3000 0.5 1 1.203 1.225 1. 216 1.212 1.359 500 1 1 1.245 1.248 1.224 1.232 1.298 1000 1 1 1.219 1.222 1.209 1.215 1.184 3000 1 1 1.269 1.268 1.269 1.267 1.249 500 2 1 1.135 1.145 1.130 1.132 1.071 1000 2 1 1.141 1.142 1.134 1.138 1.185 3000 2 1 1.127 1.124 1.126 1.116 1.152

PAGE 118

118 Table A 11 Mean WRMSD from GRM model for IRT observed score equating in four common testlets condition C4 GRM WRMSD N MM MS SL HB CC 500 0 0 1.197 1.189 1.179 1.183 1.271 1000 0 0 1.064 1.063 1.066 1.059 1.159 3000 0 0 1.193 1.192 1.197 1.19 0 1.178 500 0.5 0 1.117 1.108 1.107 1.108 1.074 1000 0.5 0 1.113 1.108 1.107 1.106 1.050 3000 0.5 0 1.089 1.063 1.062 1.060 1.064 500 1 0 1.175 1.134 1.139 1.133 1.149 1000 1 0 1.111 1.104 1.108 1.100 1.071 3000 1 0 0.988 0.985 0.988 0.981 0.951 500 2 0 1.022 0.980 0.978 0.979 1.001 1000 2 0 1.022 1.007 1.011 1.005 1.064 3000 2 0 1.054 1.027 1.030 1.024 1.011 500 0 0.5 1.122 1.177 1.176 1.169 1.211 1000 0 0.5 1.266 1.254 1.244 1.241 1.101 3000 0 0.5 1.138 1.145 1.139 1.143 1.193 500 0.5 0.5 1.1 94 1.167 1.163 1.165 1.129 1000 0.5 0.5 1.105 1.104 1.103 1.101 1.136 3000 0.5 0.5 1.105 1.084 1.083 1.083 1.068 500 1 0.5 1.166 1.097 1.101 1.096 1.071 1000 1 0.5 1.071 1.055 1.057 1.053 1.016 3000 1 0.5 1.053 1.039 1.043 1.036 1.086 500 2 0.5 1.051 0.988 0.988 0.987 1.035 1000 2 0.5 0.955 0.928 0.934 0.928 1.068 3000 2 0.5 0.974 0.944 0.949 0.945 1.028 500 0 1 1.251 1.113 1.128 1.141 1.136 1000 0 1 1.177 1.149 1.144 1.155 1.173 3000 0 1 1.157 1.170 1.161 1.161 1.209 500 0.5 1 0.984 0.954 0.943 0.947 1.010 1000 0.5 1 1.140 1.074 1.077 1.078 1.145 3000 0.5 1 1.145 1.115 1.115 1.106 1.127 500 1 1 1.114 1.045 1.041 1.037 1.049 1000 1 1 1.039 1.000 1.003 0.998 1.086 3000 1 1 1.060 1.049 1.054 1.042 1.017 500 2 1 1.039 0.955 0.957 0.948 1.023 1000 2 1 1.038 0.997 1.004 0.992 1.145 3000 2 1 1.024 0.967 0.972 0.965 1.031

PAGE 119

119 Table A 12 Mean WRMSD from GPC model for IRT observed score equating in four common testlets condition C4 GPC WRMSD N MM MS SL HB CC 500 0 0 1.364 1.299 1.309 1.3 02 1.149 1000 0 0 1.244 1.233 1.222 1.237 1.168 3000 0 0 1.155 1.148 1.149 1.150 1.130 500 0.5 0 1.184 1.196 1.169 1.206 1.105 1000 0.5 0 1.264 1.232 1.230 1.232 1.208 3000 0.5 0 1.130 1.121 1.120 1.122 1.066 500 1 0 1.202 1.206 1.197 1.203 1.136 10 00 1 0 1.243 1.229 1.223 1.229 1.121 3000 1 0 1.068 1.055 1.053 1.056 1.113 500 2 0 1.162 1.104 1.104 1.104 1.008 1000 2 0 1.097 1.074 1.081 1.071 1.078 3000 2 0 1.023 1.011 0.978 1.012 0.951 500 0 0.5 1.326 1.282 1.201 1.210 1.168 1000 0 0.5 1.148 1 .126 1.103 1.131 1.071 3000 0 0.5 1.135 1.219 1.128 1.126 1.214 500 0.5 0.5 1.209 1.167 1.165 1.184 1.280 1000 0.5 0.5 1.307 1.255 1.217 1.262 1.187 3000 0.5 0.5 1.045 1.043 1.043 1.043 1.205 500 1 0.5 1.154 1.127 1.113 1.126 1.148 1000 1 0.5 1.184 1 .160 1.161 1.164 1.119 3000 1 0.5 1.080 1.078 1.076 1.076 1.130 500 2 0.5 1.130 1.070 1.035 1.068 1.025 1000 2 0.5 1.093 1.068 1.065 1.072 1.036 3000 2 0.5 0.990 0.988 0.950 0.987 1.029 500 0 1 1.168 1.137 1.103 1.142 1.197 1000 0 1 1.189 1.137 1.178 1.127 1.278 3000 0 1 1.191 1.197 1.192 1.190 1.198 500 0.5 1 1.148 1.147 1.121 1.143 1.174 1000 0.5 1 1.131 1.163 1.135 1.118 1.137 3000 0.5 1 1.096 1.092 1.121 1.089 1.173 500 1 1 1.173 1.122 1.110 1.127 1.105 1000 1 1 1.106 1.088 1.076 1.090 1.114 3000 1 1 1.070 1.055 1.050 1.057 1.126 500 2 1 1.082 1.039 1.034 1.037 0.992 1000 2 1 1.036 1.025 1.004 1.021 1.052 3000 2 1 1.093 1.086 1.089 1.084 1.023

PAGE 120

120 LIST OF REFERENCES Baker, F. B. (1993a). Equate 2.0: A computer program for the characterist ic curve method of IRT equating. Applied Psychological Measurement 1 7 20. Bguin, A. A., & Hanson, B. A. (2001, April). Effect of noncompensatory multidimensionality on separate and concurrent estimation in IRT observed score equating Paper presented a t the annual meeting of the National Council of Measurement in Education, Seattle. Bguin, A. A., Hanson, B. A., & Glas, C. A. W. (2000, April). Effect of multidimensionality on separate and concurrent estimation in IRT equating Paper presented at the ann ual meeting of the National Council of Measurement in Education, New Orleans. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29 51. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation method of item parameters: Application of an EM algorithm. Psychometrika, 46 443 459. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika 64 153 1 68. Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows [Computer software]. Lincolnwood, IL: Scientific Software International. Cao, Y. (2008). Mixed format test equating: Effects of test dimensionality and common item sets D issertatio n, University of Maryland. Cook, K. F., Dodd, B. G., & Fitzpatrick, S. J. (1999). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of Outcome Meausement, 3 1 20. Cook, L.L. & Eignor, D.R. ( 1991 ). A n NCME instructional module on IRT equating methods Educational Measurement: Issues and Practice, 10 37 45 Cook, L. L., Eignor, D. R., & Taft, H. (1985). A comparative study of curriculum effects on the stability of IRT and conventional item parameter estimates (RR 85 38). Princeton NJ: Educational Testing Service. Cook, L. L., & Paterson, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances Applied Psychological Measur ement 11 225 244. DeMars, C. E. (2006). Application of the bi factor multidimensional item response theory model to testlet based tests. Journal of Educational Measurement, 43 145 168.

PAGE 121

121 Du, Z. (1998). Modelling conditional item dependencies with a thre e parameter logistic testlet model. Unpublished doctoral dissertation, Columbia University. Gibbons R. D., & Hedeker, D. R. ( 1992 ). Full information bi factor analysis. Psychometrika, 57 423 436. Gulliksen, H. (1950). Theory of mental tests New York: Wiley. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22 144 149. Han, T., Kolen, M.J., & Pohlmann, J. (1997). A comparison among IRT true and observed score equatings and tradi tional equipercentile equating. Applied Measurement in Education, 10 (2), 105 121. Hanson, B. A., & Bguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common item equat ing design. Applied Psychological Measurement 26 3 24. Hanson, B. A., & Zeng, L. (1995). A computer program for IRT equating ( PIE ) ( Version 1.0 ) [Computer program]. Iowa City IA: ACT. He, W, Li. F., Wolfe, E. W., & Mao, X. (2012, April). Model selectio n for equating testlet based tests in the NEAT design: An empirical study. Paper presented at the annual meeting of American Educational Researh Association, Vancouver. Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and co ncurrent calibration under item response theory. Applied Psychological Measurement 22 131 143. Kim, S. H., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed format tests. Applied Measurement in Education, 19, 357 381. Kim, S. H., & Lee, W. C. (2006). An extension of four IRT linking methods for mixed format tests. Journal of Educational Measurement, 43, 53 76. Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common item equating with nonrandom groups. Jo urnal of Educational Measurement, 22 197 206. Klein, L. W., & Kolen, M. J. (1985, April). Effect of number of common items in common item equating with nonrandom groups Paper presented at the annual meeting of American Educational Researh Association, C hicago. Kolen, M.J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18 1 11.

PAGE 122

122 Kolen, M.J. (2004a). Population invariance in equating and linking: Concept and history. Journal of E ducational Measurement, 41 3 14. Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). New York: Springer. Lee, G., Dunbar, S. B., & Frisbie, D. A. ( 1999 April). Measure ment models for a testlet ba sed test. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada. Lee, G., Kolen, M. J, Frisbie, D. A.,& Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equatin g scores from tests composed of testlets. Applied Psychological Measurement, 25 357 372. Levine, R. (1955). Equating the score scales of alternate forms administered to samples of different ability (Research Bulletin 55 23). Princeton, NJ: Educational Te sting Service. Lee, W C., & Ban, J C. (2010). A comparison of IRT linking procedures. Applied measurement in Education 23 23 48. Li, D. (2009). Developing a common scale for testlet model parameter estimates under the common item nonequivalent groups d esign Disse rtation, University of Maryland Li, Y. M., Bolt, D. M., & Fu, J. B. (2005). A test characteristic curve linking method for the testlest model. Applied Psychological Measurement, 29 340 356. Liang,T., & Wells, C.S. (2009). A model fit statis tic for generalized partial credit model. Educational and Psychological Measurement 69(6), 913 928. Loyd, B., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17 179 193. Marco, G. L. (1977). Item cha racteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14 139 160. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149 174. Maydeu Olivares, A., Drasgow, F., & Mead, A. D. (1994). Distinguishing among parametric item response models for polychotomous ordered data. Applied Psychological Measurement, 18 245 256. Mislevy, R. J., & Bock, R.D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models ( 2 nd e d. ) [Computer program]. Mooresville IN: Scientific Softwar. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16 159 76.

PAGE 123

123 Muraki, E., & Bock, R. D. (1997). PARSCALE: IRT Item Analysi s and Test Scoring for Rating scale Data. Chicago, IL: Scientific Softwar International. Petersen, N. S., Cook, L. L., & Stocking, M. L. (1983). IRT versus conventional equating method: A comparative study of scale stability. Journal of Educational Statis tics, 8, 137 156. Reckase, M. D. (1985). The difficulty of items that measure more than one ability. Applied Psychological Measurement, 9 401 412. Reckase, M. D., & McKinley, R. L. (1983). The definition of difficulty and discrimination for multidimensi onal item response theory models Paper presented at the meeting of the American Educational Research Association, Montreal, Quebec, Canada. Reise, S. P., & Yu, J. (1990). Parameter Recovery in the Graded Response Model Using MULTILOG. Journal of Educatio nal Measurement, 27 133 144. Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software. 17 1 25. //www.jstatsoft.org/v17/i05/. Samejima, F. (1969). Estimation of Latent Abil ity Using a Response Pattern of Graded Scores. Psychometrika, Monograph Supplement, No. 17. Simon, M. K. (2008). Comparison of concurrent and separate multidimensional IRT linking of item parameters Unpublished dissertation. University of Minnesota. Sire ci, S. G., Thissen, D., & Wainer, H. ( 1991 ). On the reliability of testlet based tests. Journal of Educational Measurement, 28 237 247. Spiegelhalter, D., Thomas, A., & Best, N. (2003).WinBUGS version 1.4 [Computer program]. Cambridge, UK: MRC Biostatist ics Unit, Institute of Public Health. Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201 210. Tang, K. L., & Eignor, D. R. (1997). Concurrent Calibration of Dichotomously and Polytomously Scored TOEFL Items using IRT Models : Educational Testing Service. Thissen, D. (1991). MULTILOG: Multiple categorical item analysis and tst scoring using item response theory ( Version 6.0 ) [Computer program]. Chicago: Scientific Software Inte rnational. Thissen, D., Steinberg, L., & Mooney, J. (1989). Trace lines for testlets: A use of multiple categorical response models. Journal of Educational Measurement, 26, 247 260.

PAGE 124

124 Wainer, H. (1995). Precision and Differential Item Functioning on a Test let based Test: the 1991 Law School Admissions Test as an Example. Applied measurement in Education, 8 157 186. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL useful in adaptive testing. In W. J. van der Linde n & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245 270). Boston: Kluwer Nijhoff. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet Response Theory and Its Applications New York: Cambridge University Press Wainer, H., & Kiely, G. L. ( 1987 ). Item clusters and computerized adaptive testing: A case for testlets Journal of Educational measurement 24, 185 201. Wainer H., & Thissen, D. ( 1996 ). How is reliability related to the quality of test scores? What is the effe ct of local dependence on reliability? Educational Measurment: Issues and Practice, 15, 22 29. Wainer H., & Wang, C. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37 203 220. Wang, X., Bradlow, E.T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26 190 128. Weeks, J. P. (2010). Plink: An R package for linking mixed format tests using IRT based methods. Journal of Statist ical Software, 35 1 33. URL http: //www.jstatsoft.org/v35/i12/. Wingersky, M.S. Barton, M.A. & Lord, F.M. ( 1982 ). LOGIST user's guide Princeton NJ : Educational Testing Service Wingersky, M. S., Cook, L. L, & Eignor, D. R. (1986, April). Specifying t he characteristics of linking items used for item response theory item calibration Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8 347 364. Wood, R., Wilson, D., Gibbons, R., Schilling, S., Muraki, E., & Bock, D. (1987). TESTFACT. Mooresville, IN: Scientific Software. Yen, W. M (1984). Effec ts of local item dependence on the fit and equating performance of the three parameter logistic model. Applied Psychological Measurement, 19 231 240. Yen, W. M. ( 1993 ). Scaling performance assessments: Strategies for managing local item dependence. Journ al of Educational Measurement 30 187 213.

PAGE 125

125 Zeng, L., & Kolen, M. J., & Hanson, B. A. (1995). Random groups equating program ( RAGE ) ( Version 2.0 ) [Computer program]. Iowa City IA: ACT. Zhang, Jin. (2007). Dichotomous or polytomous model? Equating of test let based tests in light of conditional item pair correlations Dissertation, University of Iowa. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R.D. (2005). BILOG MG: Multiple group IRT analysis and test maintenance for binary items Lincolnwood IL: Scientific Softwar International, Inc.

PAGE 126

126 BIOGRAPHICAL SKETCH Lidong Zhang was born in China. She earned he ducation at Henan Normal University. In 2005, s he obtained a master degree in linguistics and a pplie d l inguistics from Shanghai Jiaotong University. Her Ph.D study in research and evaluation methodology at the University of Florida started from 2008. In August 2013, she received her doctoral degree.