UFDC Home  myUFDC Home  Help 



Full Text  
PAGE 1 1 POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2010 PAGE 2 2 2010 Ou Zhang PAGE 3 3 To my Dad who has support ed me, believed me in, and encouraged me to start this long way He is my hero! PAGE 4 4 ACKNOWLEDGMENTS I would like to express my sincere appreciation to Dr. M. David. Miller, my committee chair, for providing valuable guidance and continuous support. I would also like to thank Dr. James J. Algina, my committee member, for sharing his ideas and corrections on this project. My deepest gratitude goes to my parents and my wife, Bei Li, for their constant support and love. Thanks to my summer internship mentor, Dr. Feiming Li and Vice President, Dr. Linjun Shen for giving me such a valuable opportunity to enter the educational measurement industry. Thanks to my friend Yan Cao for her patience and help over years. Last, thanks go out to Dr. Andrich for his comment and suggestion. PAGE 5 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 10 1.1 Model Selection ................................ ................................ ................................ 14 1.2 Survey of the Testlet Size in Applications of Testlet ................................ ......... 15 1.3 Purpose of the Study ................................ ................................ ........................ 16 2 LITERATURE REVIEW ................................ ................................ .......................... 19 2.1 Item Response Theory ................................ ................................ ...................... 19 2.1.1 IRT Assumptions ................................ ................................ ................... 20 2.1.2 One Parameter Logistic Model (1 PL Model or Rasch Model) .............. 20 2.1.3 Polytomous Item Response Theory (IRT) Model Partial Credit Model .. 21 2.1.4 Testlet Model Rasch Testlet Model ................................ ....................... 22 2.1. 5 Local Item Dependence ................................ ................................ ....... 23 2.2 Reliability ................................ ................................ ................................ .......... 25 2.3 Survey in Application of Testlet ................................ ................................ ......... 26 3 METHOD ................................ ................................ ................................ ................ 34 3.1 Model Used to Generate Data ................................ ................................ .......... 34 3.2 Population Parameters ................................ ................................ ..................... 35 3.3 Condition Manipulated ................................ ................................ ...................... 35 3.4 Data Generation ................................ ................................ ............................... 36 3.5 Para meter Estimation ................................ ................................ ....................... 37 3.6 Ability Estimation ................................ ................................ ............................... 39 3.7 Analysis ................................ ................................ ................................ ............ 41 3.7.1 Bias ................................ ................................ ................................ ....... 41 3.7.2 Root Mean Sq uare Error (RMSE) ................................ .......................... 42 3.7.3 Reliability ................................ ................................ ............................... 42 4 RESULTS ................................ ................................ ................................ ............... 44 4.1 MLE Non conver gence Issue ................................ ................................ ............ 44 4.2 Test Reliability ................................ ................................ ................................ .. 44 4.3 Standard Error of Measurement ................................ ................................ ....... 46 4.4 Bias and RMSE ................................ ................................ ................................ 47 PAGE 6 6 4.5 An Emp irical Case ................................ ................................ ............................ 49 5 DISCUSSION ................................ ................................ ................................ ......... 90 5.1 General Discussion ................................ ................................ ........................... 90 5.2 Li mitations and Suggestions for Future Research ................................ ............ 91 5.3 Conclusion ................................ ................................ ................................ ........ 92 LIST OF REFERENCES ................................ ................................ ............................... 94 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 101 PAGE 7 7 LIST OF TABLES page 1 1 Testlet size in the article reviews ................................ ................................ ........ 18 2 1 The number of testlets in the dataset ................................ ................................ 29 2 2 T est length in the reviewed articles ................................ ................................ .... 30 2 3 Sample sizes in the reviewed articles ................................ ................................ 31 2 4 Fit indices in reviewed articles ................................ ................................ ............ 33 2 5 Estimation method in reviewed articles ................................ .............................. 33 2 7 The number of simulation replication applied in the reviewed articles ................ 33 3 1 Study design condition with 3 factors ................................ ................................ 43 4 1 MLE nonconvergence case and rate per condition testlet size 3 ....................... 53 4 2 MLE nonconvergence case and rate per condition testlet size 5 ....................... 53 4 3 Test reliability testlet size 3 conditions ................................ ............................... 54 4 4 Test reliability testlet size 5 conditions ................................ ............................... 55 4 5 Testlet size 3 the results of the Spearman B rown prophecy .............................. 56 4 6 Testlet size 5 the results of the Spearman Brown prophecy .............................. 57 4 7 Mean standard error of measurement for each condition (testlet size 3) ............ 58 4 8 M ean standard error of measurement for each condition (testlet size 5) ............ 59 4 9 Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) ...................... 60 4 10 Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) ...................... 62 4 11 Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 64 4 12 Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ ............ 66 4 13 Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 68 PAGE 8 8 4 14 Rasch testlet model (Testlet Size 5) Bias of Ability ( ) Estimate Recovery (EAP) with 6 Different Ab ility Intervals ................................ ................................ 70 4 15 Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ ............ 72 4 16 Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 74 4 17 Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 76 4 18 Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 78 4 19 Standard Rasch model (testlet size 3) R MSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 8 0 4 20 Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 82 4 21 Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................ ................................ 84 4 22 Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals ................................ ................................ ............ 86 4 23 NBOME LEVEL 2 Block 1 Item WMSE ................................ .............................. 88 4 24 COMLEX Level 2 2008 block 1 local item dependence detection results .......... 89 PAGE 9 9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By Ou Zhang Dec ember 2010 Chair: M. David Miller Major: Research and Evaluation Methodology This study investigated the effectiveness of ability parameter recovery for three models to detect the influence of the local item dependence across testlet items under the small testlet size situation. A simulation study was used to compare three Rasch ty pe models, which were the standard Rasch model, the partial credit model, and the Rasch testlet model. The results revealed that both the partial credit model and Rasch testlet model performed better than the standard Rasch model as the existence of local item dependence within testlet. The results also indicated that a s the sample size increases, the discrepancies between model estimates and the real data set increases. The study concluded that u sing the polytomous IRT model for testlet item analyses is st ill efficient for small testlet size and non adaptive typed tests. Moreover, f or small testlet sizes, polytomous IRT models are more stable than the Rasch testlet model when there are a large number of the testlets included in a test. In sum, t he polytomo us IRT model and Rasch testlet model offers an advantage over the standard Rasch model as it avoids standard error of measurement underestimation and better ability parameter estimations in the small testlet size situations. PAGE 10 10 CHAPTER 1 INTRODUCTION Item response theory (IRT) models are commonly used in educational and psychological testing. Employing item response theory allows for assessing latent human characteristics and quantifying underlying traits. IRT holds a major assumption local item i ndependence. Local item independence (LID) assumes items in the te st are un related with each other, after controlling for the underlying trait. However, the LID assumption can be commonly violated in real world applications. In fact, many real world tasks require solving related problems or solving a single problem in stepwise fashion. In accordance with such circumstance s the exam includes items within a subset sharing a single content stimulus. The items sharing the same stimuli are grouped as a unit, t ermed as an item bundle (Rosenbaum, 1988) or testlet (Wainer & Kiely, 1987). An item bundle or testlet, hence forward referred to as a testlet, is a scoring unit within a test that is smaller than a test (Wainer & Kiely, 1987). Items within testlets are l ocally dependent because they are associated with the same stimulus. Moreover, local item dependence introduces unintended dimensions into the test at the construct of & Thissen, 1996). Thus, the challenge for the test developer is not to eliminate the item dependencies, but rather to find a proper solution so that such local item dependence does not impact the test reliability and the validity of inferences from the test. More specifically, the violation of the assumption of loc al item independency may lead to an underestimate of the standard errors and could result in (a) bias in item difficulty estimates, (b) inflated item discrimination estimates, (c) overestimation of the precision of examinee scores, and (d) overestimation o f test PAGE 11 11 reliability and test information. This last result can lead to inaccurate inferences that may result in a greater chance of misclassification when making decisions regarding examinee ability categorization (Sireci, Thissen, & Wainer, 1991; Yen, 1993 ). Therefore, some models were proposed as solutions to the violation of the local item independence assumption. One of the methods is to treat such testlet items as a single super polytomous item in the analysis (Sireci, Thissen, & Wainer, 1991; Thissen, theorem of item bundles (Rosenbaum, 1988) using a polytomous (IRT) model to score the locally independent testlets. The key idea is that the items tha t form each testlet may have excessive local dependence, but that once the entire testlet is considered as a single unit and scored polytomously these local dependencies may disappear. The item scores are summed within each testlet. When the total scores i n a testlet are identical, they will be assigned to the same category. This method allows researchers to score testlets polytomously. Once the summed item scores are obtained, testlet type item responses are calibrated by applying polytomous item response models, such as the Graded Response Model (Samejima, 1969), the Partial Credit Model (Masters, 1982), the Rating Scale Model (Andrich, 1978), or the Nominal Response Model (Bock, 1972). In using a polytomous IRT model to score testlets, the data can be ana lyzed while maintaining local independence across different testlets. This approach avoids the overestimation of the test reliability and information so that the statistics of the polytomous IRT model consistently perform better than the standard Rasch mod el in such circumstance. PAGE 12 12 However, this approach has some weaknesses when it is applied to manipulate testlet discussed (Thissen, Billeaud, McLeod, & Nelson, 1997; Yen, 1993; Wainer & Wang 2000). First, when polytomous IRT models are applied some test information, the precise pattern of responses the examinee generates, is lost. In addition, some parameters are dropped from the polytomous model compared to the individual dichotomous item scoring. Third, it is inappropriate if the test is administered adaptively. Last but not least, the test reliability might be underestimated (Yen, 1993). Wainer (1995) claimed that using a polytomous IRT model to manage testlets might be appropriate when the local dependence between items within a testlet is moderate and the testlet type items only take a small proportion of the entire test. The other method, the testlet model (Wainer & Kiely, 1987), is explicitly introduced as an alternative to the polytomous IRT model and attempts to solve the same problem. IRT testlet models have been proposed in which a random effect parameter is added to model the local dependence among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlo w, & Du, 2000; Wang, Bradlow, & Wainer, 2002). As one random effect parameter is added to the model, an additional latent trait is also added to the testlet model. Thus, the testlet model proposed by Wainer& Wang (2000) is a special case of a multidimensio nal IRT model (MIRT). Proposed by Wang and Wilson (2005b), the Rasch testlet model is a special case of testlet model (Wainer & Wang, 2000) by combining special features of the Rasch model and the testlet model. In so doing it makes use of several desirab le measurement and psychometric properties of the Rasch model (Wang & Wilson 2005). First, the PAGE 13 13 Rasch model has observable sufficient statistics for the model parameters and a relatively small sample size requirement for parameter estimation. Second, no dis tributional assumption on the item parameters is necessary in Rasch models since the items are treated as fixed effects. Therefore, the Rasch model is widely applied in testing and scoring. Because of such advantages of the Rasch model, Wang and Wilson (20 05a, 2005b) showed that it is possible to model locally dependent items in relation to testlets by using a Rasch testlet model so that more precise and adequate estimates are obtained. The Rasch testlet model is the special case of the testlet model (Waine r &Wang, 2000). Before the testlet model was proposed, polytomous IRT models were the primary method to analyze testlets. Currently, both approaches are widely used for testlet analyses and no doubt, both approaches have pros and cons. Thus, the theoretic al reason to choose the testlet model (Wainer &Wang, 2000) over polytomous IRT model in testlet analysis might be obvious. However, some potential caveats of the testlet model should also be considered. First, the testlet model is more complex than both t he an additional latent trait is also added in the model so that multidimensionality occurs and results in increased complexity of analysis. Thus, the model analysis process is extremely prolonged and some potential issues emerge (e.g. sometimes the model calibration fails to converge). Therefore, the benefits in using the testlet model (Wainer& Kiely, 1987) s hould be weighed against the added complexity in data analysis. PAGE 14 14 1.1 Model Selection Although Wainer and Wang (2000) addressed advantages of the testlet model over the polytomous model in applied testlet analyses, it remains important to compare these two models under various condition s. Pitt, Kim, and Myung (2003) indicated that the goal of model selection was not just to find the model that provide the maximum fit to a given data set, but to identify a model from a set of competing models, that best capt ures the characteristics or trends underlying the cognitive process of interest. Briefly, the best model is the model that matches the purpose of the study and can explain all of the important features of the actual data without adding unnecessary complexi ty. T realistic circumstances should be noted in model selection for analyzing testlets. First is that of testlet size. In previous testlet research, in order to obtain illustra tive results to support hypotheses, testlet sizes were usually set from 5 to 10 or more items (e.g. Adams,Wilson & Wang,1997; Wang & Wilson, 2005; Brandt, 2008; Wainer & Wang, 2000; Wainer & Lewis, 1990). Small and medium testlet sizes (2 4 items) were ra rely applied ( Ip, Smits & De Boeck,2009; Tokar, Fischer, Snell & Harik Williams, 1999; DeMars,2006 ). This is potentially problematic because in some exams, like National Medical Licensing Examinations (COMLEX) USA exam, testlet sizes are often small. The second issue to consider is that of non adaptive tests. Non adaptive tests are still widely used in the educational and psychological measurement field. Because of the small testlet sizes and non adaptive features, the loss of response pattern information is not that serious for t his kind of tests. In this manner, more concerns are given for the PAGE 15 15 model comparison between two models when the aforementioned shortcomings of polytomous IRT model applying in testlet analysis are minimal. In addition, the local dependence effect ( ) within testlet varies. Since the local dependence effect ( ) is avoided when polytomous IRT models are applied, the extent to which the local dependence effect ( ) influences the fit of polytomous m odel in testlet analysis should also draw attention. 1.2 Survey of the Testlet Size in Applications of Testlet Very little research has focused on model comparison between the polytomous IRT model and the testlet model initially proposed by Wainer and Wan g (2000) regarding model fit, ability parameter recovery, and test reliability as testlet conditions change, especially when testlet size and local dependence effect ( ) are at a medium level. A review of the literature to identify the application of testlets was and 2009. A total of fifty five articles relevant to the testlet were found and reviewed (see the reference list from Appendix B) .Among all fifty five testlet related articles, forty five articles have specific descriptions regarding the factors that could influence the testlet analysis in testlet resea rch designs (i.e. testlet size, the number of testlets within a test, sample size, etc.). The remaining ten articles, which include two book reviews, conceptually describe testlet theory and application. I ssues of testlet size within the testlet have been well documented in the literature. In these forty five testlet relevant articles, only four articles solely applied small testlet size designs (i.e. testlet size smaller than five). The other forty one articles have a PAGE 16 16 mixture of testlet size designs, alth ough most of the articles included moderate and large testlet size designs (testlet size larger than 5) in their research. Of these forty one, there were twelve articles that considered the small testlet size designs. Over thirty five articles included the testlet sizes between 5 and 10 and twelve articles included large testlet size conditions, larger than 10. Overall, 16 articles (35.6%) investigated small testlets and only 12 compared the small and medium testlet sizes. The detailed results are shown in Table 1 1 In sum, this study adds to this literature by investigating the results of three different models of testlet type data under the small and medium testlet size circumstances (i.e. testlet size small than or equal to 5). Testlet size, local dependence effect, sample size, and the ratio of testlet/independent items are factors in this st udy. We examine model fit, test reliability, and the ability parameter recovery of the three different models (i.e., Rasch model, Partial Credit model, and Rasch testlet model) employed in a testlet type data analysis. 1.3 Purpose of the Study In accor dance with previous testlet research, one of the research purposes inherent to this study is exploring the consequences of variation in testlet size and local dependence effects on test reliability, standard error of measurement, and ability parameter reco very of the standard Rasch model, the Partial Credit model, and the Rasch testlet Model. By looking for the trend of how changes in testlet factors (i.e. testlet size, local dependence effect, sample size, testlet/independent item ratio) affect different m guide for model selection is expected to emerge. PAGE 17 17 The other essential goal of this study is to determine which model performs the best at person ability parameter recovery by consi dering the trade off of the test reliability and analysis complexity. An answer to these questions will be useful to provide evidence as a reference for researchers interested in applying IRT models to measure tests appropriately. Furthermore, since we use data from the NBOME COMLEX USA examination, it will provide guidance for future improvements in the estimation of this exam. PAGE 18 18 Table 1 1 Testlet size in the article r eviews Testlet size (m) m <5 5 PAGE 19 19 CHAPTER 2 LITERATURE REVIEW In this section the theoretical framework of this research is given Several important parts are included: IRT theory, IRT assumption, IRT models used in this research, local item dependence, and test reliability. 2.1 Item Response Theory Item Response Theory (IRT), proposed by Lord (1952), is a family of statistical models for analyzing item responses in a population of individuals. It depicts the relationship between examinees and items through mathematical models (Wainer & Mislevey, 2000). Many mathematical models can be developed within the IRT framework. There are two general types of IRT models, dichotomous IRT mode ls and polytomous IRT models. Dichotomous IRT models are used to model items with only correct or incorrect response option. One Parameter Logistic (1PL), Two Parameter Logistic (2PL), and Three Parameter Logistic (3PL) IRT models are three common dichot omous IRT models. Items with more than two response options can be modeled with polytomous IRT models. Among the polytomous IRT models already suggested, examples of polytomous IRT models include the Graded Response model (GRM; Samejima, 1969), the Ratin g Scale model (RSM; Andrich, 1978), the Partial Credit model (PCM; Masters,1982), the generalized Partial Credit model (GPCM; Muraki, 1992), and the Nominal Response model (NRM; Bock, 1972). The noticeable feature of IRT over classical test theory is that IRT models are invariant to item and ability parameters (Hambleton, Swaminathan & Rogers, 1991) According to this invariance feature, item parameters (e.g., difficulty, discrimination and guessing) are not dependent on the PAGE 20 20 ability distribution of any par ticular group of examinees and the examinee ability parameter s s) are not dependent on a specific set of test items. 2.1.1 IRT Assumptions Two essential a priori assumptions are held by Item Response Theory. The first assumption of IRT is local item ind ependence: the probability of a correct response to one item is independent from other items. Local item independence means that the item responses are independent for a given value of latent trait The joint probability of a response pattern for all items in the test is the product of the probabilities of correct responses to the items for a given latent trait ( 2 1) where is the total number of items. The second assumption of most general IRT models (e.g. 1PL 2PL 3PL models) is unidimensionality. Early notions of IRT require that the same constru ct should be measured by all test items (Loevinger, 1947). As such, all items in the test only measure a single latent trait (Hambleton & Murray, 1983; Lord, 1980). 2.1.2 One Parameter Logistic Model (1 PL Model or Rasch Model) The Rasch model (Rasch, 1960) is the simplest of unidimensional models. The Rasch model predicts the probability of success for person on item and can be given by the formula: (2 2 ) where is examinee ; PAGE 21 21 is examinee is the difficulty parameter of item which indi cates the point on the ability continuum when an examinee has a 50% probability of answering item correctly. is the probability that examinee answers item correctly, by given proficiency level ; An assumption that is implicit in the model is that all items have the same discrimination value. 2.1.3 Polytomous Item Response Theory ( IRT ) Model Partial Credit Model In this study, for comparison with the standard Rasch model, the Partial Credit model was selected as the polytomous IRT model. The Partial Credit model (PCM; Mast ers, 1982) was originally developed for analyzing test items that require multiple steps and for which it is important to assign partial credit for completing several steps in the solution process. This model was designed to be used when partial credit can be awarded for degrees of success. The PCM is a divide by The Partial Credit model can be considered as an extension of the Rasch Model and has all the standard Rasch model features. The equation for the partial credit model i s shown below, (2 3 ) where item is scored for an item with response categories ; PAGE 22 22 ( ) is called the item step difficulty; it is associated with a category score of and is the response category of interest. is examinee is the probability that examinee answers item at category correctly, by given proficiency level ; (2 4) 2.1.4 Testlet Model Rasch Testlet Model proposed in which a random effect parameter is added to model the local dependence among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). Following this general approach, a simplified testlet model was generated by Wang and Wilson (2005) and it can be written as (2 5 ) where is the probability that examinee answers item correctly (scoring 1); is the ability of examinee ; is the difficulty of item and is a random effect that represents the interaction of person with testlet (i.e., testlet that contains item ). PAGE 23 23 2.1. 5 Local Item Dependence As we mentioned before, loc al item independence (LID) is the first a priori assumption of IRT models. It means that the item responses are conditionally independent given the latent trait. Therefore, there should not be any correlation between two items after controlling for the und erlying trait. The items should only be correlated through the latent trait that the test is measuring (Lord and Novick, 1968). However, this LID assumption is nearly always violated in real application s Sometimes, significant correlation among items rema ins after controlling for the effect of the latent trait. Because of these significant correlations, the items are locally dependent or there is a subsidiary dimension in the measurement that is not accounted for by the overarching dimension trait. Locall y dependent items are always the cause of information loss for IRT models (Chen & Thissen, 1997). Several indices have bee n proposed to detect local item dependence for dichotomous item response models. Yen (1984, 1993) introduced the statistic by comparing it with other traditional measures: (Yen, 1981), (Van den Wollenberg, 1982), and Signed (Van den Wollenberg, 1982). The statistic is the inter item correlation between item pairs once the effect of the latent trait is removed. Although the statistic has been commonly used for several years, it has two major deficiencies in applied settings. First, the statistic requires a latent trait computation prior to calculating the item pair residual correlation. Second, the entire set of test data must be applied to compute the statistic. Therefore, Chen and Thissen (1997 ) proposed four innovative LID indices to compute the expected frequency from IRT models. The PAGE 24 24 calculation of these four local dependence indices uses a subset of items without using the Likelihood ratio Standardized coefficient difference, and Standardized log odds ratio difference These four indices are defined for a pair of items. Ponocny (2001) p roposed a general family of conditional nonparametric tests to local stochastic independence (e.g. ) By creating a two by two table for two items, th e comparison can be subjected to the standard 1974). This test is able to detect the difference of covariance between the item pairs. The test of the local independence assumption can be conducted via a suitable contingency table (Ponocny, 2001). The general fami ly of conditional nonparametric tests is implanted in the following extension of the Rasch model: (2 6 ) where is the item marginal sum for the item difficulty parameter ( ). T he random variable is a sufficient statistic for the parameter which expresses a certain violation of the Rasch model ( Ponocny, 2001) Based on the conditional nonparametric tests from Ponocny (2001), the local item dependence is demonstrated by inter item correlation between item and item ( ). The inter item correlation is based on the ( ) table by calculating the cases with equal responses PAGE 25 25 on both items ( Ponocny, 2001). The statistic is applied for the local item dependence detection as below: (2 7 ) where indicates the Kronecker symbol with for and otherwise. Then, a goodness of fit test is conducted to check the proportions of the correlation comparison between the model implied estimates and the observed value from the matrix for two specific items. The sum of the 's over the item pairs serves as a test statistic when two or more item pairs are investigated simultaneously ( Ponocny, 2001) In the meantime an overall test statistic ( ) for the local dependence of test is given by summing up the absolute deviation from the expected value to all inter item correlations in the test. The test statistic is shown as below ( Pon ocny, 2001): (2 8 ) 2.2 Reliability In educational measurement, reliability is a statistical index to quantify and evaluate the consistency of test scores. If the local item independence assumption is violated, the measurement errors are underestimated so as to give an inflated reliability estimate. The circumstances where the local item independence assumption is violated commonly occur in tes tlets. The test construct is subject to the impact of measurement errors that are not related to the latent traits the test construct intends to measure. Thus, these measurement errors determine how reliably the test measures the construct. Test reliabilit y has been consistently mentioned in previous testlet research. A concern about test reliability was expressed regarding the creation of super polytomous items to PAGE 26 26 manipulate testlets (Keller Swaminathan, & Sireci, 2003). The approach that treats these te stlets as polytomous items may lose the information contained in the response pattern so that the measurement errors may increase and reduce the over all test reliability (Keller et al., 2003). In addition, compared to the original dichotomous items, some p arameters are dropped when the polytomous items are formed so the test reliability may decrease ( Zenisky, Hambleton & Sireci, 2002) Yen (1993) also claimed that, when items are combined into testlet scores and some of the items within a testlet are local ly dependent, the reliability will be underestimated. Thus, the comparison of the test reliabilities among the three models is especially necessary for model selection. 2.3 Survey in Application of Testlet The review of the testlet applied literature in the EBSCO Host and PsychInfo databases also identified the other possible factors that impact the application of testlets models. The testlet/independent item ratio within a test in terms of testlet number is another important factor in testlet research. A mong the forty one articles in which testlet numbers are specified, the general mean of the testlet numbers including sub conditions within each article is 7.9 and the standard deviation of the testlet number was 6.70. The largest testlet number design was (2000) article. There is one other study containing a large testlet number in their research design. Tokar, Fischer, Snell, and Harik Williams (1991) included twenty testlets in their research. Except for these t wo large testlet number designs, all the other articles (39) contained three to fifteen testlets (e.g., Wainer, Lewis, 1990; Thissen, Steinberg & Mooney, 1989; Wang, Cheng & Wilson, 2005; Wainer, 1995; Yang & Gao, 2008 ). This range gave clear guidance for PAGE 27 27 information on testlet numbers used in previous testlet studies is demonstrated in T able 2 1 Based on the same literature review forty three out of fifty five studies identified test lengths in their research designs. Of all available forty three articles, the distribution of test le ngth ranged from 13 to 899. The mean test length (64.74) was obtained by first removing the largest test length (i.e. 899) from Wa summing the remaining test lengths and dividing by the number of articles in which the test length was included in the design (Table 2 2 ) The research sample size is the third factor that can influence the analysis of testl ets. From the testlet application literature, thirty seven articles identified th e sample size, with a mean of 2047.22 and standard deviation of 2105.86. Since some studies in more illustrative. The range of the sample sizes provides a guideline for our research design. First, in twelve out of the thirty seven articles reviewed, researchers inc luded sample sizes smaller than 500 (e.g., Adams, Wilson & Wang, 1997; Wang, 2005; Schmitt, 2002). Second, eighteen articles included sample sizes between 500 and 1000 (e.g., Adams, Wilson & Wang, 1997; Stark, Chernyshenko & Drasgow, 2004). Finally, twenty studies included sample sizes larger than 1000 (e.g., Brandt, 2008; Wainer & Wang, 2000; Thissen, Steinberg & Mooney, 1989). Table 2 3 details the information on sample s izes used in previous studies. Seventeen studies used the RMSE and loglikelihood ratio coefficient as the extraction criteria (e.g., Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006; PAGE 28 28 Armstrong, 2004). The next most commonly used criteria were reliability coefficient and Bias (used by nine, and 5 papers respectively) (e.g., Stark, Cher nyshenko & Drasgow, 2004; DeMars, 2006; Armstrong, 2004; Davis, 2003; Schmitt, 2002 ). Other various indices (e.g. AIC, WMSE, RMSEA, NNFI, CFI, GFI, Q3, RMS, etc) were used in twenty studies (e.g., Gessaroli, Folske, 2002; Schmitt, 2002; Adams, Wilson& Wan g, 1997). Clearly, most researchers relied on the RMSE and loglikelihood ratio coefficient to compare the model fit and parameter estimates .Table 2 4 reveals detailed information on the fit criteria used. Finally, the estimation methods were designated in the twenty nine studies. Twenty four of these articles applied the Marginal Maximum Likelihood (MML) method (e.g., Lee, 2006; Wang & Wilson, 2005; Wainer, 1995). Only five articles used the Markov Chain Monte Carlo (MC MC) method (e.g., Li, 2005; Li, 2006; Wang, 2002; Wainer & Wang, 2000). The data analysis iterations were only acknowledged in eleven articles (e.g., Lee, 2000; Ip, Smits & De Boeck, 2009; Stark, Chernyshenko & Drasgow, 2004). Among these eleven articles, five of them applied 100 iterations (e.g. Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006) and only two articles applied even more (200 and 600) iterations (Li, 2006; Zwic k, 2002). Table s 2 5 and 2 6 include detailed information on the estimation method and iteration times us ed for all the studies reviewed. PAGE 29 29 Table 2 1 The n umber of t estlets in the d ataset Articles Testlet number Testlet number mean/article 1 2 3 2.5 2 4 8 6.0 3 5 5 4 6 6 5 4 4 6 15 15 7 4 4 8 5 5 9 5 5 10 5 5 11 5 10 7.5 12 7 7 13 4 8 10 7.3 14 6 6 15 6 10 8.0 16 4 4.0 17 2 2.0 18 9 16 14 11 12.5 19 8 8.0 20 16 16.0 21 8 9 8.5 22 5 10 7.5 23 6 3 2 3.7 24 4 5 7 8 6.0 25 4 4.0 26 11 11.0 27 5 7 8 6.7 28 5 7 8 6.7 29 50 36 43.0 30 7 7.0 31 4 5 7 8 6.0 32 3 6 4.5 33 20 20.0 34 4 5 9 6.0 35 5 5.0 36 2 5 6 10 5.8 37 4 4.0 38 4 4.0 39 10 10.0 40 10 10.0 41 10 10.0 Testlet number general mean 7.9 SD of mean 6.70 PAGE 30 30 Table 2 2 Test length in the reviewed a rticles Articles Test Length Mean length 1 64 64 2 194 194 3 76 76 4 120 120 5 22 22 6 50 50 7 60 60 8 125 125 9 30 33 24 40 47 36 40 42 50 51 54 40.64 10 25 50 15 30 11 150 150 12 20 20 13 13 17 18 17.50 14 30 50 40 15 60 60 16 64 64 17 60 90 75 18 35 41 35 35 36.50 19 55 55 20 101 101 21 55 63 59 22 150 150 23 30 30 24 38 26 46 56 40 41.20 25 40 40 26 50 50 27 49 33 36 39.33 28 49 36 43 33 40.25 29 690 290 30 42 42 31 38 46 26 30 43 43 37.67 32 60 60 33 60 60 34 44 57 26 33 44 40.80 35 60 60 36 20 20 37 60 125 92.50 38 30 30 39 75 75 40 75 75 41 20 20 42 120 120 43 137 137 General mean 64.77 PAGE 31 31 Table 2 3 Sample sizes in the reviewed articles Article No. sample size in paper sample size mean/paper sample size <500 500 < sample size <1000 sample size >1000 set 1 set 2 set 3 set 4 set 5 set 6 1 700 300 500.00 1 1 2 200 500 350.00 1 1 3 8912 8912 1 4 3866 3866 1 5 1210 589 352 717.00 1 1 1 6 4000 4000 1 7 500 1000 2000 5000 2125.00 1 1 1 8 1000 1000 1 9 500 2000 8000 3500.00 1 1 10 2000 5000 3500.00 1 11 2000 2000 1 12 5000 5000 1 13 300 500 400.00 1 1 14 1000 1392 1196.00 1 1 15 2000 2000 1 16 570 499 522 495 521.50 1 1 17 1000 1000 1 18 8026 8494 8260.00 1 1 19 3000 1000 5000 3000.00 1 1 20 1000 266 633.00 1 1 21 1996 1996 1 22 466 466 1 23 663 632 537 680 653 561 621.00 1 24 663 632 537 680 653 561 621.00 1 25 544 544 1 26 1000 1000 1 27 985 629 914 682 666 1000 812.67 1 28 1000 1000 1 29 485 485 1 30 3000 3000 1 31 100 100 1 32 1040 1040 1 33 4028 4028 1 34 4028 4028 1 35 500 500 1 36 10 15 25 50 25.00 1 PAGE 32 32 Table 2 3 Continued Article No. sample size in paper sample size mean/paper sample size <500 500 < sample size <1000 sample size >1000 37 3000 3000 1 General mean 2047.22 12 18 20 Standard Deviation 2105.86 32.43% 48.65% 54.05% General median 681 PAGE 33 33 Table 2 4 Fit indices in r eviewe d a rticles Articles Bias RMSE Reliability coefficient loglikelihood ratio test WMSE AIC Other index 42 5 17 9 17 2 1 20 Percentage 11.90% 40.48% 21.43% 40.48% 4.76% 2.38% 47.62% Table 2 5 Estimation method in reviewed a rticles Estimation method MML MCMC total Articles 24 5 29 percentage 82.76% 17.24% Table 2 7 The n umber of s imulation r eplication a pplied in the r eviewed a rticles Replication number 10 100 200 600 1000 Total Frequency 1 5 2 2 1 11 9.09% 45.45% 18.18% 18.18% 9.09% PAGE 34 34 CHAPTER 3 METHOD A comprehensive review of the testlet research from 1989 to 2009 provides us a systematic framework for exploring the performance of three different IRT models to analyze testlets. These three models will be a part of two studies presented in this paper. T he first is a series of simulation studies designed to investigate the extent to which the fluctuation of testlet conditions (testlet size, local dependence effects, etc.) influence the different model fitting results. Simulations are conducted to evaluate model fit test reliability, and parameter recovery of the three different IRT models. Next, a real data analysis of the COMLEX USA exam dataset is presented by fitting different models as an empirical case. The three one parameter IRT models adopted in t he study are: the Rasch model, the Partial Credit Model, and the Rasch testlet model. 3.1 Model Used t o Generate Data The current study evaluates the effect of changes in the local effect of testlets on the model fit, ability parameter recovery, and test reliability of three different IRT models. In order to quantify the extent of the local effect, the application of Rasch testlet model is appropriate for research data simulation. The Rasch testlet model (Wang& Wilson, 2005) includes a testlet parameter ( ) which is the random effect capturing the interaction of person with testlet when the overarching latent trait is held constant. According to the definition of the testlet, the sum of testlet parameters ( ) over examinees within any testlet is zero ( ). Thus, the local effects of testlets in the Rasch Testlet model are simulated from the normal distribution with a mean of zero, and standard dev iation PAGE 35 35 of the square root of the given local effect values ( ). The following prior model constraints are used to simulate the responses. With v V and V the total number of examinees, for all v V (3 1 ) for all (3 2 ) for all ( 3 3 ) 3.2 Population Parameters Population item parameters for the Rasch testlet model and the ability parameters for the population are simulated from the normal distribution with the mean of zero, and standard deviation of one (i.e. within a range from negative three to positive three; ). For each condition, the population item difficulty parameters are generated f rom the mean of zero, and standard deviation of one ( ) with a range of For simplicity, all simulated population parameters are rounded to three decimal places. The population item parameters and population abil ity parameters are randomly drawn from these two normal distribut ions ahead of each condition. 3.3 Condition Manipulated In this study, we examine whether fluctuations of testlet size, local dependence effects, and item difficulty within testlets affect th e reliabilities and the model fit of three different IRT models. Our study is a four factor completely crossed design: 2 (changes in testlet size) 4 (levels of local dependence effect) 3 (ratio of testlet items and general items in test) 3 (sample size). Table 3 1 demonstrates all the 72 PAGE 36 36 conditions and the interactions of these four factors effect on the testlet research designs. 1. The first factor is the testlet size. The testlet sizes chose n for this study are based on the purpose of the study and the sizes less often discussed in the applied literature. Thus, two patterns of testlet size including small and medium testlet sizes are used in this study: 2. The second fa ctor is the local dependence effect. Local dependence effects from the ten reviewed studies are within the range of zero to one (Wainer & Wang, 2000; Wang, 1999; Wang, 2002; Wang, 2005; Habing & Roussos, 2003; Adams, Wilson & Wang, 1997; Wang & Wilson, 200 5; DeMars, 2006; Li, 2005; Zenisky, Hambleton & Sireci, 2002). Therefore, four levels of local dependence effect will be examined: 3. The third factor is the ratio of testlet items to general items in the test. Among all 60 items in the test, the ratio of testlet items and general items will be 4. The fourth factor is the sample size of the examinees. Of seventy four study groups in forty five different articles from the applied literature, the distribution of sam ple size ranged from 10 to 8912; with two sample sizes greater than 8000 and four sample size smaller than 50. By dividing the remaining sixty eight sample sizes into three groups according to the size ranking; and taking the approximate mean value of the sample sizes in each category we selected three sample sizes for use in this study: ( ). These quantities represent rounded approximations of the most common sample size found in the applied literature. 5. Test length is the other issue that must be considered ahead of the research design. The test length of this simulation is set to sixty (60 items per test), the approximate general mean of the test length among the reviewed testlet literature. 6. For each condition, based on the largest occurrence of the iteration times in the applied literature, the value of the replication time is selected. Thus, one hundred replications are applied with each condition. 3.4 Data Generation The Rasch testlet model response data are generated using the s tatistical software R 2.10. Response data were generated for 100 samples from a set of population item parameters (60 items) and population ability p a rameters (1000 trait PAGE 37 37 value ) for each condition Local effects were given per each testlet accordingly. Each simulee was assigned a known trait value from the randomly selected population ability parameters. By comparing the difference between the co effect of local effects within testlets plus the randomly selected population item parameters and the known trait value from each simulee, the probability of observing the response matrix from a sample of independently responding examinees can be represente d as ( 3 4 ) where ), and are all considered unknown, fixed parameters. Thus, a response matrix with all logical indicators was generated for each repli cation within every condition. Then, a series of random numbers were given from a uniform distribution that ranged from 0 to 1 to match the logical response matrix accordingly. If the known trait value is less than the co effect of the item and testlet, known trait value is larger than the co effect of the item and testlet, the logical indicator is item and every simulee in each of the 100 samples. Thus, 100 simulated responses are generated for each condition accordingly. 3.5 Parameter Estimation In the study, the parameter s of the dataset in 3 different models (PCM, standard Rasch model, and Rasch Testlet model) are analyzed using Marginal Maximum PAGE 38 38 Likelihood (MML) methods with ConQuest Version 2.0. The most frequently used approaches to item parameter estimation for unknown trait levels are Joint Maximum Likelihood (JML), Conditional Maximum Likelihood (CML), and Marginal Maximum Likelihood (MML). Holland (1990) compared the di fferent sampling theory foundations of these three ML methods. CML is possible only for the 1PL model and is so computationally intensive as to be impractical in many situations. JML has been used extensively in early IRT programs. However, JML estimatio n also has some drawbacks for estimating IRT models. First, the JML item parameter estimates are biased and inconsistent for fixed length tests. Second, the JML standard errors are probably too small to handle the unknown person trait level (Holland, 1990) The most commonly used method for estimating the parameter of IRT models is Marginal Maximum Likelihood (MML). In MML estimation, unknown trait levels are estimated by expressing the response pattern probabilities as expectations from a population distr ibution. MML has several advantages over the other two ML methods. First, MML is applicable for all types of IRT models. Second, MML is efficient for tests with different lengths. Third, the MML estimate of item standard errors may be justified as good app roximations of expected sampling variance of the estimates. Fourth, estimates are available for perfect scores. In the previous literature, Marginal Maximum Likelihood (MML) method is applied in 82.76% of the articles. Therefore, MML was chosen for the pa rameter estimation for this study. PAGE 39 39 The simplified mechanism of MML is shown below. The prior knowledge about the examinee distribution ( ) is treated as a prior and the item difficulty parameter is indicated as That is, MML estimates of the item difficulty parameter ( ) maximize ( 3 5 ) Therefore, a posterior distribution ( ) is obtained for item parameters by multiplying by (Mislevy, 1986): ( 3 6 ) 3.6 Ability Estimation performed by two different approaches in this study. These two approaches are Maximum Likelihood Estimation (MLE; Lord, 1980) and Expected a Posteriori Estimation (EAP; Bock & Mislevy, 1982). The maximum likelihood estimation (MLE) is the most commonly used estimation he test, MLE find s the value of the latent trait that maximizes the likelihood of an item response pattern by holding the assumption that the item parameter values are known. The likelihood of the latent trait g iven an item response pattern is denoted as (3 7 ) PAGE 40 40 where represents the probability of a given response to item and the number is the number of items in the test. Although, MLE is the most common approaches for ability estimation, some drawbacks of MLE must be addres sed. First, MLE is not available for all endorsed or all not endorsed item response patterns. If these two item patterns exist, the results of MLE will go to infinity. Second, MLE may not converge when some response patterns are abnormal (Bock & Mislevy, 1 982). estimation. EAP is a Bayesian estimator with non iterative process. Unlike the MLE, EAP provides a finite estimation for all endorsed or all not endorsed item respo nse patterns. In fact, EAP estimation indicates the mean of the posterior distribution. For any test, a set of quadrature nodes ( ) are defined for a fixed number of specified trait There is a probability density corresponding to each quadrature node. The EAP trait estimate is derived by ( 3 8 ) where the represents the exponent of the log likelihood function evaluated at each of the quadrature nodes. However, some shortcomings of EAP should be mentioned. First, there is a tendency for Bayesian estimates to regress toward the mean of the prior distribution (Kim & Nicewander, 1993; Weiss, 1982). Since ConQuest provides EAP estimates for both with and without regression, the EAP estimates without regression were applied. The other shortcoming of EAP is that its estimation accuracy is reduced by an improper prior distribution ( Bock & Mislevy, 1982 ). Since both MLE and EAP ability estimation approaches have their pros and cons, PAGE 41 41 y both methods simultaneously. 3.7 Analysis In this study, each simulated data set was analyzed using ConQuest. Since, the polytomous test response patterns from partial credit model are different from the dichotomous test response pa tterns the observed data are different from these two types of models (i.e. polytomous IRT model, dichotomous IRT model). Thus, u sing the loglike lihood ratio test and Akaike's information criterion (AIC) as measures of the goodness of fit of model is inappropriate. Therefore, t he acc uracy of estimation for ability parameters with regard to three different models was quantified via bias and root mean square error (RMSE) across all replications. The local item dependences for the real data were examined by the conditional nonparametric tests ( ; Ponocny, 2001) The test reliability coefficients were also calculated for the simulated data and the real data. 3.7.1 Bias Bias is defined as average difference in true and estimated parameters across all people and items. An estimate of bias is calculated for each replication in each condition, and an average bias of each condition in the simulation. Bias is mathematically defined as: (3 9 ) where the is the true value of a item or person parameter; is the estimated value of that parameter ; PAGE 42 42 is the total instances of that type of parameter within a replication (i.e. sample size for ability ). 3.7.2 Root Mean Square Error (RMSE) RMSE is a measure of absolute accuracy in parameter estimation. RMSE is calculated for each parameter type in a replication, and an average for each condition is found within each condition. RMSE is the square root of the average squared difference between estimated and true parameters, and is mathematically defined as: (3 10 ) where terms in the equation are defined as they are with bias. 3.7.3 Reliability In this study, test reliability coefficients were computed for item responses scored dichotomously for both Rasch testlet mode l and standard Rasch model as well as item responses scored polytomously for Partial Credit Model. As we use MML estimation in ConQuest, the test reliability can be calculated as (3 11 ) PAGE 43 43 Table 3 1 Study design condition with 3 factors Condition Testlet size 5 Testlet size 3 sample size Testlet Number Local effect Condition Sample size Testlet Number Local effect 1 1000 9 0.25 37 1000 15 0.25 2 0.5 38 0.5 3 0.75 39 0.75 4 1 40 1 5 6 0.25 41 10 0.25 6 0.5 42 0.5 7 0.75 43 0.75 8 1 44 1 9 3 0.25 45 5 0.25 10 0.5 46 0.5 11 0.75 47 0.75 12 1 48 1 13 500 9 0.25 49 500 15 0.25 14 0.5 50 0.5 15 0.75 51 0.75 16 1 52 1 17 6 0.25 53 10 0.25 18 0.5 54 0.5 19 0.75 55 0.75 20 1 56 1 21 3 0.25 57 5 0.25 22 0.5 58 0.5 23 0.75 59 0.75 24 1 60 1 25 250 9 0.25 61 250 15 0.25 26 0.5 62 0.5 27 0.75 63 0.75 28 1 64 1 29 6 0.25 65 10 0.25 30 0.5 66 0.5 31 0.75 67 0.75 32 1 68 1 33 3 0.25 69 5 0.25 34 0.5 70 0.5 35 0.75 71 0.75 36 1 72 1 PAGE 44 44 CHAPTER 4 RESULTS 4.1 MLE Non convergence Issue a large number of non convergence cases arose in the Rasch testlet model results via MLE estimation in the 1000 sample size condition. Additionally, such non convergence case pattern is also occurred across other sample sizes. The number and percentage of non convergence cases is displa y ed in Table s 4 1 and 4 2 After checking the non convergence cases response patterns, neither non endorsed nor all endorsed response pat terns were found. Therefore, this phenomenon may occur because of the complexity of the multi dimensionality of the Rasch testlet model. So, for precision purposes, only EAP estimate results are used i n this study. 4.2 Test Reliability A summary of the te st reliability analyses is presented in Table s 4 3 and 4 4 Three columns of estimates are provided for each model of each condition. For most of the conditions, the reliability estimates from standard Rasch model are higher than the reliability estimates from both the Partial Credit model and the Rasch testlet model. The association between test reliability and other factors are described as below. First, t he difference in test reliability estimates between the standard Rasch model and the other two models indicates a strong association between the ratio of the independent items to testlet items within a test and test reliability overestimation. In general, the magnitude of the test reliability analyzed from standard Rasch model is higher than its corresponding coefficient from the other two models (from 0.01 to 0.0 8 ). As the ratio of the independent/testlet items within a test decreases (i.e. a greater PAGE 45 45 proportion of testlet items are included in a test), the extent of test relia bility overestimation increases due to ignoring the local item dependence. This phenomenon occurs because of the existence of locally dependent testlet items and results in the overestimation of the test reliability estimates since the standard Rasch m odel assumes items within a test are local ly independent. Second the difference in test reliability estimates between the standard Rasch model and the other two models across different sample sizes indicates a strong association between the sample size an d test reliability overestimation. As the sample size increases, the extent of test reliability overestimation increases as well. No evident patterns were found to disclose the association between test reliability local effect and testle t size As for the test reliability comparison between Partial Credit model and the Rasch testlet model, n o obvious differences were observed between the test reliability estimates computed from the se two models when the testlet size was held to th ree items. Moreover, as the testlet size was set to five, under most of the circumstances, the test reliability estimates from the Partial Credit model are slightly smaller than their corresponding estimates from the Rasch testlet model, but the differenc es are generally smaller than 0.01. This is because more parameters are dropped from the polytomous model compared to the individual item scoring models as the testlet size increases, thereby decreasing the effective test length and decreasing the estimati on of the test reliability ( Zenisky, Hambleton & Sireci, 2002) In addition, any association between the magnitude of local item effect and the variation of the test reliability estimates was not obvious in the results from this study. PAGE 46 46 The results of the Spearman Brow n prophecy are listed in Table s 4 5 and 4 6 The values of Spearman Brown prophecy from three models also indicated the effect of the reliability overestimation by using standard Rasch model. If the test administration claims that the testlet based test satisfies some required test reliability level by using overestimated test reliability coefficient from the standard Rasch model, t hese results provide an estimate of the amount by which a testlet based test would need to be lengthened t o achieve the same magnitude as the ove restimated test reliability as the standard Rasch model is applied. For sample size 1000 conditions, approximately over 4 times the test lengt h increases (from 3.984 to 5.138 ) in the testlet based test (i.e. PCM, Rasc h testlet model) would be needed to achieve the level of test reliability (overestimated) indicated by applying the standard Rasch model. As the sample size decreases to its half size (500), the magnitude to increase the test length to achieve the overesti mated test reliability is down to half as well. As the sample size decreases to its quarter size (250), the magnitude to increase the test length to achieve the overestimated test reliability is minimum but still positive. 4.3 Standard Error of Measuremen t The magnitude of standard error of measurement (SEM) for three different models was also us ed for model comparison in this study Table s 4 7 and 4 8 list the mean of SEM for all 72 conditions over 100 replications. In general, the values of mean SEM obtain ed from standard Rasch model were smaller than the values of mean SEM obtained from the other two models (i.e. Partial credit mode, Rasch testlet mo del), but the differences were generally smaller than 0.02. This phenomenon occurs because igno ring the local dependency within a testlet lead s to an underestimate of the standard errors PAGE 47 47 T he magnitude of the SEM underestimation might b e influenced by different testlet size s. From the Table s 4 7 and 4 8 even holding the same level of the independent/testlet item ratio, having a larger testlet size (i.e. testlet size 5) on average, led to a larger extent of underestimation in SEM than h aving a smaller testlet size (i.e. testlet size 3) circumstance. The quantitative differences of the SEM affected by the tesetlet size differen ce were generally around 0.01. N o obvious assoc iations were observed between the SEM e stimates and the local eff ect variations across conditions. 4.4 Bias and RMSE Table s 4 9 and 4 10 list the mean of the bias estimates and the mean of the RMSE estimates for all 72 conditions over 100 replications. There are three sets of estimates in each table corresponding to three different models (i.e. standard Rasch model, partial credit model, R asch testlet model) for the testlet size three and five conditions. T he means of RMSE (within a range from 0.01to 0.01) over all conditions for testlet sizes 3 and 5 are small for the three different models, especially when compared with the ability rang e of negative three to positive three. As found for the three models, the magnitude of RMSE estimates is fairly satisfactory. However, throughout the entire ability interval (i.e. [ 3.0, 3.0]), the magnitudes of the bias of all three models are relatively high for some conditions. In general, no obvious association s were found between the testlet size and the magnitude of bias estimates. The association between the ratio of the independent/testlet item within a test (i.e. the number of the testlets) and th e bias estimates were not found either In order to reveal how bias and RMSE changes as a function of ability variation, the ability range is split into 6 intervals and the bias and RMSE estimates are calculated accordingly. Table 4 11 to 4 16 display the mean bias estimates of ability ( ) estimate PAGE 48 48 recovery (i.e. EAP estimate) with 6 different ability intervals for three different models overall 72 conditions. According to the results listed in the tables, relatively high magnitude of positive bias was observed at the lowest ability interval level ( ) for all three models across all conditions. Meanwhile, relatively high magnitude of negative bias was also found at the highest ability interval level ( ) for all three models (i.e. standard Rasch model, partial credit model, Rasch testlet model) across all conditions. Since applying EAP estimation may result in the ability estimate distribution leaning towards its mean, a possible cause for th is hi gh magnitude of bias at both end s of the ability intervals mig ht be the usage of the EAP estimates. Other than that high magnitude of bias at both ends of the ability interval phenomena, no obvious patterns and associations between mean bias variations and the major factors in this study were found across three models. In addition, Table 4 1 7 to Table 4 22 display the RMSE estimates of ability ( ) estimate recovery with 6 different ability intervals for three different models (i.e. st andard Rasch model, partial credit model, Rasch testlet model) overall 72 conditions. Similar to the bias estimates, except for that relatively high magnitude of RMSE estimates at both ends of the ability intervals, no obvious patterns and associations bet ween RMSE estimate variations and the major factors in this study were found across three models either. In sum, all three models (i.e. standard Rasch model, partial credit model, Rasch testlet model) performed fair ly well in ability estimates recovery on the basis of the relatively low magnitude of bias and RMSE estimates from the analysis results PAGE 49 49 4.5 An Empirical Case The National Board of Osteopathic of Medical Examiners (NBOME) offers computer based COMLEX USA exams online. This computer based exa m series is designed to assess the osteopathic medical knowledge and clinical skills considered essential for osteopathic generalist physicians to practice medicine without supervision. The COMLEX USA exam responses have been analyzed with the standard Ra sch IRT Model. The 2008 National Board of Osteopathic of Medical Examiners (NBOME) COMLEX USA Level 2 exam data is used as an empirical case for this study. The COMLEX USA level 2 exam consists of 350 items in 7 blocks including 141 independent items and 209 testlet items grouped in 95 testlets (all testlet sizes are within 2 4 items ). The item type is identified (i.e. A single item, D single Item with graph, B matching item, S testlet item, F testlet item with graph). The B, S, and F type items are cate gorized as testlet items. Among all 95 testlets, there are 4 testlets with matching items and 9 testlets with a graph. The testlet sizes range from 2 to 4. A total of 450 examinees were included in the examinee population. No missing data exists. The dat a of the first block of this exam (Block 1) is used for this study. Block 1 data contains 50 items including 27 independent items and 23 test let items within 10 testlets. The data set was analyzed using the standard Rasch model, the Partial Credit model, a nd the Rasch testlet model, separately. Table 4 23 lists the weighted mean square errors (WMSE) for the 50 items in three models. In the output of Rasch testlet model, th e WMSE ranges from 0.86 to 1.19 ( ). The two items with the most extreme WMSE in the Rasch testlet model are item 47 ( ) and item 5 ( ) with a non significant p value (i.e. the item with is PAGE 50 50 treated as an item with a bad fit). That is, the item fit for all 50 items in this block is acceptable. In the output of Partial Credit model, the WMSE ranged from 0.97 to 1.07 ( ). The two items with the most extreme WMSE in the partial credit model are item 40 ( ) and item 35 ( ). In the output of the standard Rasch model, the WMSE ranges from 0.97 to 1.18 ( ). The two items with the most extreme WMSE in the standard Rasch model are item 8 ( ) and item 37 ( ). All these item WMSE estimates from three models indicate that these 50 items have an objectively fair item fit. Therefore, we should keep them in the test. The estimates of test reliability for the overarching lat ent trait are 0.899 for th e Rasch testlet model, 0.909 for the Partial Credit model, and 0.936 for the standard Rasch model. Thus, the standard Rasch model appears to overestimate the test reliability due to its ignorance of the local item depen dence within testlet The Spearman Br own prophecy formula ( ) is used to compute how much the test length is expected to increase to achieve the standard mated test reliability (0.936 ) for the Rasch testlet model and Partial Credit model. For the Rasch testlet model, the test length would have to be increased approximately 63.65% (32 items) to achieve the overestimated test reliability. For the Partial Credit model, the test length would have to be increased approximately 47.22% (24 items) to achi eve the degree of the o verestimated test reliability. NBOME COMLEX USA exam has been analyzed to detect its local item dependence by applying the statistic before (Shen & Yen, 1997). In this study, the local item dependence detection PAGE 51 51 package in R instead (Mair & Hatzinger, 2007) nonparametric Rasch model tests, proposed by Ponocny (2001). The implemented method we used is the method "T1" to check for local dependence via increased inter item correlations. For all item pairs cases are counted with equal responses on both items. item test block, there are 1,225 possibl e item pairs. Among all 1,225 possible item pairs, 27 item pairs are detected to have significant local item dependences between them. Re sults are provided in Table 4 24 For those items within the testlets, the local item dependences are evident (e.g., it em pair 28 30; item pair 28 31; item pair 47 48, etc.). Fourteen out of total twenty seven item pairs (51.85%) in which the local item dependence exists belong to items within testlets. A n overall test statistic ( ) for the local dependence of test is given by using the nonparametric Rasch model tests (Ponocny, 2001). The global test of the local By summing up the absolute deviation from the expe cted value to all inter item correlations in the test, the one side p value of this 50 item block test is 0.371 (significant level 0.05) which indicates that the global test of local dependence is non significant and this test (i.e. NBOME COMLEX USA level 2 exam block 1) holds the local independence assumption for the entire test block. In sum, the partial credit model and the Rasch tesetlet model are the better model choice s to analyz e NBOME COMLEX exams. The da ta from NBOME COMLEX USA level 2 exam block 1ar e better modeled using PCM and Rasch testlet model than the standard Rasch model In addition, the test reliability discrepancy between the PCM and PAGE 52 52 the Rasch testlet model to analyze NBOME COMLEX data is withi n the range of 0.01 but the test reliability discrepancy between the standard Rasch model and the other two models to analyze NBOME COMLEX data is approximately over 0.04 This result also supports that PCM and the Rasch testlet model are the better model choice s to analyze NBOME COMLEX exams. PAGE 53 53 Table 4 1 MLE nonconvergence case and rate per condition testlet size 3 Testlet Size 3 condition Sample size Testlet No. Local effect Nonconvergence Case Percentage 37 1000 15 0.25 8469 8.47% 38 0.5 9701 9.70% 39 0.75 8010 8.01% 40 1 9955 9.96% 41 10 0.25 1216 1.22% 42 0.5 1007 1.01% 43 0.75 526 0.53% 44 1 680 0.68% 45 5 0.25 45 0.05% 46 0.5 131 0.13% 47 0.75 29 0.03% 48 1 99 0.10% Table 4 2 MLE nonconvergence case and rate per condition testlet size 5 Testlet Size 5 condition Sample size Testlet No. Local effect Nonconvergence Case Percentage 1 1000 9 0.25 3002 3.00% 2 0.5 3658 3.66% 3 0.75 2406 2.41% 4 1 3016 3.02% 5 6 0.25 289 0.29% 6 0.5 551 0.55% 7 0.75 262 0.26% 8 1 376 0.38% 9 3 0.25 26 0.03% 10 0.5 38 0.04% 11 0.75 40 0.04% 12 1 42 0.04% PAGE 54 54 Table 4 3 Test reliability testlet size 3 conditions testlet size 3 Condition Sample size Testlet No. Local effect Testlet model Partial Credit Standard Rasch 1 1000 15 0.25 0.90631 0.90382 0.97843 2 0.5 0.90521 0.90474 0.97785 3 0.75 0.89974 0.90513 0.97649 4 1 0.90663 0.90466 0.97828 5 10 0.25 0.90406 0.90360 0.97821 6 0.5 0.90201 0.90441 0.97772 7 0.75 0.89167 0.90630 0.97463 8 1 0.90059 0.90464 0.97725 9 5 0.25 0.90274 0.90359 0.97809 10 0.5 0.89987 0.90364 0.97740 11 0.75 0.90053 0.90388 0.97756 12 1 0.89820 0.90425 0.97692 13 500 15 0.25 0.89851 0.90440 0.95188 14 0.5 0.91206 0.90303 0.95908 15 0.75 0.89450 0.90663 0.94967 16 1 0.89666 0.90578 0.95095 17 10 0.25 0.90822 0.90235 0.95873 18 0.5 0.90155 0.90453 0.95481 19 0.75 0.89423 0.90583 0.95092 20 1 0.89932 0.90444 0.95408 21 5 0.25 0.90015 0.90488 0.95395 22 0.5 0.88961 0.90540 0.94962 23 0.75 0.90373 0.90395 0.95599 24 1 0.89208 0.90526 0.95064 25 250 15 0.25 0.91113 0.90304 0.91925 26 0.5 0.89811 0.90557 0.91223 27 0.75 0.90328 0.90645 0.90647 28 1 0.89212 0.90576 0.90664 29 10 0.25 0.89754 0.90427 0.90634 30 0.5 0.89718 0.90453 0.90629 31 0.75 0.89223 0.90594 0.90972 32 1 0.90038 0.90551 0.90789 33 5 0.25 0.90464 0.90451 0.91279 34 0.5 0.89704 0.90466 0.90720 35 0.75 0.89793 0.90554 0.90664 36 1 0.90469 0.90347 0.91438 PAGE 55 55 Table 4 4 Test reliability testlet size 5 conditions testlet size 5 condition Sample size Testlet No. Local effect Testlet model Partial Credit Standard Rasch 1 1000 9 0.25 0.90040 0.89809 0.97711 2 0.5 0.90295 0.89787 0.97770 3 0.75 0.90271 0.89796 0.97708 4 1 0.90099 0.89850 0.97669 5 6 0.25 0.89223 0.89873 0.97540 6 0.5 0.90099 0.89812 0.97761 7 0.75 0.88920 0.90030 0.97430 8 1 0.90349 0.89769 0.97830 9 3 0.25 0.90076 0.89734 0.97780 10 0.5 0.89936 0.89840 0.97718 11 0.75 0.89962 0.89817 0.97735 12 1 0.89177 0.89918 0.97513 13 500 9 0.25 0.89885 0.89767 0.95250 14 0.5 0.89788 0.89884 0.95185 15 0.75 0.89535 0.89899 0.95086 16 1 0.90691 0.89752 0.95719 17 6 0.25 0.90195 0.89698 0.95603 18 0.5 0.89905 0.89893 0.95362 19 0.75 0.90251 0.89752 0.95585 20 1 0.90697 0.89763 0.95755 21 3 0.25 0.89304 0.89841 0.95112 22 0.5 0.90059 0.89817 0.95487 23 0.75 0.89500 0.89832 0.95236 24 1 0.90214 0.89692 0.95632 25 250 9 0.25 0.90941 0.89648 0.91701 26 0.5 0.89676 0.89995 0.90275 27 0.75 0.90618 0.89721 0.91367 28 1 0.90682 0.89774 0.91279 29 6 0.25 0.91078 0.89575 0.92096 30 0.5 0.90108 0.89767 0.91065 31 0.75 0.90156 0.89722 0.91182 32 1 0.88919 0.89977 0.90674 33 3 0.25 0.90417 0.89557 0.92314 34 0.5 0.89564 0.89826 0.90485 35 0.75 0.88549 0.90013 0.90393 36 1 0.89959 0.89755 0.90990 PAGE 56 56 Table 4 5 Testlet size 3 the results of the Spearman Brown prophecy testlet size 3 Condition Sample size Testlet No. Local effect Spearman Brown (Testlet) Spearman Brown (Partial Credit) 1 1000 15 0.25 4.689 4.827 2 0.5 4.623 4.648 3 0.75 4.628 4.353 4 1 4.639 4.747 5 10 0.25 4.764 4.789 6 0.5 4.767 4.638 7 0.75 4.667 3.972 8 1 4.742 4.528 9 5 0.25 4.810 4.763 10 0.5 4.812 4.612 11 0.75 4.812 4.633 12 1 4.797 4.482 13 500 15 0.25 2.234 2.091 14 0.5 2.260 2.517 15 0.75 2.225 1.943 16 1 2.234 2.017 17 10 0.25 2.348 2.514 18 0.5 2.307 2.230 19 0.75 2.292 2.014 20 1 2.326 2.195 21 5 0.25 2.298 2.178 22 0.5 2.339 1.969 23 0.75 2.314 2.308 24 1 2.330 2.016 25 250 15 0.25 1.110 1.222 26 0.5 1.179 1.084 27 0.75 1.038 1.000 28 1 1.174 1.010 29 10 0.25 1.105 1.024 30 0.5 1.108 1.021 31 0.75 1.217 1.046 32 1 1.091 1.029 33 5 0.25 1.103 1.105 34 0.5 1.122 1.030 35 0.75 1.104 1.013 36 1 1.125 1.141 PAGE 57 57 Table 4 6 Testlet size 5 the results of the Spearman Brown prophecy testlet size 5 condition Sample size Testlet No. Local effect Spearman Brown (Testlet) Spearman Brown (Partial Credit) 1 1000 9 0.25 4.722 4.844 2 0.5 4.712 4.987 3 0.75 4.594 4.844 4 1 4.604 4.733 5 6 0.25 4.789 4.468 6 0.5 4.798 4.953 7 0.75 4.724 4.198 8 1 4.816 5.138 9 3 0.25 4.853 5.039 10 0.5 4.792 4.843 11 0.75 4.815 4.892 12 1 4.759 4.396 13 500 9 0.25 2.257 2.286 14 0.5 2.248 2.225 15 0.75 2.262 2.174 16 1 2.295 2.553 17 6 0.25 2.364 2.497 18 0.5 2.309 2.312 19 0.75 2.339 2.472 20 1 2.314 2.573 21 3 0.25 2.331 2.200 22 0.5 2.336 2.399 23 0.75 2.345 2.263 24 1 2.375 2.516 25 250 9 0.25 1.101 1.276 26 0.5 1.069 1.032 27 0.75 1.096 1.213 28 1 1.075 1.192 29 6 0.25 1.141 1.356 30 0.5 1.119 1.162 31 0.75 1.129 1.185 32 1 1.212 1.083 33 3 0.25 1.273 1.401 34 0.5 1.108 1.077 35 0.75 1.217 1.044 36 1 1.127 1.153 PAGE 58 58 Table 4 7 Mean standard error of measurement for each condition (testlet size 3) condition Sample size Testlet No. Local effect Testlet PC Rasch 37 1000 15 0.25 0.300381 0.305033 0.289589 38 0.5 0.296793 0.303545 0.288309 39 0.75 0.295281 0.302921 0.287677 40 1 0.291839 0.303677 0.288298 41 10 0.25 0.301639 0.305362 0.289899 42 0.5 0.299473 0.304075 0.288841 43 0.75 0.298098 0.301055 0.285724 44 1 0.296793 0.303715 0.288325 45 5 0.25 0.303638 0.305379 0.290094 46 0.5 0.303610 0.305306 0.289880 47 0.75 0.302271 0.304918 0.289561 48 1 0.301956 0.304322 0.288957 49 500 15 0.25 0.301526 0.304080 0.288464 50 0.5 0.297063 0.306254 0.290488 51 0.75 0.293822 0.300528 0.285145 52 1 0.293095 0.301892 0.286377 53 10 0.25 0.302328 0.307330 0.291530 54 0.5 0.299370 0.303877 0.288227 55 0.75 0.297841 0.301809 0.286284 56 1 0.297281 0.304030 0.288398 57 5 0.25 0.302365 0.303326 0.287746 58 0.5 0.303011 0.302495 0.286941 59 0.75 0.301661 0.304797 0.289131 60 1 0.301875 0.302717 0.287157 61 250 15 0.25 0.301321 0.306252 0.291301 62 0.5 0.297011 0.302224 0.286611 63 0.75 0.293531 0.300802 0.285815 64 1 0.294556 0.301925 0.286726 65 10 0.25 0.302382 0.304287 0.288809 66 0.5 0.300390 0.303883 0.288478 67 0.75 0.298503 0.301642 0.286330 68 1 0.295590 0.302319 0.286844 69 5 0.25 0.301868 0.303920 0.288364 70 0.5 0.302962 0.303675 0.288801 71 0.75 0.300649 0.302262 0.287081 72 1 0.301612 0.305566 0.290187 PAGE 59 59 Table 4 8 M ean standard error of measurement for each condition (t estlet size 5) condition Sample size Testlet No. Local effect Testlet PC Rasch 1 1000 9 0.25 0.299785 0.304136 0.275109 2 0.5 0.295323 0.304440 0.275416 3 0.75 0.292322 0.304307 0.275406 4 1 0.289530 0.303489 0.274878 5 6 0.25 0.302390 0.303154 0.274489 6 0.5 0.298983 0.304078 0.275321 7 0.75 0.297526 0.300803 0.272350 8 1 0.296710 0.304706 0.275876 9 3 0.25 0.303190 0.305241 0.276391 10 0.5 0.301505 0.303647 0.274911 11 0.75 0.301083 0.303993 0.275250 12 1 0.300625 0.302478 0.273795 13 500 9 0.25 0.300613 0.304737 0.275479 14 0.5 0.295684 0.303001 0.273933 15 0.75 0.292964 0.302767 0.273778 16 1 0.288456 0.304967 0.275759 17 6 0.25 0.302246 0.305767 0.276519 18 0.5 0.298544 0.302856 0.273895 19 0.75 0.297919 0.304972 0.275769 20 1 0.294441 0.304801 0.275576 21 3 0.25 0.303535 0.303637 0.274653 22 0.5 0.300839 0.303986 0.274908 23 0.75 0.301813 0.303762 0.274750 24 1 0.300681 0.305853 0.276625 25 250 9 0.25 0.299585 0.306513 0.277211 26 0.5 0.294056 0.301339 0.272416 27 0.75 0.292320 0.305417 0.276170 28 1 0.288759 0.304640 0.275400 29 6 0.25 0.302050 0.307584 0.278153 30 0.5 0.299601 0.304732 0.275591 31 0.75 0.297984 0.305409 0.276203 32 1 0.296080 0.301589 0.272642 33 3 0.25 0.302384 0.307852 0.278436 34 0.5 0.301966 0.303866 0.274839 35 0.75 0.301620 0.301048 0.272269 36 1 0.300667 0.304903 0.275727 PAGE 60 60 Table 4 9 T estlet size 3 Bias and RMSE of ability estimate r ecovery (EAP) Testlet size 3 testlet Model Partial Credit Model Standard Rasch Model condition Sample size Testlet No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE 37 1000 15 0.25 0.16783 0.01002 0.09522 0.00897 0.14925 0.01012 38 0.5 0.04346 0.00711 0.03334 0.00690 0.02660 0.00687 39 0.75 0.09725 0.00713 0.05011 0.00631 0.02341 0.00665 40 1 0.09176 0.00853 0.17260 0.00970 0.13382 0.00917 41 10 0.25 0.06458 0.00752 0.15846 0.00704 0.13656 0.00695 42 0.5 0.10475 0.00796 0.10288 0.00792 0.11166 0.00792 43 0.75 0.14755 0.00626 0.00742 0.00691 0.08861 0.00867 44 1 0.00417 0.01003 0.02655 0.01034 0.01398 0.01000 45 5 0.25 0.11195 0.00713 0.11097 0.00700 0.10368 0.00724 46 0.5 0.11073 0.00726 0.06216 0.00687 0.03382 0.00674 47 0.75 0.07001 0.00769 0.10917 0.00841 0.15884 0.00922 48 1 0.09311 0.00689 0.13323 0.00728 0.18511 0.00840 49 500 15 0.25 0.11591 0.01240 0.07243 0.01268 0.14060 0.01139 50 0.5 0.19790 0.01497 0.06774 0.01321 0.16854 0.01560 51 0.75 0.07199 0.01209 0.01459 0.01071 0.07251 0.01139 52 1 0.13480 0.01271 0.07473 0.01080 0.05624 0.01054 53 10 0.25 0.14430 0.01326 0.10452 0.01271 0.17322 0.01378 54 0.5 0.01525 0.01222 0.05002 0.01292 0.10895 0.00991 55 0.75 0.04809 0.00981 0.01907 0.00944 0.02956 0.00953 56 1 0.03209 0.01093 0.03091 0.01121 0.15904 0.01326 57 5 0.25 0.16260 0.01096 0.19737 0.01107 0.22554 0.01164 58 0.5 0.18784 0.01273 0.22612 0.01345 0.24754 0.01406 59 0.75 0.13105 0.01311 0.09802 0.01204 0.24434 0.01617 60 1 0.11728 0.00977 0.03991 0.00996 0.14934 0.00980 61 250 15 0.25 0.03807 0.01502 0.04629 0.01401 0.03043 0.01421 62 0.5 0.29854 0.02358 0.24066 0.02088 0.10983 0.01594 63 0.75 0.07253 0.01541 0.05782 0.01348 0.14003 0.01505 64 1 0.03604 0.01275 0.05762 0.01021 0.04707 0.01062 65 10 0.25 0.07542 0.01475 0.00293 0.01605 0.00552 0.01547 66 0.5 0.24154 0.02061 0.13173 0.01512 0.20653 0.01843 67 0.75 0.05263 0.01521 0.02047 0.01530 0.05522 0.01517 68 1 0.03395 0.01608 0.18840 0.02246 0.25207 0.02576 69 5 0.25 0.14591 0.01773 0.15485 0.01743 0.22109 0.01918 70 0.5 0.25183 0.01889 0.20715 0.01614 0.20095 0.01647 71 0.75 0.05471 0.01360 0.02376 0.01360 0.04982 0.01329 PAGE 61 61 Table 4 9 Continued Testlet size 3 testlet Model Partial Credit Model Standard Rasch Model 0.02027 0.23503 0.02058 condition Sample size Testlet No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE MAX 0.29854 0.02358 0.24066 0.02246 0.24754 0.02576 Overall Mean 0.02846 0.01181 0.01140 0.01148 0.02249 0.01197 Standard Deviation 0.12285 0.00420 0.11664 0.00415 0.14342 0.00440 PAGE 62 62 Table 4 10 T estlet size 5 Bias and RMSE of ability estimate r ecovery (EAP) Testlet size 5 testlet Model Partial Credit Model Standard Rasch Model condition Sample size Testlet No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE 1 1000 9 0.25 0.19164 0.00867 0.19664 0.00843 0.15635 0.00793 2 0.5 0.15941 0.00980 0.11543 0.00955 0.07367 0.00901 3 0.75 0.11687 0.01159 0.03946 0.00972 0.09474 0.00840 4 1 0.18283 0.00951 0.16801 0.00965 0.13127 0.00905 5 6 0.25 0.00801 0.00638 0.06287 0.00583 0.03093 0.00670 6 0.5 0.50256 0.02672 0.32653 0.01931 0.12354 0.01294 7 0.75 0.10712 0.00810 0.15731 0.00737 0.04488 0.00689 8 1 0.00526 0.00877 0.08249 0.00795 0.12016 0.00827 9 3 0.25 0.06007 0.00712 0.00125 0.00679 0.00938 0.00665 10 0.5 0.04465 0.00653 0.05095 0.00658 0.19904 0.00868 11 0.75 0.26770 0.01191 0.20402 0.01028 0.25305 0.01157 12 1 0.25933 0.00972 0.14790 0.00791 0.15701 0.00782 13 500 9 0.25 0.13080 0.01029 0.22012 0.01225 0.10559 0.00996 14 0.5 0.08486 0.01197 0.01553 0.01106 0.17483 0.01453 15 0.75 0.01278 0.01215 0.06039 0.01137 0.01830 0.01170 16 1 0.23682 0.01767 0.30371 0.02006 0.31090 0.02059 17 6 0.25 0.05779 0.00887 0.15835 0.01077 0.06598 0.00909 18 0.5 0.03734 0.01203 0.08598 0.01175 0.00903 0.01127 19 0.75 0.14255 0.01127 0.17055 0.01082 0.24755 0.01176 20 1 0.23726 0.01445 0.13707 0.01232 0.00799 0.01118 21 3 0.25 0.23119 0.01149 0.14785 0.00943 0.16435 0.00959 22 0.5 0.01187 0.00864 0.07000 0.00769 0.15143 0.00828 23 0.75 0.18909 0.01173 0.15529 0.01066 0.19302 0.01106 24 1 0.17958 0.01117 0.09252 0.00928 0.07032 0.00904 25 250 9 0.25 0.03497 0.01264 0.08511 0.01317 0.13961 0.01437 26 0.5 0.09268 0.01438 0.04789 0.01380 0.10150 0.01634 27 0.75 0.01476 0.01506 0.03546 0.01554 0.10886 0.01443 28 1 0.38487 0.02121 0.28872 0.01855 0.28307 0.01810 29 6 0.25 0.05265 0.01544 0.13213 0.01716 0.03681 0.01452 30 0.5 0.10775 0.01580 0.09121 0.01380 0.03764 0.01291 31 0.75 0.10867 0.01352 0.19622 0.01602 0.31813 0.02217 32 1 0.07402 0.01357 0.10687 0.01220 0.03626 0.01079 33 3 0.25 0.28391 0.01869 0.11141 0.01456 0.11499 0.01452 34 0.5 0.24272 0.10725 0.13467 0.11244 0.02008 0.12209 PAGE 63 63 Table 4 10 Continued Testlet size 5 testlet Model Partial Credit Model Standard Rasch Model condition Sample size Testlet No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE 35 0.75 0.11239 0.01639 0.04690 0.01527 0.08978 0.01543 36 1 0.04974 0.01230 0.03783 0.01222 0.02647 0.01219 MIN 0.38487 0.00638 0.30371 0.00583 0.31813 0.00665 MAX 0.50256 0.10725 0.32653 0.11244 0.25305 0.12209 Overall mean 0.00144 0.01516 0.00765 0.01455 0.04165 0.01479 Standard Deviation 0.18094 0.01634 0.14899 0.01718 0.14163 0.01878 PAGE 64 64 Ta ble 4 11 Rasch t estlet model (Testlet Size 3) Bias of a bility ( ) estimate recovery (EAP) with 6 different ability i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 37 1000 15 0.25 0.473167 0.248143 0.183455 0.149797 0.091889 0.111914 38 0.5 0.354534 0.114570 0.052191 0.027920 0.028258 0.190016 39 0.75 0.370821 0.154818 0.096164 0.085521 0.046400 0.086360 40 1 0.172154 0.039595 0.095668 0.104551 0.138436 0.289969 41 10 0.25 0.275666 0.152291 0.094051 0.046317 0.037948 0.209459 42 0.5 0.321742 0.195670 0.127418 0.087044 0.014702 0.146453 43 0.75 0.005361 0.072230 0.123801 0.164971 0.230953 0.376841 44 1 0.175261 0.077036 0.011355 0.021059 0.070775 0.163715 45 5 0.25 0.435356 0.205839 0.137949 0.095536 0.012910 0.141757 46 0.5 0.395263 0.187794 0.127652 0.095096 0.028981 0.206536 47 0.75 0.219189 0.015793 0.053097 0.089053 0.145936 0.299091 48 1 0.242773 0.016552 0.076447 0.111252 0.172157 0.361526 49 500 15 0.25 0.384715 0.183385 0.137153 0.108710 0.044345 0.187429 50 0.5 0.477372 0.267986 0.211695 0.193633 0.131396 0.150944 51 0.75 0.237362 0.021731 0.067369 0.082894 0.132175 0.317696 52 1 0.097377 0.097434 0.134401 0.139968 0.177940 0.290971 53 10 0.25 0.356236 0.228351 0.168639 0.126469 0.055644 0.135595 54 0.5 0.212429 0.102587 0.040898 0.002519 0.078594 0.239629 55 0.75 0.166690 0.043431 0.030665 0.070784 0.135619 0.289627 56 1 0.211727 0.102454 0.045510 0.011864 0.035988 0.183253 57 5 0.25 0.110223 0.060440 0.136362 0.186604 0.257486 0.462181 58 0.5 0.462159 0.251401 0.213998 0.171382 0.075050 0.151453 59 0.75 0.358289 0.211521 0.164480 0.120848 0.032428 0.219213 60 1 0.444407 0.190858 0.133521 0.100451 0.028398 0.132833 61 250 15 0.25 0.425444 0.088707 0.038860 0.014501 0.043279 0.203645 62 0.5 0.539062 0.356469 0.319489 0.290043 0.250876 0.020486 63 0.75 0.216494 0.002026 0.075173 0.093850 0.143783 0.231325 64 1 0.174028 0.016443 0.042212 0.030380 0.074964 0.255932 65 10 0.25 0.303218 0.178595 0.097837 0.056333 0.012798 0.177952 66 0.5 0.043971 0.156407 0.226762 0.259869 0.310529 0.473968 67 0.75 0.142809 0.024524 0.040944 0.074493 0.129313 0.256615 68 1 0.141291 0.054279 0.019997 0.056531 0.113450 0.224637 69 5 0.25 0.368075 0.246382 0.178723 0.124299 0.040473 0.265577 70 0.5 0.555768 0.318019 0.273727 0.238187 0.132023 0.025779 71 0.75 0.306779 0.136065 0.077814 0.035790 0.039109 0.258084 PAGE 65 65 Table 4 11 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 72 1 0.352490 0.160919 0.116425 0.086989 0.011101 0.285605 Overall mean 0.290049 0.111529 0.053503 0.021610 0.042024 0.220320 Standard Deviation 0.144891 0.124567 0.127104 0.124312 0.118086 0.107127 PAGE 66 66 Table 4 12 Partial credit m odel (testlet size 3) Bias of a bility ( ) e stim ate recovery (EAP) with 6 different ability i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 37 1000 15 0.25 0.336622 0.203823 0.118733 0.070286 0.009691 0.214690 38 0.5 0.315843 0.153878 0.058782 0.002898 0.070801 0.259818 39 0.75 0.324340 0.176789 0.072556 0.012976 0.055217 0.232093 40 1 0.079975 0.054760 0.153193 0.210201 0.273728 0.474976 41 10 0.25 0.370687 0.259955 0.189054 0.139130 0.056038 0.192075 42 0.5 0.355115 0.221533 0.130672 0.078644 0.000630 0.233303 43 0.75 0.214934 0.116662 0.026076 0.037418 0.125863 0.358937 44 1 0.247635 0.149033 0.052803 0.000008 0.069264 0.220998 45 5 0.25 0.381780 0.223662 0.135451 0.088816 0.016893 0.128480 46 0.5 0.313348 0.168042 0.079470 0.038035 0.023125 0.252448 47 0.75 0.155233 0.008629 0.087356 0.136890 0.195870 0.349476 48 1 0.174679 0.018619 0.113778 0.163059 0.225781 0.413138 49 500 15 0.25 0.292574 0.178334 0.107760 0.053391 0.024299 0.321069 50 0.5 0.263480 0.169531 0.095938 0.049898 0.016077 0.315868 51 0.75 0.306953 0.110805 0.016032 0.053743 0.136533 0.383503 52 1 0.166832 0.041945 0.043852 0.108448 0.185566 0.335328 53 10 0.25 0.290882 0.192752 0.127494 0.087437 0.025865 0.211482 54 0.5 0.156712 0.059820 0.019303 0.072568 0.150827 0.372046 55 0.75 0.229094 0.111338 0.010573 0.053234 0.139269 0.376540 56 1 0.181234 0.075932 0.007133 0.062984 0.129794 0.355068 57 5 0.25 0.035895 0.075835 0.171228 0.228269 0.291249 0.486819 58 0.5 0.478965 0.325959 0.257307 0.199297 0.099948 0.137633 59 0.75 0.280798 0.197895 0.130120 0.078434 0.003367 0.224228 60 1 0.324746 0.154661 0.061032 0.009318 0.064787 0.239826 61 250 15 0.25 0.282271 0.032899 0.039813 0.077479 0.134475 0.290433 62 0.5 0.472681 0.359737 0.289173 0.218702 0.150106 0.141069 63 0.75 0.217801 0.073962 0.043076 0.106858 0.176700 0.288534 64 1 0.189103 0.057994 0.030791 0.088302 0.178435 0.440313 65 10 0.25 0.248906 0.128714 0.029408 0.018945 0.093579 0.359625 66 0.5 0.095283 0.016983 0.111733 0.154756 0.216678 0.491812 67 0.75 0.280561 0.145899 0.043611 0.017089 0.092554 0.342466 68 1 0.027517 0.066781 0.164813 0.221067 0.295618 0.509271 69 5 0.25 0.343087 0.267476 0.181118 0.129846 0.061416 0.212631 70 0.5 0.474667 0.308877 0.230810 0.184408 0.079458 0.034207 71 0.75 0.208783 0.092116 0.002041 0.052994 0.126155 0.350564 72 1 0.446835 0.315588 0.241987 0.204057 0.137092 0.138479 PAGE 67 67 Table 4 12 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 Overall mean 0.265718 0.134757 0.047276 0.006076 0.079215 0.296923 Standard Deviation 0.114902 0.112429 0.119329 0.120615 0.117579 0.114145 PAGE 68 68 Table 4 13 Standard Rasch model (testlet size 3) Bias of a bility ( ) estimate recovery (EAP) with 6 different ability i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 37 1000 15 0.25 0.377349 0.256007 0.174512 0.123467 0.066063 0.158090 38 0.5 0.288100 0.142818 0.053952 0.003419 0.074942 0.265710 39 0.75 0.278403 0.145130 0.047520 0.012507 0.080783 0.257701 40 1 0.101816 0.018816 0.113572 0.170834 0.232747 0.427665 41 10 0.25 0.335641 0.235327 0.168284 0.116694 0.036510 0.205762 42 0.5 0.344568 0.225106 0.140802 0.087759 0.012315 0.217320 43 0.75 0.300918 0.212029 0.123561 0.057606 0.030350 0.256467 44 1 0.193358 0.105935 0.014537 0.041603 0.108161 0.261854 45 5 0.25 0.358078 0.212635 0.130214 0.080768 0.012205 0.136275 46 0.5 0.270038 0.136912 0.052790 0.009639 0.051126 0.274681 47 0.75 0.088620 0.045237 0.134873 0.186832 0.242925 0.400196 48 1 0.106223 0.073708 0.163902 0.214979 0.276585 0.462143 49 500 15 0.25 0.367581 0.248949 0.175592 0.120463 0.044052 0.251761 50 0.5 0.374032 0.274207 0.197322 0.149253 0.082147 0.219028 51 0.75 0.244773 0.050650 0.042050 0.110614 0.193649 0.444443 52 1 0.188429 0.061434 0.025102 0.090149 0.168285 0.320278 53 10 0.25 0.364199 0.263265 0.196480 0.155625 0.092914 0.146044 54 0.5 0.322251 0.221667 0.140289 0.085397 0.005789 0.217023 55 0.75 0.226423 0.103118 0.000191 0.064436 0.151322 0.390154 56 1 0.056398 0.050612 0.134877 0.191533 0.259697 0.487368 57 5 0.25 0.010390 0.103156 0.199253 0.256550 0.320193 0.517764 58 0.5 0.503030 0.348689 0.279222 0.220309 0.119510 0.119608 59 0.75 0.434021 0.346219 0.276517 0.224106 0.148551 0.080009 60 1 0.439148 0.265474 0.170702 0.118328 0.043324 0.133062 61 250 15 0.25 0.333248 0.103105 0.038560 0.001722 0.053561 0.214062 62 0.5 0.334339 0.229383 0.161915 0.085442 0.019557 0.273484 63 0.75 0.115012 0.012460 0.124758 0.185956 0.258118 0.375915 64 1 0.276731 0.158015 0.074835 0.017737 0.073405 0.329492 65 10 0.25 0.237986 0.128453 0.034612 0.017241 0.092265 0.345118 66 0.5 0.012850 0.090193 0.183208 0.231655 0.295516 0.564457 67 0.75 0.303445 0.177303 0.079462 0.018418 0.060361 0.294325 68 1 0.045749 0.130230 0.226555 0.285692 0.361091 0.571902 69 5 0.25 0.401088 0.333323 0.249247 0.194643 0.126869 0.137617 70 0.5 0.446812 0.297214 0.225627 0.179150 0.076715 0.035870 71 0.75 0.266258 0.162389 0.077149 0.020855 0.051654 0.275809 PAGE 69 69 Table 4 13 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 72 1 0.447285 0.327455 0.257638 0.216147 0.151092 0.120725 Overall mean 0.269530 0.145772 0.060927 0.006098 0.066642 0.283033 Standard Deviation 0.138194 0.138171 0.145588 0.147425 0.144426 0.137488 PAGE 70 70 Table 4 14 Ras ch testlet m odel (Testlet Size 5) Bias of a bility ( ) e stimate r ecovery (EAP) with 6 d ifferent a bility Intervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 1 1000 9 0.25 0.094350 0.095255 0.174461 0.213346 0.281807 0.428692 2 0.5 0.436489 0.237122 0.179649 0.147421 0.082498 0.098116 3 0.75 0.122463 0.050085 0.104181 0.127667 0.196337 0.321018 4 1 0.442925 0.250876 0.195557 0.171773 0.105662 0.004987 5 6 0.25 0.379519 0.091610 0.030039 0.010922 0.073238 0.322554 6 0.5 0.806265 0.587031 0.530736 0.488181 0.419653 0.115971 7 0.75 0.421201 0.186316 0.125033 0.086932 0.021690 0.247222 8 1 0.357342 0.082943 0.013761 0.023014 0.081773 0.347509 9 3 0.25 0.362618 0.155299 0.084661 0.030937 0.039237 0.213096 10 0.5 0.312980 0.065283 0.015409 0.071428 0.131333 0.272589 11 0.75 0.560732 0.350580 0.290498 0.234278 0.171695 0.001140 12 1 0.551760 0.339561 0.285196 0.233684 0.176530 0.038358 13 500 9 0.25 0.145720 0.033425 0.114898 0.143125 0.207488 0.400981 14 0.5 0.195187 0.009050 0.066950 0.102276 0.156287 0.310182 15 0.75 0.315093 0.077165 0.023492 0.001618 0.062240 0.183802 16 1 0.084424 0.159926 0.233283 0.259885 0.321462 0.424321 17 6 0.25 0.456119 0.138420 0.074700 0.034685 0.039098 0.245486 18 0.5 0.290231 0.128894 0.062876 0.020589 0.044900 0.247926 19 0.75 0.171234 0.057490 0.132013 0.164068 0.217203 0.436617 20 1 0.150614 0.149303 0.226614 0.265776 0.326516 0.478286 21 3 0.25 0.174309 0.132377 0.207516 0.258158 0.313038 0.455341 22 0.5 0.255474 0.088922 0.017856 0.042086 0.097710 0.226965 23 0.75 0.094036 0.091653 0.173193 0.223266 0.274488 0.389311 24 1 0.148404 0.087851 0.155808 0.202989 0.257186 0.429154 25 250 9 0.25 0.223756 0.047142 0.020527 0.060633 0.128812 0.269732 26 0.5 0.428441 0.184559 0.111482 0.065017 0.005950 0.094928 27 0.75 0.237617 0.041268 0.002304 0.029471 0.083137 0.206376 28 1 0.121960 0.317932 0.390940 0.413389 0.458683 0.459374 29 6 0.25 0.309336 0.135764 0.074997 0.035037 0.040391 0.297998 30 0.5 0.380690 0.168035 0.126960 0.096969 0.033264 0.191007 31 0.75 0.253362 0.020784 0.083982 0.115322 0.160088 0.540590 32 1 0.314140 0.133842 0.093918 0.053564 0.002582 0.247434 33 3 0.25 0.576460 0.384467 0.313977 0.263214 0.192252 0.044460 34 0.5 3.193850 1.675706 0.751623 0.237276 1.192136 1.719127 35 0.75 0.117816 0.015918 0.074050 0.135772 0.192788 0.400226 PAGE 71 71 Table 4 14 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 36 1 0.198452 0.029064 0.021486 0.078501 0.140366 0.262858 Overall mean 0.373374 0.121078 0.033167 0.033735 0.119941 0.306961 Standard Deviation 0.514662 0.318963 0.218118 0.181103 0.253235 0.289446 PAGE 72 72 Table 4 15 Partial credit m odel ( t estlet s ize 5) b ias of a bility ( ) e stimate r ecovery (EAP) with 6 d ifferent a bility i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 1 1000 9 0.25 0.057694 0.074844 0.173165 0.226777 0.293046 0.493063 2 0.5 0.357073 0.227225 0.146618 0.092851 0.026046 0.211425 3 0.75 0.183253 0.067466 0.015145 0.064679 0.142554 0.330352 4 1 0.414649 0.288070 0.197109 0.140516 0.057407 0.106933 5 6 0.25 0.254284 0.056994 0.037817 0.089405 0.166245 0.386558 6 0.5 0.566313 0.442467 0.356654 0.305465 0.225716 0.009284 7 0.75 0.445097 0.286063 0.183949 0.122392 0.034249 0.192302 8 1 0.179819 0.039463 0.057599 0.109310 0.178169 0.357742 9 3 0.25 0.255866 0.110652 0.021129 0.021659 0.100628 0.311067 10 0.5 0.255503 0.078655 0.022398 0.072849 0.147754 0.331184 11 0.75 0.465425 0.307731 0.224380 0.172734 0.095564 0.136830 12 1 0.424246 0.261036 0.176704 0.121079 0.041691 0.259527 13 500 9 0.25 0.005684 0.103587 0.196281 0.239268 0.298955 0.544116 14 0.5 0.257783 0.125218 0.046717 0.013190 0.070077 0.279283 15 0.75 0.214159 0.048767 0.033883 0.088606 0.162287 0.352665 16 1 0.052872 0.202811 0.289531 0.336450 0.396747 0.559555 17 6 0.25 0.435278 0.257183 0.176986 0.133877 0.062489 0.090830 18 0.5 0.290807 0.206287 0.116569 0.060745 0.011217 0.173629 19 0.75 0.059018 0.062851 0.155885 0.200646 0.256321 0.387814 20 1 0.118874 0.030462 0.119601 0.169003 0.232470 0.328117 21 3 0.25 0.198857 0.028118 0.121680 0.173547 0.245166 0.437178 22 0.5 0.133467 0.044319 0.042115 0.095718 0.158739 0.316826 23 0.75 0.089688 0.035384 0.139598 0.189674 0.257481 0.409490 24 1 0.162975 0.012675 0.068295 0.111421 0.172786 0.376298 25 250 9 0.25 0.283943 0.171075 0.102312 0.057526 0.003210 0.172672 26 0.5 0.322473 0.171561 0.077414 0.008952 0.061644 0.198336 27 0.75 0.165109 0.050177 0.006297 0.061494 0.109914 0.271395 28 1 0.058034 0.199634 0.286395 0.329072 0.374657 0.446141 29 6 0.25 0.291949 0.217848 0.150383 0.116097 0.052439 0.114647 30 0.5 0.293698 0.177147 0.115804 0.074011 0.006296 0.170862 31 0.75 0.045159 0.091565 0.168066 0.213752 0.266624 0.512562 32 1 0.321524 0.210881 0.136044 0.071290 0.000412 0.199886 33 3 0.25 0.303234 0.202762 0.132833 0.100714 0.039509 0.107170 34 0.5 3.077542 1.569886 0.643957 0.344942 1.302150 1.829526 35 0.75 0.149291 0.083300 0.000826 0.071642 0.152472 0.447450 36 1 0.244549 0.134084 0.064922 0.012723 0.058589 0.223248 PAGE 73 73 Table 4 15 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 Overall mean 0.311483 0.139437 0.031599 0.045337 0.138102 0.335443 Standard Deviation 0.496357 0.285833 0.182890 0.157178 0.245830 0.291249 PAGE 74 74 Table 4 16 Standard Rasch m ode l (t estlet s ize 5) b ias of a bility ( ) e stimate r ecovery (EAP) with 6 d ifferent ability i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 1 1000 9 0.25 0.099971 0.034221 0.133403 0.186551 0.252010 0.452877 2 0.5 0.318107 0.186509 0.104965 0.050271 0.015544 0.253090 3 0.75 0.310948 0.202532 0.120125 0.068132 0.008049 0.195075 4 1 0.360180 0.246459 0.160534 0.104280 0.025703 0.138183 5 6 0.25 0.333569 0.147097 0.057703 0.004335 0.071557 0.289829 6 0.5 0.348352 0.236751 0.154771 0.100999 0.026762 0.203307 7 0.75 0.319864 0.171062 0.072551 0.009943 0.077491 0.297234 8 1 0.126029 0.002200 0.093594 0.147499 0.213030 0.391729 9 3 0.25 0.248496 0.116344 0.031513 0.013947 0.091865 0.298587 10 0.5 0.091450 0.073315 0.168065 0.222511 0.293491 0.471152 11 0.75 0.502321 0.354719 0.274381 0.222110 0.145063 0.079361 12 1 0.418671 0.266539 0.186816 0.130330 0.051680 0.236261 13 500 9 0.25 0.135963 0.020694 0.078640 0.127076 0.193024 0.445134 14 0.5 0.080448 0.057799 0.140910 0.206017 0.267658 0.482120 15 0.75 0.266265 0.095822 0.009320 0.048039 0.124395 0.318598 16 1 0.049814 0.205177 0.294881 0.344937 0.410871 0.582785 17 6 0.25 0.344499 0.166983 0.086702 0.040895 0.034596 0.191486 18 0.5 0.200928 0.113445 0.022554 0.034379 0.109346 0.276489 19 0.75 0.012288 0.136464 0.230731 0.278090 0.339425 0.480841 20 1 0.260057 0.105714 0.011334 0.042154 0.110330 0.210707 21 3 0.25 0.184735 0.044013 0.138392 0.190226 0.261472 0.453125 22 0.5 0.056858 0.034257 0.122437 0.177865 0.243289 0.404036 23 0.75 0.054648 0.072215 0.177272 0.227534 0.296165 0.450183 24 1 0.186734 0.036123 0.045174 0.089354 0.152590 0.358227 25 250 9 0.25 0.074001 0.047596 0.123366 0.169803 0.222756 0.401669 26 0.5 0.190891 0.030451 0.069547 0.142738 0.219874 0.367389 27 0.75 0.317890 0.197999 0.141188 0.083001 0.026529 0.145503 28 1 0.038382 0.186940 0.277641 0.325517 0.383315 0.475526 29 6 0.25 0.206191 0.127465 0.055987 0.019072 0.046796 0.218904 30 0.5 0.173974 0.052087 0.012475 0.056456 0.124616 0.303180 31 0.75 0.064167 0.206921 0.289422 0.337861 0.390709 0.638718 32 1 0.265819 0.147722 0.067491 0.002090 0.077987 0.285482 33 3 0.25 0.313663 0.208113 0.136700 0.103557 0.041626 0.105400 34 0.5 2.922879 1.415147 0.489106 0.499631 1.456934 1.983828 35 0.75 0.294057 0.222541 0.138180 0.064553 0.017974 0.315192 PAGE 75 75 Table 4 16 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 36 1 0.185198 0.072606 0.002331 0.052120 0.127573 0.298190 Overall mean 0.278583 0.106661 0.001992 0.081137 0.175483 0.374983 Standard Deviation 0.473497 0.264457 0.167108 0.160390 0.260040 0.305778 PAGE 76 76 Table 4 17 Rasch testlet Model (testlet size 3) RMSE of a bility ( ) estimate r ecov ery (EAP) with 6 different ability i ntervals condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 1000 15 0.25 0.078337 0.026778 0.013562 0.013996 0.022898 0.048603 38 0.5 0.058554 0.023512 0.013291 0.011807 0.020247 0.043903 39 0.75 0.109081 0.025007 0.015425 0.011851 0.024913 0.015921 40 1 0.042240 0.021694 0.016620 0.015158 0.024771 0.045446 41 10 0.25 0.078190 0.019787 0.011747 0.011594 0.021015 0.003677 42 0.5 0.093387 0.022832 0.013078 0.013353 0.021377 0.005111 43 0.75 0.073321 0.016861 0.013579 0.011982 0.026210 0.057016 44 1 0.070156 0.027408 0.014366 0.014792 0.020166 0.083354 45 5 0.25 0.084415 0.019277 0.012643 0.012258 0.018854 0.027661 46 0.5 0.084047 0.023888 0.012558 0.012447 0.020792 0.084284 47 0.75 0.059131 0.019724 0.011052 0.012973 0.024332 0.122048 48 1 0.066213 0.018199 0.012070 0.011800 0.021689 0.081700 49 500 15 0.25 0.192334 0.033274 0.019610 0.017669 0.023263 0.023885 50 0.5 0.147215 0.039461 0.020473 0.022246 0.027377 0.028104 51 0.75 0.088903 0.030005 0.018717 0.020319 0.038214 0.321320 52 1 0.086655 0.032271 0.022941 0.021482 0.034426 0.075319 53 10 0.25 0.119339 0.029762 0.020720 0.024285 0.024943 0.150809 54 0.5 0.080430 0.033565 0.018110 0.014816 0.041293 0.006453 55 0.75 0.082724 0.025981 0.017898 0.015908 0.038934 0.015708 56 1 0.098487 0.033379 0.020264 0.018472 0.030859 0.023337 57 5 0.25 0.086273 0.028639 0.018110 0.020713 0.035449 0.026388 58 0.5 0.143987 0.037974 0.019936 0.021455 0.032242 0.071056 59 0.75 0.156287 0.036786 0.022283 0.019618 0.028308 0.131062 60 1 0.184270 0.038350 0.019849 0.016660 0.029845 0.013516 61 250 15 0.25 0.292779 0.039214 0.027430 0.023809 0.034594 0.087950 62 0.5 0.218673 0.084971 0.042775 0.036884 0.055996 0.086146 63 0.75 0.144418 0.049953 0.025300 0.027950 0.047892 0.264669 64 1 0.140580 0.042917 0.021739 0.025272 0.049209 0.330666 65 10 0.25 0.178133 0.057100 0.027217 0.021679 0.040607 0.114075 66 0.5 0.120084 0.049634 0.029591 0.037874 0.055365 0.092475 67 0.75 0.113360 0.051918 0.024542 0.025663 0.053502 0.196613 68 1 0.155949 0.036788 0.026793 0.027741 0.042362 0.004325 69 5 0.25 0.134323 0.057182 0.030049 0.026274 0.042960 0.265263 70 0.5 0.182210 0.067420 0.030899 0.032589 0.045741 0.266494 71 0.75 0.119217 0.047717 0.022420 0.024159 0.041848 0.088415 PAGE 77 77 Table 4 17 Continued condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 72 1 0.108693 0.045403 0.029070 0.027282 0.041949 0.020853 Overall mean 0.118678 0.035962 0.020465 0.020134 0.033457 0.092323 Standard Deviation 0.052386 0.015056 0.006978 0.007164 0.011086 0.092709 PAGE 78 78 Table 4 18 Partial c redit m odel ( t estlet s ize 3) RMSE of a bility ( ) estimate recovery (EAP) with 6 different ability i ntervals condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 37 1000 15 0.25 0.065677 0.023965 0.011388 0.011902 0.023170 0.067682 38 0.5 0.060728 0.023976 0.013179 0.011456 0.020410 0.043749 39 0.75 0.099028 0.024728 0.013634 0.010496 0.023721 0.010887 40 1 0.048958 0.019700 0.017045 0.017225 0.031831 0.086938 41 10 0.25 0.094635 0.027001 0.014954 0.013781 0.019667 0.012397 42 0.5 0.090130 0.024572 0.013120 0.013289 0.021730 0.009448 43 0.75 0.083380 0.018589 0.011610 0.010228 0.022013 0.024883 44 1 0.078642 0.028244 0.014527 0.013482 0.020061 0.096752 45 5 0.25 0.082899 0.019888 0.012796 0.012045 0.019864 0.030363 46 0.5 0.080121 0.023337 0.011771 0.011788 0.020182 0.079555 47 0.75 0.064722 0.019505 0.011721 0.014189 0.025794 0.124530 48 1 0.067890 0.017737 0.012473 0.012460 0.024051 0.097091 49 500 15 0.25 0.154701 0.035439 0.017829 0.016028 0.022127 0.020737 50 0.5 0.105886 0.034800 0.017357 0.017181 0.026016 0.025455 51 0.75 0.097656 0.027177 0.016228 0.018008 0.034151 0.270534 52 1 0.086244 0.028811 0.018696 0.018263 0.034193 0.068944 53 10 0.25 0.111764 0.029271 0.018249 0.023287 0.025120 0.139872 54 0.5 0.078844 0.030477 0.018031 0.014985 0.043723 0.016498 55 0.75 0.090798 0.028041 0.017230 0.015321 0.039667 0.031949 56 1 0.098187 0.032181 0.018453 0.018955 0.032947 0.000127 57 5 0.25 0.091645 0.029466 0.018295 0.022806 0.038532 0.037234 58 0.5 0.147778 0.045405 0.021106 0.022665 0.033154 0.080828 59 0.75 0.137374 0.036145 0.020466 0.019162 0.028722 0.122420 60 1 0.154455 0.036054 0.016647 0.016980 0.029782 0.033431 61 250 15 0.25 0.228860 0.042316 0.025572 0.026715 0.039057 0.039881 62 0.5 0.190427 0.086004 0.037864 0.029954 0.044274 0.067882 63 0.75 0.132743 0.045510 0.021947 0.024455 0.049834 0.315181 64 1 0.139310 0.033277 0.017402 0.021319 0.052945 0.171415 65 10 0.25 0.169489 0.053985 0.023519 0.021089 0.044170 0.138256 66 0.5 0.121261 0.048928 0.024269 0.027797 0.046095 0.047430 67 0.75 0.120247 0.054696 0.024694 0.022111 0.050509 0.222708 68 1 0.136154 0.038438 0.026148 0.033798 0.059179 0.047885 69 5 0.25 0.138147 0.061613 0.029549 0.026912 0.045897 0.267080 70 0.5 0.175470 0.065379 0.026202 0.027847 0.042611 0.178891 71 0.75 0.110521 0.045314 0.022411 0.023566 0.046565 0.070210 72 1 0.147110 0.059226 0.037350 0.033784 0.045651 0.060984 PAGE 79 79 Table 4 18 Continued condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 Overall mean 0.113386 0.036089 0.019270 0.019315 0.034095 0.087781 Standard Deviation 0.040422 0.015425 0.006583 0.006572 0.011399 0.079865 PAGE 80 80 Table 4 19 Standard Rasch m odel ( t estlet s ize 3) RMSE of a bility ( ) e stimate r ecovery (EAP) with 6 d ifferent a bility i ntervals condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 37 1000 15 0.25 0.071970 0.027048 0.012636 0.012719 0.023061 0.060829 38 0.5 0.060926 0.023909 0.012678 0.011401 0.020533 0.045204 39 0.75 0.090384 0.023542 0.013484 0.011052 0.022812 0.019934 40 1 0.050341 0.019446 0.016044 0.016289 0.029408 0.083912 41 10 0.25 0.086566 0.025718 0.014281 0.012950 0.019438 0.008700 42 0.5 0.086802 0.025173 0.012795 0.013299 0.021531 0.011279 43 0.75 0.094025 0.023338 0.011782 0.010458 0.020305 0.004227 44 1 0.073421 0.027330 0.014314 0.014277 0.020211 0.100202 45 5 0.25 0.080224 0.019463 0.012707 0.012450 0.019174 0.027102 46 0.5 0.075284 0.022845 0.011400 0.011556 0.020183 0.071554 47 0.75 0.062711 0.020036 0.012618 0.015555 0.026669 0.131456 48 1 0.063622 0.018998 0.013537 0.014393 0.026115 0.105073 49 500 15 0.25 0.174828 0.040648 0.019626 0.017030 0.022111 0.000227 50 0.5 0.127802 0.041100 0.018820 0.018756 0.026461 0.005880 51 0.75 0.093012 0.025979 0.016592 0.019151 0.037344 0.303639 52 1 0.089175 0.028986 0.018560 0.017811 0.033151 0.060292 53 10 0.25 0.123712 0.033044 0.020933 0.025251 0.024102 0.154511 54 0.5 0.098728 0.040028 0.018633 0.015801 0.039165 0.021575 55 0.75 0.089436 0.027861 0.017481 0.015466 0.040531 0.034278 56 1 0.088494 0.027750 0.020219 0.022417 0.041304 0.037713 57 5 0.25 0.089638 0.029695 0.019246 0.024006 0.040744 0.044998 58 0.5 0.153260 0.047723 0.022317 0.023691 0.034507 0.088560 59 0.75 0.180469 0.048970 0.027495 0.022214 0.031881 0.162449 60 1 0.186296 0.042333 0.019634 0.016705 0.029095 0.007079 61 250 15 0.25 0.244394 0.045048 0.025937 0.023603 0.034617 0.039916 62 0.5 0.147001 0.065525 0.028917 0.025894 0.037003 0.022912 63 0.75 0.113811 0.045457 0.024075 0.027296 0.053748 0.376276 64 1 0.157073 0.034502 0.018100 0.019511 0.046642 0.240287 65 10 0.25 0.167921 0.055283 0.023304 0.021270 0.042570 0.130323 66 0.5 0.123050 0.048287 0.027207 0.033872 0.052401 0.093415 67 0.75 0.125161 0.056347 0.024480 0.022407 0.048606 0.187861 68 1 0.134140 0.041803 0.028784 0.038777 0.067885 0.084995 69 5 0.25 0.154649 0.069085 0.032520 0.030000 0.047157 0.292877 70 0.5 0.168907 0.064514 0.025804 0.028420 0.042436 0.180015 71 0.75 0.121488 0.053159 0.021906 0.023698 0.041709 0.099905 PAGE 81 81 Table 4 19 Continued condition Sample size Testlet No. Local effect mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6 72 1 0.149801 0.060850 0.037591 0.034294 0.045378 0.056767 Overall mean 0.116626 0.037523 0.019902 0.020104 0.034166 0.094340 Standard Deviation 0.044101 0.014734 0.006456 0.007159 0.011939 0.092177 PAGE 82 82 Table 4 20 Rasch t estlet model (t estlet s ize 5) RMSE of a bility ( ) estimate r ecovery (eap ) with 6 d ifferent a bility i ntervals condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 1 1000 9 0.25 0.048535 0.020547 0.013542 0.014905 0.025987 0.132773 2 0.5 0.104665 0.027705 0.016009 0.016919 0.019188 0.098599 3 0.75 0.048658 0.022476 0.013048 0.016071 0.030985 0.114531 4 1 0.081613 0.028248 0.015910 0.014095 0.021880 0.069132 5 6 0.25 0.070898 0.022451 0.010947 0.011000 0.024396 0.062862 6 0.5 0.172489 0.050252 0.031796 0.026443 0.036537 0.004791 7 0.75 0.093437 0.024567 0.012562 0.013693 0.022836 0.080497 8 1 0.119667 0.021661 0.015078 0.010693 0.021798 0.100457 9 3 0.25 0.087055 0.020796 0.012494 0.011860 0.019987 0.034308 10 0.5 0.088673 0.022802 0.011603 0.010394 0.022048 0.026025 11 0.75 0.114683 0.029492 0.020703 0.015341 0.022141 0.175543 12 1 0.119646 0.030957 0.016737 0.016704 0.028018 0.003056 13 500 9 0.25 0.073423 0.028199 0.017699 0.019940 0.030755 0.112060 14 0.5 0.104189 0.034109 0.018287 0.020583 0.034834 0.062598 15 0.75 0.110667 0.032705 0.017843 0.016505 0.026520 0.141207 16 1 0.072086 0.029727 0.024652 0.026943 0.048265 0.173372 17 6 0.25 0.093659 0.037446 0.020469 0.014655 0.026255 0.077015 18 0.5 0.088630 0.032820 0.018278 0.020875 0.030910 0.080459 19 0.75 0.098950 0.028540 0.021342 0.022284 0.033999 0.126608 20 1 0.071756 0.040521 0.024634 0.027753 0.049917 0.147327 21 3 0.25 0.146400 0.027316 0.019713 0.022029 0.054734 0.101774 22 0.5 0.094459 0.034377 0.014855 0.018803 0.030531 0.068490 23 0.75 0.083904 0.028614 0.019628 0.022143 0.038706 0.030053 24 1 0.072267 0.031387 0.019253 0.018611 0.040400 0.066370 25 250 9 0.25 0.104921 0.036662 0.020843 0.030739 0.053081 0.013633 26 0.5 0.137340 0.059429 0.025746 0.024807 0.045477 0.180335 27 0.75 0.136443 0.040754 0.027155 0.025382 0.051492 0.075550 28 1 0.107111 0.047915 0.050709 0.045143 0.084078 0.391390 29 6 0.25 0.144626 0.045561 0.025586 0.027354 0.038179 0.026828 30 0.5 0.138057 0.055004 0.026784 0.026009 0.041708 0.362921 31 0.75 0.199151 0.049567 0.024088 0.024056 0.045059 0.011575 32 1 0.127239 0.046378 0.022620 0.025210 0.046614 0.234785 33 3 0.25 0.183628 0.067105 0.045098 0.031144 0.051958 0.060797 34 0.5 1.788592 0.244154 0.092046 0.186136 0.065807 0.192168 35 0.75 0.156430 0.046850 0.026329 0.027011 0.046847 0.061260 36 1 0.105701 0.042695 0.021221 0.021594 0.048607 0.138145 PAGE 83 83 Table 4 20 Continued condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 Overall mean 0.155268 0.041383 0.023203 0.025662 0.037793 0.106647 Standard Deviation 0.282199 0.036636 0.014413 0.028410 0.014431 0.087627 PAGE 84 84 Table 4 21 Partial c redit m odel ( t estlet s ize 5) RMSE of a bility ( ) e stimate r ecovery (EAP) with 6 d ifferent a bility i ntervals condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 1 1000 9 0.25 0.055078 0.019749 0.012548 0.014505 0.026118 0.128978 2 0.5 0.093575 0.027011 0.014290 0.015383 0.020417 0.085914 3 0.75 0.062688 0.019097 0.011141 0.013724 0.025965 0.079033 4 1 0.090919 0.029425 0.016153 0.012023 0.019412 0.075747 5 6 0.25 0.071424 0.022914 0.010005 0.011628 0.028386 0.072207 6 0.5 0.124677 0.038863 0.023997 0.018060 0.025961 0.021936 7 0.75 0.105384 0.028632 0.013467 0.012453 0.024295 0.116020 8 1 0.069542 0.019988 0.013672 0.012453 0.024686 0.120254 9 3 0.25 0.071793 0.019053 0.011908 0.011416 0.021645 0.052494 10 0.5 0.080230 0.022653 0.011691 0.010857 0.022459 0.044464 11 0.75 0.100348 0.027188 0.017871 0.013633 0.020760 0.158864 12 1 0.100129 0.025238 0.013624 0.013346 0.022882 0.052848 13 500 9 0.25 0.087658 0.030786 0.021063 0.021593 0.037533 0.131595 14 0.5 0.120225 0.035837 0.017506 0.018423 0.032209 0.072691 15 0.75 0.108335 0.030604 0.014933 0.016627 0.026840 0.091080 16 1 0.075110 0.030694 0.026972 0.032415 0.054812 0.181353 17 6 0.25 0.119522 0.049433 0.021642 0.017807 0.026338 0.109247 18 0.5 0.104570 0.037916 0.018646 0.020389 0.030663 0.073772 19 0.75 0.094708 0.027392 0.021657 0.023170 0.036143 0.138768 20 1 0.091668 0.037460 0.020997 0.021922 0.042133 0.108095 21 3 0.25 0.153340 0.032996 0.016169 0.017654 0.047098 0.080737 22 0.5 0.094481 0.032526 0.013223 0.019982 0.033307 0.096095 23 0.75 0.096467 0.026004 0.019063 0.019189 0.036804 0.054503 24 1 0.087235 0.031068 0.017224 0.015474 0.034156 0.001936 25 250 9 0.25 0.133246 0.043864 0.021702 0.027811 0.039083 0.049585 26 0.5 0.145823 0.058810 0.024323 0.023814 0.043880 0.212430 27 0.75 0.117474 0.038922 0.023324 0.026192 0.049893 0.067516 28 1 0.117828 0.041894 0.037692 0.036645 0.068336 0.253170 29 6 0.25 0.111735 0.051914 0.028435 0.029285 0.036509 0.048822 30 0.5 0.135587 0.055161 0.023390 0.024883 0.040852 0.313178 31 0.75 0.256954 0.051364 0.026250 0.028499 0.056166 0.027842 32 1 0.153590 0.046054 0.020337 0.025486 0.048274 0.178885 33 3 0.25 0.117107 0.047913 0.032525 0.024273 0.046733 0.025307 34 0.5 1.727757 0.225916 0.080002 0.195150 0.050771 0.121486 35 0.75 0.170151 0.046210 0.023578 0.025170 0.042063 0.056915 36 1 0.127922 0.043790 0.021082 0.021712 0.044274 0.156180 PAGE 85 85 Table 4 21 Continued condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 Overall mean 0.154841 0.040398 0.021169 0.024807 0.035774 0.101665 Standard Deviation 0.272132 0.033626 0.011842 0.029903 0.011965 0.066253 PAGE 86 86 Table 4 22 Standard R asch m odel ( t estlet s ize 5) RMSE of a bility ( ) e stimate r ecovery with 6 d ifferent a bility i ntervals condition Sample size T estlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 1 1000 9 0.25 0.054505 0.019157 0.011612 0.013639 0.024171 0.123000 2 0.5 0.086199 0.025481 0.013019 0.015002 0.020782 0.077019 3 0.75 0.079972 0.019311 0.012218 0.013808 0.022459 0.055386 4 1 0.082276 0.027552 0.015155 0.011565 0.018216 0.072618 5 6 0.25 0.080914 0.025310 0.011499 0.010704 0.024551 0.058512 6 0.5 0.083499 0.026217 0.015746 0.011670 0.020758 0.055891 7 0.75 0.086500 0.023057 0.011474 0.011640 0.026323 0.098057 8 1 0.062397 0.020569 0.014227 0.013574 0.026134 0.123626 9 3 0.25 0.070911 0.019228 0.011657 0.011668 0.021136 0.050452 10 0.5 0.064980 0.022725 0.015423 0.013908 0.028543 0.072700 11 0.75 0.107659 0.030363 0.020102 0.015079 0.020511 0.172489 12 1 0.098714 0.025765 0.013472 0.013395 0.022484 0.045555 13 500 9 0.25 0.086103 0.027197 0.017137 0.019369 0.030920 0.104854 14 0.5 0.092530 0.034024 0.017266 0.022220 0.042295 0.115485 15 0.75 0.114248 0.031499 0.015170 0.015623 0.025589 0.100740 16 1 0.074132 0.030469 0.027548 0.032850 0.056242 0.195890 17 6 0.25 0.102211 0.040934 0.020133 0.015024 0.026044 0.075459 18 0.5 0.092445 0.032911 0.017773 0.019562 0.030586 0.092825 19 0.75 0.090080 0.029781 0.025252 0.026233 0.043281 0.153303 20 1 0.101604 0.039027 0.019060 0.018473 0.035121 0.066013 21 3 0.25 0.150032 0.032273 0.016446 0.018240 0.048618 0.085607 22 0.5 0.088842 0.031174 0.014248 0.023129 0.038932 0.118433 23 0.75 0.096743 0.026986 0.020221 0.020743 0.040151 0.041003 24 1 0.088962 0.030766 0.016913 0.015060 0.033199 0.003464 25 250 9 0.25 0.108200 0.037172 0.023692 0.035562 0.060802 0.055020 26 0.5 0.118448 0.050578 0.023559 0.028191 0.052439 0.293215 27 0.75 0.152811 0.044221 0.025725 0.024325 0.041710 0.035497 28 1 0.112522 0.040882 0.037117 0.035945 0.068377 0.269680 29 6 0.25 0.097628 0.045396 0.024069 0.027508 0.038001 0.009918 30 0.5 0.115448 0.046363 0.021887 0.025772 0.045353 0.245964 31 0.75 0.261414 0.057456 0.033796 0.039445 0.070626 0.037632 32 1 0.131436 0.040350 0.017983 0.023282 0.051087 0.221120 33 3 0.25 0.118245 0.047826 0.032667 0.024198 0.046625 0.028095 34 0.5 1.664506 0.200187 0.062951 0.211889 0.040309 0.053299 35 0.75 0.210820 0.057664 0.026479 0.025437 0.038700 0.123125 PAGE 87 87 Table 4 22 Continued condition Sample size Testlet No. Local effect mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6 36 1 0.118271 0.041488 0.021033 0.021938 0.047101 0.131175 Overall mean 0.148506 0.038371 0.020659 0.025713 0.036894 0.101726 Standard Deviation 0.262839 0.029611 0.009735 0.032806 0.013939 0.070900 PAGE 88 88 Table 4 23 NBOME LEVEL 2 Block 1 Item WMSE Item T estlet Model Rasch Model P artial C redit Model 1 1.03 1.01 1 2 1.03 1.03 1.02 3 1.01 1.01 1 4 1.19 1.04 1.06 5 1.06 1.03 1.02 6 0.99 0.99 0.99 7 0.99 0.98 0.97 8 1.08 1 1 9 0.97 0.98 0.98 10 1.04 1.02 1.01 11 1.03 1 0.99 12 1.03 1.01 1 13 1.04 1.01 1 14 1.04 1.04 1.02 15 1.09 1.01 0.99 16 1.03 1.03 1 17 1.05 1.03 1 18 1.1 1.04 1.01 19 1.04 1.01 0.99 20 1 1.01 1 21 1 0.98 0.98 22 1.02 1.04 1.01 23 1.06 1 0.99 24 1.04 1.03 1.01 25 1.07 1.02 1.02 26 1.05 1.02 1.02 27 1.02 1.02 1 28 1.14 1.02 1.03 29 1.03 0.99 1.03 30 0.97 1 1.01 31 1.06 1 1.01 32 1.01 0.99 0.97 33 1.09 0.99 0.99 34 1.13 1.18 1.03 35 1.04 1 1.02 36 1 1.04 1.07 37 0.93 0.99 1.02 38 1.04 1.01 39 1.05 0.97 40 1.07 0.98 41 1.03 1 42 1.01 0.97 43 0.96 1 44 0.96 1.03 45 1.01 0.99 46 0.86 0.99 47 1.06 1.02 48 1.16 1.02 49 1.02 1.01 50 1.08 1.02 MAX 1.19 1.18 1.07 MIN 0.86 0.97 0.97 MEAN 1.0362 1.012 1.007027 SD 0.055729 0.0309 0.021064 PAGE 89 89 Table 4 24 COMLEX Level 2 2008 block 1 l ocal item dependence detection results seq uence Significant results Item pair P value 1 item3, item23 0.002 2 item5, item42 0.009 3 item6, item19 0.000 4 item6, item21 0.000 5 item7, item49 0.007 6 item9, item 39 0.000 7 item9, item40 0.004 8 item10, item36 0.009 9 item11, item15 0.002 10 item11, item35 0.009 11 item12, item43 0.009 12 item19, item21 0.004 13 item20, item30 0.002 14 item28, item31 0.000 15 item29, item30 0.000 16 item29, item39 0.002 17 item30,item31 0.009 18 item32, item33 0.000 19 item35, item42 0.000 20 item37, item38 0.004 21 item37, item40 0.000 22 item39, item40 0.000 23 item41, item42 0.000 24 item43, item44 0.000 25 item45, item46 0.000 26 item47, item48 0.000 Note: Number of sampled matrices: 450; Number of Item Pairs tested: 1225; Item Pairs with one sided p < 0.01 PAGE 90 90 CHAPTER 5 DISCUSSION 5.1 General Discussion In accordance with the simulation results and the empirical case results, several empirical findings re lated to testlet modeling emerged in this study. First, our results suggest that the Partial Credit model and Rasch testlet model performed better than the standard Rasch model under the small and me dium testlet size circumstances N o sufficient evidences indicate which model performs better with regard to the performance comparison between the partial credit model and the Rasch testlet model. The results also show that sample size has a significant effect on the analysis results for the three models. As th e sample size increase s the discrepancies between model estimates and the real data set increase s Also, the degree of the test reliability overestimation for the standard Rasch model increases when the sample size increases. In addition when the testlet size keeps at medium level (i.e. testlet size <5), the ratio of the independent/testlet item within a test in terms of the number of testlets Second, the findings display that there is no obvious difference of the test reliability polytomous IRT mod el to testlets is an approach which does not result in a reduction in test reliability Previous concerns about test reliability reduction by applying polytomous IRT model to the testlets (Keller, Swaminathan, & Sireci, 2003) are not as severe as we expected in the small and medium testlet size situations. We believe that the small PAGE 91 91 number of parameters dropped when the polytomous items are formed do not drastically hurt the estimates of reliability of the entire t est when the testlet size is small. Third, the standard error of measurement results from the ability parameter estimation suggests that the standard Rasch model apparently underestimates standard error of measurement compared with other two models. Fourt h the bias and RMSE results from the process of the ability parameter recovery indicates that no evident pattern can be found to reveal the association between the factor variations (i.e., testlet size, the sample size, the n umber of testlets within a tes t) and the bias/RMSE result changes. The magnitude of the local item effects does not have an evident impact on the accuracy of the ability estimation. However, this study only investigates a small range of the local dependence effects (i.e. [0,1]). A bro ader range of the local dependence effect is worthy of more investigation. U sing EAP estimates has major effects on the bias/RMSE result changes at both tails of ability distribution. In sum, because these three models are all Rasch type model, the precis ion of the ability parameter recovery for these three models is relatively well All three Rasch type models do show some robustness, to some extent, when face up to the violation of local item independence assumption. 5.2 Limitations and Suggestions for F uture Research Although there is no obvious discrepancy of the test reliability estimates between the Rasch testlet model and the Partial Credit model, some parameters are dropped from the polytomous IRT model compared to the dichotomous IRT model applicat ion. Therefore, because of this parameter dropping issue, a decrea s e in the reliability i s still expected (Sireci, et al. 1991). The question is whether the test reliability decrease is due to the change of the test format (i.e. from dichotomous items to a polytomous item PAGE 92 92 within a testlet) or is due to the local item dependence within testlets Therefore, the 993) and applied by Zenisky, et al (2002). The swer this aforementioned question. Thus, the true cause of the test reliability reduction will be obtained. Because of limited time, we do not these two aforementio ned situations. For future research, it is worthwhile to include 5.3 C onclusion This stu dy compares the performance of three different models in sma ll and medium testlet size situations across changes in sample size, variations of the ratio of independent items to testlet items within a test, and changes of the local item effects. The study findings indicate that using the polytomous IRT model for tes tlet item analyses is still efficient for small testlet size and non adaptive typed tests. Although, under this small testlet size situation, the Rasch testlet model and the Partial Credit model both show better performances t han the standard Rasch model, h aving a large proportion of testlet items in a test will result in the instability of the Rasch testlet model for the large number of MLE non convergence rate occurrences in Rasch testlet model application For small testlet sizes, polytomous IRT models are more stable than the Rasch testlet model when there are a large number of the testlets included in a test. This instability may be caused by the multidimensionality feature of the Rasch testlet PAGE 93 93 model. T he relationship between the model instability and its multidimensionality is worthy of further investigation. Furthermore, the analysis efficiency of models should also be considered for testlet analysis model selection. The simulations were conducted via personal computer with a 2.83 GHz Intel Xeon insi de. It took 1459771.40 seconds (i.e. approximately 405.5 hours) to complete 12 conditions with the Rasch testlet model data simulation and analysis. It only took 48659.05 seconds (i.e. approximately 13.5 hours) for the Partial Credit model to complete the data simulation and analysis for the corresponding conditions. Typically using ConQuest took approximately 45 to 70 minutes for a single calibration of the Rasch testlet model, but it only took approximately 5 8 minutes for a single calibration of the Par tial Credit model. The investigation of the models used to analyze the testlet items based on the small testlet size circumstances, provides guidance for model selection for future testlet type data analysis. The polytomous IRT model and Rasch testlet model offers an advantage over the standard Rasch model as it avoids standard error of measurement underestimation and better ability parameter estimations in the small testlet size situations. PAGE 94 94 LIST OF REFERENCES Adams, R. J., Wilson, M., & Wang, W. C. ( 1 997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21 (1), 1 23. Armstrong, Ronald D (2004). Computerized Adaptive Testing With Multiple For m Structures. Applied psychological measurement, 28(3), 147 164. Andrich, D. (1978). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2 581 594. Ariel, A., Veldkamp, B. P., Breithaupt, K. (2006). Optimal Testlet Pool Assembly for Multistage Testing Designs. Applied Psychological Measurement, 30(3) 204 215. Baldwin, S.G. (2007). A review of Testlet response theory and its applications. Journal of Educational and Behavioral Statistics, 32(3) 333 336. Bock, R. D. (1972). Estimating item parameters an d latent ability when responses are scored in two or more nominal categories. Psychometrika 37 29 51. Bock, R. D.,& Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431 444. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Ba yesian random effects model for testlets. Psychometrika 64 153 168. Brandt, S. (2008) Estimation of a Rasch Model In cluding Subdimensions. IERI Monograph Series Issues and Methodologies in Large Scale Assessments, 1,51 70. Breithaupt, K,Ariel, A, Veldkamp, B.P.(2005). Automated Simultaneous Assembly for Multistage Testing. International Journal of Testing, 5(3), 319 330 Breithaupt, Krista (2007). Automated Simultaneous Assembly of Multistage Testlets for a High 12.Stakes Licensing Examination. Educational and psychological measurement, 67 (1), 5 20. Chen, W.H., Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics. 22(3), 265 289. Davis, Laurie Laughlin (2003). Item Exposure Constraints for Testlets in the Verbal Reasoning Section of the MCAT. Applied psychological measurement, 27(5), 335 356. DeMars,C.E. (2006).Application of the Bi Factor Multidimensional Item Response Theory Model to Testlet Based Tests. Journal of Educational Measurement 43, ( 2). 145 168. PAGE 95 95 Feldt, L. S.(2002). Estimating the internal consistency reliability of tests composed of testlets varying in length. Applied Measurement in Education, 15(1), 33 48. Fischer, G.H. (1974). Einfahrung in die Theoriepsychologischer Tests [Introduction to menta l test theory]. Berne: Huber. Gessaroli, M.E., Folske, J. C.(2002). Generalizing the Reliability of Tests Comprised of Testlets. International Journal of Testing, 2(3 4) 277 295. Habing, B., Roussos, Louis A.(2003). On the need for negative local item dep endence. Psychometrika, 68(3), 435 451. Haertel, E.H. (2006). Reliability. In R.L. Brennan (Ed.). Educational Measurement (4 th ed., 65 110). Westport, CT: American Council on Education and Praeger. Hambleton, R.K. & Murray, L.N. (1983). Some goodness o f fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory (pp.71 94). Vancouver BC: Educational Research Institute of British Columbia. Hambleton, R.K.& Swaminathan, H.(1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic Publishers. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Newbury Park, CA: Sage Publications. Hendrickson, A.(2007). An NCME instruction al module on multistage testing. Educational Measurement: Issues and Practice, 26(2), 44 52. Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55 577 602 Ip, E.H., Smits, D.J.M., De Boeck,P.(2009). Lo cally dependent linear logistic test model with person covariates. Applied Psychological Measurement, 33(7) 555 569. Jang, E.E., Roussos, L.(2007). An investigation into the dimensionality of TOEFL using conditional covariance based nonparametric approach Journal of Educational Measurement, 44(1) 1 21. Keller, L.A. Swaminathan, H., &Sireci, S.G.(2003). Evaluating Scoring Procedures for Context Dependent Item Sets1, Applied Measurement in Education, 16(3), 207 222 Kim, J.K. & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psychometrika, 58, 587 599. Lee, G.M., Frisbie, D. A.(1999). Estimating reliability under a generalizability theory model for test scores composed of testlets. Applied Measurement in Education,12(3). 237 255. PAGE 96 96 Lee G.M. (2000). A comparison of methods of estimating conditional standard errors of measurement for testlet based test scores using simulation techniques.; Journal of Educational Measurement, 37(2), 91 112. Lee, G.M. (2000). Estimating conditional standard errors of measurement for tests composed of testlets. Applied Measurement in Education, 13(2) 161 180. Lee, G.M. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied psychologic al measurement, 25(4) 357 372. Lee, G.M., Dunbar, S.B., Frisbie, D.A. (2001). The relative approapriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61(6), 958 975. Li, Y. M. (2005). A Test Characteristic Curve Linking Method for the Testlet Model. Applied psychological measurement,29(5), 340 356. Li, Y.M. (2006). A Comparison of Alternative Models for Testlets. Applied psychological measurement. 30(1), 3 21. Loevinger, J. (1947). A systematic approach to the construction and evaluation of tests of ability. (Psychological Monographs 61, No.4). Richmond, VA: Psychometric Society. Lord, F. M. (1952). A theory of test scores. Psychometric Monograph No. 7. Lord F. M. (1980). Applications of item response theory to practical testing problems Hillsdale NJ: Erlbaum. Lord, F. M., Novick, M. R. (1968) Statistical theories of mental test scores. Reading, Mass.: Addison Wesley. Luecht, R., Brumfield, T., Breithaupt, K. (2006). A Testlet Assembly Design for Adaptive Multistage Tests. Applied Measurement in Education, 19(3), 189 202. Mair, P., Hatzinger, R. (2007). Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R. Journal of Statistical Software, 20(9), 1 20. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika 47 149 174. Meijer, Rob R. (2004). Using Patterns of Summed Scores in Paper and Pencil Tests and Computer Adaptive Tests to Detect Misfitting Item Score Patterns. Journal of Educational Measurement, 41(2) 119 136. Mislevey, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51 177 195. PAGE 97 97 Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Appl ied Psychological Measurement 16 159 176. Pitt, M.A., Kim, W., & Myung, I.J.(2003). Flexibility versus Generalizability in Model Selection. Psychonomic Bulletin & Review, 10 29 44. Pomplun, M., Ritchie, T. (2004). An Investigation of Context Effects for item Randomization within Testlets. Journal of Educational Computing Research, 30(3), 243 254. Ponocny, I. (2001) Nonparametric goodness of fit tests for the rasch model. Psychometrika, 66(3), 437 460 Puhan, G. Moses, T.P., Grant, M.C., McHale, F. (2009). Small sample equating using a single group nearly equivalent test (SiGNET) design. Journal of Educational Measurement, 46(3), 344 362. R Development Core Team (2006). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3 900051 07 0, URL http://www.R project.org Rae, G. (2008). A note on using alpha and stratified alpha to estimate the reliability of a test composed of item parcels. Britis h Journal of Mathematical and Statistical Psychology, 61(2) 515 525. Rivera, C., Stansfield, C.W.(2003). The effect of linguistic simplification of science test items on score comparability. Educational Assessment, 9(3 4), 79 105. Rosenbaum, P. R. (1988). Item bundles. Psychometrika 53 349 359. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 17 1 100. Schmitt, N. (2002). Do reactions to tests produce changes in the construc t measured? Multivariate Behavioral Research, 37(1), 105 126. Sheehan, K. M., Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16(1) 65 76. Shen, L., Yen, J. (1997). Item dependency in medical licensing examinations. Academic Medicine.72, S S19 S21 Sireci, S. G., Thissen, D.,&Wainer, H. (1991). On the reliability of testlet based tests. Journal of Educational Measurement 28 237 247. Stark, S., Chernyshenko, O.S. & Drasgow, F. (2004) Investigating the effects of local dependence on the accuracy of IRT ability estimation. Technical Report Series two. American Institute of Certified Public Accountants. PAGE 98 98 Steinberg, L., Thissen, D.(1996).Uses of i tem response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1(1), 81 97. Thissen, D., Billeaud, K., McLeod, L., & Nelson, L (1997). A brief introduction to item response theory for items scored in more than two categories. Paper presented at the National Assessment Governing Board Achievement Levels Workshop Boulder, CO. Thissen, D., Steinberg, L., & Mooney, J. (1989). Trace lines for testlets: A use of multiple categorical response models. Journal of Educat ional Measurement, 26 247 260. Thissen, D. (2008). Review of 'Testlet response theory and its applications.'. Journal of Educational Measurement, 45(3), 305 308. Tokar, D. M.; Fischer, A.R., Snell, A.F., Harik Williams, N. (1999). Efficient assessment of the five factor model of personality: Structural validity analyses of the NEO Five Factor. Tong, Y., Kolen, M.J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2 ), 227 253. Van den wollenberg, A. L. (1982). Two new test statistics for the Rasch model. Psychometrika, 47 123 140. Vitacco, M. J.(2005). A Comparison of Factor Models on the PCL R with Mentally Disordered Offenders: The Development of a Four Factor Mod el. Criminal justice and behavior, 32(5) 526 545. Wainer. H., Lewis. C. (1990). Toward a Psychometrics for Testlets. Journal of Educational Measurement. 27(1), 1 14. Wainer, H, Lewis, C, Kaplan, B., Braswell, J.(1991). Building algebra testlets: A comparis on of hierarchical and linear structures. Journal of Educational Measurement, 28(4), 311 323. Wainer, H. & Kiely, G, L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185 201. Wainer, H. & Mislevey, R. J. (2000). Item response theory, item calibration, and proficiency estimation. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed.) Mahwah, NH: Lawrence Erlbaum Associates. Wainer, H., Sireci, S.G. Thissen, D. (1991). Di fferential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28(3) 197 219. PAGE 99 99 Wainer, H. (1995). Precision and differential item functioning on a testlet based test: The 1991 LawSchool Admissions Test as an example. Applied Measurement in Education 8 157 186. Wainer, H., &Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement 37 203 220. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model using in testlet based adaptive testing. In W. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245 269). London: Kluwer. Wang, W. C., & Wilson, M. (2005a). Exploring local item dependence using a random effects facet model. Applied Psychological Measurement, 29 (4), 296 318. Wang, W. C., & Wilson, M. (2005b). The Rasch testlet model. Applied Psychological Measurement, 29 (2), 126 149. Wang W.C. (2005). Assessment of Differential Item Functioning in Testlet Based Items Using the Rasch Testlet Model. Educational and psychological measurement, 65(4), 549 579. Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets : Theory and applications. Applied Psychological Measurement 26 109 128. Wang, X.H., (2002). A general Bayesian model for testlets: Theory and applications. Applied psychological measurement, 26(1), 109 128. Weaver, C.M., Meyer, R.G., Van Nort, J.J., Tri stan, L. (2006). Two Three and Four Factor PCL R Models in Applied Sex Offender Risk Assessments. Assessment, 13(2), 208 216. Wilson, M., Adams, R.J. (1995). Rasch Models for Item Bundles. Psychometrika, 60(2), 181 198. Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalized item response modeling software. Melbourne, VIC: Aus tralian Council for Educational Research. Yang, W.L.& Gao,R. (2008). Invariance of Score Linkings Across Gender Groups for Forms of a Testlet Based Colleg e Level Examination Program Examination. Applied Psychological Measurement 32 45 Yen, W. (1993). Scaling performance assessment: Strategies for managing local item dependence. Journal of Educational Measurement 30 187 213. PAGE 100 100 Zenisky, R. K., Hambleton, S. G. Sireci. (2002) Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test. Journal of Educational Measurement, 39(4), 291 309. Zwick, R. (2002). Application of an empirical Bayes enhancement of Mantel Haenszel differ ential item functioning analysis to a computerized adaptive test. Applied psychological measurement, 26(1), 57 77. PAGE 101 101 BIOGRAPHICAL SKETCH Ou Zhang was born in Chengdu, China. He completed his Bachelor of Science in computer science from Chengdu University of Technology in 2001 and his Master of Education in Educational Research Measurement and Evaluation from Boston College in 2 00 7 H e received h is Master of Arts in Education degree from the program of Research and Evaluation Methodo logy at U niversity of Florida in the fall of 2010. H e is currently enrolled in the Ph.D. program of Research and Evaluation Methodology at University of Florida. 