UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
THE EFFECT OF MULTIDIMENSIONALITY ON UNIDIMENSIONAL EQUATING WITH ITEM RESPONSE THEORY By PATRICIA DUFFY SPENCE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1996 This dissertation is dedicated to the memory of my father James F. Duffy 19291992 ACKNOWLEDGMENTS An effort of this magnitude always involves many people. The author wishes to especially thank the chairman of her committee, Dr. M. David Miller, for his dedication and inspiration. Without his encouragement and good humor, this dissertation would not have been possible. The author would also like to thank her committee members for their guidance and patience, particularly Dr. James Algina and Dr. Linda Crocker. Their suggestions were always correct, if not always accepted. Also, without the inspiration of Dr. Charles Dziuban of the University of Central Florida, she would never have pursued studies in this field. In addition, the author recognizes her colleagues, past and present, at The Psychological Corporation, Volusia County District Schools, and the Florida Department of Education for the opportunities to apply her learning in practical situations. Gratitude is offered to her three parentsJim, Joan, and Jeanne who stressed the importance of learning and doing things well. Thanks also to special friends: Anne Seraphine for debating the meaning of life and monotonic curves; Nada Stauffer for quiet friendship; George Suarez for making her laugh; and Carlos Guffain for demanding her best. But the author is most indebted and grateful to her husband, Verne, who has supported and encouraged her through three degrees, and her daughter Cindy who is now left to carry on the Gator tradition alone. TABLE OF CONTENTS page A C KNO W LED G M ENTS ............................................................................... iii LIST O F TA BLES ........................................ ............... .................. vii LIS T O F FIG U R ES ......................................................... ......................... xi A B S T R A C T ....................................... ............................................... ....... xii CHAPTERS 1 INTRODUCTION...................... .... .......................... 1 P u rpo se ................ ..................................... .................... ....................3 L im itatio n s ....................................................................... ....... .. ........4 Significance of the Study........................................ ...... ..................4 2 REVIEW OF LITERATURE................. .... ....................6 Test Equating ........................... ........... .................6 Conditions for Equating........................................ ......................6 Data Collection Designs............................ ...........7 Singlegroup designs ................. .................. .... Equivalentgroup designs..................................... ......9 Anchortest designs ................... ..................... 10 Equating Methods ....................................................... 14 Conventional Methods of Equating....................................14 Linear equating ... .......................................... ....... ............. 16 Equipercentile equating....................................... ............. 18 Equating Methods Based on Item Response Theory....................21 Item response theory .............................................................21 IRT equating ........................................ ... .... ..... .... 28 M ultidim ensionality ............................................. .......... ...............35 Violation of the Unidimensionality Assumption.............................35 Multidimensional Models ................................ .................... 37 Multidimensionality and Parameter Estimation ..............................45 Multidimensionality and IRT equating............................ ...............52 3 METHOD ....................................................................58 P urpo se .................................................. ......... ............. ......... 58 Introduction ................................................... ................. 58 Research Questions.....................................58 Data Generation ........................................................................... 59 Design................................... ................. 59 M odel Description ...................................................................60 Item Param eters......................... .... ... ...................... 61 Response Data .............................................. ................... 63 Noncompensatory Data............................... ..................65 Nonrandom Groups................................................66 Estimation of Parameters .................................. ...................66 Unidimensional IRT.................... ... ......................66 Analytical Estim action ...................................... .. ....... ........69 Equating.......................... ................................... ................ ........... 69 Concurrent Calibration ................... ................. .................. 70 Equated bs..........................................................70 Characteristic Curve Transformation ..................................... 71 Evaluation Criteria ...................................... .. .. ........ .......... 73 Comparison Conditions ...................................... .................. 73 Statistical Criteria ................. ............................................... ....... 75 Sum m ary. ................ ...... ............. ............ ....... ....... .... .........76 4 RESULTS AND DISCUSSION..................... ...................78 Sim ulated Data......................................................... .................... 78 Item Param eters................................................... ................... 78 Analytical Estimation .............. ... .....................88 Simulated Ability Data ............................................................. 88 Equating Results for Randomly Equivalent Groups ..........................92 Concurrent Calibration .......................................... ................... 92 Equated bs..........................................................99 Characteristic Curve Transformation..................................... 103 Equating Results for Nonequivalent Groups................................... 103 Concurrent Calibration ...............................................................103 Equated bs and Characteristic Curve Transformation ................. 108 5 CONCLUSIONS.................... ...................... ....................... 111 Effects of Multidimensional Model............................................... 111 Effects of Equating Method ........................... ....................112 Effects of the Number of Multidimensional Items ............................112 Effects of Nonequivalent Examinee Groups .................................... 115 Im plications ............................... ........... ........ ........... 116 APPENDIX ITEM PARAMETER DATA ............................................... .............. 118 R E FE R E N C E S .......................................................................................... 151 BIOGRAPHICAL SKETCH............................................ ........... ....... 158 vi LIST OF TABLES Table p 1 Summary of Recommendations for a Successful Equating ..............15 2 Summary of Unidimensional IRT Test Equating Studies................36 3 Summary of Studies of Unidimensional IRT Estimation with Multidimensional Data...................... ........................ 50 4 Summary of Studies of Unidimensional Equating with M ultidim ensional Data ...................... ............................... .... 57 5 Simulated Compensatory Parameters for MD30, Form A................64 6 Simulated Noncompensatory Parameters for Multidimensional Items, M D30 Form A ........................................ ....... ....... 67 7 Summary Statistics for Multidimensional Items in Compensatory and Noncompensatory Datasets.........................................68 8 Summation of Research Equating Conditions..................................72 9 Analytical Estimates of the Unidimensional Parameters for Compensatory MD30, Form A ...............................................74 10 Descriptive Statistics for Compensatory Form A Item Parameters ...79 11 Descriptive Statistics for Compensatory Form B Item Parameters ..80 12 Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory Form A............................ ..................81 13 Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory Form B.................. .....................82 14 Descriptive Statistics for Analytical Unidimensional Estimates of Form A Item Param eters ....................................................... 89 15 Summary Statistics for Analytical Unidimensional Estimates of Form B Item Parameters .......................... ........................ 90 16 Descriptive Statistics for Simulated Examinees Taking MD10..........91 17 Descriptive Statistics for Simulated Examinees Taking MD20.......... 92 18 Descriptive Statistics for Simulated Examinees Taking MD30..........93 19 Descriptive Statistics for Simulated Examinees Taking MD40..........94 20 Descriptive Statistics for Simulated Low Ability Examinees..............95 21 Summary of Concurrent Calibration Results with Randomly Equivalent Groups............................ .................... 96 22 Constants for Equated bs Equating of Compensatory Forms with Randomly Equivalent Groups........................... .................. 100 23 Constants for Equated bs Equating of Noncompensatory Forms with Randomly Equivalent Groups.........................................101 24 Summary of Equated bs Results with Randomly Equivalent G roups ....... ..... .... ...................................... ................. 102 25 Summary of Characteristic Curve Transformation Results with Randomly Equivalent Groups........................ .................. 104 26 Summary of Equating Results with Nonequivalent Groups............ 106 27 Constants for Equated bs Equating of Compensatory Forms with Nonequivalent Examinee Groups..........................................109 28 Simulated Compensatory Item Parameters for MD10 Form A........119 29 Simulated Compensatory Item Parameters for MD10 Form B........120 30 Simulated Compensatory Item Parameters for MD20 Form A........121 31 Simulated Compensatory Item Parameters for MD20 Form B........122 32 Simulated Compensatory Item Parameters for MD30 Form A........123 33 Simulated Compensatory Item Parameters for MD30 Form B........124 34 Simulated Compensatory Item Parameters for MD40 Form A........125 35 Simulated Compensatory Item Parameters for MD40 Form B........126 36 Noncompensatory Item Parameters for Multidimensional Items in M D10 Forms A and B .............................................. ......... 127 37 Noncompensatory Item Parameters for Multidimensional Items in M D20 Form A ................ ............................................... 128 38 Noncompensatory Item Parameters for Multidimensional Items in MD20 Form B ................... ............... ...................... 129 39 Noncompensatory Item Parameters for Multidimensional Items in MD30 Form A ...................................... 130 40 Noncompensatory Item Parameters for Multidimensional Items in MD30 Form B .................. ..... ...... .............. 131 41 Noncompensatory Item Parameters for Multidimensional Items in M D40 Form A ................. ................... .......................... 132 42 Noncompensatory Item Parameters for Multidimensional Items in MD40 Form B ........................ ..................... 133 43 Analytical Estimates of Unidimensional Item Parameters for MD10 Form A ...................................... ......... 134 44 Analytical Estimates of Unidimensional Item Parameters for MD10 Form B............. ........................ .................. 135 45 Analytical Estimates of Unidimensional Item Parameters for MD20 Form A......................... ................. 136 46 Analytical Estimates of Unidimensional Item Parameters for M D20 Form B................................................... .................. 137 47 Analytical Estimates of Unidimensional Item Parameters for MD30 Form A............................................ ... ............. 138 48 Analytical Estimates of Unidimensional Item Parameters for M D30 Form B................................................... .................. 139 49 Analytical Estimates of Unidimensional Item Parameters for M D40 Form A .................................................. ................... 140 50 Analytical Estimates of Unidimensional Item Parameters for M D 40 Form B .................................... ...... ........................... 14 1 51 Descriptive Statistics for Compensatory MD10 Linking Items with Randomly Equivalent Groups............................... ...142 52 Descriptive Statistics for Compensatory MD20 Linking Items with Randomly Equivalent Groups.......................................... 143 53 Descriptive Statistics for Compensatory MD30 Linking Items with Randomly Equivalent Groups....................................... 144 54 Descriptive Statistics for Compensatory MD40 Linking Items with Randomly Equivalent Groups.........................................145 55 Descriptive Statistics for Noncompensatory MD10 Linking Items...146 56 Descriptive Statistics for Noncompensatory MD20 Linking Items...147 57 Descriptive Statistics for Noncompensatory MD30 Linking Items... 148 58 Descriptive Statistics for Noncompensatory MD40 Linking Items...149 59 Descriptive Statistics for Compensatory Linking Items with Nonequivalent Groups..................... .. ...................... 150 LIST OF FIGURES Figure age 1 An item characteristic curve (ICC) based on the three parameter logistic model.................................. .................. 23 2 An item response surface (IRS) based on the compensatory M 2PL ..................... ........ ........................ 40 3 Item response surfaces and contour plots for item 9, MD20, a = 20 .......................... ..................................................... 84 4 Item response surfaces and contour plots for item 10, MD20, a = 30 .................................................... ................... 85 5 Item response surfaces and contour plots for item 11, MD20, a = 4 5 ....................................... .........................8 6 6 Item response surfaces and contour plots for item 12, MD20, a = 60 .................. ..................... ........................ ........... 87 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy THE EFFECT OF MULTIDIMENSIONALITY ON UNIDIMENSIONAL EQUATING WITH ITEM RESPONSE THEORY by Patricia Duffy Spence May, 1996 Chairman: M. David Miller Major Department: Foundations of Education Test publishers apply unidimensional equating techniques to their products even though tests are expected to be multidimensional to some degree. This simulation study investigated the effects of ignoring multidimensional data in applying unidimensional item response theory equating procedures. The specific effects studied were (a) multidimensional model, (b) type of equating procedure, (c) number of multidimensional items, and (d) distribution of examinee ability. Four test conditions were created by varying the number of multidimensional items contained in each test. The compensatory multidimensional two parameter logistic model was selected for data generation. Four degrees of multidimensionality were spiraled throughout each test. The data were then transformed into corresponding noncompensatory items which had the same probability of success as the compensatory item for a given examinee. Four tests with 40 items each were simulated with 12 common linking items and 28 unique items. For each experimental condition and form, responses for 1,000 simulees were generated. To examine the effects of nonrandom groups, responses for 1,000 less able examinees were also generated. Three unidimensional IRT equating methods were selected: (a) concurrent calibration, (b) equated bs, and (c) characteristic curve transformation. Parameters were calibrated with BILOG386. To evaluate the results of the research equatings, three comparison conditions were used; (1) the unidimensional approximations of the multidimensional item parameters calculated using an analytic procedure; (2) the simulated first ability dimension only; and (3) the averages of the two simulated abilities. Three statistical criteriacorrelation, standardized differences between means, and standardized root mean square differencewere applied to the data. No significant effect on the unidimensional equating results were attributed to choice of multidimensional model. For randomly equivalent groups, there was also no effects due to choice of equating procedure. Concurrent calibration favored low ability examinees when the ability distributions of the two groups were unequal. When the multidimensional composites described by the analytical estimation baseline are the data of interest, the number of multidimensional items had little effect on the unidimensional equating with randomly equivalent, normally distributed examine groups. However, if the unidimensional factor is the trait of interest, the number of multidimensional items affected the equating outcomes, with results deteriorating as the number of multidimensional items increased. When examine groups were not equivalent, equating results were affected in all conditions. Caution is advised in applying unidimensional equating procedures when the examine groups are suspected of being from different ability levels. CHAPTER 1 INTRODUCTION In many large testing programs, examinees take one of multiple forms of the same test. Although the different editions are constructed to be as similar in content and difficulty as possible, it is inevitable that some differences will exist among the various forms (Petersen, Cook, & Stocking, 1983). Direct comparison of scores would, therefore, be unfair to an examinee who happened to take a more difficult form. Because examinees are often in competition or are being directly compared, it is important to transform the scores in some way to make them equivalent. Equating is the statistical process of establishing equivalent raw or scaled scores on two or more test forms. Theoretically, the equating process adjusts for test and item characteristics so the propensity distributions would be the same regardless of which test form was administered. The application of equating to real data, however, can be full of problems and complications (Skaggs & Lissitz, 1986a). In practice, equating requires not only a knowledge of statistical models, but awareness and consideration of many other issues that have practical consequences for the use and interpretation of results. Brennan and Kolen (1987) discussed many of these issues, such as the presence of equating errors, specification of content, and security breaches. Many mathematical procedures have emerged to develop the equating transformations. Some are based on classical test theory while others arise from item response theory (IRT). Classical methods, including linear and equipercentile equating, do not seem robust to departures from optimal conditions (Cook & Eignor, 1983; Livingston, Dorans, & Wright, 1990; Skaggs & Lissitz, 1986b). Item response theory procedures, including equated bs, concurrent calibration, and characteristic curve transformation, present alternatives. Equating methods based on IRT have been found more accurate than those based on classical models (Harris & Kolen, 1985; Hills, Subhiyah, & Hirsch, 1988; Kolen, 1981; Marco, Petersen, & Stewart, 1983; Petersen, Cook, & Stocking, 1983). IRT models are grounded on strong assumptions, particularly that the item responses are unidimensional (Ansley & Forsyth, 1985). The unidimensionality assumption requires that each of the tests to be equated measures the same underlying ability. Any other factor that influences an examine's scoresuch as guessing, speededness, cheating, item context, or instructional sensitivitywill violate the unidimensionality assumption. Some of these violations can be controlled, reduced, or eliminated, but the unidimensionality assumption will still be violated in many practical testing situations (DoodyBogan & Yen, 1983). Attempts have been made to model multidimensional responses within the framework of IRT. Although these models describe multidimensional data more accurately than unidimensional models, estimation of parameters is complex and difficult in practice (Harrison, 1986). Test companies continue to apply unidimensional equating procedures to their products. The viability of using unidimensional models with multidimensional data must be explored to determine the effect on the equating outcomes. An understanding of what effect multidimensional data have on unidimensional equating results is of paramount importance. Empirical studies (Camilli, Wang, & Fesq, 1995; Cook & Eignor, 1988; Dorans & Kingston, 1985; Yen, 1984) indicate that violation of the unidimensionality assumption, while having some impact on results, may not be significant. However, each of these studies employed data from a different test and their content may have influenced findings in an unknown manner. The number of multidimensional items and the degree of multidimensionality in each is also unknown. Therefore, the generalization of results are difficult to interpret across studies (Skaggs & Lissitz, 1986a). It is necessary to design research studies that permit manipulation of independent variables to understand exactly how violations of the unidimensionality assumption affect equating. Simulation studies present a technique to manipulate and control the desired variables. Purpose The purpose of the present study was to investigate the effect of multidimensional data in applying unidimensional IRT equating techniques. The specific questions to be answered were: 1. Does the number of multidimensional items affect unidimensional equating results? 2. Does the equating procedure affect unidimensional equating results? 3. Do data simulated by using a compensatory model produce different unidimensional equating results than data simulated by using a noncompensatory model? 4. Are unidimensional equating results affected by the ability distribution of the two examinee groups? Limitations Results of this study are applicable only to the research conditions investigated. Generalizations to other item response theory models or other equating techniques are not justified. Significance of the Study In practice, test publishers today apply unidimensional equating techniques to their products. Because tests are expected to be multidimensional to some degree and it is difficult to identify multidimensionality accurately, it is important to investigate the effect of applying unidimensional equating techniques to multidimensional data. Previous studies have mainly explored unidimensional equating with empirical data that was suspected of being multidimensional. Although the results indicated the impact of violating the unidimensionality assumption may not be significant, the research designs did not allow manipulation of independent variables. In addition, the true multidimensionality of the underlying data was unknown in these empirical studies. 5 The current simulation study allowed exploration of what effect multidimensionality had on the results obtained from a variety of unidimensional equating procedures while providing a means to manipulate variables. The techniques used to generate the data afforded a mechanism to control the dimensionality of the items and test forms. The specific questions investigated were selected as having the most value for current practitioners applying unidimensional equating procedures. CHAPTER 2 REVIEW OF LITERATURE Test Equating Conditions for Equating The purpose of equating is to establish a relationship between two test forms so that it becomes a matter of indifference to the examinee which form is taken. Petersen, Kolen, and Hoover (1989) stated that equating itself is simply an empirical procedure which imposes no restrictions on the properties of scores or on the method used to define the transformation. It is only when the purpose of equating and the definition of equivalent scores are considered that restrictions become necessary. Lord (1980) outlined four conditions that must be met for the successful equating of two test forms, X and Y. Briefly, the conditions are (a) equity, (b) population invariance, (c) symmetry, and (d) same ability. To satisfy the equity condition, it must make no difference to examinees at every ability level, 0, which form of the test is taken. The conditional frequency distribution fx of the score on form X should be the same as the conditional frequency distribution of the transformed form Y score, fx(y) e. Lord (1980) added that it is not sufficient for equity that fx and fx(y) I have the same means, but they must also have equal variances. If the tests are not equally reliable, it is no longer a matter of indifference which form is administered. The equity condition requires the standard error of measurement and the higher moments to be the same after transformation for examinees of identical ability. To fully satisfy this requirement, test forms X and Y must be strictly parallel (Kolen, 1981). However, if this condition is met, equating is no longer necessary. In practice, it is nearly impossible to construct multiple forms that are strictly parallel. Therefore, equating is needed. Although the equity condition can never be met precisely, it serves to keep the purpose of equating in mind and guide the steps in the process. The population invariance and symmetry conditions also arise from the desire to achieve equivalent scores. If the scores from form X and form Y are equivalent, there is a onetoone relationship between the two sets of scores. The transformation must be unique, independent of the groups used to derive the conversion (Petersen et al., 1989). The purpose of equating also requires that the equating function be invertible or symmetric. The equating must be the same regardless of which test is labelled X and which test is labelled Y (Lord, 1980). The two tests to be equated must also measure the same characteristic, whether defined as a latent trait, ability, or skill. This condition distinguishes true equating from scaling. Scores on X and Y can always be placed on the same scale, but they must measure the same construct to be considered equated (Dorans, 1990). It is unlikely that all conditions of equating can be met in practice. However, good approximations to this ideal can be achieved and are usually fairer to examinees than if no attempt at equating had occurred (Petersen et al., 1989). Research conducted over the past 20 years serves as a guide in the application and interpretation of equating transformations. Data Collection Designs Every equating consists of two partsa data collection design and an analytical method to determine the appropriate transformation. Three basic sampling designs are most frequently described in the literature (Dorans, 1990; Dorans & Kingston, 1985; Petersen et al., 1989). The designs are classified as (a) singlegroup designs, (b) equivalentgroups designs, and (c) anchortest designs. Singlegroup designs In singlegroup designs, both forms or tests to be equated are given to the same group of examinees. The difficulty levels of the tests are not confounded with the differences in the ability levels of the groups taking each test because the examinees are the same (Hambleton & Swaminathan, 1985). However, Lord (1980) pointed out that the test administered second is not being given under typical conditions. Practice effects and fatigue may affect the equating process. To deal with this threat, the counterbalanced randomgroups design may be employed. The singlegroup is divided into two random half groups. Both halfgroups then take both tests in counterbalanced order, one group taking the old form first and the other taking the new form first (Petersen et al., 1989). Scores on both parallel forms are then equally affected by learning, fatigue, and practice. Equivalentaroups designs With singlegroup designs, it is also important to administer both tests on the same day so intervening experiences do not affect the results. However, it is difficult in practice to arrange the required time block. Equivalentgroups designs are a simple alternative. The two tests to be equated are given to two different random groups from the same population. However, differences in the ability distributions of the groups may introduce an unknown degree of bias (Hambleton & Swaminathan, 1985). Because there are no common data, it is impossible to adjust for any random differences (Petersen et al., 1989). Several researchers have studied the effects of these different group ability distributions on equating results. Harris and Kolen (1986) investigated the effect of differences in group ability on the equating of the American College Test (ACT) Math test. Although their results showed score equivalents somewhat higher for lowability students and lower equivalent scores for highability examinees, the differences were not significant. The authors concluded that the equatings were robust to even large differences in group ability distributions. Similar results were found by Angoff and Cowell (1986) when they studied the population independence of equating transformations using Graduate Record Examination (GRE) data. Some minor discrepancies were discovered, but the majority were not significant in horizontal equating situations. Cook, Eignor, and Taft (1988) hypothesized that differences in ability were expected when the groups took the two tests to be equated at different times of the year. Two forms of the Biology achievement test were administered. One form was given in the fall mainly to high school seniors, and the other form was administered predominantly to sophomores in the spring. Two fall administrations were also equated and studied. Because recency of instruction is important in some parts of this type of achievement test and most students study Biology in tenth grade, disparate results were attained from the fall/spring equating. The spring sample, containing mostly students who had just completed the subject tested, received higher scaled scores than the fall sample. In this study, the construct measured by the test depended on the sample of examinees to whom the test was administered. In contrast, the fall/fall equating was robust to group differences. This study demonstrates the importance of administering the test forms to be equated at the same time, especially when the content is instructionally sensitive. Anchortest designs Lord (1980) stated the differences between two samples of examinees can be measured and controlled by administering to each examine an anchor test measuring the same ability as tests X and Y. When an anchor test is used, 11 equating may be carried out even when the two groups are not at the same ability level. The groups may be random groups from the same population or they may be nonequivalent or naturally occurring groups. The scores on the anchor test can be used to estimate the performance of the combined group (Cook & Petersen, 1987). The anchor test may be an internal part of both tests X and Y, or it may be an external separate test. If an external anchor test is used, it should be administered after X or Y to avoid practice effects on the tests to be equated (Lord, 1980). The anchortest design, while the most complicated of the data collection methods, is the most common in real testing situations. Constraints of time or available samples placed on large testing programs often require its use (Skaggs & Lissitz, 1986a). Properties of the anchor test can seriously affect the ensuing equating results. Klein and Jarjoura (1985) studied the properties and characteristics of anchortest items in relation to the total test. A test of 250 items was equated using three different anchor tests. Although all anchors were similar to the total test in difficulty, only one of the anchortests was representative of the total test content. The results confirmed the importance of including items on the anchor test that mirror as nearly as possible the content of the total test. In addition to content representativeness, the relative position of items in test books also seems to play an important role in anchortest design. Kingston and Dorans (1984) examined relative position effects of items in a version of the GRE General Test. Although the equatings of the Verbal measure of the test were in close agreement, the Quantitative and Analytical measures showed sensitivity to relative item position. When possible, it is preferable to include the anchor items spiralled throughout the test in their operational positions. The length of the anchor test is another concern and the subject of several studies. Klein and Kolen (1985) used a certification test to examine the relationship between anchor test length and accuracy of equating results. The authors used anchor tests of varying lengths and examinee groups both similar and dissimilar in ability distribution. They concluded that when groups have similar ability distributions, the anchor test length has little effect. However, as group ability distributions become more dissimilar, longer anchor tests work best. Klein and Kolen also found that anchor tests should correspond closely with the total test in content representation, difficulty, and discrimination. The study of Cook et al. (1988) is also pertinent to the question of anchor test length. When the groups differ in level of ability, as did the spring and fall samples, different anchor test lengths yielded disparate results. In contrast, when the groups have similar ability distributions, like the two fall samples, the equatings are similar for different anchor test lengths. When applying item response theory equating methods, anchor items are usually referred to as linking items. These linking items are used to scale the item parameter estimates. Equating with IRT requires that the item parameter estimates for the two test forms be on the same scale before equating. The quality of the equating depends largely on how well this item scaling is accomplished (Cook & Petersen, 1987). Wingersky and Lord (1984) studied the problem of the optimal number of linking items in the context of IRT concurrent calibration. The authors concluded that two linking items with small standard errors of estimation worked almost as well as a set of 25 linking items with large standard errors of estimation. Wingersky, Cook, and Eignor (1986) studied the characteristics of linking items and their effects on IRT equating. Monte Carlo procedures were used with parameter values set to imitate those estimated from the Verbal sections of the College Board Scholastic Aptitude Test (SATV). These values were selected to make the simulation as realistic as possible. Linking test lengths of 10, 20, and 40 items were used as well as variations in the size of the standard errors of estimation and distributions of examinee ability. Scaling was accomplished by both concurrent calibration and characteristic curve methods. The results of this study showed little difference between the two scaling methods, and the accuracy of the both equating methods improved as the number of linking items increased. Unlike the findings of Wingersky and Lord (1984), linking items having standard errors of estimation similar to those found in actual SATV items provided slightly better equating outcomes than those chosen to have small errors of estimation. The studies reviewed clearly indicate that the properties of an anchor test are of great concern. Anchor or linking items should remain in the same relative positions in new and old forms and as many anchor items as possible should be used (Cook & Eignor, 1988). The question of optimal anchor test length becomes even more important as the ability distribution of the samples used in equating become more dissimilar. Because anchor test designs are usually used in situations where ability distributions of the groups may vary to an unknown degree, the conclusions have important implications. The anchor test must also closely mirror the total test to be equated in statistical properties and content representativeness. As the correlation between scores on the anchor test and the scores on the new and old forms becomes higher, the ensuing equating also improves (Cook & Petersen, 1987). Many factors may affect equating results. Because the purpose of equating is to create a relationship between two tests so it makes no difference to the examine which test is administered, each of these factors must be carefully considered in deciding on the equating design. Some general guidelines to successful equating are summarized in Table 1. Only after these factors have been carefully considered and the data have been collected, can a specific equating method be chosen. Equating Methods Conventional Methods of Equating Once the data have been collected using one of the data collection designs reviewed, mathematical procedures are applied to the data to develop the equating transformation. Many such methods exist, some based on classical test theory and others on item response theory (IRT). The conventional methods, those arising from classical test theory, may be categorized as linear equating or equipercentile equating. Table 1 Summary of Recommendations for a Successful Equating Total Test Welldefined content specifications Item selection based on statistical data from field testing Length of at least 35 items Examinees Sample size of at least 500 Better results with groups similar in ability Administrative Strictly controlled testing conditions Security of tests and items is maintained Scoring is controlled Anchor Tests *. Representative of the total test in difficulty and discrimination Similar to the total test in content specifications Common items are in approximately the same position in the old and new forms. Common items are identical in both forms. About 20% 30% of total test length Linear equatino In horizontal equating, the two tests to be equated are similar in difficulty. When administered to the same group of examinees, the raw score distributions are assumed to be different only with respect to the means and standard deviations (Hambleton & Swaminathan, 1985). Linear equating is based on this assumption. A transformation is identified such that scores on X and Y are considered to be equated if they correspond to the same number of standard deviations above or below the mean in some population. The two scores are equivalent if X x Y Py ,ax () These scores will have the same percentile rank if the distributions are the same (Crocker & Algina, 1986). Many variations of linear equating models exist whose details may be found in the literature (Angoff, 1971; Holland & Rubin, 1982; Marco et al., 1983). Two of the more commonly used models are the Tucker model and the Levine equally reliable model. Both of these procedures produce an equating transformation of the form: Lp(y) = Ay +B (2) where Lp (y) is the linear equating function for equating Y to X (Dorans, 1990). Adaptations of this formula exist for dealing with an anchor test, usually labelled V, when it is or is not part of the reported score. The difference between the Tucker model and the Levine equally reliable model lies in their underlying assumptions. Full discussions of these assumptions and derivations of the appropriate formulas may be found in Dorans (1990). Many studies have been conducted to assess the accuracy of linear equating methods. Skaggs and Lissitz (1986b) carried out a simulation study with an external anchor design. Both difficulty and discrimination values were manipulated. The authors discovered unacceptable results with linear equating when the discrimination means were unequal on the two tests. Marco, Petersen, and Stewart (1983) used 40 different linear equating models to transform SATV data. Both similar and dissimilar samples were used, as well as variations of anchor test designs and characteristics of the total tests. Some generalizations reached from the results of this ambitious study are as follows: 1. When a test is equated to a test or form like itself through a parallel anchor test and the ability distributions of the samples are identical, a linear model yields very good results. 2. When a test is equated to a test or form like itself through an easy or difficult anchor test with random samples, all of the models have a small mean square error. 3. When samples with dissimilar ability distributions are used, linear equating does not perform well. 4. When total tests differ in difficulty, linear models yield unsatisfactory results. Two methods of selecting samples and five methods of equating, including two linear methods, were combined in a study by Livingston, Dorans, and Wright (1990). Again, when the samples differed in ability distributions the linear equatings were inaccurate, showing a large negative bias. Matching the samples on the basis of the anchor test did little to improve the results. The authors recommended dealing with ability differences by selecting a representative sample from each population and choosing an equating method that does not assume exchangeability for examinees based on their anchor test scores. Based on these studies, it can be seen that linear equating methods are distribution dependent. Although linear equating may perform satisfactorily in optimal conditions, it is likely to produce bias in real testing situations. Equipercentile equating In equipercentile equating, a transformation is chosen so that raw scores on two tests are considered to be equated if they have the same percentile rank (Angoff, 1971). This is based on the definition that score scales are comparable for two tests if their respective score distributions are identical in shape for some population (Braun & Holland, 1982). When this is true, a table of pairs of raw scores can be constructed. Because the pairs of raw scores are not necessarily numerically equal, it is necessary to transform one set of scores into the other set or to convert both sets to a new score (Petersen et al., 1989). In mathematical terms, the equipercentile equating function for equating Y to X on population P is Ep(y) Fp .' i,,, i (3) where Gp (y) is the cumulative distribution of Y scores and Fp"1 () is the inverse of the cumulative distribution of X scores, Fp (x). A cumulative distribution function maps scores onto relative frequencies, while an inverse cumulative distribution function maps the relative frequencies onto scores (Dorans, 1990). As a mathematical model, equipercentile equating makes no assumptions about the tests to be equated. It simply compresses and stretches the score units on one test so that its raw score distribution matches the second test. It is only consideration of the purpose of equating and the desired condition of population invariance that prevents its application to tests measuring different constructs (Petersen et al., 1989). Generally, empirical studies have shown mixed results in assessing the accuracy of equipercentile equating. Livingston, Dorans, and Wright (1990) included an equipercentile equating method in their study. A composite of two equipercentile equatings, the procedure worked well in most situations. Similarly, the equipercentile equating produced acceptable results in all combinations of conditions in the Skaggs and Lissitz (1986b) study. On the other hand, in the investigation conducted by Petersen et al. (1983) using SAT data, equipercentile equating was studied along with the Tucker Equally Reliable and Levine Unequally Reliable linear models and three IRT methods. The equipercentile equating produced the worst results of all the methods investigated. This was especially true for the Verbal Test. In a 1983 study by Cook and Eignor reported in Skaggs and Lissitz (1986a), alternate forms of the biology, mathematics, and social studies achievement tests of the GRE were equated using various procedures. Again, results varied by test content, but the equipercentile method was inadequate in all cases. Cook and Eignor felt that equipercentile equating may have suffered from a lack of data at the extreme scores. The Cook et al. (1988) equatings with biology achievement test data also uncovered mixed results. Although the equipercentile equating method performed adequately with the parallel falltofall samples, it was not sufficiently robust to the ability differences found in equating the fall and spring samples. These mixed findings raise some concerns about the application of equipercentile equating. When raw scores are used, this method does not meet the conditions for equating. Hambleton and Swaminathan (1985) noted that a nonlinear transformation is needed to equalize the moments of the two distributions, resulting in a nonlinear relationship between the raw scores and the true scores. In turn, this implies that the tests are not equally reliable and it is no longer a matter of indifference to the examinee which form is taken. Besides violating the equity condition, the equipercentile equating process is population dependent. 21 For the past forty years, large scale testing programs publishing multiple forms of examinations have used an equating process. Until recently, most have employed one of the conventional linear or equipercentile procedures described. But recent psychometric developments have presented an alternative. Equating Methods Based on Item Response Theory Item response theory A brief introduction to item response theory is essential to an understanding of the following equating procedures. Item response theory (IRT) is an attempt to model an examinee's performance on a test item as a function of the characteristics of the item and the examinee's ability on some unobserved, or latent, trait. The IRT model specifies the relationship between a latent trait and the observed performance on items designed to measure that trait. This relationship can then be depicted graphically by an item characteristic curve (ICC). The ICC depicts the probability that an examinee at any given ability level will make a correct response to an item. The graph is typically an Sshaped curve with ability, symbolized by 0, plotted on the horizontal axis and the probability of a correct response to item i, P, (8), plotted on the vertical axis. Many different mathematical models may be used to depict this functional relationship. Most common in practice are the logistic class of models due to the ease of estimation. Birnbaum (1968) proposed a two parameter logistic model (2PL) of the form p, () = [1+e D6 l ] (4) where bi is the difficulty value, ai is the discrimination parameter, and D is a scaling factor, normally 1.7. The threeparameter logistic model (3PL) adds a third parameter, denoted ci, referred to as the lower asymptote. The mathematical form of the 3PL model is written as Pi(e)= + (1 )[1 + ei oe)"1 (5) with the a,, bi, and D defined as before. The value of c, is typically smaller than the value that would result if examinees were to make a random response to the item (Hambleton & Swaminathan, 1985). Figure 1 depicts an ICC based on the 3PL model. The oneparameter logistic model, or Rasch model, assumes all items have equal discrimination and no guessing occurs. This model is written P(q) = [1+ eeD)T' (6) where the parameters are defined as in the previous models. Cursory examination of the three IRT logistic models may lead to the conclusion that they form a type of hierarchy from least to most specific. However, the three models represent very different philosophical perspectives 23 of measurement theory (Skaggs & Lissitz, 1986a). It is these differences that must be considered when selecting a model for a particular application. i0 O .. ... ... (slope) m 04  0 b, (slope is maximized) 2 0 5 ABILITY Figure 1. An item characteristic curve (ICC) based on the threeparameter logistic model The use of any of the IRT models entails restrictive assumptions about the item response process. Briefly stated, the major assumptions of IRT are as follows: 1. The ICC accurately represents the data. 2. The data are unidimensional. 3. Responses are locally independent (Skaggs & Lissitz, 1986a) An ICC is defined completely when its general form is specified and when the parameters of a particular item are known (Hambleton & Swaminathan, 1985). This leads to the basic advantage of IRT models. When the data fit the model reasonably well, it is possible to demonstrate the invariance of item and ability parameters. When the item parameters are known, an examinee's ability may be estimated from any subset of the items. Also, item parameters may be calibrated with any sample drawn from a sufficiently large population (Skaggs & Lissitz, 1986a). These advantages cannot be derived from classical test theory and should have tremendous consequences for equating with item response theory. All of the practical IRT models are based on the unidimensionality assumption. This states that the probability of a correct response by examinees to a set of items can be mathematically modeled by using only one ability parameter (Kingston & Dorans, 1984). According to Lord (1980), while ability is probably not normally distributed for most groups of examinees, unidimensionality is a property of the items and does not cease to exist because the examinee group is changed in distribution. Because the items on a test are assumed to measure only one common trait, for all examinees with the same ability the item responses are independent of one another. This is the local independence assumption. The probability of success on any given item depends on the item parameters, examinee ability, and nothing else. In determining the probability of a correct response to a specific item, success or failure on other items will add no new information if ability is known (Lord, 1980). Good estimation of the item and ability parameters is of paramount importance in describing the data accurately. Many investigators have explored the effect of the number of items and the number of examinees on parameter estimation for IRT models. The results of these studies varied according to the estimation procedure used. Available estimation methods include (a) joint maximum likelihood estimation (JML), (b) conditional maximum likelihood estimation (CML), (c) marginal maximum likelihood estimation (MML), and (d) Bayesian estimation (BE). Full explanations of the various procedures may be found in Hambleton and Swaminathan (1985). Much of the research on parameter estimation employed the JML procedure as implemented by the computer program LOGIST (Wood, Wingersky, & Lord, 1976). These reports will not be reviewed here, but the interested reader is referred to Harrison (1986), Hulin et al. (1982), Lord (1968), Ree (1979), Swaminathan and Gifford (1983, 1985), and Wingersky and Lord (1984). In general, a sample size of at least 1,000 and test length of 50 or more items is required for acceptable estimation with the JML procedure of LOGIST. One major problem uncovered by these studies is that consistent estimates of the item parameters cannot be obtained in the presence of examine (0) parameters because the latter increase with sample size (Baker, 1990). This problem can be overcome by using the MML procedure implemented in the BILOG computer program (Mislevy & Bock, 1987). The examine's 0 parameters are removed from item parameter estimation by integrating them over an assumed unit normal prior distribution. At this point in the procedure, it is not the 0 of each examine that has been estimated, but the form of the 0 distribution. The item parameters are first estimated, followed by the e parameters at a later stage (Baker, 1990). In addition to MML, the BILOG program allows Bayesian maximum a posteriori estimation (MAP) and Bayesian expected a posteriori estimation (EAP) of 0 parameters. Mislevy and Stocking (1989) have recommended the EAP procedure with a unit normal prior for the 0 distribution. Specifying this prior for abilities limits extreme values of the 0 estimates and the resulting variances will tend to be smaller than with MML. When the value of the variance is smaller, the prior distribution becomes more concentrated and pulls the estimated parameters toward the mean of the distribution. Yen (1987) compared LOGIST and BILOG for accuracy of item parameter estimation. Test lengths of 10, 20, and 40 items were simulated with a sample of 1,000 examinees. The ability distributions examined were normal, positively skewed, negatively skewed, and symmetric. Item difficulty was also manipulated. The BILOG estimates were more accurate than those of LOGIST in almost every situation. The advantage of BILOG was even more pronounced for the small item set. Although ability distribution had no substantial effect on the estimation of the ICCs, discrimination and pseudo chance parameters were somewhat inaccurate with BILOG in the case of the negatively skewed distribution. In addition to investigating the effect test length had on item and ability parameter estimates derived from LOGIST and BILOG procedures, Quails and Ansley (1985) studied the sample size effect. Sample sizes of 200, 500, and 1,000 examinees with a normal ability distribution were combined with test lengths of 10, 20, and 30 items. As sample size increased, both procedures produced estimates more highly correlated with the simulated values. The BILOG estimates were slightly better in all cases and superior in the combination of small sample size with 10 items. Buhr and Algina (1986) used BILOG with four methods of estimation and sample sizes of 250, 500, 750, and 1,000 to study the similarity of estimation. The Bayesian procedures were the most robust in dealing with different ability distributions. Estimation with all procedures improved substantially as sample size increased to 500, but showed little additional effect as sample size increased further. Baker (1990) simulated item response data based on a 45item test with 500 examinees to study the pattern of estimation results as a function of the various analysis operations. The data were analyzed under the options available in BILOG and the obtained parameter estimates were equated back to the true metric. The equated results were generally very close to the true parameters. The item parameters were only slightly affected by the characteristics of various priors. The equated means of the estimated Os were somewhat higher than the true values, both when priors were and were not imposed on the item discrimination. IRT eouatina Nothing in IRT contradicts the basic conclusions of classical test theory. Additional assumptions are made that allow answers not available under classical test theory (Lord, 1980). The theoretical advantage of IRT models is that once a set of items have been fitted to an IRT model, it is possible to estimate the ability of examinees who have taken a different set of items. To accomplish this, the items must be measuring the same latent trait and must be on the same scale (Petersen et al., 1989). When this is true and the item parameters are known, it will make no difference to the examinee what subset of items is administered. Therefore, in the context of IRT, equating is not necessary (Hambleton & Swaminathan, 1985). However, when both item and ability parameters are unknown, it is necessary to choose an arbitrary metric for either the ability parameter 8 or the item difficulty b,. Because all the models for Pi (0) are functions of the quantity a, (6 b,), the same constant may be added to every 8 and b, without changing the item response function P, (0). Additionally, every 8 and b, may be multiplied by a constant and every a, divided by the same constant without changing the quantities a (6 b,) and Pi(0). Therefore, the origin and unit of measurement of the ability scale are arbitrary and any scale for 6 may be chosen as long as the same scale is chosen for b, (Petersen et al., 1989). This is referred to as indeterminacy of the parameter scale. If the parameters of a set of items are estimated separately for two different groups of examinees, the item parameters may appear to be different due to the arbitrary fixing of the metric for 8 or b,.. However, the two sets of es and b, s should have a linear relationship to each other (Hambleton & Swaminathan, 1985). The a, s should be the same except for differences in unit of measurement and, in the 3PL case, the c, s remain unaffected (Petersen et al., 1989). The advantages of IRT equating are most useful in the case where groups taking the two tests are nonrandom or intact groups (Crocker & Algina, 1986). Consequently, the following discussion will emphasize uses of IRT equating with an anchor test design. However, item response theory procedures may also be used with singlegroup or equivalent groups designs. An anchor or linking test is one method available to put the parameters for the two tests on the same scale. Four procedures commonly used with this method are (a) concurrent calibration, (b) the fixed bs method, (c) the equated bs method, and (d) the characteristic curve transformation method. In concurrent calibration, parameters for the two tests are estimated simultaneously. The linking items, or sometimes common subjects, serve to unite the two tests and results in item parameter estimates on a common scale. This allows direct equating of the two tests (Petersen et al., 1989). The parameters of each total testanchor test combination are estimated sequentially in the fixed bs method. After the item parameters have been estimated for one test, the item difficulties of the linking items obtained from the first calibration are used as input for the estimation of parameters on the second test. The linking item parameters are not reestimated. The end result is item parameters for both tests being placed on the same scale (Petersen, Cook, & Stocking, 1983). In the equated bs method, the parameters for each test are estimated separately. Then the means and standard deviations of the difficulties for the two sets of linking items are set to be equal. Ability estimates could also be used for this purpose. This linear transformation is then applied to the a,, bi, and 0 parameters of the second test (Petersen et al., 1989). Several variations of the transformation, including the mean and sigma method and the robust mean and sigma method, are described in Hambleton and Swaminathan (1985). Also, Stocking and Lord (1983) described a modification which gives lower weights to poorly estimated parameters and outliers. It is most common in both the fixed bs and equated bs methods to use only the relationship for item difficulties to obtain the equating function (Hambleton & Swaminathan, 1985). The characteristic curve method can prevent the possible loss of information caused by ignoring the discrimination relationship. For the characteristic curve method, the parameters of each test are calibrated separately. All parameters are then placed on the same scale 31 by using the two sets of parameter estimates from the common items. A linear transformation is obtained from minimizing the difference between the true scores on the linking items. This transformation is then applied to the a,, bi, and 0 parameters of the second test (Stocking & Lord, 1983). Because it takes all information into account, this procedure is theoretically an improvement over the previous methods. Sometimes the reporting of abilities in terms of 0 is unacceptable. In these situations, the 0 value from a test may be converted to its corresponding true score t through S=EP (e) (7) where n is the number of items on the test. Equating of the true scores on the two tests is then possible (Hambleton & Swaminathan, 1985). The true score on one test is said to be equated to the true score on a second test if each corresponds to the same ability level, or if S= PO() T = PI(P) (8) i=i j=1 (Skaggs & Lissitz, 1986a). In practice, estimated item parameters are used to approximate Pi (0) and Pj (0). Paired values of t and in are then computed by substituting a series of arbitrary values for 9 into Equation 8 and calculating 5 and 7] for each 0. These paired values define as a function of iT and constitute an equating of these true scores (Lord, 1980). 32 The relationship between raw scores and true scores on two tests is not necessarily the same, nor is an equating provided for individuals scoring below the chance level (Petersen et al., 1989). Observedscore equating provides a method of predicting the rawscore distribution of a test. This procedure uses probabilities of correct responses under an IRT model to generate a hypothetical joint distribution of item responses from all examinees taking both tests. Conventional equipercentile equating is then applied to the new distributions (Skaggs & Lissitz, 1986a). Neither truescore nor observedscore equating is applied often in practice. Both are complicated to calculate and expensive to implement. Many researchers have investigated the accuracy of IRT equating methods using the various IRT models and procedures. Comparison of IRT equating with conventional methods is also common. Marco, Petersen, and Stewart (1983) examined the Rasch and 3PL models along with the 40 linear and two equipercentile equating methods previously discussed. A variety of conditions, including random and dissimilar samples, internal and external anchors, and difficulty levels of the anchor tests were also studied. The two IRT methods worked well, both with an external anchor test equal in difficulty to the total test and with an internal anchor. With the external anchor test, the Rasch results were slightly better than with any of the other equating methods investigated. Both IRT models were clearly superior to the conventional equating methods when the samples differed in ability distributions, but neither the Rasch nor the 3PL model showed superiority to the other under the conditions studied. Kolen (1981) explored truescore and observedscore equating methods as well as a linear and an equipercentile equating method. The Rasch, 2PL, and 3PL models were used for the IRT equatings. The two forms of the Iowa Test of Educational Development to be equated had no common items. Each test had been administered to a random sample. The truescore method for the 3PL model produced the best results. When only quantitative items were equated, the Rasch truescore combination also worked well. Kolen and Whitney (1982) used the General Educational Development Tests (GED) with the Rasch, 2PL,and 3PL IRT models and an equipercentile equating method. They found with small samples (N < 198) a number of extreme item parameter estimates were produced by the 3PL model which seriously affected the equating. In the Petersen, Cook, and Stocking (1983) study discussed earlier in the context of conventional equating, a 3PL model was also examined using concurrent calibration, the fixed bs method, and the characteristic curve transformation. For the SATV, all IRT models and methods outperformed linear and equipercentile equatings. Both conventional and IRT methods yielded acceptable results for the mathematics test. Concurrent calibration with the 3PL model produced the least amount of error. Harris and Kolen (1985) compared conventional equating methods with IRT 3PL model equating. The sample consisted of high and low ability examinees. The 3PL model was found to be slightly superior. The Cook, Eignor, and Taft (1988) study using biology achievement tests administered at different points in time included a 3PL model with the characteristic curve transformation in addition to the equipercentile equating method. The authors concluded that the IRT results, although slightly superior with the falltospring sample equating, basically paralleled the results obtained with the conventional method. A minimumcompetency test, Florida's Statewide Student Assessment Test, Part II (SSATII) was equated by Hills, Subhiyah, and Hirsch (1988). Their purpose was to study the effect of anchor length on equating and compare different equating methods using a sample with a negatively skewed distribution. The equating methods investigated were linear, Rasch, and 3PL. The IRT models were equated with concurrent calibration, fixed bs method, and equated bs method using robust mean and sigma. The authors concluded that the 3PL model with concurrent calibration and Rasch models gave similar good results. Also, when using the 3PL model with concurrent calibration, an anchor test length of 10 items was found to be sufficient for good equating outcomes. Results of these studies indicate that the 3PL model tends to perform better than conventional and Rasch equating in a variety of situations. Equating with IRT appears to produce better results than conventional equating methods, especially when the ability distribution of the two groups is dissimilar. Concurrent calibration and characteristic curve transformation were the preferred methods of scaling, although fewer linking items are required with concurrent calibration. Table 2 contains a summary of the equating studies reviewed here. Multidimensionality Violation of the Unidimensionalitv Assumption The mathematical models upon which IRT is based are grounded on very strong assumptions, particularly that item responses are unidimensional (Ansley & Forsyth, 1985). The unidimensionality assumption requires that each of the tests to be equated onto a common scale must measure the same underlying trait or ability. Any factor that influences an examinee's score, other than the one assumed latent trait, will violate the unidimensionality assumption. Although IRT explicitly acknowledges this assumption, other commonly used procedures that transform scores, such as equipercentile equating, are also unidimensional even if not stated specifically (Hirsch, 1989). This can be seen by reviewing the required conditions for equating. There are many factors that may cause multidimensionality, such as guessing, speededness, fatigue, cheating, random answering, instructional sensitivity, or item context and content. Two or more cognitive traits may influence an examine's response to an item. For example, reading Table 2 Summary of Unidimensional IRT Test Equating Studies Study Cook & Eignor (1983) Cook, Eignor, & Taft (1988) Hais & Kolen (1986) Hills, Subhiyah, & Hirsch (1988) Kolen (1981) Kolen & Whitney (1982) Marco, Petersen, & Stewart (1983) Peterson, Cook, & Stocking (1983) Tests CBachievement Biology achievement ACTMath SSATlI ITED: Math & Vocabulary GED equipercentile SATV SATV SATQ Equating Models 3PL, equipercentile, linear 3PL equipercentile 3PL, equipercentile, linear Rasch, 3PL, linear Rasch, 2PL, 3PL, equipercentile, linear Rasch, 3PL, linear, Rasch, 3PL, linear, equipercentile 3PL, linear, equipercentile Independent Variables equating models scaling methods dissimilar samples equating models equating models dissimilar samples equating models negatively skewed distribution anchor length scaling models equating models item context equating models ability distribution internal & external anchor difficulty of anchor equating models scaling models (3PL) skill may be required to correctly answer a mathematical item. Some of these violations can be controlled, reduced, or eliminated, but the unidimensionality assumption will still be violated in many practical situations (DoodyBogan & Yen, 1983). Achievement tests are not constructed using methods that yield factor pure instruments. Instead, a table of specifications is customarily developed and items are written to match the specifications. These items rarely measure a single trait (Reckase, 1979). Due to the many possible causes leading to violation of the unidimensionality assumption, it can be concluded that dimensionality is a joint property of both the item set and the particular sample of examinees (Hattie, 1985). Multidimensional Models Recently, attempts have been made to model multidimensional responses within the framework of IRT. Several multidimensional item response theory (MIRT) models have been proposed. Although multidimensional versions of all three logistic parameter IRT models have been derived, only the multidimensional twoparameter logistic (M2PL) model will be discussed. DoodyBogan and Yen (1983) described a multidimensional model of the form PJ(Qh) = (9) 1 + exp[D aahe, ( bh)] h=1 where Oh is the ability parameter for person i for dimension h; ajh is the discrimination parameter for item j for dimension h; bjh is the difficulty parameter for item j for dimension h; and D is the scaling constant, 1.7. Another model discussed by Sympson (1978) is defined P,(a) = 1 (10) n(1+ exp[D ap bj]]) hi1 where all parameters are defined as above. These two models can be distinguished by comparing their denominators. The DoodyBogan and Yen model contains no product of probabilities in the denominator as does the Sympson model. Equation 9 can be classified as a compensatory model that permits high ability on one dimension to compensate for low ability on another dimension in terms of the probability of a correct response. If dimensionality is considered in the context of factor analysis, a twodimensional test has a group of items measuring each dimension. A compensatory model seems reasonable because the test is being considered as a whole (Ansley and Forsyth, 1985). The second model, defined by Equation 10, is called a noncompensatory model where high abilities on one factor are not allowed to supplement low abilities on the second factor. When a twodimensional test is considered as one that requires simultaneous application of the two abilities to answer each item correctly, the noncompensatory model seems more appropriate (Ansley and Forsyth, 1985). Reckase (1985) has alternately defined the compensatory M2PL to provide a simple framework for specifying and generating multidimensional item response data. This model defines the probability of a correct response as EXP(a~, + d) P( = 11a,) 1 + EXP(a, + d,) (11) where a is a vector of discrimination parameters; dj is related to item difficulty; and_ is a vector of ability parameters. The exponent can also be written as ( b) (12) h=1 where m is the number of dimensions; ajh is an element of aj; 9O, is an element of .%; and di = aajhbjh. When this form is used, the relationship to the more familiar expression in Equation 9 can be seen. The data described by a multidimensional IRT model can be depicted graphically by an item response surface (IRS). Figure 2 presents an IRS for an M2PL item. The IRS increases monotonically as the elements of 6j increase (Reckase, 1985). To identify the multidimensional item difficulty (MID) for an item, the point in the IRS where the item is most discriminating must be found. This point, which provides the maximum information about an examine, will have the greatest slope. Because the slope along the IRS can differ according to the direction taken, Reckase (1985) determined the slope using the direction from the origin of the 0 space to the point of highest discrimination. Figure 2. An item response surface (IRS) based on the compensatory M2PL. To accomplish this analysis, the model given in equation 11 is translated to polar coordinates, replacing each Oi by Oi cos Oh, where 08 is the distance from the origin to 0 and oh is the angle from the hth axis to the maximum information point (Reckase, 1985). In a twodimensional item, the value of oh can range between 0 and 900 depending on the degree to which the item measures the two traits. If the item only measures the first trait, aC1 equals 0, while cii = 900 would depict an item measuring only the second trait. The relationship between am and discrimination element aih can then be stated as cos aih = (13) lh1 The MID parameters can now be expressed as d MIDi = (14) ,i(ah)2 Finally, an item that requires two abilities for a correct response can be represented as a vector in the twodimensional latent ability space. The length of the vector for an item is equal to the degree of multidimensional discrimination (MDISC) (Ackerman, 1991). Reckase (1985) expressed MDISC as MDISC, = ()= (15) These equations provide an excellent framework for manipulating conditions during generation of multidimensional data. Many indices have been developed to assess the dimensionality of a test and test items. Hattie (1985) examined over 30 of these indices which were grouped into methods based on (a) answer patterns, (b) reliability, (c) principal components, (d) factor analysis, and (e) latent traits. Hattie concluded that none of the indices were satisfactory and only four could even distinguish unidimensional from multidimensional data sets. A major problem encountered by Hattie in assessing the indices was that unidimensionality was often confused with reliability, internal consistency, and homogeneity. More recently, other procedures have been developed to assess the dimensionality of latent traits. Roznowski, Tucker, and Humphreys (1991) explored several of these indices. Procedures based on the shape of the curve of successive eigenvalues were found to be unsatisfactory under most conditions. A pattern index of second factor loadings was accurate except with high obliqueness. The most accurate index in this study was based on local independence. The use of this index is particularly recommended with large samples and many items. Linear factor analysis has been widely used to assess dimensionality of dichotomous items. However, use of phi correlations often leads to overestimation of the number of factors underlying the responses by confounding factor coefficients with item difficulties (Bock, Gibbons, & Muraki, 1988; Hambleton & Swaminathan, 1985). Tetrachoric correlations may be substituted, but may still be confounded with item difficulty or guessing in real data (Camilli, 1992). Bock, Gibbons, and Muraki (1988) have developed a maximum likelihood full information factor analysis procedure as an attempt to deal with these problems. Another approach to dimensionality taken by Stout (1990) replaced the strong assumptions of unidimensionality and local independence with less restrictive assumptions of essential unidimensionality and essential independence. Stout contended that a dominant dimension results when an attribute overlaps many items and other dimensions common to only a few items are unavoidable in reality, but are also not significant. These minor dimensions are rarely discussed in IRT literature, but are a frequent theme in classical factor analysis. While the IRT definition of dimensionality would take all factors, major and minor, into account, essential dimensionality is a mathematical conceptualization of the number of dominant dimensions with minor dimensions ignored. An essentially unidimensional test is therefore any set of items selected from an infinite item pool that measures exactly one major dimension. When essential unidimensionality is assumed, latent ability is unique in an ordinal scaling sense and this unique latent ability is estimated consistently. Stout presented theorems and proofs to show that dimensions distributed nondensely over items or dimensions that have a minor influence on possibly many items do not necessarily negate essential unidimensionality. He continued to present guidelines for development of essentially unidimensional tests. Among the recommendations are limiting the number of abilities per item; keeping the number of items dependent on the same ability, other than the intendedtobemeasured 6, small; and controlling the number of item pairs assigned to the same ability other than 8. These conditions are usually met with the carefully designed tests usually found in practice. Nandakumar (1991) used simulations to investigate Stout's statistical test of essential unidimensionality. When one dominant trait and one or more minor dimensions having little influence on item scores were present, Stout's test performed well in indicating essential unidimensionality. The test is more likely to reject the hypothesis of essential unidimensionality as the effect of the minor dimensions increases. To facilitate application of the test of essential unidimensionality, Stout developed the computer program DIMTEST. An investigation of the program revealed problems when a test consisted of difficult, highly discriminating items where guessing was also present (Nandakumar & Stout, 1993). Refinements were subsequently made to the program to make it more robust and beneficial to the measurement practitioner. Nandakumar (1994) studied three commonly used methodologies for assessing dimensionality in a set of item responses. The three procedures DIMTEST, Holland and Rosenbaum's approach, and nonlinear factor analysis were unreliable in detecting lack of unidimensionality in real data sets. Although the more recent procedures based on local independence, full information factor analysis, and essential unidimensionality offer promise for assessing the dimensionality of dichotomous data, especially with large datasets, a satisfactory method has not yet been agreed upon by measurement researchers. Because of the current lack of an acceptable index to detect multidimensionality, it becomes even more urgent to understand 45 exactly what effect violation of the unidimensionality assumption may have on IRT applications. When a test measures several dimensions, examinees' scores will be influenced by all of these factors. As a result, systematic and unsystematic errors of equating might be expected from scaling and equating procedures that are applied to multidimensional tests (Yen, 1984). The estimation of ability and item parameters is likely to be affected also. Multidimensionalitv and Parameter Estimation Violation of the unidimensionality assumption has been suggested as a problem in the estimation of item and ability parameters, the first step in IRT equating procedures. Thus, it is important to determine how robust estimation procedures are to this violation. Ansley and Forsyth (1985) used a noncompensatory M3PL model to simulate a twodimensional dataset. The two discrimination parameters were set to have respective means of 1.23 and .49 and respective standard deviations of .34 and .11. The b values were scaled to reflect fairly easy items (1bi = .33, ob1 = .82, pb2 = 1.03, ab2 = .82). The c parameter was set to .2. A bivariate normal distribution was selected to generate the 0 vectors with both dimensions scaled to have mean 0 and standard deviation 1.0. The correlation p(Oe, 02) was varied with values of 0.0, .3, .6, .9, and .95 simulated. Four combinations of sample size (1,000 and 2,000) and test length (30, 60) were examined. Corresponding unidimensional datasets were also simulated. Correlations of the estimated and simulated parameters showed the ai estimates appeared to be averages of the true ai and a2 values. The b, estimates overestimated the true b, values. The 8 estimates were highly related to the averages of the true 0 values. The authors concluded that item parameter estimation was affected by violation of the unidimensionality assumption, but as the 0 vectors became more highly correlated, the estimations derived from the twodimensional dataset approached results obtained from the unidimensional data. Sample size and test length had little effect on any of the relationships. Reckase (1979) studied five forms of the Missouri State Testing Program and five datasets simulated to match various factor structures to determine what characteristics are estimated by the unidimensional Rasch and 3PL models when the data are multidimensional. Reckase concluded that for tests with several equally strong dimensions, the Rasch estimates should be considered as a sum or average of the abilities required for each dimension. For data with a dominant first factor, the Rasch and 3PL difficulty estimates were highly correlated with the scores for that factor. With the 3PL model and more than two potent factors, the b, estimates correlated with just one of the common factors. The author concluded good ability estimates can be obtained from unidimensional estimation procedures when the first factor accounts for at least 20 percent of the test variance, as is likely in practice. Yen (1984) used data simulated with a compensatory M3PL model and data from the Comprehensive Test of Basic Skills, Form U (CTBS/U) to study unidimensional parameter estimation of multidimensional data. A variety of a, parameters were configured and p(0i, e2) was set at .5 or .6. When multidimensionality was present, the ai and b, parameter estimates were larger than those of unidimensional sets of items. The unidimensional estimates of both a, and 0 parameters appeared to be a combination of the respective twodimensional parameters. Data simulated from a hierarchical factor model was used in a study by Drasgow and Parsons (1983). Item responses were generated from five oblique common factors. Loadings were varied producing diversity in correlations between the common factors. Each simulated dataset consisted of 50item tests and 1,000 simulees. The general latent trait was recovered well when the correlations between the common factors were .46 or higher. Harrison (1986) also used a hierarchical factor model to simulate data. The strength of the secondorder general factor, the number of firstorder common factors, the distribution of items loading on the common factors, and the number of test items were manipulated. The effect of test length was significant. As the number of items increased, the general trait was recovered more effectively regardless of the latent structure, distribution of items across common factors, or the number of common factors. Estimation of the b, parameters was found to be robust to violations of unidimensionality. The estimation of both the aj and bi parameters improved as test length and strength of the general factor increased. In general, Harrison found 48 unidimensional parameter estimation procedures to be robust in the presence of multidimensional data. The studies reviewed indicate that IRT parameters implied by the general factor are recovered well when the common factors have sufficiently high correlations. Reckase, Ackerman, and Carlson (1988) used both simulated and empirical data to demonstrate that items can be selected to construct a test that meets the unidimensionality assumption even though more than one ability is required for a correct response. The authors showed that the unidimensionality assumption only requires the items in a test to measure the same composite of abilities. This seems to have been met in the previous investigations. Based on this study, it appears as if the unidimensionality assumption is not as restrictive as formerly thought. Although these studies explored the effect of multidimensionality on unidimensional parameter estimation, it is also important to understand what effect the choice between compensatory and noncompensatory multidimensional models may have on estimation. Ackerman (1989) simulated twodimensional data using both compensatory and noncompensatory M2PL models. Forty twodimensional items were generated using the compensatory model. Difficulty was confounded with dimensionality and p(0i, 82) was selected at 0.0, .3, .6, and .9. For each compensatory item, a corresponding noncompensatory item was created using a leastsquares approach to minimize the quantity 100 Z[(Pcl a, b)(PNcl a, b)_f (16) where Pc is a given compensatory item's probability of a correct response and PNC is the noncompensatory item's probability of a correct response which varies as a function of a and b given 0. The unidimensional 2PL model was used to estimate parameters using both BILOG and LOGIST. The authors discovered minimal differences in the IRS for each model when the parameters are matched. The confounding of difficulty with dimensionality was only detected by BILOG. For both models, as p(Oi, 82) increased, the response data became more unidimensional and estimation of all parameters improved. Way, Ansley, and Forsyth (1988) also compared compensatory and noncompensatory models with simulated data. The values assigned p(08, 82) ranged from 0.0 to .95. Results showed the numberright distributions for the two models were comparable. In the noncompensatory model, the unidimensional a, estimates appeared to be averages of the al and a2 values, while the compensatory model provided ai estimates best considered as sums of at and a2. The b, estimates for the noncompensatory data were greater than b, values, while the compensatory model seemed to average the bi and b2 values. For both models, the 0 estimates were related to the average of the two 6 parameters. A summary of the studies investigating the effect of multidimensional data on unidimensional IRT parameter estimation is presented in Table 3. Generally, parameters appear to be recovered adequately with data fit Table 3 Summary of Studies of Unidmensional IRT Estimation with Multidimensional Data Study Tests Model for Estimation Number of Simulating Model Dimensions Ackerman (1989) Simulation M2PL, Comp. 2PL 2 Leastsquares conversion Ansley & Forsyth Simulation (1985) M3PL, Noncomp. Drasgow & Parsons Simulation Hierarchical 2PL (1983) factor model Harrison (1986) Simulation Hierarchical 2PL factor model Reckase (1979) Simulation, Linear factor Rasch Missouri analysis 3PL Reckase, Ackerman,Simulation, M2PL, Comp. 2PL & Carlson (1988) ACT Yen (1984) Simulation, M3PL, Comp. 3PL Note. Comp. = Compensatory model, Noncomp. = Noncompensatory model. Independent Variables p(01, 02) Difficulty confounded with dimensionality Comp. vs noncomp. models BILOG vs LOGIST p(61, 02) Sample size Test length 1 5 p(01, 62...e) General factor strength General factor strength # of common factors Test length # of dimensions Estimation methods Violation of unidimensionality p(01, 82) t a parameters conditions usually found in practice. Both compensatory and noncompensatory models are apparently viable as MIRT models. Determining the adequacy of unidimensional parameter estimation of multidimensional data has important consequences for equating multidimensional tests. In addition to the estimation procedures discussed, the relationship between multidimensional and unidimensional IRT models can also be approached from an analytical framework. Wang (1986), as reported in Ackerman (1988) and Oshima and Miller (1990), determined explicit algebraic relationships between unidimensional estimates and the true multidimensional parameters for the case in which the underlying response process is modeled by the compensatory M2PL model and the unidimensional 2PL model. Using the results for unidimensional estimation of a multidimensional data matrix, Wang concluded that the unidimensional item parameter estimates are obtained as a weighted composite of the underlying traits. The weights are a function of the discrimination vectors for the items, the correlations among the latent traits, and the difficulty parameters of the items. For group g who can be described as having a diagonal variancecovariance structure Q_ and a mean ability vector p, the 2PL item parameters for twodimensional item j can be approximated by aa y (17) Sdja I (18) where a is the discrimination vector for the M2PL model; di is the difficulty parameter for the M2PL model; Qi and _2 are the first and second standardized eigenvectors of the matrix AA'AX where A is the matrix of discrimination parameters for all items in the test and X'X = Q. Therefore, when the means, standard deviations, and item parameters of a two dimensional distribution are known, the corresponding 2PL unidimensional item parameters can be approximated. Multidimensionality and IRT Equating In practice, test equating almost exclusively assumes unidimensionality. A single score from one test is transformed to a single score from another test. An understanding of what effect the presence of multidimensional data has on these unidimensional equating results is of paramount importance. Dorans and Kingston (1985) equated four forms of the Verbal GRE Aptitude Test using the 3PL model and an equated bs procedure. Two data collection designs, equivalent groups and anchortest, were investigated as well as several variations in calibration procedures. Dimensionality was assessed through factor analyses conducted at the item level on interitem tetrachoric correlations. Two highly related verbal dimensions were identified. 53 To examine their results, the researchers first calibrated the whole test, then divided the test items into two homogeneous subgroups. The subgroups were recalibrated separately and placed on the same scale as the original test. They were then recombined back into an entire test and their corresponding ICCs were compared. The authors discovered that differences in magnitude of discrimination parameter estimates had an impact on IRT equating results, affecting the symmetry of the equating. However, the different research combinations yielded very similar equatings, leading the authors to conclude that IRT equating may be sufficiently robust to the dimensionality displayed in their data. Cook and Eignor (1988) used SAT data that was suspected to be multidimensional to examine the robustness of 3PL model concurrent calibration and the characteristic curve transformation procedures. Scale drift was used as the criterion for evaluating equating results. Cook and Eignor concluded that both IRT equating methods produced acceptable results despite the multidimensionality present in the tests being studied. In addition to studying parameter estimation, Yen (1984) equated the LOGIST trait estimates for both real (CTBS/U) and simulated data. Several statistics were used to evaluate the results: (1) the correlation r; (2) standardized difference between means (SDM); (3) ratio of standard deviations; and (4) standardized root mean squared difference (SRMSD). Trait estimates based on items that measured different dimensions had lower correlations and higher SDMs and SRMSDs. That is, when tests measuring different dimensions were equated, large unsystematic errors occurred. Systematic errors were found only when the tests measured several dimensions that differed in difficulty and were likely to be taught sequentially, as in a vertical equating situation. Camilli, Wang, and Fesq (1995) adapted the methodology of Dorans and Kingston (1985) to examine how multidimensionality may affect the equating of the Law School Admission Test (LSAT). Two dimensions of the LSAT were identified using primary and secondary factor analyses, and the stability of the dimensions was established over six administrations. The test was divided into two homogeneous subtests to study the effect of multidimensionality on IRT truescore test equating. Item calibration was done with BILOG. The authors found very small differences in the equatings except at the ends of the raw score distribution. They concluded that, for the LSAT, IRT truescore equating was robust to the presence of multidimensionality. These empirical studies indicate that violations of the unidimensionality assumption, while having some impact on results, may not be significant. However, different tests were used in this research and their content may have affected findings in an unknown manner. Therefore, the generalization of results are difficult to interpret across studies (Skaggs & Lissitz, 1986a). Also, because indices designed to detect multidimensionality are generally unsatisfactory, it is necessary to design research studies that permit 55 manipulation of independent variables to understand exactly how violations of the unidimensionality assumption affect equating. Simulation studies present a technique to manipulate and control the desired variables. There has been little simulation research on the effects of multidimensionality on unidimensional IRT equating. One notable exception is a study by DoodyBogan and Yen (1983). The main purpose of this paper was to examine the stability of several chisquare statistics for their ability to detect multidimensionality in vertical equating, but the findings are significant in the context of unidimensional equating with multidimensional data. Four multidimensional data configurations were simulated with the compensatory M3PL model described in Equation 9. One unidimensional 3PL dataset was also generated. Three differences in mean ability between the two tests to be equated were simulated with parameter estimates for all data modelled after the CTBS for realism. Correlations, standardized difference between means (SDM), and standardized root mean square differences (SRMSD) were used to evaluate results. The findings of this study were mixed. When the correlations were examined, the results of the equatings, both horizontal and vertical, were as good for the tests with multidimensional configurations as for the unidimensional tests. On the other hand, when the means were used as the criterion for comparison, the multidimensional tests provided worse equatings than the unidimensional data, especially when the tests differed in difficulty. Another concern raised was that the equatings might deteriorate if the factors loaded differently on the two tests. More recently, attempts have been made to develop a multidimensional equating procedure. Hirsch (1989) conducted a study in which real and simulated data were equated with a multidimensional method. The procedure involves (a) estimating item parameters and abilities on both dimensions for both tests, (b) identifying common basis vectors, (c) aligning basis vectors through Procustes rotation, and (d) equating means and standard deviations of the ability estimates for each dimension of the two tests. Results of this preliminary research indicated that effective equating was possible with these techniques, but the instability of the ability estimates make it impractical at this time. While work on development of MIRT equating is continuing (Hirsch & Miller, 1991), the procedure has little current value for the equating needs of testing companies. The results of the studies of unidimensional equating with multidimensional data are summarized in Table 4. The emphasis of the present study was to examine the effect of multidimensional data on unidimensional IRT equating through the use of a simulation study. The research questions chosen were those considered to be of most value to the practitioner. Table 4 Summary of Studies of Unidimensional Equating with Multidimensional Data Study Tests Model Equating Number of Independent Variables Evaluation Method Dimensions Criterion Camilli, Wang LSAT 3PL truescore 2 test dimensionality split test & Fesq (1995) equating method Cook & Eignor SAT 3PL concurrent unknown equating methods scale drift (1988) calibration characteristic curve trans. DoodyBogan Simulation 3PL & Yen (1983) M3PL Dorans & Kingston (1985) GREV 3PL equated bs equated bs 2 criterion measures p(o1,02) correlation SDM SRMSD 2 calibration procedures split test data collection design Yen (1984) CTBS/U 3PL equated bs CTBSunknown a & b parameters Simulation Sim. 2 p(01,02) correlation SDM SRMSD ratio of a CHAPTER 3 METHOD Purpose Introduction The purpose of this study was to examine the effects of multidimensional data on unidimensional equating procedures. The effects of the number of multidimensional items, type of multidimensional model, and choice of equating procedure were investigated. Most investigations were conducted with randomly equivalent, normally distributed examinee groups having mean 0 and standard deviation 1. In addition, data from examinee groups of lower ability ( X I = 0.8, SD1 = 0.6) were equated to results obtained from the randomly equivalent groups. The methods applied to investigate these effects are described in this chapter. The methodology is discussed in the following sections: (a) data generation, (b) estimation of parameters, (c) equating, and (d) criteria for evaluation. Research Questions The specific questions to be answered in the present study were: 1. Does the number of multidimensional items in a test affect unidimensional equating results? 2. Does the equating procedure affect unidimensional equating results? 3. Do data simulated by using a compensatory multidimensional model produce different unidimensional equating results than data simulated using a noncompensatory model? 4. Are unidimensional equating results affected by differing ability distributions of the two examine groups? Data Generation Design Data for two parallel forms, A and B, of each test condition were simulated. Four test conditions were created by varying the number of multidimensional items contained in each test. These conditions were created to mirror what might be found in published tests. For example, in a test of mathematics problem solving, all items might be multidimensional to some degree if reading skill were also required. However, relatively few multidimensional items might be found in a reading comprehension test containing only one graphreading passage that also needed a math skill for completion. In the present study, 10, 20, 30, and 40 items of an 40 item test were twodimensional. These conditions are referred to as MD10, MD20, MD30, and MD40 respectively. In addition to modifying the number of multidimensional items, the strength of each multidimensional item's first factor was manipulated. This was done within each test condition because it is unreasonable to expect a published test to contain multidimensional items which all have an identical factor structure. The angle of item direction was varied to 20*, 30", 45, and 60* to reflect items that predominantly measure the first trait (20* and 30), both traits equally (45"), and the second trait (60). Finally, data were originally generated using a compensatory multidimensional model. To investigate any variations due to the difference in modeling, each compensatory dataset was transformed into its corresponding noncompensatory parameters through application of the leastsquares approach used by Ackerman (1989) and described in Chapter 2. Noncompensatory parameters were considered corresponding if the probability of a correct response was the same as for the compensatory parameters. This was accomplished through the NLIN procedure in the Statistical Analysis System (SAS,1989). Specific methodology is discussed later in this chapter. Model Description To avoid problems associated with estimating the lower asymptote, the compensatory multidimensional twoparameter logistic (M2PL) model (Reckase, 1985) was selected for data generation. Because this is a compensatory model, high abilities on one ability trait are allowed to compensate for lower abilities on the second ability trait. The multidimensional item difficulty (MID,) parameter was defined by Reckase as in equation 14 where ai is the kth element of a, and m is the number of dimensions. The data of interest in this study were considered to be twodimensional, so m equaled 2. Multidimensional item difficulty is the distance from the origin of the multidimensional ability space to the point where the item provides maximum examinee information, or where the IRS has the steepest slope. A line joins these points at angle ak. In a twodimensional item, the value of aik can range between 0 and 90* depending on the degree to which the item measures the two traits. If the item only measures the first trait, ai equals 0, while al = 90" would depict an item measuring only the second trait. For this study, all was set to either 0, 200, 30, 450, or 600. Item Parameters Four tests with 40 items each were simulated using the compensatory M2PL model described above. Forty items were selected as sufficient to provide good equating results. An anchor test design was chosen for data collection as it is widely used by practitioners.(Skaggs & Lissitz, 1886a). Each test consisted of two forms with 12 common linking items and 28 unique items. The difficulty values were selected to be reasonable for published tests. Lord (1968) found difficulties ranging from 1.5 to 2.5 ( X=0.58, SD=0.87) on SAT Verbal data. DoodyBogan and Yen (1983) employed a range of bi of 2.0 to 1.52 ( X =0.028, SD=0.818) in a simulation designed to imitate CTBSU data. In a study using multidimensional data, Ackerman (1988) reported MID values ranging form 0.73 through 1.87 on an ACT Mathematics test. Oshima and Miller (1990) used MID values in the interval 2.0 to 2.0. For the purpose of this investigation, multidimensional item difficulty parameters (MID) were generated using the RANNOR function of SAS. Values were chosen randomly from a normal distribution within the range of 2.0 through 2.0 and to have mean 0 and standard deviation 1.0. The multidimensional discrimination parameters (MDISC) defined by equation 15 were randomly selected from a lognormal distribution. A majority of MDISC values lay between .5 and 2.5 with mean 1.15 and standard deviation .60. These values correspond to those reported by DoodyBogan and Yen (1983) of .5 to 2.00 with mean 1.03 and standard deviation .3387. Ackerman (1988) found an MDISC range of .58 through 2.39. To create two 40 item test forms, 68 items were generated for each test condition. The first 12 items in each set were identified as the linking items and were common to both forms. Items 13 through 40 were unique items for Form A and items 41 through 68 were unique to Form B. In order to simulate twodimensional items, the values of ai as expressed in Equation 13 varied. In the case of unidimensional items, am was set to 0. For twodimensional items, ai was either 20*, 30*, 45", or 60. Those items with ai = 20" or 30, primarily measured the first trait. Items having an = 45 measured both traits equally, and those with ai = 60 discriminated on the second factor more heavily. More multidimensional items in this study predominantly measured the first factor because it is reasonable to anticipate this to occur in a well designed commercial test. These four ai values were spiraled throughout the items in each dataset. To illustrate, in MD40 ai was 20 for item 1, 30* for 63 item 2, 45" for item 3, and 60* for item 4. This pattern then repeated for the 64 remaining items. For datasets containing both unidimensional and twodimensional items, the last 3, 6, and 9 linking items were multidimensional for MD10, MD20, and MD30 respectively. Thus the linking test had the same proportion of unidimensional items as did the corresponding unique items in each condition.. The last 7, 14, and 21 unique items for each of Forms A and B were also multidimensional. Table 5 presents the item parameters for Form A of MD30 with 75% of the items in each form being twodimensional. Response Data For each experimental condition and form, response vectors for 1,000 simulees were generated. This sample size was selected as being adequate to provide stable parameter estimates. The ability values were randomly generated through the normal distribution RANNOR function of SAS to range from approximately 3.00 to 3.00. The theta values were assumed to be uncorrelated. Probabilities of correctly answering an item were then calculated for each simulee through application of Equation 11. Finally, the SAS function RANUNI was used to produce a random number from the uniform distribution between 0 and 1. If this number was less than or equal to P(Xj = 1 lai,di, 0j), the simulee passed the item. If the random number was greater, the simulee failed. To increase confidence in results, twenty sets of response data were generated for each condition and form. Table 5 Simulated Compensatory Parameters for MD30, Form A Item Form %0, al a2 di MDISC MID 1 A,B 0 0.475 0.000 0.584 0.475 1.231 2 A,B 0 0.563 0.000 0.173 0.563 0.308 3 A,B 0 0.515 0.000 0.652 0.515 1.266 4 A,B 60 0.736 1.275 1.199 1.472 0.814 5 A,B 20 1.159 0.422 0.681 1.234 0.552 6 A,B 30 0.706 0.407 0.054 0.815 0.066 7 A,B 45 0.936 0.936 0.939 1.323 0.709 8 A,B 60 0.291 0.504 0.618 0.582 1.062 9 A,B 20 0.684 0.249 0.599 0.728 0.822 10 A,B 30 0.882 0.510 1.652 1.019 1.621 11 A,B 45 1.129 1.129 2.676 1.597 1.675 12 A,B 60 0.881 1.526 1.018 1.763 0.578 13 A 0 0.973 0.000 0.549 0.973 0.565 14 A 0 1.358 0.000 0.324 1.358 0.239 15 A 0 1.857 0.000 1.417 1.857 0.763 16 A 0 0.860 0.000 0.524 0.860 0.609 17 A 0 1.448 0.000 1.538 1.448 1.062 18 A 0 1.517 0.000 0.448 1.517 0.295 19 A 0 0.663 0.000 0.142 0.663 0.214 20 A 60 0.480 0.832 0.723 0.961 0.753 21 A 20 0.648 0.236 0.550 0.689 0.798 22 A 30 1.944 1.122 0.992 2.244 0.442 23 A 45 1.120 1.120 0.654 1.584 0.413 24 A 60 0.268 0.464 0.122 0.535 0.228 25 A 20 0.790 0.288 0.295 0.841 0.351 26 A 30 0.442 0.255 0.159 0.510 0.313 27 A 45 1.452 1.452 0.019 2.053 0.009 28 A 60 0.328 0.568 0.243 0.656 0.370 29 A 20 0.744 0.271 0.055 0.792 0.070 30 A 30 0.398 0.230 0.315 0.460 0.686 31 A 45 0.355 0.355 0.924 0.502 1.840 32 A 60 0.465 0.806 1.060 0.930 1.140 33 A 20 1.442 0.525 1.014 1.535 0.661 34 A 30 1.031 0.595 0.284 1.191 0.238 35 A 45 0.879 0.879 1.320 1.244 1.061 36 A 60 0.431 0.747 0.965 0.862 1.119 37 A 20 0.589 0.214 0.533 0.627 0.850 38 A 30 1.144 0.661 2.296 1.321 1.738 39 A 45 0.810 0.810 1.050 1.145 0.917 40 A 60 0.147 0.254 0.135 0.293 0.461 Noncompensatory Data For each compensatory item generated, a corresponding noncompensatory item was created. A noncompensatory item was considered corresponding if it had the same probability of success as the compensatory item (Ackerman, 1989). To accomplish this, the NLIN procedure of SAS was applied to Equation 16. Specifically, the compensatory probability was calculated for each case and became the dependent variable. The independent variable in the NLIN model statement was the noncompensatory probability function. Only multidimensional items were transformed as the compensatory/noncompensatory question was not applicable to unidimensional items. Starting values for noncompensatory parameter estimation were set to equal the compensatory parameters. The 1,000 theta vectors generated for the first of each compensatory response set were treated as known values. To ensure the program was producing unique local minima, starting values were changed for several items in each set and reestimated. Any differences which appeared in the parameter estimates were contained in the fourth or fifth decimal place. For approximately 10% of the items in each dataset, the convergence criterion was not met within 40 iterations. In these cases, the final parameter estimates were substituted for the starting values and the program rerun. In all such cases, convergence was achieved with the second attempt. Response vectors were generated by applying Equation 10 and using the same (01,02) combinations utilized to produce the corresponding compensatory responses. Twenty response sets were simulated for each noncompensatory dataset. The item parameters for the multidimensional Form A items of noncompensatory MD30 are shown in Table 6. Summary statistics for datasets of both models are displayed in Table 7. Noneouivalent Groups One of the strongest theoretical advantages of IRT is its usefulness with groups of subjects who differ in abilities. One case where this may occur is when a second form of a test, such as a high school proficiency test, is administered only to examinees who failed to pass the first attempt. To examine the effect of data from a lower ability group being equated to data gathered from a normally distributed group, sets of 1,000 less able simulees were generated. Scores on 61 for the lower group ranged between 3.00 and 0.00 with mean 0.80 and standard deviation 0.6. Abilities on the second dimension were normally distributed with mean 0 and standard deviation 1. Five replications of scores were generated for all four compensatory test conditions. Estimation of Parameters Unidimensional IRT The responses of the 1,000 simulated examinees in each response set were analyzed by the computer program BILOG (Mislevy & Bock, 1990) to estimate the unidimensional item discrimination and difficulty parameters. Table 6 Simulated Noncompensatory Parameters for Multidimensional Items, MD30 Form A Item Form a4o aa a2 b, b2 A,B 60 A,B 20 A,B 30 A,B 45 A,B 60 A,B 20 A,B 30 A,B 45 A,B 60 A 60 A 20 A 30 A 45 A 60 A 20 A 30 A 45 A 60 A 20 A 30 A 45 A 60 A 20 A 30 A 45 A 60 A 20 A 30 A 45 A 60 0.664 0.778 0.528 0.705 0.352 0.478 0.638 0.849 0.728 0.496 0.184 2.256 0.830 0.692 0.545 0.344 0.957 0.381 0.516 0.310 0.312 0.499 0.918 0.725 0.698 0.474 0.412 0.814 0.653 0.096 0.888 0.528 0.447 0.698 0.395 0.390 0.494 0.844 0.942 0.606 0.149 0.235 0.792 0.248 0.413 0.306 0.910 0.436 0.400 0.276 0.315 0.585 0.610 0.578 0.677 0.551 0.322 0.584 0.636 0.207 0.945 0.236 0.713 1.534 3.164 1.175 0.834 0.555 1.999 1.268 1.188 0.426 0.481 4.282 0.089 0.745 0.750 2.495 0.361 0.561 0.297 2.774 0.856 0.701 0.060 2.809 0.187 1.073 0.219 4.957 0.309 2.081 2.092 1.596 1.776 3.624 0.661 0.565 0.964 0.047 4.872 0.705 0.491 0.835 2.602 2.472 0.786 1.136 2.876 2.430 0.388 1.633 2.870 1.944 0.058 1.638 2.772 0.399 0.231 3.588 Table 7 Summary Statistics for Multidimensional Items in Compensatory and Noncompensatory Datasets MD10 MD20 MD30 MD40 Parameter C NC C NC C NC C NC Mean a, SD Mean a2 SD Mean di SD .37 .60 .86 .80 .28 .50 .75 .56 .68 .47 .30 .39 .91 .69 .50 .37 .70 .62 .38 .27 .14 1.07 Mean 1.09 1.08 1.02 .88 b, SD 1.03 1.09 1.28 1.16 Mean 1.37 1.54 1.35 1.23 b2 SD .94 1.80 1.37 1.18 Note. C = Compensatory item parameters; NC = Noncompensatory item parameters Program default values were used in the calibration of the twoparameter logistic model item parameters. Specifically, this involved marginal maximum likelihood estimation procedures, no priors specified for difficulties, and lognormal priors for discrimination parameters. For the randomly equivalent groups, each of the 160 response sets20 replications each for four compensatory and four noncompensatory multidimensional conditionswas analyzed twice. The procedure was repeated for the nonequivalent groups. First the responses for combined Forms A and B for each dataset were analyzed simultaneously. Then each form was analyzed separately. This resulted in a total of 520 BILOG runs. Analytical Estimation Unidimensional estimation of the multidimensional item parameters for the eight datasets was performed analytically using Wang's (1986) procedure. The SAS IML procedure was employed to determine the unidimensional estimates of the twodimensional item parameters for each of the eight conditions. Eauating In IRT, because the ICCs are population independent, item parameter estimates from two BILOG runs should theoretically be identical. However, P,<() in the 2PL model is a function of the quantity a, (e bi). As such, the origin and the unit of 6 and bi measurement are arbitrary or indeterminant. Any scale may be selected for 8 as long as the same scale is chosen for b,. Estimated abilities and item difficulties from two calibration runs should have a linear relationship to each other (Petersen et al., 1989). Equating is a procedure used to place the item parameters from two tests on the same scale. Three unidimensional IRT equating methods were selected for this study: (a) concurrent calibration, (b) equated bs, and (c) characteristic curve transformation. Concurrent Calibration Concurrent calibration is the simplest of the IRT methods of equating to implement. A common group of examinees or items is required to tie the information from the two tests together. For this study, the parameters of both forms were estimated simultaneously by BILOG. Twelve common items in each dataset served to link the forms and the resulting item parameter estimates were therefore on the same scale. This process was repeated for each of the response sets in each condition. Equated bs The equated bs method is based on determining the linear relationship that exists between item difficulties estimated in two separate BILOG calibration runs, one for each form. The means and standard deviations of the bis for each set of linking items from Form A and B were calculated. The linear transformation was determined by SDA bA = SD(be X) + XA (19) Wes Once the slope (A) and intercept (B) of the linear transformation were found, they were applied to all ability and item estimates for Form B, yielding b = Ab, + B (20) a =, (21) 6 = AO.+B (22) All parameters were now transformed to the same scale. Although item discrimination or ability estimates could have been used to determine the linear transformation, item difficulty estimates are usually used in practice because they yield the most stable parameter estimates (Cook & Eignor, 1991). Characteristic Curve Transformation The parameter estimates computed separately for Form A and Form B were also used in the characteristic curve transformation. This equating method used both a, and bj estimates from the linking items to derive a linear transformation through an iterative process that minimized the difference between the item parameter estimates of the linking items. The process is based on the assumption that if the estimates were free of error, choosing the proper linear transformation would cause the truescore estimates of the linking items to correspond (Petersen et al., 1989; Stocking & Lord, 1983). The resulting transformation was then applied to all Form B parameters to create estimates on the same scale. The EQUATE (Baker, AIKami, & Al Dosary, 1991) computer program was used to accomplish this. Data were examined at 80 points along the ICC and the transformation was generally identified after approximately 8 10 iterations. All three equating procedures described were applied to each of the replications for each of the twelve data conditions. This resulted in 660 equatings for this study. A summation of the research equating conditions is presented in Table 8. Table 8 Summation of Research Eauatina Conditions Equating Method Concurrent Equat Dataset Calibration bs Compensatory, Randomly Equivalent Groups MD10 ' MD20 ' MD30 ' MD40 ' Noncompensatory, Randomly Equivalent Groups MD10 ' MD20 ' MD30 ' MD40 ' Compensatory, Nonequivalent Groups MD10 ' MD20 ' MD304 MD40 V ed Characteristic Curve Evaluation Criteria To establish a foundation for evaluating the results of the research equatings, the three comparison conditions described below were used. In addition, three statistical criteriacorrelation, standardized mean difference, and standardized root mean square differencewere applied to the data. Comparison Conditions For the first comparison condition the unidimensional approximations of the multidimensional item parameters were calculated using the analytic procedure described by equations 17 and 18 (Wang, 1986). To compute these approximations for the eight research conditions, the SAS IML procedure was applied to each of the simulated parameter sets. The means and standard deviations of the responses for each condition were determined for inclusion in the formula. The resulting sets of unidimensional comparison item parameters were weighted composites of the item parameters for the two traits (Ackerman, 1988). Table 9 presents the analytical unidimensional item parameter approximations for compensatory MD30, Form A. The resulting analytical item parameter estimates were then fixed in BILOG 386 and all compensatory and noncompensatory response sets were analyzed to establish the comparison ability estimates. For the next comparison condition, the second dimension of each multidimensional item was ignored. This would be reasonable if arguing that most published tests were designed to measure only the first factor. For Table 9 Analytical Estimates of the Unidimensional Parameters for Compensatory MD30. Form A Item Discrimination Difficulty 1 0.242 1.408 2 0.286 0.352 3 0.262 1.449 4 0.679 0.949 5 0.712 0.559 6 0.479 0.066 7 0.733 0.737 8 0.289 1.237 9 0.422 0.833 10 0.599 1.622 11 0.876 1.742 12 0.786 0.673 13 0.482 0.646 14 0.650 0.273 15 0.842 0.874 16 0.429 0.698 17 0.687 1.216 18 0.715 0.338 19 0.335 0.245 20 0.466 0.877 21 0.400 0.808 22 1.320 0.442 23 0.868 0.429 24 0.267 0.265 25 0.487 0.355 26 0.300 0.312 27 1.103 0.010 28 0.325 0.432 29 0.459 0.070 30 0.270 0.685 31 0.283 1.913 32 0.452 1.327 33 0.882 0.670 34 0.700 0.239 35 0.690 1.104 36 0.421 1.304 37 0.363 0.862 38 0.777 1.734 39 0.637 0.953 40 0.148 0.536 example, although mathematics problem solving requires reading skills to understand the prompts, the reading level is usually well below the grade level being tested. In this study, the simulated ability parameters of the first dimension only from each compensatory and noncompensatory dataset were utilized. This comparison criterion would enable evaluation of how well the dominant first factor was recovered in the equatings. A third comparison condition was created which employed the averages of the two true 0 values. This condition was based on the parameter estimation studies of Yen (1984) and Ansley and Forsyth (1985) in which the unidimensional estimates of the 0 parameters appeared to be combinations of the true multidimensional abilities. Statistical Criteria Correlation coefficients between the simulated 0 and the equated 0 estimates were computed to establish the relationship between the comparison criterion and the research equatings for each condition. For concurrent calibration, the appropriate simulated 0 parameters were correlated to the corresponding estimated ability parameters for both Form A and Form B. Only the equated form, Form B, was compared to the comparison conditions for all other equating procedures. The standardized difference between means (SDM) is the difference in mean scores for the two sets of ability traits divided by a pooled estimate of the standard deviation = 2 ')(23) where S2 and S2 are the variances of the two sets of abillities (Yen, 1984). The means of the estimated ability parameters were subtracted from the means of each comparison condition to calculate this statistic. The standardized root mean square difference is the square root of the mean squared difference between examinees' trait estimates, divided by S. Again, the estimated 0 parameter values were subtracted from the appropriate comparison values to derive the criterion value. Summary Four test conditions with differing numbers of multidimensional items were simulated using the compensatory M2PL item response theory model. The item direction for multidimensional items was varied within each test. Comparable noncompensatory datasets were then created for each condition. Two 40 item forms were constructed for each situation consisting of 12 linking and 28 unique items. Responses for 1,000 normally distributed simulated examinees were generated through application of the appropriate probability equation and replicated 20 times. The same (01,02) combinations were used to generate corresponding compensatory and noncompensatory response sets. In addition, responses for 1,000 low ability examinees were generated with 5 replications for each compensatory test condition. 77 Parameter estimation was executed on all conditions using both unidimensional IRT procedures and analytical estimation. For the IRT parameter estimates, equating was performed through through techniques: (a) concurrent calibration, (b) equated bs, and (c) characteristic curve transformation. Three comparison conditionsthe first simulated theta, the average of theta 1 and theta 2, and the analytical estimations of the unidimensional parameterswere selected for comparison with equated abillity estimates. Finally, the three statistical procedures of correlation, standardized mean difference, and standardized root mean square difference were applied to examine the comparisons. CHAPTER 4 RESULTS AND DISCUSSION Simulated Data Item Parameters Item parameters for two 40 item forms of a test were generated with a compensatory multidimensional 2PL model. Four conditions were created with either 10, 20, 30, or 40 multidimensional items in each form. Four degrees of dimensionality were spiraled throughout each test and form. Each form contained twelve linking items that mirrored the total test in psychometric properties. Additionally, Forms A and B were designed to be randomly parallel. Examination of the simulated compensatory item parameters confirms this was accomplished. Descriptive statistics for the four compensatory Form A conditions are presented in Table 10 and Form B data are shown in Table 11. All generated values are within the limits found in published tests and described in previous empirical studies (DoodyBogan & Yen, 1883; Ackerman, 1988). For both forms and across all conditions, the means of the di parameters approach 0.0 with standard deviations of approximately 1.0. The means and standard deviations of all item parameters for both forms are similar. The multidimensional compensatory item parameters were then transformed into their noncompensatory correlates. Descriptive statistics for Table 10 Descriptive Statistics for Compensatory Form A Item Parameters Par II ameter Condition al 10 20 30 40 a2 10 20 30 40 d 10 20 30 40 DISC 10 20 30 40 MID 10 20 30 40 Minimum 0.29 0.30 0.15 0.28 0.00 0.00 0.00 0.21 2.27 2.44 1.06 2.90 0.41 0.30 0.29 0.57 1.94 1.86 1.84 1.43 Note. N = 40 items in each condition. Form A conditions are presented in Table 12 and Form B information is given in Table 13. The item parameter values calculated from the noncompensatory transformations are within the ranges given by Ackerman (1989). For all Maximum 3.49 2.41 1.94 2.45 1.22 1.87 1.53 1.63 2.18 2.76 2.68 2.78 3.49 2.41 2.24 2.61 1.62 1.83 1.23 1.73 Mean 1.15 0.89 0.84 0.98 0.17 0.42 0.49 0.71 0.08 0.20 0.25 0.17 1.23 1.08 1.04 1.25 0.11 0.09 0.17 0.10 Table 11 Descriptive Statistics for Compensatory Form B Item Parameters Parameter Condition Minimum Maximum Mean SD al 10 0.37 3.65 1.14 0.8 20 0.27 2.41 0.96 0.5 30 0.15 2.11 0.94 0.5 40 0.27 2.45 0.88 0.5 a2 10 0.00 1.37 0.20 0.4 20 0.00 1.58 0.36 0.5 30 0.00 2.11 0.55 0.5 40 0.18 2.27 0.71 0.4 d 10 2.55 6.23 0.30 1.5 20 1.87 2.76 0.00 1.1 30 3.30 4.65 0.10 1.3 40 2.90 2.78 0.20 1.2 MDISC 10 0.39 3.65 1.23 0.8 20 0.32 2.41 1.12 0.5 30 0.30 2.98 1.16 0.6 40 0.42 2.62 1.18 0.5 MID 10 1.71 1.96 0.13 0.9 20 1.88 1.58 0.08 0.9 30 1.68 1.77 0.02 0.9 40 1.79 1.94 0.09 0.8 Note. N = 40 items in each condition. conditions and in both forms, b2 is slightly less difficult than bl, and a2 is less discriminating than al. In all cases, the noncompensatory bI parameters are lower than the MIDj for the corresponding item. This may be explained by considering the method Table 12 Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory Form A Parameter Condition Minimum Maximum Mean SD al 10 0.27 0.99 0.63 0.3 20 0.10 1.10 0.60 0.3 30 0.10 2.26 0.63 0.4 40 0.33 2.65 0.76 0.4 a2 10 0.00 1.22 0.17 0.3 20 0.15 0.94 0.52 0.2 30 0.00 1.53 0.49 0.4 40 0.32 1.14 0.62 0.2 bl 10 3.44 0.86 1.20 1.1 20 2.94 1.68 0.79 1.2 30 4.96 1.07 1.07 1.4 40 3.55 1.93 0.92 1.2 b2 10 3.01 0.54 1.43 0.8 20 5.75 2.34 1.29 2.0 30 4.87 0.84 1.45 1.4 40 3.10 3.98 1.12 1.2 Note. The number of multidimensional items is the same as the condition number. used to calculate the transformations. A compensatory and a noncompensatory item were considered corresponding if, for each 01,02 combination, the probability of a correct response was the same on both items. Because the noncompensatory model does not allow a high ability on one trait to compensate for a low ability on the other dimension, the bi parameters on a Table 13 Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory Form B Parameter Condition Minimum Maximum Mean SD al 10 0.33 1.54 0.69 0.4 20 0.13 1.10 0.59 0.3 30 0.33 1.33 0.69 0.3 40 0.10 1.58 0.64 0.3 a2 10 0.29 0.97 0.64 0.3 20 0.10 1.12 0.57 0.3 30 0.16 0.83 0.63 0.4 40 0.25 1.82 0.64 0.3 bl 10 2.32 0.57 0.96 0.8 20 3.22 0.27 1.21 0.9 30 3.30 1.65 0.10 1.3 40 4.06 1.86 0.79 1.1 b2 10 3.04 0.33 1.25 1.0 20 4.90 0.69 1.53 1.5 30 3.62 1.03 1.24 1.3 40 3.51 0.82 1.32 1.1 Note. The number of multidimensional items is the same as the condition number. noncompensatory item must be smaller than the MIDI parameterof the compensatory item if the condition for items to be corresponding is to be met. The differences between the compensatory and noncompensatory M2PL models can also be shown graphically. Because the probability of a correct response varies as a function of the 9 in each model, the item response surfaces (IRS) and contour plots of matched items should differ. The 83 compensatory and corresponding noncompensatory model IRS and contour plot for an item of each degree of dimensionality are shown in Figures 4 through 7. In Figure 4, a matched item that discriminates predominantly on 01 (a = 20) is pictured. The differences between the two IRSs are minor. A similarity also exists in the two conditions where the degree of dimensionality is 15 from equally discriminating. Figure 5 shows the IRS for a = 30 which discriminates slightly more on 01 than on 02. Conversely, Figure 7 presents the graphs for a = 600, which discriminates slightly more on 02 than on 61. Although differences exist in the baselines, the curves of the IRSs remain similar. This is true both within each of the two matched sets and between the items with a = 300 and a=600. In Figure 6, where a = 450, the corresponding compensatory and noncompensatory items discriminate equally along 61 and 62, and there is a sharp contrast between corresponding curves. Similar conclusions can be drawn from examination of the equiprobability lines of the contour plots. For the compensatory model, parallel lines join the 61,02 combinations that have an equal probability of a correct response. The incline of these lines is a function of the discrimination parameters. However, because the noncompensatory model does not allow a high ability on one dimension to compensate for a low ability on another dimension, the lines connecting the 01,02 combinations are curvilinear. The direction of these lines in the noncompensatory model is a function of the item's difficulty parameters (a) Compensatory IRS (a,=.732, a,=.266, d=.104) (c) Compensatory Contour Plot (b) Noncompensatory IRS (a1=.526, a2=.378, b,=.595, b2=2.961) (d) Noncompensatory Contour Plot 1 I 2 3 Figure 3. Item response surfaces and contour plots for item 9, MD20, a.=20* (a) Compensatory IRS (a,=.934, a2=.539, d=.650) (c) Compensatory Contour Plot 3 2 1 *. s < .2 1 2 3, 2 1 0 ThWt 2 3 (b) Noncompensatory IRS (a,=.709, a2=.526, b,=.092, b,=1.177) (d) Noncompensatory Contour Plot 2 Tm~l Figure 4. Item response surfaces and contour plots for item 10, MD20, a=30' (a) Compensatory IRS (a,=1.223, a2=1.223, d=1.994) (c) Compensatory Contour Plot (c) Compensatory Contour Plot 3 2 1 1 1 Th.aI (b) Noncompensatory IRS (a,=.970, a,=.933, b,=.951, b2=.913) (d) Noncompensatory Contour Plot 2 3 70 1, 2 3 3 2 1 0 1 2 3 Theta Figure 5. Item response surfaces and contour plots for item 11, MD20, oa=45 