The effect of multidimensionality on unidimensional equating with item response theory

MISSING IMAGE

Material Information

Title:
The effect of multidimensionality on unidimensional equating with item response theory
Physical Description:
xiv, 158 leaves : ill. ; 29 cm.
Language:
English
Creator:
Spence, Patricia Duffy
Publication Date:

Subjects

Subjects / Keywords:
Foundations of Education thesis, Ph. D
Dissertations, Academic -- Foundations of Education -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1996.
Bibliography:
Includes bibliographical references (leaves 151-157).
Statement of Responsibility:
by Patricia Duffy Spence.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 023272443
oclc - 35109046
System ID:
AA00014231:00001


This item is only available as the following downloads:


Full Text










THE EFFECT OF MULTIDIMENSIONALITY ON
UNIDIMENSIONAL EQUATING WITH
ITEM RESPONSE THEORY












By

PATRICIA DUFFY SPENCE


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1996
































This dissertation is dedicated to the memory of my father
James F. Duffy
1929-1992













ACKNOWLEDGMENTS


An effort of this magnitude always involves many people. The author

wishes to especially thank the chairman of her committee, Dr. M. David Miller,

for his dedication and inspiration. Without his encouragement and good humor,

this dissertation would not have been possible. The author would also like to

thank her committee members for their guidance and patience, particularly Dr.

James Algina and Dr. Linda Crocker. Their suggestions were always correct, if

not always accepted. Also, without the inspiration of Dr. Charles Dziuban of the

University of Central Florida, she would never have pursued studies in this field.

In addition, the author recognizes her colleagues, past and present, at

The Psychological Corporation, Volusia County District Schools, and the Florida

Department of Education for the opportunities to apply her learning in practical

situations. Gratitude is offered to her three parents--Jim, Joan, and Jeanne-

who stressed the importance of learning and doing things well. Thanks also to

special friends: Anne Seraphine for debating the meaning of life and monotonic

curves; Nada Stauffer for quiet friendship; George Suarez for making her laugh;

and Carlos Guffain for demanding her best. But the author is most indebted and

grateful to her husband, Verne, who has supported and encouraged her through

three degrees, and her daughter Cindy who is now left to carry on the Gator

tradition alone.














TABLE OF CONTENTS

page

A C KNO W LED G M ENTS ............................................................................... iii

LIST O F TA BLES ........................................ ............... .................. vii

LIS T O F FIG U R ES ......................................................... ......................... xi

A B S T R A C T ....................................... ............................................... ....... xii

CHAPTERS

1 INTRODUCTION...................... .... .......................... 1

P u rpo se ................ ..................................... .................... ....................3
L im itatio n s ....................................................................... ....... .. ........4
Significance of the Study........................................ ...... ..................4

2 REVIEW OF LITERATURE................. .... ....................6

Test Equating ........................... ........... .................6
Conditions for Equating........................................ ......................6
Data Collection Designs............................ ...........7
Single-group designs ................. .................. ....
Equivalent-group designs..................................... ......9
Anchor-test designs ................... ..................... 10
Equating Methods ....................................................... 14
Conventional Methods of Equating....................................14
Linear equating ... .......................................... ....... ............. 16
Equipercentile equating....................................... ............. 18
Equating Methods Based on Item Response Theory....................21
Item response theory .............................................................21
IRT equating ........................................ ... .... ..... .... 28








M ultidim ensionality ............................................. .......... ...............35
Violation of the Unidimensionality Assumption.............................35
Multidimensional Models ................................ .................... 37
Multidimensionality and Parameter Estimation ..............................45
Multidimensionality and IRT equating............................ ...............52

3 METHOD ....................................................................58

P urpo se .................................................. ......... ............. ......... 58
Introduction ................................................... ................. 58
Research Questions.....................................58
Data Generation ........................................................................... 59
Design................................... ................. 59
M odel Description ...................................................................60
Item Param eters......................... .... ... ...................... 61
Response Data .............................................. ................... 63
Noncompensatory Data............................... ..................65
Nonrandom Groups................................................66
Estimation of Parameters .................................. ...................66
Unidimensional IRT.................... ... ......................66
Analytical Estim action ...................................... .. ....... ........69
Equating.......................... ................................... ................ ........... 69
Concurrent Calibration ................... ................. .................. 70
Equated bs..........................................................70
Characteristic Curve Transformation ..................................... 71
Evaluation Criteria ...................................... .. .. ........ .......... 73
Comparison Conditions ...................................... .................. 73
Statistical Criteria ................. ............................................... ....... 75
Sum m ary. ................ ...... ............. ............ ....... ....... .... .........76

4 RESULTS AND DISCUSSION..................... ...................78

Sim ulated Data......................................................... .................... 78
Item Param eters................................................... ................... 78
Analytical Estimation .............. ... .....................88
Simulated Ability Data ............................................................. 88
Equating Results for Randomly Equivalent Groups ..........................92
Concurrent Calibration .......................................... ................... 92
Equated bs..........................................................99
Characteristic Curve Transformation..................................... 103
Equating Results for Nonequivalent Groups................................... 103
Concurrent Calibration ...............................................................103
Equated bs and Characteristic Curve Transformation ................. 108









5 CONCLUSIONS.................... ...................... ....................... 111

Effects of Multidimensional Model............................................... 111
Effects of Equating Method ........................... ....................112
Effects of the Number of Multidimensional Items ............................112
Effects of Nonequivalent Examinee Groups .................................... 115
Im plications ............................... ........... ........ ........... 116

APPENDIX

ITEM PARAMETER DATA ............................................... .............. 118

R E FE R E N C E S .......................................................................................... 151

BIOGRAPHICAL SKETCH............................................ ........... ....... 158

































vi














LIST OF TABLES


Table p

1 Summary of Recommendations for a Successful Equating ..............15

2 Summary of Unidimensional IRT Test Equating Studies................36

3 Summary of Studies of Unidimensional IRT Estimation
with Multidimensional Data...................... ........................ 50

4 Summary of Studies of Unidimensional Equating with
M ultidim ensional Data ...................... ............................... .... 57

5 Simulated Compensatory Parameters for MD30, Form A................64

6 Simulated Noncompensatory Parameters for Multidimensional
Items, M D30 Form A ........................................ ....... ....... 67

7 Summary Statistics for Multidimensional Items in Compensatory
and Noncompensatory Datasets.........................................68

8 Summation of Research Equating Conditions..................................72

9 Analytical Estimates of the Unidimensional Parameters for
Compensatory MD30, Form A ...............................................74

10 Descriptive Statistics for Compensatory Form A Item Parameters ...79

11 Descriptive Statistics for Compensatory Form B Item Parameters ..80

12 Descriptive Statistics for Multidimensional Item Parameters in
Noncompensatory Form A............................ ..................81

13 Descriptive Statistics for Multidimensional Item Parameters in
Noncompensatory Form B.................. .....................82

14 Descriptive Statistics for Analytical Unidimensional Estimates of
Form A Item Param eters ....................................................... 89








15 Summary Statistics for Analytical Unidimensional Estimates of
Form B Item Parameters .......................... ........................ 90

16 Descriptive Statistics for Simulated Examinees Taking MD10..........91

17 Descriptive Statistics for Simulated Examinees Taking MD20.......... 92

18 Descriptive Statistics for Simulated Examinees Taking MD30..........93

19 Descriptive Statistics for Simulated Examinees Taking MD40..........94

20 Descriptive Statistics for Simulated Low Ability Examinees..............95

21 Summary of Concurrent Calibration Results with Randomly
Equivalent Groups............................ .................... 96

22 Constants for Equated bs Equating of Compensatory Forms with
Randomly Equivalent Groups........................... .................. 100

23 Constants for Equated bs Equating of Noncompensatory Forms
with Randomly Equivalent Groups.........................................101

24 Summary of Equated bs Results with Randomly Equivalent
G roups ....... ..... .... ...................................... ................. 102

25 Summary of Characteristic Curve Transformation Results with
Randomly Equivalent Groups........................ .................. 104

26 Summary of Equating Results with Nonequivalent Groups............ 106

27 Constants for Equated bs Equating of Compensatory Forms with
Nonequivalent Examinee Groups..........................................109

28 Simulated Compensatory Item Parameters for MD10 Form A........119

29 Simulated Compensatory Item Parameters for MD10 Form B........120

30 Simulated Compensatory Item Parameters for MD20 Form A........121

31 Simulated Compensatory Item Parameters for MD20 Form B........122

32 Simulated Compensatory Item Parameters for MD30 Form A........123

33 Simulated Compensatory Item Parameters for MD30 Form B........124








34 Simulated Compensatory Item Parameters for MD40 Form A........125

35 Simulated Compensatory Item Parameters for MD40 Form B........126

36 Noncompensatory Item Parameters for Multidimensional Items
in M D10 Forms A and B .............................................. ......... 127

37 Noncompensatory Item Parameters for Multidimensional Items
in M D20 Form A ................ ............................................... 128

38 Noncompensatory Item Parameters for Multidimensional Items
in MD20 Form B ................... ............... ...................... 129

39 Noncompensatory Item Parameters for Multidimensional Items
in MD30 Form A ...................................... 130

40 Noncompensatory Item Parameters for Multidimensional Items
in MD30 Form B .................. ..... ...... .............. 131

41 Noncompensatory Item Parameters for Multidimensional Items
in M D40 Form A ................. ................... .......................... 132

42 Noncompensatory Item Parameters for Multidimensional Items
in MD40 Form B ........................ ..................... 133

43 Analytical Estimates of Unidimensional Item Parameters for
MD10 Form A ...................................... ......... 134

44 Analytical Estimates of Unidimensional Item Parameters for
MD10 Form B............. ........................ .................. 135

45 Analytical Estimates of Unidimensional Item Parameters for
MD20 Form A......................... ................. 136

46 Analytical Estimates of Unidimensional Item Parameters for
M D20 Form B................................................... .................. 137

47 Analytical Estimates of Unidimensional Item Parameters for
MD30 Form A............................................ ... ............. 138

48 Analytical Estimates of Unidimensional Item Parameters for
M D30 Form B................................................... .................. 139

49 Analytical Estimates of Unidimensional Item Parameters for
M D40 Form A .................................................. ................... 140









50 Analytical Estimates of Unidimensional Item Parameters for
M D 40 Form B .................................... ...... ........................... 14 1

51 Descriptive Statistics for Compensatory MD10 Linking Items
with Randomly Equivalent Groups............................... ...142

52 Descriptive Statistics for Compensatory MD20 Linking Items
with Randomly Equivalent Groups.......................................... 143

53 Descriptive Statistics for Compensatory MD30 Linking Items
with Randomly Equivalent Groups....................................... 144

54 Descriptive Statistics for Compensatory MD40 Linking Items
with Randomly Equivalent Groups.........................................145

55 Descriptive Statistics for Noncompensatory MD10 Linking Items...146

56 Descriptive Statistics for Noncompensatory MD20 Linking Items...147

57 Descriptive Statistics for Noncompensatory MD30 Linking Items... 148

58 Descriptive Statistics for Noncompensatory MD40 Linking Items...149

59 Descriptive Statistics for Compensatory Linking Items with
Nonequivalent Groups..................... .. ...................... 150














LIST OF FIGURES


Figure age

1 An item characteristic curve (ICC) based on the three-
parameter logistic model.................................. .................. 23

2 An item response surface (IRS) based on the compensatory
M 2PL ..................... ........ ........................ 40

3 Item response surfaces and contour plots for item 9, MD20,
a = 20 .......................... ..................................................... 84

4 Item response surfaces and contour plots for item 10, MD20,
a = 30 .................................................... ................... 85

5 Item response surfaces and contour plots for item 11, MD20,
a = 4 5 ....................................... .........................8 6

6 Item response surfaces and contour plots for item 12, MD20,
a = 60 .................. ..................... ........................ ........... 87













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

THE EFFECT OF MULTIDIMENSIONALITY ON
UNIDIMENSIONAL EQUATING WITH
ITEM RESPONSE THEORY

by

Patricia Duffy Spence

May, 1996


Chairman: M. David Miller
Major Department: Foundations of Education


Test publishers apply unidimensional equating techniques to their products

even though tests are expected to be multidimensional to some degree. This

simulation study investigated the effects of ignoring multidimensional data in

applying unidimensional item response theory equating procedures. The

specific effects studied were (a) multidimensional model, (b) type of equating

procedure, (c) number of multidimensional items, and (d) distribution of

examinee ability.

Four test conditions were created by varying the number of multidimensional

items contained in each test. The compensatory multidimensional two-

parameter logistic model was selected for data generation. Four degrees of








multidimensionality were spiraled throughout each test. The data were then

transformed into corresponding noncompensatory items which had the same

probability of success as the compensatory item for a given examinee.

Four tests with 40 items each were simulated with 12 common linking items

and 28 unique items. For each experimental condition and form, responses for

1,000 simulees were generated. To examine the effects of nonrandom groups,

responses for 1,000 less able examinees were also generated.

Three unidimensional IRT equating methods were selected: (a)

concurrent calibration, (b) equated bs, and (c) characteristic curve

transformation. Parameters were calibrated with BILOG386. To evaluate the

results of the research equatings, three comparison conditions were used; (1)

the unidimensional approximations of the multidimensional item parameters

calculated using an analytic procedure; (2) the simulated first ability dimension

only; and (3) the averages of the two simulated abilities. Three statistical

criteria-correlation, standardized differences between means, and standardized

root mean square difference--were applied to the data.

No significant effect on the unidimensional equating results were

attributed to choice of multidimensional model. For randomly equivalent groups,

there was also no effects due to choice of equating procedure. Concurrent

calibration favored low ability examinees when the ability distributions of the two

groups were unequal. When the multidimensional composites described by the

analytical estimation baseline are the data of interest, the number of








multidimensional items had little effect on the unidimensional equating with

randomly equivalent, normally distributed examine groups. However, if the

unidimensional factor is the trait of interest, the number of multidimensional

items affected the equating outcomes, with results deteriorating as the number

of multidimensional items increased. When examine groups were not

equivalent, equating results were affected in all conditions. Caution is advised

in applying unidimensional equating procedures when the examine groups are

suspected of being from different ability levels.













CHAPTER 1
INTRODUCTION



In many large testing programs, examinees take one of multiple forms of

the same test. Although the different editions are constructed to be as similar in

content and difficulty as possible, it is inevitable that some differences will exist

among the various forms (Petersen, Cook, & Stocking, 1983). Direct

comparison of scores would, therefore, be unfair to an examinee who

happened to take a more difficult form. Because examinees are often in

competition or are being directly compared, it is important to transform the

scores in some way to make them equivalent.

Equating is the statistical process of establishing equivalent raw or scaled

scores on two or more test forms. Theoretically, the equating process adjusts

for test and item characteristics so the propensity distributions would be the

same regardless of which test form was administered. The application of

equating to real data, however, can be full of problems and complications

(Skaggs & Lissitz, 1986a). In practice, equating requires not only a knowledge

of statistical models, but awareness and consideration of many other issues that

have practical consequences for the use and interpretation of results. Brennan

and Kolen (1987) discussed many of these issues, such as the presence of

equating errors, specification of content, and security breaches.








Many mathematical procedures have emerged to develop the equating

transformations. Some are based on classical test theory while others arise

from item response theory (IRT). Classical methods, including linear and

equipercentile equating, do not seem robust to departures from optimal

conditions (Cook & Eignor, 1983; Livingston, Dorans, & Wright, 1990; Skaggs &

Lissitz, 1986b). Item response theory procedures, including equated bs,

concurrent calibration, and characteristic curve transformation, present

alternatives. Equating methods based on IRT have been found more accurate

than those based on classical models (Harris & Kolen, 1985; Hills, Subhiyah, &

Hirsch, 1988; Kolen, 1981; Marco, Petersen, & Stewart, 1983; Petersen, Cook,

& Stocking, 1983).

IRT models are grounded on strong assumptions, particularly that the

item responses are unidimensional (Ansley & Forsyth, 1985). The

unidimensionality assumption requires that each of the tests to be equated

measures the same underlying ability. Any other factor that influences an

examine's score-such as guessing, speededness, cheating, item context, or

instructional sensitivity--will violate the unidimensionality assumption. Some of

these violations can be controlled, reduced, or eliminated, but the

unidimensionality assumption will still be violated in many practical testing

situations (Doody-Bogan & Yen, 1983).

Attempts have been made to model multidimensional responses within

the framework of IRT. Although these models describe multidimensional data

more accurately than unidimensional models, estimation of parameters is








complex and difficult in practice (Harrison, 1986). Test companies continue to

apply unidimensional equating procedures to their products. The viability of

using unidimensional models with multidimensional data must be explored to

determine the effect on the equating outcomes. An understanding of what effect

multidimensional data have on unidimensional equating results is of paramount

importance. Empirical studies (Camilli, Wang, & Fesq, 1995; Cook & Eignor,

1988; Dorans & Kingston, 1985; Yen, 1984) indicate that violation of the

unidimensionality assumption, while having some impact on results, may not be

significant. However, each of these studies employed data from a different test

and their content may have influenced findings in an unknown manner. The

number of multidimensional items and the degree of multidimensionality in each

is also unknown. Therefore, the generalization of results are difficult to interpret

across studies (Skaggs & Lissitz, 1986a). It is necessary to design research

studies that permit manipulation of independent variables to understand exactly

how violations of the unidimensionality assumption affect equating. Simulation

studies present a technique to manipulate and control the desired variables.

Purpose

The purpose of the present study was to investigate the effect of

multidimensional data in applying unidimensional IRT equating techniques. The

specific questions to be answered were:

1. Does the number of multidimensional items affect unidimensional

equating results?








2. Does the equating procedure affect unidimensional equating

results?

3. Do data simulated by using a compensatory model produce

different unidimensional equating results than data simulated by using a

noncompensatory model?

4. Are unidimensional equating results affected by the ability

distribution of the two examinee groups?

Limitations

Results of this study are applicable only to the research conditions

investigated. Generalizations to other item response theory models or other

equating techniques are not justified.

Significance of the Study

In practice, test publishers today apply unidimensional equating

techniques to their products. Because tests are expected to be

multidimensional to some degree and it is difficult to identify multidimensionality

accurately, it is important to investigate the effect of applying unidimensional

equating techniques to multidimensional data. Previous studies have mainly

explored unidimensional equating with empirical data that was suspected of

being multidimensional. Although the results indicated the impact of violating

the unidimensionality assumption may not be significant, the research designs

did not allow manipulation of independent variables. In addition, the true

multidimensionality of the underlying data was unknown in these empirical


studies.






5

The current simulation study allowed exploration of what effect

multidimensionality had on the results obtained from a variety of unidimensional

equating procedures while providing a means to manipulate variables. The

techniques used to generate the data afforded a mechanism to control the

dimensionality of the items and test forms. The specific questions investigated

were selected as having the most value for current practitioners applying

unidimensional equating procedures.













CHAPTER 2
REVIEW OF LITERATURE


Test Equating


Conditions for Equating

The purpose of equating is to establish a relationship between two test

forms so that it becomes a matter of indifference to the examinee which form is

taken. Petersen, Kolen, and Hoover (1989) stated that equating itself is simply

an empirical procedure which imposes no restrictions on the properties of scores

or on the method used to define the transformation. It is only when the purpose

of equating and the definition of equivalent scores are considered that restrictions

become necessary.

Lord (1980) outlined four conditions that must be met for the successful

equating of two test forms, X and Y. Briefly, the conditions are (a) equity, (b)

population invariance, (c) symmetry, and (d) same ability. To satisfy the equity

condition, it must make no difference to examinees at every ability level, 0, which

form of the test is taken. The conditional frequency distribution fx of the score

on form X should be the same as the conditional frequency distribution of the

transformed form Y score, fx(y) e. Lord (1980) added that it is not sufficient for

equity that fx and fx(y) I have the same means, but they must also have equal

variances. If the tests are not equally reliable, it is no longer a matter of








indifference which form is administered. The equity condition requires the

standard error of measurement and the higher moments to be the same after

transformation for examinees of identical ability. To fully satisfy this requirement,

test forms X and Y must be strictly parallel (Kolen, 1981). However, if this

condition is met, equating is no longer necessary.

In practice, it is nearly impossible to construct multiple forms that are

strictly parallel. Therefore, equating is needed. Although the equity condition can

never be met precisely, it serves to keep the purpose of equating in mind and

guide the steps in the process.

The population invariance and symmetry conditions also arise from the

desire to achieve equivalent scores. If the scores from form X and form Y are

equivalent, there is a one-to-one relationship between the two sets of scores.

The transformation must be unique, independent of the groups used to derive the

conversion (Petersen et al., 1989). The purpose of equating also requires that

the equating function be invertible or symmetric. The equating must be the same

regardless of which test is labelled X and which test is labelled Y (Lord, 1980).

The two tests to be equated must also measure the same characteristic,

whether defined as a latent trait, ability, or skill. This condition distinguishes true

equating from scaling. Scores on X and Y can always be placed on the same

scale, but they must measure the same construct to be considered equated

(Dorans, 1990).








It is unlikely that all conditions of equating can be met in practice.

However, good approximations to this ideal can be achieved and are usually

fairer to examinees than if no attempt at equating had occurred (Petersen et al.,

1989). Research conducted over the past 20 years serves as a guide in the

application and interpretation of equating transformations.

Data Collection Designs

Every equating consists of two parts--a data collection design and an

analytical method to determine the appropriate transformation. Three basic

sampling designs are most frequently described in the literature (Dorans, 1990;

Dorans & Kingston, 1985; Petersen et al., 1989). The designs are classified as

(a) single-group designs, (b) equivalent-groups designs, and (c) anchor-test

designs.

Single-group designs

In single-group designs, both forms or tests to be equated are given to

the same group of examinees. The difficulty levels of the tests are not

confounded with the differences in the ability levels of the groups taking each

test because the examinees are the same (Hambleton & Swaminathan, 1985).

However, Lord (1980) pointed out that the test administered second is not being

given under typical conditions. Practice effects and fatigue may affect the

equating process. To deal with this threat, the counterbalanced random-groups

design may be employed. The single-group is divided into two random half-

groups. Both half-groups then take both tests in counterbalanced order, one








group taking the old form first and the other taking the new form first (Petersen

et al., 1989). Scores on both parallel forms are then equally affected by

learning, fatigue, and practice.

Equivalent-aroups designs

With single-group designs, it is also important to administer both tests on

the same day so intervening experiences do not affect the results. However, it

is difficult in practice to arrange the required time block. Equivalent-groups

designs are a simple alternative. The two tests to be equated are given to two

different random groups from the same population. However, differences in the

ability distributions of the groups may introduce an unknown degree of bias

(Hambleton & Swaminathan, 1985). Because there are no common data, it is

impossible to adjust for any random differences (Petersen et al., 1989). Several

researchers have studied the effects of these different group ability distributions

on equating results.

Harris and Kolen (1986) investigated the effect of differences in group

ability on the equating of the American College Test (ACT) Math test. Although

their results showed score equivalents somewhat higher for low-ability students

and lower equivalent scores for high-ability examinees, the differences were not

significant. The authors concluded that the equatings were robust to even large

differences in group ability distributions.

Similar results were found by Angoff and Cowell (1986) when they

studied the population independence of equating transformations using








Graduate Record Examination (GRE) data. Some minor discrepancies were

discovered, but the majority were not significant in horizontal equating

situations.

Cook, Eignor, and Taft (1988) hypothesized that differences in ability

were expected when the groups took the two tests to be equated at different

times of the year. Two forms of the Biology achievement test were

administered. One form was given in the fall mainly to high school seniors, and

the other form was administered predominantly to sophomores in the spring.

Two fall administrations were also equated and studied. Because recency of

instruction is important in some parts of this type of achievement test and most

students study Biology in tenth grade, disparate results were attained from the

fall/spring equating. The spring sample, containing mostly students who had

just completed the subject tested, received higher scaled scores than the fall

sample. In this study, the construct measured by the test depended on the

sample of examinees to whom the test was administered. In contrast, the

fall/fall equating was robust to group differences. This study demonstrates the

importance of administering the test forms to be equated at the same time,

especially when the content is instructionally sensitive.

Anchor-test designs

Lord (1980) stated the differences between two samples of examinees

can be measured and controlled by administering to each examine an anchor

test measuring the same ability as tests X and Y. When an anchor test is used,






11

equating may be carried out even when the two groups are not at the same ability

level. The groups may be random groups from the same population or they may

be nonequivalent or naturally occurring groups. The scores on the anchor test

can be used to estimate the performance of the combined group (Cook &

Petersen, 1987). The anchor test may be an internal part of both tests X and Y,

or it may be an external separate test. If an external anchor test is used, it should

be administered after X or Y to avoid practice effects on the tests to be equated

(Lord, 1980). The anchor-test design, while the most complicated of the data

collection methods, is the most common in real testing situations. Constraints of

time or available samples placed on large testing programs often require its use

(Skaggs & Lissitz, 1986a).

Properties of the anchor test can seriously affect the ensuing equating

results. Klein and Jarjoura (1985) studied the properties and characteristics of

anchor-test items in relation to the total test. A test of 250 items was equated

using three different anchor tests. Although all anchors were similar to the total

test in difficulty, only one of the anchor-tests was representative of the total test

content. The results confirmed the importance of including items on the anchor

test that mirror as nearly as possible the content of the total test.

In addition to content representativeness, the relative position of items in

test books also seems to play an important role in anchor-test design. Kingston

and Dorans (1984) examined relative position effects of items in a version of the

GRE General Test. Although the equatings of the Verbal measure of the test








were in close agreement, the Quantitative and Analytical measures showed

sensitivity to relative item position. When possible, it is preferable to include the

anchor items spiralled throughout the test in their operational positions.

The length of the anchor test is another concern and the subject of several

studies. Klein and Kolen (1985) used a certification test to examine the

relationship between anchor test length and accuracy of equating results. The

authors used anchor tests of varying lengths and examinee groups both similar

and dissimilar in ability distribution. They concluded that when groups have

similar ability distributions, the anchor test length has little effect. However, as

group ability distributions become more dissimilar, longer anchor tests work best.

Klein and Kolen also found that anchor tests should correspond closely with the

total test in content representation, difficulty, and discrimination.

The study of Cook et al. (1988) is also pertinent to the question of anchor-

test length. When the groups differ in level of ability, as did the spring and fall

samples, different anchor test lengths yielded disparate results. In contrast, when

the groups have similar ability distributions, like the two fall samples, the

equatings are similar for different anchor test lengths.

When applying item response theory equating methods, anchor items are

usually referred to as linking items. These linking items are used to scale the

item parameter estimates. Equating with IRT requires that the item parameter

estimates for the two test forms be on the same scale before equating. The

quality of the equating depends largely on how well this item scaling is








accomplished (Cook & Petersen, 1987). Wingersky and Lord (1984) studied the

problem of the optimal number of linking items in the context of IRT concurrent

calibration. The authors concluded that two linking items with small standard

errors of estimation worked almost as well as a set of 25 linking items with large

standard errors of estimation.

Wingersky, Cook, and Eignor (1986) studied the characteristics of linking

items and their effects on IRT equating. Monte Carlo procedures were used with

parameter values set to imitate those estimated from the Verbal sections of the

College Board Scholastic Aptitude Test (SAT-V). These values were selected to

make the simulation as realistic as possible. Linking test lengths of 10, 20, and

40 items were used as well as variations in the size of the standard errors of

estimation and distributions of examinee ability. Scaling was accomplished by

both concurrent calibration and characteristic curve methods. The results of this

study showed little difference between the two scaling methods, and the accuracy

of the both equating methods improved as the number of linking items increased.

Unlike the findings of Wingersky and Lord (1984), linking items having standard

errors of estimation similar to those found in actual SAT-V items provided slightly

better equating outcomes than those chosen to have small errors of estimation.

The studies reviewed clearly indicate that the properties of an anchor test

are of great concern. Anchor or linking items should remain in the same relative

positions in new and old forms and as many anchor items as possible should be

used (Cook & Eignor, 1988). The question of optimal anchor test length becomes








even more important as the ability distribution of the samples used in equating

become more dissimilar. Because anchor test designs are usually used in

situations where ability distributions of the groups may vary to an unknown

degree, the conclusions have important implications. The anchor test must also

closely mirror the total test to be equated in statistical properties and content

representativeness. As the correlation between scores on the anchor test and

the scores on the new and old forms becomes higher, the ensuing equating also

improves (Cook & Petersen, 1987).

Many factors may affect equating results. Because the purpose of

equating is to create a relationship between two tests so it makes no difference to

the examine which test is administered, each of these factors must be carefully

considered in deciding on the equating design. Some general guidelines to

successful equating are summarized in Table 1. Only after these factors have

been carefully considered and the data have been collected, can a specific

equating method be chosen.

Equating Methods

Conventional Methods of Equating

Once the data have been collected using one of the data collection

designs reviewed, mathematical procedures are applied to the data to develop

the equating transformation. Many such methods exist, some based on classical

test theory and others on item response theory (IRT). The conventional methods,

those arising from classical test theory, may be categorized as linear equating or

equipercentile equating.








Table 1

Summary of Recommendations for a Successful Equating




Total Test

Well-defined content specifications
Item selection based on statistical data from field testing
Length of at least 35 items

Examinees

Sample size of at least 500
Better results with groups similar in ability

Administrative

Strictly controlled testing conditions
Security of tests and items is maintained
Scoring is controlled

Anchor Tests

*. Representative of the total test in difficulty and discrimination
Similar to the total test in content specifications
Common items are in approximately the same position in the old and
new forms.
Common items are identical in both forms.
About 20% 30% of total test length








Linear equatino

In horizontal equating, the two tests to be equated are similar in

difficulty. When administered to the same group of examinees, the raw score

distributions are assumed to be different only with respect to the means and

standard deviations (Hambleton & Swaminathan, 1985). Linear equating is

based on this assumption. A transformation is identified such that scores on X

and Y are considered to be equated if they correspond to the same number of

standard deviations above or below the mean in some population. The two

scores are equivalent if


X x Y Py
,ax ()

These scores will have the same percentile rank if the distributions are the

same (Crocker & Algina, 1986).

Many variations of linear equating models exist whose details may be

found in the literature (Angoff, 1971; Holland & Rubin, 1982; Marco et al.,

1983). Two of the more commonly used models are the Tucker model and the

Levine equally reliable model. Both of these procedures produce an equating

transformation of the form:


Lp(y) = Ay +B (2)

where Lp (y) is the linear equating function for equating Y to X (Dorans, 1990).

Adaptations of this formula exist for dealing with an anchor test, usually

labelled V, when it is or is not part of the reported score. The difference

between the Tucker model and the Levine equally reliable model lies in their








underlying assumptions. Full discussions of these assumptions and

derivations of the appropriate formulas may be found in Dorans (1990).

Many studies have been conducted to assess the accuracy of linear

equating methods. Skaggs and Lissitz (1986b) carried out a simulation study

with an external anchor design. Both difficulty and discrimination values were

manipulated. The authors discovered unacceptable results with linear

equating when the discrimination means were unequal on the two tests.

Marco, Petersen, and Stewart (1983) used 40 different linear equating

models to transform SAT-V data. Both similar and dissimilar samples were

used, as well as variations of anchor test designs and characteristics of the

total tests. Some generalizations reached from the results of this ambitious

study are as follows:

1. When a test is equated to a test or form like itself through a parallel

anchor test and the ability distributions of the samples are identical, a linear

model yields very good results.

2. When a test is equated to a test or form like itself through an easy or

difficult anchor test with random samples, all of the models have a small mean

square error.

3. When samples with dissimilar ability distributions are used, linear

equating does not perform well.

4. When total tests differ in difficulty, linear models yield unsatisfactory

results.








Two methods of selecting samples and five methods of equating,

including two linear methods, were combined in a study by Livingston, Dorans,

and Wright (1990). Again, when the samples differed in ability distributions the

linear equatings were inaccurate, showing a large negative bias. Matching the

samples on the basis of the anchor test did little to improve the results. The

authors recommended dealing with ability differences by selecting a

representative sample from each population and choosing an equating method

that does not assume exchangeability for examinees based on their anchor

test scores.

Based on these studies, it can be seen that linear equating methods are

distribution dependent. Although linear equating may perform satisfactorily in

optimal conditions, it is likely to produce bias in real testing situations.

Equipercentile equating

In equipercentile equating, a transformation is chosen so that raw

scores on two tests are considered to be equated if they have the same

percentile rank (Angoff, 1971). This is based on the definition that score

scales are comparable for two tests if their respective score distributions are

identical in shape for some population (Braun & Holland, 1982). When this is

true, a table of pairs of raw scores can be constructed. Because the pairs of

raw scores are not necessarily numerically equal, it is necessary to transform

one set of scores into the other set or to convert both sets to a new score








(Petersen et al., 1989). In mathematical terms, the equipercentile equating

function for equating Y to X on population P is


Ep(y) Fp .' i,,, i (3)
where Gp (y) is the cumulative distribution of Y scores and Fp"1 () is the

inverse of the cumulative distribution of X scores, Fp (x). A cumulative

distribution function maps scores onto relative frequencies, while an inverse

cumulative distribution function maps the relative frequencies onto scores

(Dorans, 1990).

As a mathematical model, equipercentile equating makes no

assumptions about the tests to be equated. It simply compresses and

stretches the score units on one test so that its raw score distribution matches

the second test. It is only consideration of the purpose of equating and the

desired condition of population invariance that prevents its application to tests

measuring different constructs (Petersen et al., 1989).

Generally, empirical studies have shown mixed results in assessing the

accuracy of equipercentile equating. Livingston, Dorans, and Wright (1990)

included an equipercentile equating method in their study. A composite of two

equipercentile equatings, the procedure worked well in most situations.

Similarly, the equipercentile equating produced acceptable results in all

combinations of conditions in the Skaggs and Lissitz (1986b) study.

On the other hand, in the investigation conducted by Petersen et al.

(1983) using SAT data, equipercentile equating was studied along with the








Tucker Equally Reliable and Levine Unequally Reliable linear models and three

IRT methods. The equipercentile equating produced the worst results of all

the methods investigated. This was especially true for the Verbal Test.

In a 1983 study by Cook and Eignor reported in Skaggs and Lissitz

(1986a), alternate forms of the biology, mathematics, and social studies

achievement tests of the GRE were equated using various procedures. Again,

results varied by test content, but the equipercentile method was inadequate in

all cases. Cook and Eignor felt that equipercentile equating may have suffered

from a lack of data at the extreme scores.

The Cook et al. (1988) equatings with biology achievement test data

also uncovered mixed results. Although the equipercentile equating method

performed adequately with the parallel fall-to-fall samples, it was not sufficiently

robust to the ability differences found in equating the fall and spring samples.

These mixed findings raise some concerns about the application of

equipercentile equating. When raw scores are used, this method does not

meet the conditions for equating. Hambleton and Swaminathan (1985) noted

that a nonlinear transformation is needed to equalize the moments of the two

distributions, resulting in a nonlinear relationship between the raw scores and

the true scores. In turn, this implies that the tests are not equally reliable and it

is no longer a matter of indifference to the examinee which form is taken.

Besides violating the equity condition, the equipercentile equating process is

population dependent.






21

For the past forty years, large scale testing programs publishing multiple

forms of examinations have used an equating process. Until recently, most

have employed one of the conventional linear or equipercentile procedures

described. But recent psychometric developments have presented an

alternative.

Equating Methods Based on Item Response Theory

Item response theory

A brief introduction to item response theory is essential to an

understanding of the following equating procedures. Item response theory

(IRT) is an attempt to model an examinee's performance on a test item as a

function of the characteristics of the item and the examinee's ability on some

unobserved, or latent, trait. The IRT model specifies the relationship between

a latent trait and the observed performance on items designed to measure that

trait.

This relationship can then be depicted graphically by an item

characteristic curve (ICC). The ICC depicts the probability that an examinee at

any given ability level will make a correct response to an item. The graph is

typically an S-shaped curve with ability, symbolized by 0, plotted on the

horizontal axis and the probability of a correct response to item i, P, (8), plotted

on the vertical axis.

Many different mathematical models may be used to depict this

functional relationship. Most common in practice are the logistic class of








models due to the ease of estimation. Birnbaum (1968) proposed a two-

parameter logistic model (2PL) of the form


p, () = [1+e D6 l ]- (4)

where bi is the difficulty value, ai is the discrimination parameter, and D is a

scaling factor, normally 1.7.

The three-parameter logistic model (3PL) adds a third parameter,

denoted ci, referred to as the lower asymptote. The mathematical form of the

3PL model is written as


Pi(e)= + (1 )[1 + ei oe)"1 (5)

with the a,, bi, and D defined as before. The value of c, is typically smaller

than the value that would result if examinees were to make a random response

to the item (Hambleton & Swaminathan, 1985). Figure 1 depicts an ICC based

on the 3PL model.

The one-parameter logistic model, or Rasch model, assumes all items

have equal discrimination and no guessing occurs. This model is written


P(q) = [1+ ee-D)T' (6)

where the parameters are defined as in the previous models.

Cursory examination of the three IRT logistic models may lead to the

conclusion that they form a type of hierarchy from least to most specific.

However, the three models represent very different philosophical perspectives







23

of measurement theory (Skaggs & Lissitz, 1986a). It is these differences that

must be considered when selecting a model for a particular application.


i0




O ----.. --..---.--------- ... (slope)


m 04 -
0

b, (slope is
maximized)
-2 0 5
ABILITY



Figure 1. An item characteristic curve (ICC) based on the three-parameter
logistic model


The use of any of the IRT models entails restrictive assumptions about

the item response process. Briefly stated, the major assumptions of IRT are

as follows:

1. The ICC accurately represents the data.

2. The data are unidimensional.

3. Responses are locally independent (Skaggs & Lissitz, 1986a)

An ICC is defined completely when its general form is specified and

when the parameters of a particular item are known (Hambleton &

Swaminathan, 1985). This leads to the basic advantage of IRT models. When

the data fit the model reasonably well, it is possible to demonstrate the








invariance of item and ability parameters. When the item parameters are

known, an examinee's ability may be estimated from any subset of the items.

Also, item parameters may be calibrated with any sample drawn from a

sufficiently large population (Skaggs & Lissitz, 1986a). These advantages

cannot be derived from classical test theory and should have tremendous

consequences for equating with item response theory.

All of the practical IRT models are based on the unidimensionality

assumption. This states that the probability of a correct response by

examinees to a set of items can be mathematically modeled by using only one

ability parameter (Kingston & Dorans, 1984). According to Lord (1980), while

ability is probably not normally distributed for most groups of examinees,

unidimensionality is a property of the items and does not cease to exist

because the examinee group is changed in distribution.

Because the items on a test are assumed to measure only one common

trait, for all examinees with the same ability the item responses are

independent of one another. This is the local independence assumption. The

probability of success on any given item depends on the item parameters,

examinee ability, and nothing else. In determining the probability of a correct

response to a specific item, success or failure on other items will add no new

information if ability is known (Lord, 1980).

Good estimation of the item and ability parameters is of paramount

importance in describing the data accurately. Many investigators have








explored the effect of the number of items and the number of examinees on

parameter estimation for IRT models. The results of these studies varied

according to the estimation procedure used. Available estimation methods

include (a) joint maximum likelihood estimation (JML), (b) conditional maximum

likelihood estimation (CML), (c) marginal maximum likelihood estimation

(MML), and (d) Bayesian estimation (BE). Full explanations of the various

procedures may be found in Hambleton and Swaminathan (1985).

Much of the research on parameter estimation employed the JML

procedure as implemented by the computer program LOGIST (Wood,

Wingersky, & Lord, 1976). These reports will not be reviewed here, but the

interested reader is referred to Harrison (1986), Hulin et al. (1982), Lord

(1968), Ree (1979), Swaminathan and Gifford (1983, 1985), and Wingersky

and Lord (1984). In general, a sample size of at least 1,000 and test length of

50 or more items is required for acceptable estimation with the JML procedure

of LOGIST. One major problem uncovered by these studies is that consistent

estimates of the item parameters cannot be obtained in the presence of

examine (0) parameters because the latter increase with sample size (Baker,

1990).

This problem can be overcome by using the MML procedure

implemented in the BILOG computer program (Mislevy & Bock, 1987). The

examine's 0 parameters are removed from item parameter estimation by

integrating them over an assumed unit normal prior distribution. At this point in








the procedure, it is not the 0 of each examine that has been estimated, but

the form of the 0 distribution. The item parameters are first estimated, followed

by the e parameters at a later stage (Baker, 1990).

In addition to MML, the BILOG program allows Bayesian maximum a

posteriori estimation (MAP) and Bayesian expected a posteriori estimation

(EAP) of 0 parameters. Mislevy and Stocking (1989) have recommended the

EAP procedure with a unit normal prior for the 0 distribution. Specifying this

prior for abilities limits extreme values of the 0 estimates and the resulting

variances will tend to be smaller than with MML. When the value of the

variance is smaller, the prior distribution becomes more concentrated and pulls

the estimated parameters toward the mean of the distribution.

Yen (1987) compared LOGIST and BILOG for accuracy of item

parameter estimation. Test lengths of 10, 20, and 40 items were simulated

with a sample of 1,000 examinees. The ability distributions examined were

normal, positively skewed, negatively skewed, and symmetric. Item difficulty

was also manipulated. The BILOG estimates were more accurate than those

of LOGIST in almost every situation. The advantage of BILOG was even more

pronounced for the small item set. Although ability distribution had no

substantial effect on the estimation of the ICCs, discrimination and pseudo-

chance parameters were somewhat inaccurate with BILOG in the case of the

negatively skewed distribution.








In addition to investigating the effect test length had on item and ability

parameter estimates derived from LOGIST and BILOG procedures, Quails and

Ansley (1985) studied the sample size effect. Sample sizes of 200, 500, and

1,000 examinees with a normal ability distribution were combined with test

lengths of 10, 20, and 30 items. As sample size increased, both procedures

produced estimates more highly correlated with the simulated values. The

BILOG estimates were slightly better in all cases and superior in the

combination of small sample size with 10 items.

Buhr and Algina (1986) used BILOG with four methods of estimation

and sample sizes of 250, 500, 750, and 1,000 to study the similarity of

estimation. The Bayesian procedures were the most robust in dealing with

different ability distributions. Estimation with all procedures improved

substantially as sample size increased to 500, but showed little additional

effect as sample size increased further.

Baker (1990) simulated item response data based on a 45-item test with

500 examinees to study the pattern of estimation results as a function of the

various analysis operations. The data were analyzed under the options

available in BILOG and the obtained parameter estimates were equated back

to the true metric. The equated results were generally very close to the true

parameters. The item parameters were only slightly affected by the

characteristics of various priors. The equated means of the estimated Os were








somewhat higher than the true values, both when priors were and were not

imposed on the item discrimination.

IRT eouatina

Nothing in IRT contradicts the basic conclusions of classical test theory.

Additional assumptions are made that allow answers not available under

classical test theory (Lord, 1980). The theoretical advantage of IRT models is

that once a set of items have been fitted to an IRT model, it is possible to

estimate the ability of examinees who have taken a different set of items. To

accomplish this, the items must be measuring the same latent trait and must

be on the same scale (Petersen et al., 1989). When this is true and the item

parameters are known, it will make no difference to the examinee what subset

of items is administered. Therefore, in the context of IRT, equating is not

necessary (Hambleton & Swaminathan, 1985).

However, when both item and ability parameters are unknown, it is

necessary to choose an arbitrary metric for either the ability parameter 8 or the

item difficulty b,. Because all the models for Pi (0) are functions of the

quantity a, (6 b,), the same constant may be added to every 8 and b, without

changing the item response function P, (0). Additionally, every 8 and b, may

be multiplied by a constant and every a, divided by the same constant without

changing the quantities a (6 b,) and Pi(0). Therefore, the origin and unit of

measurement of the ability scale are arbitrary and any scale for 6 may be








chosen as long as the same scale is chosen for b, (Petersen et al., 1989).

This is referred to as indeterminacy of the parameter scale.

If the parameters of a set of items are estimated separately for two

different groups of examinees, the item parameters may appear to be different

due to the arbitrary fixing of the metric for 8 or b,.. However, the two sets of es

and b, s should have a linear relationship to each other (Hambleton &

Swaminathan, 1985). The a, s should be the same except for differences in

unit of measurement and, in the 3PL case, the c, s remain unaffected

(Petersen et al., 1989).

The advantages of IRT equating are most useful in the case where

groups taking the two tests are nonrandom or intact groups (Crocker & Algina,

1986). Consequently, the following discussion will emphasize uses of IRT

equating with an anchor test design. However, item response theory

procedures may also be used with single-group or equivalent groups designs.

An anchor or linking test is one method available to put the parameters

for the two tests on the same scale. Four procedures commonly used with this

method are (a) concurrent calibration, (b) the fixed bs method, (c) the equated

bs method, and (d) the characteristic curve transformation method.

In concurrent calibration, parameters for the two tests are estimated

simultaneously. The linking items, or sometimes common subjects, serve to

unite the two tests and results in item parameter estimates on a common

scale. This allows direct equating of the two tests (Petersen et al., 1989).








The parameters of each total test-anchor test combination are

estimated sequentially in the fixed bs method. After the item parameters have

been estimated for one test, the item difficulties of the linking items obtained

from the first calibration are used as input for the estimation of parameters on

the second test. The linking item parameters are not reestimated. The end

result is item parameters for both tests being placed on the same scale

(Petersen, Cook, & Stocking, 1983).

In the equated bs method, the parameters for each test are estimated

separately. Then the means and standard deviations of the difficulties for the

two sets of linking items are set to be equal. Ability estimates could also be

used for this purpose. This linear transformation is then applied to the a,, bi,

and 0 parameters of the second test (Petersen et al., 1989). Several

variations of the transformation, including the mean and sigma method and the

robust mean and sigma method, are described in Hambleton and

Swaminathan (1985). Also, Stocking and Lord (1983) described a modification

which gives lower weights to poorly estimated parameters and outliers.

It is most common in both the fixed bs and equated bs methods to use

only the relationship for item difficulties to obtain the equating function

(Hambleton & Swaminathan, 1985). The characteristic curve method can

prevent the possible loss of information caused by ignoring the discrimination

relationship. For the characteristic curve method, the parameters of each test

are calibrated separately. All parameters are then placed on the same scale






31

by using the two sets of parameter estimates from the common items. A linear

transformation is obtained from minimizing the difference between the true

scores on the linking items. This transformation is then applied to the a,, bi,

and 0 parameters of the second test (Stocking & Lord, 1983). Because it

takes all information into account, this procedure is theoretically an

improvement over the previous methods.

Sometimes the reporting of abilities in terms of 0 is unacceptable. In

these situations, the 0 value from a test may be converted to its corresponding

true score t through


S=EP (e) (7)


where n is the number of items on the test. Equating of the true scores on the

two tests is then possible (Hambleton & Swaminathan, 1985). The true score

on one test is said to be equated to the true score on a second test if each

corresponds to the same ability level, or if


S= PO() T = PI(P) (8)
i=i j=1

(Skaggs & Lissitz, 1986a). In practice, estimated item parameters are used to

approximate Pi (0) and Pj (0). Paired values of t and in are then computed by

substituting a series of arbitrary values for 9 into Equation 8 and calculating 5

and 7] for each 0. These paired values define as a function of iT and

constitute an equating of these true scores (Lord, 1980).






32

The relationship between raw scores and true scores on two tests is not

necessarily the same, nor is an equating provided for individuals scoring below

the chance level (Petersen et al., 1989). Observed-score equating provides a

method of predicting the raw-score distribution of a test. This procedure uses

probabilities of correct responses under an IRT model to generate a

hypothetical joint distribution of item responses from all examinees taking both

tests. Conventional equipercentile equating is then applied to the new

distributions (Skaggs & Lissitz, 1986a). Neither true-score nor observed-score

equating is applied often in practice. Both are complicated to calculate and

expensive to implement.

Many researchers have investigated the accuracy of IRT equating

methods using the various IRT models and procedures. Comparison of IRT

equating with conventional methods is also common. Marco, Petersen, and

Stewart (1983) examined the Rasch and 3PL models along with the 40 linear

and two equipercentile equating methods previously discussed. A variety of

conditions, including random and dissimilar samples, internal and external

anchors, and difficulty levels of the anchor tests were also studied. The two

IRT methods worked well, both with an external anchor test equal in difficulty

to the total test and with an internal anchor. With the external anchor test, the

Rasch results were slightly better than with any of the other equating methods

investigated. Both IRT models were clearly superior to the conventional

equating methods when the samples differed in ability distributions, but neither








the Rasch nor the 3PL model showed superiority to the other under the

conditions studied.

Kolen (1981) explored true-score and observed-score equating methods

as well as a linear and an equipercentile equating method. The Rasch, 2PL,

and 3PL models were used for the IRT equatings. The two forms of the Iowa

Test of Educational Development to be equated had no common items. Each

test had been administered to a random sample. The true-score method for

the 3PL model produced the best results. When only quantitative items were

equated, the Rasch true-score combination also worked well.

Kolen and Whitney (1982) used the General Educational Development

Tests (GED) with the Rasch, 2PL,and 3PL IRT models and an equipercentile

equating method. They found with small samples (N < 198) a number of

extreme item parameter estimates were produced by the 3PL model which

seriously affected the equating.

In the Petersen, Cook, and Stocking (1983) study discussed earlier in

the context of conventional equating, a 3PL model was also examined using

concurrent calibration, the fixed bs method, and the characteristic curve

transformation. For the SAT-V, all IRT models and methods outperformed

linear and equipercentile equatings. Both conventional and IRT methods

yielded acceptable results for the mathematics test. Concurrent calibration

with the 3PL model produced the least amount of error.








Harris and Kolen (1985) compared conventional equating methods with

IRT 3PL model equating. The sample consisted of high and low ability

examinees. The 3PL model was found to be slightly superior.

The Cook, Eignor, and Taft (1988) study using biology achievement

tests administered at different points in time included a 3PL model with the

characteristic curve transformation in addition to the equipercentile equating

method. The authors concluded that the IRT results, although slightly superior

with the fall-to-spring sample equating, basically paralleled the results obtained

with the conventional method.

A minimum-competency test, Florida's Statewide Student Assessment

Test, Part II (SSAT-II) was equated by Hills, Subhiyah, and Hirsch (1988).

Their purpose was to study the effect of anchor length on equating and

compare different equating methods using a sample with a negatively skewed

distribution. The equating methods investigated were linear, Rasch, and 3PL.

The IRT models were equated with concurrent calibration, fixed bs method,

and equated bs method using robust mean and sigma. The authors concluded

that the 3PL model with concurrent calibration and Rasch models gave similar

good results. Also, when using the 3PL model with concurrent calibration, an

anchor test length of 10 items was found to be sufficient for good equating

outcomes.

Results of these studies indicate that the 3PL model tends to perform

better than conventional and Rasch equating in a variety of situations.








Equating with IRT appears to produce better results than conventional

equating methods, especially when the ability distribution of the two groups is

dissimilar. Concurrent calibration and characteristic curve transformation were

the preferred methods of scaling, although fewer linking items are required with

concurrent calibration. Table 2 contains a summary of the equating studies

reviewed here.

Multidimensionality

Violation of the Unidimensionalitv Assumption

The mathematical models upon which IRT is based are grounded on

very strong assumptions, particularly that item responses are unidimensional

(Ansley & Forsyth, 1985). The unidimensionality assumption requires that

each of the tests to be equated onto a common scale must measure the same

underlying trait or ability. Any factor that influences an examinee's score, other

than the one assumed latent trait, will violate the unidimensionality assumption.

Although IRT explicitly acknowledges this assumption, other commonly used

procedures that transform scores, such as equipercentile equating, are also

unidimensional even if not stated specifically (Hirsch, 1989). This can be seen

by reviewing the required conditions for equating.

There are many factors that may cause multidimensionality, such as

guessing, speededness, fatigue, cheating, random answering, instructional

sensitivity, or item context and content. Two or more cognitive traits may

influence an examine's response to an item. For example, reading










Table 2

Summary of Unidimensional IRT Test Equating Studies


Study



Cook & Eignor
(1983)

Cook, Eignor, &
Taft (1988)

Hais & Kolen
(1986)

Hills, Subhiyah,
& Hirsch (1988)



Kolen (1981)


Kolen & Whitney
(1982)

Marco, Petersen,
& Stewart (1983)


Peterson, Cook, &
Stocking (1983)


Tests



CB-achievement


Biology achievement


ACT-Math


SSAT-lI




ITED: Math &
Vocabulary

GED
equipercentile

SAT-V



SAT-V
SAT-Q


Equating Models



3PL, equipercentile,
linear

3PL
equipercentile

3PL, equipercentile,
linear

Rasch, 3PL, linear




Rasch, 2PL, 3PL,
equipercentile, linear

Rasch, 3PL, linear,


Rasch, 3PL, linear,
equipercentile


3PL, linear,
equipercentile


Independent Variables



equating models
scaling methods

dissimilar samples
equating models

equating models
dissimilar samples

equating models
negatively skewed distribution
anchor length
scaling models

equating models
item context

equating models


ability distribution
internal & external anchor
difficulty of anchor

equating models
scaling models (3PL)








skill may be required to correctly answer a mathematical item. Some of these

violations can be controlled, reduced, or eliminated, but the unidimensionality

assumption will still be violated in many practical situations (Doody-Bogan &

Yen, 1983). Achievement tests are not constructed using methods that yield

factor pure instruments. Instead, a table of specifications is customarily

developed and items are written to match the specifications. These items

rarely measure a single trait (Reckase, 1979). Due to the many possible

causes leading to violation of the unidimensionality assumption, it can be

concluded that dimensionality is a joint property of both the item set and the

particular sample of examinees (Hattie, 1985).

Multidimensional Models

Recently, attempts have been made to model multidimensional

responses within the framework of IRT. Several multidimensional item

response theory (MIRT) models have been proposed. Although

multidimensional versions of all three logistic parameter IRT models have been

derived, only the multidimensional two-parameter logistic (M2PL) model will be

discussed.

Doody-Bogan and Yen (1983) described a multidimensional model of

the form


PJ(Qh) = (9)
1 + exp[-D aahe, ( bh)]
h=1








where Oh is the ability parameter for person i for dimension h; ajh is the

discrimination parameter for item j for dimension h; bjh is the difficulty

parameter for item j for dimension h; and D is the scaling constant, 1.7.

Another model discussed by Sympson (1978) is defined


P,(a) = 1 (10)
n(1+ exp[-D ap bj]])
hi1

where all parameters are defined as above.

These two models can be distinguished by comparing their

denominators. The Doody-Bogan and Yen model contains no product of

probabilities in the denominator as does the Sympson model. Equation 9 can

be classified as a compensatory model that permits high ability on one

dimension to compensate for low ability on another dimension in terms of the

probability of a correct response. If dimensionality is considered in the context

of factor analysis, a two-dimensional test has a group of items measuring each

dimension. A compensatory model seems reasonable because the test is

being considered as a whole (Ansley and Forsyth, 1985).

The second model, defined by Equation 10, is called a

noncompensatory model where high abilities on one factor are not allowed to

supplement low abilities on the second factor. When a two-dimensional test is

considered as one that requires simultaneous application of the two abilities to

answer each item correctly, the noncompensatory model seems more

appropriate (Ansley and Forsyth, 1985).








Reckase (1985) has alternately defined the compensatory M2PL to

provide a simple framework for specifying and generating multidimensional

item response data. This model defines the probability of a correct response

as


EXP(a~, + d)
P( = 11a,) 1 + EXP(a, + d,) (11)


where a is a vector of discrimination parameters; dj is related to item difficulty;

and_ is a vector of ability parameters. The exponent can also be written as



( b) (12)
h=1

where m is the number of dimensions; ajh is an element of aj; 9O, is an element

of .%; and di = -aajhbjh. When this form is used, the relationship to the more

familiar expression in Equation 9 can be seen.

The data described by a multidimensional IRT model can be depicted

graphically by an item response surface (IRS). Figure 2 presents an IRS for

an M2PL item. The IRS increases monotonically as the elements of 6j

increase (Reckase, 1985).

To identify the multidimensional item difficulty (MID) for an item, the

point in the IRS where the item is most discriminating must be found. This

point, which provides the maximum information about an examine, will have

the greatest slope. Because the slope along the IRS can differ according to








the direction taken, Reckase (1985) determined the slope using the direction

from the origin of the 0 space to the point of highest discrimination.


Figure 2. An item response surface (IRS) based on the compensatory M2PL.




To accomplish this analysis, the model given in equation 11 is

translated to polar coordinates, replacing each Oi by Oi cos Oh, where 08 is the

distance from the origin to 0 and oh is the angle from the hth axis to the

maximum information point (Reckase, 1985). In a two-dimensional item, the

value of oh can range between 0 and 900 depending on the degree to which

the item measures the two traits. If the item only measures the first trait, aC1

equals 0, while cii = 900 would depict an item measuring only the second trait.

The relationship between am and discrimination element aih can then be stated

as












cos aih = (13)

lh1

The MID parameters can now be expressed as


-d
MIDi = (14)
,i(ah)2


Finally, an item that requires two abilities for a correct response can be

represented as a vector in the two-dimensional latent ability space. The length

of the vector for an item is equal to the degree of multidimensional

discrimination (MDISC) (Ackerman, 1991). Reckase (1985) expressed MDISC

as


MDISC, = ()= (15)


These equations provide an excellent framework for manipulating conditions

during generation of multidimensional data.

Many indices have been developed to assess the dimensionality of a

test and test items. Hattie (1985) examined over 30 of these indices which

were grouped into methods based on (a) answer patterns, (b) reliability, (c)

principal components, (d) factor analysis, and (e) latent traits. Hattie

concluded that none of the indices were satisfactory and only four could even








distinguish unidimensional from multidimensional data sets. A major problem

encountered by Hattie in assessing the indices was that unidimensionality was

often confused with reliability, internal consistency, and homogeneity.

More recently, other procedures have been developed to assess the

dimensionality of latent traits. Roznowski, Tucker, and Humphreys (1991)

explored several of these indices. Procedures based on the shape of the

curve of successive eigenvalues were found to be unsatisfactory under most

conditions. A pattern index of second factor loadings was accurate except with

high obliqueness. The most accurate index in this study was based on local

independence. The use of this index is particularly recommended with large

samples and many items.

Linear factor analysis has been widely used to assess dimensionality of

dichotomous items. However, use of phi correlations often leads to

overestimation of the number of factors underlying the responses by

confounding factor coefficients with item difficulties (Bock, Gibbons, & Muraki,

1988; Hambleton & Swaminathan, 1985). Tetrachoric correlations may be

substituted, but may still be confounded with item difficulty or guessing in real

data (Camilli, 1992). Bock, Gibbons, and Muraki (1988) have developed a

maximum likelihood full information factor analysis procedure as an attempt to

deal with these problems.

Another approach to dimensionality taken by Stout (1990) replaced the

strong assumptions of unidimensionality and local independence with less








restrictive assumptions of essential unidimensionality and essential

independence. Stout contended that a dominant dimension results when an

attribute overlaps many items and other dimensions common to only a few

items are unavoidable in reality, but are also not significant. These minor

dimensions are rarely discussed in IRT literature, but are a frequent theme in

classical factor analysis. While the IRT definition of dimensionality would take

all factors, major and minor, into account, essential dimensionality is a

mathematical conceptualization of the number of dominant dimensions with

minor dimensions ignored. An essentially unidimensional test is therefore any

set of items selected from an infinite item pool that measures exactly one

major dimension. When essential unidimensionality is assumed, latent ability

is unique in an ordinal scaling sense and this unique latent ability is estimated

consistently. Stout presented theorems and proofs to show that dimensions

distributed nondensely over items or dimensions that have a minor influence

on possibly many items do not necessarily negate essential unidimensionality.

He continued to present guidelines for development of essentially

unidimensional tests. Among the recommendations are limiting the number of

abilities per item; keeping the number of items dependent on the same ability,

other than the intended-to-be-measured 6, small; and controlling the number of

item pairs assigned to the same ability other than 8. These conditions are

usually met with the carefully designed tests usually found in practice.








Nandakumar (1991) used simulations to investigate Stout's statistical

test of essential unidimensionality. When one dominant trait and one or more

minor dimensions having little influence on item scores were present, Stout's

test performed well in indicating essential unidimensionality. The test is more

likely to reject the hypothesis of essential unidimensionality as the effect of the

minor dimensions increases.

To facilitate application of the test of essential unidimensionality, Stout

developed the computer program DIMTEST. An investigation of the program

revealed problems when a test consisted of difficult, highly discriminating items

where guessing was also present (Nandakumar & Stout, 1993). Refinements

were subsequently made to the program to make it more robust and beneficial

to the measurement practitioner.

Nandakumar (1994) studied three commonly used methodologies for

assessing dimensionality in a set of item responses. The three procedures-

DIMTEST, Holland and Rosenbaum's approach, and nonlinear factor analysis-

-were unreliable in detecting lack of unidimensionality in real data sets.

Although the more recent procedures based on local independence, full

information factor analysis, and essential unidimensionality offer promise for

assessing the dimensionality of dichotomous data, especially with large

datasets, a satisfactory method has not yet been agreed upon by

measurement researchers. Because of the current lack of an acceptable index

to detect multidimensionality, it becomes even more urgent to understand






45

exactly what effect violation of the unidimensionality assumption may have on

IRT applications. When a test measures several dimensions, examinees'

scores will be influenced by all of these factors. As a result, systematic and

unsystematic errors of equating might be expected from scaling and equating

procedures that are applied to multidimensional tests (Yen, 1984). The

estimation of ability and item parameters is likely to be affected also.

Multidimensionalitv and Parameter Estimation

Violation of the unidimensionality assumption has been suggested as a

problem in the estimation of item and ability parameters, the first step in IRT

equating procedures. Thus, it is important to determine how robust estimation

procedures are to this violation.

Ansley and Forsyth (1985) used a noncompensatory M3PL model to

simulate a two-dimensional dataset. The two discrimination parameters were

set to have respective means of 1.23 and .49 and respective standard

deviations of .34 and .11. The b values were scaled to reflect fairly easy items

(1bi = -.33, ob1 = .82, pb2 = -1.03, ab2 = .82). The c parameter was set to .2. A

bivariate normal distribution was selected to generate the 0 vectors with both

dimensions scaled to have mean 0 and standard deviation 1.0. The

correlation p(Oe, 02) was varied with values of 0.0, .3, .6, .9, and .95 simulated.

Four combinations of sample size (1,000 and 2,000) and test length (30, 60)

were examined. Corresponding unidimensional datasets were also simulated.

Correlations of the estimated and simulated parameters showed the ai








estimates appeared to be averages of the true ai and a2 values. The b,

estimates overestimated the true b, values. The 8 estimates were highly

related to the averages of the true 0 values. The authors concluded that item

parameter estimation was affected by violation of the unidimensionality

assumption, but as the 0 vectors became more highly correlated, the

estimations derived from the two-dimensional dataset approached results

obtained from the unidimensional data. Sample size and test length had little

effect on any of the relationships.

Reckase (1979) studied five forms of the Missouri State Testing

Program and five datasets simulated to match various factor structures to

determine what characteristics are estimated by the unidimensional Rasch and

3PL models when the data are multidimensional. Reckase concluded that for

tests with several equally strong dimensions, the Rasch estimates should be

considered as a sum or average of the abilities required for each dimension.

For data with a dominant first factor, the Rasch and 3PL difficulty estimates

were highly correlated with the scores for that factor. With the 3PL model and

more than two potent factors, the b, estimates correlated with just one of the

common factors. The author concluded good ability estimates can be obtained

from unidimensional estimation procedures when the first factor accounts for at

least 20 percent of the test variance, as is likely in practice.

Yen (1984) used data simulated with a compensatory M3PL model and

data from the Comprehensive Test of Basic Skills, Form U (CTBS/U) to study








unidimensional parameter estimation of multidimensional data. A variety of a,

parameters were configured and p(0i, e2) was set at .5 or .6. When

multidimensionality was present, the ai and b, parameter estimates were

larger than those of unidimensional sets of items. The unidimensional

estimates of both a, and 0 parameters appeared to be a combination of the

respective two-dimensional parameters.

Data simulated from a hierarchical factor model was used in a study by

Drasgow and Parsons (1983). Item responses were generated from five

oblique common factors. Loadings were varied producing diversity in

correlations between the common factors. Each simulated dataset consisted

of 50-item tests and 1,000 simulees. The general latent trait was recovered

well when the correlations between the common factors were .46 or higher.

Harrison (1986) also used a hierarchical factor model to simulate data.

The strength of the second-order general factor, the number of first-order

common factors, the distribution of items loading on the common factors, and

the number of test items were manipulated. The effect of test length was

significant. As the number of items increased, the general trait was recovered

more effectively regardless of the latent structure, distribution of items across

common factors, or the number of common factors. Estimation of the b,

parameters was found to be robust to violations of unidimensionality. The

estimation of both the aj and bi parameters improved as test length and

strength of the general factor increased. In general, Harrison found






48

unidimensional parameter estimation procedures to be robust in the presence

of multidimensional data.

The studies reviewed indicate that IRT parameters implied by the

general factor are recovered well when the common factors have sufficiently

high correlations. Reckase, Ackerman, and Carlson (1988) used both

simulated and empirical data to demonstrate that items can be selected to

construct a test that meets the unidimensionality assumption even though

more than one ability is required for a correct response. The authors showed

that the unidimensionality assumption only requires the items in a test to

measure the same composite of abilities. This seems to have been met in the

previous investigations. Based on this study, it appears as if the

unidimensionality assumption is not as restrictive as formerly thought.

Although these studies explored the effect of multidimensionality on

unidimensional parameter estimation, it is also important to understand what

effect the choice between compensatory and noncompensatory

multidimensional models may have on estimation. Ackerman (1989) simulated

two-dimensional data using both compensatory and noncompensatory M2PL

models. Forty two-dimensional items were generated using the compensatory

model. Difficulty was confounded with dimensionality and p(0i, 82) was

selected at 0.0, .3, .6, and .9. For each compensatory item, a corresponding

noncompensatory item was created using a least-squares approach to

minimize the quantity









100
Z[(Pcl a, b)-(PNcl a, b)_f (16)

where Pc is a given compensatory item's probability of a correct response and

PNC is the noncompensatory item's probability of a correct response which

varies as a function of a and b given 0. The unidimensional 2PL model was

used to estimate parameters using both BILOG and LOGIST. The authors

discovered minimal differences in the IRS for each model when the parameters

are matched. The confounding of difficulty with dimensionality was only

detected by BILOG. For both models, as p(Oi, 82) increased, the response

data became more unidimensional and estimation of all parameters improved.

Way, Ansley, and Forsyth (1988) also compared compensatory and

noncompensatory models with simulated data. The values assigned p(08, 82)

ranged from 0.0 to .95. Results showed the number-right distributions for the

two models were comparable. In the noncompensatory model, the

unidimensional a, estimates appeared to be averages of the al and a2 values,

while the compensatory model provided ai estimates best considered as sums

of at and a2. The b, estimates for the noncompensatory data were greater

than b, values, while the compensatory model seemed to average the bi and

b2 values. For both models, the 0 estimates were related to the average of the

two 6 parameters.

A summary of the studies investigating the effect of multidimensional

data on unidimensional IRT parameter estimation is presented in Table 3.

Generally, parameters appear to be recovered adequately with data fit








Table 3

Summary of Studies of Unidmensional IRT Estimation with Multidimensional Data


Study Tests Model for Estimation Number of
Simulating Model Dimensions

Ackerman (1989) Simulation M2PL, Comp. 2PL 2


Least-squares
conversion


Ansley & Forsyth Simulation
(1985)


M3PL, Noncomp.


Drasgow & Parsons Simulation Hierarchical 2PL
(1983) factor model

Harrison (1986) Simulation Hierarchical 2PL
factor model


Reckase (1979) Simulation, Linear factor Rasch
Missouri analysis 3PL

Reckase, Ackerman,Simulation, M2PL, Comp. 2PL
& Carlson (1988) ACT

Yen (1984) Simulation, M3PL, Comp. 3PL


Note. Comp. = Compensatory model, Noncomp. = Noncompensatory model.


Independent Variables


p(01, 02)
Difficulty confounded with dimensionality
Comp. vs noncomp. models
BILOG vs LOGIST

p(61, 02)
Sample size
Test length

1 -5 p(01, 62...e)
General factor strength

General factor strength
# of common factors
Test length

# of dimensions
Estimation methods

Violation of unidimensionality


p(01, 82)
t


a parameters









conditions usually found in practice. Both compensatory and

noncompensatory models are apparently viable as MIRT models. Determining

the adequacy of unidimensional parameter estimation of multidimensional data

has important consequences for equating multidimensional tests.

In addition to the estimation procedures discussed, the relationship

between multidimensional and unidimensional IRT models can also be

approached from an analytical framework. Wang (1986), as reported in

Ackerman (1988) and Oshima and Miller (1990), determined explicit algebraic

relationships between unidimensional estimates and the true multidimensional

parameters for the case in which the underlying response process is modeled

by the compensatory M2PL model and the unidimensional 2PL model. Using

the results for unidimensional estimation of a multidimensional data matrix,

Wang concluded that the unidimensional item parameter estimates are

obtained as a weighted composite of the underlying traits. The weights are a

function of the discrimination vectors for the items, the correlations among the

latent traits, and the difficulty parameters of the items. For group g who can be

described as having a diagonal variance-covariance structure Q_ and a mean

ability vector p, the 2PL item parameters for two-dimensional item j can be

approximated by


aa y (17)










Sdj-a I (18)



where a is the discrimination vector for the M2PL model; di is the difficulty

parameter for the M2PL model; Qi and _2 are the first and second

standardized eigenvectors of the matrix AA'AX where A is the matrix of

discrimination parameters for all items in the test and X'X = Q. Therefore,

when the means, standard deviations, and item parameters of a two-

dimensional distribution are known, the corresponding 2PL unidimensional

item parameters can be approximated.

Multidimensionality and IRT Equating

In practice, test equating almost exclusively assumes unidimensionality.

A single score from one test is transformed to a single score from another test.

An understanding of what effect the presence of multidimensional data has on

these unidimensional equating results is of paramount importance.

Dorans and Kingston (1985) equated four forms of the Verbal GRE

Aptitude Test using the 3PL model and an equated bs procedure. Two data

collection designs, equivalent groups and anchor-test, were investigated as

well as several variations in calibration procedures. Dimensionality was

assessed through factor analyses conducted at the item level on interitem

tetrachoric correlations. Two highly related verbal dimensions were identified.






53

To examine their results, the researchers first calibrated the whole test,

then divided the test items into two homogeneous subgroups. The subgroups

were recalibrated separately and placed on the same scale as the original test.

They were then recombined back into an entire test and their corresponding

ICCs were compared. The authors discovered that differences in magnitude of

discrimination parameter estimates had an impact on IRT equating results,

affecting the symmetry of the equating. However, the different research

combinations yielded very similar equatings, leading the authors to conclude

that IRT equating may be sufficiently robust to the dimensionality displayed in

their data.

Cook and Eignor (1988) used SAT data that was suspected to be

multidimensional to examine the robustness of 3PL model concurrent

calibration and the characteristic curve transformation procedures. Scale drift

was used as the criterion for evaluating equating results. Cook and Eignor

concluded that both IRT equating methods produced acceptable results

despite the multidimensionality present in the tests being studied.

In addition to studying parameter estimation, Yen (1984) equated the

LOGIST trait estimates for both real (CTBS/U) and simulated data. Several

statistics were used to evaluate the results: (1) the correlation r; (2)

standardized difference between means (SDM); (3) ratio of standard

deviations; and (4) standardized root mean squared difference (SRMSD).

Trait estimates based on items that measured different dimensions had lower








correlations and higher SDMs and SRMSDs. That is, when tests measuring

different dimensions were equated, large unsystematic errors occurred.

Systematic errors were found only when the tests measured several

dimensions that differed in difficulty and were likely to be taught sequentially,

as in a vertical equating situation.

Camilli, Wang, and Fesq (1995) adapted the methodology of Dorans

and Kingston (1985) to examine how multidimensionality may affect the

equating of the Law School Admission Test (LSAT). Two dimensions of the

LSAT were identified using primary and secondary factor analyses, and the

stability of the dimensions was established over six administrations. The test

was divided into two homogeneous subtests to study the effect of

multidimensionality on IRT true-score test equating. Item calibration was done

with BILOG. The authors found very small differences in the equatings except

at the ends of the raw score distribution. They concluded that, for the LSAT,

IRT true-score equating was robust to the presence of multidimensionality.

These empirical studies indicate that violations of the unidimensionality

assumption, while having some impact on results, may not be significant.

However, different tests were used in this research and their content may have

affected findings in an unknown manner. Therefore, the generalization of

results are difficult to interpret across studies (Skaggs & Lissitz, 1986a). Also,

because indices designed to detect multidimensionality are generally

unsatisfactory, it is necessary to design research studies that permit






55

manipulation of independent variables to understand exactly how violations of

the unidimensionality assumption affect equating. Simulation studies present a

technique to manipulate and control the desired variables.

There has been little simulation research on the effects of

multidimensionality on unidimensional IRT equating. One notable exception is

a study by Doody-Bogan and Yen (1983). The main purpose of this paper was

to examine the stability of several chi-square statistics for their ability to detect

multidimensionality in vertical equating, but the findings are significant in the

context of unidimensional equating with multidimensional data. Four

multidimensional data configurations were simulated with the compensatory

M3PL model described in Equation 9. One unidimensional 3PL dataset was

also generated. Three differences in mean ability between the two tests to be

equated were simulated with parameter estimates for all data modelled after

the CTBS for realism. Correlations, standardized difference between means

(SDM), and standardized root mean square differences (SRMSD) were used to

evaluate results. The findings of this study were mixed. When the correlations

were examined, the results of the equatings, both horizontal and vertical, were

as good for the tests with multidimensional configurations as for the

unidimensional tests. On the other hand, when the means were used as the

criterion for comparison, the multidimensional tests provided worse equatings

than the unidimensional data, especially when the tests differed in difficulty.








Another concern raised was that the equatings might deteriorate if the factors

loaded differently on the two tests.

More recently, attempts have been made to develop a multidimensional

equating procedure. Hirsch (1989) conducted a study in which real and

simulated data were equated with a multidimensional method. The procedure

involves (a) estimating item parameters and abilities on both dimensions for

both tests, (b) identifying common basis vectors, (c) aligning basis vectors

through Procustes rotation, and (d) equating means and standard deviations of

the ability estimates for each dimension of the two tests. Results of this

preliminary research indicated that effective equating was possible with these

techniques, but the instability of the ability estimates make it impractical at this

time. While work on development of MIRT equating is continuing (Hirsch &

Miller, 1991), the procedure has little current value for the equating needs of

testing companies. The results of the studies of unidimensional equating with

multidimensional data are summarized in Table 4.

The emphasis of the present study was to examine the effect of

multidimensional data on unidimensional IRT equating through the use of a

simulation study. The research questions chosen were those considered to be

of most value to the practitioner.









Table 4

Summary of Studies of Unidimensional Equating with Multidimensional Data

Study Tests Model Equating Number of Independent Variables Evaluation
Method Dimensions Criterion
Camilli, Wang LSAT 3PL true-score 2 test dimensionality split test
& Fesq (1995) equating method

Cook & Eignor SAT 3PL concurrent unknown equating methods scale drift


(1988)


calibration
characteristic
curve trans.


Doody-Bogan Simulation 3PL
& Yen (1983) M3PL


Dorans &
Kingston
(1985)


GRE-V 3PL


equated bs




equated bs


2 criterion measures
p(o1,02)


correlation
SDM
SRMSD


2 calibration procedures split test
data collection design


Yen (1984) CTBS/U 3PL equated bs CTBS-unknown a & b parameters
Simulation Sim.- 2 p(01,02)


correlation
SDM
SRMSD
ratio of a













CHAPTER 3
METHOD


Purpose

Introduction

The purpose of this study was to examine the effects of

multidimensional data on unidimensional equating procedures. The effects of

the number of multidimensional items, type of multidimensional model, and

choice of equating procedure were investigated. Most investigations were

conducted with randomly equivalent, normally distributed examinee groups

having mean 0 and standard deviation 1. In addition, data from examinee

groups of lower ability ( X I = -0.8, SD1 = 0.6) were equated to results obtained

from the randomly equivalent groups.

The methods applied to investigate these effects are described in this

chapter. The methodology is discussed in the following sections: (a) data

generation, (b) estimation of parameters, (c) equating, and (d) criteria for

evaluation.

Research Questions

The specific questions to be answered in the present study were:

1. Does the number of multidimensional items in a test affect

unidimensional equating results?








2. Does the equating procedure affect unidimensional equating results?

3. Do data simulated by using a compensatory multidimensional model

produce different unidimensional equating results than data simulated using a

noncompensatory model?

4. Are unidimensional equating results affected by differing ability

distributions of the two examine groups?

Data Generation

Design

Data for two parallel forms, A and B, of each test condition were

simulated. Four test conditions were created by varying the number of

multidimensional items contained in each test. These conditions were created

to mirror what might be found in published tests. For example, in a test of

mathematics problem solving, all items might be multidimensional to some

degree if reading skill were also required. However, relatively few

multidimensional items might be found in a reading comprehension test

containing only one graph-reading passage that also needed a math skill for

completion. In the present study, 10, 20, 30, and 40 items of an 40 item test

were two-dimensional. These conditions are referred to as MD10, MD20,

MD30, and MD40 respectively.

In addition to modifying the number of multidimensional items, the

strength of each multidimensional item's first factor was manipulated. This

was done within each test condition because it is unreasonable to expect a

published test to contain multidimensional items which all have an identical








factor structure. The angle of item direction was varied to 20*, 30", 45, and

60* to reflect items that predominantly measure the first trait (20* and 30),

both traits equally (45"), and the second trait (60).

Finally, data were originally generated using a compensatory

multidimensional model. To investigate any variations due to the difference in

modeling, each compensatory dataset was transformed into its corresponding

noncompensatory parameters through application of the least-squares

approach used by Ackerman (1989) and described in Chapter 2.

Noncompensatory parameters were considered corresponding if the probability

of a correct response was the same as for the compensatory parameters. This

was accomplished through the NLIN procedure in the Statistical Analysis

System (SAS,1989). Specific methodology is discussed later in this chapter.

Model Description

To avoid problems associated with estimating the lower asymptote, the

compensatory multidimensional two-parameter logistic (M2PL) model

(Reckase, 1985) was selected for data generation. Because this is a

compensatory model, high abilities on one ability trait are allowed to

compensate for lower abilities on the second ability trait.

The multidimensional item difficulty (MID,) parameter was defined by

Reckase as in equation 14 where ai is the kth element of a, and m is the

number of dimensions. The data of interest in this study were considered to

be two-dimensional, so m equaled 2. Multidimensional item difficulty is the








distance from the origin of the multidimensional ability space to the point where

the item provides maximum examinee information, or where the IRS has the

steepest slope. A line joins these points at angle ak. In a two-dimensional

item, the value of aik can range between 0 and 90* depending on the degree

to which the item measures the two traits. If the item only measures the first

trait, ai equals 0, while al = 90" would depict an item measuring only the

second trait. For this study, all was set to either 0, 200, 30, 450, or 600.

Item Parameters

Four tests with 40 items each were simulated using the compensatory

M2PL model described above. Forty items were selected as sufficient to

provide good equating results. An anchor test design was chosen for data

collection as it is widely used by practitioners.(Skaggs & Lissitz, 1886a). Each

test consisted of two forms with 12 common linking items and 28 unique items.

The difficulty values were selected to be reasonable for published tests. Lord

(1968) found difficulties ranging from -1.5 to 2.5 ( X=0.58, SD=0.87) on SAT

Verbal data. Doody-Bogan and Yen (1983) employed a range of bi of -2.0 to

1.52 ( X =-0.028, SD=0.818) in a simulation designed to imitate CTBS-U data.

In a study using multidimensional data, Ackerman (1988) reported MID values

ranging form -0.73 through 1.87 on an ACT Mathematics test. Oshima and

Miller (1990) used MID values in the interval -2.0 to 2.0. For the purpose of

this investigation, multidimensional item difficulty parameters (MID) were

generated using the RANNOR function of SAS. Values were chosen randomly








from a normal distribution within the range of -2.0 through 2.0 and to have

mean 0 and standard deviation 1.0.

The multidimensional discrimination parameters (MDISC) defined by

equation 15 were randomly selected from a lognormal distribution. A majority

of MDISC values lay between .5 and 2.5 with mean 1.15 and standard

deviation .60. These values correspond to those reported by Doody-Bogan

and Yen (1983) of .5 to 2.00 with mean 1.03 and standard deviation .3387.

Ackerman (1988) found an MDISC range of .58 through 2.39.

To create two 40 item test forms, 68 items were generated for each test

condition. The first 12 items in each set were identified as the linking items

and were common to both forms. Items 13 through 40 were unique items for

Form A and items 41 through 68 were unique to Form B. In order to simulate

two-dimensional items, the values of ai as expressed in Equation 13 varied.

In the case of unidimensional items, am was set to 0. For two-dimensional

items, ai was either 20*, 30*, 45", or 60. Those items with ai = 20" or 30,

primarily measured the first trait. Items having an = 45 measured both traits

equally, and those with ai = 60 discriminated on the second factor more

heavily. More multidimensional items in this study predominantly measured

the first factor because it is reasonable to anticipate this to occur in a well-

designed commercial test. These four ai values were spiraled throughout the

items in each dataset. To illustrate, in MD40 ai was 20 for item 1, 30* for






63

item 2, 45" for item 3, and 60* for item 4. This pattern then repeated for the 64

remaining items.

For datasets containing both unidimensional and two-dimensional items,

the last 3, 6, and 9 linking items were multidimensional for MD10, MD20, and

MD30 respectively. Thus the linking test had the same proportion of

unidimensional items as did the corresponding unique items in each condition..

The last 7, 14, and 21 unique items for each of Forms A and B were also

multidimensional. Table 5 presents the item parameters for Form A of MD30

with 75% of the items in each form being two-dimensional.

Response Data

For each experimental condition and form, response vectors for 1,000

simulees were generated. This sample size was selected as being adequate

to provide stable parameter estimates. The ability values were randomly

generated through the normal distribution RANNOR function of SAS to range

from approximately -3.00 to 3.00. The theta values were assumed to be

uncorrelated. Probabilities of correctly answering an item were then calculated

for each simulee through application of Equation 11. Finally, the SAS function

RANUNI was used to produce a random number from the uniform distribution

between 0 and 1. If this number was less than or equal to P(Xj = 1 lai,di, 0j),

the simulee passed the item. If the random number was greater, the simulee

failed. To increase confidence in results, twenty sets of response data were

generated for each condition and form.









Table 5

Simulated Compensatory Parameters for MD30, Form A


Item Form %0, al a2 di MDISC MID

1 A,B 0 0.475 0.000 -0.584 0.475 1.231
2 A,B 0 0.563 0.000 -0.173 0.563 0.308
3 A,B 0 0.515 0.000 0.652 0.515 -1.266
4 A,B 60 0.736 1.275 1.199 1.472 -0.814
5 A,B 20 1.159 0.422 0.681 1.234 -0.552
6 A,B 30 0.706 0.407 -0.054 0.815 0.066
7 A,B 45 0.936 0.936 -0.939 1.323 0.709
8 A,B 60 0.291 0.504 -0.618 0.582 1.062
9 A,B 20 0.684 0.249 -0.599 0.728 0.822
10 A,B 30 0.882 0.510 1.652 1.019 -1.621
11 A,B 45 1.129 1.129 2.676 1.597 -1.675
12 A,B 60 0.881 1.526 -1.018 1.763 0.578
13 A 0 0.973 0.000 0.549 0.973 -0.565
14 A 0 1.358 0.000 -0.324 1.358 0.239
15 A 0 1.857 0.000 1.417 1.857 -0.763
16 A 0 0.860 0.000 -0.524 0.860 0.609
17 A 0 1.448 0.000 1.538 1.448 -1.062
18 A 0 1.517 0.000 -0.448 1.517 0.295
19 A 0 0.663 0.000 -0.142 0.663 0.214
20 A 60 0.480 0.832 0.723 0.961 -0.753
21 A 20 0.648 0.236 -0.550 0.689 0.798
22 A 30 1.944 1.122 0.992 2.244 -0.442
23 A 45 1.120 1.120 0.654 1.584 -0.413
24 A 60 0.268 0.464 -0.122 0.535 0.228
25 A 20 0.790 0.288 0.295 0.841 -0.351
26 A 30 0.442 0.255 0.159 0.510 -0.313
27 A 45 1.452 1.452 0.019 2.053 -0.009
28 A 60 0.328 0.568 -0.243 0.656 0.370
29 A 20 0.744 0.271 0.055 0.792 -0.070
30 A 30 0.398 0.230 0.315 0.460 -0.686
31 A 45 0.355 0.355 0.924 0.502 -1.840
32 A 60 0.465 0.806 -1.060 0.930 1.140
33 A 20 1.442 0.525 -1.014 1.535 0.661
34 A 30 1.031 0.595 -0.284 1.191 0.238
35 A 45 0.879 0.879 1.320 1.244 -1.061
36 A 60 0.431 0.747 -0.965 0.862 1.119
37 A 20 0.589 0.214 0.533 0.627 -0.850
38 A 30 1.144 0.661 2.296 1.321 -1.738
39 A 45 0.810 0.810 1.050 1.145 -0.917
40 A 60 0.147 0.254 -0.135 0.293 0.461








Noncompensatory Data

For each compensatory item generated, a corresponding

noncompensatory item was created. A noncompensatory item was considered

corresponding if it had the same probability of success as the compensatory

item (Ackerman, 1989). To accomplish this, the NLIN procedure of SAS was

applied to Equation 16. Specifically, the compensatory probability was

calculated for each case and became the dependent variable. The

independent variable in the NLIN model statement was the noncompensatory

probability function. Only multidimensional items were transformed as the

compensatory/noncompensatory question was not applicable to

unidimensional items. Starting values for noncompensatory parameter

estimation were set to equal the compensatory parameters. The 1,000 theta

vectors generated for the first of each compensatory response set were

treated as known values. To ensure the program was producing unique local

minima, starting values were changed for several items in each set and

reestimated. Any differences which appeared in the parameter estimates were

contained in the fourth or fifth decimal place. For approximately 10% of the

items in each dataset, the convergence criterion was not met within 40

iterations. In these cases, the final parameter estimates were substituted for

the starting values and the program rerun. In all such cases, convergence was

achieved with the second attempt.








Response vectors were generated by applying Equation 10 and using

the same (01,02) combinations utilized to produce the corresponding

compensatory responses. Twenty response sets were simulated for each

noncompensatory dataset. The item parameters for the multidimensional Form

A items of noncompensatory MD30 are shown in Table 6. Summary statistics

for datasets of both models are displayed in Table 7.

Noneouivalent Groups

One of the strongest theoretical advantages of IRT is its usefulness with

groups of subjects who differ in abilities. One case where this may occur is

when a second form of a test, such as a high school proficiency test, is

administered only to examinees who failed to pass the first attempt. To

examine the effect of data from a lower ability group being equated to data

gathered from a normally distributed group, sets of 1,000 less able simulees

were generated. Scores on 61 for the lower group ranged between -3.00 and

0.00 with mean -0.80 and standard deviation 0.6. Abilities on the second

dimension were normally distributed with mean 0 and standard deviation 1.

Five replications of scores were generated for all four compensatory test

conditions.

Estimation of Parameters

Unidimensional IRT

The responses of the 1,000 simulated examinees in each response set

were analyzed by the computer program BILOG (Mislevy & Bock, 1990) to

estimate the unidimensional item discrimination and difficulty parameters.









Table 6

Simulated Noncompensatory Parameters for Multidimensional Items, MD30 Form A


Item Form a4o aa a2 b, b2


A,B 60
A,B 20
A,B 30
A,B 45
A,B 60
A,B 20
A,B 30
A,B 45
A,B 60
A 60
A 20
A 30
A 45
A 60
A 20
A 30
A 45
A 60
A 20
A 30
A 45
A 60
A 20
A 30
A 45
A 60
A 20
A 30
A 45
A 60


0.664
0.778
0.528
0.705
0.352
0.478
0.638
0.849
0.728
0.496
0.184
2.256
0.830
0.692
0.545
0.344
0.957
0.381
0.516
0.310
0.312
0.499
0.918
0.725
0.698
0.474
0.412
0.814
0.653
0.096


0.888
0.528
0.447
0.698
0.395
0.390
0.494
0.844
0.942
0.606
0.149
0.235
0.792
0.248
0.413
0.306
0.910
0.436
0.400
0.276
0.315
0.585
0.610
0.578
0.677
0.551
0.322
0.584
0.636
0.207


-0.945
0.236
-0.713
-1.534
-3.164
-1.175
0.834
0.555
-1.999
-1.268
-1.188
-0.426
-0.481
-4.282
-0.089
-0.745
-0.750
-2.495
-0.361
-0.561
-0.297
-2.774
-0.856
-0.701
-0.060
-2.809
0.187
1.073
-0.219
-4.957


0.309
-2.081
-2.092
-1.596
-1.776
-3.624
-0.661
0.565
-0.964
0.047
-4.872
0.705
-0.491
0.835
-2.602
-2.472
-0.786
-1.136
-2.876
-2.430
-0.388
-1.633
-2.870
-1.944
-0.058
-1.638
-2.772
-0.399
-0.231
-3.588









Table 7

Summary Statistics for Multidimensional Items in Compensatory and
Noncompensatory Datasets


MD10 MD20 MD30 MD40


Parameter C NC C NC C NC C NC


Mean
a,
SD



Mean
a2
SD



Mean
di
SD


.37 .60 .86

.80 .28 .50



.75 .56 .68

.47 .30 .39


.91 .69

.50 .37



.70 .62

.38 .27



.14

1.07


Mean -1.09 -1.08 -1.02 -.88
b,
SD 1.03 1.09 1.28 1.16



Mean -1.37 -1.54 -1.35 -1.23
b2
SD .94 1.80 1.37 1.18

Note. C = Compensatory item parameters; NC = Noncompensatory item parameters








Program default values were used in the calibration of the two-parameter

logistic model item parameters. Specifically, this involved marginal maximum

likelihood estimation procedures, no priors specified for difficulties, and

lognormal priors for discrimination parameters. For the randomly equivalent

groups, each of the 160 response sets--20 replications each for four

compensatory and four noncompensatory multidimensional conditions-was

analyzed twice. The procedure was repeated for the nonequivalent groups.

First the responses for combined Forms A and B for each dataset were

analyzed simultaneously. Then each form was analyzed separately. This

resulted in a total of 520 BILOG runs.

Analytical Estimation

Unidimensional estimation of the multidimensional item parameters for

the eight datasets was performed analytically using Wang's (1986) procedure.

The SAS IML procedure was employed to determine the unidimensional

estimates of the two-dimensional item parameters for each of the eight

conditions.

Eauating

In IRT, because the ICCs are population independent, item parameter

estimates from two BILOG runs should theoretically be identical. However,

P,<() in the 2PL model is a function of the quantity a, (e bi). As such, the

origin and the unit of 6 and bi measurement are arbitrary or indeterminant.

Any scale may be selected for 8 as long as the same scale is chosen for b,.

Estimated abilities and item difficulties from two calibration runs should have a









linear relationship to each other (Petersen et al., 1989). Equating is a

procedure used to place the item parameters from two tests on the same

scale.

Three unidimensional IRT equating methods were selected for this

study: (a) concurrent calibration, (b) equated bs, and (c) characteristic curve

transformation.

Concurrent Calibration

Concurrent calibration is the simplest of the IRT methods of equating to

implement. A common group of examinees or items is required to tie the

information from the two tests together. For this study, the parameters of both

forms were estimated simultaneously by BILOG. Twelve common items in

each dataset served to link the forms and the resulting item parameter

estimates were therefore on the same scale. This process was repeated for

each of the response sets in each condition.

Equated bs

The equated bs method is based on determining the linear relationship that

exists between item difficulties estimated in two separate BILOG calibration

runs, one for each form. The means and standard deviations of the bis for

each set of linking items from Form A and B were calculated. The linear

transformation was determined by

SDA
bA = SD-(be- X) + XA (19)
Wes









Once the slope (A) and intercept (B) of the linear transformation were found,

they were applied to all ability and item estimates for Form B, yielding

b = Ab, + B (20)

a =, (21)

6 = AO.+B (22)

All parameters were now transformed to the same scale. Although item

discrimination or ability estimates could have been used to determine the linear

transformation, item difficulty estimates are usually used in practice because

they yield the most stable parameter estimates (Cook & Eignor, 1991).

Characteristic Curve Transformation

The parameter estimates computed separately for Form A and Form B

were also used in the characteristic curve transformation. This equating

method used both a, and bj estimates from the linking items to derive a linear

transformation through an iterative process that minimized the difference

between the item parameter estimates of the linking items. The process is

based on the assumption that if the estimates were free of error, choosing the

proper linear transformation would cause the true-score estimates of the

linking items to correspond (Petersen et al., 1989; Stocking & Lord, 1983).

The resulting transformation was then applied to all Form B parameters to

create estimates on the same scale. The EQUATE (Baker, AI-Kami, & Al-

Dosary, 1991) computer program was used to accomplish this. Data were









examined at 80 points along the ICC and the transformation was generally

identified after approximately 8 10 iterations.

All three equating procedures described were applied to each of the

replications for each of the twelve data conditions. This resulted in 660

equatings for this study. A summation of the research equating conditions is

presented in Table 8.



Table 8

Summation of Research Eauatina Conditions


Equating Method


Concurrent Equat
Dataset Calibration bs

Compensatory, Randomly Equivalent Groups
MD10 '
MD20 '
MD30 '
MD40 '

Noncompensatory, Randomly Equivalent Groups
MD10 '
MD20 '
MD30 '
MD40 '

Compensatory, Nonequivalent Groups
MD10 '
MD20 '
MD304
MD40 V


ed


Characteristic
Curve









Evaluation Criteria

To establish a foundation for evaluating the results of the research

equatings, the three comparison conditions described below were used. In

addition, three statistical criteria-correlation, standardized mean difference,

and standardized root mean square difference-were applied to the data.

Comparison Conditions

For the first comparison condition the unidimensional approximations

of the multidimensional item parameters were calculated using the analytic

procedure described by equations 17 and 18 (Wang, 1986). To compute

these approximations for the eight research conditions, the SAS IML procedure

was applied to each of the simulated parameter sets. The means and

standard deviations of the responses for each condition were determined for

inclusion in the formula. The resulting sets of unidimensional comparison item

parameters were weighted composites of the item parameters for the two traits

(Ackerman, 1988). Table 9 presents the analytical unidimensional item

parameter approximations for compensatory MD30, Form A. The resulting

analytical item parameter estimates were then fixed in BILOG 386 and all

compensatory and noncompensatory response sets were analyzed to establish

the comparison ability estimates.

For the next comparison condition, the second dimension of each

multidimensional item was ignored. This would be reasonable if arguing that

most published tests were designed to measure only the first factor. For









Table 9

Analytical Estimates of the Unidimensional Parameters for Compensatory MD30.
Form A


Item Discrimination Difficulty
1 0.242 1.408
2 0.286 0.352
3 0.262 -1.449
4 0.679 -0.949
5 0.712 -0.559
6 0.479 0.066
7 0.733 0.737
8 0.289 1.237
9 0.422 0.833
10 0.599 -1.622
11 0.876 -1.742
12 0.786 0.673
13 0.482 -0.646
14 0.650 0.273
15 0.842 -0.874
16 0.429 0.698
17 0.687 -1.216
18 0.715 0.338
19 0.335 0.245
20 0.466 -0.877
21 0.400 0.808
22 1.320 -0.442
23 0.868 -0.429
24 0.267 0.265
25 0.487 -0.355
26 0.300 -0.312
27 1.103 -0.010
28 0.325 0.432
29 0.459 -0.070
30 0.270 -0.685
31 0.283 -1.913
32 0.452 1.327
33 0.882 0.670
34 0.700 0.239
35 0.690 -1.104
36 0.421 1.304
37 0.363 -0.862
38 0.777 -1.734
39 0.637 -0.953
40 0.148 0.536









example, although mathematics problem solving requires reading skills to

understand the prompts, the reading level is usually well below the grade level

being tested. In this study, the simulated ability parameters of the first

dimension only from each compensatory and noncompensatory dataset were

utilized. This comparison criterion would enable evaluation of how well the

dominant first factor was recovered in the equatings.

A third comparison condition was created which employed the

averages of the two true 0 values. This condition was based on the parameter

estimation studies of Yen (1984) and Ansley and Forsyth (1985) in which the

unidimensional estimates of the 0 parameters appeared to be combinations of

the true multidimensional abilities.

Statistical Criteria

Correlation coefficients between the simulated 0 and the equated 0

estimates were computed to establish the relationship between the comparison

criterion and the research equatings for each condition. For concurrent

calibration, the appropriate simulated 0 parameters were correlated to the

corresponding estimated ability parameters for both Form A and Form B. Only

the equated form, Form B, was compared to the comparison conditions for all

other equating procedures.

The standardized difference between means (SDM) is the difference in

mean scores for the two sets of ability traits divided by a pooled estimate of the

standard deviation










= 2 ')(23)

where S2 and S2 are the variances of the two sets of abillities (Yen, 1984).

The means of the estimated ability parameters were subtracted from the

means of each comparison condition to calculate this statistic.

The standardized root mean square difference is the square root of the

mean squared difference between examinees' trait estimates, divided by S.

Again, the estimated 0 parameter values were subtracted from the appropriate

comparison values to derive the criterion value.

Summary

Four test conditions with differing numbers of multidimensional items

were simulated using the compensatory M2PL item response theory model.

The item direction for multidimensional items was varied within each test.

Comparable noncompensatory datasets were then created for each condition.

Two 40 item forms were constructed for each situation consisting of 12 linking

and 28 unique items. Responses for 1,000 normally distributed simulated

examinees were generated through application of the appropriate probability

equation and replicated 20 times. The same (01,02) combinations were used to

generate corresponding compensatory and noncompensatory response sets.

In addition, responses for 1,000 low ability examinees were generated with 5

replications for each compensatory test condition.






77


Parameter estimation was executed on all conditions using both

unidimensional IRT procedures and analytical estimation. For the IRT

parameter estimates, equating was performed through through techniques: (a)

concurrent calibration, (b) equated bs, and (c) characteristic curve

transformation.

Three comparison conditions--the first simulated theta, the average of

theta 1 and theta 2, and the analytical estimations of the unidimensional

parameters--were selected for comparison with equated abillity estimates.

Finally, the three statistical procedures of correlation, standardized mean

difference, and standardized root mean square difference were applied to

examine the comparisons.













CHAPTER 4
RESULTS AND DISCUSSION


Simulated Data

Item Parameters

Item parameters for two 40 item forms of a test were generated with a

compensatory multidimensional 2PL model. Four conditions were created with

either 10, 20, 30, or 40 multidimensional items in each form. Four degrees of

dimensionality were spiraled throughout each test and form. Each form

contained twelve linking items that mirrored the total test in psychometric

properties. Additionally, Forms A and B were designed to be randomly parallel.

Examination of the simulated compensatory item parameters confirms

this was accomplished. Descriptive statistics for the four compensatory Form A

conditions are presented in Table 10 and Form B data are shown in Table 11.

All generated values are within the limits found in published tests and described

in previous empirical studies (Doody-Bogan & Yen, 1883; Ackerman, 1988).

For both forms and across all conditions, the means of the di parameters

approach 0.0 with standard deviations of approximately 1.0. The means and

standard deviations of all item parameters for both forms are similar.

The multidimensional compensatory item parameters were then

transformed into their noncompensatory correlates. Descriptive statistics for









Table 10

Descriptive Statistics for Compensatory Form A Item Parameters


Par


II


ameter Condition

al 10
20
30
40

a2 10
20
30
40

d 10
20
30
40

DISC 10
20
30
40

MID 10
20
30
40


Minimum

0.29
0.30
0.15
0.28

0.00
0.00
0.00
0.21

-2.27
-2.44
-1.06
-2.90

0.41
0.30
0.29
0.57

-1.94
-1.86
-1.84
-1.43


Note. N = 40 items in each condition.



Form A conditions are presented in Table 12 and Form B information is given in

Table 13. The item parameter values calculated from the noncompensatory

transformations are within the ranges given by Ackerman (1989). For all


Maximum

3.49
2.41
1.94
2.45

1.22
1.87
1.53
1.63

2.18
2.76
2.68
2.78

3.49
2.41
2.24
2.61

1.62
1.83
1.23
1.73


Mean

1.15
0.89
0.84
0.98

0.17
0.42
0.49
0.71

0.08
0.20
0.25
0.17

1.23
1.08
1.04
1.25

-0.11
-0.09
-0.17
-0.10








Table 11

Descriptive Statistics for Compensatory Form B Item Parameters


Parameter Condition Minimum Maximum Mean SD

al 10 0.37 3.65 1.14 0.8
20 0.27 2.41 0.96 0.5
30 0.15 2.11 0.94 0.5
40 0.27 2.45 0.88 0.5

a2 10 0.00 1.37 0.20 0.4
20 0.00 1.58 0.36 0.5
30 0.00 2.11 0.55 0.5
40 0.18 2.27 0.71 0.4

d 10 -2.55 6.23 0.30 1.5
20 -1.87 2.76 0.00 1.1
30 -3.30 4.65 0.10 1.3
40 -2.90 2.78 0.20 1.2

MDISC 10 0.39 3.65 1.23 0.8
20 0.32 2.41 1.12 0.5
30 0.30 2.98 1.16 0.6
40 0.42 2.62 1.18 0.5

MID 10 -1.71 1.96 -0.13 0.9
20 -1.88 1.58 0.08 0.9
30 -1.68 1.77 -0.02 0.9
40 -1.79 1.94 -0.09 0.8

Note. N = 40 items in each condition.




conditions and in both forms, b2 is slightly less difficult than bl, and a2 is less

discriminating than al.

In all cases, the noncompensatory bI parameters are lower than the MIDj

for the corresponding item. This may be explained by considering the method








Table 12

Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory
Form A


Parameter Condition Minimum Maximum Mean SD

al 10 0.27 0.99 0.63 0.3
20 0.10 1.10 0.60 0.3
30 0.10 2.26 0.63 0.4
40 0.33 2.65 0.76 0.4

a2 10 0.00 1.22 0.17 0.3
20 0.15 0.94 0.52 0.2
30 0.00 1.53 0.49 0.4
40 0.32 1.14 0.62 0.2

bl 10 -3.44 0.86 -1.20 1.1
20 -2.94 1.68 -0.79 1.2
30 -4.96 1.07 -1.07 1.4
40 -3.55 1.93 -0.92 1.2

b2 10 -3.01 -0.54 -1.43 0.8
20 -5.75 2.34 -1.29 2.0
30 -4.87 0.84 -1.45 1.4
40 -3.10 3.98 -1.12 1.2

Note. The number of multidimensional items is the same as the condition number.



used to calculate the transformations. A compensatory and a noncompensatory

item were considered corresponding if, for each 01,02 combination, the

probability of a correct response was the same on both items. Because the

noncompensatory model does not allow a high ability on one trait to compensate

for a low ability on the other dimension, the bi parameters on a








Table 13

Descriptive Statistics for Multidimensional Item Parameters in Noncompensatory
Form B


Parameter Condition Minimum Maximum Mean SD

al 10 0.33 1.54 0.69 0.4
20 0.13 1.10 0.59 0.3
30 0.33 1.33 0.69 0.3
40 0.10 1.58 0.64 0.3

a2 10 0.29 0.97 0.64 0.3
20 0.10 1.12 0.57 0.3
30 0.16 0.83 0.63 0.4
40 0.25 1.82 0.64 0.3

bl 10 -2.32 0.57 -0.96 0.8
20 -3.22 0.27 -1.21 0.9
30 -3.30 1.65 0.10 1.3
40 -4.06 1.86 -0.79 1.1

b2 10 -3.04 0.33 -1.25 1.0
20 -4.90 0.69 -1.53 1.5
30 -3.62 1.03 -1.24 1.3
40 -3.51 0.82 -1.32 1.1

Note. The number of multidimensional items is the same as the condition number.



noncompensatory item must be smaller than the MIDI parameterof the

compensatory item if the condition for items to be corresponding is to be met.

The differences between the compensatory and noncompensatory M2PL

models can also be shown graphically. Because the probability of a correct

response varies as a function of the 9 in each model, the item response

surfaces (IRS) and contour plots of matched items should differ. The






83

compensatory and corresponding noncompensatory model IRS and contour plot

for an item of each degree of dimensionality are shown in Figures 4 through 7.

In Figure 4, a matched item that discriminates predominantly on 01 (a = 20)

is pictured. The differences between the two IRSs are minor. A similarity also

exists in the two conditions where the degree of dimensionality is 15 from

equally discriminating. Figure 5 shows the IRS for a = 30 which discriminates

slightly more on 01 than on 02. Conversely, Figure 7 presents the graphs for a =

600, which discriminates slightly more on 02 than on 61. Although differences

exist in the baselines, the curves of the IRSs remain similar. This is true both

within each of the two matched sets and between the items with a = 300 and

a=600. In Figure 6, where a = 450, the corresponding compensatory and

noncompensatory items discriminate equally along 61 and 62, and there is a

sharp contrast between corresponding curves.

Similar conclusions can be drawn from examination of the equiprobability

lines of the contour plots. For the compensatory model, parallel lines join the

61,02 combinations that have an equal probability of a correct response. The

incline of these lines is a function of the discrimination parameters. However,

because the noncompensatory model does not allow a high ability on one

dimension to compensate for a low ability on another dimension, the lines

connecting the 01,02 combinations are curvilinear. The direction of these lines

in the noncompensatory model is a function of the item's difficulty parameters










(a) Compensatory IRS
(a,=.732, a,=.266, d=-.104)


(c) Compensatory Contour Plot


(b) Noncompensatory IRS
(a1=.526, a2=.378, b,=-.595, b2=-2.961)
















(d) Noncompensatory Contour Plot


-1 I 2 3


Figure 3. Item response surfaces and contour plots for item 9, MD20, a.=20*












(a) Compensatory IRS
(a,=.934, a2=.539, d=.650)


(c) Compensatory Contour Plot




3
2

1 *. s <

.2

-1
-2
-3,------------


-2 -1 0
ThWt


2 3


(b) Noncompensatory IRS
(a,=.709, a2=.526, b,=-.092, b,=-1.177)




















(d) Noncompensatory Contour Plot












-2

Tm~l


Figure 4. Item response surfaces and contour plots for item 10, MD20, a=30'












(a) Compensatory IRS
(a,=1.223, a2=1.223, d=1.994)




















(c) Compensatory Contour Plot
(c) Compensatory Contour Plot


-3 -2 -1 1 1
Th.aI


(b) Noncompensatory IRS
(a,=.970, a,=.933, b,=-.951, b2=-.913)


(d) Noncompensatory Contour Plot


2 3


70
-1,

-2
-3
-3 -2 -1 0 1 2 3
Theta


Figure 5. Item response surfaces and contour plots for item 11, MD20, oa=45