Citation
A Comparison of three types of item analysis in test development using classical and latent trait methods /

Material Information

Title:
A Comparison of three types of item analysis in test development using classical and latent trait methods /
Added title page title:
Item analysis in test development ..
Creator:
Benson, Iris G., 1946-
Publication Date:
Copyright Date:
1977
Language:
English
Physical Description:
x, 120 leaves : ; 28 cm.

Subjects

Subjects / Keywords:
Analytics ( jstor )
Applied statistics ( jstor )
Consistent estimators ( jstor )
Factor analysis ( jstor )
Modeling ( jstor )
Population estimates ( jstor )
Sample size ( jstor )
Statistical models ( jstor )
Statistics ( jstor )
Test theory ( jstor )
Dissertations, Academic -- Educational Administration and Supervision -- UF ( lcsh )
Educational Administration and Supervision thesis Ph. D ( lcsh )
Educational tests and measurements ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis--University of Florida.
Bibliography:
Bibliography: leaves 105-110.
Additional Physical Form:
Also available on World Wide Web
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Iris G. Benson.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
026310933 ( AlephBibNum )
04054653 ( OCLC )
AAX3756 ( NOTIS )

Downloads

This item has the following downloads:


Full Text














A COMPARISON OF THREE TYPES OF ITEM ANALYSIS
IN TEST DEVELOPMENT USING CLASSICAL
AND LATENT TRAIT METHODS









By

IRIS G. BENSON


A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF
THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF DOCTOR OF PHILOSOPHY




UNIVERSITY OF FLORIDA


1977











ACKNOWLEDGMENTS


I am deeply indebted to two special people who have greatly in-

fluenced by graduate education, Dr. William Ware, chairman of my

guidance committee, and Dr. Linda Crocker, unofficial cochairman of my

committee. Their continued encouragement and support has resulted in

my reaching this point in my graduate studies. I shall always be

extremely grateful to Dr. Ware and Dr. Crocker for whatever skills I

have developed as a researcher and as a teacher arc in large part due

to their advice and guidance. To them I owe the high value I place on

objective, quantitative research methods. Further, I would like to

acknowledge the tremendous amount of time they spent in molding the

final copy of this manuscript.

I would also like to express my appreciation to the members of my

committee, Dean John Newell and Dr. William Powell, for their suggestions

and editorial comments on this dissertation. Special thanks are extended

to Dr. Wilson Guertin for his assistance with portions of the study, and

as an unofficial member of my committee.

I would like to thank Dr. Jeaninne Webb, Director of the Office of

Instructional Resources, and Mr. Robert Feinberg and Ms. Arlene Barry

of the Testing Division, for providing the data used in this study.

Finally, I would like to express my sincere appreciation to my

friends and family who stood by me during very trying times in my

graduate education.










TABLE OF CONTENTS


PAGE


ACKNOWLEDGMENTS . . .

LIST OF TABLES . . .

LIST OF FIGURES . . .

ABSTRACT . . . .

CHAPTER

I. INTRODUCTION . .


The Problem . . . . . . . . . .
Purpose of the Study . . . . . . . .
Significance of the Study . . . . . . .
Organization of the Study . . . . . . .

II. REVIEW OF THE LITERATURE . . . . . . .

Item Analysis Procedures for the Classical Model
Research Related to Classical Item Analysis in
Test Development . . . . . . . .
Simplified Methods of Obtaining Item
Discrimination . . . .
Item Analysis Procedures for the Factor Analytic
Model . . . . . . . . . . .
Research Related to Factor Analysis in Test
Development .. ........... ......
Comparison of FaEtor Analysis to/Classical Item
Analysis .. . .. .
Item Analysis Procedures for the Latent Trait Model
Research Related to Latent Trait Models in
Test Development . . . . . .
Comparison of the Ra -ch M 1odel' to Factor Analysis
.-.-----" -.
Summary . . . . . . . . . . ... .

III. METHOD . . . . . . . . . . .


The Sample . . . .
The Instrument . . .
The Procedure . . . .
Design . . . . .
Item Selection . .
Double Cross-Validation
Statistical Analyses .
Summary . . . . .


. . . . . . i i



. . . . . . vii

. . . . . . vii


. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

. . . . . . .^
. . . . . . .









TABLE OF CONTENTS Continued


CHAPTER PAGE

IV. RESULTS . .. . . . . . . . . . . 53

Item Selection . . . . . . . . . 54
Double Cross-Validation . . . . . .. ... 69
Comparison of the 15 Item Tests on Precision. . .... 69
Comparison of the 30 Item Tests on Precision . . .. 75
Comparison of the 30 Item Tests on Efficiency . . 81
Summary . . . . . . . . . . . . 82

V. DISCUSSION AND CONCLUSIONS . ... . . . .. .. 87

The Precision of the Tests Produced by the
Three Methods of Item Analysis . . . .. . . 87
Internal Consistency . . . . .. . . . 88
Standard Error of Measurement . . . . . 89
Types of Items Retained . . . .. . . . 90
Conclusions . . .. . . . ............... . 91
The Efficiency of the Tests Produced by the
Three Methods of Item Analysis .. . . . .. 93
Conclusions . . . . . . . .95
Implications for Future Research . ... ..... . . 95

VI. SU A . . . . . . ... . . . . 99

REFERENCES . . . . . . . . . .. . 105

APPENDIX A: Mathematical Derivation of the Rasch Model . . 112

APPENDIX B: Relative Efficiency Values Used in Figure
2 for the Comparisons Among Item Analytic
Methods . . . ..... . . . . . . 119

BIOGRAPHICAL SKETCH . . . . . . . ... .. ... 120









LIST OF TABLES


TABLE PAGE

1 DESCRIPTIVE DATA ON THE VERBAL APTITUDE SUB'1EST
OF THE FLORIDA TWELFTH GRADE TEST 1975
ADMINISTRATION . . . . . . . . ... .. . 41

2 SYSTEMATIC SAMPLING DESIGN OF THE STUDY N = 5,235 .. 43

3 DOUBLE CROSS-VALIDATION DESIGN OF THE STUDY . . .. 47

4 DEMOGRAPHIC BREAKDOWN BY ETHNIC ORIGIN AND
SEX FOR TOTAL SAMPLE . . . . ... ....... 55

5 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED
ON CLASSICAL ITEM ANALYSIS FOR EACH SAMPLE SIZE . . 56

6 ITEM LOADINGS ON THE FIRST UNROTATED FACTOR FOR
THE 50 TEST ITEMS BASED ON FACTOR ANALYSIS FOR
EACH SAMPLE SIZE . . . . . . . . . . 58

7 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED
ON THE RASCII MODEL FOR EACH SAMPLE SIZE . . ... 61

8 DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATES
BASED ON THE RASCH MODEL ACCORDING TO SAMPLE SIZE . 65

9 THE 15 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC
PROCEDURE ACCORDING TO SAMPLE SIZE . . . .... 66

10 THE 30 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC
PROCEDURE ACCORDING TO SAMPLE SIZE . ... ...... 67

II DESCRIPTIVE STATISTICS FOR THE TEST COMPOSED OF
THE 15 BEST ITEMS SELECTED BY EACH ITEM
ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE . . .. 70

12 CONFIDENCE INTERVALS FOR THE OBSERVED INTERNAL
CONSISTENCY ESTIMATES BASED ON THE 15 ITEM
TESTS ACCORDING TO SAMPLE SIZE . . . . ... 72

13 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DIFFICULTY BY PROCEDURE AND SAMPLE SIZE . . . . 74

14 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DISCRIMINATION BY PROCEDURE AND SAMPLE SIZE . . .. 75

15 POST HOC COMPARISONS OF THE DIFFERENCES BETWEEN
THE MEAN ITEM DISCRIMINATION FOR THE 15 ITEM TESTS . 76









LIST OF TABLES Continued


TABLE PAGE

16 DESCRIPTIVE STATISTICS FOR THE TEST COMPOSED
OF THE 30 BEST ITEMS SELECTED BY EACH ITEM
ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE . . .. 77

17 CONFIDENCE INTERVALS FOR THE OBSERVED
INTERNAL CONSISTENCY ESTIMATES BASED ON
THE 30 ITEM TESTS ACCORDING TO SAMPLE SIZE .. ... 78

18 30 ITEM TESTS: DESCRIPTIVE STATISTICS FOR
ITEM DIFFICULTY BY PROCEDURE AND SAMPLE
SIZE . . . . . . . . ... . . .. . 80

19 30 ITEM TESTS: DESCRIPTIVE STATISTICS FOR
ITEM DISCRIMINATION BY PROCEDURE AND SAMPLE
SIZE . . . . . . . . . . . . 80









LIST OF FIGURES


FIGURE PAGE

I HYPOTHETICAL ITEM CHARACTERISTIC CURVES FOR
THE FOUR LATENT TRAIT MODELS .. . . . . . .. 29

2 RELATIVE EFFICIENCY COMPARISONS FOR THE
THREE 30 ITEM TESTS N = 995 . . . . . ... 83










Abstract of Dissertation Presented to the Graduate
Council of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of Doctor of Philosophy



A COMPARISON OF THREE TYPES OF ITEM ANALYSIS
IN TEST DEVELOPMENT USING CLASSICAL
AND LATENT TRAIT METHODS



By

Iris G. Benson

December 1977

Chairman: William B. Ware
Major Department: Foundations of Education

Test reliability and validity are determined by the quality of the

items in the tests. Through the application of item analysis procedures,

test constructors are able to obtain quantitative, objective information

useful in developing and judging the quality of a test and its items.

Classical test theory forms the basis for one method of test

development. An integral part of the development of tests based on the

classical model is selection of a final set of items from an item pool

based on classical item analysis or factor analysis. Classical item

analysis requires identification of single items which provide maximum

discrimination between individuals on the latent trait being measured.

The biserial correlation between item score and total score is commonly

used as an index of item discrimination.

An alternative method of test development, but based on the

classical model, is factor analysis. Factor analysis is a more complex

test development procedure than classical item analysis. It is a


viii








statistical technique that takes into account the item correlation

with all other individual items in the test simultaneously. Thus,

classical item analysis can be viewed as a unidimensional basis for

item analysis, less sophisticated than the multidimensional procedure

of factor analysis.

Recently, the field of latent trait theory has provided a new approach

to test construction. Several latent trait models have been developed;

however, this study was concerned only with the one-parameter logistic

Rasch model. The Rasch model was chosen because it is the most parsi-

monious of the latent trait models and has recently been used in the

development and equating of tests.

A review of the literature revealed numerous studies conducted in

each of the three areas of item analysis, but no comparative studies

were reported among all three item analytic techniques. Therefore,

the present study was designed to compare the methods of classical

item analysis, factor analysis, and the Rasch model in terms of test

precision and relative efficiency.

An empirical study was designed to compare the effects of the

three methods of item analysis on test development across different

sample sizes of 250, 500, and 995 subjects. Item response data were

obtained from a sample of 5,235 high school seniors on a 50 item cogni-

tive test of verbal aptitude. The subjects were divided into nine

independent samples, one for each item analytic technique and sample

size. The study was conducted in three phases: item selection,

computation of item and test statistics for selected items on double

cross-validation samples, and statistical analyses of item characteris-

tics. For each item analytic procedure two tests were developed:








a 15 item test, and a 30 item test. Four dependent variables were

obtained for each test to assess precision: internal consistency

estimates, standard error of measurement, item difficulties, and item

discrimination. In addition, the relative efficiencies of the 30 item

tests developed by each item analytic technique were compared for the

sample of 995 subjects.

The results of the analysis revealed that there were no differences

between the tests developed by the three methods of item analysis,

in terms of the precision of measurement. In terms of efficiency,

substantive differences between the tests produced by the three item

analytic methods were observed. Specifically, the tests based on class-

ical test theory were more effective for measuring very low and very

high ability students. The Rasch developed test was more efficient for

assessing average and high ability students.
















CHAPTER 1

INTRODUCTION



The systematic approach to test development was initiated by Binet

and Simon in 1916. Since that time psychometricians have been concerned

with the extent to which accurate measurement of a person's "ability"

is possible. Most measurement experts agree that upon repeated testing

an individual's observed score will vary even though his true ability

remains constant. This variability is the essence of classical test

theory.

Classical test theory is based upon the assumption that a person's

observed score (X) is made up of a true score (T) and error score (E)

denoted:

X = T + E. (1)

Limited by few assumptions, this theory has wide applications. The few

assumptions pertain to the error score (Magnusson, 1966, p. 64):

1. The mean of an examinee's error scores on an infinite

number of parallel tests is zero.

2. The correlation between examinee's error scores on parallel

tests is zero.

3. The correlation between examinees' error scores and true

scores is zero.

Relying upon these assumptions, psychometricians have used the observed

score (X) to represent the best estimate of a person's true score (T).








The accuracy of the observed score (X) in representing an examine's

true score (T) is described by the reliability coefficient. One definition

of reliability is given by the coefficient of precision. This coefficient

is the correlation between truly parallel tests, assuming the examinee's

true score does not change between two measurements. Lord and Novick

(1968) have defined truly parallel tests to be those for which, "the

expected values [true scores] of parallel measurements are equal; and the

observed score variances of parallel measurements are equal (p. 48)."

The reliability coefficient for the population is defined as

(Lord and Novick, 1968, p. 134):
2 2
r = T = 1 E (2)

x x
2
where 0T is the true score variance, o is the observed score variance,

and OE is the error score variance. When this expression is used to

represent the coefficient of precision, it can be interpreted as the

extent to which unreliability is due solely to inadequacies of the test

form and testing procedure rather than due to changes in examinees over

time.

The coefficient of precision is a theoretical value because the
2 2
components aT and oE cannot be observed. The coefficient of precision

is usually estimated by internal consistency methods. Internal con-

sistency is a measure of the relationship between random parallel tests.

Random parallel tests are composed of items drawn from the same population

of items (Magnusson, 1966, p. 102-103). Scores on these tests may

differ somewhat from true scores in means, standard deviations,and

correlations because of random errors in the sampling of items. However,

random parallel tests are more often encountered in practice than are








truly parallel tests. Cronbach's coefficient alpha (1951) is the

internal consistency coefficient commonly used to represent the average

correlation among all possible tests created by dividing the domain

into random halves. Thus, the internal consistency coefficient indicates

the extent to which all the items are measuring the same ability or trait.

Psychological traits are often described as latent because they cannot

be directly observed. Therefore, psychological tests are developed in

an attempt to measure these latent traits.

Classical test theory forms the basis for one method of test

development. An integral part of the development of tests based on the

classical model is the utilization of classical item analysis or factor

analysis. Classical item analysis is a procedure to obtain a description

of the statistical characteristics of each item in the test. This

approach requires identification of single items which provide maximum

discrimination between individuals on the latent trait being measured.

Theoretically, selecting items which have high correlations with total

test score will result in a discriminating test which is homogeneous

with respect to the latent trait. Therefore, classical item analysis

is an aid to developing internally consistent tests.

An alternative method of test development, but based on the classi-

cal model, is factor analysis. Factor analysis is a more complex test

development procedure than classical item analysis. It is a statistical

technique that takes into account the item correlation with all other

individual items in the test simultaneously. Groups of similar items

tend to cluster together and comprise the latent traits (factors) under-

lying the test. Under the classical model then, classical item analysis

can be viewed as a inidimensional basis for item analysis, less

soplisiticated than the multidimensional procedure of factor analysis.







The purpose of factor analysis is to represent a variable in terms

of one or several underlying factors (Ilarman, 1967). Depending upon the

objective of the analysis, two general approaches are used in

factor analysis: (a) common factor analysis, and (b) principal com-

ponents analysis. A common factor solution would be warranted if the

researcher were interested in determining the number of common and

unique factors underlying a given test. A principal component solution

would be warranted if it were of interest to extract the maximum

amount of variance from a given test.

Regardless of the approach used, factor analysis is an item analytic

technique in which all test items are considered simultaneously to pro-

duce a matrix of item correlations with factors. It is these correlations

or item loadings that indicate the strength of the factor and also the

number of factors underlying the test. However, factor analysis shares

the weakness of classical item analysis, that of being sample dependent.

Critics of classical test theory contend that a major weakness of

tests developed from this model is that the item statistics vary when

the examinee group changes; item statistics may also vary if a different

set of items from the same domain is used with the same examinee group

(Hambleton and Cook, 1977; Wright, 1968). Thus, the selection of a final

set of test items will be sample dependent.

Until recently, classical item analysis and factor analysis were the

only techniques described in measurement texts for use in item analysis

and test development (Baker, 1977). However, with the publication of

Lord and Novick's Statistical Theories of Mental Test Scores (1968) and

the availability of computer programs, considerable attention is being

directed now toward the field of latent trait theory as a new area in








test development. Latent trait theory dates back to Lazarsfeld (1950)

who introduced the concept; however, Fredrick Lord is generally given

credit as the father of latent trait theory (Hambleton, Swaminathan,

Cook, Eignor, and Gifford, 1977). Proponents of this approach claim

that the advantages of latent trait theory over classical test theory

are twofold: (a) theoretically it provides item parameters which are

invariant across examinee samples which will differ with respect to

the latent trait, and (b) it provides item characteristic curves that

give insight into how specific items discriminate between students of

varying abilities. These properties of latent trait theory will be

presented in more detail in Chapter II.

Four latent trait models have been developed for use with

dichotomously scored data: the normal ogive, and the one-, two-, and

three-parameter logistic model (HambletonandCook, 1977; Lord and Novick,

1968). This study is concerned with the one-parameter logistic Rasch

model because it is the simplest of the four models.

Tests developed using the Rasch model are intended to provide

objective measurement of the examinee's true ability on the latent

trait in question, as well as providing for invariant item parameters

(Rasch, 1966; Wright, 1968). That is, any subset of items from a

population of items that have been calibrated by the Rasch model should

accurately measure the examinee's true ability regardless of whether the

items are very easy or very difficult; also, the item parameters should

remain constant over different examinees. In measurements obtained from

classical test theory this objective feature is rarely attained. The

item parameters associated with classical test theory are group and

item specific. That is, the item parameters are determined by the








ability of the people taking the test and the subset of items chosen.

1Wright (1968) has stated, "The growth of science depends on the develop-

ment of objective methods for transforming an observation into measure-

ment (p. 86)." Latent trait theory is an attempt to develop mental

measurement into a technique similar to measurement in the physical

sciences.

Latent trait theory is based on strong assumptions that are re-

strictive and hence limit its application (Hambletonand Cook, 1977). The

assumptions required for the Rasch model are the following (Rasch, 1966):

1. The test is unidimensional, e.g., there is only one factor

or trait underlying test performance.

2. The item responses of each examinee are locally independent,

e.g., success or failure on one item does not hinder other item responses.

3. The item discrimination are equal, e.g., all items load

equally on the factor underlying the test.

Lord and Novick (1968) noted that the assumptions of unidimensionality

and local independence are synonymous. To say that only one underlying

ability is being tested means the items are statistically independent

for persons at the same ability level. The third assumption relates to

item characteristic curves. The item characteristic curve is a mathematical

function that relates the probability of success on an item to the

ability measured by the test. Curves vary in slope and intercept to

reflect how items vary in discrimination and difficulty. The one-paran-

eter logistic Rasch model (the one parameter is item difficulty) assumes

all item discrimination are equal. Thus all item characteristic

curves should be similar with respect to their slopes.







The Problem

Several studies have been conducted to varify the invariant prop-

erties of tests constructed using the Rasch model (Tinsley and Dawis,

1975; Whitely and Dawis, 1974; Wright, 1968). If we assume that tests

developed using latent trait theory possess the quality of invariant

item statistics, why then hasn't latent trait theory been more visible

in the psychometric community? There appear to be three main reasons

for this slow acceptance. First, the Rasch procedure is based on a

mathematical model involving restrictive assumptions, e.g., the uni-

dimensionality of the items, the local independence of the items, and

equal item discrimination. A further restriction of the Rasch model

is the assumption of minimal guessing. However, several researchers

have demonstrated the robustness of the model with regard to departures

from the basic assumptions (Anderson, Kearney and Everett, 1968; Dinero

and Haertel, 1976; Rentz, 1976). Second, latent trait theory has not been

used in practical testing situations because until recently there was a

lack of available computer programs to handle the complex mathematical

calculations. Hambleton et al. (1977) described four computer programs

not. available to the consumer. Third, measurement experts who are

knowledgeable about latent trait models have been skeptical as to the

real gains that may be available through this line of research. Are

tests developed using latent trait models superior to tests developed

usin, classical item analysis or factor analysis?

The purpose of this study was to compare the precision and efficiency

of cognitive tests constructed by the three methods (classical item

analysis, factor analysis and the Rasch model) from a common item and

examine population. Precision, as measured by internal consistency,







is an overall estimate of a test's homogeneity, but provides no infor-

mation on how the test as a whole discriminates for the various ability

groups taking the test. For that reason measures test efficiency

(Lord, 1974a, 1974b) were incorporated into the study. Test efficiency

provides information on the effectiveness of one test over another as

a function of ability level. A cognitive college admissions subtest was

used in this study for several reasons. First, tests of this type are

widely used by educational institutions for a large number of examinees

each year, in the areas of selection, placement, and academic counseling.

Most college admission examinations traditionally have been developed

using classical item analysis. Second, because of the importance of

the decisions made using such test scores, it would be worth investing

considerable time and expense in the development of these instruments.

Thus, the use of factor analysis or the Rasch model would be justified

if superiority of either of these methods over classical item analysis

could be determined. Third, the items on college admission tests have

been written by experts, and each subtest is intended to be unidimensional,

e.g., items measuring a single ability. Thus, assumptions from all

models should be met. Fourth, because of the time required to take such

examinations, it is important to maximize the precision and the effect-

iveness of the tests. The possibility of using fewer items while main-

taining precision would be desirable. Therefore, the question of which

test development procedure can best accomplish this is not a trival one.

Purpose of the Study

The purpose of this study was to compare empirically the Rasch model

with classical item analysis and factor analysis in test development.

Five research questions guided this study.

1. Will the three methods of test development produce tests with

superior internal consistency estimates when compared to the projected









internal consistency of the population as the number of items decreases?

2. Will the three methods of test development produce tests

with stable estimates of internal consistency when the number of

examinees decreases?

3. Will the three methods of test development produce tests

with similar standard errors of measurement?1

4. Will the three methods of test development select items that

are similar in terms of difficulty and discrimination?

5. Will the three methods of test development produce equally

efficient tests for all ability levels?

Hypotheses

This study investigated the capacities of three methods of test

development to increase precision and efficiency of measurement in test

construction. The five questions posited in the previous section were

phrased as testable hypotheses:

1. There are no significant differences in the internal consistency

estimates of the tests produced by the three methods, as the number of items

decreases, when compared to the projected internal consistency estimates

for the population for tests of similar length.




Fhe standard error of measurement (SEM) is defined in the classical
sense as (Magnusson, 1966, p. 79):

SEM = Sx 1I r
XX,
where Sx is the standard deviation of the test, and r is the
reliability coefficient
reliability coefficient.









2. There are no differences in the internal consistency estimates

of the tests produced by the three methods when the number of examinees

is decreased.
2
3. There are no meaningful differences in the magnitude of the

standard error of measurement of the tests produced by the three

methods.

4. There are no significant differences in the difficulties or

discrimination of the items selected by the three methods.

5. There are no differences across ability levels in the efficiency

of the tests produced by the three methods.

Significance of the Study

Objective measurement has always been assumed in the physical

sciences. It has only been recently that objective measurement in the

behavioral sciences has been deemed possible with the advent of latent

trait theory. Since the introduction of latent trait theory by

Lazarsfeld (1950) and Lord (1952a, 1953a, 1953b) much of the research

on latent trait models has been confined to theoretical research journals.

Wright (1968), speaking at a conference on testing problems, discussed

at an applied level the need to seriously consider latent trait theory

and the Rasch model in particular as a major test development technique

far superior to classical item analysis and factor analysis. However,

even in 1968 computer programs were not yet available to run the analyses




2
Because test scores are usually reported and interpreted in
whole numbers, a "meaningful" difference in the standard error of
measurement is defined as a difference of 2 1.00.







should anyone beyond academicians be interested. Today this obstacle

has been overcome, but many test developers remain unconvinced of the

value of latent trait theory because its superiority to classical test

theory has not been conclusively demonstrated. This study is an attempt

to provide an empirical comparison of classical test theory and latent

trait theory methods of test construction.

Of the various logistic models that represent latent trait theory

the Rasch model was chosen for comparison with traditional item

analysis procedures in the present study because it is the most

parsimonious latent trait model and has been used recently in the

development of the equating of tests (Rentz and Bashaw, 1977; Woodcock,

1974). The Rasch model provides a mathematical explanation for the

outcome of an event when an examinee attempts an item on a test. Rasch

(1966) stated that the outcome of an encounter is governed by the pro-

duct of the ability of the examinee and the easiness of the item and

nothing more. The implication of this simple concept (objectivity of

measurement) would seem to revolutionize mental measurement. If invariant

properties of items and ability scores can be identified and used to

improve the psychometric quality of tests to an extent greater than now

possible with classical and factor analytic procedures then we truly

are in the age of modern test theory.

Organization of the Study

The theoretical and empirical studies related to the three methods

of item analysis are described in Chapter II. An empirical investigation

to compare the three methods of item analysis under varying conditions

is described in Chapter III. The results of the study are reported in

Chapter IV. A discussion of the results, conclusions of the study, and




12



implications for future research in this area have been presented in

the fifth chapter. A sunmarization of the study has been provided in

Chapter VI.















CHAPTER II

REVIEW OF TIE LITERATURE



The quality of the items in a test determine its validity and

reliability. Through the application of item analysis procedures, test

constructors are able to obtain quantitative objective information

useful in judging the quality of test items. Item analysis thus pro-

vides an empirical basis for revising the test, indicating which items

can be used again and which items have to be deleted or rewritten

(Lange, Lehmann, and Mehrens, 1967). Item analysis data also help

settle arguments and objections to specific items that might be raised

by administrators, test experts, examinees, or the public.

This study is focused on three approaches to item analysis (classical

item analysis, factor analysis, and the Rasch model) as test construction

techniques. It is assumed throughout this study that the test under

construction is unidimensional, e.g., all items are measuring only one

ability. These three approaches to item analysis and the relevant

research related to each method are discussed in this chapter.

Item Analysis Procedures for the Classical Model

Item analysis as a test development technique emerged at the begin-

ning of this century. Binet and Simon (1916) were among the first to

systematically validate test items. They noted the proportion of

students at particular age levels passing an item. This statistic was








measuring the relative difficulty of the items for different age groups.

The item difficulty index, defined as the percentage of persons passing

an item and denoted by p, is one of the statistics used in classical

item analysis.

Item difficulty is related to item variance and hence to the

internal consistency of the test. Test constructors are usually con-

cerned with achieving high test reliability, e.g., precision of measure-

ment. Therefore, an item difficulty of .50 is considered to be the ideal

value necessary to maximize test reliability. This is because half

the examinees are getting the item correct and half the examinees are

missing the item. The proportion missing an item is defined as 1-p or

q. Thus, when p is equal to .50, q is equal to .50. Because the

variance of a dichotomized item is p x q the maximum variation an item

can contribute to total test variance and ultimately to true-score

variance is .25. As an item's difficulty index deviates from .50,its

contribution to total test variance is always some value less than .25.

Hence test constructors have been advised (Gulliksen, 1945) to select

items with difficulty indices at or near .50. However, when items

are presented in multiple choice or alternate choice format, the ideal
3
level of difficulty is adjusted to accommodate for guessing.

A second important item statistic in classical item analysis is

the item discrimination index. An item discrimination index is a measure




The ideal value of p = .50 assumes there has been no guessing on
the item. The effects of guessing on item difficulty tends to increase
the ideal value of p. For example, on a four option multiple choice
item the chance of guessing the correct answer is (a)(.501=.12. The
value of .12 is added to .50 to correct for the effect of guessing and
the ideal p would now be .62 (Lord, 1952b; Mehrens and Lehmann, 1975).








of how well the item discriminates between persons who have high test

scores and persons who have low test scores. The discrimination index

is often expressed as a correlation between the item and total test

score. ihen the criterion is total test score, the correlation coef-

ficient indicates the contribution that item makes to the test as a

whole. Thus, on tests of academic achievement it is a measure of item

validity as well as a contributor to internal consistency. Noting an

increasing use of item analytic procedures for the improvement of

objective examinations, Richardson (1936) pointed out that the

development of the procedures of item analysis had centered primarily

around the invention of various indices of association between the test

item and the total test score, e.g., item discrimination indices.

The two most popular item-test correlation indices are the biserial

and point biserial correlations. The point biserial was developed by

Pearson (1900) and is a special case of the more general Pearson Product

Moment (PPM) correlation coefficient (Magnusson, 1966). This index

is recommended when one of the variables being correlated (the item

score) represents a true dichotomy and the other variable (total test

score) is continuously distributed. Pearson (1909) also derived the

biserial correlation which is an estimate of the PPM. The biserial

correlation is recommended when one of the variables (the item score)

has an underlying continuous and normal distribution which has been

artificially dichotomized and the other variable (total test score) is

continuously distributed. The assumption for the point biserial

correlation is often hard to justify when it is suspected that knowledge

required to answer an item is continuously distributed.









In considering the dichotomized item (pass/fail), McNemar (1962)

has commented, "It is obvious that failing a test item represents any-

thing from a dismal failure up to a near pass, whereas passing the

item involves barely passing up to passing with the greatest of ease"

(p. 191). Thus, the biserial correlation is usually favored over the

point biserial correlation as a measure of item discrimination. Also,

the biserial is often chosen over the point biserial because the

magnitude of the point biserial correlation for an item is not in-

dependent of the item difficulty (Davis, 1951; Henrysson, 1971;

Swineford, 1936). Specifically, values of the point biserial are

systematically depressed as p approaches the extremes of .00 or 1.00.

Lord and Novick (1968) have pointed out that because of this bias, the

point biserial correlation tends to favor medium difficulty items over

easy or very difficulty items.

The formulae for the biserial and point biserial correlation

respectively are (Magnusson, 1966, p. 200 & 203):

X -
S p q pq (3)
'bis =
s Y
y


rDbis p q p (4)

y

where X is the mean of y scores for persons who correctly solved the
p
item, X is the mean of y scores for persons who incorrectly solved

the item, s is the standard deviation of the y test scores, p and q

have been previously defined, and Y is the ordinate of the dividing

line between the proportions p and q in a unit normal distribution

(Magnusson, 1966).








One of the main objectives of classical test theory is to improve

the internal consistency of the test under construction where internal

consistency was defined as the extent to which all items are measuring

the same ability. To ensure high internal consistency the random error

in the test must be minimized. As stated previously in Equation 2,

reliability, in the classical model, was defined as:



r = T2 =1- E2
xx
X o2 X 2

Thus, the relationship among the test items can be noted in the

coefficient alpha formulae for estimating internal consistency for a

sample (Magnusson, 1966, pp.116-117):


r = n .1 ZSi (
xx (5)
n 1 2
n-l

or
2-
r = n Cik (6)
2x


2
where n is the number of test items, E S. is the sum of the item
1
variances, S X is the variance of the test, and C. is the mean of the
X Ik
item covariances. By comparing equation 2 with 5, it is seen that the
2
sum of the unique item variances is used as an estimate of a E, and that

when the unique item variation is minimized internal consistency will

be high. Furthermore, the mean of the item covariances (equation 6)
2
serves as an estimate of a The size of the covariance term is in

turn determined by the intercorrelations and standard deviations of

the items (Magnusson, 1966). Therefore, internal consistency is directly

dependent upon the correlation among the items in the test.









The item discrimination index provides a measure of how well in

item contributes to what the test as a whole measures. When items with

the highest item-test correlations are selected, the homogeneity of

the test is increased; that is, 1a is increased. So it is the item

discrimination that directly affects test reliability. When items with

low item-test correlations are eliminated, the remaining item inter-

correlations are raised. When item-test correlations are high, the test

is able to discriminate between high and low scorers and hence internal

consistency is increased. If too few items are discarded in an item

analysis the internal consistency of the test tends to decrease because

items with little power of measuring what the entire test is intended

to measure will dilute the measuring power of the efficient items

(Beddell, 1950).

Research Related to Classical Item Analysis in Test Development

Several articles have been published concerning standards for item

selection to maximize test validity and increase internal consistency.

Flanagan (1939) stated two considerations in selecting test items:

(a) the item must be valid, that is, it should discriminate between

high and low scorers, and (b) the level of item difficulty should be

suitable for the examine group. Gulliksen (1945) agreed with Flanagan

on these two points and added a third; items selected with p = .50 would

produce the most valid tests; however, Gulliksen noted that current

practice was opposed to selecting items with difficulty near .50. Test

developers were selecting items based upon spreading difficulty indices

over a broad range.

Several studies have been conducted to examine the effects of

varying item difficulty on test development. Brogden (1946), in a







study of test homogeneity, has shown empirically that a test of 45

items with varying levels of item difficulty produced a reliability

of .96 (measured by the Kuder-Richardson20 formula). However, a

similar but longer test of 153 items, that had item difficulties at

.50 for all items, produced reliability of .99. Thus, Brogden con-

cluded that effective item selection was based more on selecting a

test with fewer items that possessed varying difficulty, than a longer

test with equal item difficulty.

Davis (1951), in commenting on item difficulty, stated that if

all test items had a difficulty of .50 and were uncorrelated then

maximum discrimination was achieved. But when test items were cor-

related, maximum discrimination would only be achieved when the

difficulty index for all test items was spread out, e.g., several

difficult items, several easy items, and several items with difficulty

near .50. Davis recommended the latter procedure for test development

because test items are usually correlated to some degree. Davis also

recognized the need for the approval of subject matter specialists in

addition to statistical criteria in item selection.

In a study of test validity, Webster (1956) found results similar

to Brogden (1946), but different from Gulliksen (1945). By selecting

fewer items with high discrimination indices and varying item difficulty

levels, a more valid test was produced. Webster's results indicated

that a test of 178 items with difficulty indices near .50 had a validity

coefficient of .66. However, a test of 124 similar items with varying

item difficulties had a validity coefficient of .76, statistically

significant at p < .03 (based on r to z transformations).








Myers (1962), concerned by the current practice of selecting

items based on varying item difficulties instead of the-theoretical

idea of p = .50, compared the effect of the current practice to the

theoretical idea on reliability and validity of a scholastic aptitude

test. The ideal item difficulty ranged from .40 to .74 in what he

called the peaked test. Items selected by the current practice were

outside the above range, and Myers called this the U-shaped test.

Two sets of i'ct-s were selected for the peaked test and the U-shaped

test, four tests in all. Myers reported no statistically significant

differences in test validity when the different tests were correlated

with freshman grades. Test reliability was statistically significant

at P < .02 (using the Wilcoxon matched pairs sign test) in favor of the

peaked test. The reliability of the peaked test was .69. The

reliability of the U-shaped test was .63. The author noted that the

results above were based on a 24 item test, and that when test length

was projected to 48 items (via Spearman-Brown Prophecy Formula) there

were no significant differences in test reliability. The studies of

Brogden (1946) and Webster (1956) indicate that selecting items of

varying item difficulty tends to increase internal consistency and test

validity. The results from Myer's (1962) study indicated just the

opposite, that item difficulty near .50 produced the more internally

consistent test. But this was only true for a relatively short test

of 24 items, and that when the test length was projected to 48 items,

there were no differences in the reliability of either test based upon

the two methods of selecting items.








Simplified Methods of Obtaining Item Discriminations

A second major group of articles on classical item analysis has

dealt with simplified methods of obtaining indices of item discrimination.

Because of the lack of computers in the early years of test develop-

ment many psychometricians concerned themselves with devising tables

to provide quick estimates of item discrimination. Kelley (1939)

found that in the computation of item discrimination only 54 percent of

the examinee group (based on total test score) needed to be used.

Considering the top 27 percent and the bottom 27 percent of the test

scorers resulted in a considerable savings in computational time.

Flanagan (1939) developed a table of item discrimination to estimate

the PPM correlation between item and test score based on Kelley's

extreme score groups of top and bottom 27 percent.

Fan (1952) developed a table for the estimation of the tetrachoric

correlation coefficient using the upper and lower 27 percent of the

scorers. The tetrachoric correlation is similar to the biserial

correlation, where the correlation is between two variables, which are

assumed to have a normal and continuous underlying distribution, but

have been artificially dichotomized.

Guilford (1954) presented several short cut tabular and graphic

solutions for estimating various types of correlation coefficients to

measure test item validity. These methods result in saving a consider-

above amount of time when one is forced to use hand calculations.

Today these short cut methods can be used by classroom teachers who often

do not have the aid of calculators or computers. However, many test

constructors still use these classical methods of item analysis even




22



though computers are available with which more sophisticated item

analytic techniques such as factor analysis or latent trait models

can be used.

Item Analysis Procedures for the Factor Analytic Model

Charles Spearman (1904) proposed a theory of measurement based on

the idea that every test was composed of one general factor and a num-

ber of specific factors. In order to test his idea Spearman developed

the statistical procedure known as factor analysis.

"Factor analysis is a method of analyzing a set of
observations from their intercorrelations to determine
whether the variations represented can be accounted for
adequately by a number of basic categories smaller than
that with which the investigation started" (Fruchter,
1954, p. 1).

Factor analysis is a mathematical procedure which produces a linear

representation of a variable in terms of other variables (Harman, 1967).

In the case of test items being factor analyzed, a matrix of item

intercorrelations is obtained first. Subsequently, the matrix of item

correlations is submitted to the factoring process. There are two

basic alternatives within the framework of factor analysis for analyzing

a set of data: common factor analysis, based on the work of Spearman

and later Thurstone (1947); and principal components, developed by

Hlotelling (1933). The major distinction between the two methods relates

to the amount of variance analyzed, e.g., the values placed in the

diagonal of the intercorrelation matrix. Factoring of the correlation

matrix with unities in the diagonal leads to principal components, while

factoring the correlation matrix with communalities4 in the diagonal




The communality (h of a2variable is defined as the sum of the
S2 (larman 1967, p. 17)
squared factor loadings h = aj + a. + .. a.2 (Harman, 1967, p. 17),
see formula 8. jn








leads to common factor analysis (Harman, 1967). If it is of interest

to know what the test items share in common, a common factor solution

is warranted. But if it is of interest to make comparisons to other

tests or other test development procedures, a principal components

solution is warranted. Since the present study was initiated to com-

pare three different test development techniques, a principal com-

ponents solution was used in this study to analyze the date under the

factor analytic model.

The linear model for the principal components procedure is

defined as (Harman, 1967, p. 15):

Z.. = a F1 + aj2F2 + ...a. F (7)
3i jl 1 j2 2 jn n

Z.. is the variable (or item) of interest, and a.j is the coefficient,

or more frequently referred to as the loading of variable Z.. on com-

ponent, (F1). An important feature of principal components is that the

extracted components account for the maximum amount of variance from

the original variables. Each principal component extracted is a linear

combination of the original variables and is uncorrelated with sub-

sequent components extracted. Thus, the sum of the variances of all

n principal components is equal to the sum of the variances of the

original variables (Harman, 1967). According to Guertin and Bailey

(1970), the principal components solution was designed basically for

prediction, hence the need to use the maximum amount of variance in a

set of variables.

Since factor analysis is based upon a matrix of intercorrelations,

it is important that care be taken in selecting the appropriate

coefficient. Several item coefficients are available: phi, phi/phi

max, and the tetrachoric correlation coefficient. Carroll (1961)








pointed out several problems concerning the choice of a correlation

coefficient to be used in factor analysis. The phi coefficient (used

where both variables are true dichotomies) was found to be affected

by disparate marginal distributions and often underestimated the PPM.

The phi/phi max coefficient was developed to correct for the under-

estimation of phi, but the correction is not enough to counter the

effect of extreme dichotomizations. Carroll recommended the tetrachoric

coefficient as being the least biased by extreme marginal splits

providing the variable under consideration was normally distributed in

the population. Wherry and Winer (1953) had made conclusions similar

to Carroll, but went on to say that when the normality assumption was

met and the regression of test score on the item was linear the PPM

and tetrachoric are identical. The tetrachoric correlation was used in

the present study to obtain item intercorrelations.

Research Related to Factor Analysis in Test Development

The early use of factor analysis to construct and refine tests

was suggested by the work of McNemar (1942) in revising the Stanford-

Binet scales, and Burt and John (1943) in analyzing the Terman-Binet

scales.

Several contemporary psychometricians have advocated the use of

factor analysis in developing unidimensional tests (Cattell, 1957;

Hambleton and Traub, 1973; Henrysson, 1962; Lord and Novick, 1968). A

unidimensional test was defined briefly in the introduction to this

chapter, but a more precise definition is warranted. Lumsden (1961)

noted that a unidimensional test can be determined by the examinee

response patterns. If the test items are arranged from easiest to

hardest, person1 who misses iteml will miss all the other items, and








person2 who gets item correct but misses item will miss all the

subsequent items and so on. The above statement assumes infallible

items. However, most tests constructed today contain fallible items,

thus the response pattern will be disturbed by random error. Lumsden

suggested in developing unidimensional tests factorially that the items

be carefully selected on empirical grounds, thus reducing the problem of

too many heterogeneous items and the possibility of obtaining multiple

factors. By preselecting items one increases the chances of the items

converging on one factor.

The importance of developing unidimensional tests is demon-

strated most clearly in considering the concepts of test reliability

and validity. For a test to be valid it must actually measure the trait

it was intended to measure. For a test to be reliable it must provide

similar results upon repeated measurement. It should be easier to

estimate these two important aspects of a test when the test is

unidimensional than when the test is multidimensional, hence the use of

a unidimensional test in the present study.

Cattell (1957) has suggested that in the development of a factor

homogeneous scale, one should preselect items, carry out a preliminary

factor analysis, then select for further analysis those items which

load on the first factor. Cattell defined an index of unidimensionality

as the ratio of the variance of the first factor to the total test

variance. This index has no set criterion and the sampling distribution

is unknown.

Comparison of Factor Analysis to Classical Item Analysis

One measure of item validity, the biserial correlation was described

for classical item analysis procedures. This same index is also obtained









by factor analysis. When the test items are factor analyzed, the factor

loading a.., is the item-factor association that is considered a

measure of item validity, e.g., the higher the factor loading, the

greater the relationship between the item and the factor it .easures.

The factor loadings can be viewed as similar to the biserial correlations

discussed under classical test theory. This relationship between

factor loadings and biserial correl:.tions has been discussed by several

authorities (Guertin and Bailey, 1970; Henrysson, 1962; Richardson,

1936).

Factor analysis as an item analytic technique was not realistically

possible for most psychometricians until the advent of high speed

computers. Guertin and Bailey (1970) have predicted that with the

increasing use of computers factor analysis will replace classical item

analysis as a test development technique. Because it is possible for

a test to reach the highest degree of homogeneity and yet be factorially

a very odd mixture of factors (Cattell and Tsujioka, 1964), classical

item analysis alone is not sufficient to determine if a test is

unidimensional. However, factor analysis not only provides a measure

of item-test correlation (the factor loading), it also provides an

indication of how many items form a unifactor test. Thus, factor

analysis has been advocated as a superior technique to classical item

analysis (Guertin and Bailey, 1970). Using factor analysis in test

development, psychometricians have advanced beyond an independent

analysis of item intercorrelations to a simultaneous analysis of item

intercorrelations with other individual items to obtain a measure of

test unidimensionality and item-factor association.








IIowever, there is an inherent flaw in factor analysis as there

was in classical item analysis in test development. The flaw is that

both procedures are sample dependent. When an item analysis procedure,

or any procedure in general is sample dependent,it means that the

results will vary from group to group. When the groups are very

dissimilar, there is much variability. Gulliksen (1950) noted that a

significant advance in item analysis theory would be made when a

method of obtaining invariant item parameters could be discovered. To

that end latent trait theory is an attempt to identify invariant

item parameters.

Item Analysis Procedures for the Latent Trait Model

Latent trait theory specifies a relationship between the observable

examinee test performance and the unobservable traits or abilities

assumed to underlie performance on a test (Hambleton et al., 1977). The

relationship is described by a mathematical function; hence latent

trait models are mathematical models. As noted earlier, there are four

major latent trait models for use with dichotomously scored data: the

normal ogive, and the one-, two-, and three-parameter logistic models

(Hambleton and Cook, 1977; Lord and Novick, 1968). All four models are

based on the assumption that the items in the test are measuring one

common ability and that the assumption of local independence exists

between the items and examinees. These two assumptions imply that a

test which measures only one trait or ability will have less measurement

error in the test score than a test that is multidimensional, and that

the response of an examinee to one item is not related to his response

on any other item. Where the latent trait models begin to differ is

with respect to the shape of their item characteristic curves.









The normal ogive, developed by Lord (1952a, 1953a), produces

an item characteristic curve based on the following formula:

a ( b ) (8)
P (0) = f/ g (t)dt,
g

where P (0 ) is the probability that an examinee with ability 0 correctly
g
answers item g, 6(t) is the normal density function, b represents item
g
difficulty and a represents item discrimination.
g
The item characteristic curve of the two-parameter logistic model

developed by Birnbaum (1968) has the same shape as the normal ogive,

and Baker (1961) has shown them to be equivalent mathematical procedures.

The shape of the item characteristic curve of the two-parameter

logistic function is developed from the following formula:

Dag(0 b )
S(0) = e (9)
g Dag ( b )
1 + e g

P (6), a and b have the same interpretation as in the normal ogive.
g g g
D is a scaling factor equal to 1.7 (the adjustment between the logistic

function and normal density function), and e is the natural log function.

In Figure la the shape of the normal ogive and the two-parameter

logistic curve has been illustrated. In the Figure, item A is more

discriminating than item B as noted by the steepness of the slopes.

Thie.. trlee-para1leter logistic model also developed by Birnbaum (1968)

includes as an additional parameter, an index for guessing. The

mathematical form of the three-parameter logistic curve is denoted,

Dag(6 b )
P (0) = c + (1 c ) e g (10)
g g g 1 + e Dag (0 bg)

The parameter c the lower asympotote of the item characteristic curve,

represents the probability of low ability examines correctly answering









SL 1.00 --
C')


0- .50
(A)
c LwL .25
0 0 -d /
0_ o
-3.0 -2.0 -1.0 0 1.0 2.0 3.0
ABILITY CONTINUUM

NORMAL OGIVE & TWO-PARAMETER LOGISTIC CURVE (a)


ul
LL Z
00


-I-
_co
mb
m
"o-
mo


1.00

.75

.50

.25

0


-3.0 -20 -1.0 0 1.0
ABILITY CONTINUUM


2.0 3.0


THREE-PARAMETER LOGISTIC CURVE (b)


-3.0 -2.0 -1.0


0 1.0 2.0 3.0


ABILITY CONTINUUM
RASCH ONE-PARAMETER LOGISTIC CURVE (c)



FIGURE 1. HYPOTHETICAL ITEM CHARACTERISTIC

CURVES FOR THE FOUR LATENT TRAIT

MODELS.


w 1.00
< (n
Z

LLd
on .75

i- .50

m .25
o8 0
00
o 0








an item (llambleton et al., 1977). In Figure lb the shape of the

three-parameter logistic curve has been illustrated. In the Figure,

item A is more discriminating and has less guessing involved than item B.

The one-parameter logistic model, developed by Rasch (1960)

is commonly referred to as the Rasch model. The Rasch model, though

similar to the other latent trait models, was developed independently

from the other models. The Rasch model is based upon two propositions:

(a) the smarter an examinee, the more likely he is to answer the item

correctly, and (b) an examinee is more likely to answer an easy item

correctly than a difficult item. Mathematically the above propositions

can be stated in terms of odds or probability of success on an item.

The odds of an examinee with ability 0 correctly answering an item with

difficulty ( is given by the ratio of 0 to (Rasch, 1960):


odds = (11)


Thle derivation of equation 11 was presented in Appendix A. Equation 11

more formally written in the following equation is the Rasch model.


e (k 6i) (12)
P (Xki = l|k, 6.) =1 6i)- 12
ki k 1 + e k -

In equation 12, the probability of examinee k making a correct response

to item i, noted X = 1, given an examinee of ability Bk (where 0k is the

log transformation of 0) taking an item of difficulty 6. (where 6. is the

log transformation of C) is a function of the difference between the

examine's ability and the item's difficulty. The derivation of

equation 11 to equation 12 is presented in Appendix A.

The assumptions for the Rasch model were discussed in Chapter I.

Essentially the three assumptions are as follows:








1. There is only one trait underlying test performance.

2. Item responses of each examinee are statistically independent.

3. Item discrimination are equal.

The first two assumptions can be checked by conducting a factor analysis

of the test items as suggested by Lord and Novick (1968), and Hambleton

and Traub (1973). The assumptions are met if one dominant factor

emerges from the analysis. The third assumption can be checked by

plotting item characteristic curves for each item. In Figure Ic the

item characteristic curves for two hypothetical items based on the Rasch

model have been illustrated. The difficulty for items A and B is .5and

L5respectively (point where p = .50), and the discrimination of the

two items are equal. The assumption that all items have equal dis-

criminations is quite restrictive; however, Rentz (1976) demonstrated,

in a simulation study, that the item slopes can deviate from 1 (where

all slopes are equal) + .25 and still fit the model. In a similar

simulation study, Dinero and Haertel (1976) concluded that the lack of

an item discrimination parameter in the Rasch model does not result

in poor item calibrations when discrimination are varied as much as .25.

The estimates for the Rasch parameters 3k and 6i, examinee ability

estimate and item difficulty estimate respectively, are sufficient,

consistent, efficient, and unbiased (Anderson, 1973; Bock and Wood, 1971).

That is, the examinee's test score will contain all the information

necessary to measure the person ability parameter Sk, and the sum of the

right answers to a given item will contain all the information used to

calibrate the item parameter 6. (Wright, 1977). Of the latent trait

models, the Rasch model is unique in this respect.








The mathematical rationale of the Rasch model is based upon the

separation of the ability and item difficulty parameters. As shown

in Appendix A, the estimation of-the item parameters is independent of

the distribution of ability and ability independent of the distribution

of item difficulty (Rasch, 1966). Several studies have demonstrated

this (Anderson et al., 1968; Tinsley and Dawis, 1975; iWitely and

Dawis, 1974; hlitely and Dawis, 1976; Wright, 1968; W'right and

Panchapakeson, 1969). The separation of the ability and item param-

eters leads to what Rasch has termed specific objectivity. Specific

objectivity relates to the fact that the measurement of a person's

ability is not dependent upon the sample of items used, nor the

examinee group in which a person is tested. Once a set of items has been

calibrated to the Rasch model, any subset of the calibrated items will

produce the same estimate of the examinee's ability. This type of

objectivity is possessed by the physical sciences and the goal toward

which mental measurement should be aimed in the future. Toward the

goal of objective measurement several researchers have conducted

empirical studies comparing classical factor analytic test development

procedures to the latent trait models, and also comparisons have been

made between the various latent trait models.

Research Related to Latent Trait Models in Test Development

Baker (1961) conducted one of the earlier comparative studies

between two latent trait models. He compared the effect of fitting the

normal ogive and the two-parameter logistic model to the same set of

data, a scholastic aptitude test. The two-parameter model as well as

the normal ogive provide item difficulty and item discrimination estimates.








The empirical results suggest there is little difference between the

two procedures as measured by a chi-square test of fit. However, Baker

noted the computer running time of the logistic model was one-third

that of the ogive model, thus he concluded the logistic model was

more efficient in terms of cost than the ogive.

Hambleton and Traub (1971) compared the efficiency of ability

estimates provided by the Rasch model and the two-parameter model to

the three-parameter logistic model using Birnbaum's concept of infor-

mation (1968). The three-parameter model provides item difficulty

and discrimination estimates as well as accounting for guessing on

each item. Eleven simulated tests of fifteen items each were generated

varying item discrimination and degree of guessing. The authors

sought to determine how efficient the one- and two-parameter logistic

models were under these conditions taking the three-parameter model to

be the true model. The results indicated that when guessing was a

factor the three-parameter model was most efficient in providing ability

estimates, but when guessing was not a factor all models were equally

efficient. Since the Rasch model has fewer parameters to estimate, hence

it takes less computer time to run than the other two models, it would

be preferred in the absence of guessing. In considering item

discrimination, when the guessing parameter was set to zero, the Rasch

model was as efficient as the two-parameter model when item discrimination

varied from .39 to .79. As item discrimination deviated from this range

the two-parameter model was more efficient.

Hambleton and Traub (1973) compared the one- and two-parameter

models with three sets of real data (the verbal and mathematics subtests

of a scholastic aptitude test used in Ontario (items = 45 and 20








respectively), and the verbal section of the Scholastic Aptitude Test

(SAT, items = 80). Their results indicated that generally the two-

parameter model fit the data better than the one-parameter model. The

loss in predicting performance was greatest on the shorter mathematics

test and smallest on the longer SAT. These findings confirm Birnbaum's

conjecture (1968, p. 492) that if the number of items in a test is very

large the inferences that can be made about an examinee's ability will

be much the same whether the Rasch model or the two-parameter logistic

model is used. The authors questined whether the gain obtained with the

two-parameter model is worth the increased computer cost of estimating

the item discrimination parameter. Based on the results of these studies,

it is concluded that the Rasch model is the most efficient of the latent

trait models and hence will be used in comparison to the more traditional

methods of test development included in the present study.

Comparison of the Rasch Model to Factor Analysis

Two recent studies have been completed comparing the Rasch model to

factor analysis. Anderson (1976) posed two questions concerning the

Rasch model and factor analysis: (a) what types of items would be

excluded in terms of difficulty and discrimination using Rasch and

factor analysis as item analytic techniques, and (b) what effect would

the two procedures have on validity? Anderson chose to use 235 middle

school students' responses to a 15 item Likert-type scale that was

dichotomized for use with the Rasch model and the factor analytic

procedures. A principal component factor analysis based upon tetrachoric

correlation coefficients was compared to the Rasch model using the

CALFIT computer program (Wright and Mead, 1975). Only items fitting

the model were used. His results indicated that the Rasch procedure









eliminated the more difficult items and the factor analytic procedure

eliminated the easier items; a statistically significant difference as

determined by chi-square test at R < .01. For item discrimination the

Rasch procedure eliminated very low and very high item discrimination,

while the factor analytic procedure tended to reject only very low

discrimination. The difference here was not statistically significant.

The second question of test validity showed very similar results for

the two procedures when test score was correlated with course grade

point average.

In a similar study Mandeville and Smarr (1976) developed a two

stage design. First they compared the Rasch procedure to factor analysis,

then they combined the two analytic procedures. The authors felt the

combined approach would be a more effective item analytic approach

than any single method in determining which items fit the Rasch model.

Two cognitive data sets (one standardized and one classroom) and one

simulated set were used in the study. A rotated principal axis factor

analysis based upon phi correlation coefficients were compared to the

kasclh model using the CALFIT program.

The results indicated that for the standardized and simulated data

sets the double procedure of factor analyzing the items, then submitting

onlQ the items loading on the first factor to the Rasch procedure was

not really useful. The Rasch procedure alone was just as effective as

the double procedure in selecting items that fit the model.

For the classroom data set the investigators found that 92 percent

of the items fit the Rasch model, but upon factor analyzing these

Items only seven percent of the total test variance was associated with

the first factor. Their results tend to indicate that factor analysis




36



and the Rasch procedure do not always identify the same unidimensional

trait underlying test performance. However, the results of the

Mandeville and Smarr study maybe suspect for three reasons. First,

the phi coefficient, which can be seriously affected when p and q

take on extreme values, was used as a basis to form the intercorrelation

matrix that was factor analyzed. The greater the difference in p and q

the smaller will be the maximum correlation, hence very easy and very

difficult items will have systematically lower coefficients and will

tend to bias the results of the analysis in favor of moderately difficult

items. Second, the factor analysis was based on a principal axis

solution, using some value less than 1.00 in the diagonal hence less

variance is being used in the total solution for comparison with the

Rasch procedure that is utilizing all the test variance available.

Third, the principal axis solution was rotated so that the total

variance associated with the first factor has been distributed out

among the other factors and was no longer as strong as it once had been.

Summary

In the development of tests based upon classical item analysis

two main statistics are used in reviewing and revising test items, e.g.,

item difficulty and item discrimination. The item discrimination index

provides information as to the validity of the item in relation to total

test score, while item difficulty indicates how appropriate the item was

for the group tested. A serious limitation of classical item analysis

is that the statistics obtained for examinees and items are sample

dependent (Hamblcton and Cook, 1977; Wright, 1968).

The same problem of sample dependency also exists for factor

analysis. However, factor analysis is viewed as a superior technique








to classical item analysis for two reasons: (a) factor analysis

compares item intercorrelations with other items simultaneously,

and (b) factor analysis provides an indication of how many factors

or abilities the test is measuring. Also in factor analysis, the

factor loading is comparable to the item discrimination index of

classical item analysis, thus providing a measure of item validity

for each item on each factor in the test.

Not until the development of latent trait models was a solution

suggested to the problem of sample dependency of the statistics for

items and examinees. The Rasch model in particular has been shown to

provide item statistics that are independent of the group on which they

were obtained, as well as examinee statistics that are independent of

the group of items on which they were tested. This feature of the

Rasch model provides for more objective mental measurement.

The Rasch model has been compared to other latent trait models

and has been shown to be as efficient in many cases as the more complex

models. The Rasch model has also been compared with factor analytic

procedures in determining test unidimensionality, validity, and types

of items retained and excluded by the two procedures. Missing from

this review is a comparative study of the three item analytic techniques

using the same data base and a comparison of the efficiency of tests

developed from the three techniques across ability levels. Also missing

from the literature is the effect of varying sample size and number of

items as well as the kinds of items each of the three procedures would

either retain or exclude in test development.





38


It is apparent that an empirical investigation into these areas

seems warranted to determine which procedure under the various

conditions would produce the superior test in terms of internal

consistency and efficiency. It was for this reason that the present

study was undertaken comparing the three methods of classical item

analysis, factor analysis, and the Rasch model used in test development.

The design of the study is described in Chapter III.















CHAPTER III

METHOD



An empirical study was designed to compare the effects of three

methods of item analysis on test development for different sample sizes.

The three methods of item analysis studied were classical item analysis,

factor analysis, and Rasch analysis. The sample sizes used to compare

the three item analytic methods were 250, 500, and 995 subjects. The

study was designed in three phases: (a) item selection, (b) a double

cross-validation of the selected items, and (c) statistical analyses

of the selected items. For each item analytic procedure two tests

were developed, a 15 item test, and a 30 item test. Four dependent

variables were obtained for each test: (a) an estimate of internal

consistency, (b) the standard error of measurement, (c) item difficulty,

and (d) item discrimination. A description of the subjects, instrument

used, research design, and statistical analyses is presented in this

chapter.

The Sample

In the fall of 1975, all high school seniors in the State of

Florida (N = 78,751) were tested as part of the State assessment program.

The population was from 435 high schools throughout the state. From

this population a 1 in 15 systematic sample of 5,250 subjects was

chosen (Mendenhall,Ott, and Scheaffer, 1971). A systematic sample was

selected to ensure samples from every high school in the state. The








types of data obtained on each subject were sex, race, item responses,

and total score.

The data file was edited to-remove those subjects who either

answered all the items correctly or incorrectly. The rationale for

this procedure was that the Rasch model cannot calibrate items when

a person has a perfect score or the alternative, when a person has no

items correct (Wright, 1977). Through the editing procedure 15 subjects

were removed, thus the available sample size was 5,235. Because such

a small number of subjects were removed, it seems unlikely that the

elimination of these subjects wouldbias the results in favor of any

of the three item analytic techniques.

The Instrument

The instrument selected for use in this study was the Verbal

Aptitude subtest of the Florida Twelfth Grade Test, developed by the

Educational Testing Service. It is a statewide assessment battery

which has been administered every year since 1935 (Benson, 1975). The

Verbal Aptitude subtest is comprised of 50 verbal analogies, in a

multiple choice format, from which a single score based on the number

of items correct is reported. Descriptive information on the Verbal

Aptitude subtest for the population tested in 1975 is presented in

Table 1.

This particular instrument was selected for three reasons. First,

it is a cognitive measure of verbal ability and much of classical

test theory has been build upon tests in the cognitive domain. Second,

it is similar to and hence representative of other national aptitude

tests used for college admissions. Third, it has a large data pool

from which to sample.

















TABLE 1

DESCRIPTIVE DATA ON THE VERBAL APTITUDE SUBTEST
OF THE FLORIDA TWELFTH GRADE TEST
1975 ADMINISTRATION


Number of Schools = 435


Number of Students = 78,751


Number of items 50
Mean 25.95
Standard Deviation 8.23
Reliability .88
Standard Error of Measurement 2.85


Note: Data obtained from the Florida Twelfth Grade Testing
Program, Report No. 1-75, Fall 1975.

aReliability based on the split-half method, and corrected by
the Spearman-Brown formula.








Classical test theory has been built mainly around the development

of cognitive tests. Therefore, it seemed desirable to compare the

new procedures of latent trait theory, via the Rasch model to the

procedures of classical test theory, e.g., factor analysis and classical

item analysis by using a cognitive test. Thus, the results may be

more generalizable to the major type of tests developed by practitioners

in the field.

The Procedure

Design

The sample of 5,235 was divided into nine systematic samples in

the following manner:

Group = three independent samples of 250 students each;

Group = three independent samples of 500 students each;

Group = three independent samples of 995 students each.

From the initial editing of the data file, previously described, 15

subjects were removed from the total sample of 5,250. Therefore,

it was decided that this loss of subjects would only affect Group3

since it was the largest. Thus, the number of subjects in each of

the three independent samples was reduced by five, resulting in three

independent samples of 995 subjects each.

The purpose of obtaining the three separate samples for the three

groups was to insure that each item analytic and double cross-validation

procedure used an independent sample, so that tests of statistical

significance could be performed. The scheme shown in Table 2 was used

to obtain the nine samples. In the present study the independent

variables were sample size and item analytic procedure.




43











TABLE 2

SYSTEMATIC SAMPLING DESIGN OF THE STUDYa
N = 5,235


Sampling Sample Number Item Analytic Total Sample
Group Procedure Number Selected Procedure Remaining


Group 1
1
1

Group 1
1
1


Group 1 in 3
1 in 2
remaining


Classical
Factor Analysis
Rasch

Classical
Factor Analysis
Rasch

Classical
Factor Analysis
Rasch


aThe sampling procedure
technique in group1 and
and group3.


was randomly assigned to item analytic
the same pattern carried out for group2


b Those subjects edited from the data file were removed equally from
group 3 hence the reduced sample size.


4,985
4,735
4,485

3,985
3,485
2,985

1,995
995
0








The item data were analyzed in three phases: (a) selection of the

items, (b) computation of item and test statistics for selected items

on double cross-validation samples, and (c) statistical analyses of

item characteristics to test the hypotheses.

Item Selection

The three independent samples, within each of the groups of subjects

(N = 250, N = 500, N = 995), were submitted to one of the three item

analytic procedures (in accordance with Table 2) in order to select a

specified number of items, e.g., the "best" 15 and 30 items. Each of

these two sets of items comprised two separate tests; however, all of

the items on the 15 item tests were always included on each of the 30 item

tests. A different process for selecting the items was used with each

item analytic technique, and has been described in the following three

sections.

Classical item analysis. The definition of the "best" items was

based on the numerical magnitude of the items' biserial correlations.

The biserial correlation was defined as the correlation between the

artificially dichotomized item score (1 or 0) and total test score.

In using the biserial correlation the assumption was made that the

artificially dichotomized variable (the item) had a continuous and normal

distribution (Magnusson, 1966).

In order to obtain biserial correlations for the items under the

classical item analysis procedure, the 50 verbal items were submitted
5
to the item analysis program, GITAP for each of the three sample sizes.




"The Generalized Item Analysis Program (GITAP) is a part of the
test analysis package developed by F. B. Baker and T. J. Martin,
Occasional Paper No. 10, Michigan State University, 1970.









The 15 and 30 items with the highest biserial correlations were selected

as the best items from the total subtest. Item difficulties were also

obtained for the "best" 15 and 30 items selected. Item difficulty has

been defined as the proportion of persons getting a particular item

correct out of the total number of persons attempting that item

(Mehrens and Lehman, 1973).

Factor analysis. Item selection based on factor analysis was

accomplished using the computer programs developed for the Education

Evaluation Laboratory at the University of Florida. These programs

have been described by Guertin and Bailey (1970). The present study

was concerned only with the items that load on the first principal

component, in order to adhere to the unidimensionality assumption of

the test. The principal components analysis was based on a matrix of

tetrachoric item intercorrelations with unities in the diagonal.

The tetrachoric correlation was chosen to produce the intercorrela-

tion matrix for the same reason the biserial correlation was chosen:

Knowledge of an item was assumed to be normal and continuously dis-

tributed. In the case of the tetrachoric correlation each item (scored

1 or 0) was correlated with every other item.

The 15 and 30 items with the highest loadings on the first unrotated

principal component were selected from the total subtest. These com-

ponent loadings are analogous to biserial correlations previously

described, where the loading refers to the relationship of the item to

the principal component or factor (Guertin and Bailey, 1970, Henrysson,

1962).

Rasch analysis. The selection of items based on the Rasch model

was accomplished in two stages. First, in order to check the assumption








of a unidimensional test, a factor analysis using a principal components

solution was used. Items were selected with loadings between .39 and

.79 on the first unrotated factor, to hold the discrimination index

of the items constant. Hambleton and Traub (1971) have shown that the

efficiency of a test developed using the Rasch model will remain very
~Y--~~'- -~ ~-"---------~-; --- -";
high (over 95 percent) when the range on the discrimination index was

held between .39 and .79. Second, the items selected from the principal

components solution using the above criteria were submitted to a

Rasch analysis using the BICAL program (Wright and Mead, 1976). Items

were selected based upon the mean square fit of the items to the

Rasch model. The best 15 and 30 items fitting the model were chosen

from the total subtest, and their corresponding item difficulties

reported.

Double Cross-Validation

A double cross-validation design (Mosier, 1951) was used to obtain

item parameter estimates for the best 15 and 30 items selected by the

three item analytic techniques for the three sample sizes.- In this

study a 3 X 3 latin square was used to reassign samples. This procedure

ensured that the estimates of the item parameters would be based

upon a different sample of subjects than the original sample used to

identify the best items. Each item analytic technique was randomly

reassigned, using a latin square procedure (Cochran and Cox, 1957,

p. 121), to a different sample within each of the three groups (N = 250,

N = 500, N = 995). The double cross-validation design is shown in

Table 3.

The best 15 and 30 items selected by each item analytic procedure

in the first phase of the study, were submitted to a standard item

















TABLE 3

DOUBLE CROSS-VALIDATION DESIGN OF THE STUDY


Sample Number Item Analytic Double Cross-
Group Number Selected Procedure Validation Procedure


Ground 1
-1
2
3

Group, 4
5
6

Group3 7
8
9


250
250
250

500
500
500

995
995
995


Classical
Factor Analysis
Rasch

Classical
Factor Analysis
Rasch

Classical
Factor Analysis
Rasch


Factor Analysis
Rasch
Classical

Rasch
Classical
Factor Analysis

Rasch
Classical
Factor Analysis


aThe sample number is the same


bAssignment to sample was base
square procedure.


as referred to in Table 2.


d on a randomized 3 X 3 latin








analysis program (GITAP) from which were obtained the dependent

variables in the study:

-indices of internal consistency as measured by the analysis

of variance procedure (Hoyt, 1941)

-the standard error of measurement

-item difficulty

-biserial correlations

By submitting the best 15 and 30 items selected by each item analytic

procedure in the study to a common item analysis program comparable

measures of the dependent variables were obtained.

Statistical Analyses

The third phase of the study focused on obtaining measures of

statistical significance for three of the dependent variables: internal

consistency, item difficulties, and biserial correlations. Only visual

comparisons were made for the remaining dependent variable, the standard

error of measurement. The internal consistency estimates from each test
were compared to the projected population value for tests of similar

length via confidence intervals as suggested by Feldt (1965). (Projected

population values were obtained using the Spearman-Brown Prophecy Formula.)

Item difficulties for the 15 and 30 best items were submitted to a

two-way analysis of variance, the two factors being sample size and item

analytic technique. This procedure was used to test for differences in

the types of items selected, in terms of item difficulty, by each technique.

If statistical significance was observed, with a = .05, Tukey's HSD

(honestly significant difference) post hoc procedure (Kirk, 1968) was





6The analysis of variance procedure is appropriate only if the
distribution of the item difficulties and (transformed biserial correlations








employed to determine which item analytic techniques) resulted in a

test with the highest item difficulties.

The biserial correlations were transformed to an interval scale

of measurement using a linear function of z suggested by Davis (1946).

The linear transformation was based upon converting the biserial

correlation to z values, and then eliminating the decimals and negative

values of z by multiplying the constant 60.241 to each z value (Davis,

1946, pp. 12-15). Thus, the range of the transformed biserials ranged

between 0 and 100. A two-way analysis of variance 7 (sample size by item

analytic technique) was performed on the transformed biserial correlations

for the best 15 items. This type of analysis was used to test for

differences in the types of items selected, in terms of biserial correla-

tions by each technique. If statistical significance was observed,

a = .05, Tukey's HSD post hoc procedure was employed to determine which

item analytic techniques) resulted in higher transformed biserial

correlations.

The two-way analysis of variance and post hoc analysis, where

indicated, for the transformed biserial correlations was performed on

the 30 best items.
In addition to tests of statistical significance, a measure of

the efficiency of the 30 best items selected by each procedure was

compared for the sample of 995 subjects. Birnbaum (1968) defined the

relative efficiency of two testing procedures as the ratio of their




approximates normality and the variances are homogeneous (Ware and Benson,


The analysis of variance procedure is appropriate only if the
distribution of the item difficulties and (transformed) biserial correla-
tions approximates normality and the variances are homogeneous (Ware and
Benson, 1975).









information curves. Lord (1974a) has described a procedure to compare

the relative efficiency of one test with another at different ability

levels. If two tests to be compared vary in difficulty, then the

relative efficiency of each will usually be different at different

ability levels (Lord, 1974b; 1977). In classical test theory it is

common to compare two tests that measure the same ability in terms of

their reliability coefficients, but this only gives a single overall

comparison. The formula developed by Lord for relative efficiency

provides a more precise way of comparing two tests that measure the

same ability. The formula for approximating relative efficiency is

(Lord, 1974b, p. 248):

f2
n x (n -x)f2
R.E. (y,x) = y x x (13)



where R.E. denotesthe relative efficiency of y compared to x, n and n
y x
denote the number of items in the two tests, x and y are the number-

right scores having the same percentile rank, and f2 and f are the
x y
squared observed frequencies of x and y. Lord has suggested that

formula 13 only be used with a large sample of examinees and tests that

are not extremely short, hence this comparison was restricted to the

case where N = 995 and the 30 item test.

Three relative efficiency comparisons using the 30 item tests were

made: (a) the test based on factor analysis was compared to the test

based on classical item analysis, (b) the test based on the Rasch

analysis was compared to the test based on classical item analysis, and

(c) the test based on the Rasch analysis was compared to the test based

on factor analysis.









Summary

An empirical study was designed to compare the effects of classical

item analysis, factor analysis, and the Rasch model on test development.

Item response data were obtained from a sample of 5,235 high school

seniors on a cognitive test of verbal aptitude.

The subjects were divided into 9 samples: three independent

groups of 250 subjects each, three independent groups of 500 subjects

each, and three independent groups of 995 subjects each. The independent

groups were obtained so that tests of statistical significance could

be performed.

The item response data were then analyzed in three phases. First,

the "best" 15 and 30 items were selected using each item analytic

technique. Under classical item analysis, the best 15 and 30 items

were selected based on the highest biserial correlations. For factor

analysis, the best 15 and 30 items were selected based on the highest

item loadings on the first (unrotated) principal component. The

selections of the best 15 and 30 items using the Rasch model were

based upon the mean square fit of the items to the model. These

procedures were used for each group of subjects. Second, a double

cross-validation design was employed to obtain estimates on the item

parameters for the best 15 and 30 items. The three item analytic

techniques were reassigned randomly to different samples of subjects

within each level of sample size. Then, the best 15 and 30 items

chosen by each method were submitted to a common item analytic procedure

in order to obtain estimates for comparing the three item analytic

methods. Third, a two-way analysis of variance and a Tukey post hoc

comparison test, when indicated, were used to test for differences in




52



the properties of items selected by each item analytic procedure. Also

confidence intervals were calculated to compare the internal consistency

estimates to a population value. In addition, the relative efficiencies

of the 30 item tests developed by each item analytic technique were

compared for the sample of 995 subjects.
















CHAPTER 1V

RESULTS



The study was designed to compare empirically the precision and

efficiency of tests developed using three item analytic techniques:

classical item analysis, factor analysis, and the Rasch model. The

following five hypotheses were generated to compare the three techniques:

1. There are no significant differences in the internal consistency

estimates of the tests produced by the three methods as the number of

items decreases when compared to the projected internal consistency

estimates for the population for tests of similar length.

2. There are no differences in the internal consistency estimates

of the tests produced by the three methods when the number of examinees

is decreased.

3. There are no meaningful differences in the magnitude of

the standard error of measurement of the tests produced by the three

methods.

4. There are no significant differences in the difficulties or

discrimination of the items selected by the three methods.







A meaningful difference was previously defined to be > 1.00.









5. There are no differences across ability levels in the efficiency

of the tests produced by the three methods.

The Verbal Aptitude subtest of the Florida Twelfth Grade Test, was

used to test the hypotheses. A sample of 5,235 examinees was

systematically selected from a population of 78,751. A demographic

breakdown of the sample by ethnic origin and sex is presented in Table 4.

The data were analyzed and reported in the following manner: item

selection, double cross-validation, comparison of the 15 item tests on

precision and comparison of the 30 item tests on precision and efficiency.

These results were then summarized with respect to the five hypotheses.

Item Selection

The 50 items on the Verbal Aptitude subtest were submitted to each

of the three item analytic techniques. The means, medians, and standard

deviations of the biserial correlations and item difficulties, based

on classical item analysis, are presented in Table 5. These descriptive

statistics appear equivalent across the varying sample sizes.

From the factor analysis, the percentage of total test variance

accounted for by the 50 verbal items on the first unrotated principal

component has been reported in Table 6. The percentage of variance

accounted for by the first principal component was obtained by summing

the squared item loadings and dividing by the total number of items.

The percentages of variance accounted for by the first principal component

in each sample were very similar. A check on the unidimensionality

of the test was made by rotating the principal components solution for

the sample of 995 subjects. Upon rotation, the results indicated

one dominate factor remained.












r-

N
CO
ci


o\


^t-


I-4
rt
4-)







0







4-J
,c

4-)



0
0




4-JF
e_)

a,o






O *H
*H ;.

4a


C <








( N




O
t) U)
*H -rH


u) <
u <













0)
















C3
en


Nj Q~j ~


iLO I C I
-1 -4 I N4


U)











,-)



U
cn



0
ry,







1-4
x










S
0





2
z
<







3:










.a









0
ES


co I I


LO
O0


x
0)





(U


0


4-4


00


0
r4J





e _
4-H





*H
F: *H

U c




























U)



U a 0
rt _


Ln CN oo


D 00 1 -
o M rl
















t4i o r'- -4 i) -, o r- N Ln o %O In Co in Co In o In --I \Cr ro O cO (NJ in (NJ













C t r) (11 M t11 \D t c111N r IT) C to o' () 1-4 \'D in' \N T D 0o r-1 CJ C1
' In 'I ) '- \10 %0 N IZ '\0 0 in '. 'U 0 '\ 0 if) In i- In Nf) '. (N nl n 'j\o ZT LIn Ln










Ln M --1 -1 r- -I r-N ( CG .7 '1 \0 tIn '^ '\0 In cO c i in ,-i t ) 0t, N N Ln '\0 '%0
NoOCMWoCOCoommmamoN'.0inimn












L.n '- in Coo0 CO t oO in 'IT C O . .c Co 1 0 'o0 o N oo o0 O In
In Ln In In in Mn N- Ln '\0 \10 '0 '\0 in In In In %.0 t, \,o ) In fr) In f) 't L)n Lf)










o CN 1-T CGl n M11' r-1 "N lf) ~ tM) 0 CD M %0 CD CD O c0 Lf) (NJ OCD 'm %0 I Oo m












in "T N in in 01 'Ir In I '.0 (NI (NI in ( -N - r-q CO Co L t '.0 N 0 ND t- ^In %0
in N I- 'IT If N0 if n r- '.0 \D ifo r~) rN iof) T in if) i. ) \ n -1 InLf i LO '0o '.o
. . . ... .o. o . . ..o .


-.i. '. T In \'D rN O CO i C ,-I c. .N 'I Ln .0o r- 0N CI o C-i (N in q, in \'0
,-l -I l-q 1-4 l-q r-- l-- --I --I --I (~^ ~ ^ ~ fM i "M "


Cr











4-4

DC-)
*H
a,









,
i-l


.H


o



ll ,-.


QN







c<







Url








Su
HH
0 0





z-:

0<















inz
o


















00 0 LO D Ln M LO t-I L \- o "D 'C Z 00 Ln 00 o o t- r- to Lfr 0
L0 '. L) L 1) tot .q- to t (.I (N | C.l ) to C ( J CM -- -4












\ .O t"oV Ln- f -I C o OO OO o I- I' (N-,I t0 o C LO N L-) i-- \C.O 't- \D
UI i" t1: LO N to.) 't :T" n ZT- ) i. TT- L ) 1) tnO : (N N I- :- (N" t










o 3c 00 \. N -) Nr- ) -4 \ to) "N C i :T r- i L -
0 l L. .t ) t to ) N N (N. (NJ t o (. l (NJ (N. r-. r-.












00 rN- 0o "'T i t oMi--I O o -1 f) C O 't 00 t ) 00 00 m N 00
jL: o) i M t-of) o a u) TT LO In) : ) f) f) V) to tr) ) (N M t N (NJ (NJ tN
. . . . . . . . . . . .


t- C --i uI to zl- \0 -i o r- LO to C) m \O to0 to0 Ln CO 00 1- co
iO ,,- L, 'i- Ln- tl tn' t') o ) tN No t) (, N C 4 tQ rN Oi C ,l ,- i -._
NOH~rOVO~ebeetnameteremmbg
Owway 04MMI~MN~~


\o N- r- \0 \00 CO L .-1 't I'D co \10 ifO t u tI NO o C) tT ao \0
Lf) 'Zt Cl io -i Clj 't '- '- N) NO Nt) zi- Cl It "tf toC V) N 't rC no








co C0) O --i rC t) 't i-o \10 N7- co C O r-( CNI to. r \10 rN- 0 too C
(1 ClJ tf to tIr) tn tNO tO N) V to V) It N t '^- '- I-t I-- :T- T3- ^ : If)
NmHN MatOWN 4tomamwbmM









N~eaNb~~O~tern~Memmwwwwwwm


CO 00 I'1









0 0
*H


Sn> >
HU*il a) U
S-H ) -
000l ( 40
M n0 4-3

*H C C C

. ( 4-1 (1) (j
SdU)d ed
000 ou
Z220 Eml


cli C)
CL 00


00 CO zl-
S, ,-4I


-41 0
Lf)-4


0 CO V)
LO t
Lnbg-


4-4

U

SH
*H

'+-1


'1 0l
LO r)



























a) 0

II

ZO
4-J











H
C)

to 0

II

Z
ct











rq 0




E
4-J










-1-



0













E
+-
4-


Co 0o I'D o 0 CD ( L n t C'] 'z- 3, o r.- ro ( Z M ( 00 0 O C' ZT Go -i In 0)
"t 'D Ln in 110 Ln N- 'I T o Dr \o \D N- \O Zz zl- in n \ ,o Ln tn In '-Z in Ln

















Ln cC i( co In r- C C) r- L in tr a- It Go LCC in r- -4 \ o
tn \fo In in \D Ln L- Lin \ \- \'0 \0 i-) in In In \ -D '0 LO -4 in n in in Ln

















0 cirMr-4 r-o CD N o i CO N- T Lr ) C O tInO 0 (n (N LL LO in CO o
,T '\ \D 7- \o \O 7- r- \O Lr) Lr) \O \ L n ti \D \o \O iN 'T tL0 "t LO 'I T














-4 n 'T L ) \D t'- co0 G C V)'I L) \D r- i 0) N- C O O r t4 C Tl Lr) \D ri
-4 - - .- -..I (N C .. .C .


























*H

L O
OL d


II


4-J

















NOI
,-













H
InI
Z <
4-)















i-M
.-I




II

I I






2
+-


4-)



0 4-
4-o








c cd






HU -



0>
U *He
kk


l- O -Hr col 0 c "T t OL.) cc L- \D T r-. .oL M C 1 co t C O
'T L L (NJ to o To t cl r tn re) (N o Z .- -I




















L Ui) .- 'T- (NJ t) T ) tf) m m 'T 0l (N ton "N T r "N z C M
* * . .* *. * *. .* . . .



















t L) ( 'TJ- Cl "T r M If) C C"T '): tn I) "N n'T ) t N C















CO C O CD- Ci r T ) O t-O Co O -4 CI tc ) LO I) t N- O C OD
(1J CA tn) -) tM I NI tf t n) tn tn M "T 'T I T -z T- "T .t .. q i "T )









Three statistics are reported in Table 7 for the Rasch item

analysis procedure. For each sample size, the percentage of total

variance accounted for by the first unrotated principal component

and the means and standard deviations of the mean square fit statistic

and Rasch difficulties are presented for the items selected.

In order to select the best 15 and 30 items from the Rasch analysis,

all 50 items were submitted to a principal components solution. This

procedure was used to ensure that the items selected measured one

trait, as required by the assumption of test unidimensionality. As

noted in Table 7, the percentage of total test variance accounted for

by the first principal component, based on 50 items, was nearly equal

for each sample size.

From the principal components solution only items with loadings

between .39 and .79 were selected for the Rasch analysis as suggested

by Hambleton and Traub (1971), to adhere to the assumption of equal

item discrimination. Using this procedure the number of items (out

of 50) retained for the Rasch analysis varied slightly with sample

size; when N = 250, 33 items were retained, when N = 500, 35 items

were retained, and when N = 995, 33 items were retained. These items,

loading between .39 and .79, were then submitted to the Rasch analysis

to obtain mean square fit statistics and Rasch item difficulties.

These statistics have been reported in Table 7.

Wright and Panchapakesan (1969) developed a measure to assess the

fit of the item to the Rasch model. The measure, defined as the mean

square fit statistic, is:


















-ltCOOOOOCC
LI .-i ~o o co U)n (

r-l l-l


JI I t \ 0 I u U) I UI
I I \D M O I I I I W


>,

C-)
u

*H






(0
04
*r-1

cd
LO >


t) O0 Cl t r-I ) 0m zt- 00
S'.0 Un) U \.0 \.0 \.0 IZT LO


UL OCO Co O O-i i^ C1 C) '.D -i ( Ci t C N- -i Co 0-i
mo Ur) o C-1 z -tr 0 o Mo 0 I tN- In C C" c ) ULn LUn it


C')~-~r r4- 4** -
I I I I I I I I I I I I I I


It' \. L) i-

C4


lO I 0 D )i
\D1 1


I I I


C C i N- oo 0) 'T i 0 m 'i Co r-i c" '^- r-1 C W 1 00 1 U)n iU D
U.) '. ci n'i 't LO) 0M o 0 \.0 -4 C\] UL U) Ln- I'. I I U) U) 0T)

NCiU Cr-4I tU I r-n


m 0m -4 C, C C, tUo,) \ \.D CO U) Co t o r-i M 0 C o m w r- -4 C Co
in Ln U) iU \) ID r0 't. \ Ln \o \. Uo '.0 O Ln Ln U r U) U N- U) 4 't V1 ,) U)aZ Lr )
. . . . . . .


o00
.- r-4
I I


- C I 00 LU r-l



II III


Uf) (NI C\ N- Ci CiA M
IM r -


1 I 1 1 1


1000
I * *


I


U) u
QM


OU
H
















U1-





0












O
U
0o<



X o








OE-













cn


" ,, LO


I I I





01 -1 00


rC io r-


I C) o oD ,O '-, -D tJ L- \ D V) C^ i " t rto Irn I O "d O 1 003
I i O w \D '0 r -4 '.0 M m t D ti '- C) O 0D I L, "- CI I Co
IM ) r'-0 0 -i ) C' U) L IN 1 I (


+


1 Co Co I -4 'IT ) LU) CO 00 'TT U) Co Co N- Cif U) 0) 0 I 1 C 1 U)0i I 0

I . .. . .. .. .. ...... . .4 ..













r-i Ci U) 'Ct LU \o N- Co 0 0) D i 'T L U) \.o N- Co m) 0 -4 V) 'U i-U \0
NHM MH HMNNNNNNN~


.,--



U
*r)1
4-H

.H

r-
rt







a=


,.H

-4i


4-1
H

trt








1-1


6d


+-
i-i


I I
001
rt1 (

(~^ T--


I


o C


.,..,+,,+,


I
I

















" Ot0 r I I Co Ir- I
S I I I I






C .i *- a I .o 21 I
r- 0 1 I ) I
I I I I I r


,-4 ,-- 2I ,4 I NY L- ) I


211I Io I 1! I I



1 I I I I 01 I I I I I




I I I I I I I 1 I
I I(tJiN I I 2 ) I I- I I


n ,-. -4 \o I- '.D N- -t 'Z3 *o 00 LO LO) -4 \,o 0x 0Z3) i) M ixT0 \o CD '.0
Lo L) L) In t q C" ,zl t^ v) L i ,C I ,--I Cl CO t- "CI t1I T I C tI I


, I C I m F)N' I I Q'0C) I I I I I2 I \ O-Il II
IN I I 4 I I I I I I I C I I

S,-ll 4 I I ,-4--l 'll l l I I I




2 1 N 2 I O N O I 1 I 2 I I I I I I ,-- t I I






LI) LO ) I I -4 C) I I ) C I M) l


OC) N NI- r I 1 I I I I I \ I L I I r ) I I
to' ,-1 I N O lO II I I O I I I I
I I I C II I D I C I L I C I I I I







l l l- r l l0 II l "; In I
1- I I I l I I l I l I I




I. . . oi . . . i

2V I I -n .- T T t 2 - T2 T ,Z- 2 r


j 0\
,-I

CM
Cd




4I 0






0 C
4 O
co C) +a









4-) U L,

.) r-i
in > 0
0) i r-
U *-1V-
?- -
(U rt
a, > o


C


) -I
II c 2







O


IL


cZ

4-)


r(J
'--1



o H
C) C






b&
II CO +-'
Z r-(

2 H
C



0


4-4





C) -i
0 .,-r


L -t
Cd
N








o-


I

















i-q
4-I

0






(i-


-0



cd
.1-


0 co
C0 )


rO COI
*0 *

I-4 -4


0 -
0 0
-'-


4-i
i-4











cr


















-H
0















Q
C 4JI

-Ll
i-t























6,

(U
4-
1-1


" It


t) CI
*^ *


0
O


0 0

V )


0

In


a (







4-) 4"
-'-4
e i





o(4


4- )


Q)
fl


0




a

C0





0 H
M




O -H








09)















OH 0)
io-I
i-


oE




o X





-H








0 e
0 0



to 0



OU d






0) 0






0


rl0 ^
(^ 4 9









m nj
aJ & >
% *i-l
0'(J 4-


0
C


0



r-
4-J
oC









0 0








30




0) -





rt
i)

0)a0
nd ::

4c-





















r-l












-0 0
*H

















0,.






!Z 4-i
U -H
N 0
04--l













0 0








*r-i
0)










H t)






4-J (U
cj 3










4-i












0 0


4-J0
-d
en

*H C



0











H 2



,c
4- (
4->,
6 n


Lf i-I
-4 a


I







2 k-1 n
x = T 7 y 2 (14)
i=l j=l yij (14)
2
The quanity, X defined above has approximately the chi square

distribution with degrees of freedom equal to (k-1) (n-1). The

value, yij is the deviation of the item from the model, or item misfit,

and is determined by taking the difference between the observed and

expected frequency of the examinees at a given ability level who

answered a given item correctly. This difference was then divided

by the standard deviation of the observed frequency, squared and

summed over items and score groups. The BICAL program standardizes

these deviations (yij) in computing the mean square fit statistic;

therefore, y.. has a normal distribution with a mean of zero and

standard deviation of one (Hambleton et al., 1977). Items with large

mean square fit values are items which do not fit the model. As shown

in Table 7, the mean and standard deviation of the mean square fit

statistic increased with sample size.

The item difficulty estimates based on the Rasch model also have

an expected mean of zero and standard deviation of one (Wright and

Mead, 1975). These estimates remained very similar across sample

size and exceptionally close to the expected values (Table 7).

The Rasch model does not provide a parameter for item discriminating

power as all item discrimination are considered equal and centered

at one (Wright and Mean, 1975). The BICAL program provided, as part

of the normal output, estimates of the item's discriminating power

to check the fit of the data to the model. The item discrimination

were obtained by regressing the difficulty of the item for each ability

group on the ability estimate of the group (Wright and Mead, 1975, p. 11).









The means and standard deviations for the item discrimination estimates

were shown in Table 8 for each sample size.



TABLE 8

DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATES
BASED ON THE RASCH MODEL ACCORDING
TO SAMPLE SIZE



N = 250a N = 500 N = 995
K = 33b K = 35 K = 33

Mean 1.03 1.02 1.03

Standard .28 .19 .22
Deviation


aN = sample size

bK = number of items




From the data in Table 8, the mean item discrimination estimates

appear nearly equal for each sample size, and quite close to the mean

expected value of one.

The best 15 and 30 items were then selected by each item analytic

procedure based on the information in Tables 5-7, and have been listed

in Tables 9 and 10 respectively.

The items selected under classical item analysis were determined by

the magnitude of the biserial correlation, e.g., the 15 and 30 items

having the highest biserial correlations with total test score were

selected. Indices of item difficulty have been reported for inspection,

but in no way influenced the selection of items for classical item

analysis.










































I n r- 00 0) OC) CO4 L r- 0 mO C ) T "Do r- Oo ) -1 r-
Sc-i .- I O r- r l r- rI, M MN c q rn-7T
'- '4 -~- '-4'- '4Cl )3C l (N (N N t


( Vizr i \D r- CI o '-4 i tN 1) r ) C ) oM iT 0 o r- N o C)O tD ) oC
-4 -q -1-q -I 4 -4 -1 CM (N r(N C(N (N (N (N t4- T


(N t' L %O r c C '-I t) -- i,- N m c-- \o r, 0l m O tt
'-4 '-4 '-- 1 i -4 '-4 1-4 i-iq r4- C CNI (N i CNI CI )I

























NJ





H

II
0
0l
U_
C-



0

w



H




H

C-)


0

0
H

C-


rJ:

H

C

H





H


h.- (N tV) DT LO \O GO C


SN-(j LIx4N rCOOD C


r-I (N tO LO) '. N- CO O H C CN Lt' i) LOOD r- cO M O CN t L 'It ) \oD
r- ll l-1 r-I - r-I r-I r-i r-I CN -1 . C-^ lN CN









































l- CO Cl ,-- "- l' t- CO 3 '- 00
CM COl( trM tLO 14 tl) V 1 c 1-T NT3


N 0 C) --I 'D o mCo VM t' co
"N (N C lN v 't ) C r) 4T- 4T t


i


t- 00 LM o- O D D
CA N V tC t) a t) rM M r Z "T









The selection of items under factor analysis was determined by

the item loadings on the first unrotated principal component. The

15 and 30 items having the highest item-component biserial

correlation were selected.

The selection of the 15 and 30 items from the Rasch analysis was

determined by the mean square fit of the item to the Rasch model. The

closer the mean square fit was to zero the better the item fit the

model, thus items with the lowest mean square fit statistic were

selected.

Double Cross-Validation

After the tests of the best 15 and 30 items were developed by each

procedure, they were scored on independent samples, in a double cross-

validation procedure as noted in Table 3, Chapter III. Item and

test statistics, needed to test the five hypotheses were obtained for

the 15 and 30 item tests based on the cross-validation samples using the

GITAP program (Baker and Martin, 1970).

The GITAP program provided the following output:

each subject's total test score

test mean and standard deviation

internal consistency estimates as measured by Hoyt's

analysis of variance procedure

estimates of the standard error of measurement

Sindices of item difficulty and biserial correlations

Comparison of the 15 Item Tests on Precision

The descriptive statistics based on the double cross-validation

samples for the 15 item tests have been presented in Table 11.












-I
L, O CO


co o 0,1 0.


u CI0 .N
0u ( 0 f ) o co *0 0
Nt . . -o N. '.




U)
l1-.

,I 0 0






C 0a o N 0 o1 O 0o V

SU) * I
-l

Sto 0 t0 ,
eL '0 L) l C





Co 00 0 N





00 PA -1
,- O oo U) CD c- o LO I O D L 0 C: -I








w 0 t c .o ,o 0 \.0 0 ..
LU O> .1 o co t '0 -4 n N I M b
D < O . . Co I *H tU
U- OUd Lt o to o j ,4-'H


H U ) ,H



w C N r.I) ) If a 0 c l O H
Ccc U) 4-)


- w o ( 0 CCo t N 0 O mH





S -O -- ,--- o t1 4 )
oC H d o



LH L- (I o ,c


j . . .- .,I O I
S< ) o o o.
[U1 W 0 00 OZ \0CD M 4--) L k

n U Ml CD

[1 QD c C0L
0 (NJ I c)






















uw o.o ,-
QU 0 n -4 0o r'0 C 1I 0
SV o .1 a)o U )





So O tr) if C c) H O t C o
00~~~~a a) o H *








C- Utl P- l 3 0 o
U) +NI U) .0
0o0 z ||


t 0 0 Q 0 N. I) C 0 0 U
0-H ) nj H H 1










4J V) m -1 o- r-1 + +d 0 -H
Cdj HoUCa a) It) .4d a'0 d N P: 'd
+-J 4J 9 0 4 10 4-l0 4) d 0 0 4 o *, C) -4 t

-a i 0P
1-1a + E-' S^ 0. +-> ^; o








The values of the internal consistency estimates for the tests

developed using the Rasch model were consistently lower than the

internal consistency estimates of the tests developed by classical

item analysis and factor analysis across all sample sizes.

The observed internal consistency estimates were tested for

significance using confidence intervals described by Feldt (1965), to

see if they were statistically different from the internal consistency

estimate for the projected population using the Spearman-Brown Prophecy

Formula.

The internal consistency estimate for the population based on the

original 50 item subtest was .88 (Table 1). By applying the

Spearman-Brown Prophecy Formula (Mehrens and Lehman, 1973) the projected

population internal consistency estimate for a 15 item test was found

to be .687. The value .687 was the expected internal consistency if 35

of the 50 items were randomly deleted. Thus, confidence intervals

were generated around the observed internal consistency estimates,

presented in Table 11, for each procedure across all sample sizes to

see if any of the three item analytic techniques would produce a more

reliable test than would be expected from mere random item deletion.

The confidence intervals for the observed consistency estimates

for each procedure have been reported in Table 12.

When the sample sizes were 250 and 995 each item,analytic technique

produced an internal consistency estimate that was significantly different

from the projected population estimate (.687) at a confidence level of

95 percent. Each of the three techniques systematically retained

the 15 most homogeneous items. These tests were more precise

in terms of internal consistency than would have been found if the




72



items were randomly deleted as noted by comparisons to the projected

population reliability coefficient.



Table 12

CONFIDENCE INTERVALSa FOR TIE OBSERVED INTERNAL
CONSISTENCY ESTIMATES BASED ON THE 15
ITEM TESTS ACCORDING TO SAMPLE SIZE



95% Confidence Interval

Procedure N = 250 N = 500 N = 995

Classical .748 .828* .792 .838* .810 .841*

Factor Analysis .760 .837* .786 .854* .797 .831*

Rasch .704 .799* .688 .757 .711 .759*

The F values used in calculating the confidence intervals were obtained
from Marisculo (1971).
*
Statistical significance is indicated when the population internal
consistency estimate is not concluded in the confidence interval
generated for each observed internal consistency estimate. The
projected population value was .687.



Only two procedures produced tests with internal consistency

estimates significantly different from the projected population estimate

when the sample size was 500, classical item analysis and factor

analysis.

As sample size decreased, in most cases, the internal consistency

for each method tended to decrease (Table 11). An exception was noted

for the Rasch tests, when the sample size decreased from 500 to 250,

internal consistency improved slightly.

The data reported in Table 11 indicated that the standard error

of measurement for the 15 item tests based on the Rasch model were








consistently larger than the standard error of measurement of the

tests developed from classical item analysis and factor analysis for each

sample size. However, these differences were not meaningful in that

the difference did not equal or exceed 1.00 for any of the three

procedures.

The differences in mean item difficulties and discrimination

were tested for statistical significance to determine whether there were

differences in the types of items retained by each item analytic method.

In this study item discrimination were measured by biserial correlations.

A two-way analysis of variance (fixed effects model) was performed

separately for the two dependent variables of item difficulty and item

discrimination. A check was made on the assumptions for the analysis

of variance to ensure that they were met. In these analyses, item

analytic technique and sample size were the two independent factors, each

with three levels.

For item difficulty, no significant differences were found for

item analytic technique, sample size, or their interaction, F (2,126) =

2.57, p > .05; F (2,126) = .45, p > .05; F (4,126) = .33, p > .05

respectively.

The means, standard deviations, and ranges of the item difficulties

based upon the 15 item tests have been reported in Table 13.

For the analysis of variance performed on the transformed

biserial correlations a significant F ratio was observed for the factor

of item analytic technique, F (2,126) = 14.862, p < .05. No significant

differences were observed for sample size or the interaction of sample

size and item analytic technique for the transformed biserial








correlations F (2,126) = .30, p > .05; F (4,126) = 1.16, p > .05

respectively. The means, standard deviations, and ranges of the

transformed biserial correlations.based upon the 15 item tests have

been presented in Table 14.


TABLE 13

15 ITEM TESTS:
DESCRIPTIVE STATISTICS FOR ITEM DIFFICULTY
BY PROCEDURE AND SAMPLE SIZE




Procedure Sample Size
Factor
Classical Analysis Rasch 250 500 995

Mean .65 .67 .60 .66 .63 .63

Standard
Deviation .15 .14 .16 .15 .15 .16

Range .31-.91 .35-.92 .27-.90 .31-.92 .32-.92 .27-.91


Post hoc comparisons were made to determine which of the three

item analytic procedures based upon their means contributed to the

significant F ratio for the transformed biserial correlations. Tukey's

HSD (honestly significant difference) test for multiple comparisons

was employed (Kirk, 1968, p. 88). The HSD value (a = .01), was 6.88.

Therefore, a difference between means had to exceed this value to be

significantly different. The results of the post hoc comparisons





9When the actual biserial correlations were tested in the two-way
analysis of variance design similar F ratios were observed.








between the mean item discrimination have been reported in Table 15.


TABLE 14

15 ITEM TESTS:
DESCRIPTIVE STATISTICS FOR ITEM DISCRIMINATION
BY PROCEDURE AND SAMPLE SIZE



Procedure Sample Size
Factor
Classical Analysis Rasch 250 500 995

Mean 53.60 54.49 43.51 51.20 49.47 50.98

Standard
Deviation 12.18 11.98 8.21 14.17 11.20 10.36

Range 29-82 34-91 28-64 34-91 28-78 30-73


aBased on transformed biserial correlations. The transformation was a
linear transformation of the Fisher z statistic and multiplication
of the constant 60.241 providing a range of 0-100 for the biserial
correlation (Davis, 1946).



From Table 15, it is apparent that the mean transformed biserial

correlation from the Rasch developed test was significantly lower

than the mean biserial correlations from the tests developed by

classical item analysis and factor analysis.

Comparison of the 30 Item Tests on Precision

The descriptive statistics based on the double cross-validation

of the 30 items selected by each procedure, according to sample size,

have been presented in Table 16.








TABLE 15

POST HOC COMPARISONS OF THE DIFFERENCES
BETWEEN THE MEAN ITEM DISCRIMINATIONSa
FOR T11E 15 ITEM TESTS


Means
54.49 53.60 43.51


Factor Analysis ----- .889 10.98**
(54.49)

Classical Item
Analysis ----- 10.09**
(53.60)

Rasch Analysis -----
(43.51)


aBased on transformed biserial correlations. The transformation was a
linear transformation of the Fisher z statistic and multiplication
of the constant 60.241 providing a range of 0-100 for the biserial
correlation (Davis, 1846).

p < .01, HSD = 6.88.



By increasing the test length to 30 items, the internal consistency

estimate was increased across each method and sample size, but a pattern

similar to that for the 15 item test emerged. The internal consistency

estimates from the test based on the Rasch model were slightly lower

than the internal consistency estimates for the tests based on

classical item analysis and factor analysis. The observed internal

consistency estimates were tested for significance, using the confidence

intervals described in the previous section, to see if they were

statistically different from the internal consistency estimate for the

population.

The projected population internal consistency estimate for a 30

item test was found to be .814 (via the Spearman-Brown Prophecy Formula).












LO
LiO*
U Lr - c0 mI LO 1 C 0 r \O

) . i


0 r n O c i o C C
---
i o


I- N Co N -D I O \ 0









= u ..i
I* L* I




0 0O U c 0 tn 0 0 H-
LU UCO L 04







Cd 0 O t 1) c .-
,,-4'- 4 -U
-II 0
0l 0- i




















0 "H r t u C *~ 0 4 4O -
t* . I O
E- LO l t '-I Cl M LO i l i- C



W M i r N (N NI 0 O) 0O


U CD 4-J -i -4 C O C
Cl < U. (






-4 2 *- .H -
























0 0 ,-1 o *H C0l
O0 *t) 0 t- 4 -I 0




m ) \C) I ) r C. CD ) 00 [a En -i
M \ m 00 o o I LnO 1- I n Lo CO H +_
< o + * * * *H O





w NU C f N r0 t -
o 3 11 (U C- a
f M 2 O-i ?- U
i- E-U E) nHr
SNO U o l 4- O
C *H LL r rO Q O '0 O OO O I/ H

m C n *C * * tN * I 4- tj
P. m c1 N. .'] m .- t H jrt


u u \.D cc tn LO t ko O O a) r-i
E- -4 '-H l- to '-4 + + O

uU Cd -i

a- cc j *o N

S( l 4 r NO -
0- 0 00 0 *H4 0








4* 0 0-
* *r- I
dU r CO a M u0 O H H-4




U) Z .D i HU D m N m r- C) C) 0 1
W4 W Ll -4 *- 00 LO r- IU tl ?-1
~m U) tA0)

U I H *
r-- Xf C C'J Co k; Q




4-) 4-


0*H Cd C
u w () 0 0H c0
H *H C 4-3 H + H i *-*

M 4-J r Cd -4 cd4-j -4 d 4-J Cd 0 .r-
P: >- (1) Ln 7 .,1 io r-l .,-I O 71O

M>> 4-r- d ci 4Jd cd > r- *C l H > C (1) >
4-J0) 4-J 0 0 4-J4-4 X4-4 (1) 4-JO) M 2L 0 4 0
2 U) *Ma u U) 0 44 1: U i m 04 Enr c)C r- od H0
M~Q Mr r1 i- r' o C H imi ^ O i^ / -
Q~~ LO 1/ ** t- l (









The value of .814 indicated the expected internal consistency

if 20 of the 50 items were randomly deleted.

Based on the observed internal consistency estimates reported

in Table 16, confidence intervals were generated for each item analytic

procedure and have been presented in Table 17.



TABLE 17

CONFIDENCE INTERVALSa FOR THE OBSERVED INTERNAL
CONSISTENCY ESTIMATES BASED ON THE 30
ITEM TESTS ACCORDING TO SAMPLE SIZE



95% Confidence Interval
N = 250 N = 500 N = 955


Classical .800-.865 .845-.880* .850-.874*

Factor Analysis .825-.881* .834-.871* .848-.873*

Rasch .794-.860 .817-.858* .838-.863*


aThe F values used in calculating the confidence intervals were obtained
from Marisculo (1971).
*
Statistical significance is observed when the population internal
consistency estimate is not included in the confidence interval
generated for each observed internal consistency estimate. The
projected population value was .814.



For the sample of 250 examinees, only one item analytic technique

(factor analysis) produced an internal consistency estimate that was

statistically different from the projected population estimate at a

confidence level of 95 percent.

However, all three techniques produced tests with internal con-

sistency estimates significantly different from the projected population








estimate when the sample was increased to 500, and 995. Thus, when

the number of examinees was large,each of the three techniques

procedures tests with higher internal consistency estimates than if the

test were produced by randomly deleting items.

For the 30 item tests, the effect of decreasing the sample size

tended to decrease internal consistency for each method (Table 16).

But the decrease was very slight.

The standard error of measurement was essentially the same for

the three methods of item analysis across the varying sample sizes.

Two-way analyses of variance were run on item difficulties and

item discrimination for the 30 item tests, similar to those run for

the 15 item tests. Again, the independent variables were item analytic

technique and sample size, each containing three levels.

No significant differences were observed for item difficulty

for the independent variables of item analytic technique, sample size,

or their interaction, F (2,261) = .46, p > .05; F (2,261 = .27, p > .05;

F (4,261) = .24, p > .05 respectively.

No significant differences were observed for the transformed

biserial correlations0 for the independent variables of item analytic

technqiue, sample size, or their interaction, F (2,261) = 1.97, p > .05;

F (2,261) = .74, p > .05; F (4,261) = .48, p > .05 respectively.












10When the actual biserial correlations were tested in the two-way
analysis of variance design similar F values were observed.









The means, standard deviations, and ranges of the item difficulties

and transformed biserial correlations based upon the 30 item tests

have been presented in Tables 18 and 19 respectively.



TABLE 18

30 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DIFFICULTY BY PROCEDURE AND SAMPLE SIZE



Procedure Sample Size
Factor
Classical Analysis Rasch 250 500 995

Mean .58 .60 .59 .58 .60 .59

Standard
Deviation .17 .16 .17 .18 .16 .16

Range .21-.91 .29-.92 .23-.90 .23-.92 .24-.92 .21-.91




TABLE 19

30 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DISCRIMINATIONSa BY PROCEDURE AND SAMPLE SIZE



Procedure Sample Size
Factor
Classical Analysis Rasch 250 500 995

Mean 41.02 41.61 38.56 39.42 40.37 41.40

Standard
Deviation 11.82 10.85 9.93 12.45 9.87 10.34

Range 9-70 19-68 13-64 9-70 21-68 24-66


aBased on transformed biserial correlations.


The transformation was a


linear transformation of the Fisher z statistic and multiplication
of the constant 60.241 providing a range of 0-100 for the biserial
correlation (Davis, 1946).









Comparison of the 30 Item Tests on Efficiency

Lord (1974a, 1974b) proposed the formula used for approximating

the relative efficiency for two tests, stated previously in equation

13 as:

n 2
y x(n x)f
R.E. (y,x) x ,
x y(n y)f
y y

where R.E. (y,x) denotes the relative efficiency of y compared to x,

n and n are the numbers of items in the two tests, x and y are the
x y
2 2
number-right scores having the same percentile rank, and f2 and f are
x y
the squared observed frequencies of x and y obtained from frequency

distributions for similar groups of examinees. A careful examination

of the formula for relative efficiency indicated that when n = n and
x y
x = y, that it was the number of examinees at the specified ability

level (f2 and f ) that determined the efficiency of the test. That is,
x y
the fewer examinees observed at a particular percentile rank, the better

the test discriminates at that percentile rank. Therefore, test

efficiency was equated with the level of discrimination the test

was able to make between examinees, at various scores or percentile

ranks.

Three relative efficiency comparisons were made using the 30 item

tests based on the sample of 995 examinees. The three comparisons were:

(a) the test developed from factor analysis was compared to the test

developed by classical item analysis (b) the test developed by Rasch

analysis was compared to the test developed by classical item analysis,

and (c) the test developed from the Rasch analysis was compared to the

factor analytically developed test.









The efficiency curves for the three comparisons were shown in

Figure 2. The relative efficiency value was plotted on the ordinate,

while the percentile rank (student ability level) was plotted

along the abscissa. Computed values for the relative efficiency

comparisons have been reported in Appendix B. A relative efficiency

of 1.00 would indicate that the tests are equally efficient.

The test developed by factor analysis was more efficient for the

lower tenth of the pupils when compared to the test developed from

classical item analysis. Both the tests were about equally efficient

for the middle ability groups and high ability groups.

The Rasch developed test was more efficient than the test based

on classical item analysis for average to high ability students

(40th-90th percentile rank). However, it was less efficient than the

classical item analysis test for students with very low or very high

abilities (ist-20th percentile rank and 98th percentile rank).

When compared to the factorially developed test, the Rasch test

was again more efficient for students of average to high abilities

(50th-90th percentile rank). The factorially developed test appeared

more efficient for the very low and very high ability students

(lst-20th percentile rank and 98th percentile rank).

Summary

The results reported in this chapter are summarized for each of

the five hypotheses.

Hypothesis 1. There are no significant differences in the

internal consistency estimates of the tests produced by the three

methods, as the number of items decreases, when compared to the









4.0 --

3.75--

3.5-

3.25--
35 -
3.0 -

2.75-

2.5 -

-.25-

2.0 -
2-

1.75--

.5 -

125-

1.0 -

.-


- -


I 0-- I I I
t*), Y


20 30 40 50 60 70 80

PERCENTILE RANKS


FIGURE 2.


RELATIVE EFFICiENCY COMPARISONS

FOR THE THREE 30 ITEM TESTS N=995.


I'
I'
I'
I'
a.
I


I I
90 95


KEY:
*--- FACTOR ANALYSIS COMPARED TO CLASSICAL ITEM ANALYSIS
---o RASCH ANALYSIS COMPARED TO CLASSICAL ITEM ANALYSIS
0- ---a RASCH ANALYSIS COMPARED TO FACTOR ANALYSIS








projected internal consistency estimates for the population for tests

of similar length.

Confidence intervals were calculated to test for differences between

the observed internal consistency estimates and the internal consistency

estimate for the population. As reported in Tables 12 and 17, for the

15 and 30 item tests, 15 of the 18 confidence intervals (at the 95 percent

level) generated around the sample estimate did not contain the population

value. This means that 15 of the observed internal consistency estimates

were superior to the population values projected for subtests of similar

length created by random deletion of items. Therefore, hypothesis one

was not supported. The procedures that produced the three observed

internal consistency estimates that were not significantly different

from the population value, and hence no different than would be expected

by random item deletion, were the Rasch procedure (15 item test, N = 500;

30 item test, N = 250) and the classical item analysis procedure (30

item test, N = 250).

Hypothesis 2. There are no differences in the internal consistency

estimates of the tests produced by the three methods when the number

of examinees is decreased.

Hypothesis two was supported for the 15 and 30 item tests. Slight

decreases in internal consistency estimates were noted for the 15

item test (Table 11) as sample size decreased, but only decreases of

one or two one-hundreths of a point. Even smaller decreases were

observed on the 30 item test (Table 16).

Hypothesis 3. There are no meaningful differences in the

magnitude of the standard error of measurement of the tests produced

by the three methods.










Hypothesis three was supported for the 15 and 30 item tests.

Meaningful differences were defined to be > 1.00, but none of the

three methods produced tests with standard errors of measurement that

differed by that much. In each case, the difference was approximately

one-tenth of a point or less (Tables 11 and 16).

Hypothesis 4. There are no differences in the difficulties or

discrimination of the items selected by the three methods.

Hypothesis four was supported for the 15 and 30 item tests with

respect to item difficulty. That is, the two-way analysis of variance

revealed no significant differences for either the 15 or 30 item tests

with regard to item difficulty.

Hypothesis four was also supported for item discrimination, but

only for the 30 item tests. The two-way analysis of variance for item

discrimination indicated no significant differences for the 30 item

tests; however, on the 15 item tests, a significant F ratio (p < .05)

for item analytic procedure was observed for item discrimination.

Tukey's HSD test revealed that items selected by the Rasch procedure

had significantly lower average biserial correlations than the items

selected by factor analysis and classical item analysis (Table 15).

This could have been expected because the range of the biserial

correlation was restricted when the items were originally selected

for the Rasch model. This procedure was necessary to meet one of the

assumptions for the Rasch model.

Hypothesis 5. There are no differences across ability levels in

the efficiency of the tests produced by the three methods.




86



Hypothesis five was not supported. The efficiency curves

illustrated in Figure 2, generally indicated that the tests based on

classical test theory were more effective for measuring students with

very low ability (20th percentile rank or less) and students with very

high abilities (98th percentile rank). The Rasch developed test was

most efficient for assessing average and high ability students (40th-

90th percentile rank).
















CHAPTER V

DISCUSSION AND CONCLUSIONS



This study was conducted to determine which of the three item

analytic procedures (classical item analysis, factor analysis, and the

Rasch model) might produce the superior test in terms of the precision

and the efficiency of measurement. A common item and examinee population

was used to test five hypotheses. Of the five hypotheses, three dealt

with elements of test precision as measured by internal consistency

estimates. Another hypothesis treated the issue of item discrimination.

Thus, it too was related to internal consistency. The fifth hypothesis

focused on the relative efficiency of the tests produced by three item

analytic techniques. This hypothesis altered the emphasis of the study

from one overall specific measure of a test's accuracy, in terms of

internal consistency, to a general comparison of each method as a

function of ability level. The discussion of the results then has been

focused in two major areas: (a) the precision of the tests, and (b)

the efficiency of the tests produced by the three methods of item analysis.

The Precision of the Tests Produced by the
Three Methods of Item Analysis

Each of the three item analytic techniques was applied to an in-

dependent sample to select the best 15 and 30 items. The stability of

the summary statistics across each sample size for the three item analytic

techniques indicated a tendency for the nine samples to be very homogeneous.








The similarity of the means, standard deviations, and percentages of

variance accounted for were noted on Tables 5-8, with the exception

of the mean square fit statistic (Table 7) which increased with

sample size. (This exception is discussed later in this chapter.)

From these samples, items were selected by each item analytic technique

to maximize internal consistency.

The data reported in Tables 11 and 16 indicated the effectiveness

of each item analytic technique in producing internally consistent

tests. Before an overall decision can be made as to the superiority of

one technique over another, each of the hypotheses relating to precision

must be considered.

Internal Consistency

Data in Tables 11 and 16 indicate that the two tests based on

classical test theory (factor analysis and classical item analysis)

appeared superior in terms of internal consistency when compared to the

tests developed by the Rasch model.

To test whether any of the three methods produced tests with

greater internal consistency than a test created by random item deletion,

the internal consistency estimates were compared to the projected internal

consistency value for the population by using confidence intervals as

suggested by Feldt (1965). In order for a given sample internal consis-

tency estimate to be significant, the population value could not be

included in the confidence interval generated around that sample value.

For the 15 item tests, nine confidence intervals were calculated for

the nine estimates of internal consistency, one for each method at each

sample size. Eight of the nine sample values were shown to be significantly

greater than the population estimate at the 95 percent confidence level

(Table 12). Only the internal consistency estimate of the Rasch test,









based on the sample of 500 examines, failed to reach a level significantly

greater than would have been expected by chance.

For the 30 item tests, nine confidence intervals were also calculated

for the nine estimates of internal consistency, one for each method

at each sample size. Seven of the nine sample internal consistency

estimates were shown to be significantly greater than the population

estimate at the 95 percent confidence level (Table 17). The tests

based on classical item analysis and Rasch analysis, for the sample of

250 examinees, were not significantly different from the projected

population value for a 30 item test created by random item selection.

Therefore, for smaller samples (N = 250) factor analysis appeared

to be superior to classical item analysis and the Rasch analysis in

producing the most precise test.

Generally, as the number of examinees decreased so did the internal

consistency estimates. However, the tests based on factor analysis were

least affected by decreasing the sample sizes used in the cross-validation

for the 15 and 30 item tests (Tables 11 and 16).

Standard Error of Measurement

The standard error of measurement is the standard deviation of the

distribution of errors surrounding an individual's observed score on

an infinite number of parallel tests. Hence the smaller the standard

error of measurement, the greater the precision of the measurement. This

statistic is often considered a more meaningful measure of an instrument's

reliability than the reliability coefficient itself (Magnusson, 1966,

p. 82). Based on the data for this study, the standard errors of

measurement were consistently smaller for both the 15 and 30 item tests




Full Text

PAGE 1

n A COMPARISON OF THREE TYPES OF ITEM ANALYSIS IN TEST DEVELOPMENT USING CLASSICAL AND LATENT TRAIT METHODS By IRIS G. BENSON A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1977

PAGE 2

^iiiiii

PAGE 3

ACKNOWLEDGMENTS I am deeply indebted to two special people who have greatly influenced by graduate education, Dr. William Ware, chairman of my guidance committee, and Dr. Linda Crocker, unofficial cochairman of my committee. Their continued encouragement and support has resulted in my reaching this point in my graduate studies. I shall always be extremely grateful to Dr. Ware and Dr. Crocker for whatever skills I have developed as a researcher and as a teacher are in large part due to their advice and guidance. To them 1 owe the high value I place on objective, quantitative research methods. Further, 1 would like to acknowledge the tremendous amount of time they spent in molding the final copy of this manuscript. 1 would also like to express my appreciation to the members of my committee. Dean John Newell and Dr. William Powell, for their suggestions and editorial coimnents on this dissertation. Special thanks are extended to Dr. Wilson Guertin for his assistance with portions of the study, and as an unofficial member of my committee. 1 would like to thank Dr. Jeaninne Webb, Director of the Office of Instructional Resources, and Mr. Robert Feinberg and Ms. Arlene Barry of the Testing Division, for providing the data used in this study. Finally, I would like to express my sincere appreciation to my friends and family who stood by me during very trying times in my graduate education. 11 y

PAGE 4

TABLE OF CONTENTS PAGE ACKNOWLEDGMENTS ii LIST OF TABLES y LIST OF FIGURES vii ABSTRACT ^iii CHAPTER I. INTE^ODUCTION 1 The Problem 7 Purpose of the Study 8 Significance of the Study 10 Organization of the Study H II. REVIEW OF THE LITERATURE 13 Item Analysis Procedures for the Classical Model .... 13 Research Related to Classical Item Analysis in Test Development 18 Simplified Methods of Obtaining Item Discrimination 21 Item Analysis Procedures for the Factor Analytic Model 22 Research Related to Factor Analysis in Test Development .. , 24 Comparison of .Factor Analysis to^-CTassical Item Analysis . . "/'"r'""!^ T'*". . . . 7"""7""T' 25 Item Analysis Procedures for the Latent Trait Model ... 27 Research Related to Latent Trait Models in Test Development 32 Comparison of the fesch Model to Factor Analysis ... 34 Summary . . '.' . . "7~": :"". 36 III. METHOD 39 The Sample 39 The Instrument 40 The Procedure 42 Design 42 Item Selection 44 Double Cross-Validation j:_„---.-^— •-— ' -•— ;. • • • 46 Statistical Analyses , T'T 48 Summary 51 111

PAGE 5

TABLE OF CONTENTS Continued CHAPTER PAGE IV. RESULTS 53 Item Selection 54 Double Cross-Validation 69 Comparison of the 15 Item Tests on Precision 69 Comparison of the 30 Item Tests on Precision 75 Comparison of the 30 Item Tests on Lfficienc)' 81 Summary 82 V. DISCUSSION AND CONCLUSIONS 87 The Precision of the Tests Produced by the Three Methods of Item Analysis 87 Internal Consistency 88 Standard Error of Measurement 89 Types of Items Retained 90 Conclusions 9I The Efficiency of the Tests Produced by the Three Methods of Item Analysis 93 Conclusions 95 Implications for Future Research 95 VI. SUMMARY 99 REFERENCES IO5 APPENDIX A: Mathematical Derivation of the Rasch Model .... 112 APPENDIX B: Relative Efficiency Values Used in Figure 2 for the Comparisons Among Item Analytic Methods 119 BIOGRAPHICAL SKETCH 120 IV

PAGE 6

LIST OH TABLES TABLE PAGE 1 DESCRIPTIVE DATA ON THE VERBAL APTITUDE SUBTEST OF THE FLORIDA TWELFTH GRADE TEST 1975 ADMINISTRATION 41 2 SYSTEMATIC SAMPLING DESIGN OF THE STUDY N 5,235 ... 43 3 DOUBLE CROSS-VALIDATION DESIGN OF THE STUDY 47 4 DEMOGRAPHIC BREAKDOWN BY ETHNIC ORIGIN AND SEX FOR TOTAL SAMPLE 55 5 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED ON CLASSICAL ITEM ANALYSIS FOR EACH SAMPLE SIZE .... 56 6 ITEM LOADINGS ON THE FIRST UNROTATED FACTOR FOR THE 50 TEST ITEMS BASED ON FACTOR ANALYSIS FOR EACH SAMPLE SIZE 58 7 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED ON THE RASCH MODEL FOR EACH SAMPLE SIZE 61 8 DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATES BASED ON THE RASCU MODEL ACCORDING TO SAMPLE SIZE ... 65 9 THE 15 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 66 10 THE 30 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 67 11 DESCRIPTIVE STATISTICS FOR THE TEST COMPOSED OF THE 15 BEST ITEMS SELECTED BY EACH ITEM ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 70 12 CONFIDENCE INTERVALS FOR THE OBSERVED INTERNAL CONSISTENCY ESTIMATES BASED ON THE 15 ITEM TESTS ACCORDING TO SAMPLE SIZE 72 13 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM DIFFICULTY BY PROCEDURE AND SAMPLE SIZE 74 14 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM DISCRIMINATIONS BY PROCEDURE AND SAMPLE SIZE 75 15 POST HOC COMPARISONS OF THE DIFFERENCES BETWEEN THE MEAN ITEM DISCRIMINATIONS FOR THE 15 ITEM TESTS . . 76 V

PAGE 7

LIST OF TABLES Continued TABLE PAGE 16 DESCRIPTIVE STATISTICS 1-OR THE TEST COMPOSED OF THE 30 BEST ITEMS SELECTED BY EACH ITEM ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 77 17 CONFIDENCE INTERVALS FOR THE OBSERVED INTERNAL CONSISTENCY ESTIMATES BASED ON THE 30 ITEM TESTS ACCORDING TO SAMPLE SIZE 78 18 30 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM DIFFICULTY BY PROCEDURE AND SAMPLE SIZE 80 19 50 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM DISCRIMINATIONS BY PROCEDURE AND SAMPLE SIZE 80 VI

PAGE 8

LIST OF FIGURES FIGURE PAGE 1 HYPOTHETICAL ITEM CHARACTERISTIC CURVES FOR THE FOUR LATENT TRAIT MODELS 29 2 RELATIVE EFFICIENCY COMPARISONS FOR THE THREE 30 ITEM TESTS N = 995 S3 Vll

PAGE 9

Abstract of Dissertation Presented to tl\e Graduate Council of the University of Florida in Partial Fulfillment of tlie Requirements for the Degree of Doctor of P'ailosophy A COMPARISON OF THREE TYPES OF ITEM ANALYSIS IN TEST DEVELOPMENT USING CLASSICAL AND LATENT TRAIT METHODS By Iris G. Benson December 1977 Clia i rman : Willi am B . Wa r e Major Department: Foundations of Education Test reliability and validity are determined by the quality of the items in the tests. Tlirough the application of item analysis procedures, test constructors are able to obtain quantitative, objective information useful in developing and judging the quality of a test and its items. Classical test theory forms the basis for one method of test development. An integral part of the development of tests based on the classical model is selection of a final set of items from an item pool based on classical item analysis or factor analysis. Classical item analysis requires identification of single items which provide maximum discrimination between individuals on the latent trait being measured. The bi serial correlation between item score and total score is coimnonly used as an index of item discrimination. An alternative method of test development, but based on the classical model, is factor analysis. Factor analysis is a more complex test development procedure than classical item analysis. It is a VI 11

PAGE 10

statistical teclmique that takes into account the item correlation with all other individual items in the test simultaneously. Thus, classical item analysis can be viewed as a unidimensional basis for item analysis, less sophisticated tlian tlie multidimensional procedure of factor analysis. Recently, the field of latent trait theory has provided a new approach to test construction. Several latent trait models have been developed; however, this study was concerned only with the one-parameter logistic Rasch model. The Rasch model was chosen because it is the most parsimonious of the latent trait models and has recently been used in the development and equating of tests. A review of the literature revealed numerous studies conducted in each of the three areas of item analysis, but no comparative studies were reported among all three item analytic techniques. Therefore, the present study was designed to compare the methods of classical item analysis, factor analysis, and the Rasch model in terms of test precision and relative efficiency. An empirical study was designed to compare the effects of the three methods of item analysis on test development across different sample sizes of 250, 500, and 995 subjects. Item response data were obtained from a sample of 5,235 high school seniors on a 50 item cognitive test of verbal aptitude. The subjects were divided into nine independent samples, one for each item analytic technique and sample size. The study was conducted in three phases: item selection, computation of item and test statistics for selected items on double cross-validation samples, and statistical analyses of item characteristics. For each item analytic procedure two tests were developed: IX

PAGE 11

a 15 item test, and a 50 item test. Four dependent variables were obtained for eacli test to assess precision: internal consistency estimates, standard error of measurement, item difficulties, and item discriminations. In addition, tlic relative efficiencies of the 30 item tests developed by each item analytic technicjue were compared for the sample of 995 subjects. The results of the analysis revealed tliat there were no differences between the tests developed by the three methods of item analysis, in terms of the precision of measurement. In terms of efficiency, substantive differences between the tests produced by the three item analytic methods were observed. Specifically, the tests based on classical test theory were more effective for measuring very low and very high ability students. Tlie Rasch developed test was more efficient for assessing average and high ability students.

PAGE 12

CHAPTFR I INTRODUCTION The systematic approach to test development was initiated by Binet and Simon in 191(i. Since that time psychometricians have been concerned with the extent to wliich accurate measurement of a person's "ability" is possible. Most measurement experts agree that upon repeated testing an individual's observed score will vary even though his true ability remains constant. Tliis variability is the essence of classical test tlieory . Classical test theory is based upon the assumption that a person's observed score (X) is made up of a true score (T) and error score (E) denoted: X = T + E. (1) Limited by few assumptions, this theory has wide applications. The few assumptions pertain to the eri'or score (Magnusson, 1966, p. 64): 1. Tlie mean of an examinee's error scores on an infinite number of jiarallel tests is zero. 2. The correlation between examinee's error scores on parallel tests is zero. 3. The correlation between examinees' error scores and true scores is zero. Relying upon these assumptions, psychometricians liave used the observed score (X) to represent the best estimate of a person's true score (T) . 1

PAGE 13

The accuracy of the observed score [XJ in representing an examinee's true score (T) is described by the reliability coefficient. One definition of reliability is given by the coefficient of precision. This coefficient is the correlation between truly parallel tests, assuining the examinee's true score does not change between two measurements. Lord and Novick (1968] have defined truly parallel tests to be those for which, "the expected values [true scores] of parallel measurements are equal; and the observed score variances of parallel measurements are equal (p. 4S)." The reliability coefficient for the population is defined as (Lord and Novick, 1968, p. 134): 2 '' r = "^T^ 1 ^e" , (2) XX ^— 2 2 where o is the true score variance, o,, is the observed score variance, i A and Op is the error score variance. IVhen this expression is used to represent the coefficient of precision, it can be interpreted as the extent to which unreliability is due solely to inadequacies of the test form and testing procedure rather than due to changes in examinees over time. The coefficient of precision is a theoretical value because the 2 2 components o and o cannot be observed. The coefficient of precision is usually estimated by internal consistency methods. Internal consistency is a measure of the relationship between random parallel tests. Random parallel tests are composed of items drawn from the same population of items (Magnusson, 1966, p. 102-105). Scores on these tests may differ somewhat from true scores in means, standard deviations, and correlations because of random errors in the sampling of items. However, random parallel tests are more often encountered in practice than are

PAGE 14

truly parallel tests. Cronbach's coefficient alpha (1951) is the internal consistency coefficient commonly used to represent tlie average correlation among all possible tests created by dividing the domain into random lialves. Thus, the intern.al consistency coefficient indicates the extent to which all tlie items are measuring the same ability or trait, Psychological traits are often described as latent because they cannot be directly observed. Therefore, psychological tests are developed in an attempt to measure these latent traits. Classical test tlieory forms the basis for one metliod of test development. An integral part of the development of tests based on the classical model is the utilization of classical item analysis or factor analysis. Classical item analysis is a procedure to obtain a description of the statistical cliaracteristics of each item in the test. This approach requires identification of single items which provide maximum discrimination between individuals on the latent trait being measured. Theoretically, selecting items which have high correlations with total test score will result in a discriminating test winch is homogeneous with respect to tlie latent trait. Therefore, classical item analysis is an aid to developing internally consistent tests. An alternative method of test development, but based on the classical model, is factor analysis. Factor analysis is a more complex test development procedure than classical item analysis. It is a statistical technique that takes into account the item correlation with all other individual items in the test simultaneously. Groups of similar items tend to cluster together and comprise the latent traits (factors) underlying the test. Under the classical model then, classical item analysis can be viewed as a unidimensional basis for item analysis, less sophisi ticated than the multidimensional ]irocedure of factor analysis.

PAGE 15

The purpose of factor analysis is to represent a variable in terms of one or several underlying factors lHarman, 1967). Depending upon the objective of the analysis, two general approaches are used in factor analysis: (a) common factor analysis, and (h) principal components analysis. A common factor solution would he warranted if the researcher were interested in determining the number of common and unique factors underlying a given test. A principal component solution would be warranted if it were of interest to extract the maximum amount of variance from a given test. Regardless of the apju-oach used, factor analysis is an item analytic technique in whicli all test items are considered simultaneously to produce a matrix of item correlations with factors. It is these correlations or item loadings tliat indicate the strength of the factor and also the number of factors underlying the test. However, factor analysis shares the weakness of classical item analysis, that of being sample dependent. Critics of classical test theory contend that a major weakness of tests developed from this model is that the item statistics vary when the examinee group changes; item statistics may also vary if a different set of items from the same domain is used with the same examinee group (llambleton and Cook, 1977; Wright, 1968). Thus, the selection of a final set of test items will be sample dependent. Until recently, classical item analysis and factor analysis were tlie only techniques described in measurement texts for use in item analysis and test development (Baker, 1977). However, with the publication of Lord and Novick's Statistical Theories of Mental Test Scores (1968) and the availability of computer programs, considerable attention is being directed now toward the field of latent trait tlieory as a new area in

PAGE 16

test development. Latent trait theory dates back to Lazarsfeld (1950) who introduced the concept; however, Fredrick Lord is generally given credit as the father of latent trait theory (Hambleton, Swaminathan, Cook, Eignor, and Gifford, 1977). Proponents of this approach claim that the advantages of latent trait theory over classical test theory are twofold: (a) theoretically it provides item parameters which are invariant across examinee samples which will differ with respect to the latent trait, and (b) it provides item characteristic curves that give insight into how specific items discriminate between students of varying abilities. These properties of latent trait theory will be presented in more detail in Chapter II. Four latent trait models have been developed for use with dichotomously scored data: the normal ogive, and tlie one-, two-, and three-parameter logistic model (Hambleton and Cook, 1977; Lord and Novick, 1968). This study is concerned with the one-parameter logistic Rasch model because it is the simplest of the four models. Tests developed using the Rasch model are intended to provide objective measurement of the examinee's true ability on the latent trait in question, as well as providing for invariant item parameters (Rasch, 1966; Wright, 1968). lliat is, any subset of items from a population of items that have been calibrated by the Rasch model should accurately measure the examinee's true ability regardless of whether the items are very easy or very difficult; also, the item parameters should remain constant over different examinees. In measurements obtained from classical test theory this objective feature is rarely attained. The item parameters associated with classical test theory are group and item specific. That is, the item parameters are determined by the

PAGE 17

ability of the people taking the test and the subset of items chosen. Wright (196S) has stated, "llie growth of science depends on the development of objective methods for transforming an observation into measurement (p. S6)." Latent trait theory is an attempt to develop mental measurement into a technique similar to measurement in the physical sciences . Latent trait theory is based on strong assumptions that are restrictive and lience limit its application [Hambleton and Cook, 1977). The assumptions required for the Rasch model are the following (Rasch, 1966): 1. Tlie test is unidimensional, e.g., there is only one factor or trait underlying test performance. 2. The item responses of each examinee are locally independent, e.g., success or failure on one item does not hinder other item responses. 3. Tlie item discriminations are equal, e.g., all items load equally on the factor underlying the test. Lord and Xovick (1968) noted that the assumptions of unidimensionality and local independence are synonymous. To say that only one underlying ability is being tested means the items are statistically independent for persons at the same ability level. The third assumption relates to item characteristic curves. Tlie item characteristic curve is a mathematical function that relates the probiibility of success on an item to the ability measured by tlie test. Curves vary in slope and intercept to reflect how items vary in discrimination and difficulty. Tlie one-parameter logistic Rasch model (the one parameter is item difficulty) assumes all item discriminations are equal. Tlius all item characteristic curves should be similaiwith respect to their slopes.

PAGE 18

The Problem Several studies have been conducted to varify the invariant properties of tests constructed using the Rascli model (Tinsley and Dawis, 1975; IVhitely and Dawis, 1974; Wright, 1968). If we assume that tests developed using latent trait theory possess the quality of invariant item statistics, why then liasn't latent trait theory been more visible in the psychometric community? There appear to be three main reasons for this slow acceptance. First, the Rasch procedure is based on a mathematical model involving restrictive assumptions, e.g., the unidimensionality of the items, the local independence of the items, and equal item discriminations. A further restriction of the Rasch model is the assumption of minimal guessing. However, several researchers have demonstrated the robustness of the model with regard to departures from the basic assumptions (Anderson, Kearney and Everett, 196S; Dinero and Haertel, 1976; Rentz, 1976). Second, latent trait theory has not been used in practical testing situations because until recently there was a lack of available computer programs to handle the complex mathematical calculations. Hambleton et al. (1977) described four computer programs now available to the consumer. Third, measurement experts who are knowledgeable about latent trait models have been skeptical as to tlie real gains that may be available through this line of research. Are tests developed using latent trait models superior to tests developed using classical item analysis or factor analysis? The purpose of this study was to compare the precision and efficiency of cognitive tests constructed by tlie three methods (classical item analysis, factor analysis and the Rasch model) from a common item and examinee population. Precision, as measured by internal consistency,

PAGE 19

8 is an overall estimate of a test's homogeneity, but provides no information on how the test as a whole discriminates for tlie various ability groups taking tlie test. For that reason measures test efficiency (Lord, 1974a, 1974b) were incorporated into the study. Test efficiency provides information on tlie effectiveness of one test over another as a function of ability level. A cognitive college admissions subtest was used in this study for several reasons. First, tests of this t)^e are widely used by educational institutions for a large number of examinees each year, in the areas of selection, placement, and academic counseling. Most college admission examinations traditionally have been developed using classical item analysis. Second, because of the importance of the decisions made using such test scores, it would be worth investing considerable time and expense in the development of these instruments. Thus, the use of factor analysis or the Rascli model would be justified if superiority of either of these methods over classical item analysis could be determined. Third, tlie items on college admission tests have been written by experts, and each subtest is intended to be unidimensional , e.g., items measuring a single ability. Thus, assumptions from all models should be met. Fourth, because of the time required to take such examinations, it is important to maximize the precision and tlie effectiveness of tlie tests. Tlie possibility of using fewer items while maintaining precision would be desirable. Therefore, the question of which test development procedure can best accomplisli this is not a trival one. Purpose of the Study The purpose of this study was to compare empirically the Rasch model with classical item analysis and factor analysis in test development. Five research questions guided tiiis study. 1. Kill the three methods of test development produce tests with superior internal consistency estimates when compared to the jirojected

PAGE 20

internal consistency of the population as the number of items decreases? 2. Will the three methods of test development produce tests with stable estimates of internal consistency when the number of examinees decreases? 3. Will the tliree methods of test development produce tests with similar standard errors of measurement? 4. Will the three methods of test development select items that are similar in terms of difficulty and discrimination? 5. Will the three methods of test development produce equally efficient tests for all ability levels? Hypotheses This study investigated the capacities of three methods of test development to increase precision and efficiency of measurement in test construction. Tlie five questions posited in the previous section were phrased as testable hypotheses: 1. There are no significant differences in the internal consistency estimates of the tests produced by the three methods, as the number of items decreases, wlien compared to the projected internal consistency estimates for the population for tests of similar length. The standard error of measurement (SEM) is defined in the classical sense as (Magnusson, 1966, p. 79): SEM = Sv vT~~r XX, where Sv is the standard deviation of the test, and r is the XX reliability coefficient.

PAGE 21

10 2. There are no differences in the internal consistency estimates of the tests produced by the three methods when the number of examinees is decreased. 3. There are no meaningful" differences in the magnitude of the standard error of measurement of the tests produced by the three methods . 4. There are no significant differences in the difficulties or discriminations of the items selected by the three methods. 5. There are no differences across ability levels in the efficiency of the tests produced by the three methods. Significance of the Study Objective measurement has always been assumed in the physical sciences. It has only been recently that objective measurement in the behavioral sciences has been deemed possible with the advent of latent trait theory. Since the introduction of latent trait theory by Lazarsfeld (1950) and Lord (1952a, 1953a, 1953b) much of the research on latent trait models has been confined to theoretical research journals, Wright (1968), speaking at a conference on testing problems, discussed at an applied level the need to seriously consider latent trait theory and the Rasch model in particular as a major test development technique far superior to classical item analysis and factor analysis. However, even in 1968 computer programs were not yet available to run the analyses 2 Because test scores are usually reported and interpreted in whole numbers, a "meaningful" difference in the standard error of measurement is defined as a difference of ii 1.00.

PAGE 22

11 should anyone beyond academicians be intex'csted. Today this obstacle has been overcome, but many test developers remain unconvinced of the value of latent trait theory because its superiority to classical test theory has not been conclusively demonstrated. This study is an attempt to provide an empirical comparison of classical test theory and latent trait tlieory methods of test construction. Of the various logistic models that represent latent trait theory the Rasch model was chosen for comparison with traditional item analysis procedures in the present study because it is the most parsimonious latent trait model and has been used recently in the development of the equating of tests fRentz and Bashaw, 1977; Woodcock, 1974). Tlie Rascli model provides a matliematical explanation for the outcome of an event when an examinee attempts an item on a test. Rasch (1966) stated that the outcome of an encounter is governed by the product of tlie ability of tlie examinee and the easiness of the item and nothing more. The imi)lication of this simple concept (objectivity of measurement) would seem to revolutionize mental measurement. If invariant properties of items and ability scores can be identified and used to improv'e the psychometric quality of tests to an extent greater than now possible with classical and factor analytic procedures then \ip ti-uly are in the age of modern test theory. Organization of the Study The theoretical and empirical studies related to the three methods of item analysis are described in Chapter II. An empirical investigation to compare tlie three metliods of item analysis under varying conditions is described in Chapter III. The results of the study are reported in Chapter IV. A discussion of the results, conclusions of the study, and

PAGE 23

12 implications for future research in this area have been presented in the fiftli chapter. A sujiimarization of the study lias been provided in Chapter VI.

PAGE 24

CHAPTIZR II REVIEW OF THE LITEilATURE The quality of the items in a test determine its validity and reliability. Tlirough tlie application of item analysis procedures, test constructors are able to obtain quantitative objective information useful in judging the quality of test items. Item analysis thus provides an empirical basis for revising the test, indicating which items can be used again and which items have to be deleted or rewritten (Lange, Lehmann, and Mehrens , 1967). Item analysis data also help settle arguments and objections to specific items that might be raised by administrators, test experts, examinees, or the public. This study is focused on three approaches to item analysis (classical item analysis, factor analysis, and the Rasch model) as test construction techniques. It is assumed throughout this study that the test under construction is unidimensional , e.g., all items are measuring only one ability. These three approaches to item analysis and the relevant research related to each method are discussed in this chapter. Item Analysis Procedures for the Classi cal Model Item analysis as a test development technique emerged at the beginning of this century. Binet and Simon (1916) were among the first to systematically validate test items. They noted the proportion of students at particular age levels passing an item. This statistic was 13

PAGE 25

14 measuring the relative difficulty of the items for different age groups. The item difficulty index, defined as the percentage of persons passing an item and denoted by p, is one ' of the statistics used in classical item analysis. Item difficulty is related to item variance and hence to the internal consistency of the test. Test constructors are usually concerned with achieving high test reliability, e.g., precision of measurement. Therefore, an item difficulty of .50 is considered to be the ideal value necessary to maximize test reliability. Tliis is because half the examinees are getting the item correct and half the examinees are missing the item. The proportion missing an item is defined as 1-p or q. Tlius, when p is equal to .50, q is equal to .50. Uecause the variance of a dichotomized item is p x q tlie maximum variation an item can contribute to total test variance and ultimately to true-score variance is .25. As an item's difficulty index deviates from .50, its contribution to total test variance is always some value less than .25. Hence test constructors have been advised (Gulliksen, 1945) to select items with difficulty indices at or near .50. However, when items are presented in multiple choice or alternate choice format, the ideal 3 level of difficulty is adjusted to accommodate for guessing. A second important item statistic in classical item analysis is the item discrimination index. An item discrimination index is a measure • The ideal value of p = .50 assumes there has been no guessing on the item. Tlie effects of guessing on item difficulty tends to increase the ideal value of p. For example, on a four option multiple choice item the chance of guessing the correct answer is (J4) ( .501 = . 12. The value of .12 is added to .50 to correct for the effect of guessing and the ideal p would now be .62 (Lord, 1952b; Mehrens and Lehmann, 1975).

PAGE 26

15 of how well the item discriminates between persons who have high test scores and persons who have low test scores. Tlie discrimination index is often expressed as a correlation between the item and total test score. Ulien the criterion is total test score, the correlation coefficient indicates the contribution that item makes to the test as a whole, 'riius, on tests of academic achievement it is a measure of item validity as well as a contributor to internal consistency. Noting an increasing use of item analytic procedures for the improvement of objective examinations, Richardson (1936) pointed out that the development of tlie procedures of item analysis had centered primarily around the invention of various indices of association between the test item and tiie total test score, e.g., item discrimination indices. The two most popular item-test correlation indices are the biserial and point biserial correlations. The point biserial was developed by Pearson (1900) and is a special case of the more general Pearson Product Moment (PPM) correlation coefficient (Magnusson, 1966). Tliis index is recommended when one of the variables being correlated (the item score) represents a true dichotomy and the other variable (total test score) is continuously distributed. Pearson (1909) also derived the biserial correlation which is an estimate of the PPM. The biserial correlation is recoiimiended when one of the variables (the item score) has an underlying continuous and normal distribution which has been artifically dichotomized and the other variable (total test score) is continuously distributed. The assumption for the point biserial correlation is often hard to justify when it is suspected that knowledge required to answer an item is continuously distributed.

PAGE 27

16 In considei'ing the dichotomized item [pass/fail], McNemar (1962] has commented, "It is obvious that failing a test item represents anything from a dismal failure up to a near pass, whereas passing the item involves barely passing up to passing with the greatest of ease" Cp. 191). Thus, the biserial correlation is usually favored over the point biserial correlation as a measure of item discrimination. Also, the biserial is often chosen over the point biserial because the magnitude of the point biserial correlation for an item is not independent of the item difficulty (Davis, 1951; Henrysson, 1971; Swineford, 1956) . Specifically, values of the point biserial are systematically depressed as p approaches the extremes of .00 or 1.00. Lord and Novick (196S) have pointed out that because of this bias, the point biserial correlation tends to favor medium difficulty items over easy or very difficulty items. The formulae for the biserial and point biserial correlation respectively are (Magnusson, 1966, p. 200 § 203): ^bis =

PAGE 28

17 One of tlie main objectives of classical test theory is to improve the internal consistency of the test under construction where internal consistency was defined as the extent to which all items are measuring the same ability. To ensure high internal consistency the random error in the test must be minimized. As stated previously in Equation 2, reliability, in the classical model, was defined as: r = "t^ = 1 °e2 XX 0^2 0^2 Thus, the relationship among the test items can be noted in the coefficient alpha formulae for estimating internal consistency for a sample (Magnusson, 1966, pp. 116-117): 2 r = n , 1 ESi , ,^. XX _ ^(5) or T r .. " ^ik , (6) XX S 2 ^X 2 where n is the number of test items, Z S. is the sum of the item 1 2 variances, S.," is the variance of the test, and C, is the mean of the X ik item covariances. By comparing equation 2 witli 5, it is seen that the 2 sum of the unique item variances is used as an estimate of cr ", and that when the unique item variation is minimized internal consistency will be high. Furthermore, the mean of the item covariances (equation 6) 2 serves as an estimate of a . 'Hie size of the covariance term is in turn determined by the intercorrelations and standard deviations of the items (Magnusson, 1966). Therefore, internal consistency is directly dependent upon the correlation amons the items in the test.

PAGE 29

18 The item discrimination index provides a measure of how well i'n item contributes to what the test as a whole measures. W\en items with tlie highest item-test correlations are selected, the homogeneity of the test is increased; tliat is, o ~ is increased. So it is the item discrimination that directly affects test reliability. V/lien items with low item-test correlations are eliminated, the remaining item intercorrelations are raised. Wlien item-test correlations are high, the test is able to discriminate between higli and low scorers and hence internal consistency is increased. If too few items are discarded in an item analysis tlie internal consistency of tlie test tends to decrease because items with little power of measuring what tlie entire test is intended to measure will dilute the measuring power of the efficient items (Beddell, 1950). Research Related to Classical Item Ajialysis in Test Development Several articles have been published concerning standards for item selection to maximize test validity and increase internal consistency. Flanagan (1959J stated two considerations in selecting test items: (a) the item must be valid, that is, it should discriminate between high and low scorers, and [b) the level of item difficulty should be suitable for the examinee group. Gulliksen (1945) agreed with Flanagan on these two points and added a third; items selected with p = .50 would produce tlie most valid tests; liowever, Gulliksen noted that current practice was opposed to selecting items with difficulty near .50. Test developers were selecting items based upon spreading difficulty indices over a broad range . Several studies have been conducted to examine the effects of varying item difficulty on test development. Brogden (1946), in a •

PAGE 30

19 study of test homogeneity, has shown empirically that a test of 45 items witli varying levels of item difficulty produced a reliability of .96 (measured by tlie Kuder-Richardson^ formula). However, a similar but longer test of 153 items, that had item difficulties at .50 for all items, produced reliability of .99. Thus, Brogden concluded tliat effective item selection was based more on selecting a test with fewer items that possessed varying difficulty, than a longer test with equal item difficulty. Davis (1951), in coiimientina on item difficulty, stated that if all test items had a difficulty of .50 and were uncorrelated then maximum discrimination was acliieved. But when test items were correlated, maximum discrimination would only be achieved when the difficulty index for all test items was spread out, e.g., several difficult items, several easy items, and several items with difficulty near .50. Davis recommended the latter procedure for test development because test items are usually correlated to some degree. Davis also recognized the need for tlie approval of subject matter specialists in addition to statistical criteria in item selection. In a study of test validity, Webster (1956) foimd results similar to Brogden (1946), but different from Gulliksen (1945). By selecting fewer items with high discrimination indices and varying item difficulty levels, a more valid test was produced. Webster's results indicated that a test of 178 items with difficulty indices near .50 had a validity coefficient of .66. However, a test of 124 similar items with varying item difficulties had a validity coefficient of .76, statistically significant at p < .03 (based on £ to £ transformations).

PAGE 31

20 Myers (1962), concerned by the current practice of selecting items based on varying item difficulties instead of the, theoretical idea of p = .50, compared tlie effect of the current practice to the theoretical idea on reliability and validity of a scholastic aptitude test. Tlie ideal item difficulty ranged from .40 to .74 in what he called the peaked test. Items selected by the current practice were outside the above range, and Myers called this the U-shaped test. Two sets of il -Tis were selected for the peaked test and the U-shaped test, four tests in all. Myers reported no statistically significant differences in test validity when the different tests were correlated with freshman grades. Test reliability was statistically significant at P < .02 (using the Wilcoxon matched pairs sign test) in favor of the peaked test. The reliability of the peaked test was .69. The reliability of the U-shaped test was .65. The author noted that the results above were based on a 24 item test, and that when test length was projected to 48 items (via SpearmanBrown Prophecy Formula) there were no significant differences in test reliability. Tlie studies of Brogden (1946) and IVebster (1956) indicate that selecting items of varying item difficulty tends to increase internal consistency and test validity. Tlie results from Myer's (1962) study indicated just the opposite, that item difficulty near .50 produced the more internally consistent test. But this was only true for a relatively short test of 24 items, and that when the test length was projected to 48 items, there were no differences in the reliability of cither test based upon the two methods of selecting items.

PAGE 32

21 Simplified Methods of Obtaining Item Discriminations A second major group of articles on classical item analysis has dealt witli simplified metliods of obtaining indices of item discrimination. Because of the lack of computers in the early years of test development many psychometricians concerned themselves with devising tables to provide quick estimates of item discrimination. Kelley ("1939) found that in the computation of item discrimination only 54 percent of the examinee group (based on total test score) needed to be used. Considering the top 27 percent and the bottom 27 percent of the test scorers resulted in a considerable savings in computational time. Flanagan (1939) developed a table of item discriminations to estimate tlie PPM correlation between item and test score based on Kelley's extreme score groups of top and bottom 27 percent. Fan (1952) developed a table for the estimation of the tetrachoric correlation coefficient using the upper and lower 27 percent of the scorers, ilie tetrachoric correlation is similar to the biserial correlation, where the correlation is between two variables, which are assumed to have a normal and continuous underlying distribution, but have been artifically dichotomized. Guilford (1954) presented several short cut tabular and graphic solutions for estimating various types of correlation coefficients to measure test item validity. Tliese methods result in saving a considerabove amount of time when one is forced to use hand calculations. Today these short cut methods can be used by classroom teachers who often do not have the aid of calculators or computers. However, many test constructors still use these classical methods of item analysis even

PAGE 33

22 though computers are available with which more sopliist icated item analytic techniques such as factor analysis or latent trait models can be used. Item Analysis Procedures for the Factor Analytic Model Charles Spearman [1904) proposed a theory of measurement based on the idea that every test was composed of one general factor and a number of specific factors. In order to test liis idea Spearman developed the statistical procedure known as factor analysis. "Factor analysis is a method of analyzing a set of observations from their intercorrelations to determine whether tlie variations represented can be accounted for adequately by a number of basic categories smaller than that with which the investigation started" (Fruchter, 1954, p. 1). Factor analysis is a mathematical procedure which produces a linear representation of a variable in terms of other variables (Harman, 1967). In the case of test items being factor analyzed, a matrix of item intercorrelations is obtained first. Subsequently, the matrix of item correlations is submitted to tlie factoring process. There are two basic alternatives within the framework of factor analysis for analyzing a set of data: coimnon factor analysis, based on the work of Spearman and later Thurstone (1947); and principal components, developed by Hotelling (1953). Tlie major distinction between the two methods relates to the amount of variance analyzed, e.g., the values placed in tlie diagonal of the intercorrelation matrix. Factoring of the correlation matrix with unities in the diagonal leads to principal components, while ^ . 4 factoring the correlation matrix with communalities in the diagonal 4, '' The communality (h~>, of a variable is defined as the sum of the squared factor loadings h" = a. + a.^^ + ... a." (llarman, 1967, p. 17), see formula 8. J 3jn

PAGE 34

23 leads to common factor analysis (llarman, 1967) . If it is of interest to know what the test items share in common, a common factor solution is warranted. But if it is of interest to make comparisons to other tests or other test development procedures, a principal components solution is warranted. Since the present study was initiated to compare three different test development techniques, a principal components solution was used in this study to analyze the date under the factor analytic model. The linear model for the principal components procedure is defined as (Harman, 1967, p. 15): Z.. = a.,F, + a.-F+ . . .a. F . (7) ji jl 1 j2 2 jn n Z.. is the variable for item) of interest, and a., is the coefficient, Jl jl or more frequently referred to as the loading of variable Z.. on component, (F,). An important feature of principal components is that the extracted components account for the maximum amount of variance from the original variables. Each principal component extracted is a linear combination of the original variables and is uncorrelated with subsequent components extracted. Thus, the sum of the variances of all n principal components is equal to the sum of the variances of the original variables (Harman, 1967). According to Guertin and Bailey (1970), the principal components solution was designed basically for prediction, hence the need to use the maximum amount of variance in a set of variables . Since factor analysis is based upon a matrix of intercorrelations , it is important tliat care be taken in selecting the appropriate coefficient. Several item coefficients are available: phi, phi/phi max, and the tetrachoric correlation coefficient. Carroll (1961)

PAGE 35

24 pointed out several problems concerning the clioice of a correlation coefficient to be used in factor analysis. The phi coefficient (used where both variables are true dichotomies) was found to be affected by disparate marginal distributions and often imderestimated tlie PPM. The phi/phi max coefficient was developed to correct for the underestimation of phi, but the correction is not enoush to counter the effect of extreme dichotomizations. Carroll recoiranended the tetrachoric coefficient as being the least biased by extreme marginal splits providing the variable uiider consideration was nori:ially distributed in the population. Wherry and Winer (1955) had made conclusions similar to Carroll, but went on to say that when the normality assumption was met and the regression of test score on the item was linear the PPM and tetraclioric are identical. Tlie tetrachoric correlation was used in the present study to obtain item intercorrelations . Research Related to Factor Analysis in Test Development The early use of factor analysis to construct and refine tests was suggested by the work of McNemar (1942) in revising the StanfordBinet scales, and Burt and John (1943) in analyzing the Terman-Binet scales. Several contemporary psychometricians have advocated the use of factor analysis in developing unidimensional tests (Cattell, 1957; Hambleton and Traub, 1975; Henrysson, 19b2; Lord and Novick, 1968). A unidimensional test was defined briefly in the introduction to this chapter, but a more precise definition is warranted. Lumsden (1961) noted that a unidimensional test can be determined by the examinee response patterns. If the test items are arranged from easiest to hardest, person^ who misses item will miss all the other items, and

PAGE 36

25 person-, who gets item correct but misses item^ will miss all the subsequent items and so on. The above statement assumes infallible items. However, most tests constructed today contain fallible items, thus the response pattern will be disturbed by random error. Lumsden suggested in developing unidimensional tests factorial ly that the items be carefully selected on empirical grounds, tlms reducing the problem of too many heterogeneous items and the possibility of obtaining multiple factors. By preselecting items one increases the chances of the items converging on one factor. The importance of developing unidimensional tests is demonstrated most clearly in considering the concepts of test reliability and validity. For a test to be valid it must actually measure the trait it was intended to measure. For a test to be reliable it must provide similar results upon repeated measurement. It should be easier to estimate these two important aspects of a test when the test is unidimensional than when the test is multidimensional, hence the use of a unidimensional test in the present study. Cattell (1957) has suggested that in the development of a factor homogeneous scale, one should preselect items, carry out a preliminary factor analysis, then select for further analysis those items which load on the first factor. Cattell defined an index of unidimensionality as the ratio of the variance of the first factor to the total test variance. This index has no set criterion and the sampling distribution is unknown. Comparison of Factor Analysis to Classical Item Analysis One measure of item validity, the biserial correlation was described for classical item analysis procedures. This same index is also obtained

PAGE 37

26 by factor analysis. U'lien tiie test items are factor analyzed, the factor loading a... is the itemfactor association that is considered a o ij measure of item validity, e.g., the liigher tlie factor loading, the greater the relationship between the item and the factor it iieasures. Tlie factor loadings can be viewed as similar to the biserial correlations discussed under classical test theory. Tliis relationship between factor loadings and biserial correl: tions has been discussed by several authorities (Guertin and Bailey, 197;'; Henrysson, 1962; Richardson, 1936). Factor analysis as an item analytic technique was not realistically possible for most psychometricians until the advent of high speed computers. Guertin and Bailey (1970) iiave predicted that with the increasing use of computers factor analysis will replace classical item analysis as a test development technique. Because it is possible for a test to reach the highest degree of homogeneity and yet be factorial ly a very odd mixture of factors (Cattell and Tsujioka, 1964), classical item analysis alone is not sufficient to determine if a test is unidimensional . However, factor analysis not only provides a measure of item-test correlation (the factor loading), it also provides an indication of how many items form a unifactor test. Thus, factor analysis has been advocated as a superior technique to classical item analysis (Guertin and Bailey, 1970). Using factor analysis in test development, psycliometricians liave advanced beyond an independent analysis of item intercorrelations to a simultaneous analysis of item intercorrelations with other individual items to obtain a measure of test unidimensionality and item-factor association.

PAGE 38

27 However, there is an inherent flaw in factor analysis as there was in classical item analysis in test development. Tlie flaw is that both procedures are sample dependent. Wien an item analysis procedure, or any procedure in general is sample dependent, it means that the results will vary from group to group, fflien the groups are very dissimilar, there is much variability. Gulliksen (1950) noted that a significant advance in item analysis tlieory would be made when a method of obtaining invariant item parameters could be discovered. To that end latent trait theory is an attempt to identify invariant item parameters. Item Ajialvsis Procedures for the Latent Trait Model Latent trait theory specifies a relationship between the observable examinee test performance and the unobservable traits or abilities assumed to underlie performance on a test (liambleton et al., 1977). The relationship is described by a mathematical function; hence latent trait models are mathematical models. As noted earlier, there are four major latent trait models for use with dichotomously scored data: the normal ogive, and tlie one-, two-, and three-parameter logistic models (Hambleton and Cook, 1977; Lord and Novick, 196S) . All four models are based on the assumption that the items in the test are measuring one common ability and that the assumption of local independence exists between the items and examinees. These two assumptions imply that a test which measures only one trait or ability will have less measurement error in tlie test score than a test that is multidimensional, and that tl;e response of an examinee to one item is not related to his response on any other item. Ifliere the latent trait models begin to differ is with respect to the shape of their item characteristic curves.

PAGE 39

28 Tlxe normal ogive, develoj^ed by Lord (1952a, 1955a), produces an item characteristic curve based on the following formula: a (e b ) (8) Pg CQ) = / f ^ e(t)dt, where P^ (6) is the probability that an examinee with ability 6 correctly answers item g, 6(t) is the normal density function, b represents item difficulty and a represents item discrimination. The item characteristic curve of thertwo-parameter logistic model developed by Birnbaum (1968) has the same shape as the normal ogive, and Baker (1961) has shown them to be equivalent mathematical procedures. Tlie shape of the item characteristic cui-ve of the two-parameter logistic function is developed from the following formula: Dag(e b ) '^ ''' , . e ^-^ '^ V '"' P (9), ^ and b have the same interpretation as in the normal ogive. & o t> D is a scaling factor equal to 1.7 (the adjustment between the logistic function and normal density functioii) , and e is the natural log function. In Figure la the shape of the normal ogive and the two-parameter logistic curve has been illustrated. In the Figure, item A is more discriminating tlian item B as noted by the steepness of the slopes. Ilie. three-parameter logistic model also developed by Birnbaum (1968) includes as an additional parameter, an index for guessing. The mathematical form of the three-parameter logistic curve is denoted, ^.Dag(9 b ) S ^"^ '8 ^* ^g' 1 + e ^-^^ ^^ ^^ Tlie parameter c the lower asympotote of the item characteristic curve, represents the probability of low ability examinees correctly answering

PAGE 40

29 -T— r^ r -3.0 -2.0 -1.0 ABILITY CONTINUUM NORMAL OGIVE & TWO-PARAMETER LOGISTIC CURVE (a) 1.00 -TT— r— —r r -1.0 1.0 ZO ABILITY CONTINUUM 1 3.0 THREE-PARAMETER LOGISTIC CURVE (b) uj 1.00 T < CO o2 .75 > CO r u n °= .50 H CD •o '^ .25 T -3.0 -ZO -i.O 1.0 ABILITY CONTINUUM RASCH ONEPARAMETER LOGISTIC CURVE (c) FIGURE L HYPOTHETICAL ITEM CHARACTERISTIC CURVES FOR THE FOUR LATENT TRAIT MODELS.

PAGE 41

30 an item (llambleton et al., 1977). In Figure lb the shape of the three-parameter logistic curve has been illustrated. In tlie Figure, item A is more discriiainating and has less guessing involved than item B. The one-parameter logistic model, developed by Rascli (1960) is commonly referred to as the Rasch model. The Rasch model, though similar to the other latent trait models, was developed independently from the other models. The Rasch model is based upon two propositions: (a) the smarter an examinee, the more likely he is to answer the item correctly, and (b) an examinee is more likely to answer an easy item correctly than a difficult item. Matliematical ly the above propositions can be stated in terms of odds or probability of success on an item. Tlie odds of an examinee with ability t) correctly answering an item with difficulty <', is given by the ratio of G to c, (Rasch, 1960): odds = 8 . (11) The derivation of equation 11 was presented in Appendix A. Equation 11 re formally written in the following equation is the Rasch model. mo e f\ *i^ (l^) Inequation 12, tlie probability of examinee k making a correct response to item i, noted X = 1 , given an examinee of ability 3, (where 3, is the k k log transformation of 0) taking an item of difficulty 6. (where 6. is the 1 1 log transformation of cj is a function of the difference between the examinee's ability and the item's difficulty. The derivation of equation 11 to equation 12 is presented in Appendix A. The assumptions for the Rasch model were discussed in Chapter I. Essentially the three assumptions are as follows:

PAGE 42

31 1. There is only one trait underlying test performance. 2. Item responses of each examinee are statistically independent. 3. Item discriminations are equal. The first two assumptions can be checked by conducting a factor analysis of the test items as suggested by Lord and Novick (1968) , and Hambleton and Traub (1973) . Tlie assumptions are met if one dominant factor emerges from tlie analysis. Tlie tliird assumption can be checked by plotting item characteristic curves for each item. In Figure Ic the item characteristic curves for two hypothetical items based on the Rasch model have been illustrated. The difficulty for items A and B is .Sand L5 respectively (point where p = .50), and the discriminations of the two items are equal. The assumption that all items have equal discriminations is quite restrictive; however, Rentz (1976) demonstrated, in a simulation study, that the item slopes can deviate from 1 (where all slopes are equal) + .25 and still fit the model. In a similar simulation study, Dinero and Haertel (1976) concluded that the lack of an item discrimination parameter in the Rasch model does not result in poor item calibrations when discriminations are varied as much as .25. The estimates for the Rasch parameters 3, and 5., examinee ability k 1 estimate and item difficulty estimate respectively, are sufficient, consistent, efficient, and unbiased (Anderson, 1973; Bock and Wood, 1971) That is, the examinee's test score will contain all the information necessary to measure the person ability parameter S, , and the sum of the right answers to a given item will contain all the information used to calibrate the item parameter (S . (Wright, 1977). Of tlie latent trait models, the Rasch model is imique in tliis respect.

PAGE 43

Tlie matliematical rationale of the Rasch noJcl is based upon the separation of the ability and item difficulty parameters. As shown in Appendix A, the estimation ofthe item parameters is independent of the distribution of ability and ability independent of the distribution of item difficulty (Rasch, 1966). Several studies have demonstrated this (Anderson et al., 1968; Tinsley and Dawis, 1975; Uliitely and Dawis, 1974; Miitely and Dawis, 1976; IVright, 1968; Wright and Panchapakeson, 1969). The separation of the ability and item parameters leads to what Rasch has termed specific objectivity. Specific objectivity relates to the fact that the measurement of a person's ability is not dependent upon the sample of items used, nor the examinee group in which a person is tested. Once a set of items has been calibrated to the Rasch model, any subset of tlie calibrated items will produce the same estimate of the examinee's ability. This type of objectivity is possessed by the physical sciences and the goal toward which mental measurement should be aimed in the future. Toward the goal of objective measurement several researchers have conducted empirical studies comparing classical factor analytic test development procedures to the latent trait models, and also comparisons have been made between the various latent trait models. Research Related to Latent Trait Models in Test Development Baker (1961) conducted one of the earlier comparative studies between two latent trait models. He compared the effect of fitting the normal ogive and the two-parameter logistic model to the same set of data, a scholastic aptitude test. The two-parameter model as well as the noi-mal ogive provide item difficulty and item discrimination estimates,

PAGE 44

33 The empirical results suggest there is little difference between the two procedures as measured by a chi-square test of fit. However, Baker noted the computer running time of the logistic model was one-third that of the ogive model, thus he concluded the logistic model was more efficient in terms of cost than the ogive. Hambleton and Traub (1971) compared the efficiency of ability estimates provided by the Rasch model and the two-parameter model to the three-parameter logistic model using Bimbaum's concept of information (1968). The three-parameter model provides item difficulty and discrimination estimates as well as accounting for guessing on each item. Eleven simulated tests of fifteen items each were generated varying item discrimination and degree of guessing. ITie authors sought to determine how efficient the oneand two-parameter logistic models were under these conditions taking the three-parameter model to be the true model. Tlie results indicated that wlien guessing was a factor the three-parameter model was most efficient in providing ability estimates, but when guessing was not a factor all models were equally efficient. Since the Rasch model has fewer parameters to estimate, hence it takes less computer time to run than the other two models, it would be preferred in the absence of guessing. In considering item discrimination, when the guessing parameter was set to zero, the Rasch model was as efficient as the two-parameter model when item discrimination varied from .59 to .79. As item discrimination deviated from this range the two-parameter model was more efficient. Hambleton and Traub (1975) compared the oneand two-parameter models with three sets of real data (the verbal and mathematics subtests of a scholastic aptitude test used in Ontario (items = 45 and 20

PAGE 45

34 respectively), and tlie veii)al section of the Scliolastic Aptitude Test (SAT, items 80). llieir results indicated that generally tlie twoparameter model fit the data better than the one-parameter model. The loss in predicting performance was greatest on tlie shorter mathematics test and smallest on tlie longer SAT. iliese findings confirm Birnbaum's conjecture (1908, p. 492) that if the number of items in a test is very large tlie inferences that can be made about an examinee's ability will be much the same whether the Rasch model or the two-parameter logistic model is used. The authors questioned whetlier the gain obtained with the two-parameter model is worth tlie increased computer cost of estimating the item discrimination parameter. Based on the results of these studies, it is concluded that the Rascli model is tlie most efficient of the latent trait models and hence will be used in comparison to the more traditional methods of test development included in the present study. Comparison of tlie Rasch Model to Factor Analysis Two recent studies have been completed comparing the Rasch model to factor analysis. Anderson (1976) posed two questions concerning the Rasch model and factor analysis: (a) what types of items would be excluded in terms of difficulty and discrimination using Rasch and factor analysis as item analytic teclmiques, and (b) what effect would the two procedures have on validity? Anderson chose to use 235 middle school students' responses to a 15 item Likert-type scale that was dichotomized for use witli tlie Rasch model and the factor analytic procedures. A principal component factor analysis based upon tetrachoric correlation coefficients w a s compared to the Rasch model using the CALl-'lT computer program (Wright and Mead, 1975). Only items fitting the model were used. His results indicated that the Rasch procedure

PAGE 46

35 eliminated the more difficult items and the factor analytic procedure eliminated the easier items; a statistically significant difference as determined by clii-square test at £ < .01. For item discriminations the Rasch procedure eliminated very low and very high item discriminations, while the factor analytic procedure tended to reject only very low discriminations. Tlie difference here was not statistically significant. The second question of test validity showed very similar results for the two procedures wlien test score was correlated with course grade point average. In a similar study Mandeville and Smarr (1976) developed a two stage design. First they compared the Rasch procedure to factor analysis, then they combined the two analytic procedures. The authors felt the combined approacli would be a more effective item analytic approach than any single method in detei^mining whicli items fit the Rasch model. Two cognitive data sets (one standardized and one classroom) and one simulated set were used in the study. A rotated principal axis factor analysis based upon phi correlation coefficients were compared to the Rasch model using tiie CALFIT program. The results indicated that for the standardized and simulated data sets the double procedure of factor analyzing the items, then submitting only the items loading on the first factor to the Rasch procedure was not really useful. The Rasch procedure alone was just as effective as the double procedure in selecting items that fit tlie model. For the classroom data set the investigators found that 92 percent of the items fit the Rasch model, but upon factor analyzing these items only seven percent of the total test variance was associated with tl)e first factor. Their results tend to indicate that factor analysis

PAGE 47

56 and the Rascl; procedure do not always identify the same unidimensional trait underlying test performance. However, the results of the Mandeville and Smarr study may be' suspect for three reasons. First, the plii coefficient, which can be seriously affected when p and q take on extreme values, was used as a basis to form the intercorrelation matrix that was factor analyzed. The greater the difference in p and q the smaller will be the maximum correlation, hence very easy and very difficult itej-.is will have systematically lower coefficients and will tend to bias the results of the analysis in favor of moderately difficult items. Second, the factor analysis was based on a principal axis solution, using some value less than 1.00 in tlie diagonal hence less variance is being used in the total solution for comparison with the Rasch procedure that is utilizing all the test variance available. Third, the principal axis solution was rotated so that the total variance associated with the first factor has been distributed out among the other factors and was no longer as strong as it once had been. Suimnary In the development of tests based upon classical item analysis two main statistics are used in reviewing and revising test items, e.g., item difficulty and item discrimination. The item discrimination index provides information as to the validity of the item in relation to total test score, while item difficulty indicates ]:ow appropriate the item was for the group tested. A serious limitation of classical item analysis is that the statistics obtained for examinees and items are sample dependent (Hambleton and Cook, 1977; bright, 196S) . The same problem of sample dependency also exists for factor analysis. However, factor analysis is viewed as a superior technique

PAGE 48

37 to classical item analysis for two reasons: (a) factor analysis compares item intercorrelations with other items simultaneously, and (b) factor analysis provides an indication of how many factors or abilities the test is measuring. Also in factor analysis, the factor loading is comparable to the item discrimin ition index of classical item analysis, thus providing a measure of item validity for eacli item on each factor in the test. Not until the development of latent trait models was a solution suggested to the problem of sample dependency of tiie statistics for items and examinees. The Rasch model in particular has been shown to provide item statistics that are independent of the group on which they were obtained, as well as examinee statistics that are independent of the group of items on wliicli they were tested. Tliis feature of the Rasch model provides for more objective mental measurement. The Rasch model lias been compared to other latent trait models and has been shown to be as efficient in many cases as the more complex models. Tiie Rasch model has also been compared with factor analytic procedures in determining test unidimensionality , validity, and types of items retained and excluded by tlie two procedures. Missing from this review is a comparative study of the three item analytic techniques using the same data base and a comparison of the efficiency of tests developed from the three techniques across ability levels. Also missing from the literature is the effect of varying sample size and number of items as well as the kinds of items each of the three procedures would eitlier retain or exclude in test development.

PAGE 49

38 It is apparent that an empirical investigation into these areas seems warranted to determine which procedure under tlie various conditions would produce the superior test in terms of internal consistency and efficiency. It was for this reason that the present study was undertaken comparing the three methods of classical item analysis, factor analysis, and the Rasch model used in test development, The design of tlie study is described in Chapter III.

PAGE 50

CHAPTER III METHOD An empirical study was designed to compare the effects of three methods of item analysis on test development for different sample sizes. The three methods of item analysis studied were classical item analysis, factor analysis, and Rasch analysis. The sample sizes used to compare the tliree item analytic methods were 250, 500, and 995 subjects. The study was designed in three phases: (a) item selection, (b) a double cross-validation of the selected items, and (c) statistical analyses of the selected items. For each item analytic procedure two tests were developed, a 15 item test, and a 30 item test. Four dependent variables were obtained for each test: (a) an estimate of internal consistency, (b) the standard error of measurement, (c) item difficulty, and (d) item discrimination. A description of the subjects, instrument used, research design, and statistical analyses is presented in this chapter. The Sample In the fall of 1975, all high school seniors in the State of Florida (N = 78,751) were tested as part of the State assessment program. The population was from 435 high schools throughout the state. From this population a 1 in 15 systematic sample of 5,250 subjects was chosen (Mendenhall , Ott, and Scheaffer, 1971). A systematic sample was selected to ensure samples from every high school in the state. The 39

PAGE 51

40 types of data obtained on each suoject were sex, race, item responses, and total score. The data file was edited to remove those subjects who cither answered all the items correctly or incorrectly. The rationale for this procedure was that the Rasch model cannot calibrate items when a person has a perfect score or the alternative, when a person has no items correct [Wright, 1977). Through the editing procedure 15 subjects were removed, thus the available sajnple size was 5,255. Because such a small number of subjects were removed, it seems unlikely that the elimination of these subjects would bias the results in favor of any of the three item analytic techniques. The Instrument Tlie instrument selected for use in this study was the Verbal Aptitude subtest of the Florida Twelfth Grade Test, developed by the Educational Testing Service. It is a statewide assessment battery which has been administered every year since 1935 (Benson, 1975). The Verbal Aptitude subtest is comprised of 50 verbal analogies, in a multiple choice format, from which a single score based on the number of items correct is reported. Descriptive information on the Verbal Aptitude subtest for the population tested in 1975 is presented in Table 1. This particular instrument was selected for three reasons. First, it is a cognitive measure of verbal ability and much of classical test theory has been build upon tests in the cognitive domain. Second, it is similar to and hence representative of other national aotitude tests used for college admissions. Tliird, it has a large data pool from which to sample.

PAGE 52

41 TABLE 1 DliSCRIPTIVH DATA ON THE VERBAL APTITUDE SUBTEST OF THE FLORIDA TWELFTH GRADE TEST 1975 ADMINISTRATION Number of Schools = 435 Number of Students 78,751 Number of items 50 Mean 25.95 Standard Deviation 8.23 Reliability^ .88 Standard Error of Measurement 2.85 Note : Data obtained from the Florida Twelfth Grade Testing Program, Report No. 1-75, Fall 1975, Reliability based on the split-half method, and corrected by the Spearman-Brown fox'mula.

PAGE 53

42 Classical test theor)' lias been built mainly around the development of cognitive tests. Therefore, it seemed desirable to compare the new procedures of latent trait theory, via the Rascli model to the procedures of classical test tlieory, e.g., factor analysis and classical item analysis by using a cognitive test. Thus, the results may be more generalizable to the major type of tests developed by practitioners in the field. The Procedure Design The sample of 5,235 v.'as divided into nine systematic samples in the following manner: Group = three independent samples of 250 students each; Group = three independent samples of 500 students each; Group_ = three independent samples of 995 students each. From the initial editing of the data file, previously described, 15 subjects were removed from the total sample of 5,250. Therefore, it was decided that this loss of subjects would only affect Group since it was the largest. Thus, the number of subjects in each of the three independent samples was reduced by five, resulting in three independent samples of 995 subjects each. The purpose of obtaining the three separate samples for t]>e three groups was to insure that each item analytic and double cross-validation procedure used an independent sample, so that tests of statistical significance could be perform.ed. The scheme shown in Table 2 was used to obtain tlie nine samples. In the present study the independent variables were sample size and item analytic procedure.

PAGE 54

43 TABLE 2 SYSTEMATIC SAMPLING DESIGN OF THE STUDY" N = 5,235

PAGE 55

44 The item data were analyzed in thi-ee pliases : (a) selection of the items, (b) computation of item and test statistics for selected items on double cross-validation samples, and (c) statistical analyses of item characteristics to test tlie hypotheses. Item Selection The three independent samples, within each of the groups of subjects (N = 250, N = 500, N = 995), were submitted to one of the three item analytic procedures (in accordance with Table 2) in order to select a specified number of items, e.g., the "best" 15 and 30 items. Each of these two sets of items comprised two separate tests; however, all of the items on the 15 item tests were always included on each of tlie 30 item tests. A different process for selecting the items was used with each item analytic technique, and has been described in the following three sections. Classical item analysis . The definition of the "best" items was based on the numerical magnitude of the items' biserial correlations. The biserial correlation was defined as the correlation between the artifically dichotomized item score (1 or 0) and total test score. In using the biserial correlation the assumption was made that the artifically dichotomized variable (the item) had a continuous and normal distribution (Magnusson, 1966) . In order to obtain biserial correlations for the items under the classical item analysis procedure, the 50 vei'bal items were submitted 5 to the item analysis program, GITAP for each of the three sample sizes. The Generalized Item Analysis Program (GITAP) is a part of the test analysis package developed by F. B. Baker and T. J. Martin, Occasional Paper No. 10, Michigan State University, 1970.

PAGE 56

45 The 15 and 30 items with the highest biserial correlations were selected as the best items from the total subtest. Item difficulties were also obtained for the "best" 15 and 30 items selected. Item difficulty has been defined as the proportion of persons getting a particular item correct out of the total number of persons attempting that item (Mehrens and Lehman, 1973). Factor analysis . Item selection based on factor analysis was accomplished using the computer programs developed for the Education Evaluation Laboratory at the University of Florida. Tliese programs have been described by Guertin and Bailey (1970). The present study was concerned only with the items that load on the first principal component, in order to adhere to the unidimensionality assumption of the test. The principal components analysis was based on a matrix of tetrachoric item intercorrelations with unities in the diagonal. Tlie tetrachoric correlation was chosen to produce the intercorrelation matrix for the same reason tlie biserial correlation was chosen: Knowledge of an item was assiomed to be normal and continuously distributed. In the case of the tetrachoric correlation each item (scored 1 or 0) was correlated with every other item. The 15 and 30 items with the highest loadings on the first unrotated principal component were selected from the total subtest. These component loadings are analogous to biserial correlations previously described, where the loading refers to the relationship of the item to the principal component or factor (Guertin and Bailey, 1970, Henrysson, 1962). Rasch analysis, llie selection of items based on the Rasch model was accomplished in two stages. First, in order to check the assumption

PAGE 57

46 )f a unidimensional test, a factor analysis using a principal components solution was used. Items were selected with loadings between .39 and 79 on the first unrotated factor-, to hold the discrimination index of the items constant. Hambleton and Traub (1971) have shown that the efficiency of a test developed using the Rasch model will remain very high (over 95 percent) when the range on the discrimination index was held between .59 and .79. Second, the items selected from the principal components solution using the above criteria were submitted to a Rasch analysis using the IbICAL i/rogram (Wright and Mead, 1976) . Items were selected based upon the mean square fit of the items to the Rasch model. T he be st 15 and 30 items fitting the model were chosen from the total subtest, and their corresponding item difficulties reported. Double Cross-Validation A double cross-validation design (Mosier, 1951) was used to obtain item parameter estimates for the best 15 and 50 items selected by the three item analytic techniques for the three sample sizes. In this study a 5 X 3 latin square was used to reassign samples. This procedure ensured that the estimates of the item parameters would be based upon a different sample of subjects than the original sample used to identify the best items. Each item analytic technique was randomly reassigned, using a latin square procedure (Cochran and Cox, 1957, p. 121), to a different sample within each of the three groups (N 250, N = 500, N = 995) . The double cross-validation design is shown in Table 5. The best 15 and 30 items selected by each item analytic procedure in the first phase of the study, were submitted to a standard item

PAGE 58

47 TABLE 3 DOUBLE CROSS-VALIDATIOM DESIGN OF TliE STUDY Sample Group Number^ Number Selected Item Analytic Procedure Double CrossValidation Procedure GrouD 1 Group Group. 1 2 3 4 5 6 7 8 9 250 Classical 250 Factor Analysis 250 Rasch 500 Classical 500 Factor Analysis 500 Rasch 995 Classical 995 Factor Analysis 995 Rasch Factor Analysis Rasch Classical Rasch Classical Factor Analysis Rasch Classical Factor Analysis The sample number is the same as referred to in Table 2. Assignment to sample was based on a randomized 3X3 latin square procedure.

PAGE 59

48 analysis program (GITAP) from wliich were obtained the dependent variables in the study: •indices of internal consistency as measured by the analysis of variance procedure (Hoyt, 1941) •the standard error of measurement •item difficulty •biserial correlations By submitting the best 15 and 50 items selected by each item analytic procedure in the study to a common item analysis program comparable measures of the dependent variables were obtained. Statistical Analyses The third phase of the study focused on obtaining measures of statistical significance for three of the dependent variables: internal consistency, item difficulties, and biserial correlations. Only visual comparisons were made for the remaining dependent variable, the standard error of measurement. The internal consistency estimates from each test were compared to the projected population value for tests of similar length via confidence intervals as suggested by Feldt (1965). (Projected population values were obtained using the Spearman-Brown Prophecy Formula.) Item difficulties for the 15 and 50 best items were submitted to a two-way analysis of variance, the two factors being sample size and item analytic technique. This procedure was used to test for differences in the types of items selected, in terms of item difficulty, by each technique. If statistical significance was observed, with a = .05, Tukey's HSD (honestly significant difference) post hoc procedure (Kirk, 1968) was The analysis of variance procedure is appropriate only if the distribution of the item difficulties and (transformed biserial correlations

PAGE 60

49 employed to determine which item analytic teclinique(s) resulted in a test with the highest item difficulties. The biserial correlations were transformed to an interval scale of measurement using a linear function of £ suggested by Davis (1946). The linear transformation was based upon converting the biserial correlation to z values, and then eliminating tlie decimals and negative values of z by multiplying the constant 60.241 to each z_ value (Davis, 1946, pp. 12-15). Thus, the range of the transformed biserials ranged between and 100. A two-way analysis of variance (sample size by item analytic technique) was performed on the transformed biserial correlations for the best 15 items. Tliis type of analysis was used to test for differences in the types of items selected, in terms of biserial correlations by each technique. If statistical significance was observed, a = .05, Tukey's HSD post hoc procedure was employed to determine which item analytic technique(s) resulted in higher transformed biserial correlations. The two-way analysis of variance and post hoc analysis, where indicated, for the transformed biserial correlations was performed on the 30 best items. In addition to tests of statistical significance, a measure of the efficiency of the 30 best items selected by each procedure was compared for the sample of 995 subjects. Birnbaum (1968) defined the relative efficiency of two testing procedures as the ratio of their approximates normality and the variances are homogeneous (Ware and Benson, 1975) . The analysis of variance procedure is appropriate only if the distribution of the item difficulties and (transformed) biserial correlations approximates normality and the variances are homogeneous (Ware and Benson, 1975).

PAGE 61

50 information curves. Lord (1974aJ lias described a procedure to compare the relative efficiency of one test with another at different ability levels. If two tests to be compared vary in difficulty, then the relative efficiency of each will usually be different at different ability levels (Lord, lC74b; 1977). In classical test theoiy it is common to compare two tests that measure the same ability in terms of their reliability coefficients, but this only gives a single overall comparison, llie foiT.iula developed by Lord for relative efficiency provides a more precise way of comparing two tests that measure the same ability. The formula for approximating relative efficiency is (Lord, 1974b, p. 248): 2 n c r ^ " X (n -x)f ., „. R.E. (y,x3 = y • X -x , (loj n y (n -y)^2 X ' ' y 'fy where R.E. denotes the relative efficiency of y compared to x, n and n denote the number of items in the two tests, x and y are tlie number2 2 right scores having the same percentile rank, and f _ and f are the squared observed frequencies of x and y. Lord lias suggested that formula 13 only be used with a large sample of examinees and tests that are not extremely short, hence this comparison was restricted to the case where N = 995 and the 30 item test. Tliree relative efficiency comparisons using the 30 item tests were made: (a) the test based on factor analysis was compared to the test based on classical item analysis, (b) the test based on the Rasch analysis was compared to the test based on classical item analysis, and (c) the test based on the Rasch analysis was compared to the test based on factor analysis.

PAGE 62

5] Summary An empirical study was designed to compare the effects of classical item analysis, factor analysis, and the Rascli model on test development. Item response data were obtained from a sample of 5,255 high school seniors on a cognitive test of verbal aptitude. The subjects were divided into 9 samples: three independent groups of 250 subjects eacli, three independent groups of 500 subjects each, and tlu-ee independent groups of 995 subjects each. The independent groups were obtained so that tests of statistical significance could be perfoi'ined. The item response data were then analyzed in three phases. First, the "best" 15 and 30 items were selected using each item analytic technique. Under classical item analysis, the best 15 and 30 items were selected based on the liigliest biserial correlations. For factor analysis, the best 15 and 30 items were selected based on the highest item loadings on the first (unrotated) principal component. The selections of the best 15 and 30 items using the Rascli model were based upon the mean square fit of the items to tlie model. These procedures were used for each group of subjects. Second, a double cross-validation design was employed to obtain estimates on the item parameters for tlie best 15 and 30 items. The three item analytic techniques were reassigned randomly to different samples of subjects within each level of sample size. Then, the best 15 and 30 items chosen by each method were submitted to a conmion item analytic procedure in order to obtain estimates for comparing the three item analytic methods. Third, a two-way analysis of variance and a Tukey post hoc comparison test, when indicated, were used to test for differences in

PAGE 63

52 the properties of items selected by each item analytic procedure. Also confidence intervals were calculated to compare the internal consistency estimates to a population value. In addition, tlie relative efficiencies of the 30 item tests developed by eacl; item anal)'tic technique were compared for the sample of 995 subjects. ••

PAGE 64

CHAPTER IV RESULTS The study was designed to compare empirically the precision and efficiency of tests developed using three item analytic techniques: classical item analysis, factor analysis, and the Rasch model. The following five hypotheses were generated to compare the three techniques: 1. There are no significant differences in the internal consistency estimates of the tests produced by the three methods as the number of items decreases when compared to the projected internal consistency estimates for the population for tests of similar length. 2. There are no differences in the internal consistency estimates of the tests produced by the three methods when the nunber of examinees is decreased. g 3. There are no meaningful differences in the magnitude of the standard error of measurement of the tests produced by the three methods . 4. There are no significant differences in the difficulties or discriminations of the items selected by the three methods. A meaningful difference was previously defined to be >_ 1.00. 53

PAGE 65

54 5. Tliere are no differences across ability levels in the efficiency of the tests produced by the tliree metliods. The Verbal Ajjtitude subtest of the 1-lorida iwelfth Grade Test, was used to test the hypotlieses. A sample of 5,235 examinees was systematically selected from a population of 78,751. A demographic breakdown of the sample by etl;nic origin and sex is presented in Table 4. The data were analyzed and reported in the following manner: item selection, double cross-validation, comparison of the 15 item tests on precision and comparison of the 30 item tests on precision and efficiency. These results were then summarized witli respect to tlie five h)-potheses. Item Selection The 50 items on the Verbal Aptitude subtest were submitted to each of the tliree item analytic techniques. The means, medians, and standard deviations of t'ne bi serial correlations and item difficulties, based on classical item analysis, are presented in Table 5. These descriptive statistics appear ecjuivalent across the varying sample sizes. From tlie factor analysis, the percentage of total test variance accounted for by the 50 verbal items on the first unrotated principal component has been reported in Table 6. The percentage of variance accounted for by the first principal component was obtained by summing the squared item loadings and dividing by the total number of items. The percentages of variance accounted for by the first principal component in eacli sample were very similar. A check on the unidimensionality of the test was made by rotating the principal components solution for the sample of 995 subjects. Upon rotation, the results indicated one dominate factor remained.

PAGE 66

55 UJ < CO < O oi o a. X UJ c/0 Q U o [_> 3: O Q < UJ ni CQ u 1—1 o UJ Q o H O c •H o o (Li U ex b;

PAGE 67

5 b

PAGE 68

57 cri 3 o •H m H a OOLOvOLOC^LOr-ILOrH^DOI^-^OOLOOOOOt-^t^l-OLOai LO -cjTf LO >* LO ** "^ •^ K) ^ Ki 1-0 n rj fN] K) hO r-j rM rj i-H .— I LO . 2 •H ^1 1/1 •H ca rj \o ra t^ LO .-I r~i o CO 00 00 rt \o' ri to o LO r^j LO ^ \o -^ >£> oo oo '^ LO'r)-tOLor4io-^^ir!r:rtvoLO'^fOiot-o^rN(^i'^':tr)t^ ^^rn X 3 o •H •H Q ^•^M-Lo^LO'^'*^t-ototot-OK)rMrsitot~or-icMt^)r-(i— ( i-t en LO >— ( 2 r^ •H in O (/I 00 r^ cn -^ r4 vo rt ^ LO t-O LO lo ^o ^ — lOtOi— iLOvoo-^toootot^a^oor-i L^LO^^OLOLO^O^o^o^o^^^^^to^^^rJ 00 O CO to LO "^ 1— I pa >^ M .—I 3 O •H r~~Oi— iLOto^vOi-H'Or^Lo^oioocTivoiotOLOcocD'^oo LO "* ^ LO ^ LO (N -Jjl-O to to I~0 to I^ n rg lO t-O r4 f^J i-H r— I rM LO CM O Q LO n II •H fn (U •H CQ e 0) >-Dtor--^DvoooLor-i.— i'*^ooovOLO'*LoiO(NO"^critO\o LO "^ r-i LO •— I (M '^ •^ ^ to to lo * oi ^ "^ to CN to to '^ rj to OOCr. Or(0Jt0^L0v£)I^C0aiOr-HC4t0'^lOvDI^00Cr)O ri 04 lO to to to to to to to to to '^i^ rt^ -^ ^ ^ ^ * ^ LO CO 00 LO '^f "* rH

PAGE 69

58 H H

PAGE 70

59 c •H CTl O II e 2: (U

PAGE 71

60 Three statistics ai-e reported in Table 7 for the Rasch item analysis procedure. For each sample size, the percentage of total variance accounted for by the first unrotated principal conponent and the means and standard deviations of the mean square fit statistic and Rasch difficulties are presented for the items selected. In order to select the best 15 and 30 items from the Rasch analysis, all 50 items were submitted to a principal components solution. Tliis procedure was used to ensure that the items selected measured one trait, as required by the assimiption of test unidimensionality. As noted in Table 1 , the percentage of total test variance accounted for by the first principal component, based on 50 items, was nearly equal for each sample size. From the principal components solution only items with loadings between .39 and .79 were selected for the Rasch analysis as suggested by Hambleton and i'raub (1971), to adhere to the assumption of equal item discriminations. Using this procedure the number of items (out of 50) retained for the Rascli analysis varied slightly with sample size; when N = 250, 53 items were retained, when N 500, 35 items were retained, and when N = 995, 33 items were retained. Tliese items, loading between .59 and .79, were then submitted to the Rasch analysis to obtain mean square fit statistics and Rasch item difficulties. These st;;tistics have been reported in Table 7. Wriglit and Panchapakesan (1969) developed a measure to assess the fit of the item to the Rasch model. Tlie measure, defined as the mean square fit statistic, is:

PAGE 72

61

PAGE 73

3 u H >+CI Q a II LO +-> 2 U04 C H -a n o X u •H O -H O Q LO Cli II CO +-> "TT C to l-O 0-. t-[-. O CO I r-I I >— I o rI o oo 1 ^ to 1 LO o 1—1 rsj to LO 1 ra •^ 1 1 ra ^ 1

PAGE 74

63

PAGE 75

64 2 _ k1 n 2 ^ ^ i=i j=l ^ij • (14) 2 The quanity, x > defined above has approximately the chi square distribution with degrees of freedom equal to (k-1) (n-1). Tlie value, y. . is the deviation of the item from the model, or item misfit, ij and is determined by taking the difference between the observed and expected frequency of the examinees at a given ability level who answered a given item correctly. Tliis difference was then divided by the standard deviation of the observed frequency, squared and summed over items and score groups. The BICAL program standardizes these deviations (y.) in computing the mean square fit statistic; therefore, y. . has a normal distribution with a mean of zero and standard deviation of one (Hambleton et al., 1977). Items with large mean square fit values are items which do not fit the model. As shovm in Table 7, the mean and standard deviation of the mean square fit statistic increased with sample size. The item difficulty estimates based on the Rasch model also have an expected mean of zero and standard deviation of one (Wright and Mead, 1975). These estimates remained very similar across sample size and exceptionally close to the expected values (Table 7). The Rasch model does not provide a parameter for item discriminating power as all item discriminations are considered equal and centered at one (Wright and Mean, 1975). Tlie BICAL program provided, as part of the normal output, estimates of the item's discriminating power to check the fit of the data to the model. The item discriminations were obtained by regressing the difficulty of the item for each ability group on the ability estimate of the group (Wright and Mead, 1975, p. 11).

PAGE 76

65 Tlie means and standard deviations for the item discrimination estimates were shown in Table 8 for each sample size. TABLE 8 DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATES BASED ON THE RASCH MODEL ACCORDING TO SAMPLE SIZE N = 250^ K = 53b N = 500 K = 35 Mean Standard Deviation 1.03 .28 1.02 .19 N = sample size K = number ot items N = 995 K = 33 1.03 .22 From the data in Table 8, the mean item discrimination estimates appear nearly equal for each sample size, and quite close to the mean expected value of one. The best 15 and 30 items were then selected by each item analytic procedure based on the information in Tables 5-7, and have been listed in Tables 9 and 10 respectively. The items selected under classical item analysis were determined by the magnitude of the biserial correlation, e.g., the 15 and 30 items having the higliest biserial correlations with total test score were selected. Indices of item difficulty have been reported for inspection, but in no way influenced the selection of items for classical item analysis.

PAGE 77

66

PAGE 78

67 rt w hH LO 04 J § CO o z a o u < w Q PJ u o o I— ( W < J < pj 3: < w w Q Z Q W H C_) w J w 4:: (/I a: 1/1 o >. 4-1 ,-1 ^ s X XXXXXX XXXXXXXXXXXXX X XXXXXXXXXXXXXXXXXXXX XXX CTl Oi II nj Z -H in r— ( u s XXXXXX XXXXXXX XXXXX XXX r-Ht— (.— !— (rHr-Hi— ii-HrHi-Hr-grsir-irjcirsi o o CJ t/1 a; fn t/) O >^ •(-> --I O 03 03 o M t^coa)Or-irsit^'*Lr>voi^ooo^o(N'<*Ln\D r-Hr-Hr-HrHi— li-Mi— li— (r-|.-|CNCNirMrv|(N a tn o t-> o O o3 XX XXXXX XXXXXXXX XXXX XXXXXXXXXXXXXXXXXXXX XX II c3 o • H (/) o3 6 XXX XXXX XXXXXX XXXXXXX ,o^u..Dr^coa.o^r.K,2^;Or^oo2S?:lS^S

PAGE 79

t.8 C H (-> c o u 1 o UJ < o II |i5 O 1/1 H o >. +-> ^ u rt c3 C rt o •H 1/) 00 u

PAGE 80

69 The selection of items under factor ai^alysis was determined by the item loadings on the first unrotated principal component. The 15 and 30 items having the highest item-component biserial correlation were selected. The selection of the 15 and 30 items from the Rasch analysis was determined by the mean square fit of the item to tlie Rasch model. The closer the mean square fit was to zero the better the item fit the model, thus items with tl\e lowest mean square fit statistic were selected. Double Cross-Validation After the tests of the best 15 and 30 items were developed by each procedure, they were scored on independent samples, in a double crossvalidation procedure as noted in Table 3, Chapter III. Item and test statistics, needed to test the five hypotheses were obtained for the 15 and 30 item tests based on the cross-validation samples using the GITAP program (Baker and Martin, 1970). The GITAP program provided the following output: each subject's total test score test mean and standard deviation . internal consistency estimates as measured by Hoyt's analysis of variance procedure . estimates of the standard ei'ror of measurement . indices of item difficulty and biserial correlations Comparison of the 15 Item Tests on Precision The descriptive statistics based on the double cross-validation samples for the 15 item tests have been presented in Table 11.

PAGE 81

70 w pa < t— 1 tu c/5 a, w s CQ < CO LO o H a: O C£i o Q U W CJ CO > pa H Q Cl. PJ H-l H Di CJ U PJ CO -J PJ PJ D CO .a

PAGE 82

71 The values of the iiiternal c^^isistency estimates for the tests developed using the Rasch model were consistently lower than the internal consistency estimates of the tests developed by classical item analysis and factor analysis across all sample sizes. Tlie observed internal consistency estimates were tested for significance using confidence intervals described by Feldt (1965), to see if they were statistically different from the internal consistency estimate for the projected population using the Spearman-Brown Prophecy Formula. The internal consistency estimate for the population based on the original 50 item subtest was .88 (Table 1). By applying the Spearman-Brown Prophecy Formula (Mehrens and Lehman, 1975) the projected population internal consistency estimate for a 15 item test was found to be .687. The value .687 was the expected internal consistency if 55 of the 50 items were randomly deleted. Thus, confidence intervals were generated around the observed internal consistency estimates, presented in Table 11, for each procedure across all sample sizes to see if any of the three item analytic teclmiques would produce a more reliable test than would be expected from mere random item deletion. The confidence intervals for the observed consistency estimates for each procedure have been reported in Table 12. When tlie sample sizes were 250 and 995 eacli item, analytic technique produced an internal consistency estimate that was significantly different from the projected population estimate (.687) at a confidence level of 95 percent. Each of the tl;ree techniques systematically retained the 15 most homogeneous items. These tests were more precise in terms of internal consistencv than would have been found if the

PAGE 83

72 items were randoml)deleted as noted by comparisons to tlie projected population reliability coefficient. Table 12 CONFIDENCE INTERVALS^ FOR THE OBSERVED INTERNAL CONSISTENCY ESTIM/\TES BASED ON THE 15 ITEM TESTS ACCORD I F^G TO SAMPLE SIZE 95 V Confidence Interval Procedure N = 250 N = 500 N = 995 Classical .748 .828* .792 .838* .810 .841 Factor Analysis .760 .837* .786 .854* .797 .831 Kasch .704 .799* .688 .757 .711 .759 * * * The £ values used in calculating the confidence intervals were obtained from f'larisculo (1971). + Statistical significance is indicated when tlie population internal consistency estimate is not concluded in tlie confidence interval generated for each observed internal consistency estimate. The projected population value was .687. Only two procedures produced tests with internal consistency estimates significantly different from the projected population estimate when tlie sample size was 500, classical item analysis and factor analysis . As sample size decreased, in most cases, the internal consistency for each method tended to decrease (Table 11). An exception was noted for the Rasch tests, when the sample size decreased from 500 to 250, internal consistency improved slightly. The data reported in Table 11 indicated that the standard error of measurement for the 15 item tests based on the Rasch model were

PAGE 84

75 consistently larger than the stanu.ird error of nicasurcincnt of the tests developed from classical item analysis and factor analysis for each sample size. However, these differences were not meaningful in that the difference did not equal or exceed 1.00 for any of the three procedures. The differences in mean item difficulties and discriminations were tested for statistical significance to determine whether there were differences in the types of items retained by eacli item anal}-tic method. In tliis study item discriminations were measured by biserial correlations. A two-way analysis of variance (fixed effects model) was performed separately for the two dependent variables of item difficulty and item discrimination. A check was made on the assumptions for the analysis of variance to ensure that they were met. In tliese analyses, item analytic technique and sample size were the two independent factors, each with three levels. For item difficulty, no significant differences were found for item analytic technique, sample size, or their interaction, F (2,1261 = 2.57, p > .05; F (2,126) = .45, £ > .05; £ (4,126) = .33, £ > .05 respectively. The means, standard deviations, and ranges of the item difficulties based upon the 15 item tests have been reported in Table 13. For the analysis of variance performed on the transformed biserial correlations a significant F. ratio was observed for the factor of item analytic technique, F (2,126) = 14.862, p < .05. No significant differences were observed for sample size or the interaction of sample size and item analytic technique for the transformed biserial

PAGE 85

74 correlations F (2,126) .30, p > .05; F C4,126) = 1.16, £ > .05 respectively. The means, standard deviations, and ranges of the transformed biserial correlations, based upon the 15 item tests have been presented in Table 14. TABLE 13 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM DIFFICULTY BY PROCEDURE AND SAMPLE SIZE

PAGE 86

75 between tlic mean item discriminations liave been reported in Table 15, TABLE 14 15 ITEM lESTS: DESCRIPTIVE STATISTICS FOR ITEM DISCRIMINATIONS' BY PROCEDURE AND SAMPLE SIZE Mean Procedure Classical Factor Analysis 55.60 54.49 Rasch 43.51 Standard

PAGE 87

76 TABLE 15 POST HOC COMPARISONS OF THE DIFFERENCES BETWEEN THE MEAN ITEM DISCRIMINATIONS^ FOR 11 IE 15 ITEM TESTS Means 54.49 53760 43.51 Factor Analysis .889 10.98** (54.49) Classical Item Analysis 10.09** (53.60) Rasch Analysis (43.51) ^Based on transformed biserial correlations. Tlie transformation was a linear transformation of the Fisher z_ statistic and multiplication of the constant 60.241 providing a range of 0-100 for the biserial correlation (Davis, 184b). ** £ < .01, HSD =6.88. By increasing the test length to 30 items, the internal consistency estimate was increased across each method and sample size, but a pattern similar to that for the 15 item test emerged. The internal consistency estimates from the test based on the Rasch model were slightly lower than the internal consistency estimates for the tests based on classical item analysis and factor analysis. The observed internal consistency estimates were tested for significance, using the confidence intervals described in the previous section, to see if they were statistically different from the internal consistency estimate for the population. The projected population internal consistency estimate for a 30 item test was found to be .814 (via the Spearman-Brown Prophecy Formula).

PAGE 88

77 < en w PJ t-H H c/3 t— I w CQ .^ o to X I — 1 O CC O a u uj u CO < o a, PJ CO o H O U Q tu H U CO O PJ a: o < PU z u I— I H < PJ CO uj PJ >• > CQ I— I H Q &, PJ Pi U U PJ CO -J PJ PJ

PAGE 89

78 The value of .814 indicated tlio expected internal consistency if 20 of the 50 items were randomly deleted. Based on tlie observed internal consistency estimates reported in Table 16, confidence intervals were generated for each item analytic procedure and have been presented in Table 17. TABLE 17 CONFIDENCE INTERVALS^ FOR THE OBSERVED INTERNAL CONSISTENCY ESTIMATES BASED ON THE 30 ITEM TESTS ACCORDING TO S.AMPLE SIZE 95"b Confidence Interval N = 250 N =500 N = 955 Classical .800-. 865 .845-. 880* .850-. 874* Factor Analysis .825-. 881* .834-. 871* .848-. 873* Rasch .794-. 860 .817-. 858* .838-. 863* Tlie •_ values used in calculating the confidence intervals were obtained from~Marisculo (1971). * Statistical significance is observed when the population internal consistency estimate is not included in the confidence interval generated for each observed internal consistency estimate. Tlie projected population value was .814. For the sample of 250 examinees, only one item analytic technique (factor analysis) produced an internal consistency estimate that was statistically different from the projected population estimate at a confidence level of 95 percent. However, all three techniques produced tests with internal consistency estimates significantly different from the projected population •

PAGE 90

79 estimate when the sample was increased to 500, and 995. Thus, when the number of examinees was large, each of the three teclmiques procedures tests with higher internal consistency estimates than if the test were produced by randomly deleting items. For the 30 item tests, the effect of decreasing the sample size tended to decrease internal consistency for each method (Table 16) . But the decrease was very slight. The standard error of measurement was essentially tlie same for the three methods of item analysis across the varying sample sizes. Two-way analyses of variance were run on item difficulties and item discriminations for the 30 item tests, similar to those run for the 15 item tests. Again, the independent variables were item analytic technique and sample size, each containing tliree levels. No significant differences were observed for item difficulty for the independent variables of item analytic technique, sam.ple size, or their interaction, F. (2,261) = .46, p > .05; l_ (2,261 = .27, p > .05; F_ (4,261) = .24, p > .05 respectively. No significant differences were observed for the transformed biserial correlations for the independent variables of item analytic technqiue, sample size, or their interaction, F. (2,261) = 1.97, £ > .05; F (2,261) = .74, £ > .05; F_ (4,261) = .48, £ > .05 respectively. When the actual biserial correlations were tested in the two-way analysis of variance design similar F values were observed.

PAGE 91

SI) The means, standard deviations, and ranges of tlie item difficulties and transformed biserial correlations based upon the 30 item tests have been presented m Tables 18 and 19 respectively. TABLE 18 30 ITBl TESTS: DESCRIPTIVE STATISTICS FOR ITEM DIEFICULTY BY PROCEDURE AND SAMPLE SIZE

PAGE 92

SI Comparison of the 3U Item Tests on nfficiency Lord (1974a, 1974b) proposed the formula used for approximating the relative efficiency for tvv-o tests, stated previously in equation 15 as: V . xCn xj f R.E. Cy,x) = -^^^ ^ ' X y(ny y)f^ where R.E. (y,x) denotes the relative efficiency of y compared to x, n and n are tlie numbers of items in the two tests, x and y are the 2 '^ number-right scores having the same percentile rank, and f and f are ' X y the squared observed frequencies of x and y obtained from frequency distributions for similar groups of examinees. A careful examination of the formula for relative efficiency indicated tliat when n = n and X y X = y, that it was the number of examinees at the specified ability 2 2 level (f and f ) that determined the efficiency of the test. That is, X y ' ' the fewer examinees observed at a particular percentile rank, the better the test discriminates at that percentile rank. Therefore, test efficiency was equated with the level of discrimination the test was able to make between examinees, at various scores or percentile ranks. Three relative efficiency comparisons were made using the 30 item tests based on the sample of 995 examinees. The three comparisons were: [a) tlie test developed from factor analysis was compared to the test developed by classical item analysis (b) the test developed by Rasch analysis was compared to the test developed by classical item analysis, and (c) the test developed from the Rascli analysis was compared to the factor analytically developed test.

PAGE 93

Hie efficiency curves for the thi-ce comparisons were shown in Figure 2. The relative efficiency value was plotted on the ordinate, while the percentile rank (student ability level] was plotted along the abscissa. Computed values for the relative efficiency comparisons liave been reported in Appendix B. A relative efficiency of 1.00 would indicate that the tests are equally efficient. T\\e test developed by factor analysis was more efficient for the lower tenth of the pupils when compared to the test developed from classical item analysis. Both the tests were about equally efficient for the middle ability groups and high ability groups, llie Rasch developed test was more efficient than the test based on classical item analysis for average to high ability students (40th-90th percentile rank). However, it was less efficient than the classical item analysis test for students with very low or very high abilities (lst-20tl"i percentile rank and 98th percentile rank]. When compared to the factorial ly developed test, the Rasch test was again more efficient for students of average to high abilities (50th-90th percentile rank] . The factorially developed test appeared more efficient for the very low and very high ability students (lst-20th percentile rank and 98th percentile rank). Suimnary Tlie results reported in this chapter are suiranarized for each of the five hypotheses. Hypothesis 1 . There are no significant differences in the internal consistency estimates of the tests produced by the tliree methods, as the nimiber of items decreases, when compared to the

PAGE 94

83 20 30 40 50 60 70 80 PERCENTILE RANKS 1^ SO 99 key;

PAGE 95

84 projected internal consistency estimates for the population for tests of similar length. Confidence intervals were calculated to test for differences between the observed internal consistency estimates and the internal consistency estimate for the population. As reported in Tables 12 and 17, for the 15 and 50 item tests, 15 of the 18 confidence intervals (at the 95 percent level) generated around the sample estimate did not contain the population value. This means that 15 of the observed internal consistency estimates were superior to the population values projected for subtests of similar length created by random deletion of items. Therefore, hypothesis one was not supported. The procedures that produced the three observed internal consistency estimates that were not significantly different from the population value, and hence no different than would be expected by random item deletion, were the Rasch procedure (15 item test, N = 500; 50 item test, N = 250) and the classical item analysis procedure (50 item test, N = 250) . Hypothesis 2 . There are no differences in the internal consistency estimates of the tests produced by the three methods when the number of examinees is decreased. Hypothesis two was supported for the 15 and 50 item tests. Slight decreases in internal consistency estimates were noted for the 15 item test (Table 11) as sample size decreased, but only decreases of one or two one-hundreths of a point. Even smaller decreases were observed on the 50 item test (Table 16) . H>T3othe5is 5 . There are no meaningful differences in the magnitude of the standard error of measurement of the tests produced by the three methods.

PAGE 96

85 Hypothesis three was supported for the 15 and 30 item tests. Meaningful differences were defined to be >_ 1.00, but none of the three methods produced tests with standard errors of measurement that differed by tliat much. In eacli case, the difference was approximately one-tenth of a point or less (Tables 11 and 16) . Hy|:)othesis 4 . There are no differences in the difficulties or discriminations of the items selected by the three methods. Hypothesis four was supported for the 15 and 30 item tests with respect to item difficulty. That is, the two-way analysis of variance revealed no significant differences for either the 15 or 30 item tests with regard to item difficulty. Hypothesis four was also supported for item discrimination, but only for the 30 item tests. Tlie two-way analysis of variance for item discrimination indicated no significant differences for the 30 item tests; however, on the 15 item tests, a significant £ ratio (p^ < .05) for item analytic procedure was observed for item discrimination. Tukey's HSD test revealed tliat items selected by the Rasch procedure had significantly lower average biserial correlations than the items selected by factor analysis and classical item analysis (Table 15) . This could have been expected because the range of the biserial correlation was restricted when the items were originally selected for the Rasch model. This procedure was necessary to meet one of the assumptions for the Rasch model. Hypothesis 5 . There are no differences across ability levels in the efficiency of the tests produced by the three methods.

PAGE 97

86 H)'pothesis five was not supported. The efficiency curves illustrated in Figure 2, generally indicated tliat the tests based on classical test theory were more effective for measuring students with very low ability (20th percentile rank or less) and students with very high abilities (98th percentile rank), llie Rasch developed test was most efficient for assessing average and high ability students (40th90th percentile rank).

PAGE 98

CHAPTER V DISCUSSION AND CONCLUSIONS This study was conducted to determine which of the three item analytic procedures (classical item analysis, factor analysis, and the Rasch model) miglu produce the superior test in terms of the precision and the efficiency of measurement. A common item and examinee population was used to test five hypotheses. Of the five hypotheses, three dealt with elements of test precision as measured by internal consistency estimates. Another hyjriothesis treated the issue of item discriminations. Thus, it too was related to internal consistency. The fifth h>qiothesis focused on the relative efficiency of the tests produced by tliree item analytic teclniitjues. Tliis hy]oothesis altered the emphasis of the study from one overall specific measure of a test's accuracy, in terms of internal consistency, to a general comparison of each method as a function of ability level. The discussion of the results then has been focused in two major areas: (a) the precision of the tests, and [b) the efficiency of the tests produced by the three methods of item analysis. The Precision of the Tests Produced by the Three Me thods of Item Analysis Each of the three item analytic techniques was applied to an independent sample to select the best 15 and 30 items. The stability of the summary statistics across each sample size for the three item analytic techniques indicated a tendency for tlie nine samples to be very homogeneous, 87

PAGE 99

88 The similarity of the means, standard deviations, and percentages of variance accounted for were noted on Tables 5-8, witli tlie exception of the mean sciuare fit statistic (Table 7) whicli increased with sample size. (Tlvis exception is discussed later in this cliapter.) From these samples, items were selected by each item analytic techniciue to maximize internal consistency. The data reported in Tables 11 and 16 indicated the effectiveness of each item analytic technique in producing internally consistent tests. Before an overall decision can be made as to the superiority of one technique over another, each of the h)q30theses relating to precision must be considered. Internal Consistency Data in Tables 11 and 16 indicate that tlie two tests based on classical test tlieory (factor analysis and classical item analysis) appeared superior in terms of internal consistency wlien compared to the tests developed by tlie Rasch model. To test whether any of the three methods produced tests with greater intei'nal consistency than a test created by random item deletion, the internal consistency estimates were compared to the projected internal consistency value for the population by using confidence intervals as suggested by Feldt (1965). In order for a given sample internal consistency estimate to be significant, the population value could not be included in the confidence interval generated around that sample value. For the 15 item tests, nine confidence intervals were calculated for the nine estimates of internal consistency, one for each method at each sample size. Eight of the nine sample values were shown to be significantly greater than the population estimate at the 95 percent confidence level (Table 12). Only tlie internal consistency estimate of the Rascli test.

PAGE 100

80 based on the sample of 500 examinees, failed to reach a level significantly greater than would have been expected by chance. For the 30 item tests, nine confidence intervals were also calculated for the nine estimates of internal consistency, one for eacli method at each sample size. Seven of the nine sample internal consistency estimates were shown to be significantly greater than the population estimate at the 95 percent confidence level (Table 17) . The tests based on classical item analysis and Rasch analysis, for the sample of 250 examinees, were not significantly different from the projected population value for a 30 item test created by random item selection. Therefore, for smaller samples (N = 250] factor analysis appeared to be superior to classical item analysis and the Rasch analysis in producing the most precise test. Generally, as tlie number of examinees decreased so did the internal consistency estimates. However, the tests based on factor analysis were least affected by decreasing the sample sizes used in the cross-validation for the 15 and 30 item tests (Tables 11 and 16). Standard Error of Measurement The standard error of measurement is the standard deviation of the distribution of errors surrounding an individual's observed score on an infinite number of parallel tests. Hence the smaller the standard error of measurement, the greater the precision of the measurement. This statistic is often considered a more meaningful measure of an instrument's reliability than tlie reliability coefficient itself (Magnusson, 1966, p. 82). Based on the data for this study, the standard errors of measurement were consistently smaller for both the 15 and 30 item tests

PAGE 101

90 produced by classical test theory as compared to the 15 and 30 item tests based on the Rasch model; however, the differences in the standard errors of measurement did not equal or exceed 1.00 for any of the methods. Types of Items Retained Item difficulty . Item difficulties of the 15 and 50 item tests were analyzed in separate two-way analyses of variance. The two independent variables were sample size and item analytic technique. No significant £ ratios were observed for eiti\er the 15 or 30 item tests on item difficulty. Therefore, each item analytic technique tended to select items whicli had similar item difficulties on the average. Item discrimination . In this study, the item discriminations were measured by biserial correlation. Transformed biserial correlations for the 15 and 30 item tests were analyzed in separate two-way analyses of variance. The two independent variables were sample size and item analytic technique. For the 15 item tests, a significant F ratio (£ < .05) was observed for the main effect of item analytic technique. The mean biserial correlation for each 15 i tem test s were 44 for the Rasch test, 54 for the classical item analysis test, and 54 for the factorial ly developed test. Tukey's USD post hoc comparison indicated that the items selected by the Rasch procedure had lower biserial correlations, on the average, than items selected on the basis The actual mean biserial correlations for the three tests corresponding to the transformed biserial correlations were: .62, .71, ,71 respectively.

PAGE 102

91 of factor analysis or classical i t om analysis. It should be noted that this finding was due to the fact that the range of tlie biserial correlations was restricted to .33 .79 on the items selected for the Rasch calibration. This was necessary to meet the assumption of equal item discriminations. However, when the test length was increased to 50 items, no significant •_ ratios were observed for the variable of item discrimination. The difference in these two findings for the 15 and 50 item tests can be explained by the way the 15 and 50 item tests were constructed. The 15 item test was made up of the 15 items with the highest biserial correlations. The 50 item test was made up of the above 15 items and an additional set of 15 items with the next highest biserial correlations. The addition of 15 more items meant that their average biserial correlation was something less than the original 15 items. Hence, the mean biserial correlations were reduced for the longer 50 item tests [Table 19) . Conclusions From the data presented for each of the four areas above, it was concluded that each of the three item analytic techniques tended to produce tests that were really no different in terms of the precision of measurement. Thus, the question to consider now is: Should practitioners in the field of measurement spend their time learning to use the Rasch 12 model to develop cognitive norm-referenced tests knowing the extra 11 . . r "Tlie field of test development is limited to cognitive norm-retcrenced tests because tliat was the type of instrument used in this study.

PAGE 103

92 work and sopliisticatioii of knowledp,e required to effectively use the Rasch procedures? Witli the criterion of internal consistency as a measure of test superiority, it appeared from this study that time spent factorial ly developing tests, or if computer facilities were not available, tlie use of classical item analysis procedures seem more than adequate for good test construction. However, it must be remembered that internal consistency may not be a fair and sufficient criterion. Internal consistency is an integral part of classical test theory and may be biased since it was derived from the classical model. Following tliat reasoning, Whitely and Dawis (1974) liave commented on the precision of tests developed using classical item analysis and Rasch analysis. They stated that if the goal of item selection was to develop fixed-content tests, then the classical teclmiques of item selection will yield the more precise test since precision is specific to tlie trait distribution in a given test. Whitely and Dawis indicated that tlie strength of the Rasch analysis was in the individualized selection of items, as in tailored testing, rather tlian the construction of fixed-content tests. In considering the above situation, Lord (1974b) has stated that internal consistency is an overall estimate of a test's homogeneity, but provides no information on how the test as a whole discriminates for the various ability groups taking tlie test. Thus, the three techniques of test development were compared using an additional criterion, relative efficiency.

PAGE 104

93 Tlic F.ffjcicncy of ; Tests Produced by the Tliree Methods of Item Analysis The review of the literature concerning the relative efficiency of a test presented in Chapter 1I-, cited studies mainly dealing with latent trait theory (Birnbaum, 1968; Hambleton and Traub, 1971, 1973). Studies comparing the relative efficiency of tests developed by latent trait theory to classical test theory appear to be missing from the literature on test efficiency. The relative efficiency formula (Lord, 1974a, 1974b) was not derived for any specific test development theory; therefore, relative efficiency estimates should be applicable to any test development technique. Lord (1974b) suggested his formula may not work well for extremely short tests and that it should only be used on large samples of examinees. Thus, in the present study, only the 30 item tests were compared using the sample of 995 examinees. The three comparisons of relative efficiency were: (a) the test based on factor analysis was compared to the test based on classical item analysis, (b) tlie test based on the Rasch analysis was compared to the test based on classical item analysis, and (c) the test based on the Rasch analysis was compared to the test based on factor analysis. Generally, the results indicated that the Rasch test was superior to the two tests based on classical test theory for students of average and high ability. The two tests based on classical test theory, however, were superior in efficiency to the Rasch developed test for very low and very high ability students (Figure 2). Test efficiency has been defined as a measure of how well a test is able to discriminate between examinees of varying abilities. Therefore, the test constructor must

PAGE 105

94 ask himself, for which segment (s) of the examinee population is the test intended to discriminate? In this study, tlie test under consideration was a verbal aptitude college admissions test. Usually college admissions officers are interested in selecting students who will be successful once admitted to college. The examinees wtio score very high on college admissions tests will generally be admitted to college witliout any question. Thus, it is less important to be able to discriminate among the very high scoring examinees tlian to discriminate among the students who score near the mean or in the upper middle range on a college admissions test. For these students it is difficult to decide who sliould be admitted and who sliould be denied admittance. If it is known that the admissions test discriminates very well for average to high ability students, then tlie reliability of the selection process based on test scores should be increased. The data in this study, therefore, indicate tliat the test based on the Rasch analysis would be most efficient for selecting the average to 'nigh ability students for admission to college. Lord (196S) has illustrated a very important feature of test information and relative efficiency curves. He has shown tliat the contribution of each item to a test is independent of all other items. Thus, when information curves are available on a pool of items they can be added to a test to achieve a prespecified information or relative efficiency curve for any subpopulation of examinees (Lord, 1968) . Therefore, by using measures such as relative efficiency and information curves psychometricians are able to develop very discriminating tests for any segment of the population.

PAGE 106

9 5 Conclusio ns It has been suggested that because efficiency and test information curves are a function of ability, tliese estimates ought to replace the use of classical reliability estimates and the standard error of measurement in test score information (Hambleton et al., 1977). This suggestion certainly deserves some consideration in light of the present study where it was shown that the three methods of item analysis produced similar tests in terms of precision, but the three methods produced very different tests in teimis of efficiency. Today, with the increasing use of computers in test construction, perhaps the more meaningful question to be asked by psychometricians is; For which ability group is the test superior? Only test information curves and measures of relative efficiency can answer that question. Implications for Future Research The results of this empirical study revealed that tests developed using classical test theory, in spite of its inherent weaknesses, were no different with respect to precision of measurement than tests developed using one of the latent trait models, the one-parameter Rasch model. Comparisons of relative efficiency for the 30 item tests showed that the tests based on classical test theory were superior to the Rasch developed test for very low and very high scoring examinees, and the Rascli developed test was more efficient for average to high scoring examinees on the verbal aptitude college admissions subtest used in this study. Only one of the four latent trait models, the Rasch model, was used in this comparative study of test development techniques. Perhaps it was the very nature of tliis simple model that resulted in the

PAGE 107

96 devc lo])mcnt of ctjuivalcnt tests when compared to tlic tests developed by classical item analysis and factor analysis in terms of precision. It may be that the more technicaltwoand three-parameter logistic models would have produced tests comparable or superior to those developed by classical test theory in terms of precision of measurement and overall relative efficiency. The two-parameter model allows for varying item discriminations so that the initial selection of items would not have to have been restricted to a prespecified range. The three-parameter model not only allows for varying item discriminations, but also for the effects of guessing on the test. It is reasonable to suspect that guessing may have been a factor in the item scores for the t>T3e of cognitive test used in the present study. Thus, before the findings of the study can be generally accepted, replication is needed using not only other populations and other instruments, but also other latent trait models. If other latent trait models are to be considered in addition to the Rascli model, several points need to be evaluated. The Rasch model is the only latent trait model that provides for the direct calibration of items and abilities based on unweighted "number right" scoring (Wright, 1977). The twoand three-parameter models require a more complex scoring system where the item response is weighted in order to estimate the discrimination and guessing parameters. The weigliting is an iterative process that may never converge or stablize unless arbitrary boundaries are established [Wriglit, 1977, p. 104). Dccause of tliis complex scoring system, the twoand thrcc-iniramcter logistic models are less efficient for parameter estimation than the one-jiarameter model in terms of computer time.

PAGE 108

97 A further criticism of the twoand tiiree-parameter logistic models has been offered by Wright (1977) concerning the additional item parameters. If item discrimination parameters and item guessing parameters are introduced into a theory of measurement, why not person parameters for sensitivity to difficult items, and inclination toward guessing? Wright questions wliether psychometricians really need to make measurement theory so complex. A final point to consider when comparing the latent trait models is that only the one-parameter Rasch model provides a ratio scale of measurement in terms of the calibrated item and ability scores (Hambleton et al., 1977). Success on a particular item is given by the product of the person's ability and the item's easiness. Tlius, the person with no ability will have zero odds or probability of success on any item. The same logic applies to items with no easiness (zero difficulty) they cannot be solved. Thus, measurements made with Rasch calibrated items are on a ratio scale; and it is the ratio scale of measurement that leads to the concept of specific objectivity. Tlierefore, it seems that each latent trait model has its own set of advantages and disadvantages and sliould be considered if comparisons are to be made to the classical models for test development purposes. Additional areas for future research may lead to actual comparisons of the content of the items selected by each of the three item analytic techniques. Davis (1951) and Co.\ (1965) have criticized the selection of items solely on statistical criteria. The use of statistical criteria alone, may result in changing the nature of the trait being measured by deleting the items essential to adequate coi;tent coverage. Whitely

PAGE 109

98 and Dawis fioyb) found this to be true in a study of verbal analogy items wliere the type of relationship and content of specific analogy items proved to be quite significant when studied in isolation. A restricting factor in this stud)' was tliat a prespecified number of test items were selected, e.g., 15 and 50, by each item analytic procedure. For the Rasch procedure, perhaps some number of items less than 30, but greater than 15 may have provided a better fit to the model. This procedure might have produced mean square fit statistics that were equivalent across the sample sizes rather than increase witli sample size as found in this study. Thus, selecting the number of items precisely fitting the Rasch model and comparing that number of items with classical item analysis and factor analysis miglit liave led to different conclusions than those made in the present study with regard to the precision and relative efficiency of measurement,

PAGE 110

CMPTF.R VI SlIffMARY The quality of the items in a test determine its validity and reliability. Tlirough the application of item analysis procedures, test constructors are able to obtain quantitative objective information useful in developing and judging the quality of a test and its items. Classical test tlieory forms the basis for one method of test development. An integral part of the development of tests based on the classical model is the utilization of classical item analysis or factor analysis. Classical item analysis is a procedure to obtain a description of the statistical characteristics of each item in the test. This approach requires identification of single items whicli provide maximum discrimination between individuals on the latent trait being measured. Theoretically, selecting items which have a high correlation with total test score will result in a discriminating test which is homogeneous with respect to the latent trait. Therefore, classical item analysis is an aid to developing internally consistent tests. An alternative metliod of test development, but based on the classical model, is factor analysis. Factor analysis is a more complex test development procedure than classical item analysis. It is a statistical technicjue that takes into account the item correlation with 99

PAGE 111

100 all other individual items in the test simultaneously. Groups of similar items tend to cluster together and comprise the latent traits (factors) underlying the test. Thus, under the classical model then, classical item analysis can be viewed as a unidimensional basis for item analysis, less sophisticated than the multidimensional procedure of factor analysis. Classical item analysis and factor analysis have long been the only techniques described in measurement texts for use in test development (Baker, 1977). However, with the publication of Lord and Novick's Statistical Theories of Mental Test Scores , (19b8) considerable attention is being directed now toward the field of latent trait theory as a new area in test development. Proponents of this approach claim that the advantages of latent trait theory over classical test theory are twofold: (a) theoretically it provides item parameters that are invariant across examinee samples which will differ with respect to the latent trait, and (b) it provides item characteristic curves that give insight into how specific items discriminate between students of varying abilities. Four latent trait models have been developed for use with dichotomously scored data: The normal ogive, and the one-, two-, and three-parameter logistic model (Hambleton and Cook, 1977; Lord and Novick, 1968). This study was concerned with the one-parameter logistic Rasch model because it is the simplest of the four models. A review of tlie literature revealed numerous studies conducted in each of the tliree areas of item analysis, but relatively few comparative studies were reported between the three methods. Missing

PAGE 112

101 from the review were comparative studies among all three, item analytic techniques. Therefore, the present study was designed to compare the methods of classical item analysis, factor analysis, and the Rasch model on measures of precision and relative efficiency in test development. An empirical study was designed to compare the effects of the three methods of item analysis on test development across different sample sizes. Item response data were obtained from a sample of 5,235 high school seniors on a cognitive test of verbal aptitude. The subjects were divided into 9 samples: three independent groups of 250 subjects each, three independent groups of 500 subjects each, and three independent groups of 995 subjects eacli. The independent groups were obtained so that tests of statistical significance could be performed. The item response data were then analyzed in three phases. First, the "best" 15 and 30 items were selected using each item analytic technique. Under classical item analysis, the best 15 and 30 items were selected based on the highest biserial correlations. For factor analysis, the best 15 and 30 items were selected based on the highest item loadings on the first (unrotated) principal component. The selections of the best 15 and 30 items using tlie Rasch model were based upon the mean square fit of the items to the model. These procedures were used for each group of subjects. Second, a double cross-validation design was employed to obtain estimates on the item and test parameters for the best 15 and 30 items. Tlie items selected from tlic three item analytic teclmiques were scored for different samples of subjects by randomly reassigning the samples wliich Iiad been used in the original item analysis.

PAGE 113

102 Then, the best 15 and 30 itc.is chosen by each method were submitted to a common item analytic procedure in order to obtain estimates for comparing the three item analytic methods. Third, a two-way analysis of variance and a Tukey HSD post hoc comparison test, when indicated, were used to test for differences in the properties of items selected by each item analytic procedure. Also confidence intervals were calculated to compare the internal consistency estimates to a population value. In addition, the relative efficiencies of the 30 item tests developed by each item analytic technique were compared for the sample of 995 subjects. The results of the analysis showed that there were no apparent differences in the types of tests produced by the three methods of item analysis in terms of the precision of measurement. The three methods were compared on measures of internal consistency, the standard error of measurement, mean item difficulty, and mean item discrimination. Confidence intervals were generated around the observed internal consistency estimates for the 15 and 30 item tests produced by each method. The confidence intervals were obtained to compare the observed internal consistency estimates from each test to the project population internal consistency for tests of a similar length as suggested in Feldt (1965) . The projected population value (obtained via SpearmanBrown Prophecy Formula) represented what the test's internal consistency would have been for a test created by deleting items at random. Of the 18 confidence intervals calculated at a 95 percent level of confidence, 15 did not contain tlie projected population value. Therefore, it was concluded that the three item analytic techniques were significantly different from random item deletion in producing tests with liigher internal consistenc\' estimates. It was also noted that as the

PAGE 114

103 number of examinees decreased so did the internal consistency estimates. The standard error of measurement was consistently smaller for both the 15 and 30 item tests produced by classical test theory when compared to tlie 15 and 50 item tests based on the Risch model. However, the differences in the standard errors of measurement did not equal or exceed 1.00 for any of the methods. No significant F ratios were observed for either 15 of 30 item tests on item difficulty. Therefore, each item analytic technique tended to select items which had similar item difficulties on the average. For the variable, item discrimination on the 15 item tests, a significant £ ratio (p^ < .05) was observed for the main effect of the item analytic technique. Tukey's USD post hoc analysis indicated that tlie Rasch test tended to contain items with lower biserial correlations, on the average, tlian tests procedured by factor analysis and classical item analysis. This finding was probably due to the fact that the range of the biserial correlations was restricted to .39 .79 for items retained in the Rasch analysis. However, when the test length was increased to 30 items, no significant F_ ratios were observed for the variable of item discrimination. In terms of the test efficiency, the results indicated substantive differences in tlie tests produced by the three methods of item analysis. The 30 item test for the sample of 995 examinees was used in this comparison. It was found that the Rasch developed test was superior to the two tests based on classical test theory for students of average to high ability. The tests based on classical test theory, however, were superior in efficiency to the Rasch test for students of very low

PAGE 115

10 # or very high ability. In light of these findings, it was suggested that measures of test efficiency ouglit to be incorporated into the test development procedures, as it provides much more detailed information on how the test discriminates for various ability groups than does a single overall estimate of a test's homogeneity.

PAGE 116

REFERENCES Anderson, E. B. Goodness of fit for the Rasch model. Psychometrika 1973, 38, 123-140. Anderson, J., Kearney, G., and Everett, A. An evaluation of Rasch's structural model for test items. British Journal of Mathema tical and Statistica l Psychology , 1968, 2j_, 231-238. Anderson, L. W. A comparison of classical item analytic precedures with affective data. A paper presented at the annual meeting of the American Educational Research Association, San Francisco, April, 1976. Baker, F. B. Empirical comparison of tlie item parameters based on the logistic and normal functions. Psychometrika , 1961, 26, 235-246. Baker, F. B. Advances in item analysis. Review of Educational Research, 1977, 47, 151-178. Baker, F. B. , and Martin, T. Fortap: A Fortran test analysis package . Occasional Paper No. 10, Office of Research Consultation, College of Education, Michigan State University, 1970. Bedell, B. J. Determination of the optimum number of items to retain in a test measuring a single ability. Psychometr ika, 1950, 15, 419-430. — Benson, 1. G. The Florida Twelfth Grade Testing Program: A factor analytic study of the aptitude and achievement subtests. Unpublished Master of arts in education, thesis. University of Florida, 1975. Binet, A. and Simon, T. (Tlie development of intelligence in children) (E. S. Kite, trans.). Baltimore, Md . : Williams ^ Wilklns, 1916. Birnbaum, A. Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord and M. R. Novick, Statistical theories of mental test scores . Reading, Ma. : Adison-Wesley, 1968. Bock, R. D. and Wood, R. Test theory in P. H. Mussen and M. R. Rosenweig (Eds.), Annual review of psychology (Vol. 22). Palo Alto, Ca. : Ainuuil Reviews, Inc., 1971 Brogden, H. E. Variation in test validity with variation in the test distribution of item difficulty, number of items, and degree of their intercorrelation. Psychometrika, 1946, 11, 197-214. 105

PAGE 117

10() Burt, C. and John, E. A factoj-ial analysis of the Terinan-Binet tests. British Journal of tiJticational Psycliology , 1943, 1_2, 156-161. Carroll, J. B. The nature of the data, or how to choose a correlation coefficient. Psychometrika , 1961, 26, 347-572. Cattail, R. B. Personality and motivation , structure and measurement . New York: World Book, IncTT 1957. Cattell, R. B. and Tsujioka, A. The importance of factor trueness and validity, versus homogeneity and orthogonality in test scales. Educa t ional and Ps ychological Measurement , 1964, 24_, 3-30. Cochran, W. G. and Cox, G. M. Experimental Designs . (2nd ed.) New York: John Wiley and Sons, Inc., 1957. Cox, R. C. Item selection techniques and evaluation of instructional objectives. Journal of Educational Measurement , 1965, 2_, 181-185. Cronbacli, L. J. Coefficient alpha and tlie internal structure of tests. Psychometrika , 1951, 16, 297-334. Davi s , E . Item Analysis Data: Their computation , interpretation and use in test construction . Cambridge, Ma. : Harvard University, 1946. Davis, E. Item selection techniques. In E. F. Lindquist (Ed.), Educational Measurement . Washington, D. C: American Council on Education, 1951 . Dinero, T. E. and Haertel, E. A computer simulation investigating the applicability of the Rasch model with varying item discrimination. A paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, April 1976. Fan, C. T. Item analysis table . Princeton, N. J.: Educational Testing Service, 1952. Feldt, E. S. The approximate sampling distribution of Kuder-Richardson reliability coefficient twenty. Psychometrika , 1965, 30, 357-365. Flanagan, J. General considerations in selection of test items and a short method of estimating the product moment coefficient from data at the tails of the distribution. Journal of Educational Psychology , 1939, 30, 674-680. ~" Florida Twelfth Grade Testing Program . (Report No. 1-75J . Gainesville, Florida: Office of Instructional Resources, 1975. Fruchter, B. Introduction to factor analysis. New York: D. Van Nostrand Co . , Inc. , 1954~ Guertin, W. and Bailey J. Introduction of modern factor analysis. Ann Arbor, Mich.: Edward Bros., Inc., 1970.

PAGE 118

107 Guilford, J. P. Psychometric methods . New York: McGraw-Hill, 1954. Gulliksen, H. The relation of item difficulty and inter-item correlation to test variance and reliability. Psychometrika , 1945, 20., 79-91 . Gulliksen, H. Theory of mental tests . New York: John Wiley and Sons, Inc., 1950. Hambleton, R. K. and Cook, L. Latent trait models and their use in analysis of educational test data. Journal of Education Measurement , 1977, 1A_, 75-96. Hambleton, R. K. and Traub , R. E. Information curves and efficiency of three logistic test models. British Journal of Mathematical and Statistical Psychology , 1971, 24_, 275-281. Hambleton, R. K. and Traub, R. E. Analysis of empirical data using two logistic latent trait models. British Journal of Mathematical and Statistical Psycholog y, 1973, 26^, 195-211. Hambleton, R. K., Swaminathan, H., Cook, L., Eignor, D., and Gifford, J, Developments in latent trait theory: A review of models, technical issues, and applications. A paper presented at the annual meeting of the American Educational Research Association, New York, April 1977. Harman, H. H. Modern factor analysis . (2nd ed.) Chicago: University of Chicago Press, 1967. Henrysson, S. The relationship between factor loadings and biserial correlations in item analysis. Psychometrika , 1962, 27_, 419-424. Henrysson, S. Gathering, analyzing, and using data on test items. In E. L. Thorndike (Ed.), Educational Measurement . (2nd ed.) Washington, D. C: American Council on Education, 1971. Hotelling, H. Analysis of a complex statistical variables into principal components. Journal of Educational Psychology , 1933, 24, 417-441, 498-520. Hoyt, C. Test reliability estimated by analysis of variance. Psychometrika , 1941, 6_, 153-160. Kelley, T. L. The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology , 1939, 30, 17-24. Kirk, R. Experimental design: Procedures for the behaviora l sciences . Belmont, Ca.: Wadsworth, 1968. Lange, A., Lehmann, I., and Meherens, W. Using item analysis to improve tests. Journal of Educational Measurement , 1967, 4_, 65-68.

PAGE 119

108 Lazarsfeld, P. F. The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer et al . , Measurement and prediction. Princeton, N. J.: Princeton University Press, 1950. Lord, F. M. A theory of test scores. Psychometri c Monographs , 1952, No. 7 (a). ~ Lord, F. M. The relationship of the reliability of multiple-choice tests to the distribtuion of item difficulties. Psychometrika , 1952, LS, 181-194 (b) . Lord, F. M. An application of confidence intervals and of maximum likehood to the estimation of an examinee's ability. Psychometrika , 1953, 2i, 57-75 (a). Lord, F. M. The relations of test scores to the trait underlying the test. Educational and Psychological Measurement , 1953, 13^, 517-548 (b) Lord, F. M. An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and Psychological Measurement , 1968, 28, 9S9-1020T Lord, F. M. The relative efficiency of two tests as a function of ability level. Psychometrika , 1974, 39, 351-358 (a). Lord, F. M. Quick estimates of the relative efficiency of two tests as a function of ability level. Journal of Educational Measurement, 1974, IJ^, 247-254 (b) . Lord, F. M. Practical applications of item characteristic curve theory. Journal of Educational Measurement , 1977, 14_, 117-138. Lord, F. M. and Novick, M. Statistical theories of mental test scores . Reading, Ma.: Addison-Wesley , 1968. Lumsden, J. The construction of unidimensional tests. Psychological Bulletin , 1961, 58^, 122-151. McNemar, Q. The revision of the Stanford-Binet scale. Boston: Houghlin-Mlfflin, T9l2". McNemar, Q. Psychological Statistics . New York: John Wiley and Sons, Inc., 196T; Magnusson, D. Test theory . Reading, Ma.: Addison-Wesley, 1966. Mandeville, G. K. and Smarr, A. M. Rasch model analysis of three types of cognitive data. A paper presented at the annual meeting of the American Educational Research Association, San Francisco, April 1976. Marscuilo, L. A. Statistical methods for behavioral science research . New York: McGraw-Hill, 1971^

PAGE 120

109' Mehrens , W. and Lehmann, I. Measurement and evaluat ion iji educa tion and psycholog y. New York: Holt, Rinehart, and Winston, Inc., 1973. Mendenhall, W., Ott, L., and Schaffer, R. E lementary survey sampling . Belmont, Ca. : Wadsworth Publishing Co., 1971. Mosier, C. 1. Problems and designs of cross validation. Educational and Psychological Measurement , 1951, l]_, 5-11. Myers, C. T. The relationship between item difficulty and test validity and reliability. Educational and Psychological Measurement , 1962, 22, 565-57. Pearson, K. On the correlation of characters not quantatively measurable. Royal Society Philosophical Transactions , Series A, 1900, 295, 1-47. Pearson, K. On a new method of determining a correlation between a measured character of A, and a character of B, of wliich only the percentage of cases wherein B exceeds (or fall short of) intensity is recorded for each grade of A. Biometrika , 1909, 7_, 96-105. Rasch, G. Probablistic models for some intelligence and attainment tests . Copenhagen, Denmark: Danmarks Paedagogiske Institute, 1960. Rasch, G. An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology , 1966, 19, 49-57. Rentz, C. Rasch model invariance as a function of the shape of the sample distribution and degree of model -data fit. A paper presented at the annual meeting of the Florida Educational Research Association, Januaiy 1976. Rentz, R. R. and Bashaw, W. L. The national reference scale for reading; An application of the Rasch model. Journal of Educational Measure ment , 1977, 2i, 161-179. Richardson, M. IV. Notes on the rational of item analysis. Psychometrika , 1936, J^, 69-76. Ryan, J. P. The rationale for the Rasch model. A paper presented at the pre-convention training session for the Rasch model, the annual meeting of the American Educational Research Association, New York, April 1977. Spearman, C. General intelligence objectively determined and measured. American Journal of Psychology , 1904, 15^, 201-293. Swineford, F. Biserial r versus Pearson r as measures of test-item validity. Journal of Educational Psychology , 1936, 27^, 471-472.

PAGE 121

lUi Tinsley, II. and Dawis, R. Aii investigation of the Rasch simjile logistic model: Sam])lefree item and test calibration. Educational and Psychological Measurement , 1975, 3^, 325-339. Thurstone, L. L. Multiple factor analysis . Chicago: University of Chicago Press, 1947. Ware, W. B. and Benson, J. Appropriate statistics and measurement scales. Science Education , 1975, 5^, 575-582. Webster, H. Maxiniimizing test validity by item selection. Psychometrika , 1956, 2j_, 153-lb4. Wherry, R. J. and Winer, B. J. A metliod for factoring large numbers of items. I'sychometrika , 1955, J^, 161-179. Whitely, S. and Dawis, R. The nature of objectivity with the Rasch model. Journal of Educational Measurement , 1974, J_l, 163-178. Whitely, S. and Dawis, R. The influence of the context on item difficulty. Educational and Psychological Measurement , 1976, 36 , 329-337. Woodcock, R. W. Woodcock Reading Mastery Test . Circle Pines, Minn.: American Guidance Service, 1974. Wright, B. Samplefree test calibration and person measurement. Proceedings of the 1967 Invitational Conference on Testing Problems . Princeton, N. J. : Educational Testing Service, 1968. Wright, B. Solving measurement problems with the Rasch model. Journal of Educational Measurement , 1977, H_, 97-116. Wright, B. D. and Mead, R. J. CALFIT: Samplefree item calibration with a Rasch measurement model. Research Memorandum No. 18 . Chicago: Statistical Laboratory, Department of Education, University of Chicago, 1975. Wright, B. D. and Mead R. J. BICAL: Calibrating rating scales with the Rasch model. Research Memorandum No. 23. Chicago: Statistical Laboratory, Department of Education, University of Chicago, 1976. Wright, B. D. and Panchapakesan , N. A procedure for sample-free item analysis. Educational and Psychological Measurement , 1969, 29 , 23-48.

PAGE 122

APPl-NDIX A MATHEMATICAL DERIVATION OF mE RASCH MODEL^ Summarized from Ryan (1977)

PAGE 123

.NLVniEM-VriCAL DERIVATION OF lllE lUSCll MODEL If ability is equal to and item difficulty is equal to i; then the odds of correctly solving an item is given by ,^ e , (1) odas = : ^ ' s Ulien 6>£, the person will get the item right, when 9 C + 9 112

PAGE 124

115 Formula 6 is the probability of an incorrect response. Tlie relationship of these probability statements to the statement of the odds is given by Q ^/c + e c The right side of equation 7 is simply the odds as defined in equation 1. The left side of the equation is the probability of a correct response divided by the probability of an incorrect response. The probability of a correct response, P, is estimated by the proportion of examinees in a sample who correctly answer an item. On a test of k items, the probability of a person with score of j (where 1 ^ j ^ 1<) , correctly answering a particular item is simply the proportion of people with the raw score j who correctly answered the item. This is nothing more than the item difficulty for all the people who have a raw score of j. The probability can be calculated in terms of the item difficulty for all raw score groups from 1 to (k-1) and across all items. The probability of an incorrect response, Q, is the proportion of people answering an item incorrectly. Tliis is simply one minus the proportion answering it correctly (1-P) . Both P and Q are easily calculated from a set of data hence the value of P/Q in formula 7 is an easily derived statistic which estimates the odds. Separating the Parameters Consider again equation 7 and take the natural log (In) of both sides of the equation. This gives In (^) = In (|) . (8)

PAGE 125

\u Since the In of a rtitio is the same as the difference of the In. 8 becomes In (|) = In 6 In C • (9) Person Free Item Difficulties Next consider a score group with ability 6 and two test items on with difficulties c and c, respectively. The probability of a pers with ability 6 correctly answering an item of difficulty r is P.,. The probability of an incorrect response is Q, , . Tlie probability of a person with ability 6 correctly answering an item of difficulty ^ is P.-, and tlie probability of an incorrect response is Q -,. By equation 9, Pll In (^) = In 6 In r and (10) ^11 In (^) = In 9 In c.. (11) ^12 If equation 11 is subtracted from equation 10, term by term, the result is P P In (A In {—) = (In 9 In Cj (In 6 In ^) . (12) ^11 ^12 Tills is the same as P P In (^) In (^) = In 9^ In c^ In 9^ + In r,^ , (13) P P or. In (^) m (^) In c^ " In C^(14) Ec[uation 14 should be examined very carefully. On the left side of the equation is an easily calculated statistic: The difference between the In odds of a correct response on item 1 compared to item 2.

PAGE 126

115 The right side is significant for what it does not contain. There is no parameter for tiie person's ability on the right side of tl\e equation. Tliis same result occurs regardless of the ability of the person or group examined. The difference between the difficulty of item 1 and the difficulty of item 2 can be calculated independently of the subject or group of subjects involved. In general, for any two items of difficulty Ci and ; , the difference between the difficulty of the 1 m _^. „ „ . „ f_^ two items is given by in C„^ in C^ ^ in iA In (^) . (15) 1 1 i m Item Free Person Abilities The discussion of person ability estimates is an exact parallel to the discussion of item difficulty estimates. Instead of comparing two items across any group of examinees the discussion of ability proceeds by comparing any two groups on any test item. Consider score group with ability 6 score group ^ with ability Q^, and item with difficulty Q . From equation 9, ^11 In (^^ — ) = In 9 In c,, and (16) ^11 ^1 In (— ) = In e^ In ^^ . (17) Subtracting equation 17 from equation 16 will yield P P In i-A In i-^) In 8^ In 6 (18) ^11 ^21 The difference between the abilities of the examinees in score group , and score group ^ (the right side of equation 18) is described without reference to the item involved. In general, for any two groups with abilities 9 . and 9 . , 1 J

PAGE 127

lib In 6 P., P.. In a = m (^) In (^) (19) for any item with difficulty c In this case the abilities are being compared independently of the difficulty of the item used to compare them. Tliis is often referred to as item free person ability estimation. Formalizing the Model To describe the Rasch model let ln6 and In C be re-defined. Specifically, let, e = In 6, and (20) 6 = In c. ^ — .,._^^ (21) Equations 20 and 21 simply define the In ability as B and the In difficulty as 6. — If both sides of equation 20 and 21 are raised to the base of the natural log system, e, we get 3 In , and e = e 6 In c. e = e Recall equation 5 P = 6/C (22) (23) (24) 1 + e/c and substitute the equivalent terms for 9 and C as defined in equations 22 and 25. This gives 6 P = P = e / e or 1 + e / e ( 3-6 ) 1 + e ( 3-6 ) (25) (26)

PAGE 128

117 More formally this is P (X^. = 1| 3,, 6.) ef \ '0 l.e^h^^ (27) Equation 27 is the Rasch model,

PAGE 129

APPENDIX B RELATIVP. EFFICIENCY VALUES USED IN FIGURE 2 FOR THE COMPARISONS AMONG ITEM ANALYTIC METHODS

PAGE 130

TABLE B.l RELATIVE EFFICIENCY VALUES USED IN FIGURE 2 FOR THE COMPARISONS AMONG ITEM ANALYTIC METHODS

PAGE 131

BIOGRAPHICAL SKETCH Iris Benson was born January 19, 1946, in Charleston, South Carolina. She graduated from Hialeah High Scliool, Hialeah, Florida, in June 1964. She began attending college part-time in 1968, and graduated in August 1971 from Santa Fc Junior College in Gainesville, Florida.Iris then attended the University of Florida, and received a Bachelor of Arts degree with honors in March 1973 with a major in Psychology. In the fall of 1973, she began her graduate studies in the College of Education at the University of Florida. Upon completing her master's thesis entitled. T he Florida Twelfth Grade Testing Program: A factor analysis of tl^e aptitude and achievement subtests , she received the degree Master of Arts in Education in June 1975. Iris began her doctoral program at the University of Florida in the fall of 1975. Wliile working on \\ev graduate degrees slie has held, at various times, a graduate assistantship in tlie area of evaluation and test develo]Miient , and teaching assistantships in graduate level courses in educational measurement and statistics. She was also an evaluation intern at the Northwest Regional Educational Laboratory in Portland, Oregon from September 1976 to March 1977. Iris is a member of the American Educational Research Association, and the National Council on Measurement in Education. She has co-authored several articles and papers on the topics of examinee test-taking behavior, appropriate statistics for different measurement scales, and the adversary evaluation model. < 120

PAGE 132

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. (•....,• : ,(riA William B. Ware, Chairman Professor of Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Plulosophy. Linda M. Crocker Associate Professor of Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. L•v^^^^ \\l.u-t^A hn M. Newell P]3ofessor of Foundations of Education

PAGE 133

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. William R. Powell Professor of Instructional Leadership and Support This dissertation was submitted to the Graduate Faculty of the Department of Foundations of Education in the College of Education and to the Graduate Council, and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. December 1977 I 01^ 'C tfKc<.Chairman, Foundations of Education Dean, Graduate School

PAGE 134

^7 4^ F^13 7 8. 03.8.5.