
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00034640/00001
Material Information
 Title:
 The relationship between test anxiety and statistical measures of person fit on achievement tests
 Creator:
 Schmitt, Alicia P., 1952
 Publication Date:
 1984
 Language:
 English
 Physical Description:
 ix, 65 leaves : ill. ; 28 cm.
Subjects
 Subjects / Keywords:
 Achievement tests ( jstor )
Anxiety ( jstor ) Consistent estimators ( jstor ) Correlations ( jstor ) Educational research ( jstor ) Mathematics ( jstor ) Statistics ( jstor ) Test anxiety ( jstor ) Test scores ( jstor ) Test theory ( jstor ) Dissertations, Academic  Foundations of Education  UF Educational tests and measurements ( lcsh ) Foundations of Education thesis Ph. D Intelligence tests ( lcsh ) Test anxiety ( lcsh )
 Genre:
 bibliography ( marcgt )
nonfiction ( marcgt )
Notes
 Thesis:
 Thesis (Ph. D.)University of Florida, 1984.
 Bibliography:
 Bibliography: leaves 6164.
 General Note:
 Typescript.
 General Note:
 Vita.
 Statement of Responsibility:
 by Alicia P. Schmitt.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. Â§107) for nonprofit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
 Resource Identifier:
 030595770 ( ALEPH )
12015502 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
THE RELATIONSHIP BETWEEN TEST ANXIETY AND STATISTICAL
MEASURES OF PERSON FIT ON ACHIEVEMENT TESTS By
ALICIA P. SCHMITT
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1984
To my father, who taught me
the value of education
ACKNOWLEDGMENTS
I would like to express my sincere appreciation to Dr. Linda
Crocker, chair of my doctoral committee. Her standard of excellence and sound advice have guided my doctoral studies. She has always provided invaluable opportunities for learning. I am also extremely grateful to Dr. James Algina who was continually available for consultation and guidance, and to Dr. Marvin E. Shaw who gave freely of his patient and quiet encouragement. I am grateful to each of
these members of my doctoral committee for helping me reach this point in my career.
My husband, Jeff, deserves special recognition since he lived with and stood by me through this special time in my life. He was always encouraging and helpful. This degree belongs to him as much as to me. To my sister, Amelia, I am indebted for providing a consistent model of perseverance and strength. I am also grateful to my family,
friends, and coworkers, who always encouraged me and knew that I would finish, and to Adele Koehler, my typist, who always, with a smile, helped me meet my deadlines.
Finally, I thank the Alachua County School Board for providing the data used in this study.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS.............................................. iii
LIST OF TABLES................................................ vi
LIST OF FIGURES............................................... vii
ABSTRACT...................................................... viii
CHAPTER
I INTRODUCTION.......................................... 1
Statement of the Problem.............................. 1
Purpose of the Study.................................. 2
Theoretical Rationale................................. 3
Definition of Technical Terms......................... 5
StudentProblem Table (SP table)................ 6
Modified Caution Index (MCI)..................... 7
Personal Biserial Correlation (PB)............... 7
Norm Conformity Index (NCI)... ...................8
Rasch PersonFit Statistic (Rx )................. 9
Extended Caution Index (ECI)..................... 10
Assumptions........................................... 11
Educational Significance.............................. 11
Summary ............................................... 12
II REVIEW OF LITERATURE.................................. 14
PersonFit Measures.................................. 14
Historical Background............................14
Types of PersonFit Indices...................... 16
Comparative Studies of PersonFit Indices........ 18 PersonFit Indices Under Study................... 20
Test Anxiety.......................................... 21
Summary ............................................... 23
III METHODOLOGY........................................... 25
Examinees............................................. 25
Instruments........................................... 26
Creation of the Data File............................. 27
Calculation of PersonFit Statistics.................. 27
Analysis.............................................. 28
Summary ............................................... 29
i V
Page
IV RESULTS............................................... 31
Descriptive Statistics................................ 32
Relationship Between PersonFit, Ability,
and Test Anxiety.................................... 36
Relationship Between PersonFit and a Linear
Combination of Ability, Test Anxiety, and
Their Interaction................................... 36
Internal Consistency of PersonFit Statistics......... 47 Summary............................................... 50
V DISCUSSION............................................ 52
Relationships Among PersonFit Statistics............. 52
Relationship Between PersonFit, Ability, and
Test Anxiety........................................ 53
Reliability of PersonFit Measures.................... 58
Summary............................................... 59
REFERENCES.................................................... 61
BIOGRAPHICAL SKETCH........................................... 65
V
LIST OF TABLES
Table Page
1 SP Table for Five Examinees and Six Items
(Ideal Pattern)........................................ 6
2 Descriptive Statistics for PersonFit Measures by
Grade and Ability Test................................. 33
3 Descriptive Statistics for Ability Tests and Test
Anxiety by Grade....................................... 34
4 PersonFit Measures Intercorrelations by Grade
and Ability Test....................................... 35
5 Correlations Between PersonFit Measures and Total
Scores on Corresponding Ability Test and Test
Anxiety................................................ 37
6 Percentage of Variance (R2) in PersonFit
Explained by the Combination of Examinee Ability,
Test Anxiety, and Their Interaction.................... 38
7 Significant Ability and Test Anxiety Interactions
and R2 Increases for PersonFit Measures............... 41
8 Significant Main Effect of Ability as Predictor
of PersonFit Measures................................. 48
9 Corrected SplitHalf Reliability Estimates for
PersonFit Measures by Grade and Ability Test.......... 49
LIST OF FIGURES
Figure Page
1. Relationship between the modified caution index
and test anxiety for seventhgrade examinees at
different reading ability levels....................... 42
2. Relationship between the modified caution index
and test anxiety for eighthgrade examinees at
different science ability levels....................... 43
3. Relationship between the personal biserial and test
anxiety for eighthgrade examinees at different
science ability levels................................. 44
4. Relationship between the norm conformity index
and test anxiety for eighthgrade examinees at
different science ability levels....................... 45
5. Relationship between the extended caution index
and test anxiety for eighthgrade examinees at
different science ability levels....................... 46
vii
Abstract of Dissertation Presented to the Graduate School of
the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
THE RELATIONSHIP BETWEEN TEST ANXIETY AND STATISTICAL
MEASURES OF PERSON FIT ON ACHIEVEMENT TESTS by
Alicia P. Schmitt
December, 1984
Chair: Linda M. Crocker
Major Department: Foundations of Education
The purpose of this study was to investigate the relationship
between an examinee's level of test anxiety and each of five different personfit statistics and to establish if this relationship is dependent on ability level. A secondary interest was to determine the relationship among personfit indices within and across different subject areas of a standardized achievement test and to assess the internal consistency of personfit indices.
An existing data set was analyzed to explore the nature of these relationships for the modified caution index, the personal biserial correlation, the norm conformity index, the Rasch personfit index, and the extended caution index. Achievement test scores on the
reading, mathematics, and science subtests of the Metropolitan Achievement Test (MAT) and scores on the Test Anxiety Scale for Adolescents were used as estimates of ability and anxiety. The item scores and total test scores of 225 seventhgraders and 188 eighthgraders of a metropolitan middle school comprised this data set.
viii
Intercorrelations among the measures of personfit were in
the .80s to .90s within samesubject content areas. Across subject content areas little or no relationship was found. Low to moderate correlations were obtained between personfit indices and their
corresponding ability scores (1.00 to .501) and test anxiety (1.02 to .221). For four of the measures of person fit, on one or more of the subject tests, a significant proportion of variance was explained by the linear combination of ability, test anxiety, and their interaction. In these cases ability levels moderate the relationship between personfit measures and test anxiety; for lower ability
examinees the relationship is direct, but for higherability examinees the relationship is inverse. Only the Rasch personfit index was consistently unaffected by this interaction. Corrected splithalf
reliability estimates of personfit indices were low (.20 to .56), indicating little consistency of the trait.
According to these results, the potential uses of personfit
indices are questionable at this time. More research is needed before these measures can be recommended for routine use in interpretation of
achievement test scores for individual examinees.
ix
CHAPTER I
INTRODUCTION
Statement of the Problem
Although total scores have been consistently used as the basis to
evaluate educational achievement, analysis of itemresponse patterns can contribute additional information that may be useful in the interpretation of an overall score. Analysis of response patterns can be based on two dimensions: item difficulty and examinee ability.
Ability is typically estimated by the total score on the test of interest, and item difficulty, by the proportion of examinees answering the item correctly. If the items are arranged in ascending
order of difficulty, an examinee with a given ability should answer items correctly until the point where his or her ability matches the difficulty of the items, and miss each item thereafter. Deviations from the expected response pattern occur when the pattern of passed and missed items is not consistent. If a person misses easier items
but then responds correctly to harder items, there is deviation from the expected response and misfit occurs.
With the introduction of the scalogram technique, Guttman (1941, 1950) was one of the first social scientists to suggest that some persons respond consistently to a given set of ordered stimuli (test items) while others do not. Under Guttman's scale theory, a response pattern where a student passing a more difficult item also responds
1
2
correctly to all easier items, is called a perfect simplex, and the scale or test under such situation is called a perfect scale.
During the late 1970s and early 1980s there has been a resurgence of interest in using information provided by response patterns. A
number of personfit statistics have been developed to provide a measure of an individual examine's deviation from the expected response pattern to a given set of items. Although some studies have shown that indices of personfit are highly correlated (Harnisch & Linn, 1981; Rudner, 1983), attempts to identify causes of person misfit (or even personality or demographic correlates of it) have remained mainly speculative. Some researchers, such as Frary (1982) and Harnisch and Linn (1981), have suggested that one factor which may contribute to personmisfit on cognitive tests is test anxiety, but prior to this study there has been no empirical investigation to test this hypothesis.
Purpose of the Study
The present exploratory study was designed to investigate the nature of the relationship between measures of personfit and test anxiety. For each of five selected indices of fit (modified caution index, personal biserial correlation, norm conformity index, Rasch personfit index, and an extended caution index), the following questions were asked:
1. What is the degree of linear relationship between test anxiety and an examinee's level of misfit?
3
2. What is the degree of linear relationship between ability (as defined by performance on the current achievement test) and level of misfit?
3. To what extent is variance in personmisfit explained by a linear combination of the variables: ability level, test anxiety, and their interaction?
A secondary interest in this investigation was to explore the degree of relationship among the five selected personfit indices within and across subject area tests of an achievement battery and to estimate their internal consistencies. This information was considered important because the tests used in this case were subtests
from a wellknown nationally normreferenced, standardized achievement test battery. Earlier studies of the interrelationship and
reliability of personfit indices have typically been based upon state minimal competency examinations (Harnisch & Linn, 1981) or locally developed teachermade tests (Frary, 1982).
Theoretical Rationale
To date most research on test anxiety has considered primarily
the effects on examinees' total test score. Recently, Frary (1982) and Harnisch and Linn (1981) have suggested that test anxiety may be a factor which contributes to erratic performance of an examinee within
a given test (e.g., missing relatively easy items, while answering more difficult items correctly). Careless errors and lack of concentration by hightestanxious individuals could change the pattern of item responses from the pattern that would be expected.
4
Two theories predict the effect of anxiety on performance.
According to the cognitive attentional theory of test anxiety, highly anxious students attend to selfrelevant variables instead of to taskrelevant variables, negatively affecting their performance (Wine, 1980). In an analysis of Spielberger's (1966, 1971) extension of SpenceTaylor drive theory, Heinrich and Spielberger (1982) make several predictions about the effect of ability and anxiety on performance of tasks with varying levels of difficulty that seem relevant to this study of test anxiety and personfit. These are as follows:
1. For subjects with superior intelligence, high anxiety
will facilitate performance on most learning tasks. While high anxiety may initially cause performance decrements on
very difficult tasks, it will eventually facilitate the performance of bright subjects as they progress through
the task and correct responses become dominant.
2. For subjects of average intelligence, high anxiety will facilitate performance on simple tasks and, later in learning, on tasks of moderate difficulty. On very
difficult tasks, high anxiety will generally lead to
performance decrements.
3. For low intelligence subjects, high anxiety may facilitate performance on simple tasks that have been mastered.
However, performance decrements will generally be
associated with high anxiety on difficult tasks, especially in the early stages of learning. (Heinrich & Spielberger,
1982, p. 147)
According to these predictions, response patterns and personfit
statistics will be different for high, average, and low ability examinees depending on their anxiety levels. The predicted effect for high and low anxious students at these three ability levels would be
1. Subjects with high ability and high test anxiety would be
expected to initially fail hard items, but since during testing conditions, examinees receive no feedback, correct responses will not
be expected to become dominant. These examinees would continue to
have occasional difficulty on harder items, but their high levels of test anxiety would facilitate performance on easier items. Moderate to low misfit would be expected. For subjects with high ability and
low test anxiety, interference in performance of difficult items is not expected. Due to less attentional interference, use of testtaking strategies might also be more accessible. These students might be more open to guessing on harder items. Moderatehigh misfit could be expected.
2. For subjects with average ability, high anxiety will help with easy to moderately difficult items, but will interfere with harder items. A low to moderate misfit would be predicted.
Similarly, low test anxiety is not expected to differentially affect item responses for averageability examin'ees.
3. For low ability subjects, high anxiety may help with the easier items but will interfere with performance on more difficult items. If these examinees do not feel very confident in their knowledge, high anxiety might not help but affect their concentration, making them answer in a more spurious manner. In this last case, higher misfit might occur. When low ability subjects are also low in
test anxiety no interference is expected; these students will probably direct their attention to the easier items they master. A low to moderate misfit is expected.
Definition of Technical Terms
Definitions and formulas required to explain major technical terms used in this study are as follows:
5
6
StudentProblem Table (SP table)
The SP table is used to organize test information into a matrix of zeros and ones. The rows in this matrix represent the students ranked from highest to lowest according to total test score. The
columns represent the items arranged from left to right in ascending order of difficulty. Correct responses are represented by ones and incorrect responses are represented by zeros. Assuming that items are arranged in increasing order of difficulty (from easy to hard), a concordant response pattern is one in which an examinee answers the
items correctly until he or she reaches an item that is too difficult and answers the items incorrectly from then on. If all examinees had concordant response patterns, the SP matrix would have all ones in the upper lefthand corner, and all zeros in the lower righthand corner. A short illustrative table of the ideal response pattern is presented as Table 1.
Table 1
SP Table for Five Examinees and Six Items (Ideal Pattern)
Item
Examinee
i 1 2 3 4 5 6
1 1 1 1 1 1 0
2 1 1 1 1 0 0
3 1 1 1 0 0 0
4 1 1 0 0 0 0
5 1 0 0 0 0 0
7
Modified Caution Index (MCI)
Harnisch and Linn (1981) introduced the modified caution index as a modification of an earlier caution index proposed by Sato in 1975 (cited in Harnisch & Linn, 1981). The MCI has a lower bound of 0 and an upper bound of 1. The higher the value of the index, the more divergent is the person's response pattern. This index is computed with data arranged into an SP table, using the following formula:
n i
. (1 U .)n.  U. n
j=1 1J~ =(n. +1)13.
MCI = n. 1. (1)
n . J
I n I n.
j=1 ' j=(J + 1n )
where i is the examinee index in the SP matrix, j is the item index,
Uij is 1 if examinee i answers item j correctly and 0 if examinee i answers item j incorrectly, ni. is the total number of correct responses for examinee i, and nj is the number of correct responses to item j.
Personal Biserial Correlation (PB)
Donlon and Fisher (1968) proposed this correlation. The
coefficient obtained represents the correlation between a person's item response and the item's difficulty value. Donlon and Fischer define item difficulty as the proportion of examinees who respond incorrectly to an item. Large values correspond to difficult items and small values correspond to easy items. A positive correlation
3
represents good fit, indicating that a person tends to answer correctly items that are easy for the group and miss the more difficult items. Low or negative correlations represent more divergent response patterns. The formula to compute this correlation is
(Qr ~O) 3'
PB = . (2)
where Qr is the mean item difficulty for items answered, Qc is the mean item difficulty for items answered correctly, SQ is the standard
r
deviation for Qr' Jr' is the number of items answered correctly divided by the number of items answered, and Y is the ordinate from
the standard normal curve at the point separating the proportions Jr and (1  Jrl)Norm Conformity Index (NCI)
This index was developed by Tatsuoka and Tatsuoka (1980). The NCI indicates the degree of concordance to a group response pattern where items are arranged in descending order of difficulty, from hardest to easiest. Values of this index may range from 1 to 1. The
smaller or more negative the index, the more divergent is the individual's response pattern in comparison to the group norm. This index is undefined for either perfect or zero scores. Let S denote
the row vector of a person's response pattern; let S' denote the transpose of the complement of S; and let N = S'S. The formula to compute this index is
9
2U
NCI = ( )a (3)
where Ua . .nij, the sum of the above diagonal elements of N and
1 j<1
U = n , the sum of all the elements of N.
1 3
Rasch PersonFit Statistic (Rx2
The Rx2, also referred to as Maximum Likelihood Procedure (MAX)
(Wright & Panchapakesan, 1969) and as weighted total fit mean square (Rudner, 1983), was adopted for use with the one parameter Rasch model
and is calculated using the BICAL program (Wright, Mead, & Bell, 1979). This index has a high value for an examinee who has a response
pattern that is inconsistent with the examinee's score and the Rasch model measure of item difficulty. The following formula is used to compute this index:
32
2 (U  P )2
Rx2 j=1 13)
S(P;. j(1  P..i))
3=1
where Uij is the response of examinee i to item j and Pij is the probability of a correct response for examinee i on item j as predicted by the Rasch model:
P e 6 b ) /[1+ e 1 3]
10
Extended Caution Index (ECI)
The extended indices have been proposed by Tatsuoka and Linn (1981, 1983). They describe the extended indices as linear transformations of the distance between a person's response pattern and a theoretical curve. In the case of ECI, this curve is the group
response curve (GRC), which is "an average function of N different Person Response Curves" (Tatsuoka & Linn, 1981, p. 10). For the ECI, probabilities of success, calculated through item response theorylogistic models substitute the zeros or ones in the SP table. For
purpose of this study, the Rasch oneparameter logistic model will be used to calculate these probabilities. The formula for the ECI is
* (Y.  P. )(Y .  P )
ECI = 1  =1(5)
j=1 ii
where Yij is the response of examinee i to item j, Yj is the sum of responses across examinees for item j, P. is the proportion of
correct responses of examinee i, P.. is the total proportion of correct responses, Pij is the probability of correct response for each examinee i on item j according to the Rasch model, and Pij/J is the mean predicted probability of success for examinee i.
This formula is computed by the ratio of two covariances. The higher the value of the ECI, the more variation from the expected response pattern. This index is also limited by perfect or zero scores. The denominator would become zero and the value infinite.
11
Assumptions
The following underlying assumptions were held for this study:
1. Standardized testing situations are capable of inducing test anxiety among students who would normally be test anxious.
2. Students responded truthfully on the selfreport instrument used to assess their level of test anxiety.
3. Total score on the achievement test used can be taken as an
estimate of the examinee's ability (substituting for any external measure of ability, such as an I.Q. or academic aptitude test score).
4. Each subtest of the achievement test measures a fairly
unidimensional trait (i.e., achievement in reading or mathematics or science). This assumption is critical for the indices using the Rasch
statistics, but it also underlies the assumptions made to calculate the other indices.
Educational Significance
The potential value of personfit indices has been cited by Frary (1982), Harnisch (1983), Harnisch and Linn (1981), Levine and Rubin (1979), Rudner (1983), and Van der Flier (1982). These writers suggest that these indices could be useful for the fol lowing purposes:
1. To identify individuals for whom the test is inappropriate or invalid. Total test score interpretation can be misleading for examinees who come from different experiential backgrounds or take the test under different motivational dispositions, e.g., test anxiety.
2. To identify groups with different instructional practices or histories, which could change the difficulty of the items, e.g., schooling differences.
12
3. To identify items that are inadequate for particular groups of examinees.
Presently personfit indices are considered to be at a state of development where more research is needed to investigate their psychometric properties and establish their applicability. The reasons why some people are misfits are not clear. If test anxiety can be identified as a factor associated with person misfit, then the
interpretive value of personfit statistics would be enhanced. Another pragmatic contribution of this study is to extend the body of
research of personfit statistics by providing information about 1) the agreement of person fit classifications across different subject matter content areas, as measured by the subtests of the
Metropolitan Achievement Test, and 2) the degree of agreement of personfit classifications by different indices.
Summary
Analysis of itemresponse patterns provides information not
contained in a total test score. Although the idea of using response
pattern information probably originated when Guttman (1941) introduced the scalogram technique, it has not been until the late 1970s and early 1980s that a strong interest in personfit statistics has developed.
Personfit statistics quantify the degree of deviation of an
examinee's response pattern from the expected response pattern. The development and application of personfit indices is at a fledgling stage. More research is needed to investigate their psychometric properties and establish their applicability. Attempts to identify
13
causes of personmisfit have remained mainly speculative. Recently Frary (1982) and Harnisch and Linn (1981) have suggested that test anxiety may be one factor that can explain erratic performance of an examinee within a given test (e.g., missing easy items while answering more difficult items correctly). According to drive theory the relationship between test anxiety and performance is moderated by level of ability and task difficulty. Performance on specific items
might not only be dependent on the item's difficulty and the examinee's ability but also on the examinee's test anxiety.
The primary purpose of this study was to establish 1) the degree
of linear relationship between test anxiety and an examinee's level of misfit, 2) the degree of linear relationship between ability (total score on achievement test) and level of misfit, and 3) the extent that variance in personmisfit can be explained by a linear combination of ability level, test anxiety, and their interaction. Five different indices of personfit were used in this study: the MCI, PB, NCI, Rx2, and ECI.
A secondary interest was to investigate the relationship among the five selected personfit indices within and across subtests of a normreferenced achievement battery and to estimate their internal consistencies.
CHAPTER II
REVIEW OF LITERATURE
The two central aspects of this study are personfit statistics and test anxiety. These two topics provide the major themes for the organization of the literature review presented in this chapter.
PersonFit Measures
Historical Background
During the late 1970s and early 1980s there has been an
increasing interest in the development and application of statistical indices to identify examinees with aberrant itemresponse patterns. Proponents of personfit statistics indicate that these indices add to
the information provided by total scores and can also be used to identify potentially inaccurate total scores (Frary, 1982; Harnisch, 1983; Harnisch & Linn, 1981; Rudner, 1983).
This trend toward using information from itemresponse patterns is not new. According to Gaier and Lee (1953)
one of the most promising trends in current psychometric
research is an increasing concern with methods of
evaluating patterns of test scores and test responses . . . our initial hypothesis is that consideration of
response configurations will yield more fruitful results
than the usual methods of reporting merely the total score for a test . . . a total score may thus carry
considerably less diagnostic significance than a direct
and detailed analysis of test responses per se.
(p. 140)
14
15
Guttman (1941, 1950) was one of the first writers to suggest that
some persons respond consistently to a given set of ordered stimuli (test items) while others do not. According to Guttman (1950), "a person who endorses a more extreme statement . . . should endorse all less extreme statements if the statements are to be considered a
scale" (p. 62). Guttman's description of the basic procedure for the scalogram technique of scale analysis is very similar to Sato's SP chart construction. He states that there are two basic steps in the scalogram pattern formation. These are
first, the questions are ranked in order of "difficulty"
with the "hardest" questions, i.e., the ones that fewest
persons got right, placed first and with the other
questions following in decreasing order of "difficulty."
Second, the people are ranked in order of "knowledge"
with the "most informed" persons, i.e., those who got
all questions right, placed first, the other individuals following in decreasing order of "knowledge." (Guttman,
1950, p. 70)
Sato's SP chart is also a twodimensional matrix where the rows represent the students ranked from highest to lowest according to total test score (cited in Harnisch & Linn, 1981). The columns represent the items arranged from left to right in ascending order of difficulty. Construction of a scalogram pattern and a SP table follow the same two steps. Once the responses are organized in this fashion, a concordant response pattern is defined as the case when an
examinee answers the items correctly until he or she reaches an item that is too difficult and answers all items incorrectly thereafter. Some disruption of a perfect pattern can happen. As the response pattern deviates more from the expected pattern, the degree of aberrance increases. Visual identification of aberrant or erratic
response patterns becomes increasingly more difficult as the number of
16
items in a test increases. The number of possible response patterns
multiplies as the number of items increases. With the recent introduction of a variety of statistics to measure the degree of
deviation from a typical response pattern, there is a renewed interest in using response pattern information.
Types of PersonFit Indices
Indices measuring the degree of unusual response patterns can be categorized into three major types: normcomparison indices, goodnessoffit indices, and extended indices.
Normcomparison indices, which are based on observed patterns of right and wrong answers and are calculated with summary statistics based on the norm group, include Sato's caution index (cited in Harnisch & Linn, 1981), the modified caution index (Harnisch & Linn, 1981), the agreement, disagreement, and dependability indices proposed by Kane and Brennan (1980), the U' index by Van der Flier (1977), the personal biserial by Donlon and Fischer (1968), and the normconformity index by Tatsuoka and Tatsuoka (1980). Van der Flier's U' index and Tatsuoka and Tatsuoka's normconformity index have been reported to have a perfect negative relationship (Harnisch & Linn, 1981). Normcomparison indices are calculated by using information organized in a SP table. They indicate the degree of aberrance from
the expected response pattern, when examinee ability is defined as the total observed score on the test.
Goodnessoffit or "appropriateness" indices are based on item
response theory (IRT) (Levine & Rubin, 1979). As with normccanparison indices, goodnessoffit indices are also based on the expected
17
response pattern for an examinee at a given ability level. The distinction is that for goodnessoffit indices a more sophisticated definition of "ability" is employed. Instead of simply equating ability with the examinee's observed raw score on the test, ability is defined in terms of his estimated score on a theoretical latent continuum underlying test performance. There are two popular IRT models that estimate examinee abilities based on the latent trait underlying test performance. For Rasch's oneparameter logistic model, examinee ability estimates are determined as a function of item difficulty parameters. A widely used computer program, the BICAL, written by Wright et al. (1979) provides examinee ability estimates, item difficulty parameter estimates, and a personfit statistic (Rx ) which "indicates how well the individual's item response pattern and the Rasch model fit" (Rudner, 1982, p. 4).
The second widely used IRT model is Birnbaum's three parameter logistic model (Lord & Novick, 1968, Ch. 17) for which examinee ability estimates are determined as a function of item difficulty, item discrimination, and guessing parameters. Levine and Rubin (1979) developed three types of appropriateness indices, based on Birnbaum's three parameter logistic model. These approaches are the marginal probability, the likelihood ratios, and the estimated ability variation indices. A practical limitation in using these procedures arises from the large sample sizes usually recommended to obtain stable estimates from the threeparameter model (Hambleton & Cook, 1977). Levine and Rubin devised a simulation of item response data on the Scholastic Aptitude Test (SAT) to conform to normal or aberrant response patterns. Their findings indicate that all three types of
13
goodnessoffit indices demonstrate the capability to detect aberrance when present.
Extended caution indices have been proposed by Tatsuoka and Linn (1981, 1983) as a link between the normcomparison and the goodnessoffit indices. They linked Sato's SP theory and item response theory by replacing the original observed zeros and ones of the item scores with IRT probabilities of passing the items. These
probabilities were then used in the calculation of the caution indices. Five variations of the extended caution index were created. These extended caution indices are defined as "linear transformations
of the covariance or correlation between a person's response pattern and a theoretical curve" (Tatsuoka & Linn, 1983, p. 95). Their findings support the effectiveness of the extended indices in
identifying examinees who use erroneous rules in answering arithmetic test problems. Tatsuoka and Linn (1983) point out that these indices can have instrumental utility by identifying students who consistently make errors because of misconceptions.
Comparative Studies of PersonFit Indices
There have been several studies in which the relationship and
effectiveness of personfit indices have been compared. Harnisch and
Linn (1981) made a comparative analysis of ten normcomparison indices. Using mathematics and reading tests from a statewide assessment program they also examined school and regional differences.
The intercorrelations between these indices ranged from 1.13 to .99 for mathematics and from I.34 to .96 1 for reading. They found that Kane and Brennan's (1980) agreement index had the lowest correlation
19
with the other indices, but had the highest correlation (.99) with total score. The modified caution index (MCI) was found to have the
lowest correlation with total score (.02 for mathematics and .21 for reading). Harnisch and Linn (1981) found significant school and regional differences in student's response patterns as measured by the
MCI.
Rudner (1982, 1983) evaluated nine indices; four were normcomparison indices, while five were goodnessoffit indices. He generated data by simulating examinees and their responses through Birnbaum's threeparameters model. Response patterns were altered to simulate spuriously high or low respondents. Findings indicated that
the normcomparison indices (point biserial correlation, PB, NCI, and MCI) and the weighted total fit mean square or Rx2 were highly intercorrelated (I.77 to .991). The goodnessoffit indices using Birnbaum's three parameter model and the unweighted total fit mean square had lower intercorrelations (1.17 to .801). Validity of the indices was tested by observing how sensitive they were to assessment accuracy. The MCI and the NCI identified comparable proportions of examinees with aberrant response patterns. According to Rudner,
"these two approaches were the most stable of the statistics" (Rudner, 1983, p. 217). In general, indices based on IRT showed better detection rates of aberrant response patterns than the normcomparison indices.
Frary (1982), using teachermade multiplechoice tests, compared three personfit measures; the Rx2, the MCI, and a weighted choice index. In the weighted choice index, distractor choice is considered as part of the estimation of personfit. The Rx2 and the MCI were
20
found to be highly correlated (.75). The smallest relationship between any two of the three indices was between the Rx2 and the
weighted choice test (.42). In this study Frary was the first to compute and report personfit internal consistency estimates. Low and even negative splithalf coefficients of the personfit measures in his study were found (Frary, 1982).
PersonFit Indices Under Study
The present study is the first to include all three different
types of personfit statistics (i.e., the normcomparison indices, the goodnessoffit indices, and the extended indices). The five indices under study are the modified caution index (MCI), the personal
biserial correlation (PB), the normconformity index (NCI), the Rasch personfit statistic (Rx2), and the extended caution index (ECI).
The MCI was chosen for use in this study because it was found to
be least related to total test scores (Harnisch & Linn, 1981) and is considered stable with short and long tests (Rudner, 1983). The PB was selected because it has been in use for a longer period of time
than more recent indices and is generally associated with classical test theory. Its computations are relatively simple and it has been found to be very efficient with shorter classroom tests (Rudner, 1983). For these reasons the PB could be useful to a larger number of practitioners. The NCI has been found to correlate with total score somewhat higher than other indices (Harnisch & Linn, 1981), but it has nevertheless been recurrently used in different research studies
(Harnisch & Linn, 1981; Rudner, 1982, 1983; Van der Flier, 1977, 1982). The NCI and the MCI are considered to be the most applicable
21
and stable under situations with long and short tests and spuriously
high or low scores (Rudner, 1983).
The Rx2 index was selected as part of the indices used in this
study, due to the availability of the BICAL computer program. The convenience of having the Rx2 computations given as part of the output from the BICAL computer program make the Rx2 index more usable to practitioners. The Rx2 is a goodnessoffit or appropriateness type of personfit index. It uses the Rasch oneparameter logistic model to estimate ability and item difficulty. Appropriateness indices
requiring use of the threeparameter logistic model were not feasible for the present study because of the larger sample size recommended to get consistent ability parameter estimates (Hambleton & Cook, 1977). The ECI represents a link between normcomparison indices and goodnessoffit indices. Since no comparisons of the ECI with other
indices or computations with actual data are available in the literature this index was included to evaluate its relationship to the other indices.
It is probably noteworthy that most previous studies of multiple
measures of personfit have focused primarily upon intercorrelations among these indices without investigating how they correlate with measures of any trait other than achievement itself, as measured by the test. The present study is somewhat broader in scope, since it
investigates how these indices relate to another variable, test anxiety.
Test Anxiety
Most research on test anxiety has considered the effects of test anxiety on total score. According to Tryon (1980) test anxiety
22
research findings present a consistent moderate negative correlation between test anxiety and total score measures of achievement. Hightestanxious individuals tend to score lower on classroom and aptitude tests (Alpert & Haber, 1960; Harper, 1974; Mandler & Sarason, 1952; I. Sarason, 1963, 1975; Spielberger, Gonzalez, Taylor, Algaze, & Anton, 1978).
Several researchers have tried to explain why test anxiety
affects performance. According to the cognitive attentional theory of
test anxiety (CATTA), introduced by Sarason (1960) and extended by Wine (1971, 1980), the "major cognitive characteristics of test anxious persons are negative selfpreoccupation, and attention to evaluative cues to the detriment of test cues" (Wine, 1980, p. 371). This misdirection of attention, both in the prestages of evaluation
(study phase) and the testtaking situation, may limit coding, retention, and retrieval of information by hightestanxious individuals. Difficulty of the task (e.g., difficult items) is expected to negatively affect attention. Thus according to CATTA, performance of testanxious persons will be negatively affected.
The SpenceTaylor drive theory also predicts the effect of
anxiety on performance of tasks with varying levels of difficulty. Heinrich and Spielberger (1982) summarize these predictions according to the difficulty of the task. They explain that for high anxious students the performance of a task is dependent on its difficulty.
High anxiety may facilitate performance on easy tasks, interfere with performance on harder tasks, and be dependent on the stage of learning for tasks of intermediate difficulty. Heinrich and Spielberger (1982) explain the relationship between performance and the learning stage.
23
According to these authors, "high anxiety wil1 be detrimental to performance early in learning when the strength of correct responses is weak relative to competing error tendencies. Later in learning, high anxiety will begin to facilitate performance as correct responses are strengthened and error tendencies are extinguished" (Heinrich & Spielberger, 1982, p. 146).
Varying ability levels and their relationship with anxiety and
task difficulty are also considered by the SpenceTaylor drive theory. According to Spielberger (1971) the effect of anxiety on subjects with different ability levels will be subject to the task difficulty and the learning stage considered.
These two theories, the CATTA and the SpenceTaylor drive theory,
point to the possibility that test anxiety might have an effect at the item level and that this effect might be dependent on ability level. Personfit statistics measure the deviation from an expected response pattern. According to the theory of personfit, in a good fit to a
response pattern "high ability examinees are expected to get few easy items wrong," while "low ability examinees are expected to get few difficult items right" (Rudner, 1983, p. 207). If test anxiety affects performance at the item level, it might be a factor which contributes to erratic performance for high anxiety examinees.
Summary
Literature pertinent to personfit indices and test anxiety has
been reviewed in this chapter. Recent literature on personfit measures shows an increasing interest in the development and application of these indices. The idea of using information provided
24
by response patterns is not new; it was first introduced by Guttman (1941) with the scalogram technique. During the late 1970s and early 1980s a number of different personfit statistics were developed as measures of the degree of deviation from a typical response pattern. These indices can be categorized into three major types: normcomparison indices, goodnessoffit indices, and extended indices.
Five personfit indices were selected for use in this study.
Three of these indices are norm comparison indices (MCI, PB, and NCI),
one is a goodnessoffit index (Rx2), and one belongs to the extended category of indices (ECI). Three major research studies comparing personfit indices were reviewed. Harnisch and Linn (1981) compared
ten normcomparison indices, and concluded that the MCI seemed to be the most promising due to its lower correlation to total score. Rudner (1983) evaluated four normcomparison indices and five goodnessoffit indices. He found that goodnessoffit indices showed better detection rates of aberrant response patterns. Frary (1982)
contributed to the development of personfit indices by being the first to study their internal consistency. He found low splithalf reliabilities.
Test anxiety has been suggested as a factor that could affect
performance within a test (Frary, 1982; Harnisch & Linn, 1981). Two
theories, the CATTA and the SpenceTaylor drive theory predict that high test anxiety will negatively affect performance. These two theories point to the possibility that test anxiety might have an effect at the item level and that this effect might be dependent on ability level. Test anxiety could thus be a factor that contributes to personmisfit.
CHAPTER III
METHODOLOGY
The present study was designed to investigate the relationship
between an examinee's level of test anxiety and each of five different personfit statistics and to establish if this relationship is dependent on ability level. A second purpose was to investigate the correlation of personfit indices within and across different subject areas of a standardized achievement test battery and to assess the internal consistency of personfit indices. An existing data set was
analyzed to explore the nature of these relationships. A description of the examinee group, instruments, datafile creation, and data analysis methods is presented in this chapter.
Examinees
The data pool used in this study consisted of test scores and
item responses from 225 seventhgraders and 188 eighthgraders from a metropolitan middle school in north central Florida. There was an almost even distribution of boys and girls at each grade level. Approximately 70% of the examinees were white and 30% were black at each grade level. The school population is heterogeneous with respect to socioeconomic level.
25
26
Instruments
The Test Anxiety Scale for Adolescents (TASA) (Schmitt and Crocker, 1982) was used to measure test anxiety as a trait. This instrument is a modified version of the 37item MandlerSarason Test Anxiety Scale (Sarason, 1972). This scale consists of 31 truefalse items and is designed for use with examinees in middle school or juniorhigh grades. Unlike most other test anxiety scales for children, all items on the TASA deal exclusively with examinee feelings about tests. Sample items include
"I worry just before getting a test back"; and
"Sometimes on a test I just can't think."
Schmitt and Crocker (1982) have found the factor structure of the TASA to be fairly similar to that reported for the adult test anxiety scales. They reported a KR20 of .87 as a total score reliability estimate for these middle school examinees.
The Metropolitan Achievement Test (MAT) subscales (Form KS) were used to measure achievement in reading, mathematics, and science (Prescott, Balow, Hogan, & Farr, 1978). The TASA was administered in
March, 1981, approximately two weeks prior to the schoolwide administration of standardized achievement tests. The MAT was administered by school staff as part of the school district's regular testing program approximately two weeks after the test anxiety scale was given. The range of item difficulties of the MAT subscales for the seventh and eighth grades is respectively: .21 to .99 and .28 to .99 for reading; .20 to .97 and .31 to .97 for mathematics; and .29 to .94 and .34 to .98 for science.
27
Creation of the Data File
The test anxiety scores, coded with student ID number, but no other identifying information, were obtained in conjunction with a
University of Florida College of Education inservice training project for school personnel on identifying and counseling testanxious students. The researcher later obtained a set of MAT test item responses for students with those same ID numbers from the county school district testing office. This data file also contained some demographic information (i.e., sex, race, and grade level). The data file used for analysis in this study was created by matching the two examinee data files on student ID number and merging the files. Thirtyseven of the students' records (13 of the seventh graders and 24 of the eighth graders) in the merged file were deleted later, because it was found that these students had been tested outofgrade level, rather than on the test form taken by their grade peers.
Calculation of PersonFit Statistics
For each examinee on each subtest, five different indices of personfit were calculated. These were the modified caution index, the personal biserial correlation, the normconformity index, the Rasch personfit index, and the extended caution index. To create the
data file containing the five personfit indices for each MAT subtest at each grade level, the original itemexaminee response matrix was used. Each examinee's response* to each item was coded 0 or 1 in this matrix. The data on this matrix were used to
*Blanks or omitted responses were treated as incorrect responses and
assigned a 0 value.
28
1. compute total scores to get an ability estimate for each student;
2. compute a mean score for each item as estimate of item
difficulty (p value); and
3. reorganize data into an SP matrix, by sorting by total score, and by item difficulty.
The resulting matrix had students organized by ability (from most able to least able) and items organized by difficulty (from easiest to hardest). This SP matrix was used to calculate the five personfit indices, using computer programs written by this author for each personfit statistic. Refer to Chapter I for formulas' definition. These statistics were programmed using the Statistical Analysis System (SAS) package (Helwig & Council, 1979). The accuracy of each programmed computation was tested using the dummy data set given by Harnisch and Linn (1981, p. 136).
Analysis
Means, standard deviations, minimum and maximum values by grade
were computed for each personfit statistic, ability measure, and test anxiety score. For the personfit indices and ability measures these
descriptive statistics were calculated for each subtest of the MAT (reading, mathematics, and'science). Correlations among fit
statistics and between personfit and test anxiety and between personfit and ability measures were calculated.
To investigate the relationship between examinees' level of test anxiety and degree of personfit and to study if this relationship is dependent upon ability level, a linear multiple regression analysis
29
was used. In this analysis personfit measures were the dependent
variables and ability (reading, mathematics, or science) and TASA were the continuous independent variables. The model used for each ability measure and personfit index is
Y' = bo + bjXj + b2X2 + b3X1X2
where Y' = personfit predicted by the model, b0 = intercept value, bj = regression slope for the ability independent variable, X1 = ability, estimated from total score, b2 = regression slope for the TASA independent variable, X2 = TASA, b3 = regression slope for the interaction of ability and TASA, and X1X2 = interaction between ability and TASA.
To estimate the internal consistency of personfit indices, items for each MAT subtest at each grade level were divided into odd and even subtests. The original sequentialtestitemnumber was used for this split. Odditem and evenitem personfit statistics were computed by following the sequence of steps previously described. The fit index for the odd items was correlated with the fit index for the even items and the resulting correlation was corrected using the SpearmanBrown formula to obtain an internal consistency estimate for the fulllength test.
Summary
A linear multiple regression analysis was used to investigate the relationship between an examinee's level of test anxiety and each of five different personfit statistics and to study if this
30
relationship is dependent on ability level. The five personfit indices included in this study were the modified caution index, the
personal biserial correlation, the norm conformity index, the Rasch personfit index, and the extended caution index. A data set of 225 seventhgrade and 188 eighthgrade examinees' responses to the Test Anxiety Scale for Adolescents and to the Metropolitan Achievement Test reading, mathematics, and science subscales was used to compute personfit statistics and explore the nature of these relationships. Correlations of personfit statistics between and within the different subject area tests were examined. Personfit splithalf reliabilities were also computed for each index for each content area and grade level.
CHAPTER IV
RESULTS
The present study was undertaken to answer the following questions for each of five selected indices of personfit:
1. What is the degree of linear relationship between test anxiety and an examinee's level of misfit?
2. What is the degree of linear relationship between ability (as defined by performance on the current test) and level of misfit?
3. To what extent is variance in person misfit explained by a
linear combination of the variables: ability level, test anxiety, and their interaction? Analyses were also performed to explore the degree of relationship among the five selected personfit indices and
to determine their splithalf reliabilities.
The results of the analyses presented in this chapter have been
organized beginning with results of simpler analyses and proceeding to those of greater complexity. The issue of the degree of relationship among the five personfit indices within and across subtests of the MAT is addressed in the next section, Descriptive Statistics. Data relevant to questions 1 and 2 (dealing with the bivariate relationships between test anxiety and misfit and between ability level and misfit) are presented next. Results of multiple regression analyses, relevant to question 3 are presented in the third section of this chapter. The final section of this chapter contains the results
31
32
of the investigation of the internal consistency of personfit indices.
Descriptive Statistics
The mean, standard deviation, and minimum and maximum values for personfit measures by grade and ability subtests are presented in Table 2. Means and standard deviations for each personfit index are very similar across grades and ability subject tests. For the reading test, personfit measures show the greatest dispersion between the minimum and maximum scores. For this subtest of the MAT, the maximum
score observed for the personal biserial correlation on both the seventh and eighth grade was greater than one. This was not a
computation error but may be ascribed to sampling error or violation of the assumption of an underlying normal distribution for the dichotomous item response variable. Lord and Novick discuss conditions under which biserial correlations may exceed 1.00 (Lord & Novick, 1968, p. 339). Descriptive statistics for ability tests and test anxiety for the seventh and eighth grades are presented in Table 3.
Personfit measures' intercorrelations for each grade level are
presented in Table 4. Seventhgrade intercorrelations are shown above the diagonal and eighthgrade's below the diagonal. Results of these correlations indicate that personfit measures are highly related within the same subject test area. These correlations ranged
from 1.78 to .991 and are all significant at an alpha level of .0001. However, across tests in different subject areas, there was little or no relationship between the same personfit measure. The Rasch
33
Table 2
Descriptive Statistics for PersonFit Measures by Grade and Ability Test
Grade
7 3
PersonFi: M so Min. Max. M 5D Min. ax.
Reading Test
MCI .23 .10 .2 .81 .25 .12 0.
.57 .20 .16 1.12 .59 .2. .12
NC' .54 .20 .52 .36 .55 .21 .10 >.0
Rx .93 .14 .55 1.37 .97 .15 .55 1.49
ECI .01 .25 .89 1.92 .01 .37 . 1.24
Mathematics best
MC: .22 .08 .06 .50 .23 .39 .01 .58
.57 .17 .02 .93 .58 .19 .20
.55 .17 .03 .39 .55 .'. .17
Rx' .98 .14 .67 1.43 .18 .15 .52
EC: .21 .31 .2 .96 .J .33 .72 1.34
Science Test
MC: .25 .18 .07 .39 .25 ._8 .37 .54
.43 .5 .14 .3i 9 .15 .7 .73
NC: .47 .17 .15 .37 .47 .17 .09 .35
;2 .99 .'2 .59 i.40 .;9 .22 .78 '.B
.6 .54 1.23 .05 .3 0
22% 'or the seventh ;rade; I =U . or ne 3.ghth ;race.
225 for the seventh 9rade; = 138 or tre eighth grace.
m * 225 for the seventh ;race; = 15 for the eignth grace.
Table 3
Descriptive Statistics for Ability Tests and Test Anxiety by Grade
Grade
7 8
Test 11 SD Mill. Max. M SD Mil. Max.
Reading 35.59 11.62 9 54 39.47 10.48 15 54
Mathematics 30.32 9.44 13 49 33.02 8.88 8 48
Science 33.43 9.87 9 53 35.02 9.13 14 52
IASA 15.05 6.66 1 29 13.75 6.23 1 27
Table 4
PersonFit Measures Intercorrelations by Grade and Ability Test
Test
Read I I I
MCI PII rlI: I
Rx 2
.97** .91** .82**
.94** 89**
Tr'st Rh'aIinq
fps. I.
PS1 ric I
Rx'
I C I
th thomlia t i cs
MC I
PB (IC
MC I
P
.92**
.90**
.98**
.10
.12
.11
.12
.15
.13
.21*
.15
Ma them t Ic 2
[cI MCI Pli NCI Rx2
.88 **
.99**
.96** . 881*
.90**
 .83**
 93**
07 .07
.08
.12
.o8
.17 .17
13
2.3*
.16
.14
.15
.15
.17 .15
201
17
.18
.10
.1!
.1o1
.11
.10
.12
.15
.09
.17
.10
.12
.15
.13
17
.15
 .981** ..99A* .89** .99**
.11
.16
.12
14 .I1
. 14
.16
.11
.15
.15
.10
.14 .12
.16
.14
.00
.06
.05
.12
.07
.98A* 99*1 .93**
.96** .914*
.97**
.I3**
.974**
.10
.16 .11
. 13
.09
 .93*k
.914k
.99** .90**
.861**
.78**
.93**
.09
.13
.08
.07 .07
. 0?
.03
.05
.01
.06
.18
.18
.18
.21
18
Science
[CI MCI PIl NCI
.10
.14
.12
.17
.15
.99 *
97*A .99* A 94*
.15
.18
.15
.18 .15
.03
 . l 1
.04
.06 .03
.03
.01
.04
.07
.05
 .97**
.99*1* 9p * .994**
.02
.03 .04
.05
.04
.02
.03
.03
.05
.02
.01
.04
.00
 .00
.02
.01
.04
.02
.94*
.%6*1
o: Se(,vp' 11rade resuI ts showing ablovo aIga lli I and , ilghthgrade 's he I ow (I iagonla 1 ' t1 i f i ii di. = .or. ' i ti Iiiit di t .( 0 )I.
 .14
.18
.14
.17
.14
IN2
.06
 00
S03
.0.3
.07
.05
.10 .07
.W
*07'
.1() 1
.0!
 .03
06
Oil
99
 g/I
 90lA
* o' ,
(.98**  ). # 93*k
.97,* .901 4 .974A  .()I**
.9IkA
.92A**
..94* *
.99**
36
personfit measure (Rx2) had the lowest relationship to other indices within same subject test. Interestingly Rx2 is the only index with significant intercorrelations with other indices across subject areas.
Relationship Between PersonFit, Ability, and Test Anxiety
Correlations between personfit measures, total ability measures of reading, mathematics, and science and test anxiety for each grade are presented in Table 5. Correlations between personfit measures and their corresponding total ability scores ranged from 1.00 to .501. In general the personal biserial (PB) index had the lowest correlation with total score. This lower correlation was consistently observed across subject test areas and grade level. Correlation values between personfit measures and test anxiety ranged from I .02 to .221 . The highest correlation between personfit measures and test anxiety
occurred for seventh grade science personfit values and for eighthgrade reading personfit values.
Relationship Between PersonFit and a Linear Combination
of Ability, Test Anxiety, and Their Interaction
Multiple regression results for the model with ability, test
anxiety, and their interaction are sunmarized in Table 6. Values of R2 for the model at each test area are reported. Models with significant ability and test anxiety interactions are identified by having their corresponding R2 underlined. For the seventhgrade
sample, a significant proportion of variance in the modified caution index was explained for each of the three subject area tests by tie
37
Table 5
Correlations Between PersonFit Measures and Total Scores on Corresponding Ability Test and Test Anxiety
Grade
7 3
PersonFit Total TASA Total TASA
Reading Test
MCT .28** .18** .32** .07
PB .05 .10 .18* .15*
NCi .03 .03 .20* .5*
Rx2 .11 .02 .16* .17*
EC .02 .08 .05 .14
Mathematics Test
MCI .i6* .03 .09 .06
PB .04 .04 .01 .03
NC .20** .05 .17* .03
Rx .32** .10 .30** .11
ECI .23* .05 .20** .08
Science Test
MCI .25** .17* .4D** .03
PB .08 .09 .29** .02
NC .22** .16* .44** .03
Rx .39* .22** .50** .08
ECl .28** .18** .50** .04
Note: Higher misfit is represented oy higher values on the MCI, Rx2,
and ECI and by lower values on the PB and WC.
*Sicnificant at a = .05.
""Significant at a = .01.
38
Table 6
Percentage of Variance (R2) in PersonFit Explained by the Combination of Examinee Ability, Test Anxiety, and Their Interaction
PersonFit
Test MCI PB NCI Rx2 ECI
Seventh Grade
Reading .11* .04* .01 .02 .03
Mathematics .04* .03 .05 * .06**
Science .07** .01 .06* .** .09**
Eighth Grade
Reading .12** .05* .05 .05* .02
Mathematics .04 .04 .06* .10** .07**
Science .23** .13** ~** .27** ~27*
'Significant at:, = .05.
*Significant at : = .01.
Significant interaction , = .05.
39
linear* combination of test anxiety, ability, and their interaction. Only in reading, however, was the percentage of variance explained greater than 10%, and on this case the interaction of ability and test anxiety was significant. For the personal biserial, the normconformity index and the extended caution index, although several significant R2 values were observed, these were all less than .10. Thus it is difficult to interpret these relationships as being substantially important. For the Rasch index of personfit, substantial proportions of variation were explained by the model in the areas of math and science. No significant interaction effect of ability and test anxiety was detected in either of these cases.
For the eighth grade, the modified caution index again appears to
be sensitive to variation in examinees' level of test anxiety, ability, and their interaction. The significant R2 values exceeded 10% for this index in reading and science. Science had the largest R2 (.23) and a significant interaction between test anxiety and ability.
For the personal biserial, the normconformity index, and the extended caution index, significant R2 values greater than .10 were found only in the area of science, and for each of these cases, the interaction between ability and test anxiety contributed significantly to the overall model. Significant (and meaningful) R2 values for the Rasch X2 index were found for the areas of science and mathematics without any interaction between ability and test anxiety.
*Curvilinear relationships between personfit indices and ability and TASA measures were tested and found not significant.
40
For further interpretations of these results, the nature of the
interaction effect of test anxiety and ability on personfit was examined.
Significant interactions were followed up by plotting the
relationship between personfit measures and TASA at selected ability levels. The formula to calculate the regression line at each ability level is
Y' = b0 + b1X1 + (b2 + b3Xl)X2
At any value of ability (X1) a predicted personfit measure
(Y') was calculated for different points of TASA (X2). The intercept of this model equals (b0 + b1XI) and the slope equals (b2 + b3Xl). Table 7 reports the slope and intercept estimates which were used in plotting these lines for all cases in which R2 exceeded .10.
The regression lines resulting from these computations are shown in Figures 15. It should be noted that Figures 25 are based on a single gradelevel and a single subject area. These plots depict the nature of the interaction of ability and test anxiety on various indices of person fit. Although the points of intersection of these
lines may vary, the same general pattern of relationship between personfit and test anxiety occurs in all cases. Namely, this pattern of interaction can be characterized as follows:
1. For examinees in the average ability ranges, there is little or no relationship between test anxiety and personmisfit. (Note the "flat" slope of the regression line for Group E in all figures.)
2. As examinee ability level increases (see lines for Groups F, G, and H), the slope of the regression lines increases, generally
41
Table 7
Significant Ability and Test Anxiety Interactions and Increases
for PersonFig Measures
ceceldent Stancarc
rarsole Oarameter Estimate Error t :ncrease
Seventh GradeReaoing
:nterceot .3294 .3643
Slope4bifity .0058 .3016
ilo:eTASA .3071 .A035
SlopeAbility*TASA .3002 .0001
Nodel  R2 . .11*; R 3,22 = 3.30")
. 16
3. 74** 2.02*
2.52'
.025
:nter:ept SlopeAbility SlopeTASA SlopeAbility*TASA
(Mcdet  R =23*;
:ntercept SlopeAbility SlooeTASA
Slcoe,bility*TASA
2
odei  R=.3*
:nterceot
SicoeAbility S!ooe7ASA SloceAbility*TASA
:ntercept Sl.oekbiliq
hth GradeScience .316 .0593
.3013 .3015
.0064 .3037
.3002 .0001
F, IQ,) .00**)
.5323
.0020
.0157
.3005
.1:99
.0031 .075
.0C02
3,432 
.1727 ."021
.3144 .3004 3,132
.'210
.3075
. 002
.2291 .:__73
.3053 .3367
M 37 .E 1
.1c:
PB
5 .22*
2. 15
4.44..
.53
2.23* 2.566"
fl~Oaflt ~:
3.02** .56
1.91
2.23*
.39
.93
42
H
G
F
 E
D
C Reading Ability:
A  12.35 B C 23.97
D 29.78 E35.59 (x A F 4 1 .40
.32 .30 .28 .26 .24
.22 .20 .18 .16
.14 .12 .0
H 53.02
I I I
1.73 5.06 8.39 11.72 15.05 TASA
18.38 21.71 25.04 28.37
Relationship between the modified caution index and test anxiety for seventhgrade examinees at different reading
ability levels.
more misfit
.38
.36
.34 
C:
:
0 a)
)
less misfit'
Figure 1.
43
more misfit
.40
38F
36
.34 .32 .30 .28 .26
.24 .22
20
~I6
A
8
C
D
E
F
G
H
 Science Ability:
 25.89
D D30.46
 35.02
. F 39.r
14  G 44.
F H  48. less
misfit 9 4.4
59
15 72
752 10.64 13.75 16.87
TASA
19.98 23.10 26.21
Relationship between the modified caution index and test anxiety for eighthgrade examinees at difference science
ability levels.
CO)
C
Figure 2.
18
44
H
G
F
E
D
C
ence Ability: A  16.76 8  21. 33 C  25.89 D 30.46 A
39.59
44.15
48.72
F
G
H
752 10.64 13.75 16.87
TASA
19.98 23.10 26.21
Relationship between the personal biserial and test anxiety for eighthgrade examinees at different science ability levels.
less misfit
.72
.68
.64 .60 
.56
c
U 0
.52
.48
.44
.40
. 36 .32
.28
Sci
.24 more
misfit H
1.29 4.41
Figure 3.
45
less
misfit
.77 Science Ability:
A 16.76 H
8 21.33 C 25.89
.69 D 30.46 G
 E 35.02 (x
.65  F 39.59
 G 44.1 5
.61I H 48.72F
3 .57C
S.53
v .49
0 D
f .45
U .41 C
.37
.33 8
.29
.25 A
more
misfitt
1.29 4.41 752 10.64 13.75 16.87 19.98 23.10 26.21 TASA
Figure 4. Relationship between the norm conformity index and test
anxiety for eighthgrade examinees at different science
ability levels.
46
more misfit
.56
.32
.24 .16 .08
a)
(F)
U
Wi
A
8
C
D
.00 L
.08E.
1 61
.24
. 321
.40
.48
.56
.64
SciE
. 72
. 80
less / misfit 4
.48
.AC
E 35.02 F 39.59 G  44.15 H 48.72
1.29 4.41
(x)
752 10.64
TA
13.75 SA
16.87 19.98 23.10 26.21
nce Ability:
A  16.76 B  21.33 C  25.89
r) 32Q4r
E
Relationship between the extended caution index and test anxiety for eighthgrade examinees at different science ability levels.
Figure 5.
G
H
47
indicating an increasing negative relationship between test anxiety and personfit. Specifically highability, lowanxious examinees show more misfit than highability, highanxious examinees.
3. As examinee ability decreases (see lines for Groups C, B, and A), the opposite effect occurs. Namely lowability, lowanxious examinees show less misfit than lowability, highanxious examinees.
When no interactions were significant, only ability main effects. were significant. Table 8 presents the Type I sumsofsquares which
measure incremental sums of squares for the model as each variable was added. Models for the Rasch personfit index on mathematics and science ability main effects for the seventh and eighth grades were significant. There was also a significant main effect of reading ability for the model with the modified caution index at the eighth grade.
Internal Consistency of PersonFit Statistics
Corrected splithalf internal consistency reliability
coefficients for personfit measures by grade and subject content area are reported in Table 9.
For the seventh grade sample, the internal consistency estimates
ranged from .23 to .56. The highest personfit splithalf reliability estimates were found consistently for the reading subtest. For the eighth grade the range of values was from .23 to .39, with a slight trend for reading to have the higher values.
Among personfit indices, the Rx2 index has the highest internal
consistency estimates for the eighth grade (ranging from .31 to .39), but for the seventh grade, no personfit index was consistently more
48
Table 8 Significant Main Effect of Ability as Predictor of PersonFit Measures
Depencent 7ype I
Iariaoie Source if Sum Squares Mean Squares RZ
Seventh GradeMathematics
Rx Mocel 3 .50 .17 .20' .ii4athematics 1 .47 .47 25.30*
7ASA  .01 .31 .47
Matnematics*TASA 1 .32 .32 1.12
Errcr 221 4.01 .02
Seventh GradeScience
Rx Model 3 .50 .17 13.73* .5Science .49 .49 39.95*
TASA . .01 .01 1.21
Sciencea*ASA 1 .CO .00 .01
Errcr 221 2.70 .01
Eighth GradeReading
MC Model 3 .31 .10 7.37 ..2*;eading .27 .27 20.3*.
TASA .34 .04 3.28
Reading*7ASA .30 .,0 .17
Eror 120 2.36 .01
Eighth ]radeMathematics
;x Mocel 3 .44 .1: 7.21* .13*
Matnematics 1 .38 .3 ;.6*
TASA i .1 .31 .25
Mathematics*TASA 1 .^6 .35
Error 184 3.36 .32
Eighth GradeScience
x Macel 2 .57 .22 S2.33
Scence .53 .52 53.35*
nASAn A .31 .31
Sc,enc9,7.4SA 0 .523
:rrcr .32 1.31 .31
.*Signifcan at
49
Table 9
Corrected SplitHalf Reliability Estimates for PersonFit Measures by Grade and Ability Test
PersonFi t
7est PB 1C:
Seventh Grade
Reaaing .06 .:6 .49 .45
Matuematics .28 .29 .25 .2 .29
Science .25 .20 .23 29 .25
Eighth Grade
Reading .35 .37 .31 .39 .33
Ma thematics .29 .37 .33 .29 .26
Science .25 .23 .26 .32 .25
50
reliable than the other. Overall with the exception of the reading subtest the reliabilities for personfit indices for the mathematics and science subtests appear consistent across various indices, and
are relatively low.
Summary
Results can be summarized as follows:
1. Descriptive statistics for personfit measures, test anxiety, and ability scores were very similar across subject content area and grades.
2. Intercorrelations between personfit measures showed that
these measures are highly related within the same subject content area. Across subject area little or no relationship was found.
3. Correlations between personfit measures and their
corresponding total ability score ranged from 1.00 to .501. The PB index had the lowest correlation with total score.
4. Correlations between personfit measures and test anxiety
scores ranged from 1.02 to .221. In science, four out of five indices were significantly related to test anxiety scores for seventh graders. For eighth graders in reading, three of the five indices were significantly related to test anxiety.
5. A significant proportion of variance in personfit measures
was explained by the linear combination of test anxiety, ability, and their interaction, for the seventh and eighth grade reading MCI index and for the eighth grade science MCI, PB, NCI, and ECI. Regression lines depicting the nature of these interactions were presented for
selected ability levels. Significant and meaningful R2 values
51
(greater than R2 = .10) for the Rasch personfit index were found for the areas of science and mathematics without any interaction between ability and test anxiety.
6. Corrected splithalf internal consistency estimates for the personfit indices ranged from .20 to .56.
CHAPTER V
DISCUSSION
This study was conducted to investigate the relationship between
examinee's level of personfit and test anxiety, and to study the effect of ability on this relationship. Five personfit indices were
calculated for seventh and eighth grade students who had taken a testanxiety selfreport test and the reading, mathematics, and science subtests of the MAT.
Discussion of results will focus on findings about (1) the interrelationship among the five personfit indices, (2) the relationship between personfit, test anxiety, and ability, and
(3) the reliability of the five personfit statistics under study.
Relationships Among PersonFit Statistics
Intercorrelations among measures of personfit were quite high within samesubject content areas. The correlation values ranged from 1.78 to .991. These high personfit statistics' intercorrelations confirm previous research findings by Harnisch and Linn (1981) and Rudner (1983). In particular, Harnisch and Linn found intercorrelations among the MCI, PB, and NCI that ranged from 1.93 to .971 for mathematics tests and from 1.89 to .95 1 for reading tests. Rudner's intercorrelations among the MCI, PB, NCI, and the Rx2 for the simulated SAT test ranged from 1.80 to .99 and
52
53
from 1.77 to .99 I for the simulated teachermadebiology test. These consistent high intercorrelations among the personfit indices under study indicate that they seem to be measuring a common construct.
Relatively low correlations were found for any given index (MCI, PB, NCI, Rx2, and ECI) across the reading, mathematics, and science tests. These correlations ranged from .03 to .24. There seems to be no persistence of misfit across the different tests. These results would question the notion that tendency to misfit is a stable trait which consistently manifests itself in examinee performance across various subject areas. Frary's (1982) correlations of same personfit index across several tests were also very low.
For practical implications these results lead to two points that need to be considered when interpreting personfit results. Namely, that if an examinee is identified as having poor personfit by one
index, another index will probably identify him/her as a misfit on the same test; but that it can not be concluded that he or she will also be a misfit on another test.
Relationship Between PersonFit, Ability, and Test Anxiety
Low to moderate correlation values were obtained between personfit indices and their corresponding total ability scores. The Rasch personfit index was the index with the highest correlations with total mathematics and science ability scores. For the reading test,
the MCI had the highest correlations with total score, but this correlation was positive while a negative correlation would have been expected. The reading subtest was not typical of a power test, since most examinees did not finish all items. More able examinees probably
54
attempted more items, passing end items that were more "difficult" than some items which they had missed earlier in the test. This caused more able students to get higher misfit classifications, as can be seen in Figure 1. The only personfit index that did not seem to
be affected by the speededness of the reading test was the Rx2. Although its relationship to total reading score was low, it was in the direction expected.
The PB index had the lowest correlation with total scores.
Contrary to the present findings, Harnisch and Linn (1981) found that the PB was one of the indices that had a high relationship with total
score on the reading test (.63). The relationship of the PB to the math total score on the Harnisch and Linn study was nevertheless somewhat lower (.28) than for some other indices.
Correlation values between personfit measures and test anxiety
were relatively low. These correlations ranged from 1.02 to .22J. The Rx2 index had the highest correlations with test anxiety. The only case when the Rx2 correlated lowest with test anxiety was for the seventhgrade reading test. This low correlation is ascribed to the speeded nature of the reading test, which was more noticeable at the seventh grade. These correlation values are disappointingly low from a practical perspective; however, one point that might be considered
is that these observed correlations may have been attenuated by the extremely low reliability of the personfit measures and by having a general or trait measure of test anxiety instead of a statespecific measure. As an exploratory exercise one could speculate about the nature of this relationship if personfit could be more reliably measured. A correction for attenuation method (Magnusson, 1966,
55
p. 148) was used to estimate what the correlation between personfit and test anxiety would be if these measures were perfectly reliable. The corrected correlations between personfit measures and test anxiety ranged from 1.03 to .441. Even with the correction for attenuation most of these correlations are still relatively low and there is little evidence offered by the present study to support the notion that higher reliabilities can be achieved in personfit measures.
In order to explore the extent to which variance in person misfit can be explained by a linear combination of the variablesability level, test anxiety, and their interactionthis linear multiple regression model was tested for each personfit statistic at each grade level and for each ability measure. Because of the large sample size a number of fairly small R2 were statistically significant, so only cases where there was a meaningful percentage of variance (larger than 10%) were considered.
The significant interactions between ability and test anxiety
demonstrate that ability levels moderate the relationship between personfit measures and test anxiety. For lowerability examinees a direct relationship between personfit measures and test anxiety was found. For higherability examinees this relationship was found to be inverse.
Figures 1 and 2 present the two general pictures of these
interactions. Figure 1 shows the relationship between the modified caution index and test anxiety as measured by TASA for seventhgrade examinees with different reading levels. The nature of the interaction is the same as previously described, but higher ability level students have higher overall MCI values (more misfit). For all
56
the other significant interactions which were for person fit measures
of science at the eighth grade, lower ability level students have higher values of misfit. Figure 2 shows this for the MCI. The difference found for these two content areas might be due to the performance of this sample on the reading subtest. More than 10 percent of the seventh graders missed the last fifteen items of the reading test, making it appear more like a speeded test than a power test. Higher ability examinees probably attempted more end items, increasing their probability of getting highermisfit classifications.
The cognitiveattentional theory of test anxiety, the SpenceTaylor drive theory, and previous personfit research findings suggest interpretive explanations of the interaction results. According to Tobias (1980) and Weinstein, Cubberly, and Richardson (1982) hightestanxious students will have worse performance on difficult materials compared to lowtestanxious individuals, while with easier material little difference between anxiety levels is expected. In reporting results about cognitive coping behavior and anxiety, Houston
(1982) suggests that "highly traitanxious (and test anxious) individuals tend to lack organized ways of coping with stress and
instead ruminate about themselves and the situation in which they find themselves" (p. 198).
Since hightestanxious students' performance on more difficult items is more affected by their anxiety as these examinees reach items
that have a level of difficulty that approximates their ability level they will have a harder time coping. Other testing strategies such as
test wiseness would not be readily available due to lack of concentration. Examinees with high ability levels and low test
57
anxiety could take advantage of testtaking skills and answer item correctly beyond their ability level. Since these items would not be answered with the same degree of certainty as easier items, more deviation from the expected response pattern could occur and higher misfit would result. For examinees with high ability and hightest
anxiety, coping and testtaking skills could be blocked; making them consistently miss harder items. Lower misfit values would occur for this group. Examinees with lower ability and lower test anxiety levels would answer items correctly to the point where they reach items at their ability level and then miss the harder items. Some misfit could occur due to attempts at harder items. For examinees
with low ability and hightest anxiety, distracting thoughts might interfere with performance to almost all items. Even easy items (in reference to their ability level) could be missed; this sporadic answering pattern would classify this group as high misfits.
These interaction effects between ability and test anxiety seem
to appear more consistently for science especially at the eighth grade. One possible explanation is that the standardized science test fits the curriculum less than the reading or math tests. The mean item difficulty for the science test is lower, indicating a harder test. Examinees taking the science examination might find themselves in a more ambiguous and hence anxietyproducing situation.
These findings are primarily of theoretical interest to those who may be interested in learning more about the constructs of test anxiety or personfit. At best the combination of ability, test
anxiety, and their interaction appear to account for only about onefourth of the variance in personfit indices, and the increments in R2
58
obtained by using the interaction term in the regression model were small. The interaction term only accounted for about two percent of the R2 values.
Reliability of PersonFit Measures
Corrected oddeven splithalf reliability estimates of personfit indices were low. These coefficients ranged from .20 to .56. These internal consistency estimates were highest for the personfit indices computed for the seventhgrade reading test. Part of the reason for the higher reliabilities could be ascribed to the speeded nature of
the reading test at this grade level since the original sequentialtest item number was used to split the test into oddeven subtests. Magnusson (1966) cautions on using splithalf methods on timed tests. He states, "the time limit has the effect that in reliability computations with splithalf methods the test's reliability tends to be overestimated" (p. 114).
Frary (1982) analyzed the internal consistency of personfit
indices and also found low and even negative splithalf coefficients. His findings led him to conclude that unexpected responses to a small number of itens contributed to high misfit classifications and that there was little consistency on the specific items contributing more to misfit.
These findings certainly seem to call into question the notion of personfit as a stable trait that can be reliably measured and also question the potential utility of these indices. Frary summarizes this concern when he suggests "that use of personfit measures for any decisionmaking purpose, especially with respect to individual
59
examinees, should be undertaken only with extreme caution and that substantial additional research will be required before they can be used routinely" (Frary, 1982, p. 17).
Summary
Results of the relationship among personfit statistics were quite high within samesubject content areas, but not across differentsubject tests. It can be generalized that if an examinee is
identified as having poor personfit by one index, another index will probably identify this examinee as a misfit on the same test, but it
can not be expected that this examinee will also be a misfit on another test.
Significant interactions between ability and test anxiety
demonstrate that ability levels moderate the relationship between personfit measures and test anxiety. For lowerability examinees a direct relationship between personfit measures and test anxiety was found. For higherability examinees this relationship was inverse.
The cognitiveattentional theory of test anxiety and the SpenceTaylor drive theory suggest interpretive explanations for these interaction results. These findings are of theoretical interest to those interested in learning more about the constructs of test anxiety and personfit. At best the combination of ability, test anxiety, and their interaction appears to account for only about onefourth of the variance in personfit indices.
Internal consistency (splithalf) reliabilities were relatively low and the present study offers little evidence to support the notion that higher reliability of personfit indices could be achieved.
60
According to these results the potential uses of personfit indices are questionable at this time. More research is needed before personfit indices can be recommended as a routine measure in achievement tests.
REFERENCES
Alpert, R., & Haber, R.N. (1960). Anxiety in academic achievement
situations. Journal of Abnormal and Social Psychology, 61, 207215.
Donlon, T.F., & Fischer, F.E. (1968). An index of an individual's
agreement with groupdetermined item difficulties. Educational
and Psychological Measurement, 28, 105113.
Frary, R.B. (1982). A comparison among personfit measures. Paper
presented at the annual meeting of the American Educational
Research Association, New York.
Frary, R.B., & Giles, M.B. (1980). Multiplechoice test bias due to
answering strategy variations. Paper presented at the annual
meeting of the National Council on Measurement in Education,
Boston, Mass.
Gaier, E.L., & Lee, M.C. (1953). Pattern analysis: The configural
approach to predictive measurement. Psychological Bulletin, 50,
140148.
Guttman, L. (1941). The quantification of a class of attributes: A
theory and method of scale construction. In P. Horst, P. Wallin, & L. Guttman (Eds.), The prediction of personal adjustment. New
York: Social Science Research Council, Connittee on Social
Adjustment.
Guttman, L. (1950). The basis for scalogram analysis. In S.
Stouffer, L. Guttman, E. Suchman, P. Lazarsfeld, S. Star, & J.
Clausen (Eds.), Measure and prediction (Vol. 6). Princeton,
N.J.: Princeton University Press, 6090.
Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their
use in the analysis of educational test data. Journal of
Educational Measurement, 14, 7596.
Harnisch, D.L. (1983). Item response patterns: Application for
educational practice. Journal of Educational Measurement, 20,
191206.
Harnisch, D.L., & Linn, R.L. (1981). Analysis of item response
patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, 18, 133146.
61
62
Harper, F.B.W. (1974). The comparative validity of the MandlerSarason Test Anxiety Questionnaire and the Achievement Anxiety Test. Educational and Psychological Measurement, 34, 961966.
Heinrich, D.L., & Spielberger, C.D. (1982). Anxiety and complex
learning. In H.W. Krohne & L. Laux (Eds.), Achievement, stress,
and anxiety. Washington, D.C.: Hemisphere.
Helwig, J.T., & Council, K.A. (Eds.). (1979). SAS user's guide, 1979
edition. Raleigh, N.C.: SAS Institute.
Houston, K. (1982). Trait anxiety and cognitive coping behavior. In
I.G. Sarason & C.D. Spielberger (Eds.), Achievement, stress, and
anxiety. Washington, D.C.: Hemisphere.
Kane, M.T., & Brennan, R.L. (1980). Agreement coefficients as
indices of dependability for domainreferenced tests. Applied
Psychological Measurement, 4, 105126.
Levine, M.V., & Rubin, D.B. (1979). Measuring the appropriateness of
multiple choice test scores. Journal of Educational Statistics,
4, 269290.
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental
test scores. Reading, Mass.: AddisonWesley.
Magnusson, D. (1966). Test theory. Reading, Mass.: AddisonWesley.
Mandler, G., & Sarason, S.B. (1952). A study of anxiety and
learning. Journal of Abnormal and Social Psychology, 47, 166173.
Prescott, G.A., Balow, I.H., Hogan, T.P., & Farr, R.C. (1978).
Teacher's manual for administering and interpreting Metropolitan Achievement Tests (Advanced 1, Forms JS and KS). New York: The
Psychological Corporation.
Rudner, L.M. (1982). Individual assessment accuracy. A paper
presented at the annual meetings of the American Educational
Research Association, New York, 1982.
Rudner, L.M. (1983). Individual assessment accuracy. Journal of
Educational Measurement, 20, 207220.
Sarason, I.G. (1960). Empirical findings and theoretical problems in
the use of anxiety scales. Psychological Bulletin, 57, 403415.
Sarason, I.G. (1963). Test anxiety and intellectual performance.
Journal of Abnormal and Social Psychology, 66, 7375.
Sarason, I.G. (1972). Experimental approaches to test anxiety:
Attention and the use of information. In C.D. Spielberger (Ed.),
Anxiety: Current trends in theory and research (Vol. 2). New
York: Academic Press.
63
Sarason, I.G. (1975). Anxiety and selfpreoccupation. In I.G.
Sarason & E.D. Spielberger (Eds.), Stress and anxiety (Vol. 2).
Washington, D.C.: Hemisphere.
Schmitt, A.P., & Crocker, L. (1982). Test anxiety and its components
for middle school students. Journal of Early Adolescence, 2,
267275.
Spielberger, C.D. (1966). Theory and research on anxiety. In C.D.
Speilberger (Ed.), Anxiety and behavior. New York: Academic
Press.
Spielberger, C.D. (1971). Traitstate anxiety and motor behavior.
Journal of Motor Behavior, 3, 265279.
Spielberger, C.D., Gonzalez, H.P., Taylor, C.J., Algaze, B., & Anton,
W.D. (1978). Examination stress and test anxiety. In C.D.
Spielberger & I.G. Sarason (Eds.), Stress and anxiety (Vol. 5).
New York: Hemisphere/Wiley.
Tatsuoka, K., & Linn, R.L. (1981). Indices for Detecting unusual
item response patterns in Personnel testing: Links between direct and itemresponsetheory approaches (Research Report
815). Urbana: University of Illinois, ComputerBased Education
Research Laboratory.
Tatsuoka, K., & Linn, R.L. (1983). Indices for detecting unusual
patterns: Links between two general approaches and potential
applications. Applied Psychological Measurement, 7, 8196.
Tatsuoka, K., & Tatsuoka, M.M. (1980). Detection of aberrant
response patterns and their effect on dimensionality (Research
Report 804). Urbana: University of Illinois, ComputerBased
Education Research Laboratory.
Tobias, S. (1980). Anxiety and instruction. In I.G. Sarason (Ed.),
Test anxiety: Theory, research, and applications. Hillsdale,
N.J.: Lawrence Erlbaum Associates.
Tryon, G.S. (1980). The measurement and treatment of test anxiety.
Review of Educational Research, 50, 343372.
Van der Flier, H. (1977). Environmental factors and deviant response
patterns. In Y.H. Poortinga (Ed.), Basic problems in crosscultural psychology. Lisse, Netherlands: Swets & Zeitlinger.
Van der Flier, H. (1982). Deviant response patterns and
comparability of test scores. Journal of CrossCultural
Psychology, 13, 267298.
Weinstein, C.E., Cubberly, W.E., & Richardson, F.C. (1982). The
effects of test anxiety on learning at superficial and deep
levels of processing. Contemporary Educational Psychology, 7,
107112.
64
Wine, J.D. (1971). Test anxiety and direction of attention.
Psychological Bullegin, 76, 92104.
Wine, J.D. (1980).
I.G. Sarason applications.
Cognitiveattentional theory of test anxiety. (Ed.), Test anxiety: Theory, research, and Hillsdale, N.J.: Lawrence Erlbaum Associates.
Wright, B.D., Mead, R., & Bell, S.
with the Rasch model (RM23b).
In
(1979). BICAL: Calibrating items Chicago: University of Chicago.
Wright, B.D., & Panchapakesan, N.A. (1969). A procedure for sample
free item analysis. Educational and Psychological Measurement,
29, 2348.
BIOGRAPHICAL SKETCH
Alicia P. Schmitt was born in Havana, Cuba, on September 28,
1952. She immigrated to the United States in 1961 and moved to Puerto Rico in 1962. In 1970 she graduated from high school and entered the
University of Puerto Rico, graduating in 1973 and in 1975 with a bachelor's and master's degree in psychology. From 1975 to 1977 she worked as Evaluation Coordinator for Project Follow Through and taught evening courses at the University of Puerto Rico.
Alicia moved to Gainesville in 1977 and began her doctoral
program at the University of Florida in the fall of 1978. While in graduate school she held a variety of assistantships. For a period of two years, she served as research consultant for the Research Clinic of the College of Education. As part of the duties in other assistantships she developed and taught the Independent Study by Correspondence Course in Measurement and Evaluation in Education; assisted in measurement, research, and statistics courses in the
College of Education; and worked as research assistant for a graduate school dean.
In 1983 she became Assistant Director of Testing and Evaluation for the Office of Instructional Resources, University of Florida. In this capacity she administered the College of Education Basic Skills Testing Program and analyzed institutional studies used for educational planning. Alicia has currently accepted a position with Educational Testing Services and will be moving to Princeton, New Jersey.
65
I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
Linda M. Crocker, Chair Professor of Foundations of Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
J mes J. Alg'ha
Associate Pro essor o Foundations of Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
P ofessor of Psychology
This dissertation was submitted to the Graduate Faculty of the College of Education and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy.
December, 1984
Ch rman, Department of Foundations
Education
Dean, College
of Education
Dean for Graduate Studies and
Research

Full Text 
PAGE 1
THE RELATIONSHIP BETWEEN TEST ANXIETY AND STATISTICAL MEASURES OF PERSON FIT ON ACHIEVEMENT TESTS By ALICIA P. SCHMITT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1984
PAGE 2
To my father, who taught me the value of education
PAGE 3
ACKNOWLEDGMENTS I would like to express my sincere appreciation to Dr. Linda Crocker, chair of my doctoral committee. Her standard of excellence and sound advice have guided my doctoral studies. She has always provided invaluable opportunities for learning. I am also extremely grateful to Dr. James Algina who was continually available for consultation and guidance, and to Dr. Marvin E. Shaw who gave freely of his patient and quiet encouragement. I am grateful to each of these members of my doctoral committee for helping me reach this point in my career. My husband, Jeff, deserves special recognition since he lived with and stood by me through this special time in my life. He was always encouraging and helpful. This degree belongs to him as much as to me. To my sister, Amelia, I am indebted for providing a consistent model of perseverance and strength. I am also grateful to my family, friends, and coworkers, who always encouraged me and knew that I would finish, and to Adele Koehler, my typist, who always, with a smile, helped me meet my deadlines. Finally, I thank the Alachua County School Board for providing the data used in this study. iii
PAGE 4
TABLE OF CONTENTS Page ACKNOWLEDGEMENTS iii LIST OF TABLES vi LIST OF FIGURES vii ABSTRACT vi ii CHAPTER I INTRODUCTION 1 Statement of the Problem 1 Purpose of the Study 2 Theoretical Rationale 3 Definition of Technical Terms 5 StudentProblem Table (SP table) 6 Modified Caution Index (MCI) 7 Personal Biserial Correlation (PB) 7 Norm Conformity Index (NCI).... 8 Rasch PersonFit Statistic (Rx 2 ) 9 Extended Caution Index (ECI) 10 Assumptions 11 Educational Significance 11 Summary 12 II REVIEW OF LITERATURE 14 PersonFit Measures 14 Historical Background 14 Types of PersonFit Indices 16 Comparative Studies of PersonFit Indices 18 PersonFit Indices Under Study 20 Test Anxiety 21 Summary 23 III METHODOLOGY 25 Examinees 25 Instruments 26 Creation of the Data File 27 Calculation of PersonFit Statistics 27 Analysis 28 Summary 29 i v
PAGE 5
Page IV RESULTS 31 Descriptive Statistics 32 Relationship Between PersonFit, Ability, and Test Anxiety 36 Relationship Between PersonFit and a Linear Combination of Ability, Test Anxiety, and Their Interaction 36 Internal Consistency of PersonFit Statistics 47 Summary 50 V DISCUSSION 52 Relationships Among PersonFit Statistics 52 Relationship Between PersonFit, Ability, and Test Anxiety 53 Reliability of PersonFit Measures 58 Summary 59 REFERENCES 61 BIOGRAPHICAL SKETCH 65 v
PAGE 6
LIST OF TABLES Table Mi 1 SP Table for Five Examinees and Six Items (Ideal Pattern) 6 2 Descriptive Statistics for PersonFit Measures by Grade and Ability Test 33 3 Descriptive Statistics for Ability Tests and Test Anxiety by Grade 34 4 PersonFit Measures Intercorrelations by Grade and Ability Test 35 5 Correlations Between PersonFit Measures and Total Scores on Corresponding Ability Test and Test Anxiety 37 6 Percentage of Variance (R 2 ) in PersonFit Explained by the Combination of Examinee Ability, Test Anxiety, and Their Interaction 38 7 Significant Ability and Test Anxiety Interactions and R 2 Increases for PersonFit Measures 41 8 Significant Main Effect of Ability as Predictor of PersonFit Measures 48 9 Corrected SplitHalf Reliability Estimates for PersonFit Measures by Grade and Ability Test 49 vi
PAGE 7
LIST OF FIGURES Figure Page 1. Relationship between the modified caution index and test anxiety for seventhgrade examinees at different reading ability levels 42 2. Relationship between the modified caution index and test anxiety for eighthgrade examinees at different science ability levels 43 3. Relationship between the personal biserial and test anxiety for eighthgrade examinees at different science ability levels 44 4. Relationship between the norm conformity index and test anxiety for eighthgrade examinees at different science ability levels 45 5. Relationship between the extended caution index and test anxiety for eighthgrade examinees at different science ability levels 46 vii
PAGE 8
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy THE RELATIONSHIP BETWEEN TEST ANXIETY AND STATISTICAL MEASURES OF PERSON FIT ON ACHIEVEMENT TESTS by Alicia P. Schmitt December, 1984 Chair: Linda M. Crocker Major Department: Foundations of Education The purpose of this study was to investigate the relationship between an examinee's level of test anxiety and each of five different personfit statistics and to establish if this relationship is dependent on ability level. A secondary interest was to determine the relationship among personfit indices within and across different subject areas of a standardized achievement test and to assess the internal consistency of personfit indices. An existing data set was analyzed to explore the nature of these relationships for the modified caution index, the personal biserial correlation, the norm conformity index, the Rasch personfit index, and the extended caution index. Achievement test scores on the reading, mathematics, and science subtests of the Metropolitan Achievement Test (MAT) and scores on the Test Anxiety Scale for Adolescents were used as estimates of ability and anxiety. The item scores and total test scores of 225 seventhgraders and 188 eighthgraders of a metropolitan middle school comprised this data set. v i i i
PAGE 9
Intercorrel ations among the measures of personfit were in the .80s to .90s within samesubject content areas. Across subject content areas little or no relationship was found. Low to moderate correlations were obtained between personfit indices and their corresponding ability scores (  .00 to .50 1) and test anxiety ( J .02 to . 2 2 1 ) . For four of the measures of person fit, on one or more of the subject tests, a significant proportion of variance was explained by the linear combination of ability, test anxiety, and their interaction. In these cases ability levels moderate the relationship between personfit measures and test anxiety; for lower ability examinees the relationship is direct, but for higherability examinees the relationship is inverse. Only the Rasch personfit index was consistently unaffected by this interaction. Corrected splithalf reliability estimates of personfit indices were low (.20 to .56), indicating little consistency of the trait. According to these results, the potential uses of personfit indices are questionable at this time. More research is needed before these measures can be recommended for routine use in interpretation of achievement test scores for individual examinees. ix
PAGE 10
CHAPTER I INTRODUCTION Statement of the Problem Although total scores have been consistently used as the basis to evaluate educational achievement, analysis of itemresponse patterns can contribute additional information that may be useful in the interpretation of an overall score. Analysis of response patterns can be based on two dimensions: item difficulty and examinee ability. Ability is typically estimated by the total score on the test of interest, and item difficulty, by the proportion of examinees answering the item correctly. If the items are arranged in ascending order of difficulty, an examinee with a given ability should answer items correctly until the point where his or her ability matches the difficulty of the items, and miss each item thereafter. Deviations from the expected response pattern occur when the pattern of passed and missed items is not consistent. If a person misses easier items but then responds correctly to harder items, there is deviation from the expected response and misfit occurs. With the introduction of the scalogram technique, Guttman (1941, 1950) was one of the first social scientists to suggest that some persons respond consistently to a given set of ordered stimuli (test items) while others do not. Under Guttman's scale theory, a response pattern where a student passing a more difficult item also responds 1
PAGE 11
2correctly to all easier items, is called a perfect simplex, and the scale or test under such situation is called a perfect scale. During the late 1970s and early 1980s there has been a resurgence of interest in using information provided by response patterns. A number of personfit statistics have been developed to provide a measure of an individual examinee's deviation from the expected response pattern to a given set of items. Although some studies have shown that indices of personfit are highly correlated (Harnisch & Linn, 1981; Rudner, 1983), attempts to identify causes of person misfit (or even personality or demographic correlates of it) have remained mainly speculative. Some researchers, such as Frary (1982) and Harnisch and Linn (1981), have suggested that one factor which may contribute to personmisfit on cognitive tests is test anxiety, but prior to this study there has been no empirical investigation to test this hypothesis. Purpose of the Study The present exploratory study was designed to investigate the nature of the relationship between measures of personfit and test anxiety. For each of five selected indices of fit (modified caution index, personal biserial correlation, norm conformity index, Rasch personfit index, and an extended caution index), the following questions were asked: 1. What is the degree of linear relationship between test anxiety and an examinee's level of misfit?
PAGE 12
32. What is the degree of linear relationship between ability (as defined by performance on the current achievement test) and level of misfit? 3. To what extent is variance in personmisfit explained by a linear combination of the variables: ability level, test anxiety, and their interaction? A secondary interest in this investigation was to explore the degree of relationship among the five selected personfit indices within and across subject area tests of an achievement battery and to estimate their internal consistencies. This information was considered important because the tests used in this case were subtests from a wellknown nationally normreferenced, standardized achievement test battery. Earlier studies of the interrelationship and reliability of personfit indices have typically been based upon state minimal competency examinations (Harnisch & Linn, 1981) or locally developed teachermade tests (Frary, 1982). Theoretical Rationale To date most research on test anxiety has considered primarily the effects on examinees' total test score. Recently, Frary (1982) and Harnisch and Linn (1981) have suggested that test anxiety may be a factor which contributes to erratic performance of an examinee within Â• a given test (e.g., missing relatively easy items, while answering more difficult items correctly). Careless errors and lack of concentration by hightestanxious individuals could change the pattern of item responses from the pattern that would be expected.
PAGE 13
4Two theories predict the effect of anxiety on performance. According to the cognitive attentional theory of test anxiety, highly anxious students attend to selfrelevant variables instead of to taskrelevant variables, negatively affecting their performance (Wine, 1980). In an analysis of Spiel berger's (1966, 1971) extension of SpenceTay lor drive theory, Heinrich and Spielberger (1982) make several predictions about the effect of ability and anxiety on performance of tasks with varying levels of difficulty that seem relevant to this study of test anxiety and personfit. These are as fol lows : 1. For subjects with superior intelligence, high anxiety will facilitate performance on most learning tasks. While high anxiety may initially cause performance decrements on very difficult tasks, it will eventually facilitate the performance of bright subjects as they progress through the task and correct responses become dominant. 2. For subjects of average intelligence, high anxiety will facilitate performance on simple tasks and, later in learning, on tasks of moderate difficulty. On very difficult tasks, high anxiety will generally lead to performance decrements. 3. For low intelligence subjects, high anxiety may facilitate performance on simple tasks that have been mastered. However, performance decrements will generally be associated with high anxiety on difficult tasks, especially in the early stages of learning. (Heinrich & Spielberger, 1982, p. 147) According to these predictions, response patterns and personfit statistics will be different for high, average, and low ability examinees depending on their anxiety levels. The predicted effect for high and low anxious students at these three ability levels would be 1. Subjects with high ability and high test anxiety would be expected to initially fail hard items, but since during testing conditions, examinees receive no feedback, correct responses will not
PAGE 14
5be expected to become dominant. These examinees would continue to have occasional difficulty on harder items, but their high levels of test anxiety would facilitate performance on easier items. Moderate to low misfit would be expected. For subjects with high ability and low test anxiety, interference in performance of difficult items is not expected. Due to less attentional interference, use of testtaking strategies might also be more accessible. These students might be more open to guessing on harder items. Moderatehigh misfit could be expected. 2. For subjects with average ability, high anxiety will help with easy to moderately difficult items, but will interfere with harder items. A low to moderate misfit would be predicted. Similarly, low test anxiety is not expected to differentially affect item responses for averageability examinees. 3. For low ability subjects, high anxiety may help with the easier items but will interfere with performance on more difficult items. If these examinees do not feel very confident in their knowledge, high anxiety might not help but affect their concentration, making them answer in a more spurious manner. In this last case, higher misfit might occur. When low ability subjects are also low in test anxiety no interference is expected; these students will probably direct their attention to the easier items they master. A low to moderate misfit is expected. Definition of Technical Terms Definitions and formulas required to explain major technical terms used in this study are as follows:
PAGE 15
6StudentProblem Table (SP table) The SP table is used to organize test information into a matrix of zeros and ones. The rows in this matrix represent the students ranked from highest to lowest according to total test score. The columns represent the items arranged from left to right in ascending order of difficulty. Correct responses are represented by ones and incorrect responses are represented by zeros. Assuming that items are arranged in increasing order of difficulty (from easy to hard), a concordant response pattern is one in which an examinee answers the items correctly until he or she reaches an item that is too difficult and answers the items incorrectly from then on. If all examinees had concordant response patterns, the SP matrix would have all ones in the upper lefthand corner, and all zeros in the lower righthand corner. A short illustrative table of the ideal response pattern is presented as Table 1. Table 1 SP Table for Five Examinees and Six Items (Ideal Pattern) Examinee i 1 2 3 Item j 4 5 6 1 1 1 1 1 1 0 2 1 1 1 1 0 0 3 1 1 1 0 0 0 4 1 1 0 3 0 0 5 1 0 0 0 0 0
PAGE 16
Modified Caution Index (MCI) Harnisch and Linn (1981) introduced the modified caution index as a modification of an earlier caution index proposed by Sato in 1975 (cited in Harnisch & Linn, 1981). The MCI has a lower bound of 0 and an upper bound of 1. The higher the value of the index, the more divergent is the person's response pattern. This index is computed with data arranged into an SP table, using the following formula: n i. J I (1 " u i> i " I u ii n i MCI = : (!) n i. J X n . I n . j=l J j = (J + 1n.J J where i is the examinee index in the SP matrix, j is the item index, Ujj is 1 if examinee i answers item j correctly and 0 if examinee i answers item j incorrectly, n^ is the total number of correct responses for examinee i, and nj is the number of correct responses to item j. Personal Biserial Correlation (PB) Donlon and Fisher (1968) proposed this correlation. The coefficient obtained represents the correlation between a person's item response and the item's difficulty value. Donlon and Fischer define item difficulty as the proportion of examinees who respond incorrectly to an item. Large values correspond to difficult items and small values correspond to easy items. A positive correlation
PAGE 17
represents good fit, indicating that a person tends to answer correctly items that are easy for the group and miss the more difficult items. Low or negative correlations represent more divergent response patterns. The formula to compute this correlation is (Q r Q c ) J ' PB=Â— ^ .4" (2) \ Y where Qr is the mean item difficulty for items answered, Q c is the mean item difficulty for items answered correctly, Sq is the standard r deviation for Q r , J r ' is the number of items answered correctly divided by the number of items answered, and Y is the ordinate from the standard normal curve at the point separating the proportions J r ' and (1 J r '). Norm Conformity Index (NCI) This index was developed by Tatsuoka and Tatsuoka (1980). The NCI indicates the degree of concordance to a group response pattern where items are arranged in descending order of difficulty, from hardest to easiest. Values of this index may range from 1 to 1. The smaller or more negative the index, the more divergent is the individual's response pattern in comparison to the group norm. This index is undefined for either perfect or zero scores. Let S denote the row vector of a person's response pattern; letT denote the transpose of the complement of S; and let N =1'S. The formula to compute this index is
PAGE 18
9NCI = (^) 1 (3) where U a = Â£ [ njj, the sum of the above diagonal elements of N and i J the sum Â°^ a "' ^ the elements of N. i J 2 Rasch PersonFit Statistic (Rx ) The Rx 2 , also referred to as Maximum Likelihood Procedure (MAX) (Wright & Panchapakesan, 1969) and as weighted total fit mean square (Rudner, 1983), was adopted for use with the one parameter Rasch model and is calculated using the BICAL program (Wright, Mead, & Bell, 1979). This index has a high value for an examinee who has a response pattern that is inconsistent with the examinee's score and the Rasch model measure of item difficulty. The following formula is used to compute this index: 2 1 (u u V 2 Rx^ = (4) (p 1j (l p ijÂ» where U^j is the response of examinee i to item j and Pjj is the probability of a correct response for examinee i on item j as predicted by the Rasch model: (e r b.) (e.b.) Pij = e 1 J /[l + e 1 J ]
PAGE 19
10Extended Caution Index (ECI) The extended indices have been proposed by Tatsuoka and Linn (1981, 1983). They describe the extended indices as linear transformations of the distance between a person's response pattern and a theoretical curve. In the case of ECI, this curve is the group response curve (GRC), which is "an average function of N different Person Response Curves" (Tatsuoka & Linn, 1981, p. 10). For the ECI, probabilities of success, calculated through item response theorylogistic models substitute the zeros or ones in the SP table. For purpose of this study, the Rasch oneparameter logistic model will be used to calculate these probabilities. The formula for the ECI is ,1 (Â»u p i.'< Y .j p ..> ECI = 1 (5) where Y ^ j is the response of examinee i to item j, Y j is the sum of responses across examinees for item j, is the proportion of correct responses of examinee i, P is the total proportion of correct responses, Pjj is the probability of correct response for each examinee i on item j according to the Rasch model, and Zp^ j / J is the mean predicted probability of success for examinee i. This formula is computed by the ratio of two covariances. The higher the value of the ECI, the more variation from the expected response pattern. This index is also limited by perfect or zero scores. The denominator would become zero and the value infinite.
PAGE 20
11Assumptions The following underlying assumptions were held for this study: 1. Standardized testing situations are capable of inducing test anxiety among students who would normally be test anxious. 2. Students responded truthfully on the selfreport instrument used to assess their level of test anxiety. 3. Total score on the achievement test used can be taken as an estimate of the examinee's ability (substituting for any external measure of ability, such as an I.Q. or academic aptitude test score). 4. Each subtest of the achievement test measures a fairly unidimensional trait (i.e., achievement in reading or mathematics or science). This assumption is critical for the indices using the Rasch statistics, but it also underlies the assumptions made to calculate the other indices. Educational Significance The potential value of personfit indices has been cited by Frary (1982), Harnisch (1983), Harnisch and Linn (1981), Levine and Rubin (1979), Rudner (1983), and Van der Flier (1982). These writers suggest that these indices could be useful for the following purposes: 1. To identify individuals for whom the test is inappropriate or invalid. Total test score interpretation can be misleading for examinees who come from different experiential backgrounds or take the test under different motivational dispositions, e.g., test anxiety. 2. To identify groups with different instructional practices or histories, which could change the difficulty of the items, e.g., schooling differences.
PAGE 21
123. To identify items that are inadequate for particular groups of examinees. Presently personfit indices are considered to be at a state of development where more research is needed to investigate their psychometric properties and establish their applicability. The reasons why some people are misfits are not clear. If test anxiety can be identified as a factor associated with person misfit, then the interpretive value of personfit statistics would be enhanced. Another pragmatic contribution of this study is to extend the body of research of personfit statistics by providing information about 1) the agreement of person fit classifications across different subject matter content areas, as measured by the subtests of the Metropolitan Achievement Test, and 2) the degree of agreement of personfit classifications by different indices. Summary Analysis of itemresponse patterns provides information not contained in a total test score. Although the idea of using response pattern information probably originated when Guttman (1941) introduced the scalogram technique, it has not been until the late 1970s and early 1980s that a strong interest in personfit statistics has developed. Personfit statistics quantify the degree of deviation of an examinee's response pattern from the expected response pattern. The development and application of personfit indices is at a fledgling stage. More research is needed to investigate their psychometric properties and establish their applicability. Attempts to identify
PAGE 22
13causes of personmisfit have remained mainly speculative. Recently Frary (1982) and Harnisch and Linn (1981) have suggested that test anxiety may be one factor that can explain erratic performance of an examinee within a given test (e.g., missing easy items while answering more difficult items correctly). According to drive theory the relationship between test anxiety and performance is moderated by level of ability and task difficulty. Performance on specific items might not only be dependent on the item's difficulty and the examinee's ability but also on the examinee's test anxiety. The primary purpose of this study was to establish 1) the degree of linear relationship between test anxiety and an examinee's level of misfit, 2) the degree of linear relationship between ability (total score on achievement test) and level of misfit, and 3) the extent that variance in personmisfit can be explained by a linear combination of ability level, test anxiety, and their interaction. Five different indices of personfit were used in this study: the MCI, PB, NCI, Rx^, and ECI. A secondary interest was to investigate the relationship among the five selected personfit indices within and across subtests of a normreferenced achievement battery and to estimate their internal consistencies.
PAGE 23
CHAPTER II REVIEW OF LITERATURE The two central aspects of this study are personfit statistics and test anxiety. These two topics provide the major themes for the organization of the literature review presented in this chapter. PersonFit Measures Historical Background During the late 1970s and early 1980s there has been an increasing interest in the development and application of statistical indices to identify examinees with aberrant itemresponse patterns. Proponents of personfit statistics indicate that these indices add to the information provided by total scores and can also be used to identify potentially inaccurate total scores (Frary, 1982; Harnisch, 1983; Harnisch & Linn, 1981; Rudner, 1983). This trend toward using information from itemresponse patterns is not new. According to Gaier and Lee (1953) one of the most promising trends in current psychometric research is an increasing concern with methods of evaluating patterns of test scores and test responses . . . our initial hypothesis is that consideration of response configurations will yield more fruitful results than the usual methods of reporting merely the total score for a test ... a total score may thus carry considerably less diagnostic significance than a direct and detailed analysis of test responses per se. (p. 140) 14
PAGE 24
15Guttman (1941, 1950) was one of the first writers to suggest that some persons respond consistently to a given set of ordered stimuli (test items) while others do not. According to Guttman (1950), "a person who endorses a more extreme statement . . . should endorse all less extreme statements if the statements are to be considered a scale" (p. 62). Guttman's description of the basic procedure for the scalogram technique of scale analysis is very similar to Sato's SP chart construction. He states that there are two basic steps in the scalogram pattern formation. These are first, the questions are ranked in order of "difficulty" with the "hardest" questions, i.e., the ones that fewest persons got right, placed first and with the other questions following in decreasing order of "difficulty." Second, the people are ranked in order of "knowledge" with the "most informed" persons, i.e., those who got all questions right, placed first, the other individuals following in decreasing order of "knowledge." (Guttman, 1950, p. 70) Sato's SP chart is also a twodimensional matrix where the rows represent the students ranked from highest to lowest according to total test score (cited in Harnisch & Linn, 1981). The columns represent the items arranged from left to right in ascending order of difficulty. Construction of a scalogram pattern and a SP table follow the same two steps. Once the responses are organized in this fashion, a concordant response pattern is defined as the case when an examinee answers the items correctly until he or she reaches an item that is too difficult and answers al 1 items incorrectly thereafter. Some disruption of a perfect pattern can happen. As the response pattern deviates more from the expected pattern, the degree of aberrance increases. Visual identification of aberrant or erratic response patterns becomes increasingly more difficult as the number of
PAGE 25
16i terns in a test increases. The number of possible response patterns multiplies as the number of items increases. With the recent introduction of a variety of statistics to measure the degree of deviation from a typical response pattern, there is a renewed interest in using response pattern information. Types of PersonFit Indices Indices measuring the degree of unusual response patterns can be categorized into three major types: normcomparison indices, goodnessoff it indices, and extended indices. Normcomparison indices, which are based on observed patterns of right and wrong answers and are calculated with summary statistics based on the norm group, include Sato's caution index (cited in Harnisch & Linn, 1981), the modified caution index (Harnisch & Linn, 1981), the agreement, disagreement, and dependability indices proposed by Kane and Brennan (1980), the U' index by Van der Flier (1977), the personal biserial by Donlon and Fischer (1968), and the normconformity index by Tatsuoka and Tatsuoka (1980). Van der Flier's U' index and Tatsuoka and Tatsuoka's normconformity index have been reported to have a perfect negative relationship (Harnisch & Linn, 1981). Normcomparison indices are calculated by using information organized in a SP table. They indicate the degree of aberrance from the expected response pattern, when examinee ability is defined as the total observed score on the test. Goodnessoffit or "appropriateness" indices are based on item response theory (IRT) (Levine & Rubin, 1979). As with normcomparison indices, goodnessoffit indices are also based on the expected
PAGE 26
17response pattern for an examinee at a given ability level. The distinction is that for goodnessoff it indices a more sophisticated definition of "ability" is employed. Instead of simply equating ability with the examinee's observed raw score on the test, ability is defined in terms of his estimated score on a theoretical latent continuum underlying test performance. There are two popular IRT models that estimate examinee abilities based on the latent trait underlying test performance. For Rasch's oneparameter logistic model, examinee ability estimates are determined as a function of item difficulty parameters. A widely used computer program, the BICAL, written by Wright et al . (1979) provides examinee ability estimates, item difficulty parameter estimates, and a personfit statistic (Rx ) which "indicates how well the individual's item response pattern and the Rasch model fit" (Rudner, 1982, p. 4). The second widely used IRT model is Birnbaum's three parameter logistic model (Lord & Novick, 1968, Ch. 17) for which examinee ability estimates are determined as a function of item difficulty, item discrimination, and guessing parameters. Levine and Rubin (1979) developed three types of appropriateness indices, based on Birnbaum's three parameter logistic model. These approaches are the marginal probability, the likelihood ratios, and the estimated ability variation indices. A practical limitation in using these procedures arises from the large sample sizes usually recommended to obtain stable estimates from the threeparameter model (Hambleton & Cook, 1977). Levine and Rubin devised a simulation of item response data on the Scholastic Aptitude Test (SAT) to conform to normal or aberrant response patterns. Their findings indicate that all three types of
PAGE 27
goodnessoff it indices demonstrate the capability to detect aberrance when present. Extended caution indices have been proposed by Tatsuoka and Linn (1981, 1983) as a link between the normcomparison and the goodnessoffit indices. They linked Sato's SP theory and item response theory by replacing the original observed zeros and ones of the item scores with IRT probabilities of passing the items. These probabilities were then used in the calculation of the caution indices. Five variations of the extended caution index were created. These extended caution indices are defined as "linear transformations of the covariance or correlation between a person's response pattern and a theoretical curve" (Tatsuoka & Linn, 1983, p. 95). Their findings support the effectiveness of the extended indices in identifying examinees who use erroneous rules in answering arithmetic test problems. Tatsuoka and Linn (1983) point out that these indices can have instrumental utility by identifying students who consistently make errors because of misconceptions. Comparative Studies of PersonFit Indices There have been several studies in which the relationship and effectiveness of personfit indices have been compared. Harnisch and Linn (1981) made a comparative analysis of ten normcomparison indices. Using mathematics and reading tests from a statewide assessment program they also examined school and regional differences. The intercorrel ations between these indices ranged from  .13 to .99  for mathematics and from  .34 to .96  for reading. They found that Kane and Brennan's (1980) agreement index had the lowest correlation
PAGE 28
19with the other indices, but had the highest correlation (.99) with total score. The modified caution index (MCI) was found to have the lowest correlation with total score (.02 for mathematics and .21 for reading). Harnisch and Linn (1981) found significant school and regional differences in student's response patterns as measured by the MCI. Rudner (1982, 1983) evaluated nine indices; four were normcomparison indices, while five were goodnessoff it indices. He generated data by simulating examinees and their responses through Birnbaum's threeparameters model. Response patterns were altered to simulate spuriously high or low respondents. Findings indicated that the normcomparison indices (point biserial correlation, PB, NCI, and MCI) and the weighted total fit mean square or Rx 2 were highly intercorrel ated (1.77 to ,99 ). The goodnessoff it indices using Birnbaum's three parameter model and the unweighted total fit mean square had lower intercorrel ations (1.17 to .80 1 ). Validity of the indices was tested by observing how sensitive they were to assessment accuracy. The MCI and the NCI identified comparable proportions of examinees with aberrant response patterns. According to Rudner, "these two approaches were the most stable of the statistics" (Rudner, 1983, p. 217). In general, indices based on IRT showed better detection rates of aberrant response patterns than the normcomparison indices. Frary (1982), using teachermade multiplechoice tests, compared three personfit measures; the Rx 2 , the MCI, and a weighted choice index. In the weighted choice index, distractor choice is considered as part of the estimation of personfit. The Rx 2 and the MCI were
PAGE 29
20found to be highly correlated (.75). The smallest relationship between any two of the three indices was between the Rx^ and the weighted choice test (.42). In this study Frary was the first to compute and report personfit internal consistency estimates. Low and even negative splithalf coefficients of the personfit measures in his study were found (Frary, 1982). PersonFit Indices Under Study The present study is the first to include all three different types of personfit statistics (i.e., the normcomparison indices, the goodnessoff it indices, and the extended indices). The five indices under study are the modified caution index (MCI), the personal biserial correlation (PB), the normconformity index (NCI), the Rasch personfit statistic (Rx^) , and the extended caution index (ECI). The MCI was chosen for use in this study because it was found to be least related to total test scores (Harnisch & Linn, 1981) and is considered stable with short and long tests (Rudner, 1983). The PB was selected because it has been in use for a longer period of time than more recent indices and is generally associated with classical test theory. Its computations are relatively simple and it has been found to be very efficient with shorter classroom tests (Rudner, 1983). For these reasons the PB could be useful to a larger number of practitioners. The NCI has been found to correlate with total score somewhat higher than other indices (Harnisch & Linn, 1981), but it has nevertheless been recurrently used in different research studies (Harnisch & Linn, 1981; Rudner, 1982, 1983; Van der Flier, 1977, 1982). The NCI and the MCI are considered to be the most applicable
PAGE 30
21and stable under situations with long and short tests and spuriously high or low scores (Rudner, 1983). The Rx 2 index was selected as part of the indices used in this study, due to the availability of the BICAL computer program. The convenience of having the Rx 2 computations given as part of the output from the BICAL computer program make the Rx 2 index more usable to practitioners. The Rx 2 is a goodnessoff it or appropriateness type of personfit index. It uses the Rasch oneparameter logistic model to estimate ability and item difficulty. Appropriateness indices requiring use of the threeparameter logistic model were not feasible for the present study because of the larger sample size recommended to get consistent ability parameter estimates (Hambleton & Cook, 1977). The ECI represents a link between normcomparison indices and goodnessoff it indices. Since no comparisons of the ECI with other indices or computations with actual data are available in the literature this index was included to evaluate its relationship to the other indices. It is probably noteworthy that most previous studies of multiple measures of personfit have focused primarily upon intercorrel ations among these indices without investigating how they correlate with measures of any trait other than achievement itself, as measured by the test. The present study is somewhat broader in scope, since it investigates how these indices relate to another variable, test anxiety. Test Anxiety Most research on test anxiety has considered the effects of test anxiety on total score. According to Tryqn (1980) test anxiety
PAGE 31
22research findings present a consistent moderate negative correlation between test anxiety and total score measures of achievement. Hightestanxious individuals tend to score lower on classroom and aptitude tests (Alpert & Haber, 1960; Harper, 1974; Mandler & Sarason, 1952; I. Sarason, 1963, 1975; Spielberger, Gonzalez, Taylor, Algaze, & Anton, 1978). Several researchers have tried to explain why test anxiety affects performance. According to the cognitive attentional theory of test anxiety (CATTA), introduced by Sarason (1960) and extended by Wine (1971, 1980), the "major cognitive characteristics of test anxious persons are negative selfpreoccupation, and attention to evaluative cues to the detriment of test cues" (Wine, 1980, p. 371). This misdirection of attention, both in the prestages of evaluation (study phase) and the testtaking situation, may limit coding, retention, and retrieval of information by hightestanxious individuals. Difficulty of the task (e.g., difficult items) is expected to negatively affect attention. Thus according to CATTA, performance of testanxious persons will be negatively affected. The SpenceTay lor drive theory also predicts the effect of anxiety on performance of tasks with varying levels of difficulty. Heinrich and Spielberger (1982) summarize these predictions according to the difficulty of the task. They explain that for high anxious students the performance of a task is dependent on its difficulty. High anxiety may facilitate performance on easy tasks, interfere with performance on harder tasks, and be dependent on the stage of learning for tasks of intermediate difficulty. Heinrich and Spielberger (1982) explain the relationship between performance and the learning stage.
PAGE 32
23According to these authors, "high anxiety will be detrimental to performance early in learning when the strength of correct responses is weak relative to competing error tendencies. Later in learning, high anxiety will begin to facilitate performance as correct responses are strengthened and error tendencies are extinguished" (Heinrich & Spielberger, 1982, p. 146). Varying ability levels and their relationship with anxiety and task difficulty are also considered by the SpenceTay 1 or drive theory. According to Spielberger (1971) the effect of anxiety on subjects with different ability levels will be subject to the task difficulty and the learning stage considered. These two theories, the CATTA and the SpenceTay lor drive theory, point to the possibility that test anxiety might have an effect at the item level and that this effect might be dependent on ability level. Personfit statistics measure the deviation from an expected response pattern. According to the theory of personfit, in a good fit to a response pattern "high ability examinees are expected to get few easy items wrong," while "low ability examinees are expected to get few difficult items right" (Rudner, 1983, p. 207). If test anxiety affects performance at the item level, it might be a factor which contributes to erratic performance for high anxiety examinees. Summary Literature pertinent to personfit indices and test anxiety has been reviewed in this chapter. Recent literature on personfit measures shows an increasing interest in the development and application of these indices. The idea of using information provided
PAGE 33
24by response patterns is not new; it was first introduced by Guttman (1941) with the scalogram technique. During the late 1970s and early 1980s a number of different personfit statistics were developed as measures of the degree of deviation from a typical response pattern. These indices can be categorized into three major types: normcomparison indices, goodnessoff it indices, and extended indices. Five personfit indices were selected for use in this study. Three of these indices are norm comparison indices (MCI, PB, and NCI), one is a goodnessoff it index (Rx^), and one belongs to the extended category of indices (ECI). Three major research studies comparing personfit indices were reviewed. Harnisch and Linn (1981) compared ten normcomparison indices, and concluded that the MCI seemed to be the most promising due to its lower correlation to total score. Rudner (1983) evaluated four normcomparison indices and five goodnessoff it indices. He found that goodnessoff it indices showed better detection rates of aberrant response patterns. Frary (1982) contributed to the development of personfit indices by being the first to study their internal consistency. He found low splithalf reliabilities. Test anxiety has been suggested as a factor that could affect performance within a test (Frary, 1982; Harnisch & Linn, 1981). Two theories, the CATTA and the SpenceTay 1 or drive theory predict that high test anxiety will negatively affect performance. These two theories point to the possibility that test anxiety might have an effect at the item level and that this effect might be dependent on ability level. Test anxiety could thus be a factor that contributes to personmisfit.
PAGE 34
CHAPTER III METHODOLOGY The present study was designed to investigate the relationship between an examinee's level of test anxiety and each of five different personfit statistics and to establish if this relationship is dependent on ability level. A second purpose was to investigate the correlation of personfit indices within and across different subject areas of a standardized achievement test battery and to assess the internal consistency of personfit indices. An existing data set was Â•analyzed to explore the nature of these relationships. A description of the examinee group, instruments, datafile creation, and data analysis methods is presented in this chapter. Examinees The data pool used in this study consisted of test scores and item responses from 225 seventhgraders and 188 eighthgraders from a metropolitan middle school in north central Florida. There was an almost even distribution of boys and girls at each grade level. Approximately 70% of the examinees were white and 30% were black at each grade level. The school population is heterogeneous with respect to socioeconomic level.
PAGE 35
26Instruments The Test Anxiety Scale for Adolescents (TASA) (Schmitt and Crocker, 1982) was used to measure test anxiety as a trait. This instrument is a modified version of the 37item Mandl erSarason Test Anxiety Scale (Sarason, 1972). This scale consists of 31 truefalse items and is designed for use with examinees in middle school or juniorhigh grades. Unlike most other test anxiety scales for children, al 1 items on the TASA deal exclusively with examinee feelings about tests. Sample items include "I worry just before getting a test back"; and "Sometimes on a test I just can't think." Schmitt and Crocker (1982) have found the factor structure of the TASA to be fairly similar to that reported for the adult test anxiety scales. They reported a KR 2 n. of 87 as a total score reliability estimate for these middle school examinees.. The Metropolitan Achievement Test (MAT) subscales (Form KS) were used to measure achievement in reading, mathematics, and science (Prescott, Balow, Hogan, & Farr, 1978). The TASA was administered in March, 1981, approximately two weeks prior to the school wide administration of standardized achievement tests. The MAT was administered by school staff as part of the school district's regular testing program approximately two weeks after the test anxiety scale was given. The range of item difficulties of the MAT subscales for the seventh and eighth grades is respectively: .21 to .99 and .28 to .99 for reading; .20 to .97 and .31 to .97 for mathematics; and .29 to .94 and .34 to .98 for science.
PAGE 36
Creation of the Data File The test anxiety scores, coded with student ID number, but no other identifying information, were obtained in conjunction with a University of Florida College of Education inservice training project for school personnel on identifying and counseling testanxious students. The researcher later obtained a set of MAT test item responses for students with those same ID numbers from the county school district testing office. This data file also contained some demographic information (i.e., sex, race, and grade level). The data file used for analysis in this study was created by matching the two examinee data files on student ID number and merging the files. Thirtyseven of the students' records (13 of the seventh graders and 24 of the eighth graders) in the merged file were deleted later, because it was found that these students had been tested outofgrade level, rather than on the test form taken by their grade peers. Calculation of PersonFit Statistics For each examinee on each subtest, five different indices of personfit were calculated. These were the modified caution index, the personal biserial correlation, the normconformity index, the Rasch personfit index, and the extended caution index. To create the data file containing the five personfit indices for each MAT subtest at each grade level, the original itemexaminee response matrix was used. Each examinee's response* to each item was coded 0 or 1 in this matrix. The data on this matrix were used to *Blanks or omitted responses were treated as incorrect responses and assigned a 0 value.
PAGE 37
281. compute total scores to get an ability estimate for each student; 2. compute a mean score for each item as estimate of item difficulty (p value); and 3. reorganize data into an SP matrix, by sorting by total score, and by item difficulty. The resulting matrix had students organized by ability (from most able to least able) and items organized by difficulty (from easiest to hardest). This SP matrix was used to calculate the five personfit indices, using computer programs written by this author for each personfit statistic. Refer to Chapter I for formulas' definition. These statistics were programmed using the Statistical Analysis System (SAS) package (Helwig & Council, 1979). The accuracy of each programmed computation was tested using the dummy data set given by Harnisch and Linn (1981, p. 136). Analysis Means, standard deviations, minimum and maximum values by grade were computed for each personfit statistic, ability measure, and test anxiety score. For the personfit indices and ability measures these descriptive statistics were calculated for each subtest of the MAT (reading, mathematics, and science). Correlations among fit statistics and between personfit and test anxiety and between personfit and ability measures were calculated. To investigate the relationship between examinees' level of test anxiety and degree of personfit and to study if this relationship is dependent upon ability level, a linear multiple regression analysis
PAGE 38
29was used. In this analysis personfit measures were the dependent variables and ability (reading, mathematics, or science) and TASA were the continuous independent variables. The model used for each ability measure and personfit index is r = b 0 + bi*i + b 2 X 2 + b 3 X 1 X 2 where Y' = personfit predicted by the model, b 0 = intercept value, b\ = regression slope for the ability independent variable, = ability, estimated from total score, b 2 = regression slope for the TASA independent variable, X 2 = TASA, b 3 = regression slope for the interaction of ability and TASA, and X^ 2 = interaction between ability and TASA. To estimate the internal consistency of personfit indices, items for each MAT subtest at each grade level were divided into odd and even subtests. The original sequential testitemnumber was used for this split. Odditem and evenitem personfit statistics were computed by following the sequence of steps previously described. The fit index for the odd items was correlated with the fit index for the even items and the resulting correlation was corrected using the SpearmanBrown formula to obtain an internal consistency estimate for the fulllength test. Summary A linear multiple regression analysis was used to investigate the relationship between an examinee's level of test anxiety and each of five different personfit statistics and to study if this
PAGE 39
30relationship is dependent on ability level. The five personfit indices included in this study were the modified caution index, the personal biserial correlation, the norm conformity index, the Rasch personfit index, and the extended caution index. A data set of 225 seventhgrade and 188 eighthgrade examinees' responses to the Test Anxiety Scale for Adolescents and to the Metropolitan Achievement Test reading, mathematics, and science subscales was used to compute personfit statistics and explore the nature of these relationships. Correlations of personfit statistics between and within the different subject area tests were examined. Personfit splithalf reliabilities were also computed for each index for each content area and grade level .
PAGE 40
CHAPTER IV RESULTS The present study was undertaken to answer the following questions for each of five selected indices of personfit: 1. What is the degree of linear relationship between test anxiety and an examinee's level of misfit? 2. What is the degree of linear relationship between ability (as defined by performance on the current test) and level of misfit? 3. To what extent is variance in person misfit explained by a linear combination of the variables: ability level, test anxiety, and their interaction? Analyses were also performed to explore the degree of relationship among the five selected personfit indices and to determine their splithalf reliabilities. The results of the analyses presented in this chapter have been organized beginning with results of simpler analyses and proceeding to those of greater complexity. The issue of the degree of relationship among the five personfit indices within and across subtests of the MAT is addressed in the next section, Descriptive Statistics. Data relevant to questions 1 and 2 (dealing with the bivariate relationships between test anxiety and misfit and between ability level and misfit) are presented next. Results of multiple regression analyses, relevant to question 3 are presented in the third section of this chapter. The final section of this chapter contains the results 31
PAGE 41
32of the investigation of the internal consistency of personfit indices . Descriptive Statistics The mean, standard deviation, and minimum and maximum values for personfit measures by grade and ability subtests are presented in Table 2. Means and standard deviations for each personfit index are very similar across grades and ability subject tests. For the reading test, personfit measures show the greatest dispersion between the minimum and maximum scores. For this subtest of the MAT, the maximum score observed for the personal biserial correlation on both the seventh and eighth grade was greater than one. This was not a computation error but may be ascribed to sampling error or violation of the assumption of an underlying normal distribution for the dichotomous item response variable. Lord and Novick discuss conditions under which biserial correlations may exceed 1.00 (Lord & Novick, 1968, p. 339). Descriptive statistics for ability tests and test anxiety for the seventh and eighth grades are presented in Table 3. Personfit measures' intercorrel ations for each grade level are presented in Table 4. Seventhgrade intercorrel ations are shown above the diagonal and eighthgrade's below the diagonal. Results of these correlations indicate that personfit measures are highly related within the same subject test area. These correlations ranged . from .78 to .99 [ and are all significant at an alpha level of .0001. However, across tests in different subject areas, there was little or no relationship between the same personfit measure. The Rasch
PAGE 42
33Table 2 Descriptive Statistics for PersonFit Measures by Grade and Ability Test Grade 3 5 rscnrit so Min. Max. 3D Min. Max. Reading 3 Test MCI .23 . 10 .02 .31 .25 12 sg ? 3 .57 .20 .46 1.12 '.S3 .23 .12 Lll2 NCI .54 .20 .52 .96 .55 .21 .10 1.00 Rx 2 .93 .14 .55 1.37 .97 .15 .55 1.49 ECI .01 .25 .89 1.92 .01 .27 . . 4.T Mathematics 13 Tes mc: .22 .08 .06 .50 .23 .39 .04 .53 PS .57 .17 .02 .93 .53 .19 .2C . 93 nc: .55 .17 .03 .39 .55 .18 .17 .92 Rx 2 .98 .14 .67 1.43 .98 .15 .52 1.57 ECI .01 .31 .62 .96 .01 .33 .72 1.34 Science" Test MCI .25 .08 .07 .59 .25 .33 .07 .54 ?3 .49 . 15 .14 .38 .49 .15 .07 .73 NCI .47 .17 .15 .37 .47 .17 .09 .35 Rx 2 .99 .12 .59 1.40 .99 .12 .73 1.43 ECI .04 .36 .94 1.23 .05 .33 .90 a n Â« 224 for the seventh grade; n 134 "or the eighth grade. \ Â» 225 for '.he seventh grade; n 1S8 for the eighth grade. n * 225 for the seventh graae; n_ 135 for the eignth grace.
PAGE 43
34>? aj fÂ— X c < t> CJ hTJ ( (O 1/1 M (/) Â•pÂ— +> 0) a. T3 n3 sL. o O (/) ra CD >1 tÂ— Q X cm CM wr Â£ CO r. n CM SI d CC CO CM O O r% XI cn n m m rn a m in a
PAGE 44
35s 5 c rv x 5 : CM C Â© CM Â— ' O CO o cr cr. M ^ T IT c CM CM CM r"_ ri a s o c c 4> CO GJ 1 Â— > Â•i Â— i Â— f Â— Q c 1 Â— 1 01 s3 CO ro QJ s: Â• i Â— u_ <* 1 C 0) c 1/1 JO s_ CO ha. u 5 == _ Â— as 01 :o a. Â— ^ \c m C 05 CO O CO C C CV S ifl P tf) N N c o o Â— e r c Â— in it cc Ol IO CO CA CTI CC cm cc cc cc * * * Â•* * * cr 1 a cc * Â« * cc cr. f cr. * Â« * * mc cc cc cr. T rv o rÂ— 1 1 cr. Â— * Â•* ccr *^r cr. Cr 1 CT 1 cr * Â« Â« * m S CTl cr i cr. * * * * Cr cr cr, t cr CT, 1 cr. * * r cc rv CC PV cr cc 1 cr. i * Â» * < cr Cr cr CC cr = = Â— 2 i rm lt. lT) ^ C N CO o o o Â— ~ O N . M OA C) CO r* r>Â» O Â— o CO CO ^ O LT. CA C?i ? ^ C xi a In co cc Â— co cc w x c o. rz c: _ Â£ 5 cc 2 O $ i i ve n> OA o t r on i JÂ£ o CM *r & i 1 QJ c X rsj CTl Â© ' O *c 1 ^1 # Â— o CC , cm a I l i * rv ei Â— CC S CM 1 1/1 m ID ' CM 1 ai X CVi .03 if! .01 CO c rz VCIll Â— cm 3C
PAGE 45
36personfit measure (Rx 2 ) had the lowest relationship to other indices within same subject test. Interestingly Rx 2 is the only index with significant intercorrelations with other indices across subject areas. Relationship Between PersonFit, Ability, and Test Anxiety Correlations between personfit measures, total ability measures of reading, mathematics, and science and test anxiety for each grade are presented in Table 5. Correlations between personfit measures and their corresponding total ability scores ranged from  .00 to .50 1. In general the personal biserial (PB) index had the lowest correlation with total score. This lower correlation was consistently observed across subject test areas and grade level. Correlation values between personfit measures and test anxiety ranged from  .02 to .22 1 : The highest correlation between personfit measures and test anxiety occurred for seventh grade science personfit values and for eighthgrade reading personfit values. Relationship Between PersonFit and a Linear Combination of Ability, Test Anxiety, and Their Interaction Multiple regression results for the model with ability, test anxiety, and their interaction are summarized in Table 6. Values of R*for the model at each test area are reported. Models with significant ability and test anxiety interactions are identified by having their corresponding R 2 underlined. For the seventhgrade sample, a significant proportion of variance in the modified caution index was explained for each of the three subject area tests by the
PAGE 46
37Table 5 Correlations Between PersonFit Measures and Total Scores on Corresponding Ability Test and Test Anxiety Grade PersonFit Total TASA Total TASA Reading Test MCI .23** . 18** .32** .07 PS .05 .10 .18* .15* NCI .Â•03 .03 .20* .15* Rx2 .11 .02 .16* .17* ECI .02 .08 .05 .14 Mathematics Test MCI .15* .03 .09 .06 PB .04 .04 .01 .03 NCI .20** .05 .17* .08 Rx 2 .32** .10 .30** .11 ECI .23** .05 .20** .08 Science Test MCI .25** .17* .45** .03 PB .08 .09 .29** .02 NCI .22** .16* .44** .03 Rx 2 .39** . 22** .50** .08 ECI .28** . 18** .50** .04 Note: Higher nisfit is represented by higher values on the MCI, Rx 2 , and ECI and by lower values cn the PB and NCI *Si gn if icant at a = .05. Significant ata = .01 .
PAGE 47
38Table 6 2 Percentage of Variance (R ) in PersonFit Explained by the Combination of Examinee Ability, Test Anxiety, and Their Interaction Test PersonFit MCI PB MCI Rx' ECI Reading Mathematics Science .11** .04* .07** Seventh Grade .04 * .03 .01 .01 .05* .06** .02 .11** .16** .03 .06** .09** Reading Mathematics Science .12** .04 .23** Eighth Grade .05* .04 .13** .05 .06* 721** .05* .10** .27** .02 .07** 7Z5*Â» 'Significant at a .05. "Significant at a = .01. Significant interaction a = .05.
PAGE 48
linear* combination of test anxiety, ability, and their interaction. Only in reading, however, was the percentage of variance explained greater than 10%, and on this case the interaction of ability and test anxiety was significant. For the personal biserial, the normconformity index and the extended caution index, although several significant R 2 values were observed, these were all less than .10. Thus it is difficult to interpret these relationships as being substantially important. For the Rasch index of personfit, substantial proportions of variation were explained by the model in the areas of math and science. No significant interaction effect of ability and test anxiety was detected in either of these cases. For the eighth grade, the modified caution index again appears to be sensitive to variation in examinees' level of test anxiety, ability, and their interaction. The significant R 2 values exceeded 10% for this index in reading and science. Science had the largest R 2 (.23) and a significant interaction between test anxiety and ability. For the personal biserial, the normconformity index, and the extended caution index, significant R 2 values greater than .10 were found only in the area of science, and for each of these cases, the interaction between ability and test anxiety contributed significantly to the overall model. Significant (and meaningful) R 2 values for the Rasch x c index were found for the areas of science and mathematics without any interaction between ability and test anxiety. *Curvi linear relationships between personfit indices and ability and TASA measures were tested and found not significant.
PAGE 49
For further interpretations of these results, the nature of the interaction effect of test anxiety and ability on personfit was examined. Significant interactions were followed up by plotting the relationship between personfit measures and TASA at selected ability levels. The formula to calculate the regression line at each ability level is Y' = b 0 + b x X x + (b 2 + b 3 X 1 )X 2 At any value of ability (X]_) a predicted personfit measure (V) was calculated for different points of TASA (X 2 ). The intercept of this model equals (bg + b^X^) and the slope equals (b 2 + b3Xj_). Table 7 reports the slope and intercept estimates which were used in plotting these lines for all cases in which exceeded .10. The regression lines resulting from these computations are shown in Figures 15. It should be noted that Figures 25 are based on a single gradelevel and a single subject area. These plots depict the nature of the interaction of ability and test anxiety on various indices of person fit. Although the points of intersection of these lines may vary, the same general pattern of relationship between personfit and test anxiety occurs in all cases. Namely, this pattern of interaction can be characterized as follows: 1. For examinees in the average ability ranges, there is little or no relationship between test anxiety and personmisfit. (Note the "flat" slope of the regression line for Group E in all figures.) 2. As examinee ability level increases (see lines for Groups F, G, and H), the slope of the regression lines increases, generally
PAGE 50
41Table 7 2 Significant Ability and Test Anxiety Interactions and R Increases for PersonFig Measures Defendant /ariaDi e Parameter Estimate Error t R Increase venth GradeReaaing Mr* . 0543 .46 SlopeAbi 1 ity .0058 .0016 3.74*Â» SloceTASA .0071 .0035 2.02* SlopeAbility*TASA .0002 .0001 2.52* .026 (Model R 2 = .11** '' R 3,220 = 3 .30**) Eighth GradeScience Mr ' Intercept .3126 .0593 5.22** SiopeAbi 1 ity .0013 .0015 .35 SlopeTASA .0064 .0037 1.73 SlopeAbiiity*TASA .0002 .0001 2.15* .020 (Model R 2 = . 23** F 3,182 * li 3.00**) PB .ntercspt .5323 .1199 4.44** SlopeAbil ity .0020 .0031 .53 SlopeTASA .0157 .0075 2.22* SlooeAbil ity'TASA .0005 .0002 2.66** .024 Â• (Model R 2 = . 13** 3 . i32 06**) nc: Intercept .3727 .1210 3. OS*' SlopeAbil ity .0021 .0031 .55 SlooeTASA .0144 .0075 1.91 SlooeAbil ity*TASA .0004 .0002 2.33* .323 (Model R . 23**; '"3,132 = 17 .33**) CI Intercept SlcpeAbil ity Slope7ASA 31 :3eAbil ity'TASA 3 [Mocel R~ = .2291 .0053 .0327 . O^ 1 1 '3,132 .2573 . CC6 7 .0004
PAGE 51
mere misfit .38r .36 .34 .32 .30 .28 .26 .24 .22 .20 .18 .16 .14 .12 .10 less f misfit L /A Reading Ability: A I 2.35 B 18. 16 C 23.97 D 29.78 E35.59 (x) F4 1 .40 G47.21 H53.02 1.73 5.06 8.39 1.72 15.05 TASA 18.38 21.71 25.04 28.37 Figure 1. Relationship between the modified caution index and test anxiety for seventhgrade examinees at different reading abil ity level s.
PAGE 52
43more misfit .40 .38 .36 .34 .32 CO i 03 U c 03 'o CO o .30 .28 .26 .24 .22 .20.18.16. 14 Science Ability: A 16.76 B 21.33 C 25.89 D 30.46 E 35.02 (x) F 39.59 G 44.15 H 48.72 misfit 1.29 4.41 7.52 10.64 13.75 16.87 19.98 23.10 26.21 TASA Figure 2. Relationship between the modified caution index and test anxiety for eighthgrade examinees at difference science abil ity levels.
PAGE 53
Â•44less misfit .72 .68 .64 .60 .56 CO 52 l
PAGE 54
45less misfit .77 .73 .69 .65 .6 I I // Â— >1.29 4.41 7.52 10.64 13.75 16.87 19.98 2310 26 21 TASA Figure 4. Relationship between the norm conformity index and test anxiety for eighthgrade examinees at different science abil ity level s.
PAGE 55
46more misfit .56 r .48 .40h .32 .24 .16 .08 CO oo CD u c CD "(J 00 O .08 . 16 .24 . 32 .40 .48 .56 .64 .72 .80less / misfit Science Ability: A 16.76 B 21.33 C 25.89 D 30.46 E 35.02 (x) F 39.59 G 44.15 H 48.72 1.29 4.41 752 10.64 13.75 16.87 19.98 23.10 26.2! TASA Figure 5. Relationship between the extended caution index and test anxiety for eighthgrade examinees at different science ability levels.
PAGE 56
47indicating an increasing negative relationship between test anxiety and personfit. Specifically highability, lowanxious examinees show more misfit than highability, highanxious examinees. 3. As examinee ability decreases (see lines for Groups C, B, and A), the opposite effect occurs. Namely lowability, lowanxious examinees show less misfit than lowability, highanxious examinees. When no interactions were significant, only ability main effects were significant. Table 8 presents the Type I sumsofsquares which measure incremental sums of squares for the model as each variable was added. Models for the Rasch personfit index on mathematics and science ability main effects for the seventh and eighth grades were significant. There was also a significant main effect of reading ability for the model with the modified caution index at the eighth grade. Internal Consistency of PersonFit Statistics Corrected splithalf internal consistency reliability coefficients for personfit measures by grade and subject content area are reported in Table 9. For the seventh grade sample, the internal consistency estimates ranged from .23 to .56. The highest personfit splithalf reliability estimates were found consistently for the reading subtest. For the eighth grade the range of values was from .23 to .39, with a slight trend for reading to have the higher values. Among personfit indices, the Rx^ index has the highest internal consistency estimates for the eighth grade (ranging from .31 to .39), but for the seventh grade, no personfit index was consistently more
PAGE 57
48Table 8 Significant Main Effect of Ability as Predictor of PersonFit Measures 3epenaent Type I VariaDle Source df Sum Squares Mean Squares F 9r Seventn GradeÂ— Mathematics 3x 2 Mocel 3 .50 .17 3.20Â™ .11* Mathematics 1 .47 .47 25. 30Â™ TASA 1 .01 .01 .47 Mathemat1cs*TASA 1 .02 .02 1.12 :r 221 4.01 .02 Seventh GradeÂ—Science Rx 2 Model 3 .50 .17 13.73' Science 1 .49 .49 39.35" "ASA 1 .01 .01 1.21 Science*TASA 1 .00 .00 .01 Error 221 2.70 .01 Eighth GradeÂ— Reading Model 3 .31 . 10 7.37Â™ heading 1 .27 .27 20. 3S* TASA 1 .04 .04 3.33 3eading*7ASA 1 .00 .00 .17 Error 130 2.36 .01 Eighth GradeMathematics . 2 *x Model 3 .44 .15 7.01Â™ .10* Mathematics 1 .33 .33 13!o6Â™ TA SA 1 .01 .01 * \zi Mathematics 'TASA 1 .35 .05 2.73 Error 134 3.36 .02 Eighth GradeÂ— Science Model Science i ASA Science*TASA irrcr .a/ .53 .01 .02 3 1 2' .01 ii. joÂ™ 53.32Â™ .31 3'07 'Sigm Â— cant at i 'Sign if: cant st s = 31.
PAGE 58
49Table 9 Corrected SplitHalf Reliability Estimates for PersonFit Measures by Grade and Ability Test Person? i t Test MCI PS NCI ECI Seventh Grade Reading Mathematics Science . 56 .28 .25 . 56 .28 ,20 Eighth Grade .49 .IS .22 .45 .11 19 .29 .25 Reading Mathematics Science .35 .29 .25 .37 .:; .23 .31 .33 .26 .23 .39 .31 .33 .36 .25
PAGE 59
reliable than the other. Overall with the exception of the reading subtest the reliabilities for personfit indices for the mathematics and science subtests appear consistent across various indices, and are relatively low. Summary Results can be summarized as follows: 1. Descriptive statistics for personfit measures, test anxiety, and ability scores were very similar across subject content area and grades. 2. Intercorrelations between personfit measures showed that these measures are highly related within the same subject content area. Across subject area little or no relationship was found. 3. Correlations between personfit measures and their corresponding total ability score ranged from 1 .00 to .50 1. The PB index had the lowest correlation with total score. 4. Correlations between personfit measures and test anxiety scores ranged from .02 to .22 . In science, four out of five indices were significantly related to test anxiety scores for seventh graders. For eighth graders in reading, three of the five indices were significantly related to test anxiety. 5. A significant proportion of variance in personfit measures was explained by the linear combination of test anxiety, ability, and their interaction, for the seventh and eighth grade reading MCI index and for the eighth grade science MCI, PB, NCI, and ECI. Regression lines depicting the nature of these interactions were presented for selected ability levels. Significant and meaningful R 2 values
PAGE 60
(greater than R 2 = .10) for the Rasch personfit index were found for the areas of science and mathematics without any interaction between ability and test anxiety. 6. Corrected splithalf internal consistency estimates for the personfit indices ranged from .20 to .56.
PAGE 61
CHAPTER V DISCUSSION This study was conducted to investigate the relationship between examinee's level of personfit and test anxiety, and to study the effect of ability on this relationship. Five personfit indices were calculated for seventh and eighth grade students who had taken a testanxiety selfreport test and the reading, mathematics, and science subtests of the MAT. Discussion of results will focus on findings about (1) the interrelationship among the five personfit indices, (2) the relationship between personfit, test anxiety, and ability, and (3) the reliability of the five personfit statistics under study. Relationships Among PersonFit Statistics Intercorrel ations among measures of personfit were quite high within samesubject content areas. The correlation values ranged from 1.78 to .991. These high personfit statistics' intercorrel ations confirm previous research findings by Harnisch and Linn (1981) and Rudner (1983). In particular, Harnisch and Linn found intercorrel ations among the MCI, PB, and NCI that ranged from .93 to .97 1 for mathematics tests and from 1.89 to .951 for reading tests. Rudner's intercorrelations among the MCI, PB, NCI, and the Rx 2 for the simulated SAT test ranged from 1 .80 to .99  and 52
PAGE 62
53from  .77 to .99  for the simulated teachermadebiology test. These consistent high intercorrelations among the personfit indices under study indicate that they seem to be measuring a common construct. Relatively low correlations were found for any given index (MCI, PB, NCI, Rx 2 , and ECI) across the reading, mathematics, and science tests. These correlations ranged from .03 to .24. There seems to be no persistence of misfit across the different tests. These results would question the notion that tendency to misfit is a stable trait which consistently manifests itself in examinee performance across various subject areas. Frary's (1982) correlations of same personfit index across several tests were also very low. For practical implications these results lead to two points that need to be considered when interpreting personfit results. Namely, that if an examinee is identified as having poor personfit by one index, another index will probably identify him/her as a misfit on the same test; but that it can not be concluded that he or she will also be a misfit on another test. Relationship Between PersonFit, Ability, and Test Anxiety Low to moderate correlation values were obtained between personfit indices and their corresponding total ability scores. The Rasch personfit index was the index with the highest correlations with total mathematics and science ability scores. For the reading test, the MCI had the highest correlations with total score, but this correlation was positive while a negative correlation would have been expected. The reading subtest was not typical of a power test, since most examinees did not finish all items. More able examinees probably
PAGE 63
54attempted more items, passing end items that were more "difficult" than some items which they had missed earlier in the test. This caused more able students to get higher misfit classifications, as can be seen in Figure 1. The only personfit index that did not seem to be affected by the speededness of the reading test was the Rx^. Although its relationship to total reading score was low, it was in the direction expected. The PB index had the lowest correlation with total scores. Contrary to the present findings, Harnisch and Linn (1981) found that the PB was one of the indices that had a high relationship with total score on the reading test (.63). The relationship of the PB to the math total score on the Harnisch and Linn study was nevertheless somewhat lower (.28) than for some other indices. Correlation values between personfit measures and test anxiety were relatively low. These correlations ranged from  .02 to ,ZZ\. The Rx^ index had the highest correlations with test anxiety. The only case when the Rx^ correlated lowest with test anxiety was for the seventhgrade reading test. This low correlation is ascribed to the speeded nature of the reading test, which was more noticeable at the seventh grade. These correlation values are disappointingly low from a practical perspective; however, one point that might be considered is that these observed correlations may have been attenuated by the extremely low reliability of the personfit measures and by having a general or trait measure of test anxiety instead of a statespecific measure. As an exploratory exercise one could speculate about the nature of this relationship if personfit could be more reliably measured. A correction for attenuation method (Magnusson, 1956,
PAGE 64
55p. 148) was used to estimate what the correlation between personfit and test anxiety would be if these measures were perfectly reliable. The corrected correlations between personfit measures and test anxiety ranged from j .03 to .44 1 . Even with the correction for attenuation most of these correlations are still relatively low and there is little evidence offered by the present study to support the notion that higher reliabilities can be achieved in personfit measures. In order to explore the extent to which variance in person misfit can be explained by a linear combination of the variabl esabi 1 ity level, test anxiety, and their interactionÂ— this linear multiple regression model was tested for each personfit statistic at each grade level and for each ability measure. Because of the large sample size a number of fairly small R 2 were statistically significant, so only cases where there was a meaningful percentage of variance (larger than 10%) were considered. The significant interactions between ability and test anxiety demonstrate that ability levels moderate the relationship between personfit measures and test anxiety. For lowerability examinees a direct relationship between personfit measures and test anxiety was found. For higherability examinees this relationship was found to be inverse. Figures 1 and 2 present the two general pictures of these interactions. Figure 1 shows the relationship between the modified caution index and test anxiety as measured by TASA for seventhgrade examinees with different reading levels. The nature of the interaction is the same as previously described, but higher ability level students have higher overall MCI values (more misfit). For all
PAGE 65
56the other significant interactions which were for person fit measures of science at the eighth grade, lower ability level students have higher values of misfit. Figure 2 shows this for the MCI. The difference found for these two content areas might be due to the performance of this sample on the reading subtest. More than 10 percent of the seventh graders missed the last fifteen items of the reading test, making it appear more like a speeded test than a power test. Higher ability examinees probably attempted more end items, increasing their probability of getting highermisfit classifications. The cogniti veattentional theory of test anxiety, the SpenceTaylor drive theory, and previous personfit research findings suggest interpretive explanations of the interaction results. According to Tobias (1980) and Weinstein, Cubberly, and Richardson (1982) hightestanxious students will have worse performance on difficult materials compared to lowtestanxious individuals, while with easier material little difference between anxiety levels is expected. In reporting results about cognitive coping behavior and anxiety, Houston (1982) suggests that "highly traitanxious (and test anxious) individuals tend to lack organized ways of coping with stress and instead ruminate about themselves and the situation in which they find themselves" (p. 198). Since hightestanxious students' performance on more difficult items is more affected by their anxiety as these examinees reach items that have a level of difficulty that approximates their ability level they will have a harder time coping. Other testing strategies such as test wiseness would not be readily available due to lack of concentration. Examinees with high ability levels and low test
PAGE 66
57anxiety could take advantage of testtaking skills and answer item correctly beyond their ability level. Since these items would not be answered with the same degree of certainty as easier items, more deviation from the expected response pattern could occur and higher misfit would result. For examinees with high ability and hightest anxiety, coping and testtaking skills could be blocked; making them consistently miss harder items. Lower misfit values would occur for this group. Examinees with lower ability and lower test anxiety levels would answer items correctly to the point where they reach items at their ability level and then miss the harder items. Some misfit could occur due to attempts at harder items. For examinees with low ability and hightest anxiety, distracting thoughts might interfere with performance to almost all items. Even easy items (in reference to their ability level) could be missed; this sporadic answering pattern would classify this group as high misfits. These interaction effects between ability and test anxiety seem to appear more consistently for science especially at the eighth grade. One possible explanation is that the standardized science test fits the curriculum less than the reading or math tests. The mean item difficulty for the science test is lower, indicating a harder test. Examinees taking the science examination might find themselves in a more ambiguous and hence anxietyproducing situation. These findings are primarily of theoretical interest to those who may be interested in learning more about the constructs of test anxiety or personfit. At best the combination of ability, test anxiety, and their interaction appear to account for only about onefourth of the variance in personfit indices, and the increments in R 2
PAGE 67
58obtained by using the interaction term in the regression model were small. The interaction term only accounted for about two percent of the R 2 values. Reliability of PersonFit Measures Corrected oddeven splithalf reliability estimates of personfit indices were low. These coefficients ranged from .20 to .56. These internal consistency estimates were highest for the personfit indices computed for the seventhgrade reading test. Part of the reason for the higher reliabilities could be ascribed to the speeded nature of the reading test at this grade level since the original sequentialtest item number was used to split the test into oddeven subtests. Magnusson (1966) cautions on using splithalf methods on timed tests. He states, "the time limit has the effect that in reliability computations with splithalf methods the test's reliability tends to be overestimated" (p. 114). Frary (1982) analyzed the internal consistency of personfit indices and also found low and even negative splithalf coefficients. His findings led him to conclude that unexpected responses to a small number of items contributed to high misfit classifications and that there was little consistency on the specific items contributing more to misfit. These findings certainly seem to call into question the notion of personfit as a stable trait that can be reliably measured and also question the potential utility of these indices. Frary summarizes this concern when he suggests "that use of personfit measures for any decisionmaking purpose, especially with respect to individual
PAGE 68
59examinees, should be undertaken only with extreme caution and that substantial additional research will be required before they can be used routinely" (Frary, 1982, p. 17). Summary Results of the relationship among personfit statistics were quite high within samesubject content areas, but not across differentsubject tests. It can be generalized that if an examinee is identified as having poor personfit by one index, another index will probably identify this examinee as a misfit on the same test, but it can not be expected that this examinee will also be a misfit on another test. Significant interactions between ability and test anxiety demonstrate that ability levels moderate the relationship between personfit measures and test anxiety. For lowerability examinees a direct relationship between personfit measures and test anxiety was found. For higherability examinees this relationship was inverse. The cogniti veattentional theory of test anxiety and the SpenceTay lor drive theory suggest interpretive explanations for these interaction results. These findings are of theoretical interest to those interested in learning more about the constructs of test anxiety and personfit. At best the combination of ability, test anxiety, and their interaction appears to account for only about onefourth of the variance in personfit indices. Internal consistency (splithalf) reliabilities were relatively low and the present study offers little evidence to support the notion that higher reliability of personfit indices could be achieved.
PAGE 69
60According to these results the potential uses of personfit indices are questionable at this time. More research is needed before personfit indices can be recommended as a routine measure in achievement tests .
PAGE 70
REFERENCES Alpert, R., & Haber, R.N. (1960). Anxiety in academic achievement situations. Journal of Abnormal and Social Psychology , 61 , 207215. Donlon, T.F., & Fischer, F.E. (1968). An index of an individual's agreement with groupdetermined item difficulties. Educational and Psychological Measurement , 28 , 105113. Frary, R.B. (1982). A comparison among personfit measures . Paper presented at the annual meeting of the American Educational Research Association, New York. Frary, R.B., & Giles, M.B. (1980). Multiplechoice test bias due to answering strategy variations . Paper presented at the annual meeting of the National Council on Measurement in Education, Boston, Mass. Gaier, E.L., & Lee, M.C. (1953). Pattern analysis: The configural approach to predictive measurement. Psychological Bulletin , 50 , 140148. Guttman, L. (1941). The quantification of a class of attributes: A theory and method of scale construction. In P. Horst, P. Wall in & L. Guttman (Eds.), The prediction of personal adjustment . New York: Social Science Research Counci 1 , Committee on Social Adjustment. Guttman, L. (1950). The basis for scalogram analysis. In S. Stouffer, L. Guttman, E. Suchman, P. Lazarsfeld, S. Star, & J. Clausen (Eds.), Measure and prediction (Vol. 6). Princeton, N.J.: Princeton University Press, 6090. Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement , 14 , 7596. Harnisch, D.L. (1983). Item response patterns: Application for educational practice. Journal of Educational Measurem ent, 20, 191206. Harnisch, D.L. ,& Linn, R.L. (1981). Ana 1 ys i s of i tern response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement , 18, 133146.
PAGE 71
Harper, F.B.W. (1974). The comparative validity of the MandlerSarason Test Anxiety Questionnaire and the Achievement Anxiety Test. Educational and Psychological Measurement , 34, 961966. Heinrich, D.L., & Spielberger, CD. (1982). Anxiety and complex learning. In H.W. Krohne & L. Laux (Eds.), Achievement, stress, and anxiety . Washington, D.C.: Hemisphere. Helwig, J.T., & Council, K.A. (Eds.). (1979). SAS user's guide, 1979 edition . Raleigh, N.C.: SAS Institute. Houston, K. (1982). Trait anxiety and cognitive coping behavior. In 1.6. Sarason & CD. Spielberger (Eds.), Achievement, stress, and anxiety . Washington, D.C.: Hemisphere. Kane, M.T., & Brennan, R.L. (1980). Agreement coefficients as indices of dependability for domainreferenced tests. Appl ied Psychological Measurement , 4, 105126. Levine, M.V., & Rubin, D.B. (1979). Measuring the appropriateness of multiple choice test scores. Journal of Educational Statistics , 4, 269290. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores . Reading, Mass . : AddisonWesley. Magnusson, D. (1966). Test theory . Reading, Mass.: AddisonWesley. Mandler, G., & Sarason, S.B. (1952). A study of anxiety and learning. Journal of Abnormal and Social Psychology , 47, 166173. Prescott, 6. A., Balow, I.H., Hogan, T.P., & Farr, R.C. (1978). Teacher's manual for administering and interpreting Metropolitan Achievement Tests (Advanced 1, Forms JS and KS). New York: Tn~e Psychological Corporation. Rudner, L.M. (1982). Individual assessment accuracy . A paper presented at the annual meetings of the American Educational Research Association, New York, 1982. Rudner, L.M. (1983). Individual assessment accuracy. Journal of Educational Measurement , 20, 207220. Sarason, I.G. (1960). Empirical findings and theoretical problems in the use of anxiety scales. Psychological Bulletin , 57 , 403415. Sarason, I.G. (1963). Test anxiety and intellectual performance. Journal of Abnormal and Social Psychology , 66 , 7375. Sarason, I.G. (1972). Experimental approaches to test anxiety: Attention and the use of information. In CD. Spielberger (Ed.), Anxiety: Current trends in theory and research (Vol. 2). New York: Academic Press.
PAGE 72
63Sarason, 1.6. (1975). Anxiety and selfpreoccupation. In I.G. Sarason & E.O. Spielberger (Eds.), Stress and anxiety (Vol. 2). Washington, D.C.: Hemisphere. Schmitt, A.P., & Crocker, L. (1982). Test anxiety and its components for middle school students. Journal of Early Adolescence , 2, 267275. Spielberger, CD. (1966). Theory and research on anxiety. In CD. Speilberger (Ed.), Anxiety and behavior . New York: Academic Press. Spielberger, CD. (1971). Traitstate anxiety and motor behavior. Journal of Motor Behavior , _3, 265279. Spielberger, CD., Gonzalez, H.P., Taylor, C.J., Algaze, B., & Anton, W.D. (1978). Examination stress and test anxiety. In CD. Spielberger & I.G. Sarason (Eds.), Stress and anxiety (Vol. 5). New York: Hemisphere/Wiley. Tatsuoka, K., & Linn, R.L. (1981). Indices for Detecting unusual item response patterns in Personnel testing: Links between direct and itemresponsetheory approaches (Research Report 815). Urbana: University of Illinois, ComputerBased Education Research Laboratory. Tatsuoka, '<., & Linn, R.L. (1983). Indices for detecting unusual patterns: Links between two general approaches and potential applications. Applied Psychological Measurement , ]_, 8196. Tatsuoka, K., & Tatsuoka, M.M. (1980). Detection of aberrant response patterns and their effect on dimensionality (Research Report 804). Urbana: University of Illinois, ComputerBased Education Research Laboratory. Tobias, S. (1980). Anxiety and instruction. In I.G. Sarason (Ed.), Test anxiety: Theory, research, and applications . Hillsdale, N.J.: Lawrence Erlbaum Associates. Tryon, G.S. (1980). The measurement and treatment of test anxiety. Review of Educational Research , 50 , 343372. Van der Flier, H. (1977). Environmental factors and deviant response patterns. In Y.H. Poortinga (Ed.), Basic problems in crosscultural psychology . Lisse, Netherlands: Swets & Zeitlinger. Van der Flier, H. (1982). Deviant response patterns and comparability of test scores. Journal of CrossCul tural Psychology , 13_, 267298. Weinstein, C.E., Cubberly, W.E., & Richardson, F.C (1982). The effects of test anxiety on learning at superficial and deep levels of processing. Contempor ary Educational Psychology, 7, 107112. ~
PAGE 73
64Wine, J.D. (1971). Test anxiety and direction of attention. Psychological Bullegin , 76, 92104. Wine, J.D. (1980). Cogniti veattentional theory of test anxiety. In I.G. Sarason (Ed.), Test anxiety: Theory, research, and appl ications . Hillsdale, N.J.: Lawrence Erlbaum Associates. Wright, B.D., Mead, R., & Bell, S. (1979). BICAL: Calibrating items with the Rasch model (RM23b). Chicago: University of Chicago. Wright, B.D., & Panchapakesan, N.A. (1969). A procedure for sample free item analysis. Educational and Psychological Measurement , 29, 2348.
PAGE 74
BIOGRAPHICAL SKETCH Alicia P. Schmitt was born in Havana, Cuba, on September 28, 1952. She immigrated to the United States in 1961 and moved to Puerto Rico in 1962. In 1970 she graduated from high school and entered the University of Puerto Rico, graduating in 1973 and in 1975 with a bachelor's and master's degree in psychology. From 1975 to 1977 she worked as Evaluation Coordinator for Project Follow Through and taught evening courses at the University of Puerto Rico. Alicia moved to Gainesville in 1977 and began her doctoral program at the University of Florida in the fall of 1978. While in graduate school she held a variety of assistantships. For a period of two years, she served as research consultant for the Research Clinic of the College of Education. As part of the duties in other assistantships she developed and taught the Independent Study by Correspondence Course in Measurement and Evaluation in Education; assisted in measurement, research, and statistics courses in the College of Education; and worked as research assistant for a graduate school dean. In 1983 she became Assistant Director of Testing and Evaluation for the Office of Instructional Resources, University of Florida. In this capacity she administered the College of Education Basic Skills Testing Program and analyzed institutional studies used for educational planning. Alicia has currently accepted a position with Educational Testing Services and will be moving to Princeton, New Jersey. 65
PAGE 75
I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Linda M. Crocker, Chair Professor of Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Jqlmes J. Algina \ Associate Processor of^Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation, and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. ofessor of Psychology
PAGE 76
This dissertation was submitted to the Graduate Faculty of the College of Education and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. December, 1984 Dean for Graduate Studies and Research

