• TABLE OF CONTENTS
HIDE
 Title Page
 Abstract
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Introduction
 Review of the literature
 Statement of the problem
 Method
 Results
 Discussion
 References
 Vita














Title: Effect of immediate item feedback on the reliability and validity of verbal ability test scores
CITATION MAP IT! PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00100878/00001
 Material Information
Title: Effect of immediate item feedback on the reliability and validity of verbal ability test scores
Physical Description: Book
Language: English
Creator: Benson, Gordon Guy Edward ( Author, Primary )
Stoker, Howard W., 1925-
King, F. J., 1927-
Beard, Jacob G.
Kalin, Robert
Publisher: College of Education, The Florida State University
Place of Publication: Tallahassee, Fla.
Publication Date: 1979
Copyright Date: 1979
 Subjects
Coordinates: 30.319444 x -81.66
 Notes
Abstract: The purpose of this study was to investigate the effects of immediate item feedback (knowledge of results) on the reliability and validity of total test scores. Two types of feedback were studied: partial feedback (knowledge of correctness obtained by means of one attempt per item) and full feedback (knowledge of the correct response obtained by means of one attempt per item). Total feedback, or knowledge of the correct response obtained by answering until correct, was not involved. Much of the previously published research on immediate item feedback appeared to be in need of larger sample sizes, and many designs did not appear to be capable of isolating the effects of feedback on mean test scores and reliability and validity coefficients of the test administered under feedback conditions. Their results were possibly confounded by using different response devices, time limits, numbers of attempts per item, and scoring strategies in the treatment and control groups. Nine junior high schools in a large urban-suburban school district in the southeastern United States were selected using a stratified, random sampling procedure. Ninth grade students were assigned to cells in a 3 x 3, treatment-by-ability design and were tested on an adapted version of the SCAT-3B Verbal using Trainer-Tester response devices. Total scores of 2,023 students were analyzed with a non-orthogonal, ANOVA procedure and Scheffé comparisons. KR-20 reliability coefficients were analyzed using a k-sample test developed by Hakstian and Whalen (1976), and validity correlations with a subsequent reading achievement measure were analyzed with the usual tests for Pearson correlations. Statistically significant main effects were found for treatment and ability, and the interaction was also significant. Examination for simple main effects indicated consistently lower means for the nonfeedback groups across ability levels and except for a reversal within the low ability level, full feedback means were generally lower than those for partial feedback. Reliability coefficients among the three treatment groups were statistically significant (partial feedback was greater than no feedback which was greater than full feedback), while the validity coefficients for partial and no feedback were significantly greater than that obtained for full feedback. While a wealth of statistically significant findings were obtained, many of these significant differences were small. Criteria for judging education, or practical, significance were discussed in terms of effect sizes (Cohen, 1969) and increased test length. After adopting suggested criteria, only one finding was judged to be educationally significant: for low ability students, there was a substantial increment in mean verbal ability scores in favor of full feedback over no feedback. Otherwise this study failed to show any substantial benefit or harm in students receiving knowledge of results while taking tests similar to those used in the study. The relevance of the study to previous research and suggestions for further research were also discussed.
 Record Information
Bibliographic ID: UF00100878
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

PDF ( 58 MBs ) ( PDF )


Table of Contents
    Title Page
        Page i
    Abstract
        Page ii
        Page iii
        Page iv
    Acknowledgement
        Page v
        Page vi
    Table of Contents
        Page vii
    List of Tables
        Page viii
        Page ix
    List of Figures
        Page x
    Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
    Review of the literature
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
    Statement of the problem
        Page 42
        Page 43
    Method
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
    Results
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
    Discussion
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
    References
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
    Vita
        Page 79
        Page 80
Full Text











THE FLORIDA STATE UNIVERSITY


COLLEGE OF EDUCATION



THE EFFECT OF IMMEDIATE ITEM FEEDBACK

ON THE RELIABILITY AND VALIDITY

OF VERBAL ABILITY TEST SCORES



by

GORDON GUY EDWARD BENSON



A Dissertation submitted to the Department of
Educational Research, Development, and Foundations
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy



Approved:




Professor Directing Das rotation








Depar met Head


December, 1979


Copyright c 1979
Gordon Guy Edward Benson
All rights reserved.















THE EFFECT OF IMMEDIATE ITEM FEEDBACK
ON THE RELIABILITY AND VALIDITY
OF VERBAL ABILITY TEST SCORES

(Publication No. )

Gordon Guy Edward Benson, Ph.D.
The Florida State University, 1979

Major Professor: Dr. Howard Stoker


The purpose of this study was to investigate the

effects of immediate item feedback (knowledge of results)

on the reliability and validity of total test scores. Two

types of feedback were studied: partial feedback (knowl-

edge of correctness obtained by means of one attempt per

item) and full feedback (knowledge of the correct response

obtained by means of one attempt per item). Total feed-

back, or knowledge of the correct response obtained by

answering until correct, was not involved.

Much of the previously published research on immediate

item feedback appeared to be in need of larger sample sizes,

and many designs did not appear to be capable of isolating

the effects of feedback on mean test scores and reliability

and validity coefficients of the test administered under

feedback conditions. Their results were possibly con-

founded by using different response devices, time limits,

numbers of attempts per item, and scoring strategies in the

ii









treatment and control groups.

Nine junior high schools in a large urban-suburban

school district in the southeastern United States were

selected using a stratified, random sampling procedure.

Ninth grade students were assigned to cells in a 3 x 3,

treatment-by-ability design and were tested on an adapted

version of the SCAT-3B Verbal using Trainer-Tester response

devices. Total scores of 2,023 students were analyzed with

a non-orthogonal, ANOVA procedure and Scheffe comparisons.

KR-20 reliability coefficients were analyzed using a k-

sample test developed by Hakstian and Whalen (1976), and

validity correlations with a subsequent reading achievement

measure were analyzed with the usual tests for Pearson cor-

relations.

Statistically significant main effects were found for

treatment and ability, and the interaction was also sig-

nificant. Examination for simple main effects indicated

consistently lower means for the nonfeedback groups across

ability levels and except for a reversal within the low

ability level, full feedback means were generally lower

than those for partial feedback. Reliability coefficients

among the three treatment groups were statistically sig-

nificant (partial feedback was greater than no feedback

which was greater than full feedback), while the validity

coefficients for partial and no feedback were significantly

greater than that obtained for full feedback.

iii









While a wealth of statistically significant findings

were obtained, many of these significant differences were

small. Criteria for judging educational, or practical,

significance were discussed in terms of effect sizes

(Cohen, 1969) and increased test length. After adopting

suggested criteria, only one finding was judged to be

educationally significant: for low ability students, there

was a substantial increment in mean verbal ability scores

in favor of full feedback over no feedback. Otherwise

this study failed to show any substantial benefit or harm

in students receiving knowledge of results while taking

tests similar to those used in the study. The relevance

of the study to previous research and suggestions for

further research were also discussed.















Acknowledgements


Deepest gratitude is expressed to Dr. Howard Stoker

who directed the writing of this dissertation. The concern

and care with which he has guided my studies and promoted

my welfare has been unending. He has been a tremendous

source of encouragement, while also being a true friend.

I have been fortunate to know him and have the opportunity

to work with him.

Many other faculty, friends and coworkers have had a

part in shaping my dissertation and my intellectual growth

during my doctoral studies. Drs. F J King, Jacob Beard,

and Robert Kalin are thanked for their assistance in

serving on my committee, along with other faculty in their

respective departments who have stimulated me profession-

ally. Special thanks to Dr. Kalin for introducing me, as

an undergraduate, to educational measurement and who later

encouraged me to pursue the profession in graduate school.

Special thanks are also extended to Dr. John Hills who

contributed many hours of his time in thought-provoking

discussions, who suggested the topic of this dissertation,

and who in friendship accompanied me to the West Campus on

numerous occasions.









My appreciation is also extended to Richard Neville

who assisted me in obtaining appropriate response devices

and to Educational Testing Service who kindly permitted me

to adapt and reproduce form 3B of the School and College

Ability Tests. Approval, participation and encouragement

of the professionals in the schools and administration of

the Duval County School Board is acknowledged.

No lesser amount of appreciation and thanks is due to

my wife Jeannie. Her assistance and encouragement in my

educational and professional pursuits is irreplaceable.

The amount of time and skill spent in editing and typing

various drafts of this manuscript is evidence of her sub-

stantial contribution to the completion of this paper.
















TABLE OF CONTENTS


ABSTRACT ........

ACKNOWLEDGEMENTS

LIST OF TABLES ..

LIST OF FIGURES .

Chapter


I. INTRODUCTION ..........................

II. REVIEW OF THE LITERATURE ..............


III.

IV.







V.

VI.








REFERENCES

VITA ......


Synopses of Feedback Contrasts
Response Devices
Summary

STATEMENT OF THE PROBLEM ..............

METHOD ................................


Subjects
Instruments
Procedures

RESULTS ...............................

DISCUSSION ............................

Practical Significance
Threats to Validity
Relevance
Conclusions
Recommendations

........................................


vii


Page

ii

v

viii

x


..................................

..................................

..................................

..................................














List of Tables


Table Page

1 Results of a Feedback Study by Hanna

(1974) Comparing the Reliability and

Validity of Partial and Inferred Number

Right Scoring ............................ 22

2 Results of a Feedback Study by Hanna

(1975) Comparing the Reliability and

Validity of Partial and Inferred Number

Right Scoring ............................ 23

3 Results of a Feedback Study by Hanna

(1977) Comparing the Reliability and

Validity of Total, Partial, and No

Feedback ................................ 27

4 Summaries of 13 Feedback Studies ......... 37

5 Numbers of Students Tested, Valid and

Invalid, at Each School .................. 51

6 Statistics for the Dalenius and Hodges

Procedure for Determining Ability Group

Boundaries using the 1978 Total Reading

Scores on the Stanford Achievement Test 52


viii









Table Page

7 Numbers of Subjects, Total Reading Scaled

Score Means and Standard Deviations on the

1978 Administration of the Stanford

Achievement Test by Treatment and Ability

Group .................................... 54

8 Percentages of Subjects by Treatment Group

Who Attempted SCAT Items 26 to 50 ........ 56

9 Numbers of Subjects, Raw Score Means, and
Standard Deviations on the 28-Item SCAT by

Treatment and Ability Group .............. 57

10 Analysis of Variance for the 28-Item SCAT 58

11 Probability Levels for Statistically

Significant Differences Between Pairs of

Treatment Means .......................... 59

12 Within Cells Reliability and Validity

Coefficients and Standard Errors ......... 61

13 Response Percentages and Means for

Questionnaire Questions by Treatment

Groups .................................. 63















List of Figures


Figure


1 Interaction between feedback and ability


Page















Chapter I

Introduction



The purpose of this study was to investigate the

effect of two kinds of immediate test item feedback on

total test scores. The feedback was administered to exam-

inees following their response to an item by informing them

of the adequacy of their response prior to their responding

to the next item.

Two categories of feedback referred to in the measure-

ment literature can be called knowledge of correctness and

knowledge of correct response. When examinees respond to

an item and are informed only whether they are correct or

incorrect before responding to the next item, then they are

receiving feedback which can be called knowledge of cor-

rectness. However, if examinees who answer incorrectly are

also informed of the correct answer to the item prior to

continuing to the following item, then this type of feed-

back could be termed knowledge of the correct response.

Hanna (1974, 1975, 1976, 1977) utilized the terms

partial feedback and total feedback. Partial feedback and

knowledge of correctness are synonymous; however, Hanna de-

fines total feedback as knowledge of the correct response

1









2

wherein the examinee who is incorrect on the first response

to an item continues to respond to that item until the

correct answer is selected. Thus, Hanna's definition of

total feedback is a combination of knowledge of the correct

response and a response procedure, answer until correct,

which is unconventional because in a paper and pencil mode

examinees usually receive only one opportunity to respond

to each item regardless of their apparent correctness on

that attempt.

Knowledge of the correct response can be fed back to

examinees by allowing the usual one attempt per item. In

order to differentiate this from Hanna's total feedback,

knowledge of the correct response with one attempt per item

could be termed full feedback.

Hanna and other researchers have studied the effects

of total and partial feedback on test scores by comparing

means, reliability coefficients, and validity coefficients

for total, partial and no feedback groups. But there have

been only two investigations of full feedback. Utilizing

special answer sheets, Strang and Rust (1973) reported that

subjects who received full feedback scored, on the average,

significantly lower than subjects who received no feedback.

No other comparisons were reported. Betz and Weiss (1976a)

found that on a computer-administered, conventional test,

full feedback subjects scored significantly higher than the

no feedback group, but the reliability and validity coef-









3

ficients were unaffected by the feedback.

The object of this study was to fill a void that

existed in the research on the immediate knowledge of

results of noncomputerized tests by comparing means, reli-

abilities and validities for full, partial and no feedback

groups. The study also investigated the effect of feedback

on the reliability and validity of test scores of examinees

within different ability levels.

Researchers have been designing and experimenting with

individual devices which give examinees immediate knowledge

of test results since 1915 (Pressey, 1946). The principal

purpose for testing with these devices "has been the pur-

suit of learning that may accompany such testing" (Hanna,

1976, p. 202). However, while three basic conditions for

learning--contiguity of stimulus and response, practice,

and feedback--have been recognized (DeCecco, 1968,

pp. 290-295), there seems to be no agreement on the optimal

time differential between the learner's response and the

emission of feedback.

When examinees receive partial feedback or knowledge

of correctness, each examinee receives the feedback at

approximately the same time--immediately following the

first attempt. But when examinees receive total feedback

or knowledge of the correct response by answering until

correct, the only examinees who receive immediate knowledge

of the correct response are those who are correct on the









4

first attempt. Those who respond incorrectly merely

receive immediate knowledge of correctness, and then they

must continue to respond to the same item until a correct

attempt is made in order to receive knowledge of the

correct response.

The researchers who have studied the effects of total

feedback on test scores have noted that examinees who

receive total feedback utilize a different response pro-

cedure when they answer until correct as opposed to the

single attempt permitted examinees who receive partial

feedback or no feedback. Examinees who answer until

correct are allowed multiple attempts to respond to each

item; whereas, examinees are otherwise permitted one

attempt per item. But these researchers have failed to

state in their reports that the difference in response

modes was a factor that could have had a confounding effect

on the results, making it difficult, if not impossible, to

make a valid conclusion about the effect of knowledge of

the correct response.

Other possible sources of confounding in most of the

studies on the effects of immediate item feedback are the

response devices utilized by the feedback and nonfeedback

groups. Except for the research reported by Strang and

Rust (1973) and Betz and Weiss (1976a), each of the remain-

ing studies involved the use of uncommon response devices

for the treatment groups which were not used by the control









5

groups. The treatment groups usually responded with punch-

boards or specially constructed answer sheets which emitted

feedback while the control groups answered the items in a

conventional manner. It is impossible to determine whether

the significant treatment effects found in some of these

studies are due to the effect of feedback, the effect of

the special response devices or a combination of both

effects.

Therefore, this study was designed to eliminate these

possible sources of confoundment. All examinees included

in the study responded to each item in the same manner.

All examinees were allowed but one attempt at each item and

utilized the same type of response device. The devices

were different only in that they emitted either one of the

two kinds of feedback or no feedback.

The psychological effects of immediate item feedback

are not well documented. In a 1956 survey of the effects

of knowledge of performance that included immediate item

feedback, Ammons (1956) generalized that knowledge of per-

formance affected motivation, usually increased it, and

students receiving feedback while taking tests preferred

this procedure over other examination procedures.

When taking multiple choice tests under normal circum-

stances, examinees receive some feedback when they deter-

mine an answer to an item and find that their answer is

included among the item's alternatives. However, if the








6

item was well constructed with attractive foils, then the

examinees may be easily mistaken about their assumed

correctness.

Formally providing the examinee with immediate item

feedback should not affect the level of motivation unless

the feedback indicated results contrary to the examinee's

expectations. Examinees who expect to do poorly and exami-

nees who expect to do well should not be surprised if their

expectations are confirmed. But if examinees expect their

answers to be correct or expect them to be incorrect and

are told the opposite, then that knowledge may have a

psychological effect on the examinees and an unknown

effect on the test scores. Should the effect be inconsis-

tent with what the test was measuring, then it could inad-

vertantly affect the test's reliability and subsequently

attenuate validity correlations.














Chapter II

Review of the Literature



The history of immediate test item feedback can be

traced to the early utilization of self-scoring devices in

programmed instruction (Face, 1964). Although he intro-

duced a self-scoring, teaching-testing device to the public

in 1926 (Face, 1964), Pressey had been designing and ex-

perimenting with individual, learner-operated machines

which gave immediate knowledge of test results since 1915

(Pressey, 1946). The original apparatus was approximately

as large as a portable typewriter with a small window for

viewing test items (Pressey, 1926). When the examinee

responded to an item by depressing a key, a new item moved

into view and a counter kept a record of the number of

correctly answered items (Pressey, 1926). The machine

could also be adjusted so that it would not proceed to the

next question until the correct answer had been selected

(Face, 1964). Hence, examinees could be required to answer

until correct.

In 1950, Pressey introduced a punchboard in the form

of a card, three inches by five inches and -inch thick

(Face, 1964). The examinee responded to each item by

7








8

punching a pencil through a piece of paper, and if the

response was the correct one, the pencil point continued

down into a hole in the punchboard (Pressey, 1950). Thus

the examinee was immediately informed as to the correctness

of each response. If the answer was incorrect, additional

attempts could be made until the correct alternative was

selected, thus informing the examinee of the correct answer

prior to responding to the next item.

Other teaching-testing machines and punchboards were

developed by Angell (1949), Skinner (1954, 1958), and by

Briggs and other researchers in the Armed Forces (Face,

1964). Latent image answer sheets or cards which gave

knowledge of results by means of special inks, removable

tabs, or erasable shields were available before 1930 (Face,

1964).

Prior to 1969, the research on the effects of feedback

conducted by Pressey and others was concentrated on the

study of feedback as an aid in learning. Subjects were

assigned to one of two groups; one group took one or more

tests or quizzes utilizing a special, feedback device which

would allow examinees to answer until correct, while the

other group answered the same items in a conventional,

paper-and-pencil mode. Both groups were subsequently

examined again using conventional means. Typically, the

feedback group scored significantly higher on the second

testing than the no feedback group, and the difference in









9

group means was attributed to the additional learning that

resulted when the special feedback devices were utilized

(Angell, 1949; Pressey, 1950).

Since 1969, very few studies involving feedback have

investigated its effects on learning. Most of the research

on feedback in the last decade has been concentrated on

determining what happens to the scores of the examinees on

a test during which feedback is received following each

item. Means and reliability and validity coefficients have

been compared for groups tested with and without feedback.

Some investigators also attempted to determine what hap-

pened to the attitudes and the anxiety levels of the exami-

nees who were given immediate item feedback. Earlier

research on feedback indicated that students preferred the

feedback devices (Pressey, 1950) even though they felt more

nervous during testing (Angell, 1949).

Spencer and Barker (1969) investigated the effects of

feedback on the test scores of students taking a junior

high school biology course. A 60-item pretest was admin-

istered to six classes of biology students two days prior

to Christmas vacation. Students in three of the classes

used a punchboard device and were allowed one attempt at

answering each item correctly. The remaining three classes

of students were tested using conventional answer sheets.

Two days following their vacation and without the

control group receiving any formal knowledge of their per-









10

formance on the pretest, all classes were retested uti-

lyzing conventional answer sheets. All scores on the

pretest and posttest consisted of the total number of items

answered correctly.

The results indicated that the feedback group re-

ceived a significantly lower pretest mean score but a sig-

nificantly higher posttest mean score than the control

group. In addition, the experimental group gained 12

points, while the control group lost three points.

Although Spencer and Barker (1969) concluded that this
"obvious gain indicates that the test served as a learning

tool" (p. 5), they were nevertheless cautious about the

novelty of the punchboard for the feedback group.

Burgess (1970) divided an 80-item final examination in

an extension psychology course into two 40-item subtests of

approximately equal difficulty. The 23 students taking the

course were separated into two groups. Both groups

answered one subtest using a punchboard and the other

subtest with a conventional answer sheet. However, the

groups differed in the order in which the subtests were

administered. While the punchboard permitted the examinees

to receive immediate feedback, Burgess did not indicate

explicitly whether they were allowed to answer until

correct. It was implied from the report that each examinee

attempted each item only once.

The results were combined across groups for each








11

subtest. The reliability coefficients for the conventional

and punchboard subtests were .59 and .58, respectively.

The mean score on the punchboard subtest was significantly

lower. There were no significant differences between

examinee groups on the punchboard subtest or the conven-

tional subtest. Seventy percent of the students felt that

the inability to change responses made in error was a

serious handicap, but most felt that the feedback was an

aid in learning. Some students found that the immediate

feedback was useful but others said that immediate knowl-

edge of their errors was upsetting and threatening.

Burgess (1970) concluded that the lower mean scores on the

punchboard subtest may have indicated that the punchboard

"made the test more difficult" (p. 146).

Montor (1970) used a Trainer-Tester answering device

on which examinees responded to each item by erasing a

shield corresponding to the option they selected. Beneath

the shield was revealed a symbol identifying the correct-

ness of the answer.

The subjects in this study were two classes of United

States Naval Academy midshipmen taking an introductory

course in psychology and leadership. The two classes were

administered, in the following order, a pretest, 12 indi-

vidual quizzes, a surprise retest on all 12 quizzes com-

bined into one test, and a midterm examination. For the

administration of the 12 individual quizzes to the treat-








12

ment class, the examinees answered until correct using a

Trainer-Tester device. However, in order to make compari-

sons with the control class, right-wrong scoring was accom-

plished by inferring the number right on the first attempt

from the erasures made on the Trainer-Tester. The answers

to the quiz questions were discussed in both classes after

the class had finished each quiz.

The results showed no significant difference between

the two classes' means on the pretest, the total score

across quizzes, the surprise retest or the midterm. Montor

(1970) concluded that "it wasn't possible to prove that the

Trainer-Tester was worthwhile in this particular applica-

tion" (p. 436) and that in order to be effective in rein-

forcing learning, immediate feedback may have to be accom-

panied by knowledge of why a response is correct or not.

Heald (1970) administered a 40-item multiple choice

test to three groups of 18 students each. The students

were taking a course in educational administration and the

test was their midterm examination. One group served as a

control while partial feedback was received by the other

two groups, one of which received a coded message for each

incorrect response that referenced text passages for

further study after the test. The feedback groups used

Trainer-Testers while the control group used conventional

answer sheets.

Each of the 54 students was pretested with the Sarason









13

Test Anxiety Scale and was classified as high or low.

Within anxiety levels, students were randomly assigned to

treatment groups for the midterm. The students who re-

ceived partial feedback with text referencing scored sig-

nificantly higher than the control group, while the scores

of the students who received only partial feedback were not

significantly different from either of the other groups.

There was no interaction with anxiety level.

Students were retested with the same test two weeks

later in order to determine effects on learning. Both

feedback groups scored significantly higher than the no

feedback group. There was no interaction with anxiety, but

an analysis of gain scores was conducted: students who

received feedback with text referencing, regardless of

anxiety level, and low anxiety, feedback-only students made

a significant improvement, but high anxiety, feedback-only

students did not.

Gilman and Ferry (1972) administered a 66-item, four-

option, multiple choice test on test construction prin-

ciples to 54 graduate students in education. All students

used a Trainer-Tester answer sheet and were told to answer

until correct. The answer sheets were scored twice: first

by inferring from the erasures on the answer sheet, the

number of items each examinee answered correctly on the

first response to each item, and secondly by counting the

total number of erasures required to answer all items








14

correctly. The former was a dichotomous scoring system

while the latter constituted partial scoring.

The mean and standard deviation for the partial

scoring method were much higher than for the inferred

number right. The odd-even item correlations were .87 and

.66, respectively. When the Spearman-Brown correction was

applied, these reliability estimates were .93 and .79,

respectively.

The specific reason for the increase in reliability

was unclear. The answer until correct procedure essen-

tially lengthened the test by requiring multiple responses

to items answered incorrectly. Increasing test length with

similar items generally increases reliability estimates and

variability simultaneously. But partial scoring can

increase or decrease the variability of total scores de-

pending on the method used to award partial credit.

Suppose Gilman and Ferry had used a 3-2-1-0 partial

scoring scheme to award points to examinees answering items

correctly on the first, second, third, and fourth attempts,

respectively. Except for irrelevant special cases, the

total score variance would be greater than that found when

the items were dichotomously scored. However, if a

1-2/3-1/3-0 scheme were used, that variance would (with the

same exceptions) be less than that resulting from the

dichotomy. Since these partial scoring schemes are linear

transformations of each other, correlation coefficients









15

calculated between the total scores and a third variable,

or among items, or among split-halves, would be unaffected

by whichever of the partial schemes is used. Hence,

increasing the variance by using the former scheme would

alone not account for the increase in reliability dis-

covered by Gilman and Ferry; the latter scheme would have

decreased the variance and the same increment in reli-

ability would have been found. The larger reliability

coefficient attributable to partial scoring indicated that

a larger proportion of total variance could be accounted

for through consistent discrimination among examinees who

responded incorrectly on the first attempt.

Three separate groups of subjects were involved in a

study conducted by Beeson (1973): 30 students in a college

mathematics class for elementary school teachers, 15

students in remedial college mathematics, and 30 students

in a junior high school general mathematics class. Each

class took several examinations. Within each group,

subjects were assigned to one of two subgroups for the

duration of the experiment. One subgroup within each

group received knowledge of correctness feedback on either

the first or second half of each examination by using a

punchboard. The other subgroups always received the

opposite treatment.

On each of ten examinations across the three groups,

no significant difference in mean scores of the feedback









16

and no feedback groups was found. However, a final exami-

nation was also administered to the group of elementary

school teachers under the experimental conditions, and a

statistically significant difference was found in favor of

the feedback subgroup. Beeson suggested that the one sig-

nificant result could have been due to the longer final

examination, but because of the large number of individual,

statistical tests, the difference could have occurred by

chance alone.

Wentling (1973) studied the effects of both partial

and total feedback. A mental ability test was administered

to 116 male high school students enrolled in six automotive

mechanics classes. Students were classified into two

ability groups based upon the results. Within the high and

low ability levels, students were assigned to one of six

cells in a two-by-three, instructional strategy by feedback

design.

The instructional strategies were mastery and non-

mastery. For each instructional unit, the student completed

an instructional booklet and was tested. Nonmastery

students received a unit grade and continued on to the next

unit, while the mastery students were required to attain

an 80% level of mastery in order to continue. Those in the

mastery group who did not obtain an 80% level of mastery

were recycled as many as three times and retested with

parallel forms of the unit tests.








17
The feedback conditions were total feedback (answer

until correct), partial feedback (one attempt with knowl-

edge of correctness) and no feedback. The total and

partial feedback conditions were accomplished by means of

the same chemically treated answer sheet, with a minor

variation in instructions. The no feedback condition was

obtained by having students respond to the test items in a

conventional manner.

Achievement on a final posttest with all groups

examined under the same conditions showed that the mean

achievement score for partial feedback was greater than

that for no feedback, which was greater than that for total

feedback (significant at the .07 level). Achievement was

also significantly higher for the high ability and mastery

groups. There was no significant interaction involving

achievement.

In addition to the final achievement measure, an

instrument which purportedly measured attitude toward

instruction was administered on the same day. There was

no difference in attitude between the mastery and non-

mastery groups. The high ability group had a better

attitude than the low ability group (significant at the

.05 level). The partial feedback group had a better

attitude than the no feedback group, which was better than

the total feedback group (significant at the .001 level).

There was no significant interaction involving attitude.








18

In the research of Strang and Rust (1973), 158

students in three sections of an introductory course in

human growth and development took one of two 25-question

multiple choice tests in which items had been matched for

difficulty. On this test, 132 students answered more than

50% of the questions correctly and were assigned to one of

four cells in a two-by-two, feedback by instructions

design.

Those scoring higher than 50% took a second test

using an answer sheet which necessitated erasing a spot for

each response. All students were allowed one attempt per

item. Students receiving no feedback uncovered a dot

beneath the spot regardless of which option was selected.

If a student receiving feedback answered an item cor-

rectly a plus sign was revealed beneath the spot, otherwise

the letter of the correct answer was uncovered. In this

way, the examinee could receive knowledge of correctness

and knowledge of the correct response without having to

answer until correct.

On the second test two types of instructions were

given to the students. One group was told that the test

would count toward the final grade while another group was

told that the test would not count. A nervousness rating

scale was also administered following the second test.

Using the scores on the first test as a covariate,

Strang and Rust indicated that the feedback group's








19

adjusted mean on the second test was lower than the no

feedback group's adjusted mean (significant at the .05

level). There was no main effect for instruction and no

interaction.

Analysis of the nervousness rating scale resulted in

higher levels of nervousness for the group that was told

that the test would count (significant at .the .025 level)

and for the feedback group (significant at the .01 level),

but no interaction was detected.

Evans and Misfeldt (1974) used a different kind of

latent image feedback answer sheet which was duplicated on

a ditto machine. While the answer sheets were being pre-

pared, an invisible mark was made on one of the options for

each item. This mark was revealed when the examinee

recorded the correct answer on the sheet by means of a

special pen.

A 50-item test was administered twice to 52 students

in an undergraduate educational psychology class at two

successive class meetings. Half of the students recorded

their answers on a conventional machine-scorable answer

sheet, while the other half of the class was told to use

the latent image answer sheet and answer until correct.

Later, on the final examination for the course, each

student used the latent image answer sheet.

Split-half reliability estimates with the Spearman-

Brown correction were lowest for the conventional answer








20

sheet, .67 on the midterm and .69 on the retest. On the

same test, a partial scoring procedure used with the latent

image answer sheets yielded consistently higher reliability

estimates (.84 and .92) when compared to the same responses

scored dichotomously (.71 and .73). On the final exami-

nation, the partial scoring reliability was .67, in

comparison to .63 for the right-wrong scoring. Evans and

Misfeldt (1974) concluded that the incremental reliability

in favor of partial scoring was due to the increased vari-

ability under the partial scoring procedure; however, no

statistics other than the reliability estimates were

presented.

Hanna (1974) utilized Trainer-Tester answer sheets and

answer until correct instructions to study their effects on

the reliability and validity of multiple choice tests. The

subjects involved were 38 undergraduate students taking a

course in educational psychology.

The instruments involved were 11 quizzes each with

10 multiple choice items, two course papers, an 82-item

interpretive exercise, and a 50-item, multiple choice final

examination. Each of the items on the quizzes and final

examination had four options, and these tests were admin-

istered with Trainer-Testers and answer until correct

directions. The answer sheets were scored twice: once

using a partial scoring scheme based upon the number of

attempts necessary to answer each item correctly and di-









21

chotomously by inferring the correctness of each first

attempt at an item from the number of erasures made.

For each quiz and the final examination, the means,

standard deviations, and corrected odd-even reliability

coefficients were calculated for the partial and dichot-

omous scores. In addition, as measures of validity, the

partial and dichotomous scores on the tests were correlated

statistically with the corresponding scores on the inter-

pretive exercise.

The results are shown in Table 1. For the quizzes,

only the mean values were reported. No indications of

statistical significance were given. However, the reli-

ability and validity coefficients were consistently in

favor of partial scoring with the larger differences

occurring for quiz reliabilities and validity correlations

of the final examination with the interpretive exercise and

the first paper. In a subsequent article, Hanna (1975)

commented that "these criterion-related validity compari-

sons revealed a slight to substantial superiority"

(p. 176), and "the consistent, but statistically nonsig-

nificant, findings were interpreted to suggest that the AUC

method merited further study" (p. 176).

Subsequently, Hanna (1975) replicated his previous

study, reported in 1974. The procedures were the same, but

a new textbook was introduced into the undergraduate edu-

cational psychology class. The quizzes were revised, and









22

Table 1

Results of a Feedback Study by Hanna (1974)

Comparing the Reliability and Validity

of Partial and Inferred Number Right Scoring

Validity correlations
a


Scoring M SD R Exercise Paper 1 Paper 2

Quizzes

INR 7.4 1.5 .18 .31 .22 .23

Partial 26.1 2.7 .25 .33 .24 .25

Final

INR 34.8 6.0 .76 .31 .22 .15

Partial 125.8 11.0 .77 .42 .32 .19
aOdd-even, split-half reliability coefficient with Spearman-Brown

correction.


one of the course papers was replaced by an essay section

on the final examination which also contained a new 30-item

multiple choice subtest.

As before the quizzes and the multiple choice portion

of the final examination were administered with a Trainer-

Tester and answer until correct instructions. These tests

were scored using dichotomous (right-wrong) and partial

scoring methods, and the two scores were correlated statis-

tically with the interpretive exercise, course paper, and

essay section of the final examination.

The results (see Table 2) were generally consistent









23
with Hanna's previous study in that most of the differences

between the two scoring procedures were statistically non-

significant. However, the reliability of the partial

scoring procedure was significantly higher, and the

validity correlations for the partial scores with the

course paper and the essay subtest were significantly lower,

than the same values determined for the right-wrong, in-

ferred number right.


Table 2

Results of a Feedback Study by Hanna (1975)

Comparing the Reliability and Validity

of Partial and Inferred Number Right Scoring

Validity correlations


Scoring M SD Ra Exercise Paper

Quizzes

INR 7.2 1.6 .20 .22 .33

Partial 25.6 2.8 .20 .20 .31

Final

INR 19.7 3.0 .44 .24 .46

Partial 74.4 5.5 .55 .22 .37
aOdd-even, split-half reliability coefficient with Spearman-

Brown correction.

< .05.

p < .01.


Essay



.32

.30


*
.51

.43








24

Hanna (1975) noted that before one becomes too con-

cerned about these validity findings, consideration must be

given to the subjectivity involved in the scoring of the

paper and the essay which tends to lower the reliability of

these measures. It was also noted that the reliability

results for the multiple choice subtest of the final exami-

nation were consistent with the results of Gilman and Ferry

(1972).

Hanna (1976) developed three tests of equivalent

content and difficulty which were designed to measure upper

elementary school students' ability to interpret science,

social studies and mathematics data in graphic and tabular

formats. Two of the tests consisted of 18 completion items,

while the third was composed of an equal number of multiple

choice items, each having four alternatives.

Three experimental conditions were of interest for

administering the multiple choice test: total feedback

(answer until correct), partial feedback (knowledge of cor-

rectness, one attempt per item), and no feedback. Those

subjects who received feedback used a Trainer-Tester answer

sheet, while those receiving no feedback utilized a con-

ventional answer sheet.

The three tests were administered to 1,391 fifth and

sixth grade students in 15 schools in six Kansas school

districts in the following order: a completion pretest, a

multiple choice test under one of three treatment con-









25

editions, and a completion posttest. Each student com-

pleted the three tests in a single sitting with additional

time allowed for those in the two feedback groups to com-

plete the multiple choice test. Each test was timed sepa-

rately with a comfortable time limit, but based upon the

results of a pilot study, Hanna decided to allow the

partial and total feedback groups 25% and 50% longer, re-

spectively, than the no feedback group. Over 90% of the

students in each group attempted the last item.

The pretest scores were used to match 389 triads of

students among feedback conditions so that within group

means and standard deviations were equal. Within each

feedback condition and also based upon their pretest

scores, the 389 students were separated into ability groups

of approximately equal frequency.

In order to ascertain the effect of feedback on

learning, the posttest scores were examined by analysis of

variance in a 3 by 3 by 2, feedback by ability by sex

design. Statistically significant main effects were found

for ability and feedback. Overall, the no feedback group

scored lower than each of the total and partial feedback

groups.

Two significant interactions were also found. Within

ability groups, posttest scores were higher for the high

ability students who had received partial feedback, lower

for medium ability students who had received no feedback,








26

and higher for low ability students who had received total

feedback on the multiple choice test. Male students who

had received no feedback scored lower than other males,

while there were no differences among groups of female

students.

Subsequently, Hanna (1977) reported corrected, split-

half, reliability coefficients for the multiple choice test

and validity correlations between the multiple choice test

and the posttest under the three feedback conditions. The

total feedback responses were scored using a partial

scoring method. All others were scored dichotomously. The

reliability of the test when administered with total feed-

back was significantly higher than without feedback (see

Table 3). There were no significant differences among the

validity coefficients.

Hanna (1976) concluded that within limits the effects

of feedback on learning had been demonstrated in this study

and the results helped generalize earlier studies performed

with much older subjects (p. 205). Hanna (1977) further

concluded that the higher reliability obtained from the

answer until correct procedure with partial scoring was

consistent with other researchers findings (p. 7).

The research of Betz and Weiss (1976a) involved high

and low ability students in a feedback study in which the

test items were administered by computer. The high ability

group consisted of 239 students taking an introductory









27

Table 3

Results of a Feedback Study by Hanna (1977)

Comparing the Reliability and Validity

of Total, Partial, and No Feedback

Group Scoring M SD Ra

Total Partial 32.55 9.23 .82 .66

Partial NR 11.02 3.70 .75 .65

Control NR 11.07 3.22 .70 .65

aOdd-even, split-half reliability coefficient with Spearman-

Brown correction.

< .05.



psychology course in the College of Liberal Arts at the

University of Minnesota,.while the low ability group con-

sisted of 111 students taking various psychology courses in

the General College at the same university.

Three tests were constructed from a pool of multiple

choice verbal ability items: an adaptive test, a conven-

tional 50-item test and a conventional 44-item posttest.

Within ability groups, students were assigned to one of four

cells in a test (adaptive or conventional) by feedback

(knowledge of correct response or no feedback) design.

Those receiving feedback were allowed one attempt per item,

and after responding to an item, the examinee was informed

as to the correctness of the answer. If incorrect, the

examinee was told the correct alternative.









28

Students completed either the adaptive test or the con-

ventional test with or without feedback. Then the conven-

tional posttest was administered without feedback to deter-

mine whether there were any carry-over effects from the

treatments. Each test administration was controlled by a

computer.

On the 50-item conventional test and on the adaptive

test, there were significant main effects for ability and

feedback, but no interactions. The feedback means were

higher than the no feedback means. The reliability coef-

ficients for the 50-item conventional test under feedback

and no feedback conditions were not significantly different

(.89 and .91, respectively). The corresponding validity

correlations were likewise not significantly different

(.69 and .76, respectively). The analysis of posttest

scores yielded only a significant main effect for ability,

apparently indicating that the feedback and test treatments

had no effect on subsequent verbal ability test performance.

Betz and Weiss (1976a) concluded that their study

demonstrated that the use of feedback "can lead to signifi-

cant increases in ability test scores" (p. 28). However,

their generalization may be limited to computer-administered

tests where examinees are not permitted to return to omitted

items.

Several investigators of the effects of feedback have

coincidentally asked their subjects to respond to additional









29

questions in an attempt to determine the psychological and

attitudinal effects of feedback. Some research has indi-

cated that examinees liked receiving feedback while being

tested (Pressen, 1950; Angell, 1949; Betz and Weiss, 1976b)

and wanted to continue taking tests that way (Pressey,

1950; Angell, 1949). Many students preferred the use of

feedback as a learning aid over other teaching devices

(Pressey, 1950; Burgess, 1970), while one researcher con-

cluded that students receiving feedback on formative evalu-

ation instruments had a significantly better attitude

toward instruction (Wentling, 1973).

Angell (1949), Burgess (1970) and Strang and Rust

(1973) indicated that feedback made examinees more nervous

or upset from the immediate knowledge of their errors on

the test. However, Betz and Weiss (1976b) concluded that

even though some examinees were bothered by the knowledge

of an incorrect response, the examinees' responses to

questions designed to gauge their anxiety and motivation

levels indicated no differences for the full feedback and

nonfeedback groups.

Synopses of Feedback Contrasts

The preceding research has been reorganized according

to types of comparisons. Each type of comparison was sum-

marized separately. Only the research on the effects of

feedback on test scores and their reliability and validity

has been included.









30

Type I Comparison: Total Feedback-Partial Scoring with

Total Feedback-Inferred Number Right Scoring

Gilman and Ferry (1972), Evans and Misfeldt (1974) and

Hanna (1974, 1975) reported the results of studies where a

group of examinees were administered one or more tests with

instructions to answer until correct, and the tests were

scored utilizing two scoring procedures, partial scoring

and the inferred number right. Mean test scores were not

compared since any reasonable partial scoring procedure

will result in scores equal to or greater than the scores

derived from the inferred number correct.

In each of the four studies, the reliability of the

scores derived from the partial scoring procedure was

greater than the reliability of the inferred number right

scores. In only one of the four studies was the difference

not statistically significant (Hanna, 1974).

To compare the validity of the respective scoring

procedures, Hanna (1974, 1975) correlated the test scores

with subjects' scores on three other measures that had been

administered and scored in a traditional manner. It was

found that each of the three criterion measures correlated

higher with inferred number right scores than with partial

scores. In one study, none of the six differences in pairs

of validity correlations was significant (Hanna, 1974),

while in the second study two out of six differences were

significant (Hanna, 1975).









31

Type II Comparison: Total Feedback-Partial Scoring with

No Feedback

Relatively few researchers have compared the test

scores of a no feedback group with those of a group of sub-

jects that received knowledge of correct response by means

of an answer until correct procedure with partial scoring.

Evans and Misfeldt (1974) and Hanna (1977) found the reli-

ability of the scores of the feedback group to be signifi-

cantly higher than the reliability of the no feedback group.

As a validity criterion, Hanna (1977) administered a

completion test to both the feedback and no feedback groups

and correlated the scores on the two measures within groups.

The difference between the correlation coefficients was not

significant.

To indicate the effect of feedback on learning, Hanna

(1976) reported that the completion test mean was signifi-

cantly higher for those low and medium ability students who

received feedback. There was no difference for high ability

students.

Type III Comparison: Total Feedback-Inferred Number Right

Scoring with No Feedback

Montor (1970) compared the mean inferred number right

score of a knowledge of correct response group that had

answered until correct with the mean of a nonfeedback

control group and found the difference to be nonsignificant.

Evans and Misfeldt (1974) compared reliability coefficients









32

and reported no significant differences.

As a validity criterion, Hanna (1977) administered a

completion test to both the feedback and no feedback groups

and correlated the scores on the two measures within groups.

The difference between the correlation coefficients was not

significant.

To indicate the effect of feedback on learning, Hanna

(1976) reported that the completion test mean was signifi-

cantly higher for those low and medium ability students who

received feedback. There was no difference for high ability

students.

Type III Comparison: Total Feedback-Inferred Number Right

Scoring with No Feedback

Montor (1970) compared the mean inferred number right

score of a knowledge of correct response group that had

answered until correct with the mean of a nonfeedback

control group and found the difference to be nonsignificant.

Evans and Misfeldt (1974) compared reliability coefficients

and reported no significant differences.

Montor (1970) and Wentling (1973) investigated the

effect of feedback on learning. Montor (1970) found no

subsequent achievement gain from having received feedback,

but Wentling (1973) concluded that the feedback group scored

significantly lower than the no feedback group on his

learning measure.

Type IV Comparison: Full Feedback with No Feedback









33

One of two studies in this category was conducted by

Strang and Rust (1973). The subjects in both groups

utilized answer sheets wherein the examinee responded to an

item by erasing a shield in the appropriate answer space.

For the treatment group the erasure revealed a plus sign if

the correct alternative had been selected and if incorrect,

the letter corresponding to the correct alternative was

revealed. Erasure of a shield on the control group's answer

sheets revealed a dot, regardless of the correctness of the

selected alternative. Thus, each examinee responded only

once per item with the treatment group receiving full feed-

back and the control group receiving no feedback. Unfortu-

nately the study was limited to a comparison of means. The

full feedback group mean was found to be significantly lower

than the no feedback group mean.

Betz and Weiss (1976a) administered a verbal ability

test by computer to both feedback and nonfeedback groups.

The mean ability score for examinees who received knowledge

of correct response was significantly higher than that of

the no feedback examinees. There was no significant differ-

ence between the two groups' reliability and validity coef-

ficients, nor was there any learning difference.-

Type V Comparison: Total Feedback-Partial Scoring with

Partial Feedback

Evidently only one researcher has investigated this

comparison. Hanna (1977) found that the knowledge of








34

correct response feedback group had scores which had a

significantly higher reliability and nonsignificantly

higher validity coefficients than the scores of the knowl-

edge of correctness group.

Hanna (1976) found a disordinal interaction of feed-

back and ability on learning. The high ability group that

had answered until correct had significantly lower scores

on the learning measure, while the opposite result was

found for the low ability students. There was no difference

for the medium ability group.

Type VI Comparison: Total Feedback-Inferred Number Right

Scoring with Partial Feedback

Only one study was classified into this category. In

a study on the effects of feedback on learning, Wentling

(1973) found that the mean score on a subsequent test of

examinees who had received knowledge of correct response was

significantly lower than the mean score of examinees who had

received knowledge of correctness.

Type VII Comparison: Full Feedback with Partial Feedback

No research was found concerning this comparison.

Type VIII Comparison: Partial Feedback with No Feedback

Spencer and Barker (1969), Burgess (1970), Heald

(1970), Beeson (1973), and Hanna (1977) compared test score

means for the groups of interest in this category of the

research on feedback. Spencer and Barker (1969) and Burgess

(1970) found significant differences favoring the no feed-








35

back groups, while Heald (1970) and Beeson (1973) found

only one of two and one of eight comparisons, respectively,

were significant but in the opposite direction. Hanna

(1977) found the difference to be nonsignificant.

Burgess (1970) and Hanna (1977) compared reliability

coefficients for their groups and found the differences to

be nonsignificant. Hanna (1977) compared criterion-related

validity coefficients and found the difference to be nonsig-

nificant. He also discovered another interaction of feed-

back and ability on learning. The high and medium ability

students who received feedback scored significantly higher

on the learning measure than the students who did not

receive feedback, while there was no difference for low

ability students (Hanna, 1976). The learning studies of

Spencer and Barker (1969), Heald (1970), and Wentling (1973)

had results consistent with Hanna (1976) in that students

who received knowledge of correctness scored significantly

higher than the students who received no feedback.

Response Devices

In the research conducted by Spencer and Barker (1969),

Burgess (1970), and Beeson (1973), the treatment groups

received feedback by using a punchboard. The feedback

groups in the study by Betz and Weiss (1976) received their

immediate knowledge of results by means of a computer-

assisted testing system.

In each of the remaining studies, feedback was received









36

by the treatment groups through the use of latent image

answer sheets. This kind of device concealed the feedback

message by a shield which was either erased with an ordi-

nary pencil eraser or was dissolved by special fluid that

was applied to the answer sheet by a pen.

In all except two of the studies, the no feedback

control group responded in a conventional pencil and paper

mode. Strang and Rust (1973) and Betz and Weiss (1976)

utilized the same response device with both the treatment

and the control groups, thus controlling for a response

device effect that could have confounded the results of the

other research.

Summary

Table 4 consists of a summary of 13 feedback studies

in which the researchers examined the scores of tests,

during which examinees had received immediate item feedback.

While some of these studies included research on the effects

of feedback on learning, it should not be inferred that all

such studies have been included in this review. This latter

topic was not of primary interest in this investigation.

Therefore, knowledge of the results of the research on

learning was thought to be of lesser importance in this

regard.

In the research of Evans and Misfeldt (1974), Gilman

and Ferry (1972), and Hanna (1974, 1975), groups of exami-

nees had received total feedback. Their responses were













Table 4


Summaries of 13 Feedback Studies

Conclusions b

Author Subjects Feedback Device Scoring Mean Reliability Validity Learning


Beeson







Betz &
Weiss


30 elem ed
majors
15 col rem
math stu
30 JHS gen
math stu

350 under-
grad stu


Burgess 23 col stu
psych

Evans & 52 col stu
Misfeldt ed psych


Gilman & 54 grad
Ferry stu ed

Hanna 38 col stu
(1974) psych


Hanna
(1975)


85 col stu
psych


Partial
None
Partial
None
Partial
None


la. Full
b. Full
2a. None
b. None

1. Partial
2. None

1. None
2. Total


Total


Total


Total


Punch
Conv
Punch
Conv
Punch
Cony

CRT
CRT
CRT
CRT

Punch
Conv

Conv
Pen


NR
INR
3. Part


Erase 1. Part
2. INR

Erase 1. Part
2. INR

Erase 1. Part
2. INR


1=2(4)
1>2(1)
3=4(3)

5=6(3)


1>2


1=2


la=2a
lb=2b


1<2


1=2


1=2<3


1=2(2)


1=2(6)


1=2(4)
1<2(2)













Table 4 (Continued)

Conclusions a,b

Author Subjects Feedback Device Scoring Mean Reliability Validity Learning


Hanna
(1976,
1977)








Heald



Montor


1391 grade
5 and 6
students


54 stu, ed 1.
admin 2.
3.


2 classes,
psych


Spencer & 6 classes,
Barker JHS biol

Strang & 158 col
Rust students

Wentling 116 male
HS shop
students


Total
Total
Total
Partial
Partial
Partial
None
None
None

Partial
Partial
None


1. Total
2. None


1. Partial
2. None

1. Full
2. None

1. Total
2. Partial
3. None


Erase
Erase
Erase
Erase
Erase
Erase
Cony
Cony
Cony

Erase
Erase
Cony

Erase
Cony

Punch
Cony

Erase
Erase

Pen
Pen
Cony


Part
Part
Part
NR
NR
NR
NR
NR
NR

NR
NR
NR

INR
NR


2=3


1>2=3


1=2=3


1=2>3


lc>2c=3c
2a>la=3a



3b

1>3
1=2
2=3


1=2


1>2


1<2


? 2>3>1
NR
NR


Inequality signs indicate statistical significance.


aNumbers in parentheses provide the number of comparisons attaining the indicated result.









39
scored dichotomously, from inferring the number correct on

the first attempt, and by using a partial scoring scheme.

The results of these studies seem to indicate that if the

test is administered with total feedback and if the test is

fairly reliable in its own right, then partial scoring will

result in an increment in reliability, but possibly also a

decrement in predictive validity. This does not imply good

or bad effects of feedback as Hanna (1975) seems to have

concluded, since different feedback treatments were not

compared. All subjects received total feedback, but the

responses of each subject were scored in two ways, and only

the effects of the two methods of scoring were compared.

In the research in which one or more different feedback

groups were studied, most of it suffered from a lack of

reasonable sample sizes. In only two studies (Betz and

Weiss, 1977a, 1977b; Hanna, 1976, 1977) did the total number

of subjects across groups exceed 160. Due to the small

proportion of statistically significant and/or consistent

findings one wonders whether there really is a treatment

effect, except perhaps where partial scoring is utilized, or

whether there is insufficient statistical power to detect

it.

It also appears to be difficult to compare total feed-

back, regardless of scoring procedure with a group receiving

another type of feedback or with a nonfeedback group.

Researchers have reported that it takes much longer to









40

administer a test with total feedback than under any of the

other conditions mentioned. This time increase is necessary

in order to provide each group with a comfortable time

limit. But are the effects, if detected, a result of

receiving knowledge of the correct response, of having to

answer until correct, of the differential time limits, or of

a combination of these. The answer cannot be determined

from the research found.

The effects of the answering devices on the results of

the research reported here cannot be determined. Spencer

and Barker (1969) were concerned "that the punchboard was

novel and unique to the experimental group and may have

resulted in greater degrees of interest toward the exami-

nation" (p. 5). If one group receives feedback with special

answer sheets and the nonfeedback group does not use the

special answer sheets, the sole source of any treatment

effects cannot be discerned. In the research of Betz and

Weiss (1977a, 1977b) and Strang and Rust (1973), the effects

of different answering devices were controlled. The differ-

ences in means were in opposite directions and Betz and

Weiss (1977a) found no difference in either the reliability

or the validity coefficients.

There are many other problems with the research on

feedback. While several of the researchers used students

in college psychology classes, others involved students from

grade five to graduate school who were taking courses in









41

mathematics, biology, elementary education, and educational

administration. In one study, the college within the uni-

versity in which the student was enrolled was used to

determine ability level. But in most of the research, the

tests that were used were classroom tests. Considering all

of the variations involved, it is small wonder that the

results of the 13 studies are generally inconsistent.














Chapter III

Statement of the Problem



From the examination reported herein of the research on

the effects of feedback on test scores, it is apparent that

only two investigations of the effect of full feedback on

test scores were found. Strang and Rust (1973) indicated

that the knowledge of correct response resulted in lower

average test scores, while Betz and Weiss (1977a) concluded

that the effect was incremental and had no effect on reli-

ability or validity coefficients. Strang and Rust (1973)

did not report any other results.

The work of Betz and Weiss (1977a) and Strang and Rust

(1973) is important for another reason. They were appar-

ently the only investigators to require their nonfeedback,

control groups to utilize the same response devices as the

feedback, treatment groups. The subjects in the former

study responded to the test items on a computer terminal,

while the subjects in the latter study used erasable latent

image answer sheets.

No evidence was found to indicate that research had

been designed and conducted in order to compare (a) the

effects of full and partial feedback on mean test scores,

42








43

or on the reliability or validity of test scores, (b) the

effects of full feedback and no feedback on the reliability

or validity of the scores of a test wherein the examinees

responded to each item on the same, noncomputerized devices,

(c) the effects of partial and no feedback on mean test

scores, or on the reliability or validity of the scores of

a test wherein examinees responded to each item on the same

devices, (d) the effects for examinees of different ability

of full, partial and no feedback on the mean scores of a

test administered under the previously mentioned conditions,

or (e) the effects for examinees of different ability of

full, partial and no feedback on the reliability or validity

of the scores of a test administered under any conditions.

These were the problems investigated by this study.















Chapter IV

Method



Subjects

A total of 2911 ninth grade students were selected

from nine junior high schools in the public school system of

Jacksonville, Florida. The participating schools were

selected by a stratified sampling procedure. All of the

public junior high schools in Jacksonville were rank-ordered

on the basis of their 1978 average eighth grade score on the

Total Reading section of the Stanford Achievement Test.

Four strata were formed. Each stratum contained five

schools. The schools in the top three strata were classi-

fied as high, medium and low, while the schools in the

fourth stratum were eliminated as contenders for selection.

Over 50% of the students in this latter group of schools had

scored at or below the first quartile, and it was judged

that students at this achievement level were adequately

represented in the other three strata.

Within each stratum, the five schools were randomly

ordered and listed, and the principals at the first three

were contacted in order to obtain their cooperation. One

school could not participate; it was replaced by the next

44









45

school on the random list. Within each school, the prin-

cipal obtained the cooperation of three or four ninth grade

language arts teachers who were to administer a test with

special answer sheets to the students in their ninth grade

language arts classes. A total of 2448 students were

tested out of 2911 students in the designated classes.

Instruments

A test of basic verbal ability from Series II of the

School and College Ability Tests was selected to be admin-

istered with three variations of Trainer-Tester response

devices. Form 3B of SCAT-II was selected because ninth

grade students had participated in its norming, and it was

known to provide a reliable and valid measure of general

school ability which would correlate well with measures of

achievement. The test was also believed to be relatively

insensitive to moderate differences in school curricula.

Trainer-Tester response devices were readily available

for partial feedback and no feedback (Z4d and Z110A, re-

spectively). Response devices for full feedback (Z20A)

were developed especially for this study in cooperation

with the manufacturer of Trainer-Testers, VanValkenburgh,

Nooger, and Neville, Incorporated. The Verbal subtest of

SCAT consisted of 50, four-option, analogy items, but the

options were reordered so that the correct answers would

coincide with the response devices and a constant order of

options would be maintained across feedback groups.









46

The SCAT directions were modified to instruct exami-

nees to erase the box that went with their answer. Exami-

nees receiving full feedback were instructed as follows:

Each question begins with two words.

These two words go together in a certain way. Under

them, there are four other pairs of words lettered a,

b, c, and d. Find the lettered pair of words that go

together in the same way as the first pair of words.

Then find the row of boxes on your answer sheet which

has the same number as the question. In this row of

boxes, erase the box that goes with the letter of the

pair of words you have chosen. Erase that box until

you see two letters. The letter on the left is the

correct answer to that question. The letter on the

right has no meaning for this test.

The partial feedback directions were:

Each question begins with two words.

These two words go together in a certain way. Under

them, there are four other pairs of words lettered a,

b, c, and d. Find the lettered pair of words that go

together in the same way as the first pair of words.

Then find the row of boxes on your answer sheet which

has the same number as the question. In this row of

boxes, erase the box that goes with the letter of the

pair of words you have chosen. Erase that box until

you see one letter. If the letter is E, then your









47

answer is right. Any other letter T, H, or L is wrong.

Examinees receiving no feedback were told:

Each question begins with two words.

These two words go together in a certain way. Under

them, there are four other pairs of words lettered a,

b, c, and d. Find the lettered pair of words that go

together in the same way as the first pair of words.

Then find the row of boxes on your answer sheet which

has the same number as the question. In this row of

boxes erase the box that goes with the letter of the

pair of words you have chosen. Erase that box until

you see a three-digit numeral. Remember, when you can

see all of the numeral, that is your signal to stop

erasing and go on.

In addition, the directions included two example items.

In the first example, the selection of a correct answer was

depicted- while in the second example, the selection of a

wrong answer was illustrated.

A 10-item questionnaire was constructed to gauge the

examinees' attitudes toward the test questions, the response

devices, and feedback. The questionnaire items were devel-

oped from the comments reported by other researchers of this

topic.

The test booklet for each examinee consisted of di-

rections with example items, the test itself and either

seven or 10 questionnaire questions. The test booklet for









48

the nonfeedback examinees did not contain the three items

on feedback.

Two forms of the Advanced level of the Stanford

Achievement Tests were also used. Forms A and B are ad-

ministered each spring in the ninth and eighth grades,

respectively, in all public junior high schools in

Jacksonville. The Stanford Achievement Tests are designed

to measure a student's scholastic achievement, particularly

in reading and mathematics.

Procedures

Within each stratum of schools, one of the selected

schools was randomly assigned to each of the three treatment

groups--full feedback, partial feedback, and no feedback--

for the purpose of administering SCAT. Rather than assign

students, classrooms of students, or teachers of students

to treatment groups, schools were assigned to treatments

because the administration of the test was conducted by the

language arts teachers in their classes and it would have

been difficult to control the interchange of test booklets

and response devices under any other assignment procedure.

SCAT was administered during the first two weeks of

March, 1979, in the ninth grade language arts classes in

nine Jacksonville junior high schools. Even though it was

known that the use of Trainer-Tester response devices re-

quired a longer testing time than with conventional devices

(for example, see Hanna, 1976), the 20-minute time limit was









49

not altered due to the length of the allotted class period.

Students were told that not everyone was expected to answer

all of the questions, but they should try to do as well as

possible. They were told that their scores would not count

against them as this was a field-test of some new and dif-

ferent answer sheets, and that many of them would probably

find the field-test to be interesting. Since the Trainer-

Tester devices were new to these students, they were allowed

to practice by answering question one on the Trainer-Tester

immediately prior to the 20-minute administration period

and without penalty. Immediately following the test, the

students responded to their questionnaires.














Chapter V

Results



The raw responses to the SCAT items and questionnaire

questions were keypunched onto cards and entered into a

computer. Students without a valid student number were not

entered; these numbers were utilized to search computerized

test records for student's reading scores from the Spring,

1978 and 1979 administration of the Stanford Achievement

Test. Table 5 contains the numbers of students at each

school who were tested and who were deleted for having no

valid student number nor valid test record for 1978.

The reading scores from 1978 were used to establish

three ability groups by combining the reading scores across

schools and utilizing the Dalenius and Hodges (1959) pro-

cedure for minimum variance stratification. Other methods

for determining ability groups were attempted and the

results were distressing. The overlapping groups method

failed because the three distributions from the low-,

medium-, and high-scoring schools all intersected at the

same point instead of two pairs of distributions, each pair

having a different point in common. Dividing the range of

scores by three resulted in low frequencies in the low and

50









51

Table 5

Numbers of Students Tested,

Valid and Invalid, at Each School

Invalid

Student Reading
School Tested Valid Number Score

High

A 285 227 11 47

B 244 188 7 49

C 257 217 2 38

Medium

D 261 224 8 29

E 332 260 4 68

F 302 249 8 45

Low

G 254 213 5 36

H 259 218 5 36

I 254 227 6 21

Total 2448 2023 56 369



high ability groups, while dividing the total number of

subjects by three resulted in little variance in the middle

group. The Dalenius and Hodges procedure appeared to be a

valid compromise.

To apply the Dalenius and Hodges procedure, the grouped

frequency distribution in Table 6 was used. The square root








52

Table 6

Statistics for the Dalenius and Hodges Procedure

for Determining Ability Group Boundaries using the 1978

Total Reading Scores on the Stanford Achievement Test

Interval f f cum f

245-249 3 1.73 212.89

240-244 29 5.39 211.16

235-239 36 6.00 205.78

230-234 17 4.12 199.78

225-229 33 5.79 195.65

220-224 36 6.00 189.91

215-219 67 8.19 183.91

210-214 67 8.19 175.73

205-209 92 9.59 167.54

200-204 42 6.48 157.95

195-199 144 12.00 151.47

190-194 168 12.96 139.47

185-189 182 13.49 126.50

180-184 138 11.75 113.01

175-179 185 13.60 101.27

170-174 173 13.15 87.67

165-169 129 11.36 74.51

160-164 136 11.66 63.15

155-159 139 11.79 51.49

150-154 59 7.68 39.70

145-149 28 5.29 32.02








53

Table 6 (Continued)

Interval f f'2 cum f,

14o0-144 45 6.71 26.73

135-139 25 5.00 20.02

130-134 22 4.69 15.02

125-129 9 3.00 10.33

120-124 7 2.65 7.33

115-119 5 2.24 4.69

90-114 6 2.45 2.45



of each frequency was determined and these values were cu-

mulated. The total square root frequency (212.89) was

divided by the desired number of ability groups (three) and

multiples of this quotient (70.96) were used in determining

the group boundaries. In Table 6, the cumulative square

root frequencies closest to these multiples, 70.96 and

141.92, were 74.51 and 139.47, corresponding to the inter-

vals 165-169 and 190-194, respectively. The upper limits

of these intervals distinguished the ability groups. Table

7 contains the numbers of students and the means and stan-

dard deviations on the classification test for each of the

nine feedback by ability cells. As could be expected the

standard deviations for the medium ability cells were much

lower than the standard deviations for the high and low

ability cells.

The time limit for the SCAT verbal, established by the









54

Table 7

Numbers of Subjects, Total Reading Scaled Score Means and

Standard Deviations on the 1978 Administration of the

Stanford Achievement Test by Treatment and Ability Group

Group n M SD

Full Feedback 664 180.355 21.369

High 160 209.169 11.810

Medium 305 180.616 7.076

Low 199 156.789 10.539

Partial Feedback 666 183.599 26.395

High 197 215.477 15.191

Medium 260 182.454 7.381

Low 209 154.976 12.442

No Feedback 693 183.056 25.713

High 210 212.719 13.087

Medium 281 182.302 7.517

Low 202 153.267 14.409


developers of the test, was 20 minutes. Previous work with

erasure type answer sheets had indicated that fewer than

normal proportions of examinees would finish the SCAT verbal

in the allotted time. Consideration had been given to

extending the time limit. However, this would have meant

either dropping the questionnaire or extending the length of

the junior high school class period of forty minutes.

Neither of these two alternatives was acceptable, and so it









55

was decided to closely examine the item responses prior to

the score analyses to determine whether the test appeared

to be highly speeded.

A comfortable time limit for a test such as SCAT will

permit 90% of a typical group of examinees to finish

(Nunnally, 1967). Table 8 contains the proportions of sub-

jects in the three feedback groups who responded to each of

the last 25 items of the SCAT verbal. Fewer than 50% com-

pleted the test and if the 90% criterion is used relative

to the 20 minutes allowed, approximately 90% of all sub-

jects reached item 30.

Conflicting evidence exists as to the effect of

speeded-up conditions on the reliability of tests developed

with comfortable time limits (Nunnally, 1967). A decision

was made, therefore, to include 28 items only--those num-

bered from two to 29. Question one had been used as a

sample item and was not included in the 20-minute time

limit.

The total scores on the 28-item SCAT are summarized in

Table 9 for the nine feedback-by-ability cells. Unequal

cell sizes and heterogeneity of the cell variances existed.

Glass and Stanley (1970) concluded that if such inequal-

ities existed and the larger variances accompanied the

larger sample sizes, then the actual probability of Type I

error in an analysis of variance would be less than the

nominal level. Further examination of Table 9 indicated









56

Table 8

Percentages of Subjects by Treatment Group

Who Attempted SCAT Items 26 to 50


Feedback

Partial

97

96

96

95

93

92

89

87

814

82

79

78

74


Item

39

4o

41

42

43

144

45

46

47

48

49

50


Full

68

64

60

58

57

54

52

48

47

46

45

47


Feedback

Partial

71

68

62

58

44

48

50

45

43

41

41

39


that within ability levels, as the cell sizes increased the

standard deviations generally increased also.

In order to obtain independent estimates of effects.

a nonorthogonal, fixed effects, analysis of variance was

performed in a treatments-by-blocks design. The results of

the ANOVA are summarized in Table 10. A significant F-ratio

was obtained for the interaction of feedback and ability in


Item

26

27

28

29

30

31

32

33

34

35

36

37

38


Full

97

95

94

92

91

88

88

86

83

81

75

74

72


None

60

58

51

50

47

44

43

4o

38

35

34

34


None

92

91

90

88

87

84

93

79

73

69

67

66

62









57

Table 9

Numbers of Subjects, Raw Score Means,

on the 28-Item SCAT by Treatment

Group n

Full Feedback 664

High 160

Medium 305

Low 199

Partial Feedback 666

High 197

Medium 260

Low 209

No Feedback 693

High 210

Medium 281

Low 202


and Standard Deviations

and Ability Group

M SD

20.202 4.89

23.644 3.61:

20.741 4.o0l

16.608 4.65

20.545 5.36'

24.716 2.211

21.442 3.82'

15.498 5.11i

19.081 5.53j

22.976 3.80:

19.719 4.60o

14.144 4.46


I


addition to the significant F-ratios for feedback and

ability. The interaction is shown in Figure 1.

The simple, main effects of feedback within ability

levels were examined using the Scheffe Test for differences

between pairs of means based on varying sample sizes. The

probability levels for the Scheffe comparisons are in

Table 11.

For low ability subjects in the study, those receiving

feedback scored significantly higher than those who did not


4

L

6

7

7
6

5

4

4

3

4

8









58

Table 10

Analysis of Variance for the 28-Item SCAT

Source SS df MS F

Feedback 311.078 2 155.539 9.083

Ability 4554.299 2 2277.150 132.973*

F x A 310.054 4 77.514 4.526*

Error 34489,.631 2014 17.125

Total 57030.311 2022 28.205

p < .001.


LOW MEDIA UM HIGH
ABILITY


Figure 1. Interaction between feedback and ability.









59

Table 11

Probability Levels for Statistically Significant

Differences Between Pairs of Treatment Means

High Ability

Full Partial


None


.10


NS

Medium Ability


Full NS

Partial

None .05 .01

Low Ability

Full

Partial .05

None .01 .01

Note. For each nonempty cell, row treatment is less than column

treatment.


receive feedback, while those of low ability who received

full feedback made significantly better scores, on the

average, than similar subjects who received partial feed-

back. The means for the other partial and full feedback

groups were always in favor of partial feedback; however,

only in the high ability comparison was the increment sig-

nificant. Consistent with the low ability group, the means


Feedback

Full

Partial

None








60

for the nonfeedback subjects of medium and high ability

were consistently lower than those of subjects who received

feedback. For those of high ability, the full feedback and

nonfeedback means were not significantly different, while

within the medium ability level, only the full and partial

feedback comparison was not statistically significant.

Table 12 contains reliability and validity coeffi-

cients, and their associated standard errors, for each

feedback group and for each ability level therein. The

internal consistency reliability was estimated with KR-20,

rather than an odd-even, split-half correlation, because it

was apparent from investigation of the test manual that the

test developer had ordered the SCAT items on the basis of

item difficulty level. The resulting odd-even, split-half

coefficient, corrected with the Spearman-Brown formula

would have been an artificially high reliability estimate

since when utilizing split-half procedures it is assumed

that the halves are made equivalent through random assign-

ment of items to test halves. On the other hand, the KR-20

formula yields the mean split-half coefficient had all

possible split-halves been calculated, thus avoiding the

necessarily biased odd-even coefficient.

A procedure for testing the differences among k

independent alpha coefficients was applied. The method was

developed by Hakstian and Whalen (1976) to parallel the

Scheffe Test for multiple comparisons of means in that all









61

Table 12

Within Cells Reliability and Validity Coefficients

and Standard Errors

Reliability Validitya

Group KR-20 SEM V SEE

Full Feedback .811 2.13 .565(620) 17.6

High .766 1.75 M .405(154) 14.0

Medium .718 2.13 .247(283) 13.7

Low .749 2.33 .329(183) 15.3

Partial Feedback .852 2.06 .690(631) 18.7

High .451 1.64 .440(183) 15.6

Medium .707 2.07 .355(246) 12.5

Low .795 2.31 .380(202) 15.9

No Feedback .845 2.17 .689(651) 18.8

High .756 1.88 .407(200) 14.4

Medium .773 2.20 .383(262) 14.7

Low .722 2.36 .448(189) 15.3
a
Sample sizes in parentheses are for validity coefficients

only.



linear contrasts could be made simultaneously, subject to

an experimentwise Type I error rate equal to alpha and the

restriction of a zero sum of contrast coefficients. The

analysis indicated that for the three treatment groups,

without regard to ability, the three KR-20 coefficients

(.811, .852, and .845) are statistically, significantly








62

different from each other (p < .001).

The validity coefficients are Pearson product-moment

correlation coefficients between SCAT and the Total Reading

score from the 1979 administration of the Stanford Achieve-

ment Test. The Stanford was administered one month fol-

lowing the administration of SCAT. In Table 12, the

numbers in parenthesis immediately following the validity

coefficients indicate the sample size upon which each was

based.

Without regard to ability groups, the validity coef-

ficients for the three treatment groups were tested for

statistical significance. The test for differences between

independent sample correlations was applied. The results

of the test indicated that the coefficient for full feed-

back was significantly lower than those for the other two

treatment groups (p < .001).

The responses to the questionnaire that was adminis-

tered immediately following SCAT are summarized in Table 13.

For each question, the proportion of each treatment group

selecting each response is displayed. Subjects not re-

sponding to a question were classified as neutral, except

for question five, in which case nonrespondents (approxi-

mately 3%) were not tallied.

In order to detect differences between the responses

of the three treatment groups to the questionnaire, a chi

square test of independence was performed for each question.









63

Table 13

Response Percentages and Means for Questionnaire Questions

by Treatment Groups

Feedback
a
Question and Response Categories Full Partial None p

1. I like answering questions .001
like the ones on the test.
Strongly agree 25.6 26.7 13.3
Agree 38.7 39.9 30.9
Neither agree nor disagree 9.6 9.2 9.8
Disagree 14.6 13.1 24.5
Strongly disagree 11.4 11.1 21.5
Mean .52 .58 -.10

2. I tried to answer the .793
questions as well as I could.
Strongly agree 62.3 60.5 59.9
Agree 30.6 32.3 32.3
Neither agree nor disagree 3.9 3.2 3.9
Disagree 2.0 2.6 1.7
Strongly disagree 1.2 1.5 2.2
Mean 1.51 1.47 1.46

3. After I erased each box, I .001
was able to read the printing.
Strongly agree 58.7 58.4 29.9
Agree 32.5 33.3 40.3
Neither agree nor disagree 4.2 3.6 6.6
Disagree 3.2 3.3 14.0
Strongly disagree 1.4 1.4 9.2
Mean 1.44 1.44 .68

4. I did not like being unable .049
to change my answer.
Strongly agree 44.0 41.7 50.1
Agree 25.6 25.5 22.4
Neither agree nor disagree 8.7 7.7 6.8
Disagree 12.2 13.5 9.5
Strongly disagree 9.5 11.6 11.3
Mean .82 .72 .90









64

Table 13 (Continued)

Feedback

Question and Response Categories Full Partial None Ta

5. On how many questions did you .001
want to change your answer?
0 9-8 9.0 30.2
1-3 35.2 34.2 48.7
4-6 24.5 25.7 12.5
7-9 13.5 13.8 3.0
10 or more 17.1 17.5 5.6

6. I was more nervous using the .031
new answer sheets than the
other kind used with stan-
dardized tests.
Strongly agree 16.1 17.9 17.9
Agree 20.2 21.5 20.6
Neither agree nor disagree 13.4 12.3 15.7
Disagree 23.8 28.8 26.0
Strongly disagree 26.5 19.5 19.8
Mean -.24 -.11 -.09

7. I like the new answer sheets. .001
Strongly agree 32.7 29.3 11.0
Agree 27.6 29.9 13.0
Neither agree nor disagree 10.4 10.7 8.1
Disagree 13.7 13.7 18.9
Strongly disagree 15.7 16.5 49.1
Mean .48 .42 -.82

8. I like knowing whether my .137
answer is correct right away.
Strongly agree 65.5 62.3
Agree 21.7 26.0
Neither agree nor disagree 6.8 6.2
Disagree 3.3 3.6
Strongly disagree 2.7 2.0
Mean 1.44 1.43

9. I learned from the questions .018
I got wrong.
Strongly agree 35.8 30.8
Agree 40.5 37.7
Neither agree nor disagree 10.1 11.4
Disagree 10.1 15.9
Strongly disagree 3.5 4.2
Mean .95 .75









65

Table 13 (Continued)

Feedback

Question and Response Categories Full Partial None p

10. The questions I got wrong .270
helped me get others right.
Strongly agree 21.2 19.7
Agree 29.4 33.8
Neither agree nor disagree 15.7 12.8
Disagree 23.2 24.3
Strongly disagree 10.7 9.5
Mean .27 .30
aChi-square test of independence.


The probability that the responses to each question are

independent of the treatment group is indicated in Table 13.

In order to further facilitate detection of specific dif-

ferences between treatments, means were calculated based

upon a scale of 2, 1, 0, -1, -2 for responses Strongly Agree

to Strongly Disagree.















Chapter VI

Discussion



Practical Significance

The results of this study appear to indicate a wealth

of statistically significant findings. However before any

conclusions are made, it is important to make a reassess-

ment in terms of the educational, or practical, signifi-

cance of the results. Statistical significance is not

enough. One must ask whether the statistically significant

differences are large enough to be important.

Cohen (1969) uses the term effect size to mean "the

degree to which the phenomenon is present in the popu-

lation" (p. 9) and provides benchmarks for small, medium,

and large effect sizes in the context of the more popular

one-, two-, and k-sample statistical tests. While they may

not be satisfactory to all audiences, Cohen's relative

effect sizes provide useful operational definitions and a

frame of reference for practical significance.

To Cohen, a medium effect size is one that we would

be aware of in normal experience. When comparing indepen-

dent sample means, Cohen's recommendation for a medium

effect size is one-half of one standard deviation. In

66








67

Table 9, the means and standard deviations for SCAT were

reported for each ability-treatment combination and the

results of an examination of the simple main effects were

reported in Table 11. Using Cohen's criterion of .5S

within each ability group and the standard deviation of

each no feedback group, only one practically significant

result was obtained: within the low ability level, the

difference between the full feedback and no feedback groups

was in favor of full feedback.

Cohen (1969) does not discuss or make recommendations

concerning reliability coefficients. However, if we main-

tain Cohen's concept of a medium effect size as one being

apparent in normal practice or experience, we would cer-

tainly be aware of having to take 10 minutes longer to

administer an otherwise 20-minute test. Another way of

saying this would be: if a 20-minute test has to be made

50% longer under treatment A in order to obtain the same

measure of reliability as under normal conditions, then

that lengthening of the test is of practical significance.

Applying this rule-of-thumb to the treatment group reli-

abilities in Table 12 and by using the generalized

Spearman-Brown prophecy formula, it was determined that

SCAT would require lengthening by 27% for the full feed-

back group in order to increase the test's reliability from

.811 to .845. Hence, the decrement in test reliability for

the full feedback was statistically, but not practically,









68

significant under the criterion proposed here.

An additional consideration in checking for practical

significance could be the standard error of measurement.

The reliability coefficient for the high ability subjects

who received partial feedback was inordinately low (see

Table 12), and the SCAT standard deviation for this cell

was also very low (see Table 9). However, when the stan-

dard errors of measurement in Table 12 were compared within

ability levels and across treatments, the largest differ-

ence was between partial and no feedback within high

ability but only about one-fourth of a raw score point.

This was a small discrepancy, if indeed we can legitimately

compare raw score standard errors of measurement across

treatments. Since the feedback treatments were such an

integral part of the administration of the SCAT, under the

three feedback conditions, SCAT could be considered as

three different tests. In this case, the raw score units

were not equivalent across treatments.

Cohen (1969) represents effect sizes for differences

between correlation coefficients in Fisher z-transformed

units. In these terms, a medium effect size is .30. Even

though the validity coefficient for full feedback was

statistically, significantly less than the same for partial

and no feedback, the difference was only .20 Fisher z-

transformed units and was judged not to be of practical

significance.









69

Threats to Validity

The principal threats to internal validity were

instrumentation and selection. Instrumentation was touched

upon previously in this chapter in reference to the

validity of comparing the within cells, raw score, standard

errors of measurement (within ability levels and across

treatments) to determine practical and/or statistical sig-

nificance. With respect to instrumentation, it was also

likely that the no feedback subjects took less interest in

taking SCAT because of the drudgery of many erasures with-

out receiving any feedback as a reward for the effort.

Evidence to sustain this threat was found in significantly

lower mean scores for no feedback subjects within and

across ability levels, a number of no feedback subjects

wrote to the experimenter on the questionnaire form re-

lating negative comments about their answer sheets, and

the negative response of no feedback subjects to question-

naire question seven on how well they liked their answer

sheets.

Another type of instrumentation problem resulted from

reducing the number of items to be included in the analysis

of SCAT after the entire test had been completed. Scores

of some subjects may have been different had they used all

of the allotted time to attempt only those 28 items used

in the analysis.

Selection was a possible threat to the internal









70

validity of this study due to the manner in which subjects

were sampled and separated into ability groups. Other

sampling and classification procedures and different

numbers of levels of ability may have determined other

results.

Mortality constituted a minor threat to validity.

Some students were not tested on one or more of three

testing dates and as a result may have been dropped par-

tially or completely from the analysis.

External validity was hampered by the limitations

placed on the results due to obtaining subjects in only one

grade in only one school district and concentrating the

study on only one area, that of verbal ability. The extent

to which different results could have been obtained by

selecting subjects from other school systems and grade

levels and by testing other areas of ability/achievement

in nonexperimental settings, limits the generalizability of

the results.

Relevance

In comparing the results of this study with others on

this topic, it can be seen that relative to full feedback

statistically, significantly higher mean scores were gen-

erally found in favor of full feedback over no feedback

(the lone exception was for high ability subjects where

the higher mean for full feedback was not statistically

significant). This was consistent with Betz and Weiss









71

(1976a), but the opposite relationship was found by Strang

and Rust (1973). This discrepancy might be due to the

degree to which tests used in the research were relevant

toward instruction: Betz and Weiss and this investigator

utilized verbal ability tests (SCAT), whereas Strang and

Rust used achievement tests which were developed specifi-

cally for a course in human growth and development.

Betz and Weiss (1976a) found no statistically sig-

nificant differences in between groups reliability and

validity coefficients, but statistically significant dif-

ferences found in the current study should not necessarily

be considered a discrepancy. Sample sizes almost four

times as large as those of Betz and Weiss may have con-

tributed to this result and criteria for practical, or

educational, significance discounted those results. Strang

and Rust (1973) did not report reliability or validity

coefficients.

In this study, partial feedback means were consis-

tently found to be statistically, significantly greater

than no feedback means, but the differences were judged not

to be practically, or educationally, significant. Other

studies typically purported to show either statistical

significance in the opposite direction or no difference at

all. However, only one of those studies (Hanna, 1976)

appeared to involve a reasonably large number of subjects,

and the conclusion therein was no difference.









72

The differences between reliability coefficients for

partial and no feedback groups were consistent with other

reports in the literature: not significant (Hanna, 1976;

Burgess, 1970). A similar consistency was found for the

not significantly different validity coefficients (Hanna,

1976).

Conclusions

Relative to the statement of the problem in Chapter

III and possible subject to threats and limitations cited

above, as well as other interpretations and criteria for

significance, the results of this study provide meager

support for past claims of significant effects of immediate

item feedback, or knowledge of results, on total test score

means, and reliability and validity coefficients. Several

findings of statistical significance were obtained but in

only one instance was practical significance sustained.

A substantial increment in mean ability scores was found in

favor of full feedback over no feedback for low ability

students.

Recommendations

Further research is needed in order to more com-

pletely isolate the effects of immediate item feedback on

total test score reliability and validity from other

variables which naturally accompany the administration of

tests. The effects of type of test (aptitude, achievement,

etc.), similarity between test content and course material,









73
subject matter, item easiness, and the age, maturity,

ability, and anxiety of examinees may be important vari-

ables to consider. What are the relative effects of the

different answering devices--punchboards, latent image

answer sheets (erasure and soluble types), and computerized

testing--and what are the effects of practice in relation

to each of these devices? Finally, even though some

researchers have studied the effects of feedback on test

validity by correlational means, and on reliability with

internal consistency coefficients, it appears as though no

one has examined the effects of feedback on long-range pre-

dictions, or on score stability, or on alternate forms

reliability. These are suggested as considerations for

those contemplating research in this area.














References


Ammons, R. B. Effects of knowledge of performance: a

survey and tentative theoretical formulation. Journal

of General Psychology, 1956, 54, 279-299.

Angell, G. W. The effect of immediate knowledge of quiz

results on final examination scores in freshman chem-

istry. Journal of Educational Research, 1949, 42,

391-394.

Beeson, R. C. Immediate knowledge of test results and test

performance. Journal of Educational Research, 1973, 66,

224-226.

Betz, N. E., & Weiss, D. J. Effects of immediate knowledge

of results and adaptive testing on ability test perfor-

mance (Research Report No. 76-3). Minneapolis: Psycho-

metric Methods Program, Department of Psychology,

University of Minnesota, June 1976. (a)

Betz, N. E., & Weiss, D. J. Psychological effects of

immediate knowledge of results and adaptive ability

testing (Research Report No. 76-4). Minneapolis:

Psychometric Methods Program, Department of Psychology,

University of Minnesota, June 1976. (b)









75
Burgess, T. C. Evaluation of an IBM card punchboard used

as a test answer sheet. Psychological Reports, 1970,

27, 146.

Cohen, J. Statistical Power Analysis for the Behavorial

Sciences. New York: Academic Press, 1969.

Dalenius, T., & Hodges, J. L. Minimum variance stratifi-

cation. Journal of the American Statistical Association,

1959, 54, 88-101.

DeCecco, J. P. Psychology of Learning and Instruction:

Educational Psychology. Englewood Cliffs, New Jersey:

Prentice-Hall, 1968.

Evans, R. M., & Misfeldt, K. Effect of self-scoring pro-

cedures on test reliability. Perceptual and Motor

Skills, 1974, 38, 1248.

Face, W. L. Self-scoring devices. Industrial Arts and

Vocational Education, 1964, 53(8), 44-47; 72; 74-76.

Gilman, D. A., & Ferry, P. Increasing test reliability

through self-scoring procedures. Journal of Educational

Measurement, 1972, 9, 205-207.

Glass, G. V, & Stanley, J. C. Statistical methods in

education and psychology. Englewood Cliffs, New Jersey:

Prentice-Hall, 1970.

Hakstian, R. A., & Whalen, T. E. A k-sample significance

test for independent alpha coefficients. Psychometrika,

1976, 41, 219-231.








76
Hanna, G. S. Improving reliability and validity of

multiple-choice tests with an answer-until-correct

procedure. Paper presented at the Annual Meeting of the

American Educational Research Association, Chicago,

April 1974. (ERIC Document Reproduction Service

No. ED 088 953)

Hanna, G. S. Incremental reliability and validity of

multiple choice tests with an answer-until-correct pro-

cedure. Journal of Educational Measurement, 1975, 12,

175-178.

Hanna, G. S. Effects of total and partial feedback in

multiple-choice testing upon learning. Journal of

Educational Research, 1976, 69, 202-205.

Hanna, G. S. A study of reliability and validity effects

of total and partial immediate feedback in multiple-

choice testing. Journal of Educational Measurement,

1977, 14, 1-7.

Heald, H. M. The effects of immediate knowledge of results

and correction of errors and test anxiety upon test per-

formance (Doctoral dissertation, University of Nebraska,

1970). Dissertation Abstracts International, 1970, 31,

1621A.

Montor, K. Affect of using a self-scoring answer sheet on

knowledge retention. Journal of Educational Research,

1970, 63, 435-437.









77
Nunnally, J. C. Psychometric Theory. New York: McGraw

Hill, 1967.

Pressey, S. L. A simple device which gives tests and

scores and teaches. School and Society, 1926, 23,

373-376.

Pressey, S. L. Further attempts to develop a mechanical

teacher. American Psychologist, 1946, 1, 262.

Pressey, S. L. Development and appraisal of devices pro-

viding immediate automatic scoring of objective tests

and concomitant self-instruction. Journal of Psychology,

1950, 29, 417-447.

Skinner, B. F. Teaching machines. Science, 1958, 128,

969-977.

Skinner, B. F. The science of learning and the art of

teaching. Harvard Educational Review, 1954, 24, 86-97.

Spencer, R. E., & Barker, B. An applied test of effective-

ness of an experimental feedback answer sheet (Research

Report No. 293). Urbana, Illinois: Measurement and

Research Division, Office of Instructional Resources,

University of Illinois, April 1969.

Strang, H. R., & Rust, D. J. The effects of immediate

knowledge of results and task definition on multiple-

choice answering. Journal of Experimental Education,

1973, 42, 77-80.








78
Wentling, T. L. Mastery versus nonmastery instruction with

varying test item feedback treatments. Journal of

Educational Psychology, 1973, 65, 50-58.














Vita


Gordon Guy Edward Benson was born on February 26, 1947

in Leeds, Yorkshire, England, an only child of British

parents. Gordon's primary schooling begin before his fifth

birthday and it was briefly interrupted when he emigrated

at age 9, along with his parents, to Toronto, Canada. Sub-

sequently moving to the U.S.A. when 14, Gordon graduated in

the Fletcher Senior High School (Neptune Beach, Florida)

Class of 1965.

After attending the University of Florida for two

years and working in a managerial position in a

Jacksonville, Florida, store for an additional two years,

Gordon moved to Tallahassee in 1969 with his new bride, the

former Sylvia Jean Patten, and enrolled at Florida State

University. Two years later, Gordon graduated with a

Bachelor of Science degree in mathematics education.

Returning to Jacksonville, Gordon taught mathematics

between 1971 and 1974 in the public secondary schools and

at the University of North Florida, where he coincidentally

earned a graduate degree in mathematics education. At-

tending summer sessions at F.S.U., an interest in statis-

tics was born, and this led to three years of fulltime

79









attendance with graduate coursework, teaching and research

in measurement, evaluation, research design and statistics.

In 1977, seeking fulltime employment, Gordon once

again returned to the Jacksonville public schools accepting

a position as Coordinator of Tests, Measurement, and

Statistics. Professional responsibilities in the Research

and Evaluation Division have included program evaluation,

test development and validation, test score analysis and

reporting, survey construction, and statistical analysis,

as well as providing technical assistance in tests and

measurement, evaluation, research, and statistics. In

addition, over a two-year period, Gordon developed

materials, and collected and analyzed data for his dis-

sertation which was completed in December, 1979 when he

received the Ph.D.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs