Citation
Perspectives on holistic scoring

Material Information

Title:
Perspectives on holistic scoring
Creator:
Wolcott, Willa Buckley
Copyright Date:
1989
Language:
English

Subjects

Subjects / Keywords:
Essays ( jstor )
First papers ( jstor )
Questionnaires ( jstor )
Sentence structure ( jstor )
Writers ( jstor )
Writing assignments ( jstor )
Writing instruction ( jstor )
Writing tables ( jstor )
Writing teachers ( jstor )
Writing tests ( jstor )
City of Gainesville ( local )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Willa Buckley. Wolcott. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
22215614 ( OCLC )
AHD7005 ( LTUF )
0021228577 ( ALEPH )

Downloads

This item has the following downloads:


Full Text











PERSPECTIVES ON HOLISTIC SCORING:
THE IMPACT OF MONITORING ON WRITING EVALUATION

















By

WILLA BUCKLEY WOLCOTT


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY





UNIVERSITY OF FLORIDA


1989

































Copyright 1989

by

Willa Buckley Wolcott


























To Edward,

Whose understanding and help made this project possible,





and to





Kedron, Charnley, Roma, Sharon, Bill, and Janet,

Who encouraged me throughout.














ACKNOWLEDGMENTS


I would like to thank the many people who assisted me with this study. I was most fortunate to have had the commitment, expertise, and interest of 17 outstanding participants in this study. Daniel Kelly, University of

Florida, served as chief reader and Anita Doyle, retired, Alachua County Schools, served as associate chief reader. The readers and table leaders were as follows: Kent Beyette, University of Florida; Janet Fisher, Jacksonville University; Noelle Geiger, Santa Fe Community College; Kay Gonsoulin, Buchholz High School; Anthe Hoffman, University of Florida; Gail Kanipe, Gainesville High School; Donald Kanipe, Santa Fe

Community College; Robert Lauriault, University of Florida; Patrick McMahon, Tallahassee Community College; Daniel

McPhail, Newberry High School; Wendy McPhail, Gainesville High School; Mary Morgan, Gainesville High School; Elizabeth

Novinger, Tallahassee Community College; Vincent Puma, Flagler College; and Donald Tighe, Valencia Community College. I appreciate the help of William Wood in handling the logistics for the scoring, and I also appreciate the help of the

holistic scorers at the Tallahassee scoring site who pilot tested the questionnaire during the fall of 1988.

I am grateful to Denise Standiford, Eastside High School, for independently validating the logs in this study, and to


iv










the following people who assisted me with the statistical programs: Robert Baskin, Center for Instructional and Research Computing Activities; Anne DePalma, Department of Foundations; and David Miller, Assistant Professor, Department of Foundations. I am indebted to Carolyn Lyons for transcribing

the tapes and typing the final manuscript. I am also grateful to Jeaninne Webb, Director of the Office of Instructional Resources, and to my colleagues Sue Legg, Associate Director, and Dianne Buhr, Assistant Director of Testing and Evaluation, for their encouragement.

Finally, I would like to thank my committee members for all their help: Ruthellen Crews, my chair, who guided me and

believed in me enough to let me undertake this study; Margaret Early, who asked insightful questions; and Robert Wright, Forrest Parkay, and Sandra Damico, who provided support and assistance along the way.


v















TABLE OF CONTENTS


ACKNOWLEDGMENTS ........................................... iv


LIST OF TABLES ..................................... viii


LIST OF FIGURES .................................... ix


ABSTRACT ....... ........................................... x


CHAPTERS

1 INTRODUCTION ................................. 1

Nature of the Problem ......................... 2
Purpose of the Study .......................... 2
Rationale for the Study........................ 4
Significance of the Study ................. 11
Limitations of the Study ..................... 15
Definition of Terms......................... 15
Organization of the Report .................. 18

2 REVIEW OF SELECTED LITERATURE ................... 19

Conditions and Procedures of
Holistic Scoring ........................ 19
The Effectiveness of Holistic
Scoring in Comparision to Other
Evaluation Systems .................... 29
Factors Involved in the Evaluation
of Writing .............................. 37
Summary of Literature Review ................ 57

3 METHODOLOGY .................................. 60

The Monitored Holistic Scoring .............. 60
The Questionnaire ......................... 70
The Unmonitored Holistic Scoring ............ 72
Methods Used for Analyzing Data ............. 73


vi












TABLE OF CONTENTS--(Continued)


4 RESULTS AND DISCUSSION .. --....................... 78

Mean Scores in the Two Scoring
Conditions .............................78
Interrater Reliability... .................. 87
Impact of Chief Readers
on Scoring.. ............................ 90
Writing Criteria Across Different
Scoring Levels .......................... 99
Patterns Among Readers...................... 113
Nature of the Monitoring................... 142

5 SUMMARY AND CONCLUSIONS ...................... 164

Procedures Used............................ 165
Issues Explored............................. 166
Discussion ................................... ,173
Recommendations for Research.................. 176
Conclusion.................................. 179

APPENDICES

A FORMS AND QUESTIONNAIRE -...................... 182

B SCORING CHARTS AND ADDITIONAL FIGURES ........ 192

REFERENCES ............................................ 200

BIOGRAPHICAL SKETCH ................................ 206


vii














LIST OF TABLES


TABLES Page


1. Descriptive Statistics for the Stratified Random Sampling .................. 64

2. Analysis of Variance Mixed Models Source Table .... ............................. 79

3. Cronbach's Alpha of Readers' Scores .......... 88

4. Cronbach's Alpha of Readers' and Chiefreaders' Scores ........................ 94

5. Results of the Questionnaire, Part III ....... 112

6. Results of Questionnaire on Biases and Preferences .............................. 134

7. Questionnaire Results Dealing with Training ................................ 158

8. Additional Questions for 12 Table Leaders and Chief Readers ................... 161


viii














LIST OF FIGURES


FIGURES Page

1. Monitored vs. unmonitored scores for
overall group................................. 81

2. Monitored vs. unmonitored scores of
individual readers........................... 82

3. Monitored vs. unmonitored scores:
Score type summary........................... 91

4. Results of rangefinders and samples
in the monitored scoring..................... 97

5. Category of comments assigned at each
score level................................... 105

6. Summary of points assigned to writing
criteria by individual readers............... 137

7. Readers' ratings of importance of
criteria in timed writings................... 138

8. Summary of points assigned to writing
criteria by chief readers and
table leaders................................. 145

9. Ratings by chief readers and table
leaders of importance of criteria
in timed writing............................. 146

B-1. Account of procedures.......................... 192

B-2. Summary of comments for paper 034.............. 193

B-3. Summary of comments for paper 088.............. 196

B-4. Results of table leaders' independent scoring of training samples.................. 199


ix











Abstract of Dissertation Presented to the
Graduate School of the University of Florida
in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy PERSPECTIVES ON HOLISTIC SCORING:
THE IMPACT OF MONITORING ON WRITING EVALUATION By

Willa Buckley Wolcott

December 1989

Chairman: Ruthellen Crews
Major Department: Instruction and Curriculum

The purpose of this study was to examine the impact of monitoring on the reliability and validity of holistic

scoring as a means to evaluate writing. Six questions were addressed concerning (a) mean scores in monitored versus unmonitored settings, (b) agreement among readers, (c) the chief readers' influence, (d) criteria for scoring,

(e) patterns in readers' responses, and (f) the nature of the monitoring.

Qualitative and quantitative measures were used. In an unmonitored scoring, eight experienced holistic scorers with different teaching backgrounds rated over 50 expository essays written by college students for a state assessment program; the scorers recorded written responses to each essay in logs. Another four special readers rated

a subset of the essays and orally responded to these papers through the use of taped protocols. Later two chief readers conducted a monitored scoring with customary


x








training procedures; three table leaders each monitored four readers throughout. The readers scored another 50+

essays, matched with the first set on the basis of original scores awarded during the actual scoring two years previously. All participants completed a questionnaire devised for this study.

A mixed-model ANOVA for nested factors and repeated measures revealed a significant difference (p < .00005) in the mean scores assigned the matched essays in the two conditions; lower scores occurred in the monitored setting. The ANOVA also showed significant differences (p < .001) among the eight readers: These individual differences were reflected in the participants' logs and audiotapes. When

two separate Cronbach's alphas were run with the chief readers' scores included in the second alpha, the reliability was consistently over .91. However, additional data suggested not only that readers' scores more closely

approximated the chief readers' ratings in the monitored setting than in the unmonitored, but also that the

potential for fewer noncontiguous scores existed in the monitored setting. Thus, monitoring appeared effective in increasing readers' reliability.

More significantly in terms of validity, the data

showed readers clearly responding to rhetorical, as well as to mechanical, elements. Furthermore, all participants indicated that they assented to the holistic scoring standards and that they perceived monitoring as a helpful resource.
xi















CHAPTER 1
INTRODUCTION



With the growth of writing assessment, holistic scoring has become widespread as one method for the largescale evaluation of students' essays. Sometimes called general impression scoring after the work of Diederich, French, and Carlton (1961), holistic scoring is based on the premise that the whole is more than the sum of its parts (Myers, 1980). Holistic scoring requires readers to read an essay quickly but completely, mentally rank

ordering the essay as it compares in overall quality to other papers, and then to assign a score accordingly. In the course of the reading, holistic readers make no comments on the paper, nor do they tally any errors. Rather, they rate each paper in terms of its overall

quality and record the score in coded form so that other readers will not know their ratings. Sample essays, which

are selected from the same test administration as being representative of each point on a scoring scale, provide the readers with a frame of reference. In some modified forms of this procedure, a descriptive set of scoring criteria gives the reader an additional guide.


1






2


Nature of the Problem


The role that training and monitoring play in helping holistic scorers make these overall writing evaluations is yet to be fully explored. The importance of such training

is implied in Spandel and Stiggins' (1981) observation that training can help to eliminate biases on the part of

readers; it is further emphasized by White (1985) who notes as follows:

The training of readers, or 'calibration' as it is sometimes called, is not indoctrination into standards determined by those who know best (as
it is too often imagined to be) but rather the formation of an assenting community that feels a sense of ownership of the standards and the
process. (p. 164)*

Despite the seeming importance that these comments attach to training, the actual influence that such monitoring may have on readers' judgments within a structured holistic scoring has received little attention.


Purpose of the Study


The purpose of this study was to explore how the training and monitoring which readers undergo during a formal holistic scoring influence the writing judgments they make. By means of logs, holistic scores, protocol analyses, and the metacognitive awareness of the participants themselves as shown through a questionnaire, it was


*From Teaching and Assessing Writing by Edward White, 1985, San Francisco: Jossey-Bass. Copyright 1985 by JosseyBass. Reprinted by permission.






3


expected that the study would reveal to what extent the

nature of this training embodies the "interpretive

community" described by White as reflective of Fish's

(1980) reader response theory.

The following questions were addressed:

1. Do the mean scores for the essays differ when the
papers are evaluated by readers working in a monitored setting from when the papers are judged
by the readers working independently?

2. Do experienced readers participating in a monitored
scoring achieve greater agreement with each other
than when they evaluate essays independently?

3. What impact do the chief readers have on an
holistic scoring? How do they ensure both a
reliable and a collegial reading?

4. What criteria do readers use in assigning different
score levels as reported through their logs,
talking protocols, and responses to a questionnaire devised for this study? What standards are reflected in the score levels assigned across essays? How do readers respond to these standards?

5. Do any common patterns appear in the scorers'
written or audiotaped responses to the essays, or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their
written or verbal responses, correspond to the writing features they rate as important on a
questionnaire?

6. What is the nature of the monitoring that the
readers receive during a scoring as reported through the logs of table leaders and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring processes as noted on their responses to a
questionnaire?






4


Rationale for the Study


Training and monitoring comprise an essential part of a formal holistic scoring. In fact, the key to a

successful holistic scoring lies precisely in its system of "checks and balances" that ensures the greater likelihood of readers rating the same paper comparably. According to procedures established by the Educational Testing Service,

one check lies in the structured format of the reading itself. Readers assemble as a group to do the reading, hence reemphasizing the need for working toward a group consensus. They read at tables directed by table leaders, who gently function as consultants, guides, and monitors. The table leaders, in turn, are guided and monitored by a

head table consisting of a chief reader and associate chief readers.

A second check lies with the ongoing nature of the monitoring provided. Not only do new readers undergo a preliminary training session in which they experiment with

the holistic approach on old essays, but in subsequent scoring sessions, new and experienced readers alike receive additional practice. Each scoring is introduced with

general comments, in which information is provided about the examinees' testing conditions, and the readers are reminded to avoid potential problems with length,

handwriting, and other surface features. Through this






5


procedure the chief reader establishes expectations for the readers, expectations which, as Freedman and Calfee (1983)

hypothesize, alter the "text image" (p. 94) that readers create in their own minds before making a judgment. Then readers begin by working with an old and new set of rangefinders--that is, with the anchor papers selected

beforehand as illustrative of various scoring levels for that test administration. Together with the operational definitions that are usually included in a modified holistic scoring, the guidelines and rangefinders provide

the criteria--both explicit and implicit--against which readers can rate the essays.

The monitoring continues throughout a scoring, as

sample papers are used after each rest break to ensure the adherence of all readers to group standards; because the tally of sample scores is publicly recorded, the

readers are able to see where their own scores place. Finally, frequent "check readings" are conducted in which a random sample of current essays from each table is independently evaluated by a reader, a table leader, and a

chief reader. If any of the papers receive discrepant scores--that is, a noncontiguous score such as 1-3, 2-4, or 1-4 on a four-point scale--then the paper is returned to the party whose score is discrepant, and the reader is asked to review the essay. If the reader is unwilling to adjust the score after reviewing the paper, then it is






6


subsequently refereed. Thus, in a formal scoring, training adds another dimension to the complexity of writing evaluation. These issues of training and monitoring are critical, for they affect both how and why scorers react the way they do to essays.

As a tool of writing evaluation, holistic scoring

attracts both strong support and serious criticism; the training of readers for the purpose of achieving a scoring consensus lies at the heart of the conflict. For example, in support of holistic scoring, Davis, Scriven, and Thomas

(1987) point out that the reliability and the relatively low cost of holistic assessment make it valuable for evaluating a school's writing program. Bamberg, too,

(1982) argues that writing programs which focus on the writing process should use essays for evaluation purposes,

adding, "Holistically scored essays should, therefore, play a leading role in assessments of writing programs and writing competence" (p. 406). Cooper (Cooper & Odell,

1977) emphasizes as well the high interrater reliability that can be achieved in holistic scoring; he stresses that

with similar backgrounds and training, raters can obtain substantial agreement in scoring several essays of a student.

In addition, the theoretical assumptions behind

holistic scoring receive strong endorsement as White (1985) underscores the value of examining papers as a whole. He






7


observes that holisticism "is the most obvious example in

the field of English of the attempt to evoke and evaluate

wholes rather than parts, individual thought rather than

mere socialized correctness" (p. 19). White readily

acknowledges the limitations of holistic scoring, pointing

out that this evaluation approach is unable to provide

diagnostic information for individual students and that,

moreover, the scores represent rankings rather than

absolute values. But while stressing the need for using

this approach responsibly, White defends the underlying

principles behind this form of evaluation:

Holistic scoring is important for reasons beyond measurement, for reasons that return us to the nature of writing and to the importance of the study of writing itself. It is in our writing that we see ourselves thinking, and we ask our students to write so that they can think more clearly, learn more quickly, and develop more fully. Writing, like reading, is an exercise for the whole mind, including its most creative, individual, and imaginative faculties. The rapid
growth of holistic scoring in grading reflects this view of reading and writing as activities not describable through an inventory of their parts, and such scoring serves as a direct expression of that view: By maintaining that writing must be
seen as a whole and that the evaluating of writing cannot be split into a sequence of objective
activities, holistic scoring reinforces the vision of reading and writing as intensely individual
activities involving the full self. (p. 32)

In order for students to understand more clearly the

evaluation criteria used on their papers, White, in fact,

advocates the application of holistic scoring guides in the

classroom. That many teachers enthusiastically endorse

this practice (Mishler and Hogan, 1982; Paulis, 1985, and






8


Westcott and Gardner, 1984) gives added weight to the value of holistic scoring as an aid to teaching and revising.

At the same time, holistic scoring receives pronounced criticism from some researchers and scholars. Frequently cited as emblematic of reader reliability problems is the classic work of Diederich, French, and Carlton (1961). In

this study the researchers asked over 50 readers from 6 fields to grade 300 essays written by new freshmen at different colleges. All of the essays received at least five of the nine possible scores on the scale, and onethird even received the entire range of scores. Often overlooked in references to this study, however, is the absence of any criteria or scoring assistance provided for readers (White, 1985); such a lack of training represents a complete departure from the guidance given in most

holistic scorings of recent years. Nevertheless, even with such guidance given, the issue of score reliability--a broad term that reflects potential error sources in topics, tasks, conditions of examinees, and agreement among readers--remains, according to Breland, Camp, Jones,

Morris, and Rock (1987), the "Achilles Heel" of writing assessment.

In addition to questions of reliability, the issue of the validity of holistic scoring has come under attack.

For example, Charney (1984) argues that "a given set of criteria devised by one set of experts is no more valid





9


than a different set of standards, arrived at by a different group of experts" (p. 73). She suggests, moreover, that the very need for extensive training in holistic scoring implicitly illustrates the difficulties readers experience in adhering to the imposed criteria. This difficulty, according to Charney, is further shown in

those studies which have found such superficial features of writing as handwriting or spelling to be influential in the holistic scores assigned. Thus, Charney concludes, "Holistic ratings should not be ruled out as a method of evaluating writing ability, but those who use such ratings

must seriously consider the question of the validity of the scores that result" (p. 79).

Similar caution is reflected in the report by the CCC [College Composition and Communication] Committee on Teaching and Its Evaluation in Composition (1982); the report notes that holistic scoring is of "limited value" for evaluating either writing programs or courses. Several educators and composition specialists also express their concern. Hirsch (1977) states, for example, that some of

the greatest thinkers in history have been unable to establish holistic standards which encompass both intrinsic and extrinsic criteria. Hirsch insists that the Aristotelian mode of intrinsic evaluation, which judges how effectively

and how correctly writers carry out their intentions, is better for predicting writing ability than is the Platonic






10


mode, which judges the quality of intentions external to the writers; the intrinsic mode is also, he argues, "the only kind of assessment in which anyone should have confidence" (p. 186).

Elbow (1986) expresses reservations about an evaluation model which requires agreement among judges. Not only may

it result in an overemphasis on such measurable features as grammar and spelling, but it also requires readers to

suspend their own judgments in favor of other standards. For Elbow, "descriptive perceptions" (p. 255)--even when they conflict--provide a more valuable learning experience than those evaluations which merely rank or measure.

The need for agreement among holistic raters disturbs the educator Roberts (1983) as well, contributing, in his

view, to a limited, "product-centered and decontextualized" form of evaluation that disregards the writer's purpose, intentions, and environment. Roberts questions whether holistic scoring, like the empirical research to which he attributes its growth, can effectively measure writing quality, writing change, or "anything other than how well a writing sample simulates an Idealized Text" (p. 3); his latter observation directly contradicts the view expressed by Spandel and Stiggins (1981) that holistic scoring,

comparing as it does the relative quality of essays, has no "preconceived notion of the 'ideal' paper" (p. 24).





11


As can be seen, those who question the validity of holistic scoring imply that emphasizing agreement of scores through training destroys the individual perspective--an individuality endorsed in recent years by such reader response theorists as Bleich (1975). But White (1985) argues an opposing viewpoint just as emphatically. Calling attention to the importance of the nature of the community

that forms in an holistic scoring, White compares the "true community of assent," which is properly developed through

a formal essay scoring, to the "interpretive community" discussed by another theorist of reader response, Fish (1980). Although Fish's concept refers to the sense of agreement readers of literary texts strive to attain, White (1985) sees similarities between Fish's reader response theory and the need for establishing a responsive community in holistic scoring. Thus, a study is needed to explore the impact of training both on the nature of the community that develops among the scorers and on any scoring agreement that results.


Significance of the Study


In revealing the extent to which holistic scorers willingly adopt the criteria, the study should have educational, theoretical, and practical significance. First, as Keech (1982) notes, it should show the individual holistic scorer responding "not as an error-counter or a






12


conserver of threatened forms, but as a receiver of

intended communication" (p. 174). Holistic scorers attempt to derive the meaning through envisioning each text as a whole; they operate in a context in which they are

encouraged both to remember the limited conditions under which examinees write and to recognize that strong papers may not be perfect ones. As such, the study should verify White's (1985) observation expressed below:

The simple fact is that the definition of textuality and the reader's role in developing the meaning of a text that we find in recent theories
of reading happens to describe much of our experience of responding with professional care to the writing our students produce for us. Part of the problem of evaluating student writing comes out of our deep understanding that we need to consider the process of writing as well as the product before us and that much of what the student is trying to say did not get very clearly into the
words on the page. (p. 93)

The study should further serve to integrate writing assessment more closely with reader response theory by illuminating whether the particular sense of community that arises through training procedures influences the

holistic judgments made. The early composition researchers Braddock, Lloyd-Jones, and Schoer (1963) stress the

importance of agreeing to criteria. In a reference to analytic scoring, for example, the researchers link the effectiveness with which criteria are applied directly to "the commitment which each rater feels toward the criteria being employed" (p. 15).






13


As indicated by White (1985), reader response theorist Fish (1980) attributes the agreement which can occur among

readers of literature to a "stability in the makeup of interpretive communities," a stability arising from a commonality of goals (Fish, 1980, p. 15). According to Fish, the stability is due not to independent qualities within the texts, but rather to the "interpretive

strategies" which give shape to the event of reading and hence to the making of meaning of the texts themselves. The nature of the interpretive communities can change because the interpretive strategies are learned; they are

learned through persuasion, as writers invite readers to employ particular strategies.

The reader response theorist Rosenblatt (1985, 1988) finds Fish's view too narrow; however, she also underscores the value of agreement in her observation that "in any specific situation, given agreed-upon criteria, it is possible to decide that some readings are more defensible

than others" (Rosenblatt, 1985, p. 36). An important distinction Rosenblatt makes is between "efferent" and "aesthetic" reading. In efferent reading, the reader is

concerned with what can be taken away from the reading, whereas in aesthetic reading, the reader is involved with experiencing the reading event itself. The two types of reading fall on a continuum, requiring readers to select the primary elements to which they will give their






14


attention. Of special significance for this study is Rosenblatt's contention that "the need for grasping the author's purpose and for a consensus among readers is usually more stringent in efferent reading" (1988, p. 8).

As she elaborates, "In efferent reading, the student has to learn to focus attention mainly on the public, referential

aspects of consciousness and to ignore private aspects that might distort or bias the desired publicly verifiable or justifiable interpretation" (1988, p. 8). In this regard the training of holistic scorers can perhaps be perceived as a means for helping readers adopt an appropriate stance in which they overcome their biases and select agreed-upon public criteria.

Thus, the issue of agreement surfaces both in reader response theory and in writing assessment. Because this issue lies at the heart of questions concerning the validity of holistic scoring, this study should have theoretical implications in revealing the degree of

commitment holistic scorers feel to the standards they use.

Finally, the study should have practical significance. If the monitoring and training processes of the structured scoring have a noticeable impact on readers' evaluations,

then the need for continuing holistic scorings within a formal context will be apparent. If, on the other hand, readers' judgments do not appear to be unduly affected or altered by the group monitoring procedures, then an option





15


might be to have experienced readers follow what is

currently done in many state assessments and score some essays at home.


Limitations of the Study


Any conclusions to be drawn from the study will, of necessity, be limited, as the small number of participants--17 altogether, including readers, table leaders, and chief readers--and the limited number of essays involved--a little over 100--will prevent generalizations. Furthermore, the scorers involved in the study will be highly experienced readers; different results might be obtained with less experienced scorers who might not react the same way, especially in the at-home scoring. Finally, the study relies on the accuracy of scorers' self-reporting in logs, taped protocols, and questionnaire

responses; such self-reporting not only entails subjectivity but also, as Freedman and Calfee indicate (1983), depends upon the evaluators' abilities to articulate their own responses.


Definition of Terms


For the purpose of this study, the following definitions are used:

1. Scorer and reader are used interchangeably to

refer to those readers making the rating judgments






16


on each paper. Special scorer is used to refer to any of those four readers who do talking protocols

as they evaluate the papers.

2. Chief reader or trainer is used to refer to the one

or two individuals who conduct the holistic

scorings and who train the readers by providing

sample papers.

3. Table leader refers to the individual who is in

charge of a table of readers and who monitors those

readers' progress.

4. Monitored scoring, structured scoring, and formal

scoring are used interchangeably to refer to a

formal writing assessment approach in which a group of readers meets and follows the set procedures

described in this study.

5. Training and calibration of readers refer to the

processes whereby the readers are given initial exposure to selected sample papers and ongoing practice in reading and scoring those essays.

These processes also include the public tallying of

scores on those papers.

6. Monitoring refers to the ongoing process whereby

table leaders continuously check some of the actual

essays the readers have scored.

7. Check reading refers to the formal process whereby

the chief readers collect from the table leaders






17


two papers per reader which the table leaders have

also scored. The chief readers independently score

these papers and compare the results.

8. Rangefinders and anchor papers are used interchangeably to refer to the six essays which have been formally chosen in a previous sample selection process as representative of each scoring level from level 1 (the lowest) to level 4 (the highest).

These papers, which the readers must initially rank order according to quality, serve as guideposts for

the standards of any reading.

9. Sample refers to additional essays which have also

been selected from the same previous scoring as the rangefinders and which are used throughout a scoring to illustrate particular levels of scores. 10. Operational definitions is used to refer to those

written descriptors of each level of paper and the

qualities that the levels embody.

11. Log is used to refer to the running commentary

readers provide of their scores and the decisions for these scores. It is distinguished from the term Account of Procedures which is used to refer to the customary log of procedures that chief

readers maintain.






18



Organization of the Report


A review of selected literature is presented in Chapter 2. The methodology used in the study is addressed in Chapter 3. The qualitative and quantitative results are discussed in Chapter 4, and a summary of the findings and their implications are presented in Chapter 5.














CHAPTER 2
REVIEW OF SELECTED LITERATURE


Literature on holistic scoring and its related areas falls into three broad categories: One set of studies establishes and describes holistic scoring procedures and

conditions, primarily for the purpose of improving their implementation. A second category addresses such issues as the validity, reliability, and cost-effectiveness of

holistic scoring and explores the effectiveness of this evaluation system as it relates to other procedures. A third category, more theoretical in nature, explores the complexities entailed in making writing judgments.


Conditions and Procedures of Holistic Scoring


The classic study by Diederich, French, and Carlton (1961) cited in the introduction has influenced both the conceptual base and the method of holistic scoring. As noted previously, the researchers asked readers to make an

overall judgment as to the quality of a particular essay by implicitly rank-ordering the papers. In this study, 53 professionals from a variety of fields evaluated about 300 essays written at home by college freshmen from different

universities. The raters were first instructed to sort 50 papers into three piles signifying their level of quality--


19






20


average, above-average, and below-average. Next, they were to sort each pile into three more stacks, for a total of nine. Finally, they had to place the remaining papers in one of the appropriate piles and write comments about what they liked or disliked in the essays.

When the grades assigned to each paper were correlated, the median of correlations between readers was .31, indicating a low reliability for the reading. Precisely for this reason, the work of Diederich et al. has often been cited as indicative of the problems inherent in general impression scoring. Yet, as White (1985) notes, it

is important to recognize that the conditions of their study differed considerably from those typically used today: Not only did the readers come from diverse backgrounds, but neither training nor monitoring was provided; moreover, the papers were written outside class, a departure from normal testing conditions.

In fact, because of the large variation that occurred

in their study, the authors conclude that reliability is crucial if scoring the essays is to serve any important purpose. They suggest that readers should be tested so that only those whose ratings correlate at .60 with the general

consensus be allowed to score; they speculate that some training and directions might be of help. Thus, the discoveries made in this major work have undoubtedly been






21


instrumental in the development of holistic scoring as it is known today.

A second influential study is The Measurement of

Writing Ability (1966) by Godshalk, Swineford, and Coffman. The authors used holistic scoring to evaluate each of five

essays written by nearly 650 grade 11 and grade 12 students throughout the country. Although the study was undertaken

for the purpose of validating multiple-choice items on a standardized test, the researchers conclude with several recommendations about holistic scoring.

For example, they note the importance of providing sufficient time for training early in a scoring session in order to avoid having readers assign overly high scores in the beginning. Finding the time of day to be a factor

in a scoring, they also emphasize the need for having multiple readings of a paper done at different stages of a scoring session. Such a practice can, according to the researchers, minimize both the variance among readers and the variance deriving from the time of day.

In later stages of their study, the researchers reduced the number of readers evaluating any one essay, and they experimented with changing their original 3-point scoring scale, denoting superior, average, and inferior, to a 4-point scale. The even-numbered scale required readers to choose the half of the scale that each paper exemplified, rather than resorting to the safety of the middle






22


score whenever in doubt. As a result of their experimentation with various conditions, the work of Godshalk et al. (1966) has formed the basis for many formal holistic scorings conducted today.

More recent literature reflects continued interest in the conditions under which an holistic scoring is conducted. For example, Paden (1986) explored the possible relationship between the context in which an essay is placed and the influence of floor and ceiling effects on the range of different score levels. Citing the

conclusions of other researchers who failed to minimize context effects on a scoring, Paden hypothesized a theoretical model to link the effect of context to

the potential for increase or decrease contained by each score level. Because the potential for change can differ substantially depending on whether a score is a 1 or a 3,

for example, the particular placement of a score on a scale can, according to Paden, affect the amount of change that

context can influence. She stresses the need for validating her hypothetical model.

Concern for training of the readers also appears in several studies during this decade. The role that

training of readers can play has been illustrated in a study by Freedman (1981), who studied the impact of three variables--essay, reader, and environment--on an holistic scoring. Four highly qualified scorers worked in pairs to






23


score holistically 64 argumentative papers composed on each of eight topics by college students; in a later session, they rated the same essays analytically. Two trainers trained the different pairs of raters, providing the readers with sample essays for each topic.

Freedman (1981) found that the four readers graded the papers consistently with each other and appeared to be unaffected by the rating session and the time. In addition, their holistic scores correlated significantly with all the analytic ratings except for the area of usage. The choice of topic did seem to affect the results in that one opinion topic received higher scores throughout. Most important, Freedman also found an apparent effect that the trainer could have on the scoring. Even though both trainers (one of whom was Freedman) agreed on

the scores to be given sample essays, on a replay of the taped training sessions, differences in the discussions appeared. Thus, while one trainer might state that two contiguous scores of 2 and 1 were appropriate for a given sample, the other trainer might state that the 1 score was not suitable for papers of that type.

Freedman (1981) speculates that such training differences can result in higher or lower scores being assigned

accordingly, and she suggests that researchers in small projects avoid conducting the training themselves in order to avoid influencing results in favor of their hypotheses.






24


The importance of training in writing evaluation was underscored by Hrach's dissertation (1983), in which she explored possible links between raters' previous writing experiences, their tolerance of ambiguity, and their evaluation approaches. Fifty-nine secondary English

teachers sorted into whatever scoring categories they chose 20 papers written on the same topic by secondary school students. Using a three-way multidimensional scaling system, Hrach identified the basis of the classifications to be style, organization, maturity of thought and expression, and substance. Thirty-nine other teachers who rated the compositions analytically confirmed the accuracy

of the classifications. In addition, the teachers completed two instruments addressing their experience with writing skills and their tolerance of ambiguity. Results of Hrach's study, like Diederich et al.'s (1961), confirmed that raters were influenced by the presence or absence of certain writing qualities in essays. Also like the findings of Diederich and his colleagues, Hrach discovered that the raters differed substantially on what they considered important. That is, only three raters used three of the four dimensions she identified, and almost half focused on one dimension alone. These differences in evaluations did not appear related either to the raters' previous experiences with writing skills or to their tolerance for

ambiguity, as these features were subsequently not found to be influential in writing judgments.






25


Because the raters were given neither criteria nor restrictions for sorting the papers, Hrach (1983)--who endorses the realism of this practice--nevertheless

suggests that a lack of training in writing evaluation might explain the wide variability in results. She speculates that even if the instructors had completed

coursework in writing, without training in the teaching or evaluating of writing, "they probably would not have developed common perceptions of what constitutes good

writing" (p. 171). In what seems a forerunner of this study, Hrach suggests that it might be useful to examine how raters trained in holistic scoring rate papers independently and as part of a group.

The importance of training in an holistic scoring has also been emphasized by Sweedler-Brown (1985), who sought to determine whether the amount of training and the experience that holistic scorers had with a grading scale

affected either their evaluations of writing quality or the consistency of the evaluations.

Using a six-point scale, 20 experienced writing

instructors and graduate students, whose experience with holistic scoring ranged from none to three years, holistically scored 897 essays written by university students. From this group of essays the 36 essays which had received discrepant scores were selected for analysis. Each of the readers involved in the discrepant scores, together






26


with one of the six trainers who had served as referees, was asked three days later to score the same essays

analytically. The eight criteria included content, organization, diction, development, mechanics, and spelling.

When the holistic scores were correlated with the total analytic scores, the trainers were found to give equivalent holistic and analytic scores over 60% of the time, whereas

readers assigned comparable scores only 37% of the time. Although both trainers and readers valued content and sentence structure (albeit in reverse order), the trainers tended to give lower holistic and analytic scores than did the readers.

Thus, the researcher concludes, "Graders with greater experience and training have significantly greater

consistency between their holistic and analytic evaluations of the same essay, from which we conclude that the amount

of training and experience does significantly affect the reliability of a grader's evaluation" (p. 54). The importance Sweedler-Brown (1985) attaches to training seems justifiable; however, some limitations in her study suggest that the results must be interpreted cautiously.

That Sweedler-Brown's (1985) conclusion derives from a small sample of 36 discrepantly scored papers seems troublesome: Not all readers would necessarily have been involved in these discrepant scores, and hence, the actual

number of readers from whom correlations were obtained--






27


while not clearly stated--might have been fewer than the 20 doing the holistic scoring. In addition, the training

provided for the analytic scoring was far more limited than that given to the holistic criteria, with the result that

agreement on the analytic scales might have been harder to achieve. Finally, as Sweedler-Brown acknowledges, the trainers scored far fewer papers than did the readers. The trainers might have remembered their original holistic

scores on the discrepant papers, thereby contributing to the higher correlation they achieved between analytic and holistic scores. Thus, the limitations of this study

militate against the conclusions, however strongly the need for training in holistic scoring appears to be.

Differences between trained and untrained scorers were also examined by Huot (1988) in a recent dissertation somewhat related to the present study. Arguing that too much attention has been paid to the issue of agreement, Huot explored the validity of holistic scoring by comparing the protocols of four novice and four expert holistic raters. Each scorer rated 84 essays selected from a previous assessment and written in letter format by

college freshmen on two different topics. The scorers, who talked aloud for half the essays (42 each), were given

training in doing protocol analysis. Then, over a four-day period, pairs of expert readers and pairs of novice readers scored for two days apiece. The novices were given neither






28


training in holistic scoring nor any rubric to use; the experts, together with a scoring leader, first trained with anchor papers and the original rubric, and then they modified the rubric for the protocol scoring.

The researcher coded the number of responses that each scorer made, noting when the responses were made, whether the responses were positive, neutral, or negative, whether the responses were made to the writer or to the essay, and

the criteria on which the judgments were based. After each scoring session, the researcher interviewed the scorers.

Huot found that even though the novice raters made substantially more comments than did the expert raters, the experts' comments--many of which were made after, rather than during the scoring--reflected more varied viewpoints and more personal engagement with the student essays. Because the experts could use a scoring rubric whereas the

novices could not, Huot concludes that, contrary to his expectations, the rubric and other holistic training procedures did not intrude on the rating process. Rather, by providing scorers with "expectations, justification or

explanation" (p. 223), the rubric enabled the raters to read the essays more fully. Novice raters, on the other

hand, sought strategies that would work with a specific set of papers and concentrated on evaluation to the exclusion of any personal engagement with the essays. Huot suggests

that holistic scoring procedures, far from impeding true






29


reading, "actually promote the kind of rating process that insures a valid reading and rating of student writing" (p. 237).

As can be seen from this section of the literature review, several studies have reflected concern both for determining what occurs in holistic scoring and for improving the procedures under which writing is holistically scored. Because of these concerns, several of these

works have become the reference point for the practices currently used in a structured holistic scoring.


The Effectiveness of Holistic Scoring in
Comparison to Other Evaluation Systems


A number of studies have explored either the effectiveness or the cost efficiency of holistic scoring, especially as it relates to other forms of writing evaluation. One such study was the early undertaking of Follman and Anderson (1967), who randomly assigned five raters to use one of five evaluation approaches in rating ten compositions written by college students. The rating systems included The California Essay Scale, the Cleveland Composition Rating Scale, the Diederich Rating Scale, the Follman English Mechanics Guide, and the Everyman's Scale

in which the evaluators could use whatever system they wished. All but two of the evaluators were English education majors enrolled in the same English course.






30


Follman and Anderson found high correlations among the different systems except for the Diederich scale; they also found high reliability for each group, leading the researchers to conclude that the homogeneity of the raters might be a major contributing factor.

In a later study Winters (1978) compared four different scoring systems--one General Impression, two analytic, and

one a T-unit analysis--to determine how well each system classified four groups of students who had been previously

placed in high and low writing groups in high school and in college.

After six high school and college teachers were thoroughly trained in at least two of the scoring systems,

four of the readers used each system to score 80 papers. Interrater reliability was substantial on all four systems, with the General Impression system achieving the lowest rate at .81, in contrast to the .99 reliability rate of the T-unit analysis system. Winters (1978) attributes the relatively low reliability of the General Impression scale

both to the fact that the scorers used this system first and hence lacked the practice they subsequently experienced with the other systems and also to the fact that the rubric for this system was less defined than it was for any of the other procedures.

Of most concern to Winters (1978) is her finding that in three of the four systems--the General Impression






31


system, the Diederich Expository Scale, and the CSE Analytic Scale, which was developed at the Center for the Study of Evaluation--the low college group did better than did their high peers. Winters attributes this

unexpected occurrence to the small size of the sample, the atypical nature of summer students, and, most significantly, to the substantial number of foreign-born students in the college low group--students whose problems with syntax or with awkward wording might not be reflected by the scoring systems. That the T-unit did not discriminate among the four groups at all could, according to Winters,

be explained by the similarity of age in the students of the study, unlike those students in previous research on T-units.

The researcher speculates that three systems are better than two for classifying students' writing, and she notes that a combination of General Impression scoring, together with an analytic system, seems best. She concludes that

General Impression scoring, while not adequate alone for placement procedures, should be included in most writing assessments.

Like Winters, Shoaf (1985) also studied the effectiveness of two different methods--holistic scoring and T-unit

analysis--in evaluating the writing skill of high school students. An additional purpose of her study was to

determine whether students gained in writing proficiency






32


over a semester and retained that growth during the years following.

Shoaf (1985) had the students enrolled in her

sophomore-level average composition class write a 50-minute pre- and post-test on the same topic; one and two years later all students taking English wrote on the same topic for a delayed post-test. The researcher and an assistant then tallied the number of T-units in 388 samples. The essays were typed, and a team of 12 scorers,

after undergoing a training session with anchor papers, holistically scored the essays on a scale of 1-4.

A correlation of holistic scores with the T-unit results proved non-significant. When the holistic scores were analyzed for the four groups as a whole, the holistic scores increased over the semester and reflected a slight

decline on the delayed post-test. However, when the T-unit scores were analyzed for the same period of time, no significant results occurred.

Thus, Shoaf (1985) concludes that T-unit analysis is not effective in evaluating the overall writing progress of groups and that T-unit scoring should only be used for determining levels of syntactic maturity. Acknowledging that her study did not address the issue of individual writing proficiency, she states, "Holistic scoring is a useful technique for determining whether groups of students have made general progress in the development of writing" (p. 67).






33


In a study by Bauer (1981), the cost-effectiveness of three different scoring systems--analytic, primary

trait, and holistic--was explored, as well as the interreliability and intra-reliability of each system. Nine graduate students, none of whom were familiar with the scoring methods, were divided into groups of three and trained in one of the methods. The graduate assistants scored 118 essays previously written for the National

Assessment of Educational Progress. Results indicated that the analytic scoring method, which contained the most

specific scoring criteria and which required the longest training time, achieved the strongest inter- and intrareliabilities. The holistic scoring method, though attaining the lowest intra-reliability rate, was the second strongest of the three methods in terms of interreliability. It also proved to be the most cost-efficient for scoring large numbers of essays.

Janopoulos (1987) explored the effectiveness of

holistic scoring from still another perspective. He sought to determine how well holistic scorers comprehended compositions written by nonnative speakers of English.

After receiving training in holistic scoring, 12 readers rated two compositions predetermined as representing higher and lower quality. In the first rating--the "naive" condition--readers were not told they would have to write

a recall protocol after the holistic scoring; in the second






34


rating--the "focused" condition--readers were told beforehand that another recall protocol would be required.

The readers operating in the naive condition were able

to recall the higher text more clearly than they did the lower, thereby illustrating, according to the researcher,

the role that comprehension can play in raters' holistic judgments. To his puzzlement, even though the readers operating in the focused condition recalled more overall content than they did in the naive condition, the focused readers did not recall more of the higher level text than

they did that of the lower; rather, they recalled about the same amount of information in both levels.

Janopoulos (1987) attributes the lack of impact that this higher text seemingly had on the focused readers to a possible ceiling effect and to the small sample size. He concludes, nevertheless, that holistic scoring is a valid way to assess non-native speakers' writing proficiency in terms of the comprehension component. Despite the problems

Janopoulos encountered in interpreting the results, his conclusion seems valid in that holistic scoring more closely resembles the naive condition under which the readers in his study were operating than it does the focused condition.

An altogether different stance toward holistic scoring appears in Roberts' dissertation (1982); he compared individualized writing instruction, an approach he strongly






35


endorses, to the more traditional classroom method of teaching composition in two West Virginia colleges.

Students wrote pre- and post-essays on a topic developed by researchers in another study, they took the Daly-Miller writing apprehension test at the beginning and end of their work, and they answered three questions regarding their view of writing. Then the essays were holistically scored and studied for T-unit length; they were also rated according to a forced-choice method. An increase in the T-unit length for the control group was the only significant difference that occurred.

Roberts (1982) questions the effectiveness of holistic

scoring as a means of evaluating the quality of student writing. He notes that one of his four raters dropped out

of the study altogether, unwilling to rate "Themes as Products" (p. 96), and he points out that still another rater failed to achieve acceptable reliability. Roberts observes, "All of the raters commented that the evaluation

techniques required product-centered evaluation based on an artificial rubric that, while developed specifically for the essay topics by prominent researchers, was inadequate

for evaluating what the papers really deserved, based on what the raters perceived as the students' intentions" (p. 96).

Although Roberts' (1982) disillusionment with holistic scoring may be warranted, his use of both a scoring rubric






36


and a topic from a different testing program is troublesome; as White (1985) suggests, the requirements of each testing population and program must be taken into consideration in the development of an holistic guide. Moreover, Roberts' study was problematic in that he controlled only

for the instructional mode and not for such other variables as teacher differences or course content. Although Roberts dismissed this lack of control by stating that his study was primarily naturalistic rather than experimental, he did not provide the extensive descriptions or observational data often associated with naturalistic studies. Therefore, despite the limitations which holistic scoring admittedly has, the problems in Roberts' dissertation weaken the impact of his criticism of this scoring method.

Taken together, the studies by Winter (1978), Shoaf (1985), Bauer (1981), Janopoulos (1987), and Roberts (1982) illustrate the potential, as well as the

limitations, of holistic scoring for writing evaluation. Their findings suggest that holistic scoring is more

meaningful--albeit somewhat less reliable and more timeconsuming in terms of training required--than is T-unit analysis as a means of assessing overall writing quality,

including the writing of non-native speakers of English. It is also a cost-efficient approach for the large-scale assessment of essays. At the same time, as Winters (1978) points out, holistic scoring cannot reveal specific,






37


diagnostic information and hence, the purposes for which it is used must be clearly defined beforehand.


Factors Involved in the Evaluation of Writing


A third major component of the literature review encompasses those studies that explore the elements involved in the evaluation of writing. The focus of this

section is not on holistic scoring per se but rather on the larger issue of writing quality--and most importantly, on those elements that comprise that quality. Thus,

studies which address writing in a variety of contexts are included, as are studies which use assessment methods

other than holistic scoring. The studies are primarily categorized according to results although, of necessity, some overlapping among the categories occurs. Content and Organization

The importance of content is emphasized by Diederich (1974), who, in discussing the factor analysis that was performed in the earlier study of Diederich, French, and Carlton (1961), states:

Then it became quite clear that the largest
cluster was most influenced by the ideas expressed: their richness, soundness, clarity, development, and relevance to the topic and the writer's purpose. Hence we must accept it as a fact that a high proportion of intelligent,
educated adults do pay attention to the quality, development, support, and relevance of the ideas expressed in student compositions and weight them heavily in their judgment of the general merit of
these papers. (p. 7)






38


Support for Diederich's views comes from two other studies in which content and organization proved to be significant determiners of writing quality. For example,

Freedman (1979) undertook to find which essay characteristics influenced judges most by rewriting four essays on each of eight topics composed by college freshmen. The

essays were rewritten to be strong or weak in the four broad categories of content, organization, sentence structure, and mechanics; then they were typed.

Unaware of the rewriting that had been done, 12

instructors of a college freshman English program holistically scored the papers and subsequently rated the papers according to their perceptions of the strength or weakness of the papers in each category. An analysis of variance revealed that, as Diederich (1974) had also found, essays

with stronger content received higher scores than did those with weaker content. Organization also proved to be a statistically significant factor. Mechanics appeared to be

influential as well in those papers with strong organization. When the perceptions of the evaluators toward the rewritten versions were examined, interestingly, the

evaluators did not always agree with the rewriters as to the strength or weakness of the categories of content and organization. In fact, two readers were removed from the study because their disagreement was substantial. The readers had better agreement for the more concrete categories of mechanics and sentence structure.





39


Freedman (1979) acknowledges as limitations of the study the breadth of the categories used for rewriting--a breadth which made it impossible to know what exact

qualities judges might be rating--and the homogeneity of the raters. Like Diederich (1974), she stresses the need for emphasizing more in classroom teaching the development and organization of ideas.

To explore the criteria that holistic scorers use in making their evaluations, Breland and Jones (1984) compared scores obtained from a regular scoring of the English

Composition Test (ECT) with analyses made nine months later by 20 college English professors on a sample of 806 essays. The samples contained equal numbers of papers written by blacks, whites, native Hispanics and nonnative Hispanic speakers of English. In the special scoring, the

evaluators first scored the papers holistically and then checked on an evaluation form the strong and weak features of each essay. During a subsequent session, the readers were also asked to write on the essays themselves.

Correlational procedures used to predict the original holistic score indicated that readers were most influenced

by the organization, support, and significant ideas in a paper, with organization correlating the most highly of

all discourse characteristics with the original English Composition Test score. Surface features, such as essay length, neatness, and spelling, contributed significantly






40


to predicting the ECT score as well, with essay length correlating most strongly at .43. Syntactic and lexical

characteristics influenced the scoring of the nonnative Hispanic speakers of English.

On a questionnaire given prior to the scoring, the special scorers indicated that organization, thesis,

support, and ideas were significant to them, characteristics that proved influential in their special scoring. Differences were also noted between experienced and

inexperienced scorers, with the experienced scorers tending to score more harshly. This finding was similar to Sweedler-Brown's (1985), as discussed in the previous section of the review.

Breland and Jones (1984) note that the score reliability of one writing sample rated by two readers has been found to range typically from .38 .58. They stress the need for caution in interpreting the results of their study. They speculate that the special scorers may have been unduly influenced by the evaluation form and by the targeted groups of students, and they point out that their sample contained above-average students. The researchers

observe that in their study, length greatly influenced holistic scores, implying possibly the importance of

development in argumentative essays; they call attention to the importance that content and organization played in the holistic scores.





41


Mechanics, Sentence Structure, Vocabulary

As can be seen from these studies, content and organization appear to be influential factors in many

evaluators' writing judgments. At the same time, other elements, such as mechanics in Freedman's (1979) study or

spelling and length in Breland and Jones's study (1984), play a role as well. The extent to which these concrete,

nonrhetorical factors can influence writing evaluations comprises the focus of several other studies.

Allen (1976) investigated the influence of mechanical

and grammatical errors on teachers' content ratings by preparing four versions of a writing sample that contained different numbers of errors. Over 400 secondary English teachers scored one version apiece. Although Allen noted several issues that needed further exploring, the results

did not support his hypothesis that teachers' customary concerns with mechanical errors would affect their evaluation of rhetorical elements.

Rafoth and Rubin (1984) sought to determine the significance of content and mechanics on college

instructors' evaluation of writing by rewriting an essay to contain stronger or weaker content and stronger or weaker mechanics. The researchers composed three new versions of a timed expository essay originally written by a college freshman, adding spelling and punctuation errors to two of the versions and deleting propositions from other versions






42


in order to alter the quality of content. Of the four final versions, one was high in content and free of errors; one was high in content and full of errors; another was low in content and free of errors; and the last was low in content and full of errors.

Eighty composition instructors from four state universities voluntarily accepted one of the versions to grade. Some instructors were told to pay special attention to content and ignore mechanics, whereas others

were told to pay attention to mechanics instead of content. Still others were simply told to read the paper according to their normal practice. All the instructors were also asked to rate the paper according to the criteria on the Diederich scale.

A series of ANOVAs showed that mechanically correct versions received higher general impression scores than did those papers with errors; furthermore, the ratings according to the Diederich scale showed that the mechanically correct versions received higher scores for ideas, organization, and punctuation than did those versions with the errors inserted. According to the researchers, "The

present results strongly suggest that regardless of writing content or evaluative criteria, college instructors' perceptions of composition quality are most influenced by mechanics" (p. 455). They speculate that graders may not

distinguish clearly between the domains of content and mechanics in making their writing judgments.






43


The researchers acknowledge that the writing assignment was limited by timed conditions and by the inclusion in some versions of a substantial number (14) of errors.

However, an additional limitation seems to have been the use of only one essay per grader per evaluative condition.

If each grader had been given several essays to score or if the graders had been given some training in using the Diederich instrument, the results obtained by Rafoth and Rubin (1984) might appear more conclusive.

In another study of the way teachers' writing evaluations are influenced, Stewart and Grobe (1979) reexamined

232 samples from an earlier national writing assessment program. They found that students increased in the three

measures of syntactic maturity--words per T-unit, words per clause, and clauses per T-unit--from grade 5 to grade 11.

In addition, students improved in their command of spelling and in the avoidance of run-on sentences; they did not improve to the same extent in their avoidance of unclear pronoun reference or avoidance of sentence fragments.

The features which best predicted the quality ratings in grades 8 and 11 were the number of words and spelling;

only in grade 5 were the syntactic maturity measures at all significant. While speculating that the teachers in grades 8 and 11 may have been influenced more by content and organization than by sentence maturity, Stewart and Grobe

(1979) express dismay at the lack of concern seemingly shown for syntactic development.






44


In a subsequent study Grobe (1981) compared analytic ratings completed by 18 trained graders to the holistic scores assigned narratives written by 437 5th, 8th, and llth grade students. As in the earlier study, composition

length and the absence of spelling errors proved to be significant factors in predicting the holistic score.

To explain the holistic variance unaccounted for by the 14 syntax and mechanics variables, Grobe (1981) subsequently added several vocabulary measures to the analytic rating system. A computer program analyzed 50 essays selected at random from each grade level. Spelling continued to be important, but essay length was less significant once vocabulary variables were introduced.

Instead, the vocabulary variable which indicated the number of different words in a composition became significant, leading Grobe to conclude that vocabulary diversity is important in good narrative writing.

The importance of mature vocabulary and complex syntax

on writing evaluation comprised the focus of a study by Neilsen and Piche (1981), who created four versions of a 250-word descriptive passage on a winter scene. One passage contained complex nominals and mature vocabulary; a second contained complex nominals and simple vocabulary (as in the use of the word "face" instead of "confront"); a third contained simple nominals (as in the phrase "like

cattle in a barren field" instead of "like cattle in a






45


barren, frozen field of blowing snow"; the fourth contained simple nominals and simple vocabulary.

Eighty high school English teachers were given folders with one version of the passage. They assigned holistic scores to the essays and rated them according to a scale containing bipolar descriptions of qualities, such as "logical illogical." Results of an ANOVA indicated that nominal complexity did not significantly affect either the holistic scores or the composition scales; however, vocabulary did have a significant impact on both.

The authors note the following limitations of the study: The constructed passage might not resemble actual

student writing, verbs comprised the only basis for the vocabulary differences, and the findings they obtained from descriptive passages might not apply to other modes.

Despite these limitations, vocabulary seems clearly to have influenced the holistic scores for this descriptive essay,

just as it influenced the narrative writing in Grobe's (1981) study.


Length and Surface Features

Thus, the above studies suggest that such factors as mechanics, spelling, syntactic maturity, and vocabulary may affect some judgments of writing quality. As has been seen in the previously cited studies by Breland and Jones (1984), Grobe (1981), and Roberts (1982), length has also been a contributing factor.






46


Length proved similarly influential in a study

conducted by Nold and Freedman (1977) to explore whether certain elements could be identified as contributing to readers' evaluations of compositions. The researchers used four argumentative essays written by each of 22 Stanford freshmen. The essays were typed, and then six experienced teachers were trained to score the papers holistically.

Nold and Freedman (1977) hypothesized that four main categories might prove influential: the extent to which ideas were developed, the organization of those ideas, the complexity of syntax, and the adequacy of vocabulary.

Emphasizing countable, syntactic elements within the essay, Nold and Freedman analyzed the essays according to an instrument developed by Golub and supplemented by such variables as common verbs and the length of the essay.

The researchers found that the holistic scores assigned were distributed below the mean, with readers noting that

they had expected better writing of Stanford freshmen. Four variables, including shortness, overuse of modals and be verbs, and common vocabulary, negatively predicted quality, whereas final free modifiers positively predicted

quality ratings. According to the researchers, limitations included their focus on only those measurable elements

of writing quality and their use of a select group of students as a sample. Despite any potential problems with the study, length--in addition to vocabulary--appears as a





47


contributing factor for the writing judgments made in this

study in much the same way that it appeared in the research previously discussed by Grobe (1981) and by Neilson and Pichd (1981). Length is frequently considered a surface feature, and as White (1985) notes, is criticized whenever it is used as the basis for holistic judgments. However, as Freedman and Calfee (1983) thoughtfully observe, length cannot always be identified as a superficial quality of a paper. They note:

The problem with interpreting such findings is that length may or may not be an index of a
significant psycholinguistic category such as idea development. Longer essays with fuller development of ideas may deserve higher scores than shorter essays, but longer essays padded with redundant information may deserve lower scores than their shorter counterparts. Correlational studies do not reveal why longer essays receive
higher scores. (p. 85)

Even though the length of a paper may not always create a problem for writing evaluation, other surface features such as handwriting and neatness are clearly troublesome. Studies completed over 15 years ago (Chase,

1968; McColly, 1970; and Marshall, 1972) have suggested that poor handwriting or messy essays may affect the grades assigned to them.

For example, Chase (1968) found that 16 graduate

students gave "more generous" grades to essay test items done with good handwriting than they did to items done with poor handwriting; although the scorers tended to score papers equally on the first item, the negative "halo






48


effect" of poor handwriting adversely influenced the scoring of the second item.

Marshall (1972) introduced various numbers of spelling

errors into essays composed in response to one American history question. For each essay containing a set number of spelling errors (e.g., 0, 6, 12, and 18 errors), Marshall prepared a typewritten copy and had students copy

the essay over with three different degrees of neatness and legibility. The 16 resulting forms of the essays were sent to 480 classroom teachers who were asked to grade the papers according to content. Although Marshall, to his surprise, found no significant differences in mean scores

for the levels of spelling problems, he did find differences in the scores assigned to typed versus handwritten essays. That is, all the handwritten versions of essays containing zero to six errors received lower scores than did the typed versions of the same essays; the results for

essays containing 12 to 18 errors were less clearcut and seemed to fall into a random pattern.

Handwriting is also labeled as a problem in McColly's

review (1970) of the issues comprised in writing evaluation. Citing several studies in which handwriting

influenced writing judgments, McColly warns that in such instances, "the validity is actually lowered, because handwriting ability and writing ability are not the same thing." He continues by suggesting that "the only cure






49


for this condition is to have examination essays typed or put into some other standard printed format" (p. 154).

Stach (1987), too, expresses concern in his dissertation about the influence of appearance on holistic scorers' judgments. In his study three college teachers were

trained to score holistically 140 essays written by college freshmen. The teachers then described what they considered good writing to be, and they rank ordered the importance they placed on several factors in making writing evaluations. Presumably, they considered the factor of "presentation," which signified handwriting and neatness, to be totally unimportant and mechanics to be less

meaningful than many other qualities; however, a regression analysis revealed that appearance and mechanics were the only statistically significant predictors of holistic scores. According to Stach, the implication of such

findings was "that scorers in holistic procedures (and perhaps teachers in general) aspire to grade essays differently than they actually do, and that they hope

to be qualitatively better graders than they are, overlooking, or 'seeing beyond,' mechanics and appearance" (p. 113). Suggesting that the scorers' descriptive

statements reflected not "priorities, but aspirations," Stach concludes, "Certainly there is a great gulf between

what they say matters to them and what the best statistical predictors of holistic scores turned out to be" (p. 120).






50


As these studies indicate, writing evaluations are affected to varying degrees by such elements as mechanics, vocabulary, syntax, spelling, length, and even handwriting or neatness. Such a link between these elements of form

and the rhetorical elements of organization and content is, according to Harris (1977), almost inevitable. Referring to her own study, which will be discussed in the next section, Harris comments that there "came the conviction that form is so integral a part of content that in some ethereal way form is content and content is form" (pp. 180-181).


Other Factors Involved in Writing Judgments

In addition to elements of form and content, other-almost intangible--factors in writing evaluation have received increasing attention. One factor is the discrepancy between what readers say they value and what they actually reward; the second factor is the perspective that readers adopt toward the writers behind the essays.

Harris (1977) sought to determine those features that influenced English teachers in their evaluation of student writing. Thirty-six high school teachers read 12 student

essays, marking them according to their customary practice; they then ranked the essays according to merit and completed a questionnaire. They finally reevaluated the papers against five criteria.

Taken together, the four procedures revealed a discrepancy between the criteria teachers rated as important






51


and the criteria they actually demonstrated in their comments and markings. That is, on the questionnaire the

teachers indicated that content and organization were of great importance to them, while their annotations and manner of ranking the papers revealed the major role that mechanics and usage played. For these teachers, sentence structure and diction were less important. Additional findings by Harris (1977) included her discoveries that the teachers basically agreed with each other about the evaluation of writing and that many of the teachers' annotations and other comments were negative.

Hake and Williams (1981) raise the question of what teachers of writing actually do value: "Is it possible that despite our public declarations about clear, direct writing, we might somehow discourage our students from writing good prose and encourage them, through our own tacit behavior, to write bad?" (p. 434) Their question arises from four experiments they conducted in which they altered the style of similar essays--changing the direct, verbal style that contained a subject/verb/object (or agent/action/goal) to a nominalized, indirect style in which abstract nouns predominated. Approximately 80

teachers, from high school to the upper college classes, rated the heavily nominalized papers more highly than they did those essays which, though structurally similar,

were directly verbal in style. In one experiment, the






52


readers, who were unaware of the purpose of the study, wrote comments indicating that the nominalized versions contained better organization and support, even though the pairs of papers were identical in those respects. In another experiment, senior college graders rated the

nominalized papers higher than they did verbal versions even when they could find major errors in the nominalized version.

The authors speculate that the good nominalized papers

may have been associated with intellectual quality, in contrast to the perceived lower quality of the verbal versions. Thus, Hake and Williams (1981) suggest that despite what writing teachers claim to do, one cause of "stylistic infelicity" (p. 446) may be the practices of the teachers themselves.

Still another source of complexity in writing evaluation is the attitude or expectations of the readers toward the writers of the essays. For example, Freedman (1984) gave to four experienced holistic scorers packets of essays containing not only the writings of students from four different colleges but also a timed essay composed by a professional writer on the same topic. The scorers, who were unaware that professional writings had been included

in the study, gave only slightly higher mean holistic scores to the professionals than they did to the student writers. (In fact, student writers received the three highest holistic scores.)






53


The professional writers received higher analytic

scores than did the students in the categories of voice, sentence structure, word choice, and usage, but they

received lower scores in the categories of development and organization. Because of the low scores that had been given in these categories and because of wide differences in the holistic scores assigned to these papers, Freedman (1984) sought to discover what qualities characterized the professional essays. She found four common traits: (a) a tone of familiarity, (b) an initial rejection of the task with a subsequent acceptance of it, (c) a final commitment

to the topic with resulting forcefulness in the papers, and

(d) scholarly references.

As these traits are unlikely to appear in most students' writing, the author speculates that the scorers may have negatively reacted to what they viewed as "overstepping"' of authority on the part of some students.

She advocates that teachers encourage students to write with authority and freedom.

Sullivan (1986) also explored whether holistic scorers' disagreement in problem papers about discourse issues

reflected certain attitudes toward the writer behind the essays. Stressing that "evaluation of writing ability is best viewed as a multifunctional social interaction" (p. 11), Sullivan sought to determine whether readers created writers in addition to the meaning of the texts.






54


Sullivan (1986) randomly selected for analysis 99 essays that had been written by entering freshmen and holistically scored. Topics for the essays had required students to argue to a specified audience a certain

position on a controversial issue. Using Prince's "Taxonomy of Assumed Familiarity," Sullivan classified the information contained in the noun phrases of the essays in terms

of assumptions made about readers' familiarity with the information--that is, whether it was assumed to fall under new, inferable, or old (Evoked) categories (p. 14).

A regression analysis indicated that three of the subcategories of information significantly correlated with holistic scores. According to Sullivan, these categories represent deviations from Grice's Cooperative Principle

in which writers are supposed to assume that the readers have reasonable familiarity with the information. He speculates that these deviations from expected norms

reflect three different identities--that of the "testtaker," the "knowledgeable student," and the "straightforwardly cooperative writer" (p. 33)--and that readers were responding either negatively or positively to these identities. He stresses the need for additional research

to determine whether readers are evaluating texts on the basis of their responses to the writers' identities.

Though intriguing, much of Sullivan's (1986) work appears highly speculative; for example, the basis behind






55


his claims that certain linguistic categories of information reflect particular social identities, such as that of the "test taker," seems arbitrary. Moreover, as Sullivan himself acknowledges, the hypothetical audience that

students were required by the topics to address and that conflicted with the real audience of holistic scorers may have compounded students' uncertainties about how much

information they needed to provide. But despite the problems that Sullivan's work contains, his research

illustrates the potential impact that the writers themselves may have on the readers' evaluation of their work.

Barritt, Stock, and Clark (1986) found a similar

attitude of unease held by readers toward writers who do not adhere to their expected role. A group of faculty members of the University of Michigan's English Composition Board met periodically over a two-year period to discuss how they holistically rated student placement essays and why they sometimes disagreed with each other. At each meeting, they read a selected essay, scored it privately, and noted their reasons for the score; then they discussed their findings together.

They found that on those essays which evoked the most

disagreement, the comments fell into several categories:

(a) "the written text"; (b) "the imagined student writer"; and (c) the "prospective student" (p. 319). The authors note that even though they initially urged readers to pay






56


attention to the texts, rather than to the writers behind

the work, their recommendations were in vain. They justify

the readers' reactions by emphasizing the importance of the

expectations the readers bring to the reading:

We had forgotten that reading is always an act of recreation and that what we have learned as students of literary theory has much to teach us about what we do as we read our students' assessment essays. In our case, as reader/evaluators asked to judge placement essays, we had to engage ourselves as active readers trying to make common sense--that is sense in common--with student
authors. We found ourselves working mentally with each student writer to compose a placement essay; as we overlaid the student's writing with our own
expectations, we completed incomplete arguments, supplied missing transitions, second-guessed
particular cases for general statements.

Like the readers Wolfgang Iser posits, we were trying to build consistency into students' texts by investing spaces of indeterminacy in them with our own expectations about what should fill the gaps. The teaching experience each of us brought to the task of evaluating student texts led us to expect in each text the writing of a 'typical' college freshman, and our expectations
influenced our readings. (p. 320)

Arguing against the need always to have consistency of

judgment, Barritt et al. (1986) suggest that it is more

important to accept and understand the basis behind those

judgments that are not in agreement.

In a similar vein, Martin (1987) explored the process

that occurs for readers in a placement scoring. She

examined the written responses that three faculty members

made to six placement essays composed by entering college

students. The instructors, all of whom were experienced






57


scorers, ranked the papers once and wrote comments intended for the students to use in revising; at a later time, they

ranked the papers again and wrote comments intended for the researcher. Martin studied the comments in the light of the readers' own backgrounds and their own experiences with reading and writing. She concludes by observing that readers, as well as writers, are individuals and that the essays do not necessarily contain features which all readers can assess; rather, in her view, placement scorers are primarily concerned with the extent to which the

writing samples indicate students' readiness for college tasks.


Summary of Literature Review


Together, the three major sections of the literature review reveal complex links between holistic scoring and the writing criteria on which it is based. The studies of

the first section (Diederich, French, and Carlton, 1961; Godshalk, Swineford, and Coffman, 1966; Freedman, 1981; Hrach, 1983; Sweedler-Brown, 1985; and Huot, 1988) illustrate both the development of and rationale for various holistic procedures, including the training of

readers. The studies of the second section (Winters, 1978; Roberts, 1982; Shoaf, 1985; and Janopoulos, 1987) depict the strengths and weaknesses of holistic scoring in

comparison to other evaluation systems. The studies of the






58


last section reflect researchers' attempts to explore--in a variety of contexts--the elements involved in the

evaluation of writing. From this section emerges a picture of readers in some contexts primarily influenced by organization and content (Diederich, 1974; Freedman, 1979; and Breland and Jones, 1984) and of readers in other contexts chiefly concerned with such features as mechanics,

spelling, vocabulary, and length (Harris, 1977; Nold and Freedman, 1977; Grobe, 1981; Rafoth and Rubin, 1984; and Stach, 1987). Still other studies convey how readers' perceptions and expectations of writers affect some evaluations (Freedman, 1984; Sullivan, 1986; and Barritt, Stock, and Clark, 1986). The involvement of so many factors in writing judgments underscores not only the need in writing assessment for such structured approaches as holistic

scoring but also the need for training and monitoring to ensure some similarity in the perspectives that readers bring to their evaluations. But if the need for training

is clear, the nature of that training and monitoring in holistic scoring has not yet been fully explored.

Holistic scoring of assessment essays entails special circumstances both for writers and for readers: That is,

just as writers in an assessment often have a limited time in which to discuss a given topic for an unfamiliar audience, so, too, do holistic readers have a short time in

which to determine the meaning of a text and respond by






59



evaluating it. Questions thus remain as to how the

training and monitoring of a structured holistic scoring help to create a community of readers who willingly accommodate their own writing criteria to the writing standards of the group as a whole.

The methodology used in the present study to explore the impact of monitoring on an holistic scoring is described in Chapter 3.














CHAPTER 3
METHODOLOGY


The impact that training and monitoring have on an holistic scoring of writing was explored from three perspectives: (a) A monitored holistic scoring was conducted in which 12 readers scored over 100 student-written

essays; 8 of the readers recounted their responses to these essays through the use of logs, and 4 of the readers recorded their reactions through the use of audio-taped protocols; (b) the same 12 readers also scored an equal number of essays at home in an unmonitored situation, again using logs and audiotapes; and (c) all participants in the study--including the 3 table leaders and 2 chief readers-were administered a questionnaire regarding their attitudes to writing evaluation and to the holistic scoring process.


The Monitored Holistic Scoring


A monitored holistic scoring was conducted to replicate on a small scale the structured scorings used in the

writing assessment of college sophomores throughout the state of Florida. The chief reader for the state of

Florida, together with an associate chief reader, conducted the scoring on Saturday, January 7, 1989. Permission was

obtained from the Department of Education in Tallahassee


60






61


to select by means of a stratified random sampling over 100 essays used two years previously in an administration of the College Level Academic Skills Test (CLAST).


Subjects in the Study

Seventeen men and women who are highly experienced holistic scorers and who have taught English at different levels were the subjects. Earlier studies by Follman and Anderson (1967) and by Freedman (1979) had found the homogeneity of their scorers to be a factor in the results; in fact, Freedman cites such homogeneity as a limitation of her work. Because most CLAST scorings employ English instructors from diverse levels, it was assumed that subjects with a broad base of English teaching would more accurately reflect real-life conditions found in an holistic scoring session.

The chief reader for the study, a former director of freshman composition at a large university, is the current

chief reader for the state of Florida and has directed many large-scale holistic scorings. The assistant chief reader, a former chair of a high school English department and an Advanced Placement English teacher, has frequently served in the role of assistant chief reader.

Eight women and seven men participated in this study either as table leaders or as readers. Five taught in three local high schools, with some instructing Advanced Placement English classes or participating in the Writing






62


Enhancement Program; several had studied in the Florida Writing Project. Another five were English faculty from three community colleges within a 200-mile radius. The remaining five were from three universities or four-year colleges within a 100-mile radius. The teaching experience of the 15 participants, in addition to the 2 chief readers

described above, ranged from 8 to 37 years, with an average of 18 years; their holistic scoring experience averaged 7 years. All but one participant had holistically scored

other types of examinations as well as the CLAST. The subjects received an honorarium for their participation in the study.


Writing Samples

The essays used in the study were written by college

students nearing the end of their sophomore year as part of a state-wide mandatory test to assure minimal competencies in reading, writing, and mathematics. Students were given a choice of two topics, each of which required

them to draw upon their general knowledge, to create a thesis, and to support it during the 50 minutes allotted for writing. Because of test security purposes, the topics cannot be revealed. However, they followed the paradigm developed by Hoetker and Brossell (1986) and used in Florida for several years; the paradigm typically is a fragment, containing a class specification and two






63


differentiating criteria. The paradigm is exemplified by

such topic phrases as "a book/ that many students read/ that may affect them beneficially" or "a common practice/

in American colleges/ that should be changed" (p. 330) which Hoetker and Brossell describe in their research. Procedure

As students taking CLAST have a choice of two topics, holistic scorers are accustomed to scoring sets of papers

in which two different topics are intermingled; consequently, the essays used in the study were not separated out by topic for scoring.

As shown in Table 1, a stratified random sampling procedure was used to select 112 essays which would

approximate the distribution of scores obtained in the actual scoring of these essays; hence, the essays reflected the writing of students from various institutions in different parts of the state. On the basis of scores originally assigned to the papers, the papers were randomly divided in half for the monitored and unmonitored scorings. Thus, the papers with scores of 8 were distributed equally

to the two treatments, as were the papers with scores of 2, 4, 5, k, and 7. To ensure students' anonymity, all identifying information was removed, and each essay was labeled with a three-digit number; the essays were then reproduced so that each reader would have a copy of all





64


TABLE 1

Descriptive Statistics for the Stratified Random Sampling




Total Cumulative Cumulative Score Frequency Percent Frequency Percent


Frequency of Scores for the Actual Essays

2 1779 9.9 1779 9.9 4 4666 26.0 6445 36.0 5 4535 25.3 10980 61.3 6 4098 22.9 15078 84.1 7 2172 12.1 17250 96.2 8 675 3.8 17925 100.0


Frequency of Scores for the Sample Essays

2 11 9.8 11 9.8 4 29 25.9 40 35.7 5 28 25.0 68 60.7 6 26 23.2 94 83.9 7 14 12.5 108 96.4 8 4 3.6 112 100.0






65


56 papers. The papers were randomly distributed in 4 packets of approximately 14 to each reader.

Three tables were established for the scoring, each consisting of four readers and an experienced table leader, all of whom were randomly assigned to their table. Two tables of four readers each followed regular holistic

scoring procedures and were used for comparative statistical purposes in the study; the third table was treated as a separate entity. That is, the four readers at the third table took part in the training procedures but then

adjourned to small, adjacent offices to tape record both their reading of the essays and their reactions to these essays.

Each of the eight readers participating in the regular scoring was assigned 56 papers to score, a number arbitrarily chosen for several reasons. It was manageable enough to facilitate the subsequent interpretation of data

and yet affordable. It also represented a large enough sample to reveal any scoring tendencies on the readers' part and to indicate any potential influence of training samples, breaks, and monitoring procedures. However, prior to any statistical analysis, five sets of data were

subsequently removed from the study: One set of matched papers was deleted because poor photocopying had made one of the two essays impossible to read; four sets were

removed as deviant data when the chief readers' independent






66


scoring beforehand of the entire group of papers revealed

that some papers had been incorrectly scored two years earlier and hence were inappropriately matched. Thus, the

actual data for the study comprised 51 sets of matched papers, or 102 essays altogether.

The four readers at the third table, hereafter referred to as "special readers," were given a subset of 20 papers

to score by means of talking protocols. Because they scored far fewer essays than did the eight regular readers, the special readers were not included in any statistical analysis. Results of the special readers' scoring will be

discussed separately from results obtained from the regular readers.


Procedures for Training

The scoring adhered to the customary procedures, with rangefinders provided initially for training purposes, followed by the presentation of several samples throughout

the scoring for the group to score and tally together. Reading breaks occurred at approximately 45-minute intervals, and table leaders monitored the scoring throughout. The chief readers conducted two check readings as an additional verification that all participants were scoring the papers comparably. In these respects, then, the scoring represented a replication of the procedures typically used in assessing writing on a large scale.





67


The training and monitoring procedures also followed custom insofar as readers were urged to employ the full range of scores (e.g., from 1-4) in assigning scores to the six rangefinders. Rangefinders from the original

reading were read first, with the readers asked to rank order the papers and to assign each of the four scores to

at least one essay. Then the readers' scores were publicly tallied. If one or two scores clearly differed from the scores assigned by other readers, readers whose scores were discrepant were urged to look the paper over again.

Once the rangefinders were tallied, table leaders,

who kept running accounts of the vote at their tables, led their table in a brief discussion of why papers received certain scores. They referred to the operational

definitions if necessary. Then pairs of sample essays were introduced, with readers again asked to read and score an essay and raise their hands as each score level was

announced by the chief reader. Samples were given until the group reached a consensus on most scores.


Special Measures Used for the Study

For the purpose of this study, several new measures were introduced. All eight readers at the two regular tables, Table 1 and Table 2, were reading the same papers arranged in random order in 4 packets of approximately 14 papers each. (The third table will be discussed subsequently.) Thus, for each essay, at least eight scores were obtained, four from each of two tables.






68


As the readers scored each paper, they were asked to jot down several brief comments about their overriding impression of the paper, its key strengths and weaknesses. The commentary, therefore, provided a running log that was used to explore the basis on which the readers made

particular scoring judgments. This "process log" resembled the log developed for writers by Faigley, Cherry, Jolliffe, and Skinner (1985). In addition, readers noted such

procedures as the time they began each reading session after a break, their scores on samples, and any adjustments they made after talking to their table leaders or after consulting the rangefinders. During the previous month readers had been given instructions in how to use the logs before they began their unmonitored scoring. A copy of the log is provided in Appendix A.

In actual scorings, chief readers customarily keep logs as part of their procedures, noting such details as the time of each reading, the samples used for training, and the start of check readings. For the purpose of this study, chief readers were asked to maintain their customary log, but it was labeled an "Account of Procedures" in order to

avoid being confused with the log or running commentary employed by the readers. The actual account is included in Appendix B.

In including the written observations of readers, this study partially followed the procedures used by Diederich

et al. (1961), who asked their readers to note comments as






69


they sorted papers into piles and assigned rankings.

However, as noted in the literature review, the readers in Diederich's study came from diverse backgrounds, received no training, and worked in an unstructured

situation. Logs have also been used in one study undertaken by Murphy, Carroll, Kinzer, and Robyns (1982) with the Bay Area Writing Project (pp. 397-410).

The use of written comments was selected for this study as opposed to the annotations used by Breland and Jones (1984) in their study of writing perceptions. Despite the

difficulties entailed in categorizing written observations, such comments are far less apt to disrupt the momentum of the holistic scoring than the more analytic checklist that Breland and other researchers have employed. Moreover, unlike the analytic checklists which provide readers with

lists of certain criteria, blank log sheets are not apt to influence readers' responses. Indeed, written comments

are currently used during the sample selection part of actual holistic scoring procedures, when the chief readers assemble to select the papers to be used as training samples during a scoring.

Table leaders were also asked to keep logs and to note such monitoring procedures as whose papers needed rereading, what discussions about writing ensued, and whether many scores needed to be altered. In addition, they described their readers' performance during, and reaction to, the use of training samples.






70


As noted previously, the third or special table, consisting of a table leader and four experienced readers, also participated with the other tables in the use of rangefinders and sample essays. However, at the conclusion of the training papers, the readers adjourned to separate small offices to record on audiotapes their ongoing reactions to a subset of the papers used in the monitored scoring. Such protocol analysis has been used in

composition research for a number of years and was recently employed by Huot (1988) in his study of holistic scorers.

The subset of 20 papers, like the larger one of 50+ essays, was deliberately selected to contain a range of score

levels and was assigned in random order to each of the four special readers. The number 20 was arbitrarily chosen to

allow for the extra time readers might need to read each paper aloud and record their impressions and observations.

The table leader for the special group moved among the four offices to monitor the scoring and to discuss any discrepancies. Through these talking protocols some indepth

insights were provided as to how the monitored scoring appeared to influence the scores that readers assigned.


The Questionnaire


A questionnaire devised for this study was given to all the participants immediately following the completion of the monitored holistic scoring in order that the respondents' written logs or protocols during the scoring






71


not be influenced by the nature of the questions. The questionnaire, a copy of which is included in Appendix A, contained three main categories of questions--(a) readers' ratings of the importance of certain features in writing,

(b) their self-report of their own biases in readings and their methods for dealing with these biases, and (c) their

reactions to the structured setting of an holistic scoring. Most items required closed responses, although several allowed for open-ended responses. An additional section enabled table leaders and chief readers to address questions dealing with their roles as monitors.

The questionnaire, designed in accordance with the

principles set forth by Berdie and Anderson (1974), was pilot tested two months previously by holistic scorers of the CLAST at another scoring site in the state. Over 60 percent of the readers and table leaders at the second site voluntarily completed the questionnaire and responded to specific questions regarding the substance, format, and

clarity of the instrument. (See Appendix A for a copy of the pilot questions.) A stamped, self-addressed envelope

was provided for the return of the pilot questionnaires. The respondents made specific suggestions for wording

changes, and they asked for additional items to be included in parts I (features of writing) and II (biases). In addition, several requested that the absolute categories of "never" and "always" be provided as options in parts III






72


and V. Many of the respondents indicated that answering

the questionnaire had been an interesting, challenging, or educational experience for them.


The Unmonitored Holistic Scoring


During the month prior to the monitored holistic

scoring, each of the eight regular readers was asked to score holistically at home four packets--over 50 papers-of matched essays written by different students on the same topics. Readers were asked to jot down their impressions of the papers in the running log, just as they were subsequently asked to do during the monitored scoring session. Readers were sent instructions on how to use the

log, as well as a copy of the operational definitions currently used in the CLAST administration (see Appendix A). These definitions, which describe the characteristics

typical of a certain level of essay, were the only training materials provided to the readers in the unmonitored setting.

The papers were scored over a four-week period. These papers with their scores and comments reflected how experienced holistic scorers scored without being

monitored and without being part of a group situation. During the unmonitored scoring, the table leaders and

the chief readers were assigned different tasks from the readers. For example, the chief readers met to review the






73


entire group of papers to be used in the study; they discussed each score until they agreed upon an appropriate rating for each essay. This practice, while certainly not

typical of an actual holistic scoring, was included to ascertain the chief readers' scores for the entire set, thereby helping to answer the question posed as to the role the chief readers play in influencing scores that are given. In addition, the table leaders read all the sample papers and rangefinders to be used in the subsequent monitored scoring, rating each paper and writing their responses to each essay. This procedure represented a

departure from typical procedures. That is, under normal circumstances, the table leaders meet with the chief

readers prior to a scoring to read and score the sample papers the chief readers have selected; then they discuss the results together.


Methods Used for Analyzing Data


The questions posed in Chapter 1 of the study are again listed below together with the methods used for analyzing the data; special attention has been paid to how well the

monitoring of an holistic scoring reflects the "true community of assent" as noted by White (1985).


1. Do the mean scores for the essays differ when the
papers are evaluated by readers working in a monitored setting from when they are judged by the
readers working independently?






74


To answer question 1, an analysis of variance, equal cell size mixed model, was used on a Biomedical program. The design was randomized, with the eight regular readers

comprising the repeated measure. (The four special readers who completed the protocols were not included in any

statistical analysis as they had scored far fewer essays than had the eight regular readers.) The model--P, T, E(P), R(T)--included the following four random factors: P, signifying the number of essay pairs (51); T, representing the number of tables of readers (2); E(P), signifying the monitored versus unmonitored essays nested within each

pair (2); and R(T), representing the number of readers nested within each table (4). Three Quasi F ratios were calculated according to the formula of B. J. Winer (1971) in Statistical Principles in Experimental Design for pairs

(FP), tables (Ft), and pairs within tables (FP(,)).


2. Do experienced readers participating in a monitored
scoring achieve greater agreement with each other
than when they evaluate essays independently?

To answer question 2, Cronbach's alpha was used to indicate the degree of interrater reliability under the two different scoring conditions. The interreliability rate was instrumental in showing both the extent to which

monitoring helped readers score alike and the extent to which readers may have internalized the standards.


3. What impact do the chief readers have on an holistic
scoring? How do they ensure both a reliable and a
collegial reading?






75


For question 3, the comments which the chief readers made during the training sessions were examined, and the check reading results were reviewed. In addition, because the chief readers had scored all the essays beforehand as part of their task in the unmonitored setting--a task not traditionally associated with their role--a second Cronbach's alpha was used to determine how well their scores correlated with those of the readers. The chief readers' comments, together with these scoring results, helped to indicate to what extent the chief readers were able to guide readers into assenting or "owning," as White (1985) indicates, the standards of the group.


4. What criteria do readers use in assigning different
score levels? What standards are reflected in the score levels assigned across the essays? How do
readers respond to these standards?

For the fourth category of questions, information from the eight regular readers' logs was transferred to a

Database 3 program; the readers' written comments were grouped in categories similar to those on Part I of the questionnaire (e.g., rhetoric, mechanics, grammar and usage). The database program (see Appendix A for a sample entry) not only indicated through pluses and minuses

whether the readers' comments were positive or negative but also allowed for paraphrases of each comment to be included. The database program was used to tally the positive and negative responses the readers made in their






76


logs at each score level; the program was also used to determine the exact nature of responses--e.g., whether rhetorical, mechanical, or grammatical--which readers gave to papers at varying score levels. The audiotaped protocols provided further corroboration of these criteria.

An English teacher with extensive training and experience in teaching writing served as an outside expert

to validate independently the accuracy of the database logs. She randomly reviewed 20% of the logs from each of the two scoring conditions and compared the readers' comments against each database entry. Whenever she found any errors, the database entries were adjusted accordingly before any analysis was done.


5. Do any common patterns appear in the scorers'
written responses to the essays, or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their written or oral responses, correspond to the writing features
they rate as important on the questionnaire?

For the fifth category of questions, the readers' comments--both written and oral--were studied for any common patterns that might emerge in either the monitored or unmonitored condition. The comments were examined to see whether readers giving an identical score to the same

essay cited similar or different reasons for doing so--such as organization, fluency of sentence style, or creativity.

It was hoped that identifying patterns of this nature would






77


help to answer whether a sense of community develops to influence readers' perceptions.


6. What is the nature of the monitoring that the
readers receive during a scoring as reported through the logs of table leaders and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring
processes as noted on the questionnaire?

For question 6, the logs of the three table leaders, together with the "procedure section" of the readers' logs (See Appendix A for a sample of the log) and the audiotapes, were examined for clues to the nature of monitoring. It was hoped that the logs would reveal how directive the table leaders were and what type of

relationship existed between the table leaders and the readers. Of special concern were how the readers responded

to different criteria and whether the data supported the readers' perceptions of the holistic scoring process as reflected through their responses to the questionnaire.

Thus, both qualitative and quantitative data were used

to determine whether monitoring in an holistic scoring reflects a congenial effort among scorers to arrive at common agreement throughout a scoring and whether this sense of community affects the judgments that scorers make.

An indepth discussion of the results obtained in the study is presented in Chapter 4.














CHAPTER 4
RESULTS AND DISCUSSION



The results of the study are presented according to the questions raised in the first chapter. The first three sections deal primarily with the quantitative results and the last three, with the qualitative findings.


Mean Scores in the Two Scoring Conditions


Question 1: Do the mean scores for the essays differ
when the papers are evaluated by readers working in a structured setting from when they are judged by the readers working independently?

When the mixed-model analysis of variance for nested factors and repeated measures was computed, three statistically significant main effects were found and no interactions. Not surprisingly, as shown in Table 2, statistically significant differences were found (p < .05) among the pairs of essays. That is, each pair of matched

essays differed from the next pair of matched essays. Also not surprisingly, readers nested within tables differed to a statistically significant extent (p < .001). As will be seen in the discussion for question 5, the qualitative data highlighted the individuality of the readers, thereby confirming these differences among readers.


78






79


TABLE 2

Analysis of Variance Mixed Models Source Table




Error Sum of Degrees of Mean F Source Term Squares Freedom Square


Mean 4470.71078 1 Pairs (P] 252.91422 50 5.0583 ^3.71* Tables [T] 7.84314 1 7.8431 ^9.26 Essays within ET(P) 72.37500 51 1.4191 5.03*** Pairs [E(P)] Readers within PR(T) 5.01471 6 0.8358 3.90** Tables [R(T)) Pairs crossed 11.28186 50 0.2256 ^ .93 with Tables [PT] Tables crossed ER(PT) 14.37500 51 0.2819 1.12 with Essays within Pairs [ET(P)]

Pairs crossed ER(PT) 64.23529 300 0.2141 0.85 with Readers within Tables
[PR(T)

Error [ER(PT]


Note: Signifies Quasi F ratios.

p < .05.

** p < .001.


*** p < .00005.






80


What seemed especially meaningful for the purposes of

this study was the statistically significant difference (p < .00005) found for essays nested within pairs E(P); these essays represented the two conditions of monitoring and non-monitoring. The overall mean for the 51 monitored essays was 2.279, with a standard deviation of .559, whereas the overall mean for the matched set of 51

unmonitored essays was 2.401, with a standard deviation of .690. Thus, not only was the mean score for the unmonitored essays significantly higher than the mean

for the monitored essays, but more of a spread existed among the mean scores on each essay in the unmonitored condition.

The overall higher mean for the unmonitored essays seemed due to the substantial number of upper-half scores (3's and 4's) awarded papers in the unmonitored condition:

Whereas only four essays out of the monitored set of 51 had a mean of 3 or better, 14 essays out of the unmonitored set of 51 had a mean of 3 or better. Figures 1 and 2 depict the breakdown of scores by reader and condition.

As can be seen, readers across the board gave fewer scores of 4 in the monitored condition than in the unmonitored; for several readers, the difference was dramatic. Admittedly, the zero scores of 4 for some

readers in Figure 2 is misleading in that virtually all readers gave at least one score of 4 during the monitored






NUMBER OF TIMES SCORE GIVEN
0 UNMONITORED 500RE5 D MONITORED W]ORES


i


1


200 150 100 50


0


Kw,


2


3


4


TOTAL OF ALL SCORES GIVEN BY ALL THE READERS


Figure 1. Monitored vs. unmonitored scores for overall group.


co


250


/


71 >00 >00 >00 XX\/


. 11 11


117/7/71








40IMt3ER EF TIMES SCORE GIVEN


1I LN1NIT0REfl SCORES [ M1ONITORED SCORES


32


A4


16


32


16




crr 0


2 3
SCORES GIVEN BY REllER I1H


4JMBER OF TIMES SCORE GIVEN
2 LflhITOREB 50ORES E rtMTORE] 5IWES


2 3
SCORES GIVEN BY REIIIER 1C


4


40 32


24


N MB1ER OF TIfES SCWE GIVEN W LU T(EED SCORES [ L1'NITOREO SCORES


-I


1 2 34 SCORES GIVEN BY REFUER 18 PIJ3ER OF TIES SCORE GIVEN
0 LUNITORED SCORES 0 MONITORED aCORES


16 1


0



40 32




16 1






0


2 3
SCORES GIVEN BY REFUER 10


Figure 2. Monitored vs. unmonitored scores of individual readers.


r






/


a f


7'


/


0


K,


/


'I


I


1717" 0

UFA,
I


/X


I


I










40 MN1tER OF TINES SCOE GIVEN 4 0 3ER OF TIMES SCORE GIVEN TE SUTES 42J SCORES ONT ICoRE


32


24


16


2 3
SCORES GIVEN BY REFER 2C


4


301


20


>-'0

2 3 4 SCORES GIVEN BY PEFTER 29 MER [ TINES SCOE GIVEN
Ea LNNITORED SCORES l MNITIRE] SUIRES








F-

K---


/

Y


/
1
/

/


2<


2 3
SCORES GIVEN BY REFER 2B


4


40 MNJ1ER [ TINES SCIE GIVEN
I 0LN1I[tITORE SORES M 1NITORED SORE


32


21


16 / B / /


t


//


0


I


2 3
SCORES GIVEN BY REFER 20


4


00


Figure 2 .-- (Continued)


10



0


0


40 32


21


16






0


2


1


I


I






84


scoring; some of those were included in the data deleted as deviant when the independent scoring of the chief readers showed four sets to be clearly mismatched. However, even

if these sets had been included, the monitored papers would still have had only half as many 3's and 4's as the

unmonitored set. This finding does not mean that the monitored scorers were awarding only lower-half scores, for several readers gave 3+ scores in their logs to show they

perceived some essays to be especially strong. However, such pluses and minuses could not be included in the data analysis because the actual scoring of an essay allows for

only a numerical score. Thus, the fact remains that during the monitored scoring, the spread of scores was tightened and the mean lowered. Why this tendency should have

occurred is intriguing. One possible explanation lies with the studies of Breland and Jones (1984) and of SweedlerBrown (1985), who found that experienced scorers tended to score more strictly than did less experienced scorers. Perhaps this study reflected a similar trend with the

training procedures and the monitoring by table leaders lowering some individual reader scores as readers strictly adhered to the criteria under the monitored condition.

In fact, readers indicated on their questionnaires (see the open-ended question after item 24 in Appendix A)

that they tended to grade timed writings more leniently than they did papers written outside class. Without





85


rangefinders or sample papers to measure the actual essays against during the unmonitored scoring, some readers, such

as Readers 1B and 1C, may have awarded papers higher scores than they did the matched essays during the monitored

scoring when group standards became a constant focus of attention. One reader even wrote in her unmonitored logs

of several instances in which she would have consulted a table leader if she could have, and another reader

expressed regret at not having rangefinders to examine. Still others noted during their unmonitored scorings that

they consulted their operational definitions. Thus, during the unmonitored scoring, some readers felt the need for standards to anchor their evaluations against.

Part of the explanation for the lower mean score of the monitored scoring may lie with the nature of the monitoring itself. That is, in the course of either scoring training

samples together or individually discussing specific papers with table leaders, readers may have become more attuned to problems than when they were reading the essays impressionistically on their own. Indeed, most readers commented on

their questionnaires (item 50) that they tended to view problematic papers both holistically and analytically. In this sense, even the actual scoring process for this study may have contributed to a more analytic scoring than usual

in that readers were asked to note in their logs the elements to which they were responding.






86


The logs of the table leaders also suggest that the training process may have contributed to the significant difference in mean scores for the matched sets. The logs showed that on those four occasions in which two readers changed their scores after talking to their table leaders, the readers' scores were lowered,

rather than raised. Similarly, even though one checkreading paper was returned to a reader because it had been scored too low, three others were returned to two readers and to one table leader because they had been scored too high. It is conceivable that those readers who lowered their scores after they reviewed the essays under debate may have had their subsequent scores influenced--at least for a short period afterward--by

this experience; for example, many readers indicated on their questionnaires (item 42) that the return of a paper

"sometimes" affected their subsequent scoring processes. Admittedly, only a few readers were involved with returns; hence, such an explanation has limited application. Nevertheless, both the qualitative data and the quantitative data--which, as indicated by Figure 2, show

three readers' scores moving downward and five readers' scores clustering in the middle--illustrate a stricter adherence to criteria in the monitored condition than in the unmonitored.





87


Interrater Reliability


Question 2: Do experienced readers participating in
a structured scoring achieve greater agreement with each other than when they evaluate essays independently?

Cronbach's alpha was used to determine the extent of agreement among the eight readers in both scoring conditions. In the unmonitored condition the alpha was .936 for the 51 essays; in the monitored condition the alpha was .915 for the matched set of 51 essays. Thus, in both conditions the interrater reliability was high, and the readers scoring the essays independently appeared to achieve equally great, if not slightly greater,

agreement with each other than when they scored essays as a group.

As can be seen from Table 3, no one reader appeared to affect this high interrater reliability coefficient substantially: That is, if individual readers had been removed from the analysis, the lowest alpha in the unmonitored scoring would still have been .92; similarly, the lowest alpha in the monitored scoring would still have been .894 if individual readers had been removed. In the

unmonitored scorings, Readers 1C and 2C had the lowest correlation of .74 and .73 respectively with the other readers, whereas in the monitored scoring, Readers 1A and 1C had the lowest correlation of .67 and .56 respectively

with the other readers. These correlations substantially





88


TABLE 3

Cronbach's Alpha of Readers' Scores


Scale Scale
Mean Variance Corrected Alpha
If Item If Item Item-Total If Item Reader Deleted Deleted Correlation Deleted


Unmonitored Scorina


16.902

16.549 16.667 16.726 16.882

16.824 16.863 17.098


15.863 15.745

15.824 16.020 16.020 16.196

15.941 16.039


25.050 23.093 23.627

23.443 23.346 24.468 25.561 23.850


Monitored Scoring

16.521

14.954 16.628 15.060 15.660 15.961 15.817

14.278


.789 .759

.741 .811 .872 .789 .733 .750


.671 .725 .563 .736 .801 .750 .725 .831


.927 .929 .930

.924 .920 .926 .931 .929


.908

.904 .916 .903 .898 .902

.904 .894


1A

1B 1C 1D 2A

2B 2C 2D


1A

1B 1C 1D 2A

2B 2C 2D






89


exceed the .31 correlation among all the untrained readers

in the study of Diederich et al. (1961); they exceed the .41 correlation among the English teachers in that same study.

That the readers of this study seemed to agree so strongly among themselves in the unmonitored scoring condition suggests that they had, from their years of scoring together, undoubtedly internalized the standards. Still another contributing factor may be the provision of

operational definitions for the readers' use during the unmonitored scoring; in this respect, the independent scoring condition differed substantially from the at-home scoring in the study by Diederich et al. (1961), in which readers were given few directions and no criteria on which to base their judgments.

Thus, even though the readers of this study had no table leaders to whom to turn for guidance, and even though they had no rangefinders or sample papers written on the applicable topics, the readers could consult the

definitions for each score level; in fact, the logs and tapes indicated that several readers did indeed do so.

At the same time, these results must be interpreted with caution. Because Cronbach's alpha was used with a substantial number of readers--namely, eight--the reliability rate is undoubtedly higher than might have occurred if the scores of only two readers had been correlated as




Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EFN8U6GPS_4XRR0C INGEST_TIME 2017-07-11T21:20:52Z PACKAGE UF00101032_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EXAQUPK3R_6J06FY INGEST_TIME 2017-10-31T17:31:38Z PACKAGE UF00101032_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES



PAGE 1

PERSPECTIVES ON HOLISTIC SCORING: THE IMPACT OF MONITORING ON WRITING EVALUATION By WILLA BUCKLEY WOLCOTT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1989

PAGE 2

Copyright 1989 by Willa Buckley Wolcott

PAGE 3

To Edward, Whose understanding and help made this project possible, and to Kedron, Charnley, Roma, Sharon, Bill, and Janet, Who encouraged me throughout.

PAGE 4

ACKNOWLEDGMENTS I would like to thank the many people who assisted me with this study. I was most fortunate to have had the commitment, expertise, and interest of 17 outstanding participants in this study. Daniel Kelly, University of Florida, served as chief reader and Anita Doyle, retired, Alachua County Schools, served as associate chief reader. The readers and table leaders were as follows: Kent Beyette, University of Florida; Janet Fisher, Jacksonville University; Noelle Geiger, Santa Fe Community College; Kay Gonsoulin, Buchholz High School; Anthe Hoffman, University of Florida; Gail Kanipe, Gainesville High School; Donald Kanipe, Santa Fe Community College; Robert Lauriault, University of Florida; Patrick McMahon, Tallahassee Community College; Daniel McPhail, Newberry High School; Wendy McPhail, Gainesville High School; Mary Morgan, Gainesville High School; Elizabeth Novinger, Tallahassee Community College; Vincent Puma, Flagler College; and Donald Tighe, Valencia Community College. I appreciate the help of William Wood in handling the logistics for the scoring, and I also appreciate the help of the holistic scorers at the Tallahassee scoring site who pilot tested the questionnaire during the fall of 1988. I am grateful to Denise Standiford, Eastside High School, for independently validating the logs in this study, and to

PAGE 5

the following people who assisted me with the statistical programs: Robert Baskin, Center for Instructional and Research Computing Activities; Anne DePalma, Department of Foundations; and David Miller, Assistant Professor, Department of Foundations I am indebted to Carolyn Lyons for transcribing the tapes and typing the final manuscript. I am also grateful to Jeaninne Webb, Director of the Office of Instructional Resources, and to my colleagues Sue Legg, Associate Director, and Dianne Buhr, Assistant Director of Testing and Evaluation, for their encouragement Finally, I would like to thank my committee members for all their help: Ruthellen Crews, my chair, who guided me and believed in me enough to let me undertake this study; Margaret Early, who asked insightful questions; and Robert Wright, Forrest Parkay, and Sandra Damico, who provided support and assistance along the way.

PAGE 6

TABLE OF CONTENTS Page ACKNOWLEDGMENTS iv LIST OF TABLES viii LIST OF FIGURES ix ABSTRACT x CHAPTERS 1 INTRODUCTION 1 Nature of the Problem 2 Purpose of the Study 2 Rationale for the Study 4 Significance of the Study 11 Limitations of the Study 15 Definition of Terms 15 Organization of the Report 18 2 REVIEW OF SELECTED LITERATURE 19 Conditions and Procedures of Holistic Scoring 19 The Effectiveness of Holistic Scoring in Comparision to Other Evaluation Systems 29 Factors Involved in the Evaluation of Writing 37 Summary of Literature Review 57 3 METHODOLOGY 60 The Monitored Holistic Scoring 60 The Questionnaire 70 The Unmonitored Holistic Scoring 72 Methods Used for Analyzing Data 73

PAGE 7

TABLE OF CONTENTS — (Continued) Page 4 RESULTS AND DISCUSSION 78 Mean Scores in the Two Scoring Conditions 78 Interrater Reliability 87 Impact of Chief Readers on Scoring 90 Writing Criteria Across Different Scoring Levels 99 Patterns Among Readers 113 Nature of the Monitoring 142 5 SUMMARY AND CONCLUSIONS 164 Procedures Used 165 Issues Explored 166 Discussion 173 Recommendations for Research 176 Conclusion 179 APPENDICES A FORMS AND QUESTIONNAIRE 182 B SCORING CHARTS AND ADDITIONAL FIGURES 192 REFERENCES 200 BIOGRAPHICAL SKETCH 206

PAGE 8

LIST OF TABLES TABLES Page 1. Descriptive Statistics for the Stratified Random Sampling 64 2. Analysis of Variance Mixed Models Source Table 79 3. Cronbach's Alpha of Readers' Scores 88 4. Cronbach's Alpha of Readers' and Chief readers Scores 94 5. Results of the Questionnaire, Part III 112 6 Results of Questionnaire on Biases and Preferences 134 7 Questionnaire Results Dealing with Training 158 8. Additional Questions for 12 Table Leaders and Chief Readers 161

PAGE 9

LIST OF FIGURES FIGURES Page 1 Monitored vs unmonitored scores for overall group 81 2 Monitored vs unmonitored scores of individual readers 82 3 Monitored vs unmonitored scores : Score type summary 91 4 Results of rangef inders and samples in the monitored scoring 97 5. Category of comments assigned at each score level 105 6. Summary of points assigned to writing criteria by individual readers 137 7 Readers ratings of importance of criteria in timed writings 138 8. Summary of points assigned to writing criteria by chief readers and table leaders 145 9 Ratings by chief readers and table leaders of importance of criteria in timed writing 146 B-l Account of procedures 192 B-2 Summary of comments for paper 034 193 B-3 Summary of comments for paper 088 196 B-4 Results of table leaders independent scoring of training samples 199

PAGE 10

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PERSPECTIVES ON HOLISTIC SCORING: THE IMPACT OF MONITORING ON WRITING EVALUATION By Willa Buckley Wolcott December 1989 Chairman: Ruthellen Crews Major Department: Instruction and Curriculum The purpose of this study was to examine the impact of monitoring on the reliability and validity of holistic scoring as a means to evaluate writing. Six questions were addressed concerning (a) mean scores in monitored versus unmonitored settings, (b) agreement among readers, (c) the chief readers' influence, (d) criteria for scoring, (e) patterns in readers' responses, and (f) the nature of the monitoring. Qualitative and quantitative measures were used. In an unmonitored scoring, eight experienced holistic scorers with different teaching backgrounds rated over 50 expository essays written by college students for a state assessment program; the scorers recorded written responses to each essay in logs. Another four special readers rated a subset of the essays and orally responded to these papers through the use of taped protocols Later two chief readers conducted a monitored scoring with customary

PAGE 11

training procedures; three table leaders each monitored four readers throughout. The readers scored another 50+ essays, matched with the first set on the basis of original scores awarded during the actual scoring two years previously. All participants completed a questionnaire devised for this study. A mixed-model ANOVA for nested factors and repeated measures revealed a significant difference (p < .00005) in the mean scores assigned the matched essays in the two conditions; lower scores occurred in the monitored setting. The ANOVA also showed significant differences (p < .001) among the eight readers: These individual differences were reflected in the participants' logs and audiotapes. When two separate Cronbach's alphas were run with the chief readers' scores included in the second alpha, the reliability was consistently over .91. However, additional data suggested not only that readers' scores more closely approximated the chief readers ratings in the monitored setting than in the unmonitored, but also that the potential for fewer noncontiguous scores existed in the monitored setting. Thus, monitoring appeared effective in increasing readers' reliability. More significantly in terms of validity, the data showed readers clearly responding to rhetorical, as well as to mechanical, elements. Furthermore, all participants indicated that they assented to the holistic scoring standards and that they perceived monitoring as a helpful resource.

PAGE 12

CHAPTER 1 INTRODUCTION With the growth of writing assessment, holistic scoring has become widespread as one method for the largescale evaluation of students' essays. Sometimes called general impression scoring after the work of Diederich, French, and Carlton (1961), holistic scoring is based on the premise that the whole is more than the sum of its parts (Myers, 1980). Holistic scoring requires readers to read an essay quickly but completely, mentally rank ordering the essay as it compares in overall quality to other papers, and then to assign a score accordingly. In the course of the reading, holistic readers make no comments on the paper, nor do they tally any errors. Rather, they rate each paper in terms of its overall quality and record the score in coded form so that other readers will not know their ratings. Sample essays, which are selected from the same test administration as being representative of each point on a scoring scale, provide the readers with a frame of reference. In some modified forms of this procedure, a descriptive set of scoring criteria gives the reader an additional guide.

PAGE 13

Nature of the Problem The role that training and monitoring play in helping holistic scorers make these overall writing evaluations is yet to be fully explored. The importance of such training is implied in Spandel and Stiggins (1981) observation that training can help to eliminate biases on the part of readers; it is further emphasized by White (1985) who notes as follows: The training of readers, or x calibration' as it is sometimes called, is not indoctrination into standards determined by those who know best (as it is too often imagined to be) but rather the formation of an assenting community that feels a sense of ownership of the standards and the process, (p. 164)* Despite the seeming importance that these comments attach to training, the actual influence that such monitoring may have on readers' judgments within a structured holistic scoring has received little attention. Purpose of the Study The purpose of this study was to explore how the training and monitoring which readers undergo during a formal holistic scoring influence the writing judgments they make. By means of logs, holistic scores, protocol analyses, and the metacognitive awareness of the participants themselves as shown through a questionnaire, it was *From Teaching and Assessing Writing by Edward White, 1985, San Francisco: Jossey-Bass. Copyright 1985 by JosseyBass. Reprinted by permission.

PAGE 14

expected that the study would reveal to what extent the nature of this training embodies the "interpretive community" described by White as reflective of Fish's (1980) reader response theory. The following questions were addressed: 1. Do the mean scores for the essays differ when the papers are evaluated by readers working in a monitored setting from when the papers are judged by the readers working independently? 2 Do experienced readers participating in a monitored scoring achieve greater agreement with each other than when they evaluate essays independently? 3. What impact do the chief readers have on an holistic scoring? How do they ensure both a reliable and a collegial reading? 4. What criteria do readers use in assigning different score levels as reported through their logs, talking protocols, and responses to a questionnaire devised for this study? What standards are reflected in the score levels assigned across essays? How do readers respond to these standards? 5 Do any common patterns appear in the scorers written or audiotaped responses to the essays or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their written or verbal responses correspond to the writing features they rate as important on a questionnaire? 6 What is the nature of the monitoring that the readers receive during a scoring as reported through the logs of table leaders and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring processes as noted on their responses to a questionnaire?

PAGE 15

Rationale for the Study Training and monitoring comprise an essential part of a formal holistic scoring. In fact, the key to a successful holistic scoring lies precisely in its system of "checks and balances" that ensures the greater likelihood of readers rating the same paper comparably. According to procedures established by the Educational Testing Service, one check lies in the structured format of the reading itself. Readers assemble as a group to do the reading, hence reemphasizing the need for working toward a group consensus. They read at tables directed by table leaders, who gently function as consultants, guides, and monitors. The table leaders, in turn, are guided and monitored by a head table consisting of a chief reader and associate chief readers A second check lies with the ongoing nature of the monitoring provided. Not only do new readers undergo a preliminary training session in which they experiment with the holistic approach on old essays, but in subsequent scoring sessions, new and experienced readers alike receive additional practice. Each scoring is introduced with general comments in which information is provided about the examinees' testing conditions, and the readers are reminded to avoid potential problems with length, handwriting, and other surface features. Through this

PAGE 16

procedure the chief reader establishes expectations for the readers, expectations which, as Freedman and Calfee (1983) hypothesize, alter the "text image" (p. 94) that readers create in their own minds before making a judgment. Then readers begin by working with an old and new set of rangefinders — that is, with the anchor papers selected beforehand as illustrative of various scoring levels for that test administration. Together with the operational definitions that are usually included in a modified holistic scoring, the guidelines and rangefinders provide the criteria — both explicit and implicit — against which readers can rate the essays The monitoring continues throughout a scoring, as sample papers are used after each rest break to ensure the adherence of all readers to group standards; because the tally of sample scores is publicly recorded, the readers are able to see where their own scores place. Finally, frequent "check readings" are conducted in which a random sample of current essays from each table is independently evaluated by a reader, a table leader, and a chief reader. If any of the papers receive discrepant scores — that is, a noncontiguous score such as 1-3, 2-4, or 1-4 on a four-point scale — then the paper is returned to the party whose score is discrepant, and the reader is asked to review the essay. If the reader is unwilling to adjust the score after reviewing the paper, then it is

PAGE 17

subsequently refereed. Thus, in a formal scoring, training adds another dimension to the complexity of writing evaluation. These issues of training and monitoring are critical, for they affect both how and why scorers react the way they do to essays. As a tool of writing evaluation, holistic scoring attracts both strong support and serious criticism; the training of readers for the purpose of achieving a scoring consensus lies at the heart of the conflict. For example, in support of holistic scoring, Davis, Scriven, and Thomas (1987) point out that the reliability and the relatively low cost of holistic assessment make it valuable for evaluating a school's writing program. Bamberg, too, (1982) argues that writing programs which focus on the writing process should use essays for evaluation purposes, adding, "Holistically scored essays should, therefore, play a leading role in assessments of writing programs and writing competence" (p. 406). Cooper (Cooper & Odell, 1977) emphasizes as well the high interrater reliability that can be achieved in holistic scoring; he stresses that with similar backgrounds and training, raters can obtain substantial agreement in scoring several essays of a student In addition, the theoretical assumptions behind holistic scoring receive strong endorsement as White (1985) underscores the value of examining papers as a whole. He

PAGE 18

observes that holisticism "is the most obvious example in the field of English of the attempt to evoke and evaluate wholes rather than parts individual thought rather than mere socialized correctness" (p. 19). White readily acknowledges the limitations of holistic scoring, pointing out that this evaluation approach is unable to provide diagnostic information for individual students and that, moreover, the scores represent rankings rather than absolute values. But while stressing the need for using this approach responsibly, White defends the underlying principles behind this form of evaluation: Holistic scoring is important for reasons beyond measurement, for reasons that return us to the nature of writing and to the importance of the study of writing itself. It is in our writing that we see ourselves thinking, and we ask our students to write so that they can think more clearly, learn more quickly, and develop more fully. Writing, like reading, is an exercise for the whole mind, including its most creative, individual, and imaginative faculties. The rapid growth of holistic scoring in grading reflects this view of reading and writing as activities not describable through an inventory of their parts, and such scoring serves as a direct expression of that view: By maintaining that writing must be seen as a whole and that the evaluating of writing cannot be split into a sequence of objective activities, holistic scoring reinforces the vision of reading and writing as intensely individual activities involving the full self. (p. 32) In order for students to understand more clearly the evaluation criteria used on their papers, White, in fact, advocates the application of holistic scoring guides in the classroom. That many teachers enthusiastically endorse this practice (Mishler and Hogan, 1982; Paulis, 1985, and

PAGE 19

Westcott and Gardner, 1984) gives added weight to the value of holistic scoring as an aid to teaching and revising. At the same time, holistic scoring receives pronounced criticism from some researchers and scholars. Frequently cited as emblematic of reader reliability problems is the classic work of Diederich, French, and Carlton (1961). In this study the researchers asked over 50 readers from 6 fields to grade 300 essays written by new freshmen at different colleges All of the essays received at least five of the nine possible scores on the scale, and onethird even received the entire range of scores Often overlooked in references to this study, however, is the absence of any criteria or scoring assistance provided for readers (White, 1985); such a lack of training represents a complete departure from the guidance given in most holistic scorings of recent years. Nevertheless, even with such guidance given, the issue of score reliability — a broad term that reflects potential error sources in topics, tasks, conditions of examinees, and agreement among readers — remains, according to Breland, Camp, Jones, Morris, and Rock (1987), the "Achilles Heel" of writing assessment. In addition to questions of reliability, the issue of the validity of holistic scoring has come under attack. For example, Charney (1984) argues that "a given set of criteria devised by one set of experts is no more valid

PAGE 20

than a different set of standards, arrived at by a different group of experts" (p. 73). She suggests, moreover, that the very need for extensive training in holistic scoring implicitly illustrates the difficulties readers experience in adhering to the imposed criteria. This difficulty, according to Charney, is further shown in those studies which have found such superficial features of writing as handwriting or spelling to be influential in the holistic scores assigned. Thus, Charney concludes, "Holistic ratings should not be ruled out as a method of evaluating writing ability, but those who use such ratings must seriously consider the question of the validity of the scores that result" (p. 79). Similar caution is reflected in the report by the CCC [College Composition and Communication] Committee on Teaching and Its Evaluation in Composition (1982); the report notes that holistic scoring is of "limited value" for evaluating either writing programs or courses Several educators and composition specialists also express their concern. Hirsch (1977) states, for example, that some of the greatest thinkers in history have been unable to establish holistic standards which encompass both intrinsic and extrinsic criteria. Hirsch insists that the Aristotelian mode of intrinsic evaluation, which judges how effectively and how correctly writers carry out their intentions, is better for predicting writing ability than is the Platonic

PAGE 21

10 mode, which judges the quality of intentions external to the writers; the intrinsic mode is also, he argues, "the only kind of assessment in which anyone should have confidence" (p. 186). Elbow (1986) expresses reservations about an evaluation model which requires agreement among judges. Not only may it result in an overemphasis on such measurable features as grammar and spelling, but it also requires readers to suspend their own judgments in favor of other standards. For Elbow, "descriptive perceptions" (p. 255) — even when they conf lict--provide a more valuable learning experience than those evaluations which merely rank or measure. The need for agreement among holistic raters disturbs the educator Roberts (1983) as well, contributing, in his view, to a limited, "product-centered and decontextualized" form of evaluation that disregards the writer's purpose, intentions, and environment. Roberts questions whether holistic scoring, like the empirical research to which he attributes its growth, can effectively measure writing quality, writing change, or "anything other than how well a writing sample simulates an Idealized Text" (p. 3); his latter observation directly contradicts the view expressed by Spandel and Stiggins (1981) that holistic scoring, comparing as it does the relative quality of essays, has no "preconceived notion of the 'ideal' paper" (p. 24).

PAGE 22

11 As can be seen, those who question the validity of holistic scoring imply that emphasizing agreement of scores through training destroys the individual perspective — an individuality endorsed in recent years by such reader response theorists as Bleich (1975). But White (1985) argues an opposing viewpoint just as emphatically. Calling attention to the importance of the nature of the community that forms in an holistic scoring, White compares the "true community of assent, which is properly developed through a formal essay scoring, to the "interpretive community" discussed by another theorist of reader response, Fish (1980). Although Fish's concept refers to the sense of agreement readers of literary texts strive to attain, White (1985) sees similarities between Fish's reader response theory and the need for establishing a responsive community in holistic scoring. Thus, a study is needed to explore the impact of training both on the nature of the community that develops among the scorers and on any scoring agreement that results Significance of the Study In revealing the extent to which holistic scorers willingly adopt the criteria, the study should have educational, theoretical, and practical significance. First, as Keech (1982) notes, it should show the individual holistic scorer responding "not as an error-counter or a

PAGE 23

12 conserver of threatened forms but as a receiver of intended communication" (p. 174). Holistic scorers attempt to derive the meaning through envisioning each text as a whole; they operate in a context in which they are encouraged both to remember the limited conditions under which examinees write and to recognize that strong papers may not be perfect ones. As such, the study should verify White's (1985) observation expressed below: The simple fact is that the definition of textuality and the reader's role in developing the meaning of a text that we find in recent theories of reading happens to describe much of our experience of responding with professional care to the writing our students produce for us. Part of the problem of evaluating student writing comes out of our deep understanding that we need to consider the process of writing as well as the product before us and that much of what the student is trying to say did not get very clearly into the words on the page. (p. 93) The study should further serve to integrate writing assessment more closely with reader response theory by illuminating whether the particular sense of community that arises through training procedures influences the holistic judgments made. The early composition researchers Braddock, Lloyd-Jones, and Schoer (1963) stress the importance of agreeing to criteria. In a reference to analytic scoring, for example, the researchers link the effectiveness with which criteria are applied directly to "the commitment which each rater feels toward the criteria being employed" (p. 15).

PAGE 24

13 As indicated by White (1985), reader response theorist Fish (1980) attributes the agreement which can occur among readers of literature to a "stability in the makeup of interpretive communities," a stability arising from a commonality of goals (Fish, 1980, p. 15). According to Fish, the stability is due not to independent qualities within the texts, but rather to the "interpretive strategies which give shape to the event of reading and hence to the making of meaning of the texts themselves. The nature of the interpretive communities can change because the interpretive strategies are learned; they are learned through persuasion, as writers invite readers to employ particular strategies The reader response theorist Rosenblatt (1985, 1988) finds Fish's view too narrow; however, she also underscores the value of agreement in her observation that "in any specific situation, given agreed-upon criteria it is possible to decide that some readings are more defensible than others" (Rosenblatt, 1985, p. 36). An important distinction Rosenblatt makes is between "efferent" and "aesthetic" reading. In efferent reading, the reader is concerned with what can be taken away from the reading, whereas in aesthetic reading, the reader is involved with experiencing the reading event itself. The two types of reading fall on a continuum, requiring readers to select the primary elements to which they will give their

PAGE 25

14 attention. Of special significance for this study is Rosenblatt's contention that "the need for grasping the author's purpose and for a consensus among readers is usually more stringent in efferent reading" (1988, p. 8). As she elaborates, "In efferent reading, the student has to learn to focus attention mainly on the public, referential aspects of consciousness and to ignore private aspects that might distort or bias the desired publicly verifiable or justifiable interpretation" (1988, p. 8). In this regard the training of holistic scorers can perhaps be perceived as a means for helping readers adopt an appropriate stance in which they overcome their biases and select agreed-upon public criteria. Thus, the issue of agreement surfaces both in reader response theory and in writing assessment. Because this issue lies at the heart of questions concerning the validity of holistic scoring, this study should have theoretical implications in revealing the degree of commitment holistic scorers feel to the standards they use. Finally, the study should have practical significance. If the monitoring and training processes of the structured scoring have a noticeable impact on readers evaluations then the need for continuing holistic scorings within a formal context will be apparent. If, on the other hand, readers' judgments do not appear to be unduly affected or altered by the group monitoring procedures, then an option

PAGE 26

15 might be to have experienced readers follow what is currently done in many state assessments and score some essays at home. Limitations of the Study Any conclusions to be drawn from the study will, of necessity, be limited, as the small number of participants — 17 altogether, including readers, table leaders, and chief readers — and the limited number of essays involved — a little over 100 — will prevent generalizations. Furthermore, the scorers involved in the study will be highly experienced readers; different results might be obtained with less experienced scorers who might not react the same way, especially in the at-home scoring. Finally, the study relies on the accuracy of scorers' self -reporting in logs, taped protocols, and questionnaire responses; such self-reporting not only entails subjectivity but also, as Freedman and Calfee indicate (1983), depends upon the evaluators abilities to articulate their own responses Definition of Terms For the purpose of this study, the following definitions are used: 1 Scorer and reader are used interchangeably to refer to those readers making the rating judgments

PAGE 27

16 on each paper. Special scorer is used to refer to any of those four readers who do talking protocols as they evaluate the papers 2 Chief reader or trainer is used to refer to the one or two individuals who conduct the holistic scorings and who train the readers by providing sample papers 3 Table leader refers to the individual who is in charge of a table of readers and who monitors those readers' progress. 4. Monitored scoring structured scoring and formal scoring are used interchangeably to refer to a formal writing assessment approach in which a group of readers meets and follows the set procedures described in this study. 5. Training and calibration of readers refer to the processes whereby the readers are given initial exposure to selected sample papers and ongoing practice in reading and scoring those essays These processes also include the public tallying of scores on those papers 6. Monitoring refers to the ongoing process whereby table leaders continuously check some of the actual essays the readers have scored. 7 Check reading refers to the formal process whereby the chief readers collect from the table leaders

PAGE 28

17 two papers per reader which the table leaders have also scored. The chief readers independently score these papers and compare the results 8 Rangef inders and anchor papers are used interchangeably to refer to the six essays which have been formally chosen in a previous sample selection process as representative of each scoring level from level 1 (the lowest) to level 4 (the highest) These papers, which the readers must initially rank order according to quality, serve as guideposts for the standards of any reading. 9 Sample refers to additional essays which have also been selected from the same previous scoring as the rangefinders and which are used throughout a scoring to illustrate particular levels of scores. 10 Operational definitions is used to refer to those written descriptors of each level of paper and the qualities that the levels embody. 11. Log is used to refer to the running commentary readers provide of their scores and the decisions for these scores. It is distinguished from the term Account of Procedures which is used to refer to the customary log of procedures that chief readers maintain.

PAGE 29

18 Organization of the Report A review of selected literature is presented in Chapter 2 The methodology used in the study is addressed in Chapter 3. The qualitative and quantitative results are discussed in Chapter 4, and a summary of the findings and their implications are presented in Chapter 5.

PAGE 30

CHAPTER 2 REVIEW OF SELECTED LITERATURE Literature on holistic scoring and its related areas falls into three broad categories: One set of studies establishes and describes holistic scoring procedures and conditions, primarily for the purpose of improving their implementation. A second category addresses such issues as the validity, reliability, and cost-effectiveness of holistic scoring and explores the effectiveness of this evaluation system as it relates to other procedures A third category, more theoretical in nature, explores the complexities entailed in making writing judgments. Conditions and Procedures of Holistic Scoring The classic study by Diederich, French, and Carlton (1961) cited in the introduction has influenced both the conceptual base and the method of holistic scoring. As noted previously, the researchers asked readers to make an overall judgment as to the quality of a particular essay by implicitly rank-ordering the papers. In this study, 53 professionals from a variety of fields evaluated about 300 essays written at home by college freshmen from different universities. The raters were first instructed to sort 50 papers into three piles signifying their level of quality — 19

PAGE 31

20 average, above-average, and be lowaver age. Next, they were to sort each pile into three more stacks for a total of nine. Finally, they had to place the remaining papers in one of the appropriate piles and write comments about what they liked or disliked in the essays. When the grades assigned to each paper were correlated, the median of correlations between readers was .31, indicating a low reliability for the reading. Precisely for this reason, the work of Diederich et al has often been cited as indicative of the problems inherent in general impression scoring. Yet, as White (1985) notes, it is important to recognize that the conditions of their study differed considerably from those typically used today: Not only did the readers come from diverse backgrounds, but neither training nor monitoring was provided; moreover, the papers were written outside class, a departure from normal testing conditions In fact, because of the large variation that occurred in their study, the authors conclude that reliability is crucial if scoring the essays is to serve any important purpose. They suggest that readers should be tested so that only those whose ratings correlate at .60 with the general consensus be allowed to score; they speculate that some training and directions might be of help. Thus, the discoveries made in this major work have undoubtedly been

PAGE 32

21 instrumental in the development of holistic scoring as it is known today. A second influential study is The Measurement of Writing Ability (1966) by Godshalk, Swineford, and Coffman. The authors used holistic scoring to evaluate each of five essays written by nearly 650 grade 11 and grade 12 students throughout the country. Although the study was undertaken for the purpose of validating multiple-choice items on a standardized test, the researchers conclude with several recommendations about holistic scoring. For example, they note the importance of providing sufficient time for training early in a scoring session in order to avoid having readers assign overly high scores in the beginning. Finding the time of day to be a factor in a scoring, they also emphasize the need for having multiple readings of a paper done at different stages of a scoring session. Such a practice can, according to the researchers minimize both the variance among readers and the variance deriving from the time of day. In later stages of their study, the researchers reduced the number of readers evaluating any one essay, and they experimented with changing their original 3-point scoring scale, denoting superior, average, and inferior, to a 4-point scale. The even-numbered scale required readers to choose the half of the scale that each paper exemplified, rather than resorting to the safety of the middle

PAGE 33

22 score whenever in doubt. As a result of their experimentation with various conditions the work of Godshalk et al (1966) has formed the basis for many formal holistic scorings conducted today. More recent literature reflects continued interest in the conditions under which an holistic scoring is conducted. For example, Paden (1986) explored the possible relationship between the context in which an essay is placed and the influence of floor and ceiling effects on the range of different score levels. Citing the conclusions of other researchers who failed to minimize context effects on a scoring, Paden hypothesized a theoretical model to link the effect of context to the potential for increase or decrease contained by each score level. Because the potential for change can differ substantially depending on whether a score is a 1 or a 3, for example, the particular placement of a score on a scale can, according to Paden, affect the amount of change that context can influence. She stresses the need for validating her hypothetical model. Concern for training of the readers also appears in several studies during this decade. The role that training of readers can play has been illustrated in a study by Freedman (1981), who studied the impact of three variables — essay, reader, and environment — on an holistic scoring. Four highly qualified scorers worked in pairs to

PAGE 34

23 score holistically 64 argumentative papers composed on each of eight topics by college students; in a later session, they rated the same essays analytically. Two trainers trained the different pairs of raters, providing the readers with sample essays for each topic. Freedman (1981) found that the four readers graded the papers consistently with each other and appeared to be unaffected by the rating session and the time. In addition, their holistic scores correlated significantly with all the analytic ratings except for the area of usage. The choice of topic did seem to affect the results in that one opinion topic received higher scores throughout. Most important, Freedman also found an apparent effect that the trainer could have on the scoring. Even though both trainers (one of whom was Freedman) agreed on the scores to be given sample essays, on a replay of the taped training sessions, differences in the discussions appeared. Thus, while one trainer might state that two contiguous scores of 2 and 1 were appropriate for a given sample, the other trainer might state that the 1 score was not suitable for papers of that type Freedman (1981) speculates that such training differences can result in higher or lower scores being assigned accordingly, and she suggests that researchers in small projects avoid conducting the training themselves in order to avoid influencing results in favor of their hypotheses.

PAGE 35

24 The importance of training in writing evaluation was underscored by Hrach's dissertation (1983), in which she explored possible links between raters' previous writing experiences, their tolerance of ambiguity, and their evaluation approaches. Fifty-nine secondary English teachers sorted into whatever scoring categories they chose 20 papers written on the same topic by secondary school students. Using a three-way multidimensional scaling system, Hrach identified the basis of the classifications to be style, organization, maturity of thought and expression, and substance. Thirty-nine other teachers who rated the compositions analytically confirmed the accuracy of the classifications. In addition, the teachers completed two instruments addressing their experience with writing skills and their tolerance of ambiguity. Results of Hrach's study, like Diederich et al.'s (1961), confirmed that raters were influenced by the presence or absence of certain writing qualities in essays. Also like the findings of Diederich and his colleagues, Hrach discovered that the raters differed substantially on what they considered important. That is, only three raters used three of the four dimensions she identified, and almost half focused on one dimension alone. These differences in evaluations did not appear related either to the raters previous experiences with writing skills or to their tolerance for ambiguity, as these features were subsequently not found to be influential in writing judgments.

PAGE 36

25 Because the raters were given neither criteria nor restrictions for sorting the papers, Hrach (1983) — who endorses the realism of this practice — nevertheless suggests that a lack of training in writing evaluation might explain the wide variability in results She speculates that even if the instructors had completed coursework in writing, without training in the teaching or evaluating of writing, "they probably would not have developed common perceptions of what constitutes good writing" (p. 171). In what seems a forerunner of this study, Hrach suggests that it might be useful to examine how raters trained in holistic scoring rate papers independently and as part of a group. The importance of training in an holistic scoring has also been emphasized by Sweedler-Brown (1985), who sought to determine whether the amount of training and the experience that holistic scorers had with a grading scale affected either their evaluations of writing quality or the consistency of the evaluations. Using a six-point scale, 20 experienced writing instructors and graduate students, whose experience with holistic scoring ranged from none to three years, holistically scored 897 essays written by university students. From this group of essays the 36 essays which had received discrepant scores were selected for analysis. Each of the readers involved in the discrepant scores, together

PAGE 37

26 with one of the six trainers who had served as referees, was asked three days later to score the same essays analytically. The eight criteria included content, organization, diction, development, mechanics, and spelling. When the holistic scores were correlated with the total analytic scores, the trainers were found to give equivalent holistic and analytic scores over 60% of the time, whereas readers assigned comparable scores only 37% of the time. Although both trainers and readers valued content and sentence structure (albeit in reverse order), the trainers tended to give lower holistic and analytic scores than did the readers Thus, the researcher concludes, "Graders with greater experience and training have significantly greater consistency between their holistic and analytic evaluations of the same essay, from which we conclude that the amount of training and experience does significantly affect the reliability of a grader's evaluation" (p. 54). The importance Sweedler-Brown (1985) attaches to training seems justifiable; however, some limitations in her study suggest that the results must be interpreted cautiously. That Sweedler-Brown s (1985) conclusion derives from a small sample of 36 discrepantly scored papers seems troublesome: Not all readers would necessarily have been involved in these discrepant scores, and hence, the actual number of readers from whom correlations were obtained —

PAGE 38

27 while not clearly stated — might have been fewer than the 20 doing the holistic scoring. In addition, the training provided for the analytic scoring was far more limited than that given to the holistic criteria, with the result that agreement on the analytic scales might have been harder to achieve. Finally, as Sweedler-Brown acknowledges, the trainers scored far fewer papers than did the readers The trainers might have remembered their original holistic scores on the discrepant papers, thereby contributing to the higher correlation they achieved between analytic and holistic scores. Thus, the limitations of this study militate against the conclusions, however strongly the need for training in holistic scoring appears to be. Differences between trained and untrained scorers were also examined by Huot (1988) in a recent dissertation somewhat related to the present study. Arguing that too much attention has been paid to the issue of agreement, Huot explored the validity of holistic scoring by comparing the protocols of four novice and four expert holistic raters. Each scorer rated 84 essays selected from a previous assessment and written in letter format by college freshmen on two different topics. The scorers, who talked aloud for half the essays (42 each), were given training in doing protocol analysis. Then, over a four-day period, pairs of expert readers and pairs of novice readers scored for two days apiece. The novices were given neither

PAGE 39

28 training in holistic scoring nor any rubric to use; the experts, together with a scoring leader, first trained with anchor papers and the original rubric, and then they modified the rubric for the protocol scoring. The researcher coded the number of responses that each scorer made, noting when the responses were made, whether the responses were positive, neutral, or negative, whether the responses were made to the writer or to the essay, and the criteria on which the judgments were based. After each scoring session, the researcher interviewed the scorers. Huot found that even though the novice raters made substantially more comments than did the expert raters, the experts' comments — many of which were made after rather than during the scoring — reflected more varied viewpoints and more personal engagement with the student essays. Because the experts could use a scoring rubric whereas the novices could not, Huot concludes that, contrary to his expectations, the rubric and other holistic training procedures did not intrude on the rating process. Rather, by providing scorers with "expectations, justification or explanation" (p. 223), the rubric enabled the raters to read the essays more fully. Novice raters, on the other hand, sought strategies that would work with a specific set of papers and concentrated on evaluation to the exclusion of any personal engagement with the essays. Huot suggests that holistic scoring procedures, far from impeding true

PAGE 40

29 reading, "actually promote the kind of rating process that insures a valid reading and rating of student writing" (p. 237) As can be seen from this section of the literature review, several studies have reflected concern both for determining what occurs in holistic scoring and for improving the procedures under which writing is holistically scored. Because of these concerns, several of these works have become the reference point for the practices currently used in a structured holistic scoring. The Effectiveness of Holistic Scoring in Comparison to Other Evaluation Systems A number of studies have explored either the effectiveness or the cost efficiency of holistic scoring, especially as it relates to other forms of writing evaluation. One such study was the early undertaking of Follman and Anderson (1967), who randomly assigned five raters to use one of five evaluation approaches in rating ten compositions written by college students. The rating systems included The California Essay Scale, the Cleveland Composition Rating Scale, the Diederich Rating Scale, the Follman English Mechanics Guide, and the Everyman's Scale in which the evaluators could use whatever system they wished. All but two of the evaluators were English education majors enrolled in the same English course.

PAGE 41

30 Follman and Anderson found high correlations among the different systems except for the Diederich scale; they also found high reliability for each group, leading the researchers to conclude that the homogeneity of the raters might be a major contributing factor. In a later study Winters (1978) compared four different scoring systems — one General Impression, two analytic, and one a T-unit analysis — to determine how well each system classified four groups of students who had been previously placed in high and low writing groups in high school and in college After six high school and college teachers were thoroughly trained in at least two of the scoring systems, four of the readers used each system to score 80 papers. Interrater reliability was substantial on all four systems, with the General Impression system achieving the lowest rate at .81, in contrast to the .99 reliability rate of the T-unit analysis system. Winters (1978) attributes the relatively low reliability of the General Impression scale both to the fact that the scorers used this system first and hence lacked the practice they subsequently experienced with the other systems and also to the fact that the rubric for this system was less defined than it was for any of the other procedures Of most concern to Winters (1978) is her finding that in three of the four systems — the General Impression

PAGE 42

31 system, the Diederich Expository Scale, and the CSE Analytic Scale, which was developed at the Center for the Study of Evaluation — the low college group did better than did their high peers Winters attributes this unexpected occurrence to the small size of the sample, the atypical nature of summer students, and, most significantly, to the substantial number of foreign-born students in the college low group — students whose problems with syntax or with awkward wording might not be reflected by the scoring systems That the T-unit did not discriminate among the four groups at all could, according to Winters, be explained by the similarity of age in the students of the study, unlike those students in previous research on T-units The researcher speculates that three systems are better than two for classifying students' writing, and she notes that a combination of General Impression scoring, together with an analytic system, seems best. She concludes that General Impression scoring, while not adeguate alone for placement procedures, should be included in most writing assessments. Like Winters, Shoaf (1985) also studied the effectiveness of two different methods — holistic scoring and T-unit analysis — in evaluating the writing skill of high school students An additional purpose of her study was to determine whether students gained in writing proficiency

PAGE 43

32 over a semester and retained that growth during the years following. Shoaf (1985) had the students enrolled in her sophomore-level average composition class write a 50-minute preand post-test on the same topic; one and two years later all students taking English wrote on the same topic for a delayed post-test The researcher and an assistant then tallied the number of T-units in 388 samples. The essays were typed, and a team of 12 scorers, after undergoing a training session with anchor papers, holistically scored the essays on a scale of 1-4. A correlation of holistic scores with the T-unit results proved non-significant. When the holistic scores were analyzed for the four groups as a whole, the holistic scores increased over the semester and reflected a slight decline on the delayed post-test. However, when the T-unit scores were analyzed for the same period of time, no significant results occurred. Thus, Shoaf (1985) concludes that T-unit analysis is not effective in evaluating the overall writing progress of groups and that T-unit scoring should only be used for determining levels of syntactic maturity. Acknowledging that her study did not address the issue of individual writing proficiency, she states, "Holistic scoring is a useful technique for determining whether groups of students have made general progress in the development of writing" (P67).

PAGE 44

33 In a study by Bauer (1981), the cost-effectiveness of three different scoring systems — analytic, primary trait, and holistic — was explored, as well as the interreliability and intra-reliability of each system. Nine graduate students, none of whom were familiar with the scoring methods, were divided into groups of three and trained in one of the methods The graduate assistants scored 118 essays previously written for the National Assessment of Educational Progress. Results indicated that the analytic scoring method, which contained the most specific scoring criteria and which required the longest training time, achieved the strongest interand intrareliabilities The holistic scoring method, though attaining the lowest intra-reliability rate, was the second strongest of the three methods in terms of interreliability. It also proved to be the most cost-efficient for scoring large numbers of essays. Janopoulos (1987) explored the effectiveness of holistic scoring from still another perspective. He sought to determine how well holistic scorers comprehended compositions written by nonnative speakers of English. After receiving training in holistic scoring, 12 readers rated two compositions predetermined as representing higher and lower quality. In the first rating — the "naive" condition — readers were not told they would have to write a recall protocol after the holistic scoring; in the second

PAGE 45

34 rating — the "focused" condition — readers were told beforehand that another recall protocol would be required. The readers operating in the naive condition were able to recall the higher text more clearly than they did the lower, thereby illustrating, according to the researcher, the role that comprehension can play in raters' holistic judgments. To his puzzlement, even though the readers operating in the focused condition recalled more overall content than they did in the naive condition, the focused readers did not recall more of the higher level text than they did that of the lower; rather, they recalled about the same amount of information in both levels. Janopoulos (1987) attributes the lack of impact that this higher text seemingly had on the focused readers to a possible ceiling effect and to the small sample size. He concludes, nevertheless, that holistic scoring is a valid way to assess non-native speakers' writing proficiency in terms of the comprehension component. Despite the problems Janopoulos encountered in interpreting the results, his conclusion seems valid in that holistic scoring more closely resembles the naive condition under which the readers in his study were operating than it does the focused condition. An altogether different stance toward holistic scoring appears in Roberts' dissertation (1982); he compared individualized writing instruction, an approach he strongly

PAGE 46

35 endorses, to the more traditional classroom method of teaching composition in two West Virginia colleges. Students wrote preand post-essays on a topic developed by researchers in another study, they took the Daly-Miller writing apprehension test at the beginning and end of their work, and they answered three questions regarding their view of writing. Then the essays were holistically scored and studied for T-unit length; they were also rated according to a forced-choice method. An increase in the T-unit length for the control group was the only significant difference that occurred. Roberts (1982) questions the effectiveness of holistic scoring as a means of evaluating the quality of student writing. He notes that one of his four raters dropped out of the study altogether, unwilling to rate "Themes as Products" (p. 96), and he points out that still another rater failed to achieve acceptable reliability. Roberts observes, "All of the raters commented that the evaluation techniques required product-centered evaluation based on an artificial rubric that, while developed specifically for the essay topics by prominent researchers, was inadequate for evaluating what the papers really deserved based on what the raters perceived as the students' intentions" (p. 96). Although Roberts' (1982) disillusionment with holistic scoring may be warranted, his use of both a scoring rubric

PAGE 47

36 and a topic from a different testing program is troublesome; as White (1985) suggests, the requirements of each testing population and program must be taken into consideration in the development of an holistic guide. Moreover, Roberts' study was problematic in that he controlled only for the instructional mode and not for such other variables as teacher differences or course content. Although Roberts dismissed this lack of control by stating that his study was primarily naturalistic rather than experimental, he did not provide the extensive descriptions or observational data often associated with naturalistic studies. Therefore, despite the limitations which holistic scoring admittedly has, the problems in Roberts' dissertation weaken the impact of his criticism of this scoring method. Taken together, the studies by Winter (1978), Shoaf (1985), Bauer (1981), Janopoulos (1987), and Roberts (1982) illustrate the potential, as well as the limitations, of holistic scoring for writing evaluation. Their findings suggest that holistic scoring is more meaningful — albeit somewhat less reliable and more timeconsuming in terms of training required — than is T-unit analysis as a means of assessing overall writing quality, including the writing of non-native speakers of English. It is also a cost-efficient approach for the large-scale assessment of essays. At the same time, as Winters (1978) points out, holistic scoring cannot reveal specific,

PAGE 48

37 diagnostic information and hence, the purposes for which it is used must be clearly defined beforehand. Factors Involved in the Evaluation of Writing A third major component of the literature review encompasses those studies that explore the elements involved in the evaluation of writing. The focus of this section is not on holistic scoring per se but rather on the larger issue of writing quality — and most importantly, on those elements that comprise that quality. Thus, studies which address writing in a variety of contexts are included, as are studies which use assessment methods other than holistic scoring. The studies are primarily categorized according to results although, of necessity, some overlapping among the categories occurs Content and Organization The importance of content is emphasized by Diederich (1974), who, in discussing the factor analysis that was performed in the earlier study of Diederich, French, and Carlton (1961), states: Then it became quite clear that the largest cluster was most influenced by the ideas expressed: their richness, soundness, clarity, development, and relevance to the topic and the writer's purpose. Hence we must accept it as a fact that a high proportion of intelligent, educated adults do pay attention to the quality, development, support, and relevance of the ideas expressed in student compositions and weight them heavily in their judgment of the general merit of these papers. (p. 7)

PAGE 49

38 Support for Diederich's views comes from two other studies in which content and organization proved to be significant determiners of writing quality. For example, Freedman (1979) undertook to find which essay characteristics influenced judges most by rewriting four essays on each of eight topics composed by college freshmen. The essays were rewritten to be strong or weak in the four broad categories of content, organization, sentence structure, and mechanics; then they were typed. Unaware of the rewriting that had been done, 12 instructors of a college freshman English program holistically scored the papers and subsequently rated the papers according to their perceptions of the strength or weakness of the papers in each category. An analysis of variance revealed that, as Diederich (1974) had also found, essays with stronger content received higher scores than did those with weaker content. Organization also proved to be a statistically significant factor. Mechanics appeared to be influential as well in those papers with strong organization. When the perceptions of the evaluators toward the rewritten versions were examined, interestingly, the evaluators did not always agree with the rewriters as to the strength or weakness of the categories of content and organization. In fact, two readers were removed from the study because their disagreement was substantial. The readers had better agreement for the more concrete categories of mechanics and sentence structure.

PAGE 50

39 Freedman (1979) acknowledges as limitations of the study the breadth of the categories used for rewriting — a breadth which made it impossible to know what exact qualities judges might be rating — and the homogeneity of the raters. Like Diederich (1974), she stresses the need for emphasizing more in classroom teaching the development and organization of ideas To explore the criteria that holistic scorers use in making their evaluations, Breland and Jones (1984) compared scores obtained from a regular scoring of the English Composition Test (ECT) with analyses made nine months later by 20 college English professors on a sample of 806 essays. The samples contained equal numbers of papers written by blacks, whites, native Hispanics and nonnative Hispanic speakers of English. In the special scoring, the evaluators first scored the papers holistically and then checked on an evaluation form the strong and weak features of each essay. During a subsequent session, the readers were also asked to write on the essays themselves. Correlational procedures used to predict the original holistic score indicated that readers were most influenced by the organization, support, and significant ideas in a paper, with organization correlating the most highly of all discourse characteristics with the original English Composition Test score. Surface features, such as essay length, neatness, and spelling, contributed significantly

PAGE 51

40 to predicting the ECT score as well, with essay length correlating most strongly at .43. Syntactic and lexical characteristics influenced the scoring of the nonnative Hispanic speakers of English. On a guest ionnaire given prior to the scoring, the special scorers indicated that organization, thesis, support, and ideas were significant to them, characteristics that proved influential in their special scoring. Differences were also noted between experienced and inexperienced scorers, with the experienced scorers tending to score more harshly. This finding was similar to Sweedler-Brown s (1985), as discussed in the previous section of the review. Breland and Jones (1984) note that the score reliability of one writing sample rated by two readers has been found to range typically from .38 .58. They stress the need for caution in interpreting the results of their study. They speculate that the special scorers may have been unduly influenced by the evaluation form and by the targeted groups of students, and they point out that their sample contained above-average students The researchers observe that in their study, length greatly influenced holistic scores, implying possibly the importance of development in argumentative essays; they call attention to the importance that content and organization played in the holistic scores.

PAGE 52

41 Mechanics, Sentence Structure, Vocabulary As can be seen from these studies content and organization appear to be influential factors in many evaluators writing judgments. At the same time, other elements, such as mechanics in Freedman's (1979) study or spelling and length in Breland and Jones's study (1984), play a role as well. The extent to which these concrete, nonrhetorical factors can influence writing evaluations comprises the focus of several other studies Allen (1976) investigated the influence of mechanical and grammatical errors on teachers content ratings by preparing four versions of a writing sample that contained different numbers of errors. Over 400 secondary English teachers scored one version apiece. Although Allen noted several issues that needed further exploring, the results did not support his hypothesis that teachers' customary concerns with mechanical errors would affect their evaluation of rhetorical elements Rafoth and Rubin (1984) sought to determine the significance of content and mechanics on college instructors evaluation of writing by rewriting an essay to contain stronger or weaker content and stronger or weaker mechanics. The researchers composed three new versions of a timed expository essay originally written by a college freshman, adding spelling and punctuation errors to two of the versions and deleting propositions from other versions

PAGE 53

42 in order to alter the quality of content. Of the four final versions, one was high in content and free of errors; one was high in content and full of errors; another was low in content and free of errors; and the last was low in content and full of errors. Eighty composition instructors from four state universities voluntarily accepted one of the versions to grade. Some instructors were told to pay special attention to content and ignore mechanics, whereas others were told to pay attention to mechanics instead of content. Still others were simply told to read the paper according to their normal practice. All the instructors were also asked to rate the paper according to the criteria on the Diederich scale. A series of ANOVAs showed that mechanically correct versions received higher general impression scores than did those papers with errors; furthermore, the ratings according to the Diederich scale showed that the mechanically correct versions received higher scores for ideas, organization, and punctuation than did those versions with the errors inserted. According to the researchers, "The present results strongly suggest that regardless of writing content or evaluative criteria, college instructors' perceptions of composition quality are most influenced by mechanics" (p. 455). They speculate that graders may not distinguish clearly between the domains of content and mechanics in making their writing judgments.

PAGE 54

43 The researchers acknowledge that the writing assignment was limited by timed conditions and by the inclusion in some versions of a substantial number (14) of errors. However, an additional limitation seems to have been the use of only one essay per grader per evaluative condition. If each grader had been given several essays to score or if the graders had been given some training in using the Diederich instrument, the results obtained by Rafoth and Rubin (1984) might appear more conclusive. In another study of the way teachers writing evaluations are influenced, Stewart and Grobe (197 9) reexamined 232 samples from an earlier national writing assessment program. They found that students increased in the three measures of syntactic maturity — words per T-unit, words per clause, and clauses per T-unit — from grade 5 to grade 11. In addition, students improved in their command of spelling and in the avoidance of run-on sentences; they did not improve to the same extent in their avoidance of unclear pronoun reference or avoidance of sentence fragments. The features which best predicted the quality ratings in grades 8 and 11 were the number of words and spelling; only in grade 5 were the syntactic maturity measures at all significant. While speculating that the teachers in grades 8 and 11 may have been influenced more by content and organization than by sentence maturity, Stewart and Grobe (1979) express dismay at the lack of concern seemingly shown for syntactic development.

PAGE 55

44 In a subsequent study Grobe (1981) compared analytic ratings completed by 18 trained graders to the holistic scores assigned narratives written by 437 5th, 8th, and 11th grade students. As in the earlier study, composition length and the absence of spelling errors proved to be significant factors in predicting the holistic score. To explain the holistic variance unaccounted for by the 14 syntax and mechanics variables, Grobe (1981) subsequently added several vocabulary measures to the analytic rating system. A computer program analyzed 50 essays selected at random from each grade level. Spelling continued to be important, but essay length was less significant once vocabulary variables were introduced. Instead, the vocabulary variable which indicated the number of different words in a composition became significant, leading Grobe to conclude that vocabulary diversity is important in good narrative writing. The importance of mature vocabulary and complex syntax on writing evaluation comprised the focus of a study by Neilsen and Piche (1981), who created four versions of a 250-word descriptive passage on a winter scene. One passage contained complex nominals and mature vocabulary; a second contained complex nominals and simple vocabulary (as in the use of the word "face" instead of "confront"); a third contained simple nominals (as in the phrase "like cattle in a barren field" instead of "like cattle in a

PAGE 56

45 barren, frozen field of blowing snow"; the fourth contained simple nominals and simple vocabulary. Eighty high school English teachers were given folders with one version of the passage. They assigned holistic scores to the essays and rated them according to a scale containing bipolar descriptions of qualities, such as "logical illogical." Results of an ANOVA indicated that nominal complexity did not significantly affect either the holistic scores or the composition scales; however, vocabulary did have a significant impact on both. The authors note the following limitations of the study: The constructed passage might not resemble actual student writing, verbs comprised the only basis for the vocabulary differences, and the findings they obtained from descriptive passages might not apply to other modes. Despite these limitations, vocabulary seems clearly to have influenced the holistic scores for this descriptive essay, just as it influenced the narrative writing in Grobe's (1981) study. Length and Surface Features Thus, the above studies suggest that such factors as mechanics, spelling, syntactic maturity, and vocabulary may affect some judgments of writing quality. As has been seen in the previously cited studies by Breland and Jones (1984), Grobe (1981), and Roberts (1982), length has also been a contributing factor.

PAGE 57

46 Length proved similarly influential in a study conducted by Nold and Freedman (1977) to explore whether certain elements could be identified as contributing to readers' evaluations of compositions. The researchers used four argumentative essays written by each of 22 Stanford freshmen. The essays were typed, and then six experienced teachers were trained to score the papers holistically. Nold and Freedman (1977) hypothesized that four main categories might prove influential: the extent to which ideas were developed, the organization of those ideas, the complexity of syntax, and the adequacy of vocabulary. Emphasizing countable, syntactic elements within the essay, Nold and Freedman analyzed the essays according to an instrument developed by Golub and supplemented by such variables as common verbs and the length of the essay. The researchers found that the holistic scores assigned were distributed below the mean, with readers noting that they had expected better writing of Stanford freshmen. Four variables, including shortness, overuse of modals and be verbs, and common vocabulary, negatively predicted quality, whereas final free modifiers positively predicted quality ratings. According to the researchers, limitations included their focus on only those measurable elements of writing quality and their use of a select group of students as a sample. Despite any potential problems with the study, length — in addition to vocabulary — appears as a

PAGE 58

47 contributing factor for the writing judgments made in this study in much the same way that it appeared in the research previously discussed by Grobe (1981) and by Neilson and Piche (1981). Length is frequently considered a surface feature, and as White (1985) notes, is criticized whenever it is used as the basis for holistic judgments. However, as Freedman and Calfee (1983) thoughtfully observe, length cannot always be identified as a superficial quality of a paper. They note: The problem with interpreting such findings is that length may or may not be an index of a significant psycholinguistic category such as idea development. Longer essays with fuller development of ideas may deserve higher scores than shorter essays, but longer essays padded with redundant information may deserve lower scores than their shorter counterparts Correlational studies do not reveal why longer essays receive higher scores, (p. 85) Even though the length of a paper may not always create a problem for writing evaluation, other surface features such as handwriting and neatness are clearly troublesome. Studies completed over 15 years ago (Chase, 1968; McColly, 1970; and Marshall, 1972) have suggested that poor handwriting or messy essays may affect the grades assigned to them. For example, Chase (1968) found that 16 graduate students gave "more generous" grades to essay test items done with good handwriting than they did to items done with poor handwriting; although the scorers tended to score papers equally on the first item, the negative "halo

PAGE 59

48 effect" of poor handwriting adversely influenced the scoring of the second item. Marshall (1972) introduced various numbers of spelling errors into essays composed in response to one American history question. For each essay containing a set number of spelling errors (e.g., 0, 6, 12, and 18 errors), Marshall prepared a typewritten copy and had students copy the essay over with three different degrees of neatness and legibility. The 16 resulting forms of the essays were sent to 4 80 classroom teachers who were asked to grade the papers according to content. Although Marshall, to his surprise, found no significant differences in mean scores for the levels of spelling problems, he did find differences in the scores assigned to typed versus handwritten essays. That is, all the handwritten versions of essays containing zero to six errors received lower scores than did the typed versions of the same essays; the results for essays containing 12 to 18 errors were less clearcut and seemed to fall into a random pattern. Handwriting is also labeled as a problem in McColly's review (1970) of the issues comprised in writing evaluation. Citing several studies in which handwriting influenced writing judgments, McColly warns that in such instances, "the validity is actually lowered, because handwriting ability and writing ability are not the same thing." He continues by suggesting that "the only cure

PAGE 60

49 for this condition is to have examination essays typed or put into some other standard printed format" (p. 154). Stach (1987), too, expresses concern in his dissertation about the influence of appearance on holistic scorers judgments. In his study three college teachers were trained to score holistically 140 essays written by college freshmen. The teachers then described what they considered good writing to be, and they rank ordered the importance they placed on several factors in making writing evaluations. Presumably, they considered the factor of "presentation," which signified handwriting and neatness, to be totally unimportant and mechanics to be less meaningful than many other qualities; however, a regression analysis revealed that appearance and mechanics were the only statistically significant predictors of holistic scores. According to Stach, the implication of such findings was "that scorers in holistic procedures (and perhaps teachers in general) aspire to grade essays differently than they actually do, and that they hope to be qualitatively better graders than they are, overlooking, or 'seeing beyond,' mechanics and appearance" (p. 113). Suggesting that the scorers' descriptive statements reflected not "priorities, but aspirations," Stach concludes, "Certainly there is a great gulf between what they say matters to them and what the best statistical predictors of holistic scores turned out to be" (p. 120).

PAGE 61

50 As these studies indicate, writing evaluations are affected to varying degrees by such elements as mechanics, vocabulary, syntax, spelling, length, and even handwriting or neatness Such a link between these elements of form and the rhetorical elements of organization and content is, according to Harris (1977), almost inevitable. Referring to her own study, which will be discussed in the next section, Harris comments that there "came the conviction that form is so integral a part of content that in some ethereal way form is content and content is form" (pp. 180-181). Other Factors Involved in Writing Judgments In addition to elements of form and content, other — almost intangible — factors in writing evaluation have received increasing attention. One factor is the discrepancy between what readers say they value and what they actually reward; the second factor is the perspective that readers adopt toward the writers behind the essays. Harris (1977) sought to determine those features that influenced English teachers in their evaluation of student writing. Thirty-six high school teachers read 12 student essays, marking them according to their customary practice; they then ranked the essays according to merit and completed a questionnaire. They finally reevaluated the papers against five criteria. Taken together, the four procedures revealed a discrepancy between the criteria teachers rated as important

PAGE 62

51 and the criteria they actually demonstrated in their comments and markings. That is, on the questionnaire the teachers indicated that content and organization were of great importance to them, while their annotations and manner of ranking the papers revealed the major role that mechanics and usage played. For these teachers, sentence structure and diction were less important. Additional findings by Harris (1977) included her discoveries that the teachers basically agreed with each other about the evaluation of writing and that many of the teachers' annotations and other comments were negative. Hake and Williams (1981) raise the question of what teachers of writing actually do value: "Is it possible that despite our public declarations about clear, direct writing, we might somehow discourage our students from writing good prose and encourage them, through our own tacit behavior, to write bad?" (p. 434) Their question arises from four experiments they conducted in which they altered the style of similar essays — changing the direct, verbal style that contained a subject/verb/object (or agent/action/goal) to a nominalized, indirect style in which abstract nouns predominated. Approximately 80 teachers, from high school to the upper college classes, rated the heavily nominalized papers more highly than they did those essays which, though structurally similar, were directly verbal in style. In one experiment, the

PAGE 63

52 readers, who were unaware of the purpose of the study, wrote comments indicating that the nominalized versions contained better organization and support, even though the pairs of papers were identical in those respects In another experiment, senior college graders rated the nominalized papers higher than they did verbal versions even when they could find major errors in the nominalized version. The authors speculate that the good nominalized papers may have been associated with intellectual quality, in contrast to the perceived lower quality of the verbal versions. Thus, Hake and Williams (1981) suggest that despite what writing teachers claim to do, one cause of "stylistic infelicity" (p. 446) may be the practices of the teachers themselves Still another source of complexity in writing evaluation is the attitude or expectations of the readers toward the writers of the essays. For example, Freedman (1984) gave to four experienced holistic scorers packets of essays containing not only the writings of students from four different colleges but also a timed essay composed by a professional writer on the same topic. The scorers, who were unaware that professional writings had been included in the study, gave only slightly higher mean holistic scores to the professionals than they did to the student writers. (In fact, student writers received the three highest holistic scores.)

PAGE 64

53 The professional writers received higher analytic scores than did the students in the categories of voice, sentence structure, word choice, and usage, but they received lower scores in the categories of development and organization. Because of the low scores that had been given in these categories and because of wide differences in the holistic scores assigned to these papers, Freedman (1984) sought to discover what gualities characterized the professional essays. She found four common traits: (a) a tone of familiarity, (b) an initial rejection of the task with a subsequent acceptance of it, (c) a final commitment to the topic with resulting forcefulness in the papers, and (d) scholarly references. As these traits are unlikely to appear in most students' writing, the author speculates that the scorers may have negatively reacted to what they viewed as "overstepping" of authority on the part of some students. She advocates that teachers encourage students to write with authority and freedom. Sullivan (1986) also explored whether holistic scorers' disagreement in problem papers about discourse issues reflected certain attitudes toward the writer behind the essays. Stressing that "evaluation of writing ability is best viewed as a multifunctional social interaction" (p. 11), Sullivan sought to determine whether readers created writers in addition to the meaning of the texts.

PAGE 65

54 Sullivan (1986) randomly selected for analysis 99 essays that had been written by entering freshmen and holistically scored. Topics for the essays had required students to argue to a specified audience a certain position on a controversial issue. Using Prince's "Taxonomy of Assumed Familiarity, Sullivan classified the information contained in the noun phrases of the essays in terms of assumptions made about readers' familiarity with the information — that is, whether it was assumed to fall under new inferable or old (Evoked) categories (p. 14). A regression analysis indicated that three of the subcategories of information significantly correlated with holistic scores. According to Sullivan, these categories represent deviations from Grice's Cooperative Principle in which writers are supposed to assume that the readers have reasonable familiarity with the information. He speculates that these deviations from expected norms reflect three different identities — that of the "testtaker," the "knowledgeable student," and the "straightforwardly cooperative writer" (p. 33) — and that readers were responding either negatively or positively to these identities. He stresses the need for additional research to determine whether readers are evaluating texts on the basis of their responses to the writers' identities. Though intriguing, much of Sullivan's (1986) work appears highly speculative; for example, the basis behind

PAGE 66

55 his claims that certain linguistic categories of information reflect particular social identities, such as that of the "test taker," seems arbitrary. Moreover, as Sullivan himself acknowledges, the hypothetical audience that students were required by the topics to address and that conflicted with the real audience of holistic scorers may have compounded students uncertainties about how much information they needed to provide. But despite the problems that Sullivan's work contains, his research illustrates the potential impact that the writers themselves may have on the readers' evaluation of their work. Barritt, Stock, and Clark (1986) found a similar attitude of unease held by readers toward writers who do not adhere to their expected role. A group of faculty members of the University of Michigan's English Composition Board met periodically over a two-year period to discuss how they holistically rated student placement essays and why they sometimes disagreed with each other. At each meeting, they read a selected essay, scored it privately, and noted their reasons for the score; then they discussed their findings together. They found that on those essays which evoked the most disagreement, the comments fell into several categories: (a) "the written text"; (b) "the imagined student writer"; and (c) the "prospective student" (p. 319). The authors note that even though they initially urged readers to pay

PAGE 67

56 attention to the texts, rather than to the writers behind the work, their recommendations were in vain. They justify the readers' reactions by emphasizing the importance of the expectations the readers bring to the reading: We had forgotten that reading is always an act of recreation and that what we have learned as students of literary theory has much to teach us about what we do as we read our students assessment essays. In our case, as reader/evaluators asked to judge placement essays, we had to engage ourselves as active readers trying to make common sense — that is sense in common — with student authors. We found ourselves working mentally with each student writer to compose a placement essay; as we overlaid the student's writing with our own expectations, we completed incomplete arguments, supplied missing transitions, second-guessed particular cases for general statements Like the readers Wolfgang Iser posits, we were trying to build consistency into students texts by investing spaces of indeterminacy in them with our own expectations about what should fill the gaps .... The teaching experience each of us brought to the task of evaluating student texts led us to expect in each text the writing of a 'typical' college freshman, and our expectations influenced our readings. (p. 320) Arguing against the need always to have consistency of judgment, Barritt et al (1986) suggest that it is more important to accept and understand the basis behind those judgments that are not in agreement. In a similar vein, Martin (1987) explored the process that occurs for readers in a placement scoring. She examined the written responses that three faculty members made to six placement essays composed by entering college students. The instructors, all of whom were experienced

PAGE 68

57 scorers, ranked the papers once and wrote comments intended for the students to use in revising; at a later time, they ranked the papers again and wrote comments intended for the researcher. Martin studied the comments in the light of the readers own backgrounds and their own experiences with reading and writing. She concludes by observing that readers, as well as writers, are individuals and that the essays do not necessarily contain features which all readers can assess; rather, in her view, placement scorers are primarily concerned with the extent to which the writing samples indicate students' readiness for college tasks Summary of Literature Review Together, the three major sections of the literature review reveal complex links between holistic scoring and the writing criteria on which it is based. The studies of the first section (Diederich, French, and Carlton, 1961; Godshalk, Swineford, and Coffman, 1966; Freedman, 1981; Hrach, 1983; Sweedler-Brown, 1985; and Huot, 1988) illustrate both the development of and rationale for various holistic procedures, including the training of readers. The studies of the second section (Winters, 1978; Roberts, 1982; Shoaf, 1985; and Janopoulos, 1987) depict the strengths and weaknesses of holistic scoring in comparison to other evaluation systems The studies of the

PAGE 69

58 last section reflect researchers' attempts to explore — in a variety of contexts — the elements involved in the evaluation of writing. From this section emerges a picture of readers in some contexts primarily influenced by organization and content (Diederich, 1974; Freedman, 1979; and Breland and Jones, 1984) and of readers in other contexts chiefly concerned with such features as mechanics, spelling, vocabulary, and length (Harris, 1977; Nold and Freedman, 1977; Grobe, 1981; Rafoth and Rubin, 1984; and Stach, 1987). Still other studies convey how readers' perceptions and expectations of writers affect some evaluations (Freedman, 1984; Sullivan, 1986; and Barritt, Stock, and Clark, 1986). The involvement of so many factors in writing judgments underscores not only the need in writing assessment for such structured approaches as holistic scoring but also the need for training and monitoring to ensure some similarity in the perspectives that readers bring to their evaluations But if the need for training is clear, the nature of that training and monitoring in holistic scoring has not yet been fully explored. Holistic scoring of assessment essays entails special circumstances both for writers and for readers: That is, just as writers in an assessment often have a limited time in which to discuss a given topic for an unfamiliar audience, so, too, do holistic readers have a short time in which to determine the meaning of a text and respond by

PAGE 70

59 evaluating it. Questions thus remain as to how the training and monitoring of a structured holistic scoring help to create a community of readers who willingly accommodate their own writing criteria to the writing standards of the group as a whole. The methodology used in the present study to explore the impact of monitoring on an holistic scoring is described in Chapter 3

PAGE 71

CHAPTER 3 METHODOLOGY The impact that training and monitoring have on an holistic scoring of writing was explored from three perspectives: (a) A monitored holistic scoring was conducted in which 12 readers scored over 100 student -written essays; 8 of the readers recounted their responses to these essays through the use of logs, and 4 of the readers recorded their reactions through the use of audio-taped protocols; (b) the same 12 readers also scored an egual number of essays at home in an unmonitored situation, again using logs and audiotapes; and (c) all participants in the study — including the 3 table leaders and 2 chief readers — were administered a questionnaire regarding their attitudes to writing evaluation and to the holistic scoring process. The Monitored Holistic Scoring A monitored holistic scoring was conducted to replicate on a small scale the structured scorings used in the writing assessment of college sophomores throughout the state of Florida. The chief reader for the state of Florida, together with an associate chief reader, conducted the scoring on Saturday, January 7, 1989. Permission was obtained from the Department of Education in Tallahassee 60

PAGE 72

61 to select by means of a stratified random sampling over 100 essays used two years previously in an administration of the College Level Academic Skills Test (CLAST) Subjects in the Study Seventeen men and women who are highly experienced holistic scorers and who have taught English at different levels were the subjects. Earlier studies by Follman and Anderson (1967) and by Freedman (1979) had found the homogeneity of their scorers to be a factor in the results; in fact, Freedman cites such homogeneity as a limitation of her work. Because most CLAST scorings employ English instructors from diverse levels, it was assumed that subjects with a broad base of English teaching would more accurately reflect real-life conditions found in an holistic scoring session. The chief reader for the study, a former director of freshman composition at a large university, is the current chief reader for the state of Florida and has directed many large-scale holistic scorings. The assistant chief reader, a former chair of a high school English department and an Advanced Placement English teacher, has frequently served in the role of assistant chief reader. Eight women and seven men participated in this study either as table leaders or as readers Five taught in three local high schools, with some instructing Advanced Placement English classes or participating in the Writing

PAGE 73

62 Enhancement Program; several had studied in the Florida Writing Project. Another five were English faculty from three community colleges within a 200-mile radius. The remaining five were from three universities or four-year colleges within a 100-mile radius. The teaching experience of the 15 participants, in addition to the 2 chief readers described above, ranged from 8 to 37 years, with an average of 18 years; their holistic scoring experience averaged 7 years. All but one participant had holistically scored other types of examinations as well as the CLAST. The subjects received an honorarium for their participation in the study. Writing Samples The essays used in the study were written by college students nearing the end of their sophomore year as part of a state-wide mandatory test to assure minimal competencies in reading, writing, and mathematics. Students were given a choice of two topics, each of which required them to draw upon their general knowledge, to create a thesis, and to support it during the 50 minutes allotted for writing. Because of test security purposes, the topics cannot be revealed. However, they followed the paradigm developed by Hoetker and Brossell (1986) and used in Florida for several years; the paradigm typically is a fragment, containing a class specification and two

PAGE 74

63 differentiating criteria. The paradigm is exemplified by such topic phrases as "a book/ that many students read/ that may affect them beneficially" or "a common practice/ in American colleges/ that should be changed" (p. 330) which Hoetker and Brossell describe in their research. Procedure As students taking CLAST have a choice of two topics, holistic scorers are accustomed to scoring sets of papers in which two different topics are intermingled; consequently, the essays used in the study were not separated out by topic for scoring. As shown in Table 1, a stratified random sampling procedure was used to select 112 essays which would approximate the distribution of scores obtained in the actual scoring of these essays; hence, the essays reflected the writing of students from various institutions in different parts of the state. On the basis of scores originally assigned to the papers, the papers were randomly divided in half for the monitored and unmonitored scorings. Thus, the papers with scores of 8 were distributed equally to the two treatments, as were the papers with scores of 2.i 1, 5, 6, and 1_. To ensure students' anonymity, all identifying information was removed, and each essay was labeled with a three-digit number; the essays were then reproduced so that each reader would have a copy of all

PAGE 75

64 TABLE 1 Descriptive Statistics for the Stratified Random Sampling Total

PAGE 76

65 56 papers. The papers were randomly distributed in 4 packets of approximately 14 to each reader. Three tables were established for the scoring, each consisting of four readers and an experienced table leader, all of whom were randomly assigned to their table. Two tables of four readers each followed regular holistic scoring procedures and were used for comparative statistical purposes in the study; the third table was treated as a separate entity. That is, the four readers at the third table took part in the training procedures but then adjourned to small, adjacent offices to tape record both their reading of the essays and their reactions to these essays Each of the eight readers participating in the regular scoring was assigned 56 papers to score, a number arbitrarily chosen for several reasons. It was manageable enough to facilitate the subsequent interpretation of data and yet affordable. It also represented a large enough sample to reveal any scoring tendencies on the readers' part and to indicate any potential influence of training samples, breaks, and monitoring procedures. However, prior to any statistical analysis, five sets of data were subsequently removed from the study: One set of matched papers was deleted because poor photocopying had made one of the two essays impossible to read; four sets were removed as deviant data when the chief readers independent

PAGE 77

66 scoring beforehand of the entire group of papers revealed that some papers had been incorrectly scored two years earlier and hence were inappropriately matched. Thus, the actual data for the study comprised 51 sets of matched papers, or 102 essays altogether. The four readers at the third table, hereafter referred to as "special readers," were given a subset of 20 papers to score by means of talking protocols Because they scored far fewer essays than did the eight regular readers, the special readers were not included in any statistical analysis. Results of the special readers' scoring will be discussed separately from results obtained from the regular readers Procedures for Training The scoring adhered to the customary procedures, with rangefinders provided initially for training purposes, followed by the presentation of several samples throughout the scoring for the group to score and tally together. Reading breaks occurred at approximately 4 5 -minute intervals, and table leaders monitored the scoring throughout. The chief readers conducted two check readings as an additional verification that all participants were scoring the papers comparably. In these respects, then, the scoring represented a replication of the procedures typically used in assessing writing on a large scale.

PAGE 78

67 The training and monitoring procedures also followed custom insofar as readers were urged to employ the full range of scores (e.g., from .1-4) in assigning scores to the six rangefinders. Rangefinders from the original reading were read first, with the readers asked to rank order the papers and to assign each of the four scores to at least one essay. Then the readers' scores were publicly tallied. If one or two scores clearly differed from the scores assigned by other readers, readers whose scores were discrepant were urged to look the paper over again. Once the rangefinders were tallied, table leaders, who kept running accounts of the vote at their tables, led their table in a brief discussion of why papers received certain scores They referred to the operational definitions if necessary. Then pairs of sample essays were introduced, with readers again asked to read and score an essay and raise their hands as each score level was announced by the chief reader. Samples were given until the group reached a consensus on most scores Special Measures Used for the Study For the purpose of this study, several new measures were introduced. All eight readers at the two regular tables, Table 1 and Table 2, were reading the same papers arranged in random order in 4 packets of approximately 14 papers each. (The third table will be discussed subsequently.) Thus, for each essay, at least eight scores were obtained, four from each of two tables.

PAGE 79

68 As the readers scored each paper, they were asked to jot down several brief comments about their overriding impression of the paper, its key strengths and weaknesses. The commentary, therefore, provided a running log that was used to explore the basis on which the readers made particular scoring judgments. This "process log" resembled the log developed for writers by Faigley, Cherry, Jolliffe, and Skinner (1985). In addition, readers noted such procedures as the time they began each reading session after a break, their scores on samples, and any adjustments they made after talking to their table leaders or after consulting the rangef inders During the previous month readers had been given instructions in how to use the logs before they began their unmonitored scoring. A copy of the log is provided in Appendix A. In actual scorings, chief readers customarily keep logs as part of their procedures, noting such details as the time of each reading, the samples used for training, and the start of check readings. For the purpose of this study, chief readers were asked to maintain their customary log, but it was labeled an "Account of Procedures in order to avoid being confused with the log or running commentary employed by the readers. The actual account is included in Appendix B. In including the written observations of readers, this study partially followed the procedures used by Diederich et al. (1961), who asked their readers to note comments as

PAGE 80

69 they sorted papers into piles and assigned rankings. However, as noted in the literature review, the readers in Diederich's study came from diverse backgrounds, received no training, and worked in an unstructured situation. Logs have also been used in one study undertaken by Murphy, Carroll, Kinzer, and Robyns (1982) with the Bay Area Writing Project (pp. 397-410). The use of written comments was selected for this study as opposed to the annotations used by Breland and Jones (1984) in their study of writing perceptions. Despite the difficulties entailed in categorizing written observations, such comments are far less apt to disrupt the momentum of the holistic scoring than the more analytic checklist that Breland and other researchers have employed. Moreover, unlike the analytic checklists which provide readers with lists of certain criteria, blank log sheets are not apt to influence readers' responses. Indeed, written comments are currently used during the sample selection part of actual holistic scoring procedures, when the chief readers assemble to select the papers to be used as training samples during a scoring. Table leaders were also asked to keep logs and to note such monitoring procedures as whose papers needed rereading, what discussions about writing ensued, and whether many scores needed to be altered. In addition, they described their readers' performance during, and reaction to, the use of training samples.

PAGE 81

70 As noted previously, the third or special table, consisting of a table leader and four experienced readers, also participated with the other tables in the use of rangefinders and sample essays. However, at the conclusion of the training papers, the readers adjourned to separate small offices to record on audiotapes their ongoing reactions to a subset of the papers used in the monitored scoring. Such protocol analysis has been used in composition research for a number of years and was recently employed by Huot (1988) in his study of holistic scorers. The subset of 20 papers, like the larger one of 50+ essays, was deliberately selected to contain a range of score levels and was assigned in random order to each of the four special readers. The number 20 was arbitrarily chosen to allow for the extra time readers might need to read each paper aloud and record their impressions and observations. The table leader for the special group moved among the four offices to monitor the scoring and to discuss any discrepancies. Through these talking protocols some indepth insights were provided as to how the monitored scoring appeared to influence the scores that readers assigned. The Questionnaire A questionnaire devised for this study was given to all the participants immediately following the completion of the monitored holistic scoring in order that the respondents written logs or protocols during the scoring

PAGE 82

71 not be influenced by the nature of the questions The questionnaire, a copy of which is included in Appendix A, contained three main categories of questions — (a) readers' ratings of the importance of certain features in writing, (b) their self -report of their own biases in readings and their methods for dealing with these biases, and (c) their reactions to the structured setting of an holistic scoring. Most items required closed responses, although several allowed for open-ended responses. An additional section enabled table leaders and chief readers to address questions dealing with their roles as monitors The questionnaire, designed in accordance with the principles set forth by Berdie and Anderson (1974), was pilot tested two months previously by holistic scorers of the CLAST at another scoring site in the state. Over 60 percent of the readers and table leaders at the second site voluntarily completed the questionnaire and responded to specific questions regarding the substance, format, and clarity of the instrument. (See Appendix A for a copy of the pilot questions.) A stamped, self-addressed envelope was provided for the return of the pilot questionnaires. The respondents made specific suggestions for wording changes, and they asked for additional items to be included in parts I (features of writing) and II (biases). In addition, several requested that the absolute categories of "never" and "always" be provided as options in parts III

PAGE 83

72 and V. Many of the respondents indicated that answering the questionnaire had been an interesting, challenging, or educational experience for them. The Unmonitored Holistic Scoring During the month prior to the monitored holistic scoring, each of the eight regular readers was asked to score holistically at home four packets — over 50 papers — of matched essays written by different students on the same topics. Readers were asked to jot down their impressions of the papers in the running log, just as they were subsequently asked to do during the monitored scoring session. Readers were sent instructions on how to use the log, as well as a copy of the operational definitions currently used in the CLAST administration (see Appendix A). These definitions, which describe the characteristics typical of a certain level of essay, were the only training materials provided to the readers in the unmonitored setting. The papers were scored over a four-week period. These papers with their scores and comments reflected how experienced holistic scorers scored without being monitored and without being part of a group situation. During the unmonitored scoring, the table leaders and the chief readers were assigned different tasks from the readers. For example, the chief readers met to review the

PAGE 84

73 entire group of papers to be used in the study; they discussed each score until they agreed upon an appropriate rating for each essay. This practice, while certainly not typical of an actual holistic scoring, was included to ascertain the chief readers' scores for the entire set, thereby helping to answer the question posed as to the role the chief readers play in influencing scores that are given. In addition, the table leaders read all the sample papers and rangefinders to be used in the subsequent monitored scoring, rating each paper and writing their responses to each essay. This procedure represented a departure from typical procedures. That is, under normal circumstances, the table leaders meet with the chief readers prior to a scoring to read and score the sample papers the chief readers have selected; then they discuss the results together. Methods Used for Analyzing Data The questions posed in Chapter 1 of the study are again listed below together with the methods used for analyzing the data; special attention has been paid to how well the monitoring of an holistic scoring reflects the "true community of assent" as noted by White (1985). 1. Do the mean scores for the essays differ when the papers are evaluated by readers working in a monitored setting from when they are judged by the readers working independently?

PAGE 85

74 To answer question 1, an analysis of variance, equal cell size mixed model, was used on a Biomedical program. The design was randomized, with the eight regular readers comprising the repeated measure. (The four special readers who completed the protocols were not included in any statistical analysis as they had scored far fewer essays than had the eight regular readers.) The model — P, T, E(P), R(T) — included the following four random factors: P, signifying the number of essay pairs (51); T, representing the number of tables of readers (2); E(P), signifying the monitored versus unmonitored essays nested within each pair (2); and R(T), representing the number of readers nested within each table (4). Three Quasi F ratios were calculated according to the formula of B. J. Winer (1971) in Statistical Principles in Experimental Design for pairs (F p ), tables (F t ), and pairs within tables (F p(t) ). 2. Do experienced readers participating in a monitored scoring achieve greater agreement with each other than when they evaluate essays independently? To answer question 2, Cronbach's alpha was used to indicate the degree of interrater reliability under the two different scoring conditions. The interreliability rate was instrumental in showing both the extent to which monitoring helped readers score alike and the extent to which readers may have internalized the standards 3. What impact do the chief readers have on an holistic scoring? How do they ensure both a reliable and a collegial reading?

PAGE 86

75 For question 3, the comments which the chief readers made during the training sessions were examined, and the check reading results were reviewed. In addition, because the chief readers had scored all the essays beforehand as part of their task in the unmonitored setting — a task not traditionally associated with their role — a second Cronbach's alpha was used to determine how well their scores correlated with those of the readers The chief readers' comments, together with these scoring results, helped to indicate to what extent the chief readers were able to guide readers into assenting or "owning, as White (1985) indicates, the standards of the group. 4. What criteria do readers use in assigning different score levels? What standards are reflected in the score levels assigned across the essays? How do readers respond to these standards? For the fourth category of questions, information from the eight regular readers' logs was transferred to a Database 3 program; the readers' written comments were grouped in categories similar to those on Part I of the questionnaire (e.g., rhetoric, mechanics, grammar and usage). The database program (see Appendix A for a sample entry) not only indicated through pluses and minuses whether the readers' comments were positive or negative but also allowed for paraphrases of each comment to be included. The database program was used to tally the positive and negative responses the readers made in their

PAGE 87

76 logs at each score level; the program was also used to determine the exact nature of responses — e.g., whether rhetorical, mechanical, or grammatical — which readers gave to papers at varying score levels. The audiotaped protocols provided further corroboration of these criteria. An English teacher with extensive training and experience in teaching writing served as an outside expert to validate independently the accuracy of the database logs. She randomly reviewed 20% of the logs from each of the two scoring conditions and compared the readers comments against each database entry. Whenever she found any errors, the database entries were adjusted accordingly before any analysis was done. 5 Do any common patterns appear in the scorers written responses to the essays, or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their written or oral responses, correspond to the writing features they rate as important on the questionnaire? For the fifth category of questions, the readers' comments — both written and oral — were studied for any common patterns that might emerge in either the monitored or unmonitored condition. The comments were examined to see whether readers giving an identical score to the same essay cited similar or different reasons for doing so — such as organization, fluency of sentence style, or creativity. It was hoped that identifying patterns of this nature would

PAGE 88

77 help to answer whether a sense of community develops to influence readers' perceptions. 6 • What is the nature of the monitoring that the readers receive during a scoring as reported through the logs of table leaders and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring processes as noted on the questionnaire? For question 6, the logs of the three table leaders, together with the "procedure section" of the readers' logs (See Appendix A for a sample of the log) and the audiotapes, were examined for clues to the nature of monitoring. It was hoped that the logs would reveal how directive the table leaders were and what type of relationship existed between the table leaders and the readers. Of special concern were how the readers responded to different criteria and whether the data supported the readers perceptions of the holistic scoring process as reflected through their responses to the questionnaire. Thus, both qualitative and quantitative data were used to determine whether monitoring in an holistic scoring reflects a congenial effort among scorers to arrive at common agreement throughout a scoring and whether this sense of community affects the judgments that scorers make. An indepth discussion of the results obtained in the study is presented in Chapter 4

PAGE 89

CHAPTER 4 RESULTS AND DISCUSSION The results of the study are presented according to the questions raised in the first chapter. The first three sections deal primarily with the quantitative results and the last three, with the qualitative findings. Mean Scores in the Two Scoring Conditions Question 1: Do the mean scores for the essays differ when the papers are evaluated by readers working in a structured setting from when they are judged by the readers working independently? When the mixed-model analysis of variance for nested factors and repeated measures was computed, three statistically significant main effects were found and no interactions. Not surprisingly, as shown in Table 2, statistically significant differences were found (p < .05) among the pairs of essays. That is, each pair of matched essays differed from the next pair of matched essays. Also not surprisingly, readers nested within tables differed to a statistically significant extent (p < .001). As will be seen in the discussion for question 5, the qualitative data highlighted the individuality of the readers, thereby confirming these differences among readers. 78

PAGE 90

79 TABLE 2 Analysis of Variance Mixed Models Source Table Source Error Sum of Degrees of Mean F Term Squares Freedom Square Mean

PAGE 91

80 What seemed especially meaningful for the purposes of this study was the statistically significant difference (p < .00005) found for essays nested within pairs E(P); these essays represented the two conditions of monitoring and non-monitoring. The overall mean for the 51 monitored essays was 2.279, with a standard deviation of .559, whereas the overall mean for the matched set of 51 unmonitored essays was 2.401, with a standard deviation of .690. Thus, not only was the mean score for the unmonitored essays significantly higher than the mean for the monitored essays, but more of a spread existed among the mean scores on each essay in the unmonitored condition. The overall higher mean for the unmonitored essays seemed due to the substantial number of upper-half scores (l's and 4's) awarded papers in the unmonitored condition: Whereas only four essays out of the monitored set of 51 had a mean of 3 or better, 14 essays out of the unmonitored set of 51 had a mean of 3_ or better. Figures 1 and 2 depict the breakdown of scores by reader and condition. As can be seen, readers across the board gave fewer scores of 4 in the monitored condition than in the unmonitored; for several readers, the difference was dramatic. Admittedly, the zero scores of 4 for some readers in Figure 2 is misleading in that virtually all readers gave at least one score of 4 during the monitored

PAGE 92

81 -*1

PAGE 93

82 CD u •H fa

PAGE 94

83 U H Cm

PAGE 95

84 scoring; some of those were included in the data deleted as deviant when the independent scoring of the chief readers showed four sets to be clearly mismatched. However, even if these sets had been included, the monitored papers would still have had only half as many 3's and 4's as the unmonitored set. This finding does not mean that the monitored scorers were awarding only lower-half scores, for several readers gave 3+ scores in their logs to show they perceived some essays to be especially strong. However, such pluses and minuses could not be included in the data analysis because the actual scoring of an essay allows for only a numerical score. Thus, the fact remains that during the monitored scoring, the spread of scores was tightened and the mean lowered. Why this tendency should have occurred is intriguing. One possible explanation lies with the studies of Breland and Jones (1984) and of SweedlerBrown (1985), who found that experienced scorers tended to score more strictly than did less experienced scorers Perhaps this study reflected a similar trend with the training procedures and the monitoring by table leaders lowering some individual reader scores as readers strictly adhered to the criteria under the monitored condition. In fact, readers indicated on their questionnaires (see the open-ended question after item 24 in Appendix A) that they tended to grade timed writings more leniently than they did papers written outside class. Without

PAGE 96

85 rangefinders or sample papers to measure the actual essays against during the unmonitored scoring, some readers, such as Readers IB and 1C, may have awarded papers higher scores than they did the matched essays during the monitored scoring when group standards became a constant focus of attention. One reader even wrote in her unmonitored logs of several instances in which she would have consulted a table leader if she could have, and another reader expressed regret at not having rangefinders to examine. Still others noted during their unmonitored scorings that they consulted their operational definitions. Thus, during the unmonitored scoring, some readers felt the need for standards to anchor their evaluations against. Part of the explanation for the lower mean score of the monitored scoring may lie with the nature of the monitoring itself. That is, in the course of either scoring training samples together or individually discussing specific papers with table leaders, readers may have become more attuned to problems than when they were reading the essays impress ionistically on their own. Indeed, most readers commented on their guest ionnaires (item 50) that they tended to view problematic papers both holistically and analytically. In this sense, even the actual scoring process for this study may have contributed to a more analytic scoring than usual in that readers were asked to note in their logs the elements to which they were responding.

PAGE 97

86 The logs of the table leaders also suggest that the training process may have contributed to the significant difference in mean scores for the matched sets. The logs showed that on those four occasions in which two readers changed their scores after talking to their table leaders, the readers' scores were lowered, rather than raised. Similarly, even though one checkreading paper was returned to a reader because it had been scored too low three others were returned to two readers and to one table leader because they had been scored too high It is conceivable that those readers who lowered their scores after they reviewed the essays under debate may have had their subsequent scores influenced — at least for a short period afterward — by this experience; for example, many readers indicated on their questionnaires (item 42) that the return of a paper "sometimes" affected their subsequent scoring processes. Admittedly, only a few readers were involved with returns; hence, such an explanation has limited application. Nevertheless, both the qualitative data and the quantitative data — which, as indicated by Figure 2, show three readers' scores moving downward and five readers' scores clustering in the middle — illustrate a stricter adherence to criteria in the monitored condition than in the unmonitored.

PAGE 98

87 Interrater Reliability Question 2: Do experienced readers participating in a structured scoring achieve greater agreement with each other than when they evaluate essays independently? Cronbach's alpha was used to determine the extent of agreement among the eight readers in both scoring conditions In the unmonitored condition the alpha was .936 for the 51 essays; in the monitored condition the alpha was .915 for the matched set of 51 essays. Thus, in both conditions the interrater reliability was high, and the readers scoring the essays independently appeared to achieve equally great, if not slightly greater, agreement with each other than when they scored essays as a group. As can be seen from Table 3, no one reader appeared to affect this high interrater reliability coefficient substantially: That is, if individual readers had been removed from the analysis, the lowest alpha in the unmonitored scoring would still have been .92; similarly, the lowest alpha in the monitored scoring would still have been .894 if individual readers had been removed. In the unmonitored scorings, Readers 1C and 2C had the lowest correlation of .74 and .73 respectively with the other readers, whereas in the monitored scoring, Readers 1A and 1C had the lowest correlation of .67 and .56 respectively with the other readers These correlations substantially

PAGE 99

88 TABLE 3 Cronbach's Alpha of Readers' Scores Reader

PAGE 100

89 exceed the .31 correlation among all the untrained readers in the study of Diederich et al (1961); they exceed the .41 correlation among the English teachers in that same study. That the readers of this study seemed to agree so strongly among themselves in the unmonitored scoring condition suggests that they had, from their years of scoring together, undoubtedly internalized the standards. Still another contributing factor may be the provision of operational definitions for the readers use during the unmonitored scoring; in this respect, the independent scoring condition differed substantially from the at-home scoring in the study by Diederich et al. (1961), in which readers were given few directions and no criteria on which to base their judgments. Thus, even though the readers of this study had no table leaders to whom to turn for guidance, and even though they had no rangefinders or sample papers written on the applicable topics, the readers could consult the definitions for each score level; in fact, the logs and tapes indicated that several readers did indeed do so. At the same time, these results must be interpreted with caution. Because Cronbach's alpha was used with a substantial number of readers — namely, eight — the reliability rate is undoubtedly higher than might have occurred if the scores of only two readers had been correlated as

PAGE 101

90 happens in a typical scoring. Furthermore, because the alpha was simultaneously comparing all the readers' scores for 51 essays, it masked the possibility of discrepant scores occurring on individual essays. For example, as shown by Figure 3, the potential for split scores was nearly twice as high in the unmonitored scoring as in the monitored scoring: That is, if two readers in the unmonitored scoring had been paired against each other on a given essay — as happens in an actual scoring — then on 33.3% of the 51 essays in the unmonitored scoring, discrepant or noncontiguous scores might have arisen. In contrast, if two readers had been paired against each other in the monitored scoring, on 15.6% of the papers, discrepant or noncontiguous scores conceivably could have occurred. To put the findings another way, 8 of the 51 papers in the monitored condition received 3 of 4 possible scores; the remaining 43 scores were either identical or contiguous. On the other hand, in the unmonitored condition 17 of the 51 papers received 3 of 4 possible scores; the remaining 34 scores were either identical or contiguous Thus the monitored scoring clearly reduced the potential for having split scores arise. Impact of Chief Readers on Scoring Question 3: What impact do the chief readers have on an holistic scoring? How do they ensure both a reliable and a collegial reading?

PAGE 102

91 3 to CD t — 1 O

PAGE 103

92 To determine the effect of the chief readers' scores on the reliability of the reading, a second Cronbach's alpha was run. The chief readers in this study had, prior to the monitored scoring, independently rated all the essays; for 81% of the time their scores with each other were identical, and for the remaining 19% of the time their scores were contiguous, as when one chief reader decided a paper was a weak 3 and the other chief reader described it as an upper 2 When the chief readers' scores were included in the monitored condition, the alpha was .9358. When the chief readers' scores were included in the unmonitored condition, the alpha was .9474. As might be expected, therefore, the inclusion of the chief readers' scores did not substantially affect the first Cronbach's alpha. However, the second alpha did reveal the extent to which the eight individual readers scores corresponded to the chief readers' ratings, whose scores are sometimes viewed as the "true" scores. This interpretation of chief readers' scores as "true" does not mean that chief readers are infallible in their scoring; however, their experience with and commitment to the standards, their involvement in all phases of an holistic scoring from sample selection to the refereeing of discrepant essays, and their responsibility for ensuring that each scoring runs effectively lend particular credence to most chief readers ratings

PAGE 104

93 As can be seen from Table 4, the correlations with the average of the chief readers scores appear lower in the monitored condition than in the unmonitored. However, the overall correlation may mask what occurred on individual essays. For example, the number of essays on which readers gave identical scores to those of the chief readers was higher for each reader in the monitored condition than in the unmonitored condition. Moreover, the number of readers who disagreed with the chief readers' scores, as well as the number of actual essays on which readers' scores differed the most from those of the chief readers (by more than one point), was smaller in the monitored condition than in the unmonitored condition when readers were, virtually, on their own. These results suggest that in the monitored setting, the readers were more apt to score the same way as the chief readers than they were likely to do when scoring at home. This finding is not surprising in that during the monitored scoring, the chief readers were able to make their judgments known through the samples, the check readings, and their frequent interactions with the table leaders Qualitative Findings About the Chief Readers' Influence The qualitative data, together with the researcher's observation of the monitored scoring, illustrated some of the interaction between the chief reader, the associate

PAGE 105

94 TABLE 4 Cronbach's Alpha of Readers' and Chief readers Scores Correlation with Avg. of C.R.s' Reader Scores # of Essays # of Essays # of Essays # of Essays with Scores with Scores with Scores Differing by Identical to 1/2 to 1 1/2 to 1 by More Than C.R.s' scores pt. LOWER pt. HIGHER 1 Point Monitored Condition 1A

PAGE 106

95 chief reader, and the table leaders and readers in the study. Although the researcher's viewpoint is not entirely objective in that she has been involved with holistic scorings for a number of years, she simply recorded as objectively and as comprehensively as possible what took place externally during the monitored holistic scoring. The scoring session began with the six rangefinders, which comprise papers selected as representative of the four scoring levels. All 12 readers were involved with the training; the special readers of Table 3 who were taping essay subsets then adjourned to private offices to record their reactions to essays, returning after breaks to participate again in each training session. Each reader was asked to rank order these six essays in terms of quality, assigning one score from each score level (1 through 4) to at least one essay. The scores were then publicly tallied, and a brief discussion ensued in which the table leaders talked with readers about the essays. The central role such papers play in a reading was inadvertently conveyed in the taped comment of Reader 3A, who, during a period of hesitancy in her unmonitored scoring of essays at home, stated: I realize at this time that I miss the rangefinders. Starting out just kind of cold with these first five papers, I seem to have a tendency to develop a range among these papers, and, uh, of course that really can't be done. ... I can see now what the purpose of the rangefinders is — to get an idea in my mind as to what I'm looking for in the different scores.

PAGE 107

96 Thus, while operational definitions are important in describing typical characteristics of essays at each score level, the rangefinders stand as actual essays which exemplify for current topics the scoring criteria. As shown in Figure 4 in which circled numbers represent the number of readers assigning the accurate score, agreement among the 12 readers on the rangefinders was high. In addition to presenting the rangefinders, the chief reader introduced 11 pairs of samples to readers at set intervals. Again, as the partial set indicates in Figure 4, agreement was consistently high. The chief reader made general comments about the samples and rangefinders. For example, he noted that paper FF was not a "great paper" although he called attention to one positive feature about its structure. He agreed with the readers that sample was certainly a 1, and he observed that it made a good training paper in that more blase' readers would choose not to struggle with it. He called attention to the deteriorating quality in sample N by noting that the first page was upper-half, the second page a 2, and the third page almost a 1; he agreed that U was indeed a 3/2 paper as the presence of only one paragraph made it troublesome to some readers Never singling out readers who were off target, he suggested that a score which was too high — as occurred in one instance with rangefinder LL — was "charitable" or a score which was

PAGE 108

97 Scores Total Range finders D I M T W LL Samples FF JJ CC E N U BB V 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Figure 4. Results of rangefinders and samples in the monitored scoring.

PAGE 109

98 too low was rather harsh. Through these means he fulfilled what White (1985) calls the "heavy responsibility" of leaders "to ensure a reliable essay reading while at the same time respecting the professionalism, good will, and individuality of the readers who are grading the papers" (p. 31). Except for the rangef inders, for which the chief reader allowed a few minutes of conference time between the readers and table leaders, group discussions of the samples rarely occurred. The purpose of the training samples was clearly to have readers ascertain where they stood in relation to the other readers in the scoring and tallying of the same papers During the course of the scoring, the chief readers conducted two check readings. Each check reading consisted of a random set of eight essays which were independently scored by a reader, a table leader, and a chief reader, none of whom knew the other scores To have identical scores among all three was the goal; however, contiguous scores were considered acceptable if pluses and minuses on the record sheet suggested that the readers scores approximated the chief reader's score. Thus, a score of 3 by a reader would be acceptable if the chief reader's score was a high 2; conversely, if the chief reader or associate chief reader had perceived an essay as a good 3, and the check reading showed a reader giving it a 2,

PAGE 110

99 then the paper would be returned for a suggested rereading. During the monitored scoring only one noncontiguous score arose; however, the chief readers returned four papers to the tables, asking either a reader or a table leader simply to review the essay and reconsider the score. Thus, through the reading of common samples and through check readings, the chief readers insured that everyone would remain aware of group standards Writing Criteria Across Different Scoring Levels Question 4: What criteria do readers use in assigning different score levels? What standards are reflected in the score levels assigned across the essays? How do readers respond to these standards? In answer to question 4, the questionnaires, logs, and tapes were examined for insights into readers' attitudes toward holistic scoring in general and to the standards used in the CLAST program, in particular. As Table 4 indicates, 14 of the 17 study participants acknowledged on the questionnaire that they "always" or "almost always" endorsed the evaluation of written products. Thirteen agreed that they "always" or "almost always" endorsed the concept of scoring papers as a whole. Eleven stated that they used timed writings at least occasionally in their classrooms; nine admitted that at least sometimes they used holistic scoring to evaluate classroom papers. Conceptually, then, readers of this study supported the value of holistic scoring.

PAGE 111

100 Thirteen participants also strongly agreed that they felt comfortable with the standards adopted for CLAST, although two others stated that they believed the standards were too low. No matter what their attitude toward the standards, all the participants with one exception said they rarely had difficulty adhering to group standards. Readers concurred far less readily about whether they had expectations of what a CLAST paper should look like: Whereas eight agreed that they almost always had such expectations, six noted that they occasionally did, and three others wrote that they seldom, if ever, did. In fact, Table Leader 1 wrote in response to this question that she "assess [ed] the writing on the basis of the work present The variety of readers responses did not support Roberts' assertion (1983) that readers envisioned an idealized text to which they compared student essays. While the questionnaire responses indicated that readers basically supported both the concept of holistic scoring and the actual standards used, the logs and tapes showed that applying group standards to actual papers was not easy. Some essays presented special difficulties for readers Not only did the tapes reveal several readers struggles to resolve whether such papers should be scored up or down, but the logs also reflected a similar process of adjustment through some readers' use of pluses, minuses, and arrows. In an article on criteria for determining

PAGE 112

101 writing proficiency, Shaughnessy (1980) called attention to "the almost infinite number of possible combinations of strengths and weaknesses" (p. 118) which readers must balance in an attempt to decide whether a paper is incompetent. This study showed a similar balancing process occurring at all score levels. For example, Reader 3B's audiotaped comment about paper 024 revealed a typical struggle: "I don't know. I wish there weren't so many errors and yet it has so much imagery. It is well stated, and it is informative and thought provoking. I think I'll go ahead and give it a 4_. Reader 3C experienced a similar difficulty with essay 052: "It's just a tough choice between a 3_ and a 2 — the 2 because of the grammar problems. and a 3. because this person uses ... is very specific with a lot of detail, has a nice flair for writing, a nice style. The paper is just appealing. It's a real toss-up." The same term "toss-up" appeared in Reader 3A's tapes, as he, too, remarked about the "tough line" involved in distinguishing between 3/2 papers. As will be discussed under guestion 5, Reader 3A, like Reader 3D, perceived the debate in terms of rewarding a paper for its strengths or punishing an essays for its errors. For some essays, they both speculated as to what score the second reader might be likely to give. The tapes revealed that Readers 3B and 3C mentally rank-ordered such troublesome essays, comparing them to

PAGE 113

102 previous papers or to the operational definitions. Once during the unmonitored scoring. Reader 3C went back to a previous paper and raised its score; confessing that she knew she was not supposed to alter her original evaluation, she stated, nevertheless, "The contrast in these papers is so great, and I feel so strongly about this being a 2_, that I just can't possibly see giving the other one a 2 when it was so exact in detail." Clearly, then, those papers which did not fit the definitions of specific score levels or which contained discrepancies between form and content gave even these highly experienced scorers difficulty. Standards at Score Levels Although some essays presented special difficulties in scoring, several patterns were clearly discernible in papers at different score levels. As might be expected from their scarcity, essays given scores of 4. were viewed as strong papers Favorable comments in the logs or on tapes centered on the guality of ideas, the solid development of 4,-level essays, the good organization, and the coherence that typified the best papers. In their emphasis on such qualities, the readers of this study resembled those in Diederich, French, and Carlton's study (1961), in Freedman's (1979), and in Breland and Jones's (1984). To a lesser extent, readers noted the mature diction or the sentence variety that often existed in 4-level essays. Only a few negative comments were made

PAGE 114

103 about 4. papers, especially with regard to "mechanical problems," the umbrella term used to refer to a variety of grammatical and mechanical errors; however, such problems were generally deemed minor. That 4's were rare was suggested by Reader 3B's references to the "stellar" qualities she expected in top papers; Table Leader 2 conveyed a similar expectation in her observation that her classroom standards were higher than those for CLAST but that "A CLAST 4 will be an A in my class any day!" (Notwithstanding this table leader's comment, it is important to note that the four points of the holistic scoring scale used in this study are not equivalent to the letter grades of A through D; in fact, scales of six or eight points are often used in holistic scorings to allow for finer distinctions.) Reader 3C's reasoning about paper 033, to which she assigned a 4, revealed the high standards expected for such essays: "The vocabulary is very good, sentence structure is complex, and the paper seems to have a lot of depth and carries the thought all the way through." Reader 3D gave the same paper a 4_ because the detail was "sensible and alive, fulfilling his expectations as a reader; to him, the overall paper was "fluent, articulate, and organized." Like Reader 3C, Reader 3A commented that he expected to see in a 4. paper "something that shows me that this person's mind is in the top quartile. For me it's distinctive

PAGE 115

104 phrasing, it's inventive details, someone showing superior knowledge." Emphasizing that such papers do exist, Reader 3A argued that readers should not give solid 3. papers a score of 4_ simply because they have not seen an exceptionally strong essay in a while; rather, he observed, table leaders must keep readers aware of that distinction. These readers perceptions of 4. papers as truly outstanding or distinctive in some way help to explain why relatively few essays were assigned that rating. As Figure 5 indicates, readers did, however, readily assign scores of 3_ to papers they deemed upper-half. Comments recorded in the logs about the 3_level papers were, as with the 4_-level papers, largely positive about content, development, organization, style, and approach. However, unlike the top essays, readers often noted some problems with the 3 papers. The problems were varied, ranging from some rhetorical issues of focus, organization, or style to, more commonly, the mechanical elements of sentence structure, usage errors, sentence errors, punctuation, and spelling. Both the variety and the number of problems noted in the logs clearly differentiated the 3-level papers from the 4-level essays. Reader 3B's taped comment about paper 072 demonstrated the evaluation characteristic of 3_ papers: "It's not a bad paper. It's well developed. There's a rational argument. I would give this paper a 3.. It lacks polish sufficiently

PAGE 116

105

PAGE 118

107 to prevent any higher grade, but it is more impressive than many other papers." In addition, the logs of Reader 2B expressed concerns common to other readers of _3-level papers. For example, he noted about paper 058, "Good beginning, text related well to topic, some awkwardness"; likewise, he wrote in reference to paper 093, "Good development of thesis, good supporting details — minor transitional problems and errors." Still another example of the range of comments reflective of 3-level responses appears in Reader 2D's notations about paper 041: "Fairly solid writing; content is good but not great, conclusion adequate, some errors in language." As can be seen, then, comments about 3-level papers typically acknowledged strengths in rhetorical areas and, at the same time, weaknesses in language skills. Not surprisingly, papers given scores of 2 reflected many more weaknesses than did upper-half papers. As Figure 5 shows, readers made some positive comments regarding the rhetorical elements of content, focus, development, and organization. However, their comments about these elements were much more likely to be negative ones, and an even greater number of negative remarks focused on problems with sentence structure, mechanical problems, and usage errors in particular. Spelling errors were cited — both on the tapes and in the logs — but unlike some studies, such as Grobe's (1981), in which spelling, together with length,

PAGE 119

108 was one of the most commonly noted elements, the elements of development, sentence structure, mechanical problems, and usage received the largest percentage of responses The comments of Reader IB about paper 053 were typical of the responses made for 2-level papers: "Assertions repeated rather than developed and supported. Fundamental errors in spelling and grammar." Reader 2B's response to paper 057 was also representative: "Sentences illogical, thesis barely relates to topic, lack of detail." Readers often responded negatively to the quality of thought, to a shallowness of content that sometimes characterized 2-level papers. Reacting to one student's statement that schools sometimes "choose any person off the street to come in and teach a class," Reader 3A noted on his tape, "That kind of extreme, simple-minded statement keeps it out of the upper half for me." Thus, 2.-l ve l papers were often perceived as pedestrian or mechanical. In one instance, Reader 3D speculated about the probable cause for such a mechanical quality: This is a 2 because of the lack of detail. The introduction was terrific I wish it had followed through with the detail, with more examples. Yes, there were a couple of comma splices, and I don't worry too much about that. But it does indicate [the student] was trying to hurry, trying to finish the exam, get it over with, so that he or she could get on to whatever is next. The negative comments that were manifested in readers' responses to 2_-level papers predominated at the i-score level. Occasionally, some positive notes appeared, as in

PAGE 120

109 Reader lB's observations about paper 052, "Too many fundamental errors in English — punctuation, spelling, fragments. But details are good [italics added]. Similarly, Reader 3C commented about the "coherent introduction" of paper 112, and she noted in another instance that the student's ideas were good but that he or she simply had not yet mastered English sentences. Thus, even though positive comments about _l-level papers were rare, a few occurred under both scoring conditions. In this respect, the holistic scorers of this study differed from those of Haswell's (1988), whose across-theboard agreement as to bottom papers caused him concern about the stereotyping and the oversimplification such agreement implied. Noting, for example, that "the errorridden and unstylish surface of bottom writing glares shields the depths where the complexities are" (p. 311), Haswell argues that teachers can agree on the worst student writing because they have simplified its characteristics. Contrary to Haswell's (1988) finding, the readers of this study responded, as Figure 5 suggests, to a variety of problems in JL-level papers. The comment of Reader IB about paper 064 reflects this varied response: "Poor logic, bad grammar. Poor introduction. Paper has little content, much confusing repetition of phrases." Similarly, Reader 2D's notations about paper 074 indicate a comprehensive assessment that refutes Haswell's assertion

PAGE 121

110 of oversimplification: "Errors in grammar, punctuation, and usage gualify it for a 1; logic, however, is even more serious problem, development inadequate." Thus, even though the .1-level papers received 1/s primarily because of grammatical and mechanical errors, the readers seemed alert to rhetorical qualities — or the absence thereof — in these papers With the 1-papers, in particular — the score of which clearly failed a student — the question arises as to how conscious readers remained concerning the consequences of their scoring actions. Certainly, an awareness of the writer exerted a varying impact on all the study participants. For example, three scorers admitted on item 51 of the questionnaire that their perception of the writer almost always affected their scoring; Table Leader 1 wrote that the voice of the writer often influenced her. Five other scorers indicated that their perception of the writer "often" or at least "sometimes" affected their judgments; the remaining nine stated that it "seldom" or "never" did. Despite the varying impact that the scorers awareness of the writer had on their evaluations, the participants clearly seemed to distinguish between the responsibility of their task — namely, the assigning of a score — and the consequences that the score would have for a student. In fact, in many holistic scoring sessions, the chief reader's initial procedural comments often urge readers to make that

PAGE 122

Ill exact distinction. On item 43 of the questionnaire, 16 of the 17 participants answered that they "always" or "almost always" were able to separate their scoring assignment from the implication of the score for the student. (See Table 5.) Only one reader answered that just "sometimes" could he make that distinction. In addition, when the 13 participants who often served as referees were asked the additional question of whether they could separate their refereeing decisions from the consequences, 12 indicated that they "always" or "almost always" could. The same reader cited above said that he "never" could. His different viewpoint is understandable in that he was a reader who talked directly to the students on the tapes, and he saw scores in terms of reward and punishment. Overall, the responses suggest that the scorers of this study were willing to suspend their own standards in support of the group's. Despite the readers' best intentions to observe the standards, and despite their sincere efforts to do so, the logs and tapes indicate that evaluating essays holistically is a complicated task. Often the essays did not exactly fit the operational definitions, nor did they always match the rangef inders Moreover, each score level was broad, comprising a range of possibilities; a "high" 2 could substantially differ from a 2 that was "looking down." Thus, readers had to

PAGE 123

112 TABLE 5 Results of the Questionnaire, Part III Questionnaire Always/ Often/ Seldom/ Item No. Almost Always Sometimes Never Other 31) Endorse evaluation 14 (82%) 2 (12%) — 1(6%) of written products 32) Use timed writings 4 (24%) 11 (65%) 1(6%) 1(6%) in own classes 33) Use holistic scoring 5 (29%) 9 (53%) 1(6%) 2 (12%) to assess classroom papers 34) Believe in scoring 13 (76%) 4 (24%) papers as a whole 36) Have difficulty in — 1(6%) 16 (94%) adhering to group standards 37) Have expectations 8 (47%) 6 (35%) 3 (18%) of what a CLAST paper should look like 43) Can separate scoring 16 (94%) 1(6%) task from consequences for the student 45) Feel pressured by 1(6%) 7 (41%) 8 (47%) 1 ( 6%) the speed of other scorers 46) Physical comfort 2 (12%) 9 (53%) 6 (35%) affects scoring 47) Feel comfortable 13 (76%) 2 (12%) 2 (12%) with the standards for CLAST 58) When refereeing 11 (92%)* — 1(8%) papers, can separate scoring task from consequences for the student* *Only the 12 participants who referee responded.

PAGE 124

113 balance strengths against weaknesses and determine quickly which qualities — whether negative or positive — predominated in the final impression an essay made. Patterns Among Readers Question 5 : Do any common patterns appear in the scorers' written responses to the essays, or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their written responses, correspond to the writing features they rate as important on the questionnaire? In answer to the question of whether patterns occurred in the responses of individual readers, the comments which readers made in the logs and on tapes were examined under both the monitored and unmonitored conditions Several patterns appeared Without exception, all readers commented frequently on the extent of development reflected by the essays overall. Repeatedly, such comments as "thin on development, "not enough development for a 3.," or "good supporting details" appeared in the logs Similarly, the readers often responded to the quality of sentence structure. Their notations included such phrases as "awkward sentence structure," "clumsy sentences," "some sentences confusing," "syntax errors," or "syntactic sophistication." To varying degrees, all readers commented on the presence of errors in some essays — either by citing the specific mistakes, as in "homonyms" or by labeling problems with some umbrella

PAGE 125

114 term, such as "severe language problems" or "needs proofing for errors The taped protocols made by the four special readers revealed similar concerns about development and sentence structure. Reader 3D's observation that "It's such a blanket statement — I really wish I could see some details here" typified the special readers' concern with development. Most of the special readers' comments addressed the lack of sufficient or in-depth development, although occasionally, a reader would make a positive remark, as when Reader 3C noted, "I liked the concrete detail of conversation provided in that paragraph." Reader 3C also called attention to the strengths, as well as to the weaknesses, of particular sentences. For example, after reading a sentence with strong parallelism, she observed, "Beautiful sentence there," and she commented frequently on varied sentence structure. Reader 3B likewise noted individual sentences, as reflected in her remark, "That sentence improved toward the end and got rather nice." More often, the four special readers commented on the negatives — on the awkwardness, confusion, tangled structure, and lack of flow reflected by some sentences. For all 12 readers, then, as for the readers in SweedlerBrown's study (1985), development and sentence structure appeared as clear and consistent concerns. At the same time, the logs and the tapes revealed the individuality of each reader. Although the readers

PAGE 126

115 commented on a diverse multitude of features, certain recurring themes appeared in the responses of each reader, suggesting that readers brought their own lenses or frames of reference to each essay. Brief portraits of each reader's logs or tapes will illustrate these individual concerns Reader 1A commented often on the organization and structure of a paper, using the term "then/now organization" that was a response unique to her. Several times she noted the clarity of a thesis and the effectiveness of transitions, and she called attention to the emphasis appearing at the end of given essays. For this reader the quality of content seemed especially important; she noted problems in logic and commented favorably when "a reasoned argument" occurred, when "good information" was presented, or when a paper reflected "sophistication of thought." Her references to errors were limited to such occasional notations as "a few errors," "problems with expression," "not literate enough" or "ungrammatical phrasing." Just as she was apt to note approvingly if sentences were balanced and sophisticated, she also — more than any other reader — disapproved of the use of passive voice. For Reader IB, organization, together with supporting details, was critical. Such phrases as "organization acceptable, "organization okay, or "organization needs improvement dominated his logs Not only did he respond

PAGE 127

116 to the quality of introductions, but he also took note of the quality of conclusions with such remarks as "weak conclusion" or "conclusion trails off." Occasional references to the thesis or to "cliched ideas" also appeared, as did comments on paragraph unity or coherence. He remarked on punctuation errors but otherwise tended to classify problems simply as "fundamental errors" or "careless errors." The logs of Reader 1C, in contrast, rarely contained any references to organization. Rather, with such phrases as "simplistic thought," she commented often on content and referred frequently to the need for connectives between sentences and paragraphs Her responses were tailored to the particular texts in that she cited specific errors, such as "past tense of verbs," "no articles," or "agreement errors," and she virtually never grouped errors overall. This reader was especially concerned with tone, as she commented several times on "lapses [of] informality" or, using an expression unique to her logs, referred to essays that needed to have a more "scholarly tone." Like the other readers, Reader ID responded often to development, syntax, and structure. Although he identified specific errors occasionally, he primarily referred to them simply as "mechanical problems." He made several observations about organization but, unlike Reader IB, rarely referred to introductions or to conclusions; similarly, he

PAGE 128

117 seldom mentioned the thesis of a paper. However, content was important to him, as he wrote "extremely superficial content" several times and noted "macro-level logic problems" in a few entries. His logs were especially distinctive in his strong responses to diction and style. Such comments as "poor diction," "fairly strong diction," or "sophisticated diction and construction" were sprinkled throughout his accounts. Similarly, he wrote of "smoothflowing style," "breezy, creative style," and "engaging" or "plodding" styles. Of all the scorers at Table 1, Reader ID was the only reader to respond frequently to the elements of diction and style. At Table 2, however, style was also significant to Reader 2A. Its importance was revealed in such frequent comments as "style not distinguished," "perfunctory style," "awkward style hurts," or, conversely, in a lengthy approbation, "The vivid style with concrete images provides a good portrait." References to diction, coherence, and paragraph structure appeared in her logs as well; however, the frequency with which she wrote of content underscored its particular effect on her responses. Repeatedly, Reader 2A commented approvingly, "Content a plus" or "very fine content," or she wrote negatively, "Content is pedestrian" or "Content is not always coherent." Clearly for Reader 2A, as for Readers 1A, 1C, and ID, the quality of ideas was integral to the evaluation. In these readers' concern with

PAGE 129

118 content, they resembled the scorers of studies by Diederich et al. (1961), Freedman (1979), and Breland and Jones (1984). For Reader 2B, content seemed far less significant. Although his responses covered a wide range of categories, his logs reflected special concern with diction, organization, and focus. Such phrases as "good word choice," "illogical word choice," or "high-level vocabulary" were scattered throughout his commentaries, as were his remarks, "fair organization," "poor organization," "Conclusion introduces new information," or "no organization." His logs from the unmonitored condition reflected a particular awareness of focus, as several times he wrote, "Thesis not tied to topic," "loses sight of topic," or "second page unrelated to thesis"; to a lesser extent, similar responses, such as "no thesis, rambles" appeared in his monitored logs The logs of Reader 2C also reflected a concern for focus, as she commented approvingly on those writers who focused their topics tightly, rather than writing in more general terms. For this reader content was important, too, in that such comments as "not really new ideas," "questionable logic," or "good ideas" dotted her records. She, like Readers ID and 2B, also took note of word choice, and frequent remarks such as "misuses words," "abstract language," "good image," or simply the word "things" in

PAGE 130

119 quotes appeared in her logs. For this reader organization was rarely an issue. However, like Reader 1C, she labeled specific errors, citing explicit occurrences of fragments, homonyms and verb endings Reader 2D clustered his references to errors under the umbrella term of "language skills" although he often discussed the "clumsy style," the "unremarkable style," or the "crisp writing" that characterized some essays. He frequently referred to the content of essays, as well as to their introductions and conclusions. Such phrases as "content weak, illogical," "content mediocre," or "excellent, content, interesting" permeated his records. Equally pervasive were his references to "competent," "tedious" or "superficial" introductions and to "pedestrian," "repetitious," or good conclusions that even "expand[ed] the subject." As can be seen, distinctive patterns appeared in the logs of all readers The patterns crossed boundaries of gender and of instructional level, revealing the individuality of each scorer's response. To be certain, the specific nature of these comments should not be overemphasized, for as Figures B-2 and B-3 in Appendix B indicate, a core of common responses underlies their evaluations. Nevertheless, these portraits suggest that each reader also brought an individual perspective or lens through which to view the student essays

PAGE 131

120 The individuality of each reader's response to the essays was revealed in the taped protocols of the special scorers as well. Like Reader 2D and several other regular readers, Reader 3B commented on the strengths and weaknesses both of introductions and of conclusions. Although she found the presence of titles "irrelevant," she repeatedly approved of good thesis statements and objected to thesis statements that were too vague or that obscured what the writer was talking about. Like Reader 1C, she remarked on the need for transitions in several papers and approved the use of anecdotes as a means to support some assertions. Echoing Reader 2C's dislike of the word "things," as well as other vague language, Reader 3B called attention to the effective imagery in one student's use of the phrase "domino effect of awareness." Her comments on handwriting dealt only with her occasional struggle to decipher certain words; the struggle seemed part of her effort to make meaning of the essays, as revealed in her references to "faulty logic," in her interpretations of what a given writer must have meant, or in the answers she gave to her own questions, "What helps? Technology, I suppose. Like Reader 3B, Reader 3C was also concerned with introductions and conclusions. She commented frequently on the need for transitions and noted when the vocabulary was good or when a better word was needed. Her tapes, in

PAGE 132

121 particular, reflected the attempt of a reader to follow various writers explanations and to understand what was being said. For example, referring to one paper on oceanography, she expressed trouble with the idea of cameras and bells studying at ocean depths; later, she observed, "I'm not following the logic in that paragraph — maybe that's the problem — it really doesn't seem to say as much as at first I thought." Her references to logic were frequent, as were her indirect allusions to focus: On one occasion, she tried to determine why a paper was "rough reading" and concluded that her problems with it arose because it failed to adhere to its initial generalization. In her attempts to make meaning of the essays, Reader 3C, like Reader 3B, resembled the placement scorers cited in the study of Barritt et al. (1986). For Reader 3A dramatic scenes were preferable to more general, abstract introductions; similarly, he noted that he liked to "look at the drama — how people accomplish drama in conclusions." Admitting that length and handwriting were factors in his responses, he stressed the importance of style. He expressed irritation over the he/she indecision about gender, calling it "ungraceful," and he commented often on the "sophistication of diction," noting once with irritation, "an ill phrase, a vile phrase." Like Readers 3B and 3C, he remarked on the presence or lack of logic in several essays: "This person is just not

PAGE 133

122 doing very clear thinking. I'm beginning to think it's myself, but I think it's the student." Such phrases as "really silly thinking" or "simplistic logic" were sprinkled throughout his tapes. At the same time, he explained that he did not "struggle with the progression of [a] person's logic" as did so many other readers; rather, he wanted to derive coherence from the overall flow of an essay. Reader 3A spoke of rewarding or penalizing essays for certain qualities and with some essays wondered aloud what score the second reader would be likely to give. The most distinctive trait of Reader 3A, however, was his tendency to respond to whatever students discussed in essays in terms of his own personal experiences--whether it be grocery stores, the dentist's office, or an old Woody Allen serial. Reader 3D shared a number of response traits with Reader 3A even though both taught at different types of institutions in different towns. Like Reader 3A, Reader 3D talked about rewarding or penalizing essays, a concept seldom mentioned by the other special readers. Reader 3D, also, like Reader 3A, occasionally speculated as to what a second reader's score might be, and he, too, responded often to diction. His tapes were dominated by such comments as "I like the diction level here. I like the trouble this kid went through. I really appreciate it,"

PAGE 134

123 or, conversely, "Some of the vocabulary is a little awkward, a little strange at times. .I'm waiting for a little more careful use of the language here." Like several of the other readers, Reader 3D was concerned with logic and meaning; occasionally, in referring to jumbled ideas, he used the term semantic abbreviation which he attributed to Collins and Williamson's work (1984), to indicate that the student had not said enough. Unique to Reader 3D were his concerns both with text and with revision. Openly acknowledging that he was affected by handwriting, he observed that handwriting contributed to the visual impression of text that each essay made; he was similarly impressed by titles, and he took note of indentations of paragraphs and the straightness of margins. Saying that "part of the sense of text in writing must be visual," Reader 3D explained that he wanted to give each student writer "the opportunity to demonstrate that he has some sense of vision — some sense of the visual or visible quality of what a piece of writing needs to be." Another part of that visual impression entailed signs of revision; for Reader 3D, erasures and cross-outs signaled that some thinking was going on, along with the writing. So important was revision to this reader that he used the terms "bleeder" and "barfer" to distinguish between those writers who agonized over each word of each sentence and those who wrote first and then examined their work afterward.

PAGE 135

124 Therefore, as can be seen, the tapes confirmed the individuality of perspective which each reader brought to the scoring. Although all the readers shared some concerns in common, they each had individual patterns of response to the essays. Such individuality is not new, for a number of studies in the literature review, including the recent works by Martin (1987) and by Huot (1988), have underscored the individuality of each reader's response. The Writer Behind the Paper The individuality extended to several readers' envisioning of or interaction with the writer behind the paper. The effect of readers' perceptions of the writer behind the essay has been the subject of recent concern for the researchers Barritt, Stock, and Clark (1986) and Sullivan (1986); as noted in the previous discussion of Question 4 and as shown in the logs, several readers in this study felt that their perceptions of the writer behind the paper affected their responses. For example, Reader 2A's responses often showed sympathy for students' "rough draft" performance, and Reader 2D wrote approvingly of several students ability to write knowledgeably about their subjects. Most of the readers' responses to writers were indirectly expressed. Only in the logs of Readers 1A and 2C did the readers respond directly to the writers behind the essays. Reader 1A occasionally reacted to the writer's ideas, with such comments as "What about those who

PAGE 136

125 are not of a Judaeo-Christian background?" or "Help, the specialists are winning! So much for the generalists and the well-rounded individual with the curiosity to explore further than our own backyard." Reader 2C also reacted directly to some content and to the writer of those ideas. She wrote angrily, "Where does he get ideas like teachers are just anyone off the street?!" With an apology for commenting personally about one student's argument, she wrote about another paper, "[The writer is] a little uncompassionate to students who need to work and whose parents are already being responsible. May also need to consider if more will really benefit students who already want less 1 Thus, the logs reflected to a limited extent several readers' awareness of the writers behind the papers and an attempt to react to these students ideas Such interaction or engagement with the writer is more readily apparent in the taped protocols of the special readers. The accounts of Reader 3D, in particular, reflected a running dialogue with the writers Not only did he frequently interject such comments as "true, "interesting," or "okay," but he also responded directly to the student: "Well, show me what you mean by better, kid" or "I'm waiting for something more definite, young lady, young man. Similarly, as the three quotations below indicate, Reader 3D often stepped back from the papers to talk about the writers :

PAGE 137

126 1. Nice basic piece here. Almost reads like a summary of something the kid studied in some detail and tends to care about, I think. 2 I'm not sure where this is headed at this point. This kid is trying hard, though. It's obvious he is trying hard. 3. My impression at this point is that this is a kid who is struggling to write — to take the inner speech that is working up inside the head and trying to put it down in a manner that is acceptable. Throughout his tapes Reader 3D engaged in a dialogue with, as well as a commentary about, the writers themselves. He later observed that the process of verbalizing his comments on tape had made him more humane in his responses Like Reader 3D, Reader 3A also talked constantly to and about the writers He argued with some of the writers notions, declaring, "That's not true. I don't believe it," or questioning the source of another writer's statistics. He, too, stepped back on occasion to make observations about some writers: "Interesting small subject, but apparently not a good choice for this writer 'cause here she doesn't have anything to say about it." Thus, in the taped protocols of both men there is an ongoing display of interaction with the writer. A similar awareness, albeit to a lesser extent, appeared in the protocols of the two women readers. Occasionally, Reader 3B responded directly to the writer's statements with a question or a simple phrase, such as "Oh, joy" to the idea

PAGE 138

127 of a television in a dentist's office. More often, however, her comments were a bou t the writer, as revealed in her remarks, "The person is obviously talking about something she knows about, very interesting" and "He is prejudiced clearly, and makes assertions that he does not adequately support," or in her observation, "This person has obviously never tried to teach a class outside on a pretty spring day. The formality conveyed by Reader 3B's use of the term "person" rather than 3D's "kid" appeared in Reader 3C's tapes as well. Only rarely did Reader 3C respond directly to the content with such comments as "I didn't know that" or "Okay, if that's indeed a big benefit." Her remarks about the writers were infrequent, too, limited to such phrases as "I liked some of the reasoning this person gave Just as the logs revealed among the eight regular readers varying degrees of awareness of the writers behind the papers, so, too, did the taped protocols indicate a spectrum of response among the four special readers. Whereas the two men seemed directly and frequently involved with the writers, the women's interactions were less frequent and more formal The Causes of Errors and Their Remedies Like some of the regular readers, the special readers noted specific errors as they talked their way through

PAGE 139

128 the papers. In addition, they attempted to explore the reason for the difficulties. The logs did not fully address this concern, as the eight regular readers had neither the time nor the space to explore the causes of errors; only some tangential references to probable causes occurred with Readers IB and lC's remarks about careless errors and with Reader 2A's frequent emphasis on the papers as rough drafts. The special readers, however, often remarked on why a particular error might have occurred, frequently attributing such errors to haste. Reader 3C's comment was typical: "I think that this person just writes quickly and makes careless errors. At least, that's what I chalked it up to right now, because the content seems to be good, and the sentences seem to be well done." Likewise, Reader 3D speculated that a student probably thought chances instead of changes. The writer didn't even realize he or she was doing that in the haste of the situation. The special readers considered other possibilities as well. In one instance Reader 3A attributed the lack of n on the article an to a revision in which the writer failed to make all the necessary changes Still other sources of error cited were the writer's lack of sensitivity to audience and the interplay of written and spoken language, as shown when Reader 3D observed, "Seems to me that this is the person's first draft relying heavily

PAGE 140

129 on speech Often the readers were sympathetic to the students In referring to one paper with an audience problem, Reader 3D stated, "Makes me feel bad because if this were intended for an audience of draftsmen, I'm sure they'd all be sitting and nodding right now and saying what a wonderful piece this is. But unfortunately, that wasn't his audience. His audience is an English teacher doing a test." This reader noted that he often put himself in the place of the students, remembering his own written and oral examinations But with other errors, the readers expressed frustration at students' lack of knowledge. Referring to an example of poor diction, Reader 3B commented, "An upper-half writer shouldn't be making errors like these — like installate Similarly, Reader 3A, in expressing irritation over a lack of word endings, argued, "[People] should be required or taught not to abandon the word before they move on to the next one physically with their hands and put their minds on the next word. The readers remarks on the causes of errors carried over to a concern for probable remedies Reader 3B noted that a brief comma review would help one student, and Reader 3D speculated that a 10-minute lesson on apostrophes would benefit another writer. Reader 3A suggested that one especially weak writer desperately needed a one-on-one conference. His frustration at being unable to help such

PAGE 141

130 a writer was clearly evident in his probing question of how much good he could actually do: "How do you generalize the name of that error? Then, how do you get the person to understand and not do it again? You can't language skills sometimes seem to be so difficult to pinpoint." For Reader 3A, as well as for the other readers to a lesser extent, the taped evaluations of student essays served as a point of departure for general speculations about writing. For example, Reader 3B discussed how one paper embodied "the very definition of good writing"; similarly, Reader 3D commented that for him, writing was truly effective when it "filled [his] questions and enabled [him] to read like a reader and not like a teacher." Both Readers 3D and 3B commented on the inappropriate familiarity with the reader that the use of "you" reflected and on its frequent occurrence as a characteristic of less sophisticated writers. Reader 3A labeled as another trait of unsophisticated writers an inability to deal with ambivalence or ambiguity. Stressing the need for using a "domino effect" whenever editing so that one change generates the subsequent changes also needed, Reader 3A observed about one paper: "The style is, on the one hand, sophisticated but, on the other, incorrectly done. I guess all writing has things that work and things that don't work." With frustration he reflected on the difficulty of teaching writing, especially in view

PAGE 142

131 of its close link to thinking: "How would you begin to work with this person's mind?" For the special readers doing the tapes, then, the evaluation of individual essays led to more general speculations about writing. Scoring Approaches and Preferences Readers were also asked to speculate about their own approaches to scoring papers holistically. One openended question on the questionnaire (see Part V of the questionnaire in Appendix A) asked readers to describe their processes in holistically scoring papers; these processes again underscored the readers' individuality. For example, the tapes showed Reader 3A often announcing an immediate score based on what he described in his questionnaire as handwriting, the language of a sentence, the length, and the presence or absence of detail. Then he continued reading to determine whether the initial judgment would hold. Reader 1A, like several others, tended to read the first paragraph carefully and make a tentative judgment as to upper-half, lower-half. After reading the remainder of the paper, she would adjust the score accordingly. Reader 3C indicated that she first decided to which half the paper belonged, mentally assigned a score of 4. or 2, and scored down from there. Reader ID wrote that he rarely changed his mind on the upper/lower half distinction after the first paragraph and often knew the final score by the

PAGE 143

132 third paragraph. Reader 3B, in contrast, noted that sometimes her score changed two or three times in the course of reading a paper. Reader 2C responded that she "read the whole paper, and some papers just seem to be a particular score." Still other readers used a combination of methods. As Reader 2A explained, some papers were easy to score whereas others required a closer scrutiny or, occasionally, even rereading. Thus, the methods the various readers used underscored their different response patterns Still other explanations for readers' differing responses to the essays may lie in the additional answers they gave to some sections of the questionnaire. Three parts are especially applicable: (a) the readers' openended descriptions of their own scoring tendencies, (b) their ratings of their biases and preferences, and (c) their ratings of criteria the readers considered important in judging timed writings. Most of the readers described themselves as fair in their scoring; the majority also noted that they were strict, with several tending toward the low. Four readers acknowledged that they were generous or charitable with better papers. Reader 3D, while describing his scoring tendencies as fair, noted that at times his scoring was affected by "the view of the student through his/her prose." Indeed, as has been discussed, the protocols

PAGE 144

133 confirmed this reader's awareness of the writer behind the essay. The readers frankly acknowledged their biases or preferences by placing a plus ( + ) or a minus (-) beside those items on the questionnaire that triggered either a strong positive or a strong negative response in their reading. They did not mark items toward which they were neutral; they could add items not included. As Table 6 indicates, two-thirds of the 12 readers agreed that they reacted negatively to misinformation in papers, to shallow essays, to hard-to-read handwriting, and to extremely short papers Eight or nine readers also agreed that they responded positively to creative papers, to humor, and to a delightful writer behind the essay. But the individuality of the readers appeared in the mixed response that most other categories generated: For example, whereas two readers acknowledged responding negatively to rhetorical devices and positively to firstperson narratives one reader reacted to each of these categories in the reverse. Other readers were, presumably, neutral about those areas Whereas three readers liked technical/scientific papers, two did not; though six readers responded negatively to religious papers, one liked such essays. Even the category of "disagreeable writer" generated a mixed response, as two readers noted that they reacted positively to the sign of any writer

PAGE 145

134 TABLE 6 Results of Questionnaire on Biases and Preferences Questionnaire (Part 2, Item 25) Number of Positive Responses Number of Negative Responses Political papers Social issues Religious papers First-person narratives Technical papers Literary allusions Creative papers Misinformation Humor Severe misspellings Shallow papers Rhetorical devices Disagreeable writer Illegible handwriting Extremely short papers Weak conclusion/introduction Inductive papers Sentimental papers Delightful writer Slang Notes ; The 17 participants checked with pluses or minuses only those elements which triggered a strong personal reaction, either negative or positive. Other write-in categories included wit, irony, percept iveness, attacks on the test, sarcasm, and unexamined values

PAGE 146

135 behind an essay. The taped protocols revealed other biases on the part of special readers — from one reader's dislike of the phrase "a lot of" or the use of the pronoun "one" to another reader's dislike of jargon and a third reader's dislike of a formal tone. To adjust for their own biases — of which readers were obviously aware — the readers identified several strategies they used most frequently. Two-thirds indicated that they often slowed down when they encountered a paper that triggered a strong personal reaction, and one-third said they reread the paper. Four readers noted that they consulted "often" with the table leaders, four only "sometimes," and the remaining four "seldom" or "never." The readers were equally divided between "sometimes" and "seldom" in their tendency to reexamine the rangef inders One reader wrote that he put the papers at the end of the stack to return to for reexamination after a break. Readers were also asked to rate the writing criteria they considered important in evaluating timed essays; they checked which of 24 areas dealing with rhetoric, style, grammar, and mechanics they believed to be "very important" (4 points), "important" (3 points), "somewhat important" (2 points), or "not very important" (1 point). In this study, unlike Breland and Jones's (1984), correlations were not obtained between the readers ratings of criteria important to their evaluations and

PAGE 147

136 their actual responses as reflected through logs and tapes. Nevertheless, Figures 6 and 7 illustrate the variations in importance that individual readers placed on the different writing criteria. That creativity occurs as the least important feature in these ratings is somewhat surprising. However, it must be remembered that the criteria which readers were asked to rate were the features they valued most in assessing timed writings; morover, the ratings were relative, ranging from "very important" to "not very important" with absolutes, such as "not at all important," excluded. In this context of timed assessment and relative comparisons, the placement of creativity at the bottom of the scale (see Figures 6 and 7) seems less disturbing. For example, even the lowest score on the scale (a total of 25) represents an average ranking of 2 points for each of the 12 readers, a ranking which signifies "somewhat important." Their responses may thus indicate that creativity is simply not essential in timed assessments; that is, while students can be — and, in fact, typically are — rewarded for timed essays that are creative, students whose essays are otherwise strong and solid are not penalized for their lack of creativity in an assessment situation. That the criterion of the writer's commitment to the topic ranked similarly low was perhaps due to readers' recognition that assessment topics, assigned as they often are, will not engage the students

PAGE 148

137 31

PAGE 149

138 Most Important 44* Development/Adequate Controlling Idea 43 Focus 41 Avoidance of Fragments and Run-ons 40 Depth of Thought 39 Fluent Sentence Style/Avoidance of Tangled Sentences/ Length 38 Variety of Sentence Structure 37 Accurate Diction Least Important 35 ESL Errors 34 Tone/Usage Errors 33 32 Dialect Errors 31 Mature Diction/lntroduction/Conclusion 3 D Punctuation 28 Commitment of Writer Spelling/Capitalization 25 Creativity Note : Each reader could assign a maximum of 4 points per criterion. *The numbers represent total points assigned by 12 readers. Figure 7. Readers' ratings of importance of criteria in timed writings

PAGE 150

139 in the same way that topics of choice for outside assignments may do. An informal survey of the questionnaires in comparison to the logs or tapes confirms the validity of many features which readers said they deemed to be important Reader 1A, for example, who commented frequently in her logs on the thesis and on "then/now" organization, rated those rhetorical features as very important to her. Similarly, Readers ID, 2B, 2C, and 3A, all of whom had commented repeatedly on diction in their responses to the essays, admitted on the questionnaire to valuing word choice highly. Reader 1C responded on the questionnaire that sentence variety was very important to her, and the frequency of her comments in the log corroborated its significance. In addition, Reader 2D showed through both his logs and his questionnaires the importance that he placed on content. Thus, the features which the readers rated either as "very important" or "important" on the questionnaire were often the same features to which they responded in the essays. In this respect, the readers of this study differed significantly from those in Harris' study (1977) and in Stach's (1987) in which the actual basis for writing judgments was far different from what the readers thought it would be. At the same time, some discrepancies in this study could be found. For example, Reader 1C described thesis

PAGE 151

140 and focus as being very important in her writing judgments, yet she rarely mentioned these elements in her logs. Reader 2A, responding to an optional item on the questionnaire, stated that topicality was crucial to her but never addressed these elements in her logs. Thus, the questionnaires suggested that more elements were important to readers than their responses to the essays might have suggested; this finding is not surprising in that the logs comprised brief summaries and were written in the midst of an actual scoring process when readers were attempting to score substantial numbers of papers. Conversely, the logs revealed that some features, believed by readers to be only "somewhat important" or "not very important," might have more significance than the readers necessarily realized. For example, Reader IB rated the category Depth of Thought as only "somewhat important" to him. However, in the unmonitored scoring he commented in several instances on logic, cliched ideas, content, and intellect. Similarly, Reader 2B, who rated Introduction Conclusion and Tangled sentences as only "somewhat important," responded to these particular features in his logs. Reader 2C, who had frequently specified the nature of misspelling in her logs by noting "homonyms," likewise rated spelling as only "somewhat important" to her. The discrepancies between what some readers said was important to them and what they showed as being significant

PAGE 152

141 may be due to several factors: First, readers may have been unaware of what was truly involved in their own scoring judgments, although the metacognitive awareness reflected by their other self-reports makes such an explanation unlikely. Second, the subjectivity entailed in interpreting the degrees of importance — especially a category such as "somewhat important" — may account for the occurrence of some responses in the logs. Finally, the experience of recording comments in logs or on tapes was a new one for virtually all the readers. Consequently, in the process of an actual scoring, they may have been unable to note in either written or taped form all the features to which they were responding. As Freedman and Calfee (1983) note, articulating an evaluative response to a work is difficult; hence, this latter explanation seems most likely. Summary of Data Taken together all the data — from the logs and taped protocols to the closed and open-ended questions on the questionnaire — underscore the individual perspective that each reader brought to the scoring task. The perspectives seemed influenced by a combination of scoring approaches, particular biases or preferences, and features the readers valued most in timed writings, as well as, undoubtedly, by personal and background factors not under consideration.

PAGE 153

142 (In a recent study Martin (1987) found these latter factors to be especially important.) These patterns of response could not be attributed to gender, race, or instructional level. Rather, the readers seemed randomly linked with one another in some of the writing features they valued or in some of the approaches they used. But the word some is important to emphasize in this regard, for as Figure 7 illustrates, readers agreed on the significance of focus, organization, unity, and fluent sentence style; they generally agreed on the relative unimportance in timed essays of creativity, the writer's commitment to the topic, and spelling or punctuation. Thus, the individuality of readers' perspectives on writing was clearly grounded in shared beliefs that undoubtedly contributed to the high interrater reliability discussed under questions 2 and 3. Nature of the Monitoring Question 6: What is the nature of the monitoring that the readers receive during a scoring as reported through the logs of chief readers, table leaders, and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring processes as noted on the questionnaire? The nature of the monitoring in an holistic scoring was determined through an examination of the following data: (a) the logs of the table leaders and chief readers, (b) their interactions with readers as reported both

PAGE 154

143 through the readers' scoring logs and the special readers' tapes, and (c) the questionnaires given all participants. In addition, the researcher recorded observations of the monitored scoring in progress, the results of which were reported under Question 5 Roles and Questionnaire Responses of the Chief Readers and Table Leaders Like the readers, the five people responsible for their training and monitoring in the monitored holistic scoring — e.g., the two chief readers and three table leaders — came from universities, community colleges, and high schools. For the unmonitored portion of the scoring, all had been given different tasks from the readers: The table leaders read the 39 original samples and rangefinders, recording their comments in a log. The chief readers, who had been involved in the original selections of these samples two years previously, read the 112 papers initially chosen for the study, recording their comments about each paper in a log. As in the readers' case, the chief readers' and table leaders questionnaires reflected some individuality of response to various writing features. For example, on the questionnaire all five rated an adequate controlling idea and development as "very important" in their writing judgments; similarly, they judged sentence fluency, variety of sentence structure, and accurate diction as significant

PAGE 155

144 also. However, as Figures 8 and 9 indicate, the chief readers and table leaders disagreed about the importance of fragments and run-ons, of dialect/ ESL errors, and of a conclusion. The logs corroborated at once the similarities and differences between the two chief readers. Although each one responded to the degree of development in papers and to sentence constructions, distinctive responses also appeared. Whereas the first chief reader referred to problems in general terms such as "multiple language errors" or "mechanical problems," the second chief reader was more likely to identify the exact nature of the errors, using such labels as "run-ons," "verb endings," and "apostrophes." Similarly, whereas the first chief reader noted the content of essays more freguently than did the associate chief reader, the second chief reader commented more often than the first on the nature of introductions or conclusions. At the same time, both chief readers were concerned about focus, and both characterized essays globally: Chief Reader 1 talked about the "thorough treatment of the topic" in given essays or, conversely, an essay that was "bland and superficial"; the second chief reader commented on papers which reflected either "competent writing" or "ho-hum writing. The table leaders logs showed their individuality as well. Table Leader 1 expressed concern with organization

PAGE 156

145 Category Adequate Controlling Idea Table Table Leader Leader No. 1 No. 2 Table Chief Chief Leader Reader Reader No. 3 No. 1 No. Write-in Category Logic, Reasoning Note ; 4 = Most important. 1 = Least important. Total 20 Focus

PAGE 157

146 Most Important 20* Adequate Controlling Idea 19 Fluent Sentence Style 18 Focus/Development/Accurate Diction 17 Depth of Thought/Organization/Unity/Variety of Sentence Structure/Mature Diction/Avoidance of Tangled Sentence Structures 16 Tone/Avoidance of Fragments and Run-ons/Avoidance of ESL Errors Least Important 15 Commitment of Writer/Avoidance of Dialect Errors 14 Punctuation/Length 13 Introduction/Avoidance of Usage Errors/Capitalization 12 Spelling 11 Conclusion/Creativity *Total points assigned by 5 chief readers and table leaders. (The maximum number possible is 4 points per criterion per individual.) Figure 9 Ratings by chief readers and table leaders of importance of criteria in timed writing.

PAGE 158

147 and vocabulary with voice and style, with a writer's sense of control, and with a writer's ability to make a paper interesting. Table Leader 2 remarked on introductions and conclusions, took note of pertinent details and logical support, and identified types of errors specifically. This table leader responded especially to repetition and diction. Table Leader 3, like Table Leader 2, also commented on repetition and on the presence of errors; in addition, he called attention to the thesis and focus of papers, as well as to diction and logic. An excerpt from the logs illustrates the table leaders' shared interest in organization, as well as their different perspectives. About one sample, which received two scores of 3 and one score of 2+ from the table leaders, Table Leader 1 wrote the following comment: "organized — clear — held my interest — movement — strong voice (vocabulary and rhythm limitations make paper a 3, not a 4 ) Table Leader 3 noted about the same sample, "Can't spell his subject. Organized, focused — but spelling!" Table Leader 2 commented, "Spelling problems; good details — obviously knows his topic. Many good sentences — logically well organized. Thus, like the readers, the two chief readers and three table leaders brought their individual perspectives to the evaluation of each essay. At the same time, as Figure B-4 in Appendix B indicates, the table leaders' scores on the

PAGE 159

148 samples showed substantial agreement. Only on one out of 39 scores did the table leaders disagree; on the remaining papers the scores were either identical or contiguous. Training with Rangefinders and Samples Before the monitored scoring began, the table leaders reported for a discussion of the samples and the actual scores that had been assigned two years before. This short meeting took the place of the formal table leaders' session which is customarily held the day before any holistic scoring and at which table leaders can disagree with any samples selected by the chief readers. The table leaders monitoring tasks formally began with the discussion that ensued after the rangefinders were tallied at the start of the monitored scoring; on this occasion, most readers found the rangefinders to be good indicators of each scoring level, and the discussions were brief as a result. However, Table Leader 1 did work with Reader ID who inquired why paper T was not an upper-half paper Eleven other samples were presented at intermittent intervals throughout the scoring to prevent the readers from drifting away from the standards. Each time the scores were either identical or contiguous on these samples, and no discussions occurred. However, Table Leader 1 recorded in the log that one reader had scored

PAGE 160

149 sample FF as a 3 and then silently changed his score to a 2 when no other readers at the table showed a similar reading to his. Referring to the score given by dozens of readers in the original scoring two years previously, the table leader concluded, "Sample FF was, in fact, a 3 paper. The reader had succumbed to group pressure and awarded the group's score, not his score." This incident illustrated the importance of readers' maintaining confidence in their own judgment and in their own ability to adhere to the standards. The "procedures" section of some readers' logs, as well as the tapes, partially conveyed the extent to which the samples and rangefinders helped align readers with these standards. ( Information from the procedures is limited in that only four of the eight readers actually completed this portion of their monitored logs; apparently, they did not all understand the importance of completing this section, nor could this part of their task be emphasized as readers were not to know the nature of the study.) Nevertheless, nearly all the essays that were scored immediately following the rangefinders received scores of either 2 or 3.. That so many of these early essays were middle-range papers is not surprising; in fact, during the unmonitored scoring, Reader 3D, upon giving his first paper a score of 4_, noted that it was difficult to begin scoring with extremes. Notwithstanding the difficulty of initially assigning scores at

PAGE 161

150 either end of the range, two of the essays scored immediately after the rangefinders received scores of 1. More important, in all cases the early scores proved to be accurate in that when the same essays were scored later during the session by other readers, the papers received the same scores. The impact of the training samples could be seen in the scoring patterns that occurred immediately before or after breaks. At one point, Reader 3D, upon returning from a break, noticed that he was scoring low, and he expressed concern as to whether something had happened to his own sense of the standards. Moreover, Reader 3A speculated that grades before lunch might be lower than grades after lunch when readers were satisfied. Indeed, other readers had implied similar concerns on their questionnaires when they noted that fatigue, a post-lunch slump, or even room temperature could affect their scoring processes (see item 46 of Table 5). However, the data did not support Reader 3A's conjecture: The scores before lunch were always comparable to those given at other times of day to the same essay; in fact, one score before lunch — a 4 — was higher than any other scores the same essay received. Scores given after lunch when samples were again provided were also representative of the other scores assigned those particular papers at other times in the day. Readers made virtually no comments about the samples in their logs or on their tapes: The only reference came

PAGE 162

151 from Reader 3D, who noted that both a sample and an essay he read immediately after the sample had noticeably large handwriting. Thus, the inclusion of samples seemed an integral part of the holistic training procedures — almost taken for granted by the readers but, presumably, helping them to focus in on the standards Monitoring by the Table Leaders Readers were, however, frank about relying on the table leaders to confirm their judgments Even during the unmonitored scoring, one reader noted that if it had been possible, she would have asked the table leader for help with certain essays. Similarly, in the monitored scoring, the tapes and logs revealed several instances in which readers also turned to the table leaders for advice: For example, Reader 3C deliberately sought out her table leader about paper 073 which she had found difficult to score, and Reader 3D mentioned on tape that he would have asked the table leader about a paper if the table leader (who was moving among four different offices during the taping part of the monitored scoring) had been available. Several log notations of the readers at each table confirmed the readers views of table leaders as helpful resource people. Reader 1A initiated a discussion with the table leader about paper 006, and, as both their logs indicated, a discussion ensued about organization, surface

PAGE 163

152 errors, conciseness and development. Late in the scoring Reader 1C inquired whether her score on paper 103 was too high; the table leader suggested that she consider "language and structural weaknesses" in determining her score. At table 2 readers also turned to their table leader: Reader 2A pointed out a curious sentence from paper 045, and Reader 2B asked the table leader if paper 051 could possibly be a 4. Thus, part of the monitoring was clearly reader-initiated, as the readers sought out the table leaders for brief discussions of problematic papers. More commonly, the table leaders would initiate the interaction after they reviewed essays selected at random from each reader's set of scored papers. The table leaders would read the selected papers, assign them an independent score, comment on the essays in their own logs, and then look at the particular reader's corresponding score and comment. If, as Table Leader 1 noted, there was "easy, rapid agreement" on certain papers that were classic representatives of the score levels, no conferences were likely to occur. Occasionally, a table leader would confirm the score aloud with the reader, as when Table Leader 2 said of a paper given a final score of 2=_, "I almost gave it a l r and Reader 2C exclaimed, "I almost gave it a 1 also!" Similarly, Table Leader 3 conferred with Reader 3A about paper 010 on which they had given identical scores.

PAGE 164

153 Table Leader 3 noted in his log, "Reviewed for 2-3 minutes — [we agreed] on positive aspects of the paper. Reader thinks facts are also a problem — invented." Thus, when papers were particularly noteworthy — either for being very good or very bad — discussions occasionally would occur even when identical scores were assigned. More often than not, however, no conferences would arise. Few discussions appeared to take place when contiguous scores were considered accurate. For example, Table Leader 3 noted about paper 065, to which he had given a score of 2 and the reader a score of 3, the comment, "No need to review — I think it's a 2+ or 3-. Similarly, Table Leader 2 noted about paper 081, "Reader 2C scored this a 3 and I a 2 + This discrepancy does not seem major. The reader offered to look at this paper again, but I did not feel this to be necessary." Thus, although contiguous scores may appear to be quite different — and indeed they are at times — a score of 2 and another score of 3 or a score of 4 and another of _3 may be an accurate assessment of a paper. The score ranges are broad, and what one reader may perceive as a high 1, for example, another reader may see as a low 2. Discussions did ensue when table leaders thought that scores were discrepant or when the chief readers returned papers with what they believed to be inaccurate scores. The log notation of Table Leader 1 was representative of

PAGE 165

154 such an instance: "I discussed 002 with Reader 1C, a paper from a check reading which she had overrewarded Little interaction after re-reading; she said she saw what I meant about the paper being a 1, not a 2 The log of Table Leader 2 revealed a similar occurrence: The Chief Reader brought paper 045 back from check reading: Reader 2A 1; TL 2, 2; Head Table 3 — oops! Reader 2A reread and decided 's-v problems, slippery syntax' — would change her score to a 2 if they want her to. But I think there are enough problems to keep it in lower half: sp, punc word endings, diction. The change in scores that these returned papers generated was not a frequent occurrence: The logs of all three table leaders revealed only a few instances in which the readers actually changed scores after rereading and discussing questionable papers with the table leaders. What happened more frequently instead was that papers with contiguous or even identical scores served as a springboard for brief conversations about writing. For example, Table Leader wrote about paper 054 to which she had assigned a score of 3_^ and Reader 2C a score of 3: "[The reader] liked the personal experience a bit more than I Some of our comments were the same; she felt that we were 'on the same wave length.'" Similarly, Reader ID asked his table leader about the probable source of troublesome sentences in paper 083; together they discussed whether it might be due to the writer's limited vocabulary or to logic problems. Still another instance occurred when Table

PAGE 166

155 Leader 3 conferred with his reader about the strong organization and competent style of one paper that overcame its minor logic problems The tapes further illustrated the ongoing nature of discussions between readers and table leaders. In reviewing a probable score of 4 or 3 for paper 010, Table Leader 3 recounted to Reader 3C his own mental debate as to whether the strong development offset sufficiently the errors which pulled the paper down. When Reader 3C raised additional questions about the content of the paper, the table leader concluded, "Right. So if you want to give it a 4, that's fine, but if you want to knock it down because of the facts and the errors to a 3, that would be fine. There's no doubt it's an upper-half paper." On no occasion were there any signs that the table leaders made readers change their scores on questionable papers When Reader 3C asked the table leader whether she should change her score of 1 on paper 013 to a 2, Table Leader 3 laughingly replied, "No, I don't plan to beat you over the head and make you change it to a 2. It's not warranted. I just wanted to make sure we were talking more or less about along the same lines." The table leaders recognized that they themselves could be in error. For example, Table Leader 2, writing in her log about paper 081, to which she had given a 2+, noted, "Reader 2D gave this a 3_. Since Reader 2C did the same

PAGE 167

156 this morning, I guess I'm off on this one." The awareness that table leaders and chief readers could be wrong, too, was also shown when, as discussed previously, Table Leader 2 disagreed with the chief reader's judgment of a 3 for paper 045, concluding rather that the extent of problems kept that paper in the lower half. But if the table leaders and chief readers did not perceive themselves as authority figures who were necessarily "right," a brief comment by one or two readers conveyed that some readers — on some occasions, at least — showed a deference for the leaders' judgment. Not only did such deference appear in Reader 2A's previously cited willingness to change her score after a check reading from a 1 to a 2 on paper 045 "if they wanted her to" [italics added], but it also was clearly expressed by Reader 3B in her taped response to paper 085: [The paper] begins to break down a little in the end, and yet it is well said. I'm not sure definitely, at least, a 3. I'm not sure if the table leader would accept a 4_. I'm going to stop just a minute and consult my 4_ rangefinder. The paper is not as good as the 4_ rangefinder. It has a strong introduction, though. I'm going to consult my table leader and see what he thinks When Table Leader 3 agreed that the sophistication of thesis overrode what he called "errors of inelegancies to make the paper a 4, Reader 3B responded enthusiastically. "Yeah, okay. Great! I feel very comfortable with that." Thus, as can be seen, the interaction between table leaders and readers could be broadly characterized as

PAGE 168

157 pleasant and congenial. Tending to perceive table leaders as helpful, readers turned to them for guidance; they showed respect for the table leaders, and some readers expressed a willingness to alter their scores if necessary, even though the table leaders did not convey a need for doing so. This respectful attitude is especially noteworthy in view of the fact that many of the readers in this study had previously served as table leaders several times. Thus, their deference may derive from respect for a colleague's judgment rather than from perceiving someone in an authoritarian position. Questionnaire Responses About Monitoring Additional insight into the nature of monitoring was provided by all 17 participants' responses to one section of the questionnaire. As indicated in Table 7, over three-fourths of the participants believed that regular discussions of sample papers (item 48) helped to maintain their awareness of group standards; in addition, 10 respondents stated that they "almost always" or "always" reexamined rangefinders or operational definitions if they needed to realign their scoring standards. Nearly twothirds felt that it was "almost always" helpful to consult with table leaders or chief readers on problem papers and 59% indicated that they frequently felt free to disagree with table leaders or chief readers if they considered them

PAGE 169

TABLE 7 Questionnaire Results Dealing with Training 158 Questionnaire Item No. Always/ Almost Always Often/ Sometimes Seldom/ Never Other 38) Easier to score 10 (59%) essays in a structured setting than at home 39) Being "off" in — scoring a sample shakes confidence 40) Helpful to consult 11 (65%) with table leaders on problem papers 41) Feel free to dis10 (59%) agree with table leaders/chief readers if wrong 42 ) A returned paper 1(6%) affects subsequent scoring process 44) Knowing that essays — are checked is troublesome 48) Samples papers help 13 (76%) to keep group standards in mind 49) Use the rangefinders 10 (59%) or definitions to realign standards 52) Discussions with 16 (94%) readers, table leaders, and chief readers are collegial 3 (18%) 3 (18%) 1 ( 6%) same 5 (29%) 12 (71%) 4 (24%) 2 (12%) 6 (35%) 1 ( 6%) 11 (65%) 5 (29%) 2 (12%) 15 (88%) 3 (18%) 1(6%) 4 (24%) 3 (18%) 1 ( 6%)

PAGE 170

159 wrong. Thus, the questionnaire responses confirmed the role of table leaders as revealed through the logs and tapes — namely, that table leaders served as guides or consultants, rather than as authority figures. This guiding role may explain why 59% of the study participants agreed that it was "always" or "almost always" easier to score papers in a structured, monitored setting than at home. Congeniality was also important, as 94% agreed that their discussions of essays with other readers could "always" or "almost always" be characterized as collegial. In view of White's emphasis (1985) on the importance of a supportive, congenial atmosphere, this finding was important A picture of self-confidence emerged from the questionnaires, with 71% admitting that they were "seldom" or "never" shaken by being incorrect in the scoring of a sample essay. In a similar vein, 88% indicated that they were "seldom" or "never" bothered by knowing that their scores were being checked. Despite the apparent selfconfidence, nearly two-thirds admitted that the return of a paper at least sometimes affected their scoring of the subsequent few papers Their response suggests that whenever they are asked to reread an essay because of an inaccurate score, they may become particularly attentive to their own scoring processes — at least for the immediate period afterward.

PAGE 171

160 A similar portrait of self-confidence emerged from the additional questions asked of scorers who had frequently served as table leaders or chief readers in the past. (See Table 8.) Two-thirds of the 12 respondents indicated that they "seldom" or "never" had difficulty in identifying the more discrepant essay out of those they were asked to referee, nor did they find it difficult to deal with readers who might be unwilling to adjust scores at the table leaders' suggestions. That the table leaders acknowledged the possibility of readers' being right was implied in the response of "often" or "sometimes" which threefourths of the scorers gave when asked whether disagreement with a reader's score could cause them to reconsider their own judgment. Those 12 responding to the additional section of the questionnaire did not agree that part of the table leaders' role was that of arbitrating standards: Four table leaders said it "almost always" was, four said it "sometimes" was, two said it "never" was, and two did not answer. Whether the question was ambiguous or whether the respondents interpreted their role in connection with the standards differently is not clear. What they did agree on substantially — with 83% marking "almost always" — was that monitoring is an effective means for helping the group to adhere to group standards

PAGE 172

161 TABLE 8 Additional Questions for 12 Table Leaders and Chief Readers Questionnaire Item No. Always/ Almost Always Often/ Sometimes Seldom/ Never Other 55) Read problem papers holistically and analytically 11 (92%) 1 (18%) 56) Part of role is to arbitrate standards 57) Disagreement with reader's score causes a reconsideration of own judgment 4 (33%) 4 (33%) 2 (17%) 2 (17%) no resp. 1(8%) 9 (75%) 2 (17%) 59) In refereeing papers, hard to identify the discrepant score 4 (33%) 8 (66%) 60) Difficult to deal with uncooperative reader about altering score 1 ( 8%) 3 (25%) 8 (66) 61) Monitoring is effective in helping group to adhere to group standards 10 (83%) 2 (17%)

PAGE 173

162 Their perspectives on monitoring more fully appeared in the open-ended question that asked all participants to comment on how monitoring affects the interplay between the reader and the essay in an holistic scoring. Repeatedly, the comments stressed the importance of table leaders' dealing courteously with readers, the importance of, as Reader 2D noted, reinforcing positively what readers are doing, and the importance of minimizing any intrusions into the scorers' reading processes. Most significantly, the comments stressed the positive value of monitoring as a beneficial procedure. Several of those writing from only a reader's perspective acknowledged some anxiety at being checked, but as Reader 2C admitted, this discomfort was good, serving to keep them on their toes when they were tired or when their minds had wandered. Reader 2A emphasized that "monitoring by an excellent table leader — and almost always they are — is a real help and support." The thoughtful comments by Readers 1A and 3B, noted below, stand as eloquent testimony to the benefits of monitoring: Reader 1A As a table leader, I have observed the monitoring process as a tempering of our individual prejudices and preconceived notions about how the papers should be graded. We must set aside our whims, caprices, and dogmatism in the interest of fairness and competency. Readers, table leaders, and chief readers balance papers against group standards adjusting skillfully as we proceed.

PAGE 174

163 Reader 3B As a reader, I find the structure useful, supportive, reassuring, and congenial. I feel in touch with the standards I have resource people available to me when I have questions I think the formal setting helps me deal with essays fairly. The monitoring process makes the effort a collegial attempt to establish and share certain standards and values among professional colleagues, and the students benefit ultimately from that. Therefore, as can be seen from the logs, questionnaires, and tapes, the participants in this study perceived the monitoring process to be a positive source of guidance and support. Rather than considering it as dogmatic or authoritarian, they envisioned the monitoring as a resource for scorers and a springboard for a discussion among professionals of the elements of writing. Both explicitly and implicitly the participants conveyed that scoring students essays accurately and fairly was their ultimate goal. The summary and conclusions of the study are presented in Chapter 5

PAGE 175

CHAPTER 5 SUMMARY AND CONCLUSIONS In this study the impact of monitoring on the holistic scoring of essays was explored. Although previous research on this topic has been limited, monitoring is central to the issues of the validity and reliability of holistic scoring as a writing assessment tool. For example, Charney (1984) argues that the reliability of holistic scoring derives from agreement on such superficial features of writing as handwriting or spelling, features which render holistic ratings — expected as they are to measure "substantive skills" — invalid. Charney suggests, furthermore, that the very need for training to help holistic scorers adhere to writing criteria which are both preselected and imposed on them by others renders the validity of the resulting holistic scores questionable. In contrast, White (1985) envisions holistic scorers as comprising an "assenting community" much like the "interpretive community" depicted by the reader response theorist Fish (1980). In White's view, training helps holistic scorers to own both the standards and the process. Thus, the purpose of this study was to determine how training and monitoring influence the writing judgments holistic scorers make. 164

PAGE 176

165 Procedures Used Both quantitative and qualitative measures were used. Eight high school, community college, and universityteachers, all of whom were highly experienced holistic scorers, first rated at home over 50 expository essays written by college undergraduates. The essays were selected by a stratified random sampling procedure from essays written for a statewide assessment program two years previously. In addition to recording their scores for each essay — from a high of 4 to a low of 1 — the readers recorded their responses to the essays in written logs. An additional four readers, comprising a team of "special readers," scored a subset of the same essays and recorded on audiotapes their responses to the essays A month later all 12 readers assembled for a formal, monitored holistic scoring in which 2 chief readers provided the training typical of formal holistic scorings; furthermore, 3 table leaders each monitored the ongoing scoring processes of 4 readers. The readers rated another set of similar expository essays, matched beforehand with the first set of papers on the basis of scores originally assigned during the actual holistic scoring two years previously. Again the readers recorded their responses to the essays either in logs or on tapes; all participants answered a questionnaire designed specifically for the study.

PAGE 177

166 Issues Explored Six sets of questions were addressed in the study: 1. Do the mean scores for the essays differ when the papers are evaluated by readers working in a monitored setting from when the papers are judged by the readers working independently? A mixed-model analysis of variance for nested factors and repeated measures was used to answer the first question. Statistically significant results (p < .00005) were obtained when the mean scores of the 51 pairs of matched essays were compared in the unmonitored (e.g., athome) condition and the monitored holistic scoring condition. The mean scores given to essays in the monitored condition proved to be lower than those given to the matched essays when the readers evaluated the first set of papers at home. The qualitative data supported the quantitative findings: The logs of the table leaders, together with the check reading results, indicated that readers who tended to drift high in the unmonitored scoring were pulled back in line with the standards during the monitored scoring. 2. Do experienced readers participating in a monitored scoring achieve greater agreement with each other than when they evaluate essays independently? An interrater reliability coefficient of over .91 — that is, .936 in the unmonitored and .915 in the monitored — was obtained with Cronbach's alpha in both the monitored and unmonitored scoring conditions. Basically, then, the eight

PAGE 178

167 readers keeping the logs agreed on the scores they assigned the essays. (As noted previously, the four special readers taping their responses to a subset of the essays were not included in the statistical procedures.) However, additional analysis revealed that on twice as many essays in the unmonitored condition as in the monitored a potential existed for discrepant scores among certain readers if two readers were paired. Thus, monitoring appeared effective in increasing agreement among the readers 3. What impact do the chief readers have on an holistic scoring? How do they ensure both a reliable and a collegial reading? A second Cronbach's alpha was run with the chief readers' scores included. As might be expected, the results of the second alpha did not differ substantially from the results of the first alpha. That is, a coefficient of .9474 was obtained in the unmonitored condition and .9358 in the monitored condition. Additional analysis showed that readers in the monitored condition more closely approximated the chief readers' scores than they did in the unmonitored scoring. If chief readers' scores can be considered "true" scores because of chief readers' experience with, and commitment to, the standards, the monitoring appeared effective in helping readers score more accurately.

PAGE 179

168 During the monitored scoring, which the researcher observed, the chief reader drew the readers in line with the standards not only through the brief comments he made about the sample essays but also in the very samples he selected for reviewing. Although the public tallying of scores on rangefinders and other training papers obviously entailed some peer pressure, no criticisms were ever made. Rather, readers whose scores appeared to be discrepant were asked in a general manner to look a particular paper over or to reconsider their scores. Through these means the professionalism of the readers was acknowledged. 4. What criteria do readers use in assigning different score levels? What standards are reflected in the score levels assigned across essays? How do readers respond to these standards? The logs and taped protocols demonstrated the criteria that readers used in assigning different scores to papers. Most readers considered 4-level papers as strong or distinctive essays, reflecting a depth of ideas, solid development, good organization, and coherence. Because of the strengths they associated with 4-papers readers tended to assign that score sparingly. Readers also made positive comments about 3-level papers, although their responses to essays at this level included some criticism of problems in either rhetorical or mechanical areas. More negative than positive comments appeared about essays given scores of 2; the comments

PAGE 180

169 reflected particular concern about sentence structure, mechanics, and usage. Several readers commented on the shallowness and mechanical quality of many 2 papers Unlike some work (Haswell, 1988) in which readers' judgments of 1-level papers appeared overly restricted and simplistic, readers of this study responded to a variety of problems in 1 papers, occasionally even singling out some good quality, such as concrete details. As might be expected, however, the responses to 1 papers were generally negative. In some cases, the essays themselves defied ready categorization by score level. Whereas some essays appeared to be classic 2's or clearcut 3/s, other papers contained qualities of more than one level; hence, in evaluating such papers, readers were required to balance strengths against weaknesses in an effort to determine which qualities prevailed. The special readers' taped protocols revealed the difficult, ongoing process of adjustment entailed in some scoring decisions. Even in the logs, a few readers voluntarily added pluses or minuses or arrows to convey the direction of a particular score and their own difficulty in making that determination. On the issue of ownership of standards, questionnaire results revealed that the 17 participants in the study — the 12 readers, 3 table leaders, and 2 chief readers — agreed conceptually with holistic scoring and, with the exception

PAGE 181

170 of two participants, endorsed the standards used in the statewide assessment procedure. Even the two who responded that these standards were too low stated that they could work comfortably with the existing standards Virtually all participants believed they had little, if any, trouble adhering to the standards, and most emphasized that they were able to distinguish between their task of assigning a score and the consequences the scores would have for the students 5 Do any common patterns appear in the scorers written or audiotaped responses to the essays, or do their comments underscore the individuality of each reader's transaction with the text? Do readers' holistic judgments, as shown by their written or verbal responses correspond to the writing features they rate as important on a questionnaire devised for this study? Another finding of the mixed model ANOVA was that statistically significant differences (p < .001) existed among the readers The logs and tapes underscored this individuality. For example, whereas two or three readers often responded to introductions and conclusions, other readers seemed especially attuned to matters of diction and content. Similarly, whereas a few readers tended to identify each error by naming it specifically, others used such umbrella terms as "language skills" or "mechanical problems" to label problems. The individuality of the participants was again apparent in their self-reporting of biases and preferences

PAGE 182

171 on one section of the questionnaire. Although virtually all responded positively to creativity, humor, and evidence of a delightful writer, only some scorers liked rhetorical devices or technical/scientific papers, whereas others clearly did not. Similarly, whereas some viewed firstperson narratives positively, others disliked this approach. The respondents to the questionnaire identified a number of different strategies for dealing with papers that triggered their personal biases and preferences; the methods included slowing their reading down, occasionally rereading an essay, reexamining the operational definitions, or consulting with the table leader. Still other evidence of the individuality of the 12 readers occurred in the varying degree to which some acknowledged or interacted with the writer behind the paper; it also occurred in the varying degree to which some readers — especially those doing the talking protocols — speculated as to the causes of certain errors and their possible remedies. Despite the evidence of individuality, readers clearly shared certain beliefs especially regarding the importance of such elements as development, focus, and sentence structure. In this respect the readers resembled those in Sweedler-Brown s study (1985). The writing criteria which most participants marked on their questionnaires as being "very important" or

PAGE 183

172 "important" often appeared in the logs and on the tapes, thereby corroborating the significance of these features to the readers. For example, four readers who admitted to valuing word choice highly responded frequently to the diction they saw in the essays; similarly, another reader who rated content as very significant also commented often in his logs about the quality of the ideas he saw in essays. However, discrepancies appeared occasionally as well in that criteria which some readers rated as being only "somewhat important" or "not very important" — criteria such as introductions or conclusions — appeared frequently in those readers comments in the logs and on tapes This discrepancy could possibly be attributed to such factors as the ambiguity of the terms "somewhat" and "not very" on the questionnaire or the momentum of the scoring task itself, which might have prevented readers from expanding on their responses 6 What is the nature of the monitoring that the readers receive as reported through the logs of table leaders and readers? Do the procedures noted in these logs, together with the protocols of the special readers, support the readers' perceptions of their own holistic scoring processes as noted on the questionnaire? The monitoring did not appear dictatorial at any level In fact, readers indicated on their questionnaires that they generally found their table leaders monitoring to be helpful, especially when it was done with sensitivity. This perception of helpfulness was confirmed by those tapes

PAGE 184

173 and logs which showed some readers either consulting the table leaders about problematic papers or discussing larger writing principles. The table leaders never insisted their scores were right; rather they tended to discuss the qualities in the papers on which they had based their scores. In fact, several readers indicated on their questionnaires that they felt free to disagree with the table leaders Many readers perceived the monitoring as a resource: For example, not only did one reader comment during the unmonitored scoring that she would have turned to a table leader if she could have, but another reader also wrote on her questionnaire that "the monitoring process makes the effort a collegial attempt to establish and share certain standards and values among professional colleagues, and the students benefit ultimately from that." Discussion As revealed through this study, monitoring comprises both a source of guidance and a springboard for discussions about writing principles. Hence, far from rendering this evaluation approach invalid, the recalibration in holistic scoring indeed appears to serve as a re-creation of what White (1985) calls an "assenting community" similar to the "interpretive community" discussed by the reader response theorist Fish (1980). In the case of a scoring, the community is comprised of chief readers, table leaders,

PAGE 185

174 and the readers themselves; through their individual interactions with table leaders and their group tallying of samples and rangef inders, the participants in a scoring negotiate their individual responses to student texts in accordance with a framework of standards they not only recognize but also, and more importantly, adopt as their own. Of course, the readers in the unmonitored condition of this study also comprised an "assenting community" to the extent that they had internalized the standards as their agreement with one another on many essays indicates. However, what the unmonitored condition does not provide for is the opportunity for readers to discuss, to share, to debate, to determine in the words of Fish (1980) "the interpretive strategies" (p. 171) they will use in responding to the texts Obviously, reader response theory cannot be applied too extensively to the assessment context, in which the very purpose for reading — that is, the evaluation of student essays — differs from the purposes involved in reading literary or informational material. Student writers of assessment essays are not likely to be consciously helping to mold the interpretive community in the manner Fish implies that authors do; neither are student assessment essays truly reflective of the informational texts on which efferent transactions are based and which, according to

PAGE 186

175 Rosenblatt, require some consensus among readers. Nevertheless, the stance holistic readers must adopt falls toward the "predominantly efferent" or public end of the continuum as readers attempt to reconstruct meaning through what Rosenblatt refers to as the extracting and ordering of the ideas to be retained and used afterward (Rosenblatt, 1988, p. 5). In this context, the purpose is the assigning of a score to each student essay. The readers become engaged to varying degrees with these student texts and, as one special reader noted, some become more humane in the process. Notwithstanding the necessarily limited application of reader response theory to holistic scoring, this study certainly supports the reliability of holistic scoring as a means of writing evaluation; it supports the validity of holistic scoring as well. That is, the tapes and logs, while admittedly underscoring the readers' individuality, also illustrate that readers' responses were clearly affected during both scoring conditions by substantive criteria. Moreover, the monitoring that entailed discussions of writing, the readers' stated willingness to disagree upon occasion with the table leaders or chief readers, the readers' universal perception of group congeniality in the process, and the use of operational definitions that emanated inductively from actual essays support the commonality of the criteria and the

PAGE 187

176 criteria-selection process used. Together, these convey the image of holistic scoring as a vital, interpretive enterprise in which readers attempt to apply standards to actual essays and eagerly seek guidance in troublesome cases Recommendations for Research Certainly, the limitations with the study cannot be overlooked: Not only was the study limited to a select group of highly experienced scorers evaluating a small percentage of expository essays, but also these scorers had, to quote from Diederich et al. (1961), bought into "the party line" (p. 10) in that they endorsed the standards, as well as the monitoring procedures. The methods used for the study may have further contributed to the limitations. That is, the practice of recording comments either on audiotape or in logs — a practice new to these participants — may have made them more conscious of individual writing traits than is customary in most holistic scorings. In fact, Reader 3A stated outright that taping his responses orally was different from scoring the essays silently in that he had to find labels for the various and often complex problems he perceived in some papers; similarly, Reader 1C commented aloud that identifying the strengths or weaknesses in particular essays for the purpose of maintaining a log was sometimes hard to do.

PAGE 188

177 Still other limitations may arise from the subjectivityinvolved in the self-reporting required of all participants by the questionnaire. Subjectivity was also entailed in the interpretations the researcher needed to make in categorizing the readers' written responses onto a database system. To be sure, the impact of this subjectivity was checked by the random validation of 20% of these logs by an outside expert. Nevertheless, this research needs to be replicated with monitored and unmonitored scorings conducted in which no new methodologies are introduced. Such a study should include a less experienced group of holistic readers in that scorers generally bring a range of scoring experience to actual assessments, and the study would therefore reflect more typical scorings. Moreover, additional studies could include a broader scoring scale — e.g., a scale of six or eight points — to determine if the additional scoring levels make the task of balancing strengths and weaknesses in each paper easier to determine. The study raises a number of other writing issues requiring further research. As noted in the literature review, Barritt, Stock, and Clark (1986) found the readers' perceptions of the freshman writers behind the essay to have an impact on their scoring judgments of placement papers. The findings of this study also suggest that readers' awareness of the writer behind the essay — especially in terms of either the writer's voice or the

PAGE 189

178 writer's ideas — can affect, to varying degrees, some evaluations of competency essays. As this finding was a corollary of other emphases in the study, this variable should be explored more fully under controlled conditions. A second issue is the criteria that scorers use in evaluating writing. The readers in this study clearly demonstrated that they often based their judgments on substantive elements of writing, just as Diederich et al. (1961), Freedman (1979), Breland and Jones (1984), and Huot (1988) concluded from their studies. This finding contrasted with the importance that mechanics played in studies by Harris (1977), Rafoth and Rubin (1984), and Stach (1987). At the same time, readers in this study did not record in their logs all the features they rated as being "important" or "very important" to them on their questionnaires; conversely, some features that several rated as being only "somewhat important" did appear in their logs and on their tapes. Because of these discrepancies, factors involved in the judging of writing continue to be a research issue. In addition, the potential benefits and drawbacks of the taped protocols used in this study should be more closely examined. The special readers' taped responses to the essays appear to reflect what Elbow (1986) endorses as an evaluative technique — namely, providing writers with the "movies of the mind of the observer" (p. 181). Yet, at the

PAGE 190

179 same time, as Freedman and Calfee (1983) and Martin (1987) note, problems often exist in articulating such processes as the evaluation of writing. As was noted previously in this study, whereas Reader 3D found that taping his comments made him a more humane reader, Reader 3A found it easier to verbalize on his tapes the definable elements of writing rather than those features less easily labeled. Additional research is also needed to explore parallels between the taping of writing evaluations and the conferencing technigues that are practiced in writing centers and classrooms. For example, just as the taped protocols of this study showed the special readers speculating as to the causes and probable remedies for errors, so, too, have conferences been advocated (Kroll & Schafer, 1978) as a means of diagnosing students' individual problems. More significantly, as both the taped protocols of writing evaluations and conferences can be used to dramatize for students a reader's ongoing response to a composition, research into prospective links between these two approaches may help to bridge the gap between writing assessment and writing instruction. Conclusion Holistic scoring is not an exact or guantifiable science, nor can any comprehensive evaluation of writing be truly precise. Some subjectivity is usually involved.

PAGE 191

180 However, because of this subjectivity, monitoring — when sensitively done — increases reliability by encouraging readers to sublimate their own criteria in the larger interest of commonly adopted standards. The holistic scoring approach as shown in this study conveys a mutuality that encompasses the individuality of each reader's perspective while, at the same time, endorsing group criteria in the interest of fairness toward the students. Monitoring, as an integral part of this process, helps to maintain a unified community of readers who willingly seek to respond at the highest reading level, the evaluative, to the whole of each writer's text.

PAGE 192

APPENDIX A FORMS AND QUESTIONNAIRE

PAGE 193

W. Wolcott QUESTIONNAIRE ON HOLISTIC SCORING The following questionnaire contains several parts. Please answer each question as honestly as possible, and add any comments you wish. All responses will remain confidential Scorer number How long have you taught composition/English?_ How long have you been doing holistic scoring? Have you holistically scored exams other than CLAST? What is your most common role during an holistic scoring for CLAST (e.g., table leader, reader ) ? PART I: Please check the importance of the following criteria to you in judging any TIMED writing: (4) (3) (2) (1) Very Somewhat Not Very Important Important Important Important 1. Adequate controlling idea 2 Focus throughout the paper 3. Depth of thought 4. Development of ideas 5. Organization of ideas 6 Adequate introduction 7. Adequate conclusion 8. Commitment of writer to the topic 9. Unity and coherence 10. Appropriateness of tone 11. Creativity 12. Fluency of sentence style 13. Variety of sentence structure 14. Accuracy of diction 15. Maturity of diction 16. Avoidance of such sentence errors as fragments and run-on sentences 182

PAGE 194

183 17. Avoidance of tangled sentences 18. Avoidance of usage errors 19. Avoidance of dialect errors 20. Avoidance of ESL errors (4) (3) (2) (1) Very Somewhat Not Very Important Important Important Important 21. Accuracy of spelling 22. Accuracy of punctuation 23. Accuracy of basic capitalization 24. Adequate length Other How does your evaluation of essays that are timed differ from your evaluation of papers that are written outside? 25. All of us bring some personal biases and/or preferences to our readings. If you find yourself reacting strongly in a positive way to any of the following items, please place a plus (+) in tne line next to the item. If you react negatively to any of the items, please place a minus (-) in the line next to the item. a) Political papers b) Controversial social issues c) Religious papers d) First person narratives e) Technical/scientific papers f) Papers with literary allusions g) Unusually creative papers h) Misinformation in papers i ) Humor j ) Severe misspellings k) Shallow papers 1) Rhetorical devices, such as questions m) A disagreeable writer behind the paper n) Hard-to-read handwriting o) A delightful writer behind the paper p) Extremely short papers q) A weak conclusion or introduction r) Inductively written papers s) Sentimental papers t) Slang u) Others

PAGE 195

184 When you encounter papers that trigger a strong personal response, how do you handle them? (4) (3) (2) (1) Often Sometimes Seldom Never (6)

PAGE 196

185 (6) (5) (4) (3) (2) (1) Almost Always Always Often Sometimes Seldom Never 40. Do you find it helpful to consult with the table leader (or chief readers, if you are functioning as a table leader) on problem papers? 41. Do you feel free to disagree with table leaders/chief readers if you consider them wrong? 42 When a paper is returned to you for re-reading, does that procedure affect your subsequent scoring of the next papers? 43. Are you able to separate your task — e.g., the assignment of a particular score — from the consequences of that score? Does knowing that your essay responses are checked bother you in any way? Do you feel pressured by the speed of other readers at your table to read more essays? 46 Does your physical comfort — in terms of room temperature, time of day, or seating arrangements — affect your scoring in any way? If so, please explain: 47 Do you feel comfortable with the standards that have been adopted for CLAST? 48. Does discussing sample papers at regular intervals help to keep you aware of group standards? 49. Do you re-examine either the rangefinders or the operational definitions if you need to realign your scoring standards? 50. When you encounter a problem paper, do you find yourself examining it almost analytically? 51. Does your perception of the writer behind the paper affect your scoring? 52. Can your discussions on essays with other readers, table leaders, or chief readers, be characterized as collegial? From past scorings, how would you characterize your scoring tendencies? Are you generous, strict, or fair? High or low? Please comment briefly:

PAGE 197

186 PART IV: The following two questions are open-ended. Please respond as frankly as you can. 53. How do you arrive at your holistic score? Do you make an immediate judgment as to score level, or do you read top-down, lowering the score as you encounter problems? Or do you first decide on the half that the paper falls into before you decide on the actual score? 54. Please comment briefly from the perspective you know best (e.g., reader, table leader, chief reader) on how the monitoring in a formal holistic scoring appears to affect the interaction between the reader, the essay, and the holistic scoring process. PART V: If you have often served in the capacity of chief reader or table leader would you please answer the following questions: (6) (5) (4) (3) (2) (1) Almost Always Always Often Sometimes Seldom Never 55. When you are given problem papers to review, do you read them holistically AND analytically? 56. Do you see part of your role as that of arbitrating standards? 57. Does disagreement with a reader's score (or table leader's score) ever cause you to reconsider your own judgment? When refereeing papers, can you separate your task of assigning a score from the consequences that your score will have on a student? In refereeing papers with noncontiguous scores (as opposed to 2/1 papers), do you find it hard to identify which score is the more discrepant? 60. Do you find it difficult to deal with a reader who is unwilling to alter the score at your suggestion? 61. Do you believe that monitoring is effective in helping the group to adhere to group standards?

PAGE 198

187 PILOT TEST OF QUESTIONNAIRE Please try to complete the following questionnaire, and respond to the questions below: 1. Are the directions for each section clear? 2. Is the wording of the individual items clear? 3. Should any portions or items on the questionnaire be taken out? If so, please indicate below which items and why. Have any important issues in holistic scoring been omitted from the questionnaire? Please indicate below items you think should have been included. Does this questionnaire seem fair and comprehensive to you? Would you be able to answer it readily after participating in an holistic scoring? 6. Do you have any other concerns about the questionnaire?

PAGE 199

188 SAMPLE OF A DATA BASE LOG ENTRY THE RECORD NUMBER IS READER ID PACKET NO MON OR UN PAPER ID SCORE TOPIC QUAL IDEA COMM QUAL FOCUS COMM FOCUS DEVELOPMT COMM DEVEL ORG STRUCT COMM ORG STYLE TONE COMM STYLE APPROACH COMM APPR DICTION COMM DICTION SENT STRUC COMM STRUC MECH PROB COMM MECH SENT ERROR COMM SENT USAGE COMM USAGE DIAL ESL COMM DIAL PUNCT CAPS COMM PUNCT SPELLING COMM SPELL LENGTH COMM LGTH HANDWRITING COMM HAND WRITR ROLE COMM WRITR OVERALL COMM OVRLL SCR B SMP SCR A SMP COMM SMP SCR B TBLL SCR A TBLL COMM TBLL B CHK READ A CHK READ 213 2A V M 005 3 CERTIFIED NURSE'S ASSISTANT + VERY CONCRETE ESSAY MANY GOOD EXAMPLES AWK. SENT. STRUCT. AT TIMES SOME ERRORS TAI LEADER READ AMD AGREE!

PAGE 200

189 CD T3 O O u Q) d (0 0) PS o M Pi o u o o 01 n 3 T) m u u

PAGE 201

190 Descriptions of the Levels of the CLAST Ratings Score of 4: Writer purposefully and effectively develops a thesis. Writer uses relevant details, including concrete examples, that clearly support generalizations. Paragraphs carefully follow an organizational plan and are fully developed and tightly controlled. A wide variety of sentences occur, indicating that the writer has facility in the use of language, and diction is distinctive. Appropriate transitional words and phrases or other techniques make the essay coherent. Few errors in syntax, mechanics, and usage occur. Score of 3: Writer develops a thesis but may occasionally lose sight of purpose. Writer uses some relevant and specific details that adequately support generalizations. Paragraphs generally follow an organizational plan and are usually unified and developed. Sentences are often varied, and diction is usually appropriate. Some transitions are used, and parts are usually related to each other in an orderly manner. Syntactical, mechanical, and usage errors may occur but usually do not affect clarity. Score of 2: Writer may state a thesis, but the essay shows little, if any, sense of purpose. Writer uses a limited number of details, but they often do not support generalizations. Paragraphs may relate to the thesis but often will be vague, underdeveloped, or both. Sentences lack variety and are often illogical, poorly constructed, or both. Diction is pedestrian. Transitions are used infrequently, mechanically, and erratically. Numerous errors may occur in syntax, mechanics, and usage and frequently distract from clarity. Score of 1: Writer's thesis and organization are seldom apparent, but, if present, they are unclear, weak, or both. Writer uses generalizations for support, and details, when included, are usually ineffective. Underdeveloped, ineffective paragraphs do not support the thesis. Sentences are usually illogical, poorly constructed, or both. They usually consist of a series of subjects and verbs with an occasional complement. Diction is simplistic and frequently not idiomatic. Transitions and coherence devices, when discernible, are usually inappropriate. Syntactical, mechanical, and usage errors abound and impede communication.

PAGE 202

APPENDIX B SCORING CHARTS AND ADDITIONAL FIGURES

PAGE 203

Associate Chief Reader's Log January 7, 1989 8:30 Comments about the study procedures 8:34 Orientation by the chief reader 8:35 Rangefinders D, I, M, T, W, LL 8:50 Break 9:07 Samples FF, 9:15 Samples JJ, CC 9:22 Sample E 9:24 "Live" papers 10:00 Begin check reading #1 10:04 Break 10:16 Samples N, U 10:23 "Live" papers 11:13 Break 11:30 Samples BB, V 11:27 "Live" papers 11:48 Begin check reading #2 12:21 Lunch 1:30 Samples X, P 1:36 "Live" papers 2:26 Break 2:30 "Live" papers 3:40 End of reading Note: Total reading time: 254 minutes = 4 hours, 23 minutes Figure B-l. Account of procedures, 192

PAGE 204

193 Category Reader 1A Score 2 Reader IB Score 1 Reader 1C Score 2 Reader ID Score 2 Quality of Ideas Thesis clear/3 parts Development Thin on development Paragraphs not dev. Needs more dev. with Minimal developspecifics ment Organization Organization poor Style/Tone Approach Sentence Structure Mechanical Problems Sentence Errors Usage Errors Dialect/ESL Errors Punctuation/ Capitalization Spelling Length Handwriting Writer's Role Overall Comment Several wrong words Element, sentences Figure B-2. Summary of comments for paper 03 4

PAGE 205

194 Reader 2A Score 2 Reader 2B Score 2 Reader 2C Score 2 Reader 2D Score 1 Quality of Ideas Weak content Well thought out Content terrible Development Develops thesis Not very developed Does not support points Organization Organization OK Style/Tone Approach Sentence Structure Awkward syntax Errors in syntax Mechanical Problems Numerous errors Language skills not strong Sentence Errors Usage Errors Dialect/ESL Errors Punctuation/ Capitalization Spelling Errors in spelling Length Handwriting Writer's Role Overall Comment Figure B-2. — ( Continued )

PAGE 206

Organization Problems in introduction Style/Tone Approach Diction is a problem Mechanical Problems Sentence Errors Dialect/ESL Errors 195 Reader 3A Score 2 Reader 3B Score 2 Reader 3C Score 2 Reader 3D Score 2 Quality of Ideas Good, supported thesis Not specific enough First par. all thesis Development No sense of detail Sentence Structure Poor sentence logic Some awkwardness Competent sent, structure Usage Errors Unclear pronoun reference Vague pronoun reference Unclear pron. reference Punctuation/ Capitalization Spelling Misspellings Incorrect spelling of lured lured field though Misspellings knowledge Length Handwriting Writer's Role Overall Comment Good, acceptable paper Paper barely makes sense with vague pronoun reference Figure B-2. — ( Continued )

PAGE 207

Development Much detail Good supporting details Organization/ Structure OK except for weak conclusion Style/Tone Good paragraph unity Approach Conjecture Mechanical Problems Not many basic errors Sentence Errors Dialect/ESL Punctuation/ Capitalization Spelling Length Some words omitted Handwriting Difficult to read in places Writer's Role 196 Reader 1A Score 3 Reader IB Score 3 Reader 1C Score 2 Reader ID Score 3 Quality of Ideas Problems in logic A weak argument Sentence Structure Some diction problems Figure B-3 Summary of comments for paper 088

PAGE 208

Development Some generalities Organization/ Structure Approach Sentence Errors Comma splice Usage Many errors in usage Dialect/ESL Punctuation/ Capitalization Needs to review commas Spelling Length Handwriting Writer's Role Difficult to follow Figure B-3 — ( Continued ) 197 Reader 2A Score 2 Reader 2B Score 2 Reader 2C Score 3 Reader 2D Score 3 Quality of Ideas Content seemingly OK Content good Style/Tone Some awkward phrasing Some awkward phrasing Awkward word choice Wordiness hinders style Sentence Structure Mechanical Problems Awkward sentence structure hurts content Awkward sentence structure Not many basic errors A few language errors

PAGE 209

Mechanical Problems 198 Category Reader 3A Score 2 Reader 3B Score 3 Reader Score Reader 3D Score 2 Quality of Ideas Lack of certain clear logic Thesis doesn't follow from introduction Development Numerous but innocuous details There is development but some developing is not well done Generalizations too long, need specifics Organization/ Structure Terse conclusion Weak conclusion Style/Tone Sophisticated but incorrectly done Approach Lists aren't effective "Ef fectivity" horriblel Diction isn't clear, isn't apt Sentence Structure Horrible use of passive A good sentence Sentences are often awkward Sentence Errors Run-on sentences Dialect/ESL Verbs/verb tenses Punctuation/ Capitalization Needs to review commas Spelling Use of than when writer means then Length Handwriting Difficult to read copy Xerox copy poor Difficult to read copy Writer's Role Student needs help with conditionals Writer knows what he wants to say and is saying it relatively well Paper is shallow Evidence of thinking, of revision Figure B-3 — ( Continued )

PAGE 210

199 Scores of Scores of Scores of Rangef inders Table Leader 1 Table Leader 2 Table Leader 3 D

PAGE 211

REFERENCES Allen, C. L. (1976). A study of the effect of selected mechanical errors on teachers evaluation of the nonmechanical aspects of students' writing. Dissertation Abstracts International 37 09A (p. 5554-A) (University Microfilm 76-15, 827) Bamberg, B. (1982). Multiple-choice and holistic essay scores : What are they measuring? College Composition and Communication 33 404-406. Barritt, L. Stock, P. L. & Clark, F. (1986). Researching practice: Evaluating assessment essays. College Composition and Communication 37 315-327. Bauer, B. A. (1981). A study of the reliabilities and the cost-efficiencies of three methods of assessment for writing ability Research at the University of Illinois. (ERIC Document reproduction service ED 216 357) Berdie, D. R. & Anderson, J. F. (1974). Questionnaires: Design and use Metuchen, NJ: The Scarecrow Press, Inc Bleich, D. (1975). Readings and feelings: An int roduction to subjective criticism Urbana, IL: National Council of Teachers of English. Braddock, R. Lloyd-Jones, R. & Schoer, L. (1963). Research in written composition Urbana, IL: National Council of Teachers of English. Breland, H. M. Camp, R., Jones, R. J., Morris, M. M. & Rock, D. A. (1987). Assessing writing skill (Research Monograph 11). New York: College Entrance Examination Board. Breland, H. M. & Jones, R. J. (1984). Perceptions of writing skills. Written Communication 1, 101-119. CCCC Task Force on the Preparation of Teachers of Writing (1982). Position statement on the preparation and professional development of teachers of writing. College Composition and Communication 33 446-449. 200

PAGE 212

201 Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical overview. Research in the Teaching of English 18, 65-81. Chase, C. I. (1968). The impact of some obvious variables on essay test scores. Journal of Educational Measurement 5, 315-318. Collins, J. L., & Williamson, M. M. (1984). Assigned rhetorical context and semantic abbreviation in writing. In R. Beach, & L. S. Bridwell (Eds.) New directions in composition research pp. 285-296. New York: The Guilford Press. Cooper, C. R. & Odell, L. (1977). Evaluatin g writing, describing, measuring, judging Urbana, IL: National Council of Teachers of English. Davis, B., Scriven, M. & Thomas, S. (1987). The evaluation of composition instruction (2nd ed.). New York: Teachers College Press. Diederich, P. B. (1974). Measuring growt h in English. Urbana, IL: National Council of Teachers of English. Diederich, P., French, J., & Carlton, S. (1961). Factors in judgments of writing ability Princeton, NJ: Educational Testing Service ED 002172. Elbow, P. (1986). Embracing contraries New York: Oxford University Press. Faigley, L., Cherry, R. D., Jolliffe, D. A., & Skinner, A. M. (1985). Assessing writers' knowledge and processes of composing Norwood, NJ: Ablex. Fish, S. (1980). Is there a text in this class ? The authority of interpretive communities Cambridge, MA: Harvard Univ. Press. Follman, J. C, & Anderson, J. A. (1967). An investigation of the reliability of five procedures for grading English themes. Research in the Teaching of English 1, 190-200. Freedman, S. W. (1979). How characteristics of student essays influence teachers' expectations. Journal of Educational Psychology 71 (3), 328-338.

PAGE 213

202 Freedman, S. W. (1981). Influences on evaluators of expository essays: Beyond the text. Research in the Teaching of English 15 245-255. Freedman, S. W. (1984). The registers of student and professional expository writing: Influences on teachers' responses. In R. Beach, & L. S. Bridwell (Eds.) New directions in composition research (pp. 334-347). New York: The Guilford Press. Freedman, S. W. & Calfee, R. C. (1983). Holistic assessment of writing: experimental design and cognitive theory. In P. Mosenthal, L. Tamor, & S. A. Walmsley (Eds.), Research on writing: Principl es and methods (pp. 75-98). New York: Longman. Godshalk, F., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability (Research Monograph No. 6). New York: College Entrance Examination Board Grobe, C. H. (1981). Syntactic maturity, mechanics, and vocabulary as predictors of quality ratings. Research in the Teaching of English 15 75-85. Hake, R. L. & Williams, J. M. (1981). Style and its consequences: Do as I do, not as I say. College English 43, 433-451. Harris, W. H. (1977). Teacher response to student writing: A study of the response patterns of high school English teachers to determine the basis for teacher judgment of student writing. Research in the Teaching of English 11 175-185. Haswell, R. H. (1988). Dark shadows: The fate of writers at the bottom. College Composition and Communication 39, 303-315. Hirsch, E. D., Jr. (1977). The philosophy of composition Chicago: The Univ. of Chicago Press. Hoetker, J., & Brossell, G. (1986). A procedure for writing content-fair essay examination topics for large-scale writing assessments. College Composition and Communication 33 377-392. Hrach, E. (1983). The influence of rater characteristics on composition evaluation practices. Dissertation Abstracts International 45, 02A (p. 440). (University Microfilms No. DEQ 84-10956)

PAGE 214

203 Huot, B. A. (1988). The validity of holistic scoring: A comparison of the talk-aloud protocols of expert and novice holistic raters. Dissertation Abstracts International 49, 08A (p. 2188). (University Microfilms No. DA 8817872) Janopoulos, M. (1987). The role of comprehension in holistic evaluation of second-language writing proficiency at the university level. Dissertation Abstracts International 48 05A (p. 1137). (University Microfilms No. DET 87-17654) Keech, C. (1982). Practice in designing writing test prompts: Analysis and recommendations. In J. R. Gray, & L. P. Ruth (Eds.) Properties of w riting tasks: A study of alternative procedures for holistic writing assessment (pp. 132-214). University of California, Graduate School of Education, Bay Area Writing Project. (ERIC Document Reproduction No. ED 230 576) Kroll, B. M. & Schafer, J. C. (1978). Error-analysis and the teaching of composition. College Composition and Communication 29, 242-248. Marshall, J. C. (1972). Writing neatness, composition errors, and essay grades reexamined. The Journal of Educational Research 65 213-215. Martin, W. (1987). A study of reader process in the evaluation of English placement essays. Dissertation Abstracts International 48, 08A. (University Microfilms No. 87-24759) McColly, W. (1970). What does educational research say about the judging of writing ability? The Journal of Educational Research 64 148-156. Mishler, C, & Hogan, T. P. (1982). Holistic scoring of essays: Remedy for evaluating the third R. Diagnostigue 8, 4-16. Murphy, S., Carroll, K. Kinzer, C, & Robyns A. (1982). A study of the construction of the meanings of a writing prompt by its authors, the student writers, and the raters. In J. R. Gray, & L. P. Ruth (Eds.) Properties of writing tasks: A study of alternative procedures for holistic writing assessment (pp. 336-471). University of California, Graduate School of Education, Bay Area Writing Project. (ERIC Document Reproduction No. ED 230 576).

PAGE 215

204 Myers, M. (1980). A procedure for writing assessment and holistic scoring Urbana, IL: ERIC Clearinghouse on Reading and Communication Skills and the National Council of Teachers of English. Neilsen, L., & Piche, G. L. (1981). The influence of headed nominal complexity and lexical choice on teachers' evaluation of writing. Research in the Teaching of English 15 65-73. Nold, E. W., & Freedman, S. W. (1977). An analysis of readers' responses to essays. Research in the Teaching of English 13 164-174. Paden, P. (1986). The potential dual effect of context effects and score level effects on the assignment of scores to essays (Report no. 143). Princeton, NJ: Educational Testing Service. (ERIC Document Reproduction Service No. ED 280 852) Paulis, C. (1985). Holistic scoring: A revision strategy. Clearing House 59 57-60. Rafoth, B. A., & Rubin, D. L. (1984). The impact of content and mechanics on judgments of writing quality. Written Communication 1, 446-458. Raymond, J. C. (1982). What we don't know about the evaluation of writing. College Composition and Communication 33 399-403. Roberts, D. H. (1982). Individualized writing instruction in southern West Virginia colleges: A study of the acquisition of writing fluency. Dissertation Abstracts International 43 05A (p. 1525). (University Microfilms No. DD J82-24102) Roberts, D. H. (1983). Experimental research in written composition: A critical view (ERIC Document Reproduction No. ED 238 006) Rosenblatt, L. M. (1985). The transactional theory of the literary work: Implications for research. In C. Coopper, Ed. Researching response to literature and the teaching of literature Norwood, NJ: Ablex. Rosenblatt, L. M. (1988). Writing and reading: The transactional theory (Tech. Rep. No. 13). Berkeley: University of California and Carnegie Mellon University, Center for the Study of Writing.

PAGE 216

205 Shaughnessy, M. (1980). Statement on criteria for writing proficiency. Journal of Basic Writing 3, 115-119. Shoaf, J. S. (1985). Measuring gain in writing proficiency qualitatively and quantitatively. Dissertation Abstracts International 46_, H A (p. 3275). (University Microfilms No. DES 85-27322) Spandel, V., & Stiggins, R. (1981). Direct measures of writing skill; Issues and applications (Revised ed.) Portland, Or.: Northwest Regional Library. (ERIC Document reproduction service ED 213 035) Stach, C. L. (1987). The component parts of general impressions: Predicting holistic scores in collegelevel essays. Dissertation Abstracts Interna tional, 48 07A. (University Microfilms No. DES 87-22706) Stewart, M. F., & Grobe, C. H. (1979). Syntactic maturity, mechanics of writing, and teachers' quality ratings. Research in the Teaching of English 13, 207-215. Sullivan, F. J. (1986). Placing texts, pl acing writers: Sources of readers' judgments in university pl acementtesting (Technical Research Report 143). (ERIC Document Reproduction Service No. ED 285 177) Sweedler-Brown, C. 0. (1985). The influence of training and experience on holistic essay evaluations. English Journal 74 (5), 49-55. Westcott, W., & Gardner, P. (1984). Holistic scoring as a teaching device. Teaching English in the Two-Year College 11, (2), 35-39. White, E. M. (1985). Teaching and assessing writing San Francisco: Jossey-Bass. Winer, B. J. (1971). Statistical principles in experimental design New York: McGraw-Hill, pp. 375-378. Winters, L. (1978). The effects of differing response criteria on the assessment of writing competence Los Angeles, Calif.: Center for the Study of Evaluation. (ERIC Document Reproduction Service. ED 212 659)

PAGE 217

BIOGRAPHICAL SKETCH Willa Buckley Wolcott was born and raised in the White Mountains in New Hampshire. She graduated as a Wellesley College Scholar from Wellesley College, Wellesley, Massachusetts, in 1964 with a B.A. in English, and she received her M.A. degree in English from the University of Denver in 1969. She taught at the secondary level in Massachusetts and Colorado and worked with adult education programs in Denver and Connecticut. She has coordinated the Writing Center in the Office of Instructional Resources at the University of Florida since founding the Center, together with a colleague, in 1977. She serves as one of the chief readers for the statewide holistic scorings of essays written for the College Level Academic Skills Test and the Florida Teacher Certification Examination. She has had articles published in the Writing Lab Newsletter The Writing Center Journal the Journal of Basic Writing and College Composition and Communication She is a member of the National Council of Teachers of English and of the Pi Lambda Theta and the Kappa Delta Pi honorary societies. She is married and has two children away at college. 206

PAGE 218

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. (^~L~** Ruthellen Crews, Chair Professor of Instruction and Curriculum I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Margaret Early Professor of Instruction and Curriculum I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate^ in sc^pe and quality, as a dissertation for the degree/^! Docto/: pf philosophy. Wright Associate Professor and Curriculum Instruction I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. F&is\S2a^F~ VC Forrest Parkay Professor of Educational Leadership

PAGE 219

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. /fc*W xhrrtA^y Sandra Damico Professor of Foundations of Education This dissertation was submitted to the Graduate Faculty of the College of Education and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. December 1989 lAilJuJ 'rf> ^rru^-£fr Dean. Colleae of Education -" Dean, Graduate School

PAGE 220

UNIVERSITY OF FLORIDA 3 1262 08285 292 1