AN APPLICATION OF GENERALIZABILITY THEORY
TO THE ASSESSMENT OF WRITING ABILITY
By
MARIA MAGDALENA LLABRE
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF
TIE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1978
ACKNOWLEDGEMENTS
The members of my Doctoral Committee deserve special recognition
for their assistance with this dissertation. The chairman of my
committee, Dr. William B. Ware, has my deepest respect and admiration.
His standard of excellence has served as a model for me. To him I am
indebted for providing innumerable opportunities for learning.
Dr. Linda M. Crocker has also been most influential during my graduate
program. I appreciate her sound advice and consistent encouragement.
My sincere gratitude goes to Dr. Ramon C. Littell for the support he
has given me along with many explanations of statistical methods. I
also appreciate the continuous guidance of Dr. John M. Newell.
I would also like to thank Dr. James II. Goodnight of the SAS
Institute for his invaluable assistance with the data analysis.
To my friends Mary Lynn, Barbara Boss, and Shirley Bowes, I am
grateful for all the hours they spent rating the compositions without
losing their sense of humor. The help of Richard Thompson in facilitating
the analysis is gratefully recognized. Special thanks go to Louise
Stephenson for typing this manuscript.
Finally, I acknowledge the personal support of my husband, Brainard,
whose encouragement and understanding have given me the strength to carry
on.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . .
ABSTRACT . . . . . . . . ...... . .
CHAPTER
I. INTRODUCTION . . . . . . . . . . .
The Terminology of Generalizability Theory
Purpose of the Study .. . ......
Statement of the Problem .. . .....
Significance of the Study ...
II. REVIEW OF THE LITERATURE . . . . .
The Assessment of Writing Ability ..
Sources of Error in Essay Tests
Generalizability Theory .....
Variance Component Estimation ..
Summary . . . . .
III. METHOD . . . . . . . . .
The Sample . . . . . . . .
The Writing Samples: Data Collection .
The Facets . . . . . . . .
Design . . . . . . . . .
Variance Component Estimation ..
Generalizability Coefficients ..
The Error Variance 2(A) .. . ....
Summary . . . . .
IV. RESULTS . . . .
Estimates of Variance Components .
Test of Homoscedasticity Assumption
Generalizability Coefficients .
The Error Variance o2(A) .. ...
Supplementary Analysis ..
Summary . .. .
v
11
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
TABLE OF CONTENTS Continued
CHAPTER PAGE
V. DISCUSSION . . . . . . . . ... .. . 64
Interpretation of Variance Components . ... .... . 65
Usefulness of Generalizability Theory .... ...... 70
Summary and Conclusions .. ... . ... .. . 73
REFERENCES ... ............... .... . . 75
APPENDIX A: Point Estimates of the Variance Components
As Linear Combinations of Mean Squares for
the SplitPlot Factorial Design With
Balanced Data . ... .. ...... .81
BIOGRAPHICAL SKETCH ......... ..... .... 83
Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
AN APPLICATION OF GENERALIZABILITY THEORY
TO THE ASSESSMENT OF WRITING ABILITY
By
MARIA MAGDALENA LLABRE
August 1978
Chairman: William B. Ware
Major Department: Foundations of Education
Classical reliability theory, as used in the social sciences, has
been restricted by a model which specifies one undifferentiated error
component. This restriction has limited the applicability of the model
and has obscured its interpretation. Recent advancements in psychometric
theory provide more flexible models which permit the investigation of
multiple sources of error variation. Under the rubric of generalizability
theory, these methods are based on R. A. Fisher's work on the analysis of
variance and the factorial experiment.
Generalizability theory is potentially very useful in many areas
of research suffering from inconsistency of measurement. In particular,
the theory is applicable to the assessment of writing ability from
written compositions. However, applied studies in this area are lacking.
The literature on the measurement of writing ability has identified
several sources of error affecting the reliability of written compositions.
The most common sources of error noted are inconsistency across raters,
modes, and occasions. In spite of the recognition of these sources of
variation, most researchers who have studied the reliability of written
composition have examined the issue only in terms of interrater
reliability. Implicit in the concept of interrater reliability is the
assumption that fluctuations among raters are the only errors in the
model. This study incorporated three facets: raters, modes, and
occasions, in a splitplot factorial design in order to examine the
results obtained by taking into account more than one source of error
through the methodology of generalizability theory.
Samples of writing from 104 fourth graders were obtained under
selected mode and occasion conditions. Each sample was scored by four
trained raters. In the design, the students were considered as nested
within a higher classification, the classes. The number of students in
each class was not constant. Therefore, this study also extended the
principles of generalizability theory to unbalanced designs.
Point estimates of the variance components for all effects in the
model were obtained through the MIVQUE method. Negative estimates were
replaced by zeros. The relative magnitude of the estimates indicated
that students could be differentiated on the basis of their ratings.
However, the classes as units could not be distinguished. The estimates
also showed that errors resulting from variability in the quality of
writing across occasions and modes outweigh those stemming from differences
among raters. Furthermore, occasions represented a greater source of
error than modes. With training and practice, raters can consistently
score the writing samples of students using a general impression method.
Assuming homogeneity of variance, unbiased generalizability coeffi
cients were obtained for seven universes of generalization. These
universes represented generalization across one facet, two facets, or
all three facets simultaneously. The coefficients indicated that, to
obtain acceptable levels of generalizability, at least six samples of
writing from each person are necessary.
The standard error of measurement which may be used in constructing
confidence intervals around a person's universe score was also examined.
The results from this examination paralleled those based on the
generalizability coefficients.
A supplementary analysis which allowed a comparison of the estimates
obtained through the MIVQUE method to those derived using expected mean
squares, resulted in similar values for all estimates in a model without
the classes effect. These results were interpreted as lending support
to the MIVQUE method.
It was concluded that generalizability theory is very useful for
clarifying problems in estimating reliability in the area of writing
ability. Furthermore, the theory need not be limited to situations with
balanced data. Valid methods of variance component estimation documented
in the statistical literature may be used with unbalanced designs.
CHAPTER I
INTRODUCTION
The concept of reliability in educational research has undergone
notable refinements with a resulting increase in clarity and applica
bility. However, these conceptual developments have not been matched
by applications in many content areas. For example, the reliability
of essay tests still represents a confusing issue, partly because of
the continued use of the classical model for its investigation. This
study represents an attempt to "bridge the gap" between some recognized
methodological needs in the field of written language arts and advance
ments in measurement theory.
Classical reliability theory has been based on a model (originated
by Spearman in 1904) which states that a person's observed score is the
sum of a true score component and an undifferentiated error component
as shown below:
(1) X= + e
The true and error components are assumed to be independent of each
other. Therefore, the variance of the observed scores for a group of
individuals can be partitioned into the sum of independent variance
components as shown in equation (2).
(2) a2 = ( + oa .
The reliability of a test is then defined as the ratio of the true score
variance to observed score variance.
(3) rxx =
o;
2 2
Since oa and o0 are unknown, in practice reliability is estimated
by computing the correlation between parallel forms of the test. In
order for tests to be parallel, they must have equal means, equal
variances, and equal intercorrelations among items. From these
restrictions and the assumptions imposed on the model (1), it can be
shown that if two tests are parallel, their correlation equals (3)
above (for proof see Magnusson, 1967).
In addition to the restriction of parallelism, classical theory
considers the error component to be undifferentiated, that is, various
sources of inconsistency which may affect the reliability of the test
are grouped together in a single error term. Different procedures for
constructing parallel tests (e.g. testretest, splithalf) make different
assumptions about what constitutes the source of error in the model.
Therefore, following the classical model, more than one interpretation
of the same error component is possible.
The limitations of the classical model mentioned above render it
inefficient in many real life situations for several reasons. First,
the condition of parallelism is seldom met in the real world. It is
common to find that supposedly parallel tests have different means.
When tests have different means, the formulas which assume equality
provide an underestimate of the reliability (Ebel, 1951). Second, by
including only one error component which changes in meaning depending
on the method of obtaining parallel forms of the test, the classical
model can lead to some confusion. Unless the type of coefficient is
reported, the model provides no clues for the interpretation of the
error component. Finally, when more than one coefficient is desired
under the classical model, more than one study must be conducted. As a
result, the model does not allow for the consideration of error result
ing from interactions among sources.
In order to overcome these deficits inherent in the classical model,
some measurement specialists have adopted R. A. Fisher's conceptualiza
tion of the factorial experiment, a method of classifying observations
along more than one dimension; and the analysis of variance, a procedure
which partitions total variability into identifiable sources. These two
powerful tools have allowed for the possibility of releasing the restric
tion of parallelism and have provided a systematic approach to the simul
taneous consideration of multiple sources of error variation.
The applicability of these concepts to social science research and
specifically to the reliability problem was explicitly discussed by
Lindquist (1953). Since then, these techniques have been widely used in
testing hypotheses about group differences but only rarely in assessing
reliability.
More recently, Cronbach and his colleagues have assembled all of the
work which has been done along these lines under the rubric of generali
zability theory. The synthesis of their efforts is described in their
1972 book entitled The Dependability of Behavioral Measurements: Theory
of Generalizability for Scores and Profiles. Basically, generalizability
theory uses the analysis of variance approach in the estimation of relia
bility. Rather than emphasizing the computation of reliability coeffi
cients as the classical theory does, the emphasis is on the estimation of
variance components for all identifiable sources incorporated into the
design. The theory allows for unequal means, decomposes the error term
into separate sources, and requires the explicit consideration of the
factors identifying the population of measures being studied.
If desired, the variance components can be used in the computation
of "generalizability coefficients." These are intraclass correlations
analogous to reliability coefficients. Within a less restrictive model
which partitions the variance into several sources, more than one
coefficient is possible from just one study. One of the most important
advantages of this approach is that the analysis of variance technique
can be applied to many different types of experimental designs. When
the levels of the factors included in the design result from a factorial
experiment, the analysis of variance can provide estimates of the
variability due to interactions among factors. As was previously noted,
these interactions were undetectable under the classical model.
The Terminology of Generalizability Theory
Generalizability theory is considered by its developers as an
extension and liberalization of classical reliability theory. An
important distinction is made between two kinds of studies: G and D.
A Gstudy or generalizability study is one where the sources and
magnitude of the variability in one particular measurement instrument
are investigated. A Gstudy is analogous to a reliability study in the
classical sense.
A Dstudy or decision study is one which uses information concerning
the generalizability of a specific measurement tool for decision making
purposes. Two types of decisions are identified: absolute or comparative.
Absolute decisions are those which consider each individual separately.
Placement and classification decisions are both absolute decisions. A
specific example would be a decision made by a guidance counselor to
place a student in one of several curriculum programs on the basis of
the student's score on a test. A comparative decision is based on a
comparison of one individual to another or a comparison among groups
of individuals. Selection decisions, as well as decisions involving
group differences, fall under this category. An example of a compara
tive decision occurs when the scores on a test are used as the depen
dent variable for comparing the performance of two groups participating
in an experiment.
In generalizability theory an observation is considered to be a
sample from the total universe of observations which could have been
made. The observation is described in terms of the conditions under
which it is made. Two or more conditions of the same type constitute
a facet. With one exception, in Fisherian terms a facet is a factor;
conditions are simply the levels of a factor. The exception is that
"persons" is never considered a facet in a Gstudy even though it is
a factor.
When conducting a Gstudy, the investigator should include as many
of the facets which are considered to affect the reliability of the
measure as possible. From each facet, the investigator samples a set
of conditions under a particular design. The observations are then
made under the set of conditions sampled. The set of all possible ob
servations which could be included in the Gstudy is referred to as the
universe of admissible observations.
When the sampling of conditions is done at random and the universe
of conditions is sufficiently large, the investigator is operating under
a random effects model. This model is the one most commonly used in
the context of generalizability theory although fixed and mixed models
are also possible.
Regardless of the model used, the point estimates of the variance
components are obtained by computing the mean squares from the analysis
of variance and setting them equal to their corresponding expected mean
squares. A helpful but unnecessary restriction has been made in gene
ralizability theory that equal numbers of observations appear in the
subclassifications of the design. This restriction simplifies the pro
cedure for obtaining point estimates of the variance components but is
not absolutely necessary. With equal numbers of observations, the mean
squares from the analysis of variance are unique and are "the best"
estimates possible. Having unequal numbers of observations creates a
situation where the investigator must decide which of several sums of
squares (and, therefore, mean squares) to use. In either case, the
expected values of the sums of squares are linear combinations of the
variance components. Therefore, solving for a set of simultaneous
equations will result in point estimates.
The estimates of the variance components obtained under a Gstudy
can be used in subsequent Dstudies as long as the facets included in
the Dstudy were also included in the Gstudy. The conditions, however,
do not have to be the same if a random effects model is being considered.
The set of all possible observations to which an investigator
carrying out a Dstudy wishes to generalize is termed the universe of
generalization. This universe must then be a subset of the universe of
admissible observations of the Gstudy providing the variance component
estimates. This relationship between a Gstudy and subsequent Dstudies
implies that the utility of a Gstudy depends upon its ability to provide
estimates of as many components of variance as might arise in future D
studies. That is, a Gstudy providing estimates of three components of
variance is more useful than one where only one of those three components
is estimable, all other things being equal.
Purpose of the Study
The purpose of this study was to apply the principles of generaliza
bility theory to the assessment of written composition. The significance
of this study is twofold. First, it illustrates how theoretical measure
ment concepts can be applied and extended to fit specific problems
encountered in assessment. Second, it provides guidelines to applied
researchers and evaluators in the field of writing for improved methods
of estimating the reliability of their assessment procedures.
With the current movement toward teaching basic skills, the effec
tiveness of tests in assessing progress in reading, writing, and
arithmetic is under scrutiny. Of these three areas, writing presents
a paradoxical conflict. While objective tests of writing ability are
generally more reliable, essay tests or written compositions are con
sidered to be more valid measures of writing ability (Coffman, 1971).
The opinion of most specialists in the field of language arts is that
the validity of essay tests should not be traded for the higher relia
bility of objective tests (McColly, 1970).
Given this preference for the essay test in the assessment of
writing skill, any efforts to improve the quality of measurement in
this area should focus on this test form. Unfortunately, advancements
in measurement theory and practice have been, for the most part,
restricted to objective tests. However, generalizability theory offers
great potential usefulness for upgrading the reliability of measures of
written composition. Applied studies are needed to test the reality of
that potential.
Statement of the Problem
The need to study the application of generalizability theory to
the assessment of writing skills becomes apparent when the recommen
dations on research methodology from leading curriculum specialists in
written language arts are examined. Although the specific recommenda
tions will be discussed in the following chapter, at this point we will
note that more than one source of variability affecting the reliability
of written composition has been identified. The most common sources of
error noted are inconsistency across raters, modes, and occasions.
In spite of the recognition of these sources of variation, most
researchers who have studied the reliability of written composition
have examined the issue only in terms of interrater reliability.
Implicit to the concept of interrater reliability is the assumption that
fluctuations among raters is the only source of error in the model.
This study incorporated three facets in a splitplot factorial design
in order to examine the results obtained by taking into account more
than one source of error. Following the recommendations of Brennan (1975),
the students were seen as nested in a higher classification, the classes.
The number of students in each class was not constant. Therefore, this
study also explored procedures for obtaining estimates of the variance
components which are applicable to unbalanced designs. An unbalanced
design as defined here is one with unequal numbers of observations in
the subclassifications (Searle, 1971a).
Using the results from this study, we will be able to assess the
magnitude of each source of variability and determine which ones are
the most important to control in order to obtain reliable assessments
of writing skill. Based on these results, we will be able to make
recommendations for the design of future Dstudies using the same method
of assessment. These recommendations will include the nature of the
facets which must be considered as well as the frequency with which
each facet should be sampled. Both absolute and comparative decisions
will be taken into account.
In addition, the estimates of the variance components will be used
in the computation of several generalizability coefficients. The
coefficients to be considered are those which provide estimates of the
reliability when generalization is intended in either one dimension
(across raters, modes, or occasions), two dimensions (raters and modes,
etc.), or three dimensions (raters, modes, and occasions).
Significance of the Study
A renewed nationwide interest in the assessment of writing may be
evidenced by the following events:
1. A compositional writing subtest is being reinstated on the
Scholastic Aptitude Test (SAT) examination used by many colleges and
universities for student selection and placement.
2. Interest in expanding the base of knowledge on the writing
process has been underscored by National Institute of Education (NIE)
in the 1978 competition for Basic Skills Awards.
3. A number of states now include writing as a skill to be
tested in their efforts to establish statewide standards for minimum
educational competency.
Those responsible for the preparation of these examinations will
naturally be governed by practical considerations, such as the demon
strable quality of those examinations. Demonstrating the reliability
of their techniques must be one of the considerations. Generalizability
coefficients, which take into account various sources of error associated
with writing assessment, provide unambiguous estimates of reliability.
As a result, they are preferable to the traditional interrater
correlation coefficient.
CHAPTER II
REVIEW OF THE LITERATURE
The literature reviewed in this chapter has been selected from
three distinct fields: language arts, measurement theory, and sta
tistical methodology. The review is organized in the following manner.
First, selected literature pertinent to the assessment of writing
ability is presented to establish the rationale for the content area
of this study. Particular attention will be given to studies involving
primary grade children. Next, the development of generalizability
theory is traced, followed by references illustrating applications of
the theory. Finally, selected references from the literature on methods
of variance component estimation are reviewed with emphasis on methods
that are applicable to unbalanced designs. These designs have greatest
utility in determining generalizability coefficients for assessments
of written composition.
The Assessment of Writing Ability
Teachers and nonteachers alike would agree that writing is one of
the most important subjects taught in schools. But the importance of
the subject has not been accompanied by effective assessment. Evaluating
students' writing performance continues to be a problem for writing
specialists, English teachers, and researchers investigating this
complex area. Both objective tests and compositional writing (essay
tests) continue to be used (Coffman, 1971). However, the balance seems
to be on the side of essay tests. After reviewing several standardized
objective tests of writing, McCaig (1977) recommended "to evaluate
achievement in writing, evaluate the writing of children"(p.491).
Several other experts in the field also agree that writing ability
is best determined by looking at actual writing performance (Coffmnn,
1971 & McColly, 1970). The members of a recent Louisiana State Depart
ment of Education conference on minimum writing proficiency unanimously
recommended that any test of writing proficiency include a sample of
the student's writing (Suhor, 1977).
Objective tests are generally not recommended. A quote from
Braddock (1976) emphasizes the point:
At this stage of our understanding of writing
and of testing, it is difficult to believe
that any standardized test will be constructed
which can measure such ability. Therefore, anyone
who professes to evaluate "writing ability" with
a standardized test is either telling a false
hood or speaking from ignorance.(p.119)
At the present time, essay tests are included in a number of com
monly used tests of English. Examples of these are the Language Skills
Examination, the College Entrance Examination Board, and the writing
test developed by NAEP. These may be used for the prediction of
success in English, placement in special courses, exemption from required
courses, program evaluation, and experimental or correlational research
(Cooper and Odell, 1977).
Sources of Error in Essay Tests
The problem of the reliability of essay tests has been widely
recognized for some time (Meckel, 1963). Adequate reliability is
particularly important in required writing courses in which students
must earn a satisfactory grade and also in research, when essay tests
are used as a measure of gains or losses in skill which are to be attrib
uted to experiments in teaching methods.
Diederich (1957) suggested that the major problem of grading essays
has to do with variation in the grades assigned by different readers.
Commenting on the difficulties involved in grading such tests, he
pointed out that when 10 readers read a set of papers without discussing
standards, it is likely that average papers will receive the whole
range of grades. ie suggested three criteria for judging essay tests of
writing ability. First, the writing assignment should be like the
writing students do in the normal course of events. Second, the grading
should be independent of the writer's knowledge of the subject matter.
Finally, the topic must be within the student's comprehension. These
criteria were met in the selection of assignment and in the grading of
the samples used in this study.
To improve the reliability of essays, he recommended that all
students write on the same topic, that readers be trained, and that at
least two samples of writing be obtained from each student. This last
recommendation suggests a second source of variation related to the
reliability problem. Meckel was aware of this source when he said:
"samples of writing done over a semester are obviously a better index
of writing ability than a single essay" (p.988).
Braddock, LloydJones, and Schoer (1963), after screening and
reviewing 484 studies on writing, discussed four sources of variation
which should be taken into account when rating compositions. These
sources are: the writer variable, the assignment variable, the rater
variable, and the colleague variable. The writer variable refers to
daytoday fluctuations in the writing performance of individuals,
particularly the performance of better writers. On this issue these
authors recommend that each student write at least twice.
Under the assignment variable, Braddock et al. included four
aspects: topic, mode, time, and situation. They hypothesized that
variation in mode may have a stronger effect on the quality of writing
than variation in topic. The modes considered by these authors were:
narration, description, exposition, argument, and criticism. With
respect to time and condition, their recommendation was to allow as much
as 20 to 30 minutes of writing time for primary grade children and to
standardize the conditions across all children.
The rater variable, as defined by Braddock et al., refers to the
tendency of a rater to vary in his/her own standards of evaluation while
the colleague variable refers to variation in standards across different
raters. The existence of interrater variability has been substantiated
very frequently by research. Braddock et al. recommended that the raters
have a common set of criteria and that they practice together in applying
those criteria consistently. Two additional recommendations were offered
in order to reduce the interrater variation. One of them was to preserve
the anonymity of the writer. (These recommendations were previously made
by Diederich). The second one was to control for rater fatigue. As will
be shown in the next chapter, these recommendations were followed in the
rating of the samples used in this study.
McColly (1970) categorized the sources of error in grading essay
tests of writing ability into three general sources: students, readers,
and topics. In determining his classification scheme, he considered the
categories offered by Braddock et al. as well as those proposed by
French (1962). French's categories, almost identical to McColly's,
consist of student errors, test errors (the task and the topic), and
scale errors (reader disagreement).
Under the student source, McColly considered conditions such as
distractions (both internal and external) as well as the motivation of
the student. He recommended allowing the student at least 40 to 45
minutes of writing time.
With respect to readers, McColly concurred that readers must be
given the proper training and orientation as well as the opportunity to
practice. Practice is indispensable in establishing the proper speed
and rate. lie makes the following general statement in this regard:
"up to the point where the prose becomes ununderstandable, the faster
the rate and speed, the more valid and reliable the judgement"(p.150).
As far as the topic is concerned, McColly discussed the relation
ship between assessing writing ability and structuring the assignment.
In his view, by providing students with the content in a writing test,
one is filtering out, to some extent, the factor of subject matter
mastery. On the other hand, when all of the content is provided,
writing becomes simply an exercise in logic. He concluded that more
experimentation is needed in this area in order to determine to what
extent content should be provided in assessing writing ability and not
knowledge of subject matter nor logic.
It is important to make a distinction between the use of the essay
to assess ability to communicate within a subject area and the use of
written compositions to assess ability to write. Coffman (1971) has
addressed the former, but some of his ideas are relevant to the latter
use. In particular, Coffman's chapter deals with the essay examinations
when it is used by individual teachers in measuring the outcome of
instruction.
In his chapter, Coffman considered three sources of error affecting essay
scores: interrater variability, intrarater variability, and freedom
of responses. Not all three sources are pertinent to all uses of the
essay. The last source is related to McColly's concern on the structure
of the assignment. According to Coffman, if ratings are used only to
determine the rank order of the pupils, only the first source of error
is of concern. However, if the ratings are treated as direct measures
of quality, then all sources of error become critical.
More recently,Cooper and Odell (1977) have noted that to obtain
reliable measures of writing ability through essay tests, it is necessary
to have more than one piece of writing from more than one occasion and
involving two or more persons in rating each piece. Thus, these authors
implied that raters, occasions, and assignment are sources of error.
A line of empirical studies addressing the issue of factors affecting
specifically the writing of children clearly points out that writing
mode is an important source of variation. Seegars, as early as 1933,
cautioned teachers and researchers to be alert to the different impacts
of the modes in evaluating and analyzing children's writing. Several
experimental studies conducted in the 60's generally support Seegars'
contention in samples of first and third grade children (Johnson, 1967;
Anderson and Bashaw, 1968). More recent studies offer added evidence
that the mode is related to the quality of children's writing (Bortz,
1970; Veal and Tillman, 1971; Pope, 1974; Perron, 1976).
In most of these studies, a measure of syntactic complexity such as
number of clauses or number of words per clause was used as the dependent
variable. The modes investigated were descriptive, argumentative, narra
tive, and expository.
In spite of the recognition that occasion variability, assignment
variability, and mode variability are sources of error in assessing
writing ability, most researchers who study compositional writing have
considered the issue of instrument reliability in terms of interrater
reliability. For example, Cohen (1973) in evaluating the writing ability
of college students, determined reliability using percentage of agree
ment among raters. When Fagan, Cooper, and Jensen (1975) reviewed
several available measures for evaluation and research in written
language arts, interrater reliability or percentage of agreement between
raters constituted the most common type of reliability estimates
reported. The only other type of estimate, reported in only two cases,
was testretest reliability. More recent investigations of the relia
bility of specific instruments equate reliability with agreement across
raters. An example is Singleton's (1977) dissertation on the reliability
of ratings assigned on the essay portion of the Language Skills Examina
tion.
It seems that essay test reliability has practically become synony
mous with interrater reliability. A likely explanation for this phenom
enon is that nonstatistical psychologists find it easier to think in
terms of correlations. A Pearson productmoment correlation coefficient
may be easily computed between the scores assigned by two raters. But
this correlation coefficient does not adequately assess all of the sources
of variation (Coffman, 1971).
Coffman suggested using the analysis of variance approach to
adequately assess more than one source of error variation. Stanley
(1962) had previously discussed a specific design which could be used
to assess the reliability of raters and test forms.
A classic study by Finlayson (1951) is the first reliability study
to consider rater and test variability as sources of error in essays.
Based on a sample of 197 children who wrote two essays, he reported
mean coefficients of .697 and .810 for the reliability across tests and
raters, respectively. Each essay was rated by six raters, using a
general impression method of scoring with a 1 to 5 scale. In a second
part to his study, Finlayson used the analysis of variance in a 197x2x6
random effects design. In testing the significance of effects he found
the childbyessay interaction significant, suggesting that the perfor
mance of a child in one essay is not representative of his/her ability
to write in general. The childbyrater interaction was not significant.
From his results, it may be concluded that test variation represents a
greater source of error than rater variation.
In a followup study, Vernon and Millican (1954) investigated the
reliability across 7 raters and 7 topics for a sample of 224 college
students using a general impression 5point scale. They reported mean
correlations between raters on the same topic and between topics. These
were .509 and .366, respectively. In the authors words: "a still more
serious source of inconsistency in assessing English ability is the
varying performance of candidates when writing essays on different
topics"(p.73).
In view of the recommendations made by language arts specialists
and the results of the empirical studies reviewed, it appears that
extending the design of Finlayson to include raters, modes/topic, and
daytoday variation as possible sources of error is in order. To
best assess all sources simultaneously, the principles of generalizability
theory will be applied. The development of generalizability theory will
be discussed in the following section.
Generalizability Theory
The conceptual underpinnings of generalizability theory are based
on Fisher's (1925) work on the analysis of variance, the factorial
experiment, and the intraclass correlation.
The idea of using the analysis of variance to estimate the relia
bility of a test is due to Cyril Burt who translated the work of Fisher
for his students with the aid of P. 0. Johnson, J. Neyman, and R. W. B.
Jackson (Burt, 1955). Burt considered measurements as varying in three
dimensions; with respect to the person, the test form, and the occasion.
The reliability of the test is estimable from a comparison of individual
variance to group variance. In Burt's words:
On comparing the two variances it would then
seem possible, on intuitive grounds, to infer
that, when the variance of the measurements for
a single individual becomes as large as the
variance for the entire sample of different
individuals, the test used will be of no practi
cal value whatsoever: for the whole object of
such a test is to distinguish the ability as
measured for any given individual from the
abilities of the rest.(p.105)
Burt showed how the intraclass correlation provided an estimate of the
reliability.
The intraclass correlation was introduced by Fisher in the context
of the random effects model. Scheffe (1959) illustrates it using the
model
(4) Y. = p + a. + e..
where p is the grand mean and ai and eij are independent with zero means
and variance matrices o 2()Ii and o2(e)Ii
respectively. The v;aiam, 4" VO y rny bhe expressed as
(5) 2(y) = 2(a) + 2(e) .
The observations within any class are not statistically independent.
The statistical dependence between any two observations yij and yij in
the same class is expressed as
(6) r intraclass = E[(yij p)(yij I)] / o2(y)'
= E[(ai + eij)(ai + eij)] / c2(y)
= E(a 2) /o2(y)
= o2(a)/[a2(a) + o2(e)]
Thus, the intraclass correlation may be estimated by obtaining point
estimates of the variance components.
Pilliner (1952) compared the estimate of reliability obtained
from the intraclass correlation to that obtained from the Pearson
productmoment correlation for a situation where measures vary in two
dimensions: persons and tests (or items, etc.). Under homogeneity of
variance assumptions, the intraclass correlation provides an unbiased
estimate of reliability. But if variances are heterogeneous, the
estimates from the intraclass r are negatively biased. That is, they
represent a lower bound. Pilliner suggested extensions of the two
dimensional framework where components of variance are mostly needed.
His illustration was a three dimensional design using Finlayson's data,
for which his procedures were derived.
In the United States, Hoyt (1941) used the analysis of variance
approach in determining the internal consistency of a test from a subject
byitem design, where the items are dichotomously scored. He arrived
at reliability formulas identical to those derived by Kuder and
Richardson (1937).
Ebel (1951) made a case for the use of the intraclass correlation
for situations where the parallelism assumption was impractical due to
the inequality of means. He was interested in the reliability of
ratings which he estimated by applying the analysis of variance to a
subjectsbyratings design. The results from this approach were
compared to two other formulas proposed for estimating such reliability:
the generalized reliability and the average intercorrelation. Ebel
concluded that the intraclass formula was preferable because of its
flexibility with respect to the inclusion of the between raters variance
in the error term. In situations where the same raters are used to rate
all subjects, the between raters variance does not enter into the error.
On the other hand, when different raters are used, then that variance
should be considered as error.
In his 1953 textbook, Lindquist provided a clear and comprehensive
treatment of the use of variance components in the estimation of relia
bility. He discussed the possibility of obtaining negative estimates
particularly when the number of degrees of freedom is small for some
factors. A small number of degrees of freedom may not be crucial for
variance components which are not of interest (such as the between raters
variance discussed by Ebel in situations where all raters rate all
subjects). Lindquist also demonstrated that increasing the number of
observations in a study resulted in different effects, depending on the
levels of the factors sampled. In this regard, the SpearmanBrown
formula has limited utility. The limitations of the SpearmanBrown
formula for showing the effects on reliability from an increase in the
levels of a factor had been previously discussed by others (e.g.,
Pilliner ). Finally, Lindquist illustrated the added utility of
estimating variance components for determining the relative importance
of the various sources of error. This information could be useful in
suggesting designs for the construction of measurement schedules. The
idea of using variance component estimates for deciding among different
designs was later expanded by Vaughn and Corballis (1969).
Using the analysis of variance approach and extending the designs
used to estimate reliability to more than two dimensions implied a
conceptualization of reliability as a characteristic of a measurement
procedure rather than a measurement instrument. This was the position
taken by Rajaratnam (1960) and, more recently, discussed by Rowley (1976)
in the context of observational measures.
In his article, Rajaratnam introduced the notion of a reliability
coefficient as the ratio of true score variance to the observed score
variance expected in a set of observations obtained by using the same
measurement procedure in a specific way. lie formulated coefficients for
situations where every rater does not rate every subject. In this
situation, as Ebel had suggested, the systematic variance of raters is
part of the error term since it enters into the expected observed score
variance. Rajaratnam also introduced the distinction between G and D
studies which was discussed in the introduction.
In studying the reliability of classroom observational schedules,
Medley and Mitzel (1963) made use of the analysis of variance approach
in reliability estimation. Their application is extended to a four
way factorial without replications. These authors illustrate the vast
amount of reliability information which may be obtained from one care
fully designed study using analysis of variance methods.
Several articles published by Cronbach, Gleser, and Rajaratnam
(Cronbach et al., 1963; Gleser et al., 1965; Rajaratnam et al., 1965)
and culminating in the publication of their 1972 book, have summarized
the conceptualization of reliability estimation from the analysis of
variance. These authors presented a general framework which encompasses
the classical model and may be extended to include experimental designs
for fixed, random, and mixed models. They rely heavily on the paper by
Cornfield and Tukey (1956) dealing with variance component estimation
for factorials through the use of expected mean squares. Their treat
ment is limited to balanced designs, having equal numbers of observations
in the subclassifications.
In the introductory chapter, certain problems associated with the
classical theory were presented. One of these problems was discussed
by Guttman (1953) in his critique of Gulliksen's (1950) book. Guttman
observed that the notion of parallel tests, the heart of classical
reliability theory, does not provide a unique definition of reliability,
since there may be more than one reasonable basis for forming parallel
tests.
In their work, Cronbach et al. (1963) reformulated the theory of
reliability to overcome the inadequacies presented by the parallelism
assumption. They rephrased the reliability issue as follows: "an
investigator asks about the precision or reliability of a measure because
he wishes to generalize from the observation in hand to some class of
observations to which it belongs"(p.144). Their theory requires that
the investigator clearly specify a universe of conditions of observation
over which generalization is to be made. The problem of reliability
thus, becomes one of generalizability.
In terms of generalizability theory, a person's universe score,
(analogous to the classical true score), is defined as the expected
score over all admissible observations. This definition is equivalent
to Lord and Novick's (1968) "generic true score." The obtained score
is a sample from a universe of admissible observations and will generally
differ from the universe score.
A model is constructed where the observed score is expressed in
terms of the hypothesized effects. For example, consider the model
(7) j = iT + a. + e
S p J PJ
The observed score, X p, given to person p by judge j is the sum of
three components, namely Fp the effect for person p; aj the bias of
judge j; and an error component epj, which may, for example, represent
some idiosyncratic reaction of judge j to a particular person p. These
components are assumed to be independent. Models like (7) can be con
structed to fit any particular design.
The variation found among observed scores, o2(X), may be parti
tioned into variance components
(8) 02() = o2 () + a2(a) + 02(e).
o2(T) represents the variation due to persons and, in Cronbach's terms,
the universe score variance.
Cronbach et al. (1972) make a distinction between two error compo
nents 2 (A) and o2(6). (this distinction was previously noted by Ebel
(1951)). To illustrate the distinction assume that every judge con
sidered every person. The component 02(6), estimated from o2(e) in
our model, refers to the variance of each person's observed deviation
scores under each judge, (X X.), around the universe deviation score
(p p). These deviation scores eliminate the systematic variance
P
among judges, o2(a), since the mean for each judge is subtracted from
the raw score to obtain the deviation score. In general, the systematic
variance of facets where the same conditions are sampled for every person
is excluded from 02(6). The error variance a2(A), refers to the variance
of each person's observed scores, Xp, around their universe score, plp.
In our example, o2(A) = 02(0) + 12(e). In the classical sense, this
variance component is the only component of error. The square root of
02(A) is the standard error of measurement. It will be noted that 02(A)
will generally be greater than 02(6).
The emphasis of generalizability theory is on the estimation of the
variance components. These variance components have several uses, one
of which is the estimation of generalizability coefficients via intra
class correlations. The coefficient of generalizability is defined as
the ratio of the universe score variance to the expected observed score
variance. It is approximately the expected value of the squared
correlations of observed score and universe score, Ep2(X p ). The
intraclass correlation is a good approximation of p2(XI) if homogeneity
of variance assumptions are met. Maxwell and Pilliner (1968) and
Selvage (1976) have recommended performing transformations on the data
to achieve stability of variances when the assumptions are not met.
The variance components are also used in planning designs for D
studies. When making absolute decisions, it is desirable to reduce the
error 02(A). According to Cronbach et al. (1972) a nested design reduces
02(A) more than a crossed design with the same number of observations per
person since more conditions are sampled in the nested design. For
comparative decisions a2(6) is the appropriate error to consider in
determining the adequacy of the measurement procedure. The magnitudes
of the variance components provide an indication of the relative
contribution of the different effects to the error. This knowledge
is useful in determining the number of conditions to be sampled from
each facet in subsequent D studies in order to maintain the error at
a specified level.
Cronbach et al. (1972) consider a third type of error, o(E), the
error of estimate. It is the square root of the familiar variance for
errors of estimate in linear regression. The regression equation they
consider is that for predicting up from the observed score and group
information. According to Cronbach et al. (1972, p.15), the universe
score is "the ideal datum on which to base. .decision [s]." They
recommend estimating universe scores through linear regression and
setting confidence intervals around the estimated true score using o(A).
The estimated universe scores are not very useful if all scores are
regressed to the population mean; since they will be perfectly correlated
with the observed scores. But if subpopulations of persons exist with
different means, the universe score may be predicted from the observed
score and the subpopulation information.
In their book, Cronbach et al. provide detailed examples of the
application of generalizability theory to simple experimental designs
involving both crossed and nested facets. They also extended the theory
to encompass multivariate problems.
Since the publication of Cronbach's book several authors have
applied the principles of generalizability theory to various situations.
Levy (1974) applied the theory to studies of reliability in clinical
settings; and Gillmore, Kane, and Naccarato (1978) to student ratings
of instruction.
In the spirit of generality, Mellenbergh (1977) has recently
proposed a more extended view of reliability by considering all possible
replications of the design where in addition to replications of facets,
replications of subjects for fixed facets is also possible. He sug
gested using replicability coefficients which are defined as the
correlation between two replications of the design. His coefficients
include generalizability coefficients and also make use of estimates
of the variance components. Several of the possible coefficients,
however, serve no interesting purpose in most practical situations.
Brennan (1975) extended the idea of calculating reliability from
a personbyitem analysis of variance to a situation where persons are
nested within some higher order dimension. Assuming an equal number of
persons in each class, Brennan compared the generalizability coefficients
derived from a splitplot factorial design with students nested within
classes and crossed with items to those derived when the nesting clas
sification (i.e. classes) is ignored (a randomized blocks design). He
concluded that "the experimental model used to collect data for most
reliability studies is usually one where students are nested within
some dimension; therefore, the splitplot design would appear to be
more appropriate than a simple randomized block design" (p.780). In
addition, the splitplot design can be used to provide a basis for
estimating the reliability of scores for the units within which persons
are nested.
For his design, Brennan stated that the reliability of the test of
specified length calculated from the splitplot design would be less
than, equal to, or greater than that calculated from a randomized block
design depending upon whether the ratio of 2(p) (the person
variance component) to o2(e) (the error variance) is less than, equal
to, or greater than the ratio of o2(s) (the school variance component)
to o2(si) (the school by item variance component).
Thus, if one uses a randomized block design to
calculate reliability for persons when, in fact,
persons are nested within some dimension, such
as schools or classrooms, the resulting coeffi
cient will be biased, and, moreover, the
direction of bias will be unknown.(p.785)
Kane and Brennan (1977) extended generalizability theory to a split
plot design in which students were nested within classes and crossed
with items. Their purpose was to estimate the generalizability of a
class mean, where the class was the unit of analysis. They assumed an
equal number of students in each class. Four different coefficients
were formulated corresponding to four universes: an infinite universe
of students and items, a fixed universe of students and items, a
universe with fixed students and infinite items, and a universe with
infinite students and fixed items.
The situation where the students are fixed is somewhat artificial
since, in educational research, it is generally inappropriate to
restrict the universe of generalization for the student facet. Restrict
ing the set of both items and students is very unlikely. The universe
score variance in this case is estimable if the interaction effect for
students and items and the error in the model are not confounded, that
is, if there is moe than one replication of each classstudentitem
observation or if the studentitem interaction is assumed to be zero
and its estimate taken as the error estimate.
In a subsequent section, the authors showed how certain coeffi
cients may be estimated from mixed models. However, since the components
from a model with a fixed facet cannot be used to estimate a generali
zability coefficient that assumes generalization over that facet, the
authors recommended a random model in the estimation of variance compo
nents.
Kane and Brennan also related three coefficients, which appear in
the literature for estimating the reliability of class means, to their
four generalizability coefficients. None of the four reliability
coefficients is equivalent to their generalizability coefficient where
generalization is intended over students and items, a very common
situation.
Generalizability theory offers innumerable possibilities for well
designed studies to be conducted as part of instrument development.
Much information may be gained from one Gstudy, some of which is
unattainable under the classical approach. As the principles are
applied to various measurement problems, their strengths and limitations
will become apparent.More applications are needed in all areas. To this
author's knowledge the theory has not been applied to the assessment of
writing ability. The studies by Finlayson (1951) and Vernon and Millican
(1954) approximate this effort. However, these studies only reported
tests of hypotheses and interclass correlation coefficients and did not
use estimates of variance components. This applied study extended tli
design used by Finlayson and incorporated a method of estimating variance
components for unbalanced data.
Variance Component Estimation
Thus far, all references to generalizability theory,both theoretical
and applied,have assumed balanced designs. For balanced designs the
analysis of variance method of estimation is universally accepted. The
expected values of the mean squares may be expressed as linear combina
tions of the variance components. The coefficients of the components
are easy to obtain by rules developed by Cornfield and Tukey (1956) for
fixed, random, and mixed models. These rules appear in standard texts
such as Kirk (1968) and Winer (1971). The best method of estimation is
to equate the observed mean squares from the analysis of variance under
fixed effects, to the linear combination of variance components. Then
the resulting set of simultaneous equations is solved for the variance
components. These estimates are minimum variance and are unbiased
(Searle, 1971h).
Most methods of estimating variance components involve some
quadratic form of the observations. The mean squares from the analysis
of variance are the appropriate quadratics to use when the design is
balanced. Estimating variance components from unbalanced data is more
complex because there is no universally accepted method. According
to Searle (1971b, p.33) "no particular set of quadratics has been
established as being more optimal than any other set." For unbalanced
designs, using the analysis of variance procedure leads to the question
of which mean squares to use, since with unbalanced data the mean squares
may be unadjusted or adjusted for one or more effects.
A comprehensive review of methods of estimation based on the
analysis of variance has been given by Searle (1971a, 1971b) for both
balanced and unbalanced designs. For the latter case, Searle discussed
three methods proposed by Henderson (1953). Henderson's method 1 consists
of equating the unadjusted sums of squares from the fixed effects
analysis of variance to their expectations obtained under a random
effects model. These expectations are linear combinations of the
variance components. Thus, solving for the set of simultaneous equations
will yield estimates of the components. This method produces unbiased
estimates except for the random effects in mixed models.
llenderson's method 2 was developed to correct the inefficiency of
method 1 with mixed models. The procedure of the second method is to
"correct" the data by some previous least squares estimates of the fixed
effects. Using the "corrected" data in place of the original data,
method 2 proceeds as method 1. This method is inappropriate when there
are interactions between the fixed and random effects.
The method of fitting constants, or lenderson's method 3, uses the
adjusted sums of squaresadjusted sequentiallyand follows the same
pattern as the other methods. The adjusted sums of squares are similar
to those of Overall and Spiegel's (1969) "a priori ordering." All ex
pectations of these adjusted sums of squares are taken under the full
model. Under this condition, the expected value of any term involves
all of the variance components except those for the terms for which
this term was adjusted.
As Searle pointed out, the coefficients of the variance components
for these methods are not as easy to obtain as those with balanced data.
He gives several references which discuss numeric methods for obtaining
the coefficients.
More recently Rao (1971, 1972) has proposed a different approach
to the estimation of variance components. His methods called MINQUE
(minimum norm quadratic unbiased estimation) and MIVQUE (minimum variance
quadratic unbiased estimation) provide a general approach which is
applicable to both balanced and unbalanced designs and suitable for
either random or mixed models.
To summarize them, let us consider the model
(9) Y= XB+U 1 +"22 + + k
where Y is the n x 1 vector of observations, X is a n x m design matrix
for the fixed effects (in a random effects model X is just a column
vector of 1's), B is a vector of unknown parameters (the grand mean in
a random effects model), Ii. is a given n x c* matrix, the columns of
which are the coded variables for a particular factor, and is a c.
vector of uncorrelated variables for the ith random effects factor in
the model (which may be a main effects factor or an interaction factor).
2
The .'s have zero mean and variance matrix I i=l, ., k,
1 i ci
2
where o. are unknown. Furthermore, and (i/j) are uncorrelated.
i j
The kth factor is the error term. Then
(10) E(Y) = X B,
2 2
(11) V* = Var(Y) = o V1 +. .+ 2k Vk where
(12) V = U U
Rao defined
k
(13) V = Vi
i=1
The problem then is to estimate the variance components o21, . o2k
Rao considered the estimation of a linear function
2 2
(14) Pl1l + . + Pkok
of the variance components from a quadratic function Y A Y of the
observations.
A is symmetric and is chosen to satisfy the following conditions:
(a) AX = 0
2 2
(b) E(Y AY) = p . .+ pkk 
 1 k.+
Condition (a) is necessary for the estimator to be invariant to changes
in B (Rao, 1972). For condition (b) to be true (i.e. the estimator is
unbiased), then tr A V. = pi, i = i, ., k where tr represents the
 i"
trace of a matrix (the sum of the diagonal elements). To obtain the
MINQUE estimator, the Euclidean norm tr (V*A)2 is minimized. This
requires some a priori knowledge of the ratios of o2 To obtain the
MIVQUE estimator, the variance of YA Y is minimized for a particular
choice of o1, ok. That variance is Var (Y'A Y) = 2 tr(V*A) + a
term in A and kurtosis parameters. Under normality assumptions, the
kurtosis parameters are zero and MINQUE equals MIVQUE.
Rao's methods are preferable to those proposed by Henderson for
three reasons. First, they have a wider range of applicability since
they can accommodate mixed as well as random models. Second, the
computations involved are more efficiently programmable. Third, when
prior estimates of the components are available, the MIVQUE method
provides estimates which are locally minimum variance.
The second reason is relevant to Gstudies because the designs
used in such studies tend to be large. As was mentioned previously, a
Gstudy should include as many sources of error variance related to a
measurement procedure as possible. For each facet included, the
maximum number of conditions possible should be sampled. The resulting
design then requires the most efficient method for its analysis. Rao's
methods satisfy this criterion.
Summary
The literature pertinent to the measurement of writing ability
indicates that essay tests represent the most valid method of assessment.
Several sources of error have been identified as affecting the
reliability of this test form. Although variability among raters is the
source most commonly examined, daytoday and assignment variability are
considered to be equally or more important. Missing from the literature
are empirical studies which examine how these sources affect the
reliability of the essay.
Generalizahility theory offers a conceptual framework which is
applicable to the study of multiple sources of error variation. Based
on Fisher's work on the analysis of variance, in the theory, the problem
of reliability is considered as one of generalization from one observation
to a universe of admissible observations.
In a generalizability study, the observations are gathered under
a specific design characterized by facets, the identified sources of
error. The conditions of each facet included in the design may be
fixed or sampled from the total universe of conditions. The relative
magnitude of the sources of error variation is determined through the
estimation of variance components. For purposes of simplification in
the estimation process, generalizability theory has been restricted to
balanced designs. The literature on generalizability theory is lacking
in applied studies, although content areas such as writing could greatly
profit from its application.
Also missing from the psychometric literature are extensions of
the theory to unbalanced designs. These extensions are much needed
since these designs are typical in educational research. Psychometricians
could profit from methods of estimating variance components documented
in the statistical literature. In particular, the methods of Henderson
(1950) and Rao (1971, 1972) are applicable to unbalanced data. These
methods allow the principles of generalizability theory to be further
extended.
CHAPTER III
METHOD
This study was designed to demonstrate the application of generali
zability theory to the assessment of writing ability. The data were
collected in a natural setting on a sample of fourth grade children.
The study extended the application of the theory to a situation where
unequal but proportional numbers of subjects appeared in the sub
classifications.
The sample, the facets, the design, and the procedures for data
collection and analysis are described in this chapter.
The Sample
The sample used in this study consisted of 104 fourth grade
students from eight classes in two schools in Alachua county; four from
P. K. Yonge Laboratory School and four from Alachua Elementary School.
The data used in this study were collected as part of a research project
on creative writing conducted at those schools.
P. K. Yonge is a laboratory school associated with the College of
Education at the University of Florida. The student population at
each grade level is selected from a waiting list in such a way to
approximate, in each classroom, an equal balance between males and
females; a 20:80 racial balance between blacks and white or others,
respectively; and an equal balance from each of five income categories.
Fourth and fifth grades are combined in the classrooms at this school,
The four classrooms participating in this study exhausted those class
rooms containing fourth grade students. A total of 59 fourth grade
students are currently enrolled at P. K. Yonge. However, only 37 who
had complete data were used in this study.
Alachua Elementary is a public school in the rural town of Alachua.
The four classrooms from this school also exhausted the fourth grade
population. In this school, students at each grade level are assigned
to classrooms to maintain the sex and race balance previously described.
A total of 67 students in this school had complete data out of an initial
sample of 113. Thus, the writing samples used in this study were
obtained from a total of 104 individuals. The sample sizes for each
class are shown in Table 1, broken down by sex and race.
The Writing Samples: Data Collection
Samples of compositional writing, in two different writing modes,
were collected on three occasions. On each occasion, verbal and
written instructions were given to the children by one of the staff
members of the project. The same person collected the samples through
out the occasions at each school. Steps were taken to insure that the
children understood the task. Furthermore, praise was used in an
attempt to motivate the children to write. On each occasion, the
assignment and the instructions were standard for all students. Each
student was allowed sufficient time to complete the task. On the average,
the compositions were completed in approximately 45 minutes.
The Facets
The writing samples were characterized by two facets: modes and
occasion. A third facet, raters, was introduced in scoring the samples.
The levels of these facets which were used in this study are described next.
TABLE 1
SAMPLE SIZES BY CLASSROOM, SEX, AND RACE
Sex Race
Classroom Male Female Black White Total
1 3 5 1 7 8
2 4 4 2 6 8
3 8 3 2 9 11
4 3 7 3 7 10
5 6 11 4 13 17
6 8 11 7 12 19
7 6 5 2 9 11
8 8 12 4 16 20
Note: Classrooms 1 through
5 through 8 are from
4 are from P. K. Yonge and
Alachua Elementary.
Modes
The mode facet, as conceptualized in this study, was characterized
by two dimensions: the purpose of the writing sample and the type of
assignment. This use of the word mode is broader than the traditional
use. Generally, four basic writing modes are mentioned in the litera
ture related to factors which influence children's writing ability.
These are: narrative, declarative, argumentative, and expository.
Each of these modes constitutes a different purpose. For example,
the purpose of writing in the narrative mode is to tell a story; that
of the argumentative mode is to convince the audience. For each one
of these purposes, different types of assignments are possible. A
child who is asked to write in the narrative mode may tell his/her
story through a poem, a letter, a report, etc. Characterizing the
type of writing along these two dimensions allows for a large number of
possible conditions on this facet. In this study, generalization was
intended to all of the possible conditions thus identified.
Two types of writing assignment were used in this study, each
representing a different writing purpose. In one mode, children were
instructed to prepare a brief report about specific animals using a
standard set of facts supplied by the investigator. The facts were
presented either in written form or with the aid of a film. On the
first occasion a list of facts about bats was provided for the children.
Films about cows and pigs provided the facts used on the second and
third occasions, respectively. After the presentation of the stimuli,
the facts were discussed with the children.
In the second mode, the children were asked to write a creative
story explaining some imaginary phenomenon such as "how the camel got
the hump". On each occasion, a list of titles was provided for the
children from which they were to select one.
Occasions
The writing samples were collected on three occasions during the
197778 school year: Fall, Winter, and Spring. On each occasion, the
descriptive reports were collected one week before the narrative stories.
This order was maintained because the investigators felt that there
would be less carryover from a report to a story than vice versa. The
one week time period within an occasion was allowed for two reasons:
to minimize carryover effects and to maximize the motivation of the
children. With children at the elementary level, there is a loss in
motivation when similar tasks are assigned in the same day.
Writing performance is expected to fluctuate from day to day.
Furthermore, it is expected that children's writing ability will also
fluctuate (hopefully improve) during the year. In this study, genera
lization was intended to any time during the school year.
Raters
The four raters represent a sample of raters which could have been
used. Three of the raters were graduate students in educational research;
the fourth rater, an associate professor in the same department.
Generalization along this facet is intended to any person who would rate
a sample of writing for the purpose of making a decision about placement,
selection, grading, or for purposes of comparison in a research study,
The writing samples were collected and sorted into six modesby
occasion combinations. The children's names were covered and a number
was assigned and written on their sample for identification. Thus,
the anonymity of the samples was preserved. The raters scored the
samples on eight different days. Each day, the four raters scored
the samples using a general impression scoring method. At the begin
ning of each scoring session, the raters reviewed the criteria to be
used in scoring. After scoring several samples, the raters compared
their scores and discussed samples which had received divergent scores.
These discussions were an attempt to increase the interrater relia
bility. Each sample was scored independently.
Prior to the first rating session, the raters were trained
in using the general impression method. Samples from fifth grade
students were used for training. During training, the scaling points
were determined so as to obtain an approximation to a normal distribution.
Normality was not a consideration during the actual scoring of the
samples. A general impression method of scoring used in this study
involved assigning a score of 1 through 8 on the basis of the overall
quality of the writing sample. The method involves the rapid, impres
sionistic scoring of a sample. Generally, no more than two minutes are
spent on any one paper.
This procedure has been used by the Educational Testing Service (ETS)
and the College Entrance Examination Board, and was also used in the
first national assessment of writing conducted by the National Assessment
of Educational Progress (NAEP) (Mellon, 1975). The ETS research on
rater reliability in the 1960's revealed that multiple ratings based on
overall impressions were the best means of achieving interrater relia
bility (Suhor, 1977). An additional advantage to this method is the
fact that it requires less time than any other method.
Design
A schematic representation of the design used in this study is
FIGURE 1
SCHEMATIC REPRESENTATION OF THE
DESIGN INCLUDING CIASSES(C), STUDENTS(S),
OCCASIONS(O), MODES(M), AND RATERS(R)
01 02 03
H1 M2 H1 M2 H1 M2
R1R2R3R4 R1R2R3R4 R1R2R3R4 R1R2R3R4 R1R2R3R4 R1R2R3R4
S
C1l
S8
S9
C2
S16
S17
C3
S27
S28
C4
S37
S38
CS
S54
S55
C6
S73
S74
C7
S84
SBS
C8
S104
shown in Figure 1. The design is referred to as a splitplot factorial
design in standard texts (e.g.,Kirk, 1968 ; Winer, 1971) with the
classes being the main plots and the students being the subplots. In
this design students are nested within the classes; that is, each student
appears in only one class, This situation is typical in the natural
setting. The nesting of students within class results in the confound
ing of the student by class interaction with the student effect. As a
result, there is no way to estimate the student effect independent of
the classbystudent interaction. Similarly, any interaction term
involving the student effect is confounded with the corresponding
interaction term involving classbystudent. Since typically students
are nested within classrooms, the confounding of the effects mentioned
above does not present any problems.
The students, modes, occasions, and raters are factorially combined.
In terms of this study, this factorial combination means that each
student was measured in both modes on each occasion and that each rater
scored every writing sample. The crossing of students, modes, occasions,
and raters allows for the independent estimation of each main effect
and all interactions involving those effects. The levels of all factors
included in this study were considered to be random samples of all
possible levels which could have been included. Thus, the model used
is a random effects model.
Let X or denote the rating received by students in class c for
mode m occasion o and rater r Then the structural model used in
this study may be represented as:
(15) Xcsmor = + c + s+ (c) + Bcm + BTms(c) +
Yo + OYco + Y"os(c) + Or + aOcr + BJrs(c) +
By +By +By + BG + aBB +
Bmo + Bcmo+ BY mos(c) + Bmr + Bcmr
BOrmrs(c) + Yor + ty6cor + YBors(c) + BYIor +
BY6cmor + Bymors(c)
where p = the grand mean,
ac = the effect for class c (c = 1,. .., nc).
rs(c) = the effect for student s nested within class c
(s = 1,. ., ns(c)),
Bm = the effect for mode m (m = I, . ., n),
aBcm = the classbymode interaction effect,
BIms(c) = the modebystudent (nested within class c)
interaction effect,
Y = the effect for occasion o (o = . n),
aYco = the classbyoccasion interaction effect,
YTos(c) = the occasionbystudent (nested within class c)
interaction effect,
Or = the effect for rater r (r = 1, ... nr),
a cr = the classbyrater interaction effect,
OTrs(c) = the raterbystudent (nested within class c)
interaction effect,
BYmo = the modebyoccasion interaction effect,
aBYcmo = the classbymodebyoccasion interaction effect,
BYTImos(c) = the modebyoccasionbystudent (nested in class c)
interaction effect,
BOmr = the modebyrater interaction effect,
aBo = the classbymodebyrater interaction effect,
BO7mrs(c) = the modebyraterbystudent (nested in class c)
interaction effect,
y6or = the occasionbyrater interaction effect,
ayacor = the classbyoccasionbyrater interaction effect,
yenors(c) = the occasionbyraterbystudent (nested in class c)
interaction effect,
Bymor = the modebyoccasionbyrater interaction effect,
aBycmor = the classbymodebyoccasionbyrater interaction
effect, and
Bymors(c) = the modebyoccasionbyraterbystudent (nested in
class c) interaction effect.
It is assumed that each effect in the model (except for the grand
mean) is a random variable with a mean of zero and variance a2(effect).
The effects are assumed to be independent of each other so that the
total variance in the scores Xcsmor can be partitioned as
(16)
2 7 2 2 2 2 2 2
o (x) = o(a) + o (r) + o (B) + (aB) + (B) + o(y) + o (ay) +
2 2 2 2 2 2 2
o (yT) + 2 (8) + 2(a9) + a (Br) + o (By) + 2 (aBy) + o (ByT) +
2 2 2 2 2 2
0 (Be) + o (aBl) l o (Bro) + a (y0) + 0(a y) + a (y9) + a (Bye) +
o (aBYO) + a (ByOr).
The variances 2 a, G ByO are called variance components (Scheffe,
1959) and, therefore, the model is referred to as a variance component
model. To estimate variance components it is not necessary to assume
that the effects are normally distributed.
Variance Component Estimation
To estimate the variance components in (16), a new version of the
SAS VARCOMP procedure was used (Goodnight, 1978). This procedure,
called MIVQUEO, is based on the MIVQUE (minimum variance quadratic
estimator) method developed by Rao (1971). The method estimates linear
functions of the variance components through the use of quadratic
functions of the observations which have minimum variance for a particu
lar choice of l, . .,k.. The VARCOMP program selects a1, .,Ck so
as to minimize the ratio of the variance for each effect to the residual
variance. The resulting estimates are invariant, locally best (at zero)
quadratic unbiased estimates of the variance components (Goodnight,
1978). The program used was the only one available to handle the size
of the design matrix within a reasonable amount of computer time and
space.
For balanced splitplot factorial designs the expected mean
squares are linear combinations of the variance components. In this
case, the observed mean squares from the analysis of variance may be
used in the formulas shown in Appendix A to estimate the variance com
ponents.
Generalizability Coefficients
Tests for homogeneity of variances were performed on the basis of
warnings by Cronbach et al. (1972, p. 100101). In their words:
Where there is crossing of persons with facet i
(or j, etc.) observedscore variances may differ
from one application of the design to the next,
and intercorrelations between pairs of indepen
dently obtained observed scores may differ. The
intraclass correlation (our coefficient of gene
ralizability) truly equals the mean of p2(X,pp)
only if all observedscore variances are equal.
One must be hesitant, then, in taking the
coefficient of generalizability as representing
the parameter p (X,pp) for any particular D
study with crossed conditions.
To test for violations of homogeneity assumptions for this design,
a procedure suggested by Box (1950) and recommended by Kirk (1968) was
used. The procedure involved the following:
1. testing the equality of the variancecovariance matrices
across the eight classes; and if this hypothesis was not rejected,
2. testing the equality of the diagonal elements in the pooled
variancecovariance matrix.
The first test is performed by the DISCRIM procedure in SAS (Barr,
et al., 1976). The second test was done using Bartlett's test for
homogeneity of variance. It was recognized that this test is sensitive
to violations of normality assumptions. However, a visual inspection
of the frequencies within each subclassification revealed no serious
departure from normality.
Using the point estimates of the variance components, one can
derive the formulas for any desired coefficient of generalizability,
where generalization is intended to any subset of the universe of
generalization used in this study. The generalizability coefficient,
P2(X,p), is defined as the ratio of o2(1), the universe score variance,
to E(o2 (X)), the expected value of the observed score variance, the
expectation taken over repeated applications of this design.
The universe of generalization determines what constitutes the
universe score variance and the expected observed score variance. The
expected observed score variance is always made up of the universe score
variance plus error variance (Cronbach's o2(6)).
For deviation scores, the expected observed score variance is
(17)
E(o2(x)) = o2(r) + o2(TB) + o2(Ty) + 02(0) + o2(Byn) + a2(BOT) +
nm no nr nmno nmnr
a2 (y9 ) 2 (ByOn)
nonr nmronr
It includes all of the components of variance involving the student
effect. Other components in the model do not enter into the expected
observed score variance because they are constant for all students, and
in the formula, the students are considered in relation to the group's
universe score, Each component is divided by the number of conditions
entering the facet involved on that component. Given the formula for
the expected observed score variance, the universe score variance may
be obtained by taking the limit of E(o2(X)) as the number of conditions
approaches infinity. This is the case when generalization is intended
to an infinite number of levels, where all terms but o'(7T) disappear
from the formula. Thus, 2 (ir) is the universe score variance. In the
situation where generalization is intended to a fixed number of
conditions for a particular facet, the component involving that facet
is considered as part of the universe score variance.
In Table 2, seven coefficients are suggested for all possible
combinations of fixed and infinite generalizations across the three facets.
These formulas may be used in a Dstudy involving a similar population
of subjects and a subset of these facets by substituting the values for
the n's for that study and these estimates of the variance components.
The denominator of the formulas is the same for all universes. The terms
have been rearranged so that the error component is within the parenthesis.
The Error Variance o2 ()
The coefficients of generalizability exclude systematic facet
components from the error term and, therefore, from the expected
TABLE 2
FORMULAS FOR GENERALIZABILITY COEFFICIENTS FOR THE SPLITPLOT FACTORIAL DESIGN
WITH THREE FACETS RATERS(R), MODES(M), AND OCCASIONS(O) AND SEVEN UNIVERSES OF GENERALIZATION
Universe of
General nation Formula
R infinite 02()
M infinite 02( ) _2(B) o 2(y) 0 2() o2(By) o2(B) 2 z BP ) a2(ry )
O infinite + + + +
nm no nr nmno nnr nonr nonmnr
R fixed 32(,) + o2(,q)
M infinite nr
O infinite o2(r) + o2(ne) + [ (inB) + o'(y) + o2(TBy) + o2(inB) o2(rye) a 2(iB )
Ty"r n_"m T"o ~ "m' "mnTr no"r o7Mr
R infinite o2(7) + 02(nB)
M fixed nm
O infinite o2(a) o2(aB) o2+(y) o2(Tr) o2(nBy) o2(nB3 ) o2(r ij o2(aByi j
m [ + r + + +r q + orn.m r]
R infinite o2(_) + 2(ry)
R infinite n2
0 fixed 0 () + o2(ry) [ 2(TB) + oa2(8) (nTRy) + a2(rBe) 2+ o ) + o2(By)
no nm nr nno nmnr nonr nonmnr
Sfixed o2) + o nB) + 02(76) + (TB)
M fixednr nr
0 infinite 72() 2(nB) 02(i) o2(Cy) oc) o2([TBY) 2( rY) o2(TBy )
Snm + n + nmnr + [ no mno + nonr + onrn
R fixed o2( ) *' j2() o2(ne] + a2(Tr )
M infinite no nr nonr
0 fixed o2(7) 2(w) 2(uo) o2( ) 2(B) o2(rBy) 02( BO) 2( r Ti)
+ T"o + +T +nTr [ + 'r + n+ r
R infinite
M fixed
0 fixed
o2(0 ) + j2( B) + 02(ry) + o2(nBy)
nM no nmno
2) 2(B P 2 ) o2 By) + [ 2() + 02(aB) 32( e6) oI2(B )
n n n n n n n n n n n n
observed score variance. Two situations may arise where these variance
components should be considered as part of the error component. These
are:
1. Studies where the conditions of a facet are nested within the
student, rather than crossed. In other words, a different condition
or set of conditions is sampled for each student.
2. Situations which involve determining confidence intervals
around as individual's score for the purpose of making an absolute
decision.
The formulas for estimating a (A) from this design for different
universes of generalization may be obtained from the information in
Table 3. The entries in the table indicate those components which
enter into the error variance. The components are to be divided by
the frequencies shown in the last column of the table.
Summary
A total of 104 fourth grade students in eight classes participated
in this study. Samples of compositional writing, in two different
writing modes, were collected on three occasions. The samples were
scored by four trained raters using an 8point general impression method.
The design used, a splitplot factorial, considered the students
as nested in the classes and crossed with the raters, modes, and occasions.
A model was constructed which expressed the variance among all observa
tions as a linear combination of independent variance components.
Estimates of the variance components in the model were obtained
using the MIVQUEO method in SAS. This procedure is applicable to
unbalanced designs such as the one considered in this study. Prior to
TABLE 3
VARIANCE COMPONENTS ENTERING INTO THE ERROR VARIANCE a2 ()
FOR SEVEN UNIVERSES OF GENERALIZATION
UNIVERSE OF GENERALIZATION
R Infinite R Fixed R Infinite R Infinite R Fixed R Fixed R Infinite Number
Variance M Infinite M Infinite M Fixed M Infinite M Fixed M Infinite M Fixed Replica
Component 0 Infinite 0 Infinite 0 Infinite 0 Fixed O Infinite 0 Fixed 0 Fixed Within
o2(B) *
nm
2(y)
2 (Y)
0 (TTy)
02
o2(e)
o2(ire)
o2(By)
02 (rBy)
o2( 8e)
a2(rB6)
a2 (ye)
o2(ByO)
o2( Bye)
Note:
of
tions
T
m
*
n
0
n
n
0
n
r
n
r
nn
nmn
n n
*n
nn
mr
n n
nn
mr
no
.onr
or
nn nr
*rn n n
mor
the asterisks indicate those components which enter into the error variance.
using the estimates of the variance components in the estimation of
generalizability coefficients, tests for homogeneity of variance were
performed.
Formulas for generalizability coefficients corresponding to seven
universes of generalization were provided. In addition, components of
variance entering the formulas for the standard error of measurement
were listed for seven universes of generalization. These universes
represented generalization across one dimension (raters, modes, or
occasions), two dimensions (raters and modes, etc.), or three dimensions
(raters, modes, and occasions).
CHAPTER IV
RESULTS
This study was designed to apply the principles of generaliza
bility theory to the assessment of writing ability in young children.
Samples of writing from fourth grade children were collected in two
modes at each of three occasions during the school year. A general
impression method of scoring was used by four trained raters.
Because the children were nested in the classes, the observations
were first considered in a splitplot factorial design with unequal
numbers of subjects in the classes. The variance components for all
effects in this model were estimated using the MIVQUE method.
A model ignoring the class dimension was also considered. For
this second model, estimates of the variance components were obtained
through the analysis of variance mean squares. The results from these
methods are reported in this chapter. Also reported here are the
results of the homogeneity of variance tests as well as certain
coefficients of generalizability and error variances, o"(A).
Estimates of the Variance Components
The point estimates of the variance components in model (16),
obtained from the MNIVQUEO method of SAS are reported in Table 4 along
with their corresponding degrees of freedom. Negative estimates were
replaced by zeros, following the recommendation of Cronbach et al.
(1972) among others. These zero estimates are no longer unbiased
(Searle, 1971b, p.23) and are obviously bad estimates since a variance
is, by definition, nonnegative.
Searle (1971b) suggested six courses of action to follow when
negative estimates of variance components are obtained. Three of
these alternatives involve assuming that the true value is zero. The
first one is to report the negative estimate but use it as evidence that
the true value is zero. The second one is to change the negative estimate
to zero, as was done in this study. The third involves ignoring the
negative components from the model and reestimating the other components.
The fourth is to use the negative estimate as an indication of an
inappropriate model for the data and to reconsider the model, possibly
considering models with finite instead of infinite populations. The
fifth course of action is to use Bayesian or maximum likelihood
estimators. The last recommendation suggested by Searle is "the statis
tician's last hope", to collect more data.
As shown in Table 4, seven out of the 23 estimates are considered
to be zero. The actual estimates were very small. In general, all
estimates of the variance components were small. This may be partially
due to the restricted range imposed by the 1 to 8 rating scale,
The largest estimates were for the student effect (o (w) = .346),
'2
the studentbymodebyoccasion interaction (2 (By7) = .339), and the
studentbymodebyoccasionbyrater interaction which is confounded
2
with the error (a (By6r) = .235). Following in order of magnitude were
the student by occasion interaction (a (yr) = .073) and the occasion
2
main effect (a (y) = .070). All other estimates appear negligible.
TABLE 4
POINT ESTIMATES OF THE VARIANCE COMPONENTS
FOR THE MODEL (16)
VARIANCE COMPONENT df POINT ESTIMATE
o2(a) 7 0.000*
;2 ( 96 0.346
32(8) 1 0.000*
o2(aB) 7 0.000*
o2(B ) 96 0.024
32(y) 2 0.070
82(ay) 14 0.000*
a2(yu) 192 0.073
52( ) 3 0.008
52(,,) 21 0.000*
o2(e n) 288 0.002
o2(By) 2 0.010
82(oBy) 14 0.056
2 (By ) 192 0.339
2 (B ) 3 0.000*
o2(aBe) 21 0.002
o2(Ba T) 288 0.021
o2(y) 6 0.003
o2 (a) 42 0.000*
52(~ a) 576 0.007
o2(Bgy3) 6 0.006
2 (aBO ) 42 0.017
o2(Byl3 a) 576 0.235
*Negative estimate has been replaced by zero
T 0 nhi c = 1lP11
Test of Homoscedasticity Assumption
The generalizability coefficients obtained from the intraclass
correlation formulas are unbiased only if homogeneity of variance
assumptions are met. To test this assumption in the context of the
splitplot factorial design, a procedure described by Kirk (1968),
pp.258261) was used. The test for the equality of the eight variance
covariance matrices (corresponding to the eight classes) resulted in a
chisquare value of 35.11. With 2100 degrees of freedom, the observed
chisquare was not significant at the .10 level.
Since the eight matrices were not significantly different, a
pooled variancecovariance matrix was constructed. Testing for the
equality of the diagonal elements in the pooled matrix resulted in a
chisquare value of 30.78, which was not statistically significant at
the .10 level with 23 degrees of freedom. This result indicated that
differences among the diagonal elements in the pooled matrix were not
statistically significant. The results from these two tests lent
support to the homogeneity of variance assumption.
Generalizability Coefficients
The coefficients reported in this section were obtained by sub
stituting the point estimates from Table 4 into the formulas derived
in Table 2. Forty nine coefficients were estimated, corresponding to
seven different universes of generalization and seven different com
binations of condition frequency. These coefficients are reported in
Table 5. The first five represent combinations which yield a total of
24 observations on each person, Within that restriction, the combina
tions are included to show which facet needs to be sampled most frequently.
GENERALIZABILITY COEFFICIENTS FOR SEVEN UNIVERSES OF GENERALIZATION
AND SELECTED CONDITION COMBINATIONS
CONDITION COMBINATIONS
Universe of
Generalization
Raters
Modes
Occasions
Raters
Modes
Occasions
Raters
Mode
Occasions
Raters
Modes
Occasions
Raters
Modes
Occasion
Rater
Modes
Occasions
Rater
Mode
Occasion
R Infinite
M Infinite
O Infinite .765 .825 .761 .834 .703 .624 .330
R Fixed
M Infinite
0 Infinite .766 .829 .762 .836 .704 .628 .332
R Infinite
M Fixed
O Infinite .791 .840 .814 .862 .711 .646 .353
R Infinite
M Infinite
0 Fixed .819 .883 .788 .863 .851 .690 .400
R Fixed
M Fixed
O Infinite .798 .848 .879 .890 .714 .669 .375
R Fixed
M Infinite
O Fixed .822 .889 .790 .867 .855 .700 .409
R Infinite
M Fixed
O Fixed
.965 .965 .960 .973
.865 .747
The last two combinations are included to show the effect on the
coefficients of minimum sampling.
The smallest coefficient obtained, .330, corresponds to a situation
where generalization is intended across raters, modes, and occasions
but each facet is sampled only once. This situation may occur if a
classroom teacher were to base the student's writing scores for the
year on one sample of writing.
For the same universe of generalization, increasing the number of
conditions for the mode and occasion facets by one, results in an
increased coefficient of .624. The highest coefficient for that
universe, .834, is obtained when six conditions for the occasion facet
are sampled and the rater and mode facets are each sampled twice.
As the universe of generalization is restricted, by fixing one or
more facets, the generalizability coefficients tend to increase. In
all universes, the smallest coefficients are found when only one
condition of each facet is sampled.
In the three universes having only one facet fixed, the highest
coefficients correspond to the two situations where the mode by occasion
combinations are sampled the most. In the last universe, where
generalization is intended across raters only, all the coefficients are
high.
The Error Variance 2 (A)
The variance components were also used in estimating the error
variance o2(A), the square root of which may be used for obtaining
confidence intervals around an individual's universe score. Several
components, o (A), were estimated corresponding to the seven different
universes of generalization. The results of this estimation are
presented in Table b. For each universe of generalization, seven
JtDLLE 0
ESTIMATES OF THE ERROR VARIANCE u2(A) FOR
SEVEN UNIVERSES OF GENERALIZATION AND SELECTED CONDITION COMBINATIONS
CONDITION COMBINATIONS
Universe of
Generalization
Raters
Modes
Occasions
Raters
Modes
Occasions
Raters
Mode
Occasions
Raters
Modes
Occasions
Raters
Modes
Occasion
Rater
Modes
Occasions
Rater
Mode
Occasion
R Infinite
M Infinite
O Infinite .134 .101 .124 .086 .222 .257 .798
R Fixed
M Infinite
0 Infinite .130 .096 .121 .081 .219 .247 .788
R Infinite
M Fixed
0 Infinite .121 .095 .100 .074 .218 .245 .774
R Infinite
M Infinite
0 Fixed .086 .054 .078 .062 .078 .186 .655
R Fixed
M Fixed
0 Infinite .116 .087 .092 .064 .214 .225 .743
R Fixed
M Infinite
0 Fixed .083 .048 .097 .056 .073 .170 .635
R Infinite
M Fixed
0 Fixed .016 .020 .018 .021 .016 .087 .282
estimates are included. These estimates correspond to different sampl
ing combinations. The first five combinations yield a total of 24
observations. The last two represent minimal sampling of conditions
within each facet.
As shown on the table for the first three universes, and again
for the fifth,in the column where two raters, two modes, and six
occasions are sampled, the error variance is at a minimum. For the
fourth and sixth universes, the second combination is the one which
2
minimizes a (A). In the last universe, the first five combinations
yield small error variances.
Supplementary Analysis
Since five of the seven negative estimates obtained were associated
with the classes effect, a a followup analysis was done eliminating
the classes from the model. Dropping the classes resulted in a four
way balanced factorial design without replications. This was one of
the designs considered by Medley and Mitzel (1963), For this design,
the point estimates of the variance components were obtained using the
mean squares from the analysis of variance reported in Table 7. These
mean squares were substituted into the formulas for the point estimates
given by Medley and Mitzel (1963, p.312).
The resulting estimates of the variance components are reported
in Table 8. As shown in Table 8, three of the 15 point estimates were
negative and have been replaced by zeros, Of these, only the estimate
of the studentbyrater interaction had been positive in Table 4. The
ratio of negative estimates to the total number of estimates is smaller
for the model without the classes effects than for the initial model
TABLE 7
ANALYSIS OF VARIANCE
FROM A FOURWAY FACTORIAL
DESIGN WITHOUT REPLICATIONS
STUDFNT(S) X MODE(M) X OCCASION(O) X RATER(R)
Source df SS MS
S 103 1142.285 11.090
M 1 0.673 0.673
0 2 150.337 75.169
R 3 12.956 4.319
S x M 103 210.202 2.041
S x 0 206 500.079 2.428
S x R 309 100.335 .325
M x 0 2 21.073 10.536
M x R 3 0.149 0.050
0 x R 6 8.211 1.368
S x M x 0 206 387.677 1.882
S x M x R 309 101.143 0.327
S x 0 x R 618 161.372 0.261
M x 0 x R 6 5.821 0.970
S x M x 0 x R 618 152.762 0.247
TABLE 8
POINT ESTIMATES OF THE VARIANCE COMPONENTS FOR
THE FOURWAY FACTORIAL WITHOUT REPLICATIONS
POINT ESTIMATE
VARIANCE COMPONENT*
^2
o2 ( )
o(B)
.2
2 (B T)
.2
2 (y)
2 (y )
o2(a0 )
2 (By)
c2 (By T)
2 (B )
o2(B) if)
82 (0)
2 (Of a)
2
2 (BO )
82(By )
0.355
0.000**
0.006
0.076
0.066
0.006
0.000**
0.019
0.409
0.000**
0.027
0.002
0.007
0.007
0.007
*The same notation
**Negative estimate
used for model (16) will be used here.
has been replaced by zero.
including the classes effects. Therefore, using negative estimates as
the criterion, it appears that eliminating the effects involving classes,
a from model (16) results in a better model for these data.
The estimates in Table 4, obtained by the MIVQUEO method, and those
in Table 8 obtained through the analysis of variance mean squares,
are very close. The similarity between the estimates obtained from the
two different methods lends support to the validity of the MIVQUE as
a useful method when the data are unbalanced. The analyses of variance
approach, as was mentioned earlier, is universally accepted as the best
method for balanced data.
Summary
Point estimates for all variance components in the model were
obtained and reported in Table 4. Negative estimates were replaced by
zeros. The magnitude of the estimates indicated that students could
be differentiated on the basis of their ratings. However, the classes
as units could not be distinguished. Of the three sources of error
examined, the occasion facet constituted the greatest source. The mode
facet was next in magnitude. Raters represented an insignificant
source of errors.
The tests of homogeneity of variance lent support to the assumption
that the variances within each condition combination were equal.
Assuming homogeneity of variance, unbiased generalizability coefficients
were obtained for seven universes of generalization. These universes
represented generalization across one facet, two facets, or all three
facets simultaneously. For each universe, seven coefficients were
computed for possible Dstudies with various combinations of condition
frequencies. For most universes, the coefficients indicated that to
63
obtain acceptable levels of generalizability at least six samples of
writing from each person are necessary. The only exception was when
generalization was intended across raters only. The results for the
error corresponding to the standard error of measurement, were similar
to those based on the generalizability coefficients,
A supplementary analysis which compared the estimates obtained
through the MIVQUE method to those derived using expected mean squares,
resulted in similar values for all estimates in a model without the
classes effect. These results were interpreted as lending support to
the validity of the MIVQUE method.
CHAPTER V
DISCUSSION
In this study, generalizability theory was applied to the assessment
of writing ability in young children. A universe of generalization was
defined in terms of three facets: modes, occasions, and raters. Samples
of children's writing performance were obtained under selected conditions
from each facet. The design permitted the investigation of three main
sources of error and their interactions. These sources were considered
to affect the inference of writing ability from writing performance.
Formulas for generalizability coefficients were derived for seven
universes of generalization.
The first two sources of error were defined in terms of variability
in the quality of the writing samples. This variability may result from
changes in the subject's performance across time (occasions) and across
assignment (modes). The third source of error may result from differences
in the standard of judgement used by different raters when scoring the
samples. Using the principles of generalizability theory, the relative
contributions of these sources of error were examined via estimates of
the variance components. The discussion of the results is focused on
the interpretation of the variance components and the usefulness of
the theory. The limitations of this and similar studies are also
considered.
Interpretation of Variance Components
The largest component of variance was that associated with the
students, indicating that it was possible to rank order the students
on the basis of their ratings. This component represented the
universe score variance. The classes component, on the other hand,
was considered to be zero (the actual estimate was negative), indicating
that the eight classes could not be differentiated as units on the
basis of the ratings received by the students. All but three comnonents
of interactions involving the classes were also zero. The three non
zero components were: the classbymodebyoccasion, .056; the class
bymodebyrater, .002; and the interaction of the classes with all
three facets, .017.
Generalization Across One Facet
The point estimates for the studentbyfacet interactions for the
mode, occasion, and rater facets were .024, .073, and .002, respectively.
These interaction components reflect the relative contribution of each
source of error when generalization is done along that one dimension
only. No interaction would mean that students are similarly rank
ordered across all conditions of that facet, thus generalization across
all conditions would be possible. On comparing these three estimates,
it appears that occasions represented the greatest source of error while
raters represented the smallest. The large relative contribution of
occasions to error means that students are not ranked in the same manner for
all three occasions. Differential learning might have taken place during
the school year. An implication is that when making an assessment of
writing ability, it is important to note when, during the year, the
measure was obtained. If generalization is intended across different
occasion conditions, then several conditions should be sampled.
The small component associated with the studentbyrater interac
tion indicates that the four raters ranked the students similarly.
It seems possible, then, to train raters in applying the general impres
sion scoring method systematically. Since this scoring method is both
fast and efficient, large scale projects could confidently take
advantage of it. It is important to remember that, after scoring
several papers, the raters discussed those samples which received
differing scores. Thus, it is not surprising that this source of
error was minimal.
The studentbymode interaction was large enough to indicate that
changes in the task may result in different rankings of students.
Different modes of writing may demand different abilities from the
students. A piece of creative writing, for example, would require an
exercise of the imagination while writing a report would require the
ability to organize facts in a meaningful fashion.
The three main effect components associated with modes, occasions,
and raters were .000, .070, and .008, respectively. These components
reflect systematic changes and contribute to error only if absolute
decisions are being made or when different conditions are sampled for
different students. Again, the occasion component is the largest,
indicating that the overall ratings were greater on some occasions than
in others. It is possible that all students improved their writing
performance during the school year. The rater component is small but
higher than the studentbyrater interaction. This component
reflects any systematic rater bias. There appeared to be no systematic
variability due to modes.
Generalization Across Two or Three Facets
When generalization is intended across more than one dimension, in
addition to the components discussed in the previous section, those
components involving the interactions among facets must be considered.
The threeway interaction components involving the students and two
facets were .339, .021, and .007 for the modeoccasion, moderater, and
occasionrater combinations, respectively. The first one is relatively
large, almost equal in magnitude to the student component. The
interpretation of that component is that differences is students'
ranking across the mode conditions change as a function of the occasion
conditions. A large component indicates that, when generalization is
intended across these two facets, the conditions should be sampled
frequently, if error is to be minimized. This fact is reflected in
Table 7 where coefficients of generalizability are shown for several
condition combinations. The largest coefficients correspond to
situations where modes and occasions are sampled most frequently.
The studentbymodebyrater component reflects some variability
due to differential ranking of students by the raters as a function of
the mode. That is, raters were not as consistent in one mode as they
were in the other. The small studentbyoccasionbyrater interaction
indicates that raters were almost as consistent in one occasion as they
were in the others.
The twoway interaction components among facets were .010, .000,
and .003 for the modebyoccasion, modebyrater, and occasionbyrater
components, respectively. These components enter into the error
variance 2(A) but not o2(6). Of these, only the modebyoccasion
component is large enough to warrant consideration. This component
indicates that differences in the overall ratings across modes vary as
a function of the occasion. For example, it is possible that all students
performed better when writing the creative story at the beginning of
the year. On the other hand, at the end of the year they might have
done a better job on the factual reports. If all students had more
practice in one mode during the year, their improved ability in that
mode would be reflected in this component.
When generalizing across all three facets, two additional components
of variance must be considered. The fourway interaction component
involving students and all three facets was relatively large, .235.
Since there were no replications within any three facet combination,
this component was confounded with the error of replication. The
magnitude of this component indicates that generalization across all
three facets requires that more than one condition of at least one facet
be sampled in order to minimize the error. The threeway interaction
component among the three facets was relatively small, .006.
Based on the previous discussion, it may be concluded that the
occasion facet represented a greater source of error than the mode
facet. The mode facet, in turn, represented a greater source of error
than the rater facet. With proper training and practice, the rater
facet may be almost irrelevant. These findings agree with those of
Finlayson (1951) and Vernon and Millican (1954) who concluded that
differences in essays contributed more to unreliability than differ
ences in raters. The differences in essay were further investigated
in this study, since essays werecharacterized along two dimensions.
Both of those dimensions were found to be important in this study.
Furthermore, one of them was found to be more important than the other.
These findings also support the recommendations made by experts
in the field of language arts and discussed in Chapter II. To obtain
a reliable assessment of writing ability more than one sample of
writing should be collected on more than one occasion and on more than
one mode. How many is more than one? That depends on the intended
universe of generalization.
An examination of Tables 7 and 8 provides some guidelines for
answering that question. In those tables seven universes of generaliza
tion are considered. The first universe represents generalization across
all three facets. The next three reflect generalization across two
facets only, the third facet is held constant. The last three universes
correspond to generalization across one facet only: occasions, modes,
and raters, in that order. Several condition combinations are included
in each table.
The entries in Table 7 represent generalizability coefficients
obtained via intraclass correlation formulas. The error variance
entering into those coefficients is 2 (6). In general, the highest
coefficients, across all seven universes, correspond to situations where
12 writing samples are collected (the second and fourth condition combi
nations). Collecting six writing samples (first, third, and fifth
condition combinations) results in a decrease in the coefficients.
However, the decrease is not too drastic, except perhaps in situations
where all six samples are collected in one occasion. This situation
seems unrealistic since, in this case, writer fatigue would interfere
with writing ability. If only four samples are collected and only one
rater is used, the coefficients drop below .7 for most universes. With
only one sample, as shown in the last condition combination, most
coefficients would be unacceptable.
The entries in Table 8 represent the estimates of the error variance
o2(A) which takes into account systematic effects. The square root of
the entries, o(A), represents the standard error of measurement. Thus,
the information in Table 8 may be used in constructing confidence
intervals around individuals' true scores. In general, the conclusions
that may be made based on the results shown in this table are similar
to those based on Table 7. That is, for these estimates, those condition
combinations which maximize p2(x,u), also minimize a (A).
Usefulness of Generalizability Theory
On the basis of this study it may be said that generalizability
theory provides a useful method for estimating the reliability of
measures of writing ability. With a clear definition of error and using
repeated studies, it might have been possible to examine certain
reliabilities of essay using classical methods. Those reliabilities
which include components of interactions among facets would, of course,
be impossible to obtain under classical methods. For those reliabilities
which are estimable under classical methods, the treatment would be
more awkward. The basic requirement under the framework of generaliza
bility theory is that the source of error be identified as a facet and
that conditions of that facet be sampled and incorporated into the
design. In that manner, the components of variance associated with
that source are estimable. Including facets in a design is a
popular method of control in educational research since, typically,
this kind of research takes place in the natural setting. It follows
that generalizability theory provides a practical methodology in
those situations.
Given the applicability of the theory to problems of reliability,
it is surprising that applications of it are scarce in the literature.
Some possible explanations of this situation are considered here.
These are: (a) the unfamiliarity of applied educational researchers
with the methods, (b) the unavailability of formulas for more complex
designs, or (c) the limitation imposed by the restriction of balance.
This application of the theory is a step in making the methods
more familiar to a wider group of applied educational researchers. In
particular, researchers in the field of compositional writing have been
provided with estimates of variance components which may be useful in
the planning of both comparative and absolute Dstudies in that area.
In addition, formulas for the generalizability coefficients have been
derived for the design used in this study. Those formulas may be adapted
to fit other designs which represent subsets of our universe of admissible
observations. All that would be required is that those terms involving
facets not included in the design be dropped from the formula.
As was demonstrated in this study, the restriction of balance is
not necessary. Several methods are available for the estimation of
variance components in unbalanced designs. One of those methods was
used in this study. Computer programs in SAS may be used to obtain the
point estimates. The procedure available in the 1976 version of SAS
uses Henderson's method 3. A future version of SAS will include, in
addition to the current method, the MIVQUEO method which was used in
this study. The point estimates obtained from the MIVQUEO method were
very similar to those obtained for a reduced model via expected mean
squares. These results were presented in the supplementary analysis of
the previous chapter. Future research should focus on comparing the
"goodness" of these different methods when applied to specific situations.
These computer programs have certain limitations when large design
matrices are involved. For large design matrices, such as the one used
in this study, the current SAS program requires an excessive amount of
computer space and time. For example, approximately five hours would have
been required to get the point estimates for the components in this study
under the current version. The MIVQUEO method uses less time and memory
but for large design matrices it still represents an expensive process.
However, the estimates of the variance components from one G
study may be used in subsequent Dstudies involving a similar population
of individuals and similar facets. The estimates computed for this
study may be useful to persons working with fourth grade students of
similar characteristics. A limitation is introduced by the high rate
of attenuation in this sample. To the extent that the final sample is
representative of the fourth grade population, our estimates are useful.
An additional limitation of this study is introduced by the small
number of conditions sampled within each facet. As has been pointed out
by llenderson, among others, the sampling error of the estimates of
variance components is large when few conditions are used in the estimate.
On a different application of this design, then, it is possible that
the estimates obtained would vary from the ones in this study. As
the number of degrees of freedom increases, the accuracy of the estimate
also increases. It should be noted that the components used in the
generalizability coefficients have large numbers of degrees of freedom
since they involve the student effect.
Summary and Conclusions
This study examined the problem of reliability of measures of
writing ability in the context of generalizability theory. Three main
sources of error variance were considered: raters, modes, and occasions.
It may be concluded that errors resulting from variability in the
quality of writing across occasions and modes outweigh those stemming
from differences among raters. With training and practice, raters can
consistently score the writing samples of students using a general
impression method. This method proved to be both fast and easy to use.
To improve the reliability of measures of written composition and
decrease the standard error of measurement, the emphasis should be
placed on collecting several samples of writing. On the basis of the
estimates obtained in this study, collecting less than six samples would
result in coefficients below .70. Assessing the reliability of measures
of writing ability in terms of rater agreement, is skimming the problem.
It is unfortunate that this issue is most commonly addressed in terms
of interrater reliability.
This study demonstrated the potential of generalizability theory
for clarifying problems of reliability. In applying the theory, the
careful identification of potential sources of error is required. Also,
consideration must be given to the type of inference which is to be made
74
from the observations. On the basis of these considerations, the
universe of observations is defined. A carefully designed study will
allow the estimation of all sources of error variance identified. As
was shown in this study, it is not necessary to limit applications of
the theory to balanced designs. Methods of variance component estima
tion for unbalanced designs are documented in the statistical literature
and available in SAS, a popular package of statistical programs.
REFERENCES
Anderson, H. E., G Bashaw, W. L. An experimental study of first grade
theme writing. American Educational Research Journal, 1968, 5,
239247.
Barr, A. J., Goodnight, J. H., Sall, J. P., & Helwig, J. T. A user's
guide to SAS. Raleigh, N. C.: SAS Institute, 1976.
Bortz, D. E. The written language patterns of intermediate grade
children when writing compositions in three forms: descriptive,
expository, and narrative. Dissertation Abstracts International,
1970, 30, 5332A.
Box, G. E. P. Problems in the analysis of growth and wear curves.
Biometrics, 1950, 6, 362389.
Braddock, R. Evaluation of writing tests. In A. H. Grommon (Ed.),
Reviews of selected published tests in English. Urbana: National
Council of Teachers of English, 1976.
Braddock, R., LloydJones, R., & Schoer, L. Research in written composition.
Champaign, Ill.: National Council of Teachers of English, 1963.
Brennan, R. L. The calculation of reliability from a splitplot
factorial design. Educational and Psychological Measurement,
1975, 35(4), 779788.
Burt, C. Test reliability estimated by analysis of variance. British
Journal of Statistical Psychology, 1955, 8(2), 103118.
Coffman, W. E. Essay examinations. In R. L. Thorndike (Ed.),
Educational Measurement. Washington D. C.: American Council on
Education, 1971.
Coffman, W. E., F, Kurfman, D. A comparison of two methods of reading
essay examinations. American Educational Research Journal, 1968, 5,
99107.
Cohen, A. M. Assessing college students' ability to write compositions.
Research in the Teaching of English, 1973, 7, 356371.
Cooper, C. R., & Odell, L. (Eds.). Evaluating Writing: describing,
measuring, judging. Urbana, Ill.: National Council of Teachers of
English, 1977.
Cornfield, J., & Tukey, J. W. Average values of mean squares in
factorials. Annals of Mathematical Statistics, 1956, 27, 907949.
Cronbach, L. J., Gleser, G. C., Nanda, 11., & Rajaratnam, N.
The dependability of behavioral measurements: theory of generali
zability for scores and profiles. New York: Wiley F, Sons, 1972.
Cronbach, L. J., Rajaratnam, N., 6 Gleser, G. C. Theory of generali
zability: a liberization of reliability theory. British Journal
of Statistical Psychology, 1963, 16, 137163.
Diederich, P. The problem of grading essays. Princeton: Educational
Testing Service, 1957.
Ebel, R. L. Estimation of the reliability of ratings. Psychometrika,
1951, 16, 407424.
Fagan, W. T., Cooper, C. R., & Jensen, J. M. Measures for research
and evaluation in the English language arts. Urbana, Ill.:
National Council of Teachers of English, 1975.
Finlayson, D. S. The reliability of marking essays. British Journal
of Educational Psychology, 1951, 21, 126134.
Fisher, R. A. Statistical Methods for Research Workers. London:
Oliver & Boyd, 1925.
French, J. W. Schools of thought in judging excellence of English
themes. Proceedings of Invitational Conference on Testing Problems,
Princeton: Educational Testing Service, 1962.
Gillmore, G. M., Kane, M., & Naccarato, R. W. The generalizability
of student ratings of instruction: estimation of the teacher and
course components. Journal of Educational Measurement, 1978, 15(1),
114.
Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. Generalizability of
scores influenced by multiple sources of variance. Psychometrika,
1965, 30, 395418.
Goodnight, J. II. Personal communication, June 14, 1978.
Gulliksen, H. Theory of Mental Tests, New York: Wiley 6 Sons, 1950.
Guttman, L. A special review of Harold Gulliksen, Theory of Mental
Tests. Psychometrika, 1953, 18, 123130.
Henderson, C. R. Estimation of variance and covariance components.
Biometrics, 1953, 9, 226252.
Hoyt, C. J. Test reliability estimated by analysis of variance.
Psychometrika, 1941, 6, 153160.
Johnson, L. V. Children's writing in three forms of composition.
Elementary English, 1967, 44, 265269.
Kane, M. T., & Brennan, R. L. The generalizability of class means.
Review of Educational Research, 1977, 47(2), 267292.
Kirk, R. Experimental design: procedures for the behavioral sciences.
Belmont, Ca.: Brooks/Cole, 1968.
Kuder, G. F., & Richardson, M. W. The theory of the estimation of
test reliability. Psychometrika, 1937, 2, 151160.
Levy, P. Generalizability studies in clinical settings. British
Journal of Social and Clinical Psychology, 1974, 13, 161172.
Lindquist, E. F. Design and Analysis of Experiments in Psychology
and Education. Boston: HoughtonMifflin, 1953.
Lord, F. M., & Novick, M. R. Statistical Theories of Mental Tests Scores.
Reading, Mass.: AddisonWesley, 1968.
LloydJones, R. Primary Trait Scoring. In C. R. Cooper, & L. Odell
(Eds.), Evaluating Writing: describing, measuring, judging.
Urbana, Ill.: National Council of Teachers of English, 1977.
Magnusson, P. Test theory. Reading, Ma.: AddisonWesley, 1967.
Maxwell, A. E., & Pilliner, A. E. G. Deriving coefficients of
reliability and agreement for ratings. British Journal of Matathematical
and Statistical Psychology, 1968, 21, 105116.
McCaig, R. A. What your director of instruction needs to know about
standardized English tests. Language Arts, 1977, 54, 491495.
McColly, W. What does educational research say about the judging of
writing ability? Journal of Educational Research, 1970, 64(4), 148156.
Meckel, H. C. Research on teaching composition and literature. In
N. L. Gage (Ed.), Handbook of Research on Teaching. Chicago:
Rand McNally, 1963.
Medley, D. M., Q Mitzel, 11. E. Measuring classroom behavior by
systematic observation. In N. L. Gage (Ed.), Handbook of Research
on Teaching. Chicago: Rand McNally, 1963.
Mellenbergh, G. J. The replicability of observational measures.
Psychological Bulletin, 1977, 84, 378384.
Mellon, J. C. National Assessment and the Teaching of English. Urbana,
Ill.: National Council of Teachers of English, 1975.
Overall, J. E., & Spiegel, D. K. Concerning least squares analysis
of experimental data. Psychological Bulletin, 1969, 72, 311322.
Perron, J. D. The impact of mode on written syntactic complexity.
Athens: University of Georgia, 1976. (ERIC Document Reproduction
Service No. ED 126 531)
Pilliner, A. E. G. The application of analysis of variance to
problems of correlation. British Journal of Psychology,
Statistical Section, 1952, 5(1), 3138.
Pope, M. The syntax of fourth graders' narrative and explanatory
speech. Research in the Teaching of English, 1974, 8, 219227.
Rajaratnam, N. Reliability formulas for independent decision data
when reliability data are matched. Psychometrika, 1960, 25, 261271.
Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. Generalizability
of stratifiedparallel tests. Psychometrika, 1965, 30, 3956.
Rao, C. R. Estimation of variance and covariance components in
linear models. Journal of the American Statistical Association,
1972, 67, 112115.
Rao, C. R. Minimum variance quadratic unbiased estimation of variance
components. Journal of Multivariate Analysis, 1971, 1, 445456.
Rowley, G. L. The reliability of observational measures. American
Educational Research Journal, 1976, 13(1), 5159.
Scheffe, II. The analysis of variance. New York: Wiley & Sons, 1959.
Searle, S. R. Linear Models, New York: Wiley & Sons, 1971a.
Searle, S. R. Topics in variance component estimation. Biometrics,
1971b, 27, 176.
Seegars, J. C. Form of discourse and sentence structure. Elementary
English Review, 1933, 10(3), 5154.
Selvage, R. Comments on the ANOVA strategy for the computation of
intraclass reliability. Educational and Psychological Measurement,
1976, 36(3), 605609.
Singleton, D. J. The reliability of ratings of the essay portion of
the Language Skills Examination. Dissertation Abstracts International,
1977, 37, 7710A.
Stanley, J. C. Anova principles applied to the grading of essay tests.
Journal of Experimental Education, 1962, 30, 279283.
Suhor, C. Mass testing in composition: is it worth doing badly?
New Orleans: New Orleans Public Schools, June 1977.
't
80
Vaughn, G. M., & Corballis, M. C. Beyond tests of significance:
estimating strengths of effects in selected ANOVA designs.
Psychological Bulletin, 1969, 72, 204213.
Veal, L. R., & Tillman, M. Mode of discourse variation in the evalua
tion of children's writing. Research in the Teaching of English,
1971, 5, 3745.
Vernon, P. E., 6 Millican, G. D. A further study of the reliability
of English essays. British Journal of Statistical Psychology, 1954,
7(2), 6574.
Winer, B. J. Statistical principles in experimental design (2nd ed.).
New York: McGrawHill, 1971.
APPENDIX A
POINT ESTIMATES OF THE VARIANCE COMPONENTS AS
LINEAR COMBINATIONS OF MEAN SQUARES FOR
THE SPLITPLOT FACTORIAL DESIGN WITH BALANCED DATA
2 (a) = 1/ns,(c) nmnonr[ MS(a)MS (T)MS(aB)MS(ay)MS(aO)+MS (nB)
+MS (my) +MS (IO) +MS (aBy) +MS (aBO) +MS (aye) MIS (aBy) MS (aBe)
Ms(7iye)MS(aByOe)+MS(fByO)] .
2
o (T) = l/nnonr[ MS(1T)MS(sB) MS( y) MS (7e)+MS(TBy)+MS (aBO)
4MS(TyOe)MS(TByO)] .
2 (B) = 1/ns (c)ncnonr [MS (B) MS (By) MS (B) +MS (Bye) MS (aB) +MS (aBy)
+MS(aBe)MS (aCBY)] .
oa (B) = 1/ns (c)non, [MS (aB) MS (aBy) MS (aBe)+MS (aBye)MS (TrB)+MS (TBy)
+MS (rBe)MS(w Bye)]
2 (TB) = l/nonr [MS (TB)MS(aBy)MS (aBB)+MS(rBye)]
2 (y) 1/ns (c) ncnmnr [MS (y) MS (yB) MS (yO) +MS (By) MS (ay) +MS (ayO)
+MS(aBy) BS(aByO) .
o2 (ay) = i/ns(c) nnr [MS (ty)MS (ayB) MS (ayB)+MS (aBy9) MS (ry) +MS (iyB)
+MS (ryO)MS (aBy))] .
o2(Ty) = I/n nr [MS (y)MS (ryB)MS(Ty6)+MS(TrBy)] .
'2
2 (0) = /ns(c)ncn no [IMS()MS(eB)MS (ey)+MS (ByO)MS(ae)+MS (aeB)
+MS (acy)MS(aByO)] .
2 ( e) = 1/ns (c) nmino [MS (ae) MS (a6B) MS (acy) +MS (aBye) MS (nT9) +MS (reOB)
+M S(nOy)MS(TBye)] .
2 (Tfo)= 1/1, TI 1 0 [NS (T8) IS (Tr6B) NIS (7T~y) +NIS (TTBy6 )I
62 (By) = /n s (n Tn[MS( (By) NIS( (By)NIS( yo.By)I
2 (aBy) = 1/n n (C)n [MIS (aBy)MS (aByG)MS (cffBy)+MS (T ByO)]
2(ifly) = I/n r [,lS(Tffy)NIS 7fBy6)].
a2(aBe) = 1/n (cn1 [HMS(QByO)S(ByO)MS(naBy)+MSByO)
62(,BO = /ns(c) r I aO NS(~O)I T~)+I 7~8
'2 (iTrB)= 1/n [MS(TTBy)MS(TByO)].
a2(ye) = J/ns(c) n c in [S (yO NIS(Bye) MS (ayfl +NIS (aByO)]
2(ctya) = 1/n1 (C)n [MS(ayB)MIS((xByO)MS(TxBy)+MS(TTByO)]
62(BO)= I /n n,, [S(7ry6)MS (w6)By6)
2 (iBy) = 1/n s(C)n[M IS(By) MS(aoy8)]
a2(aByO) = /n nn[MS(aByO)MS(TBye)]
52 (Qoye)= 1/n[S(Tf6).
BIOGRAPHICAL SKETCH
Maria Magdalena Llabre was born in Matanzas, Cuba,on September 22,
1950. She and her family immigrated to the United States in 1962.
Upon graduating from Miami Senior High School in 1969, she enrolled
at the University of Florida where she received a Bachelor of Arts degree
with a double major in Psychology and Mathematics. Following graduation
Maria returned to Miami to teach mathematics at John F. Kennedy Junior
High School for one year.
In 1974 she was admitted to the doctoral program in the Foundations
of Education Department at the University of Florida. She received the
M. A. E. degree in Educational Psychology in 1976.
While in graduate school, Maria worked in the evaluation of Project
Follow Through and served as an evaluation consultant at P. K. Yonge
Laboratory School. She was also a teaching assistant in research and
statistics courses in the College of Education for three years.
She is currently a member of Phi Beta Kappa, the American Educa
tional Research Association, the American Statistical Association, and
the National Council for Measurement in Education.
Marfa and her husband Brainard Hines will be moving to Miami where
she has accepted a teaching position at the University of Miami starting
in August, 1978.
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.
William B. Ware, Chairman
Professor of Foundations of Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.
"'Linda M. Crocker
Associate Professor of Foundations
of Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.
) /
Ramon C. Littell
Associate Professor of Statistics
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fully
adequate, in scope and quality, as a dissertation for the degree of
Doctor of Philosophy.
L hn M. Newell
Professor of Foundations of
Education
This dissertation was submitted to the Graduate Faculty of the
Department of Foundations of Education in the College of Education
and to the Graduate Council, and was accepted as partial fulfillment of
the requirements for the degree of Doctor of Philosophy.
August 1978
Chairman, Foundations of Education
Dean, Graduate School
