DEVELOPMENT OF THE
DISRUPTIVE STUDENT BEHAVIOR SCALE
WILLIAM L. MOSES
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
William L. Moses
Who would have been proud
I wish to express appreciation to my parents for their
love, understanding, and help; my committee chairperson,
Dr. McDavis, for his counsel and encouragement; my
committee members, Dr. Ziller for his confidence in my
ability to work independently and Dr. Loesch for stepping
into the breach and contributing so much so quickly while
continuing his friendship and support; and my employers at
Pasco-Hernando Community College for their financial
Special thanks go to my friend and colleague, Dr. Tom
Floyd, who listened for hours and encouraged for years,
and to my friends and lovers who were usually supportive,
sometimes distracting, and always worth it.
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . .
LIST OF TABLES. . . . . . . . . .
ABSTRACT. . . . . . . . . . .
CHAPTER ONE INTRODUCTION . . .
Statement of the Problem . .
Purpose of the Study . . .
Need for the Study . . . .
Significance of the Study . .
Definition of Terms. . . .
Organization of the Study . .
* .. . .
*. . . .
CHAPTER TWO REVIEW OF LITERATURE . . . . .
Definition of Disruptive School Behavior (DSB)
Identification, Assessment, and Placement. .
Rating Scale Development . . . . . .
Psychometric Properties of Rating Scales . .
Uses of Behavior Rating Scales . . . .
Summary . . . . . . . . .
CHAPTER THREE METHODOLOGY. . . . . . . .
Research Questions . . . . . . . .
Construction of the DSBS . . . . . .
Validation of the DSBS . . . . . . .
Reliability of the DSBS. . . . . . .
Field Study . . . . . . . . .
Data Analyses . . . . . . . .
Validity . . . . . . . . .
Reliability . . . . . . . .
Limitations . . . . . . . . .
CHAPTER FOUR RESULTS AND DISCUSSION . . . .
Results . . . . . . . . . .
The Severity Factor . . . . . .
The Samples . . . . . . . .
Research Question One . . . . ..
Research Question Two . . . . . .
Research Question Three . . . . ..
Research Question Four. . . . . .
Summary . . . . . . . .
Discussion . . . . . . . . . .
TABLE OF CONTENTS CONTINUED
CHAPTER FIVE CONCLUSIONS, IMPLICATIONS, SUMMARY, AND
RECOMMENDATIONS. . . . . . .. . . 118
Conclusions. . . . . . . . . . 118
Implications . . ... . . . . . 119
Summary. . . . . . . . . ... . 121
Recommendations. . . . . . . . . 123
A CONSTRUCT DEVELOPMENT STUDY . . . . .. 124
B BEHAVIORS COLLECTED FROM DISCIPLINARY RECORDS 130
C ORAL INSTRUCTIONS FOR THE EDITING STUDY . . 133
D ITEMS DEVELOPED FROM CONTENT VALIDATION STUDY 134
E INSTRUCTIONS FOR CONTENT VALIDATION STUDY . 137
F THE DISRUPTIVE STUDENT BEHAVIOR SCALE (DSBS). 138
G INSTRUCTIONS FOR SEVERITY FACTOR STUDY. .. 144
H SCORING TEMPLATE FOR THE DSBS . . . .. 145
I SUMMARY OF TEACHER RATINGS ON THE DSBS. . . 151
J PRESCRIPTIVE PROFILE WORKSHEET FOR THE DSBS . 152
K INSTRUCTIONS FOR THE PILOT STUDY. . . .. .158
L RATERS EVALUATION OF THE DSBS . . . .. 160
M ASSIGNMENT OF CONSTRUCTS BY ITEM NUMBER . . 161
REFERENCES. . . . . . . . . ... . . 162
BIOGRAPHICAL SKETCH . . . . . . . . 193
LIST OF TABLES
1. Domains of Student Life Influenced by
the School Experience . . . . 76
2. Potential Adverse Consequences of DSBS
Behaviors . . . . . .... ... 90
3. Rating Form Distribution by Demographic
Categories--Norming Group . . . .. 94
4. Rating Form Distribution by Demographic
Categories--Disruptive Group. . . ... 95
5. Frequency of Observed DSBS Behaviors by
Constructs . . . .... . . .. 97
6. DSBS Constructs by Number. . . . . ... 99
7. Assignment of Proposed Scale Items to
Constructs for Content Validation 101
8. Follow-up Study for Assignment of Proposed
Scale Items to Constructs. . . . ... 103
9. Comparison of Disruptiveness Ratings by
Teachers and Non-teaching Personnel. . 106
10. DSBS Ratings and z-scores for the Disruptive
Group . . . . . . . . . 107
11. DSBS Ratings and z-scores for the
Norming Group. . . . . . . . 108
12. Test-Retest Correlations . . .. . 112
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
DEVELOPMENT OF THE
DISRUPTIVE STUDENT BEHAVIOR SCALE
William L. Moses
Chairperson: Roderick McDavis, Ph.D.
Major Department: Counselor Education
Disruptive behavior is currently seen by both educa-
tors and the public as a major problem in American
education. A procedure for quantitatively assessing
disruptive behavior in schools is required to show a need
for intervention programs and to select students for
placement in either special education or alternative
education programs. The purpose of this study was to
develop and validate an instrument, the Disruptive Student
Behavior Scale (DSBS). The DSBS is intended for use in
assessing quantitatively the disruptive school behaviors
of middle and junior high students referred for placement
in special education and alternative education programs.
This study investigated the position that disruptive
school behavior (DSB) can best be described in terms of
its type, frequency, and severity. The use of teachers as
observers and raters of disruptive school behavior is
discussed. Using teacher-generated behavioral statements
from disciplinary referrals to better describe DSB is
suggested. A review of various rating scale development
procedures attempted by business, industry, and government
A set of 10 constructs was selected to define DSB.
Scale items were developed from referral statements on
disciplinary records in a junior high school. A severity
factor was incorporated into the scoring system so that
behaviors rated as more detrimental to the student were
given a higher DSBS rating.
The DSBS was field tested in a public middle school.
Students in a norming group and a criterion, or disrup-
tive, group were rated by their classroom teachers using
the DSBS. A norm for disruptive behavior for the target
school was calculated and a criterion for classifying a
student as disruptive was established.
Results indicated the DSBS could identify the crite-
rion group of disruptive students, classify individual
students as disruptive, and exclude non-disruptive
students from the disruptive group. A follow-up study
suggested the results were consistent over time for all
DSBS ratings except those at the lowest end of the scale.
The public school system in the United States has
been assigned a major role in socializing and enculturating
American youth (Filipczak, 1978). The U.S. Supreme Court
in its 1954 landmark civil rights decision (Brown v. Board
of Education of Topeka, 74 S.Ct. 686, 691) described
education as "a principal instrument in awakening the
child to cultural values, in preparing him for later
professional training, and in helping him to adjust
normally to his environment." The materialistic emphasis
of American society and culture ordains that the educa-
tional institution at all levels be driven by the broadly
defined goal of career success for its graduates (Bell,
1984; DiPrete, 1981, p. 199; National Education Associa-
tion (NEA), 1975, p. 108).
Unfortunately, a significant number of students are
detoured from this goal when educators describe them as
displaying behaviors inappropriate to the school environ-
ment and not attributable to legally-defined mental or
emotional handicaps. Suspensions, expulsions, and assign-
ments to alternative programs are evidence of failure by
the educational system to effect students' adherence to
current social norms and culturally-specified behaviors.
The consequences to the schools for this failure include
loss of both funds and credibility, neither of which the
educational system has in sufficient quantity to squander.
Attempts to correct this failure to convey effec-
tively norms and behaviors have included both exceptional
child education and alternative schooling programs. The
Education for All Handicapped Children Act of 1975 (P.L.
94-142)(Department of Health, Education, and Welfare,
1977) effectively administered the coup de grace to
exceptional child education approaches in Florida by
failing to include a category appropriate to disruptive
behavior (Florida Department of Education, 1975, 1985).
Alternative programs frequently fail to provide for
selection and discharge criteria, rendering evaluation
virtually impossible (Pinellas County School District,
1982). A primary reason for failure to specify behavioral
criteria for alternative schooling programs is the lack of
appropriate instruments for quantifying disruptive
behavior (Salvia & Ysseldyke, 1981, pp. 8, 9).
Inadequacies of existing behavioral assessment
instruments include failure to provide for local norming,
inclusion of inappropriate items, omission of the severity
factor, and inadequacy of prescriptive information
(Mesinger, 1982). An instrument providing both a
theoretical and a pragmatic rationale for identifying
disruptive students is a requirement for reconsidering the
inclusion of this category in special education legisla-
tion and enhancing the credibility of alternative
education programs (Reeves, Perkins, & Hollon, 1978).
Statement of the Problem
Disruptive behavior in the public school system is
not a new phenomenon (Garibaldi, 1979). That it remains a
problem is emphasized by Robert J. Rubel in introducing a
collection of papers on crime and violence in public
The issue in the 1980's no longer centers on
whether or not violence in American schools is
serious; the issue no longer centers on whether
violence is increasing or decreasing; the issue
no longer centers on technical anomalies concern-
ing under- or over-reporting of incidents. In
the debate of the 1980's, the primary issue
before large proportions of our urban schools
(and sizeable numbers of our suburban and even
rural schools) revolves around the continued
viability of American education as it existed a
generation ago. (1980, p. 5)
The U. S. government has acknowledged the existence
of disruptive behavior by awarding federal grants for
alternative education pilot programs (Law Enforcement
Assistance Administration, 1979; Moses, 1976).
Included in definitions of disruptive school
behavior (DSB) are such varied activities as talking,
hitting, yelling (Mayer & Butterworth, 1979); defy-
ing rules and procedures (Walker, 1979); aggressive
behavior which interrupts the instructional program
(Foley, 1982); and conduct disorders (American Psychiatric
Association, 1980 pp. 45-50). Forness and Cantwell (1982)
and Forness, Sinclair, and Russell (1984) have identified
these categories as likely to be ineligible for special
education services under P.L. 94-142.
The U.S. government (Department of Health, Education,
and Welfare, 1977), in implementing P.L. 94-142, specif-
ically denied services to the "socially maladjusted."
Florida law provides essentially the same restrictions
(State Board of Education Rule 6A-6.3016), although Bower
(1982), whose research (Bower, 1958) formed the basis for
the P.L. 94-142 definition of emotionally disturbed,
called this exclusion "contradictory in intent and content
with . the research from which it came" (1982, p.
The need for alternative education services for
disruptive students seems supported by reports of the
widespread existence of DSB. Individuals and institutions
reporting on the continuing crisis in school discipline
include the California Department of Education (1973),
the National Education Association (1975), the U.S.
Congress (Bayh, 1975; Tygart, 1980), the Michigan
Department of Education (Vergon & Williams, 1978), the
National Institute of Education (Feldhusen, 1978), Cross
and Kohl (1978), Duke (1978), the New York State United
Teachers (1979), and the National Education Association
(1980). The Safe School Study Report to Congress (National
Institute of Education, 1978) indicated 5,000 teacher
assaults per month occurred across the nation. The Gallup
Poll on Education (Gallup, 1984) continues to report lack
of student discipline as the number one concern of
Americans about the public school system.
In Florida, the Governor's Task Force on Disrupted
Youth (GTFDY) found 17,983 student-days lost to suspen-
sions over a 2-year period in the 10 school districts
studied (GTFDY, 1973, p. 11). An analysis of conduct code
violations in Duval County, Florida, schools for 1980-1981
revealed more than 33,000 violations resulting in 13,679
days lost from school (Moses, 1981).
The aversive consequences of chronic DSB for students
include lowered self-esteem and functioning level (Caliste,
1979); dropping out and underemployment (Grise, 1980; NEA,
1975; Safer, Heaton, & Parker, 1981); alienation
(Garbarino, 1980; Moyer & Motta, 1982); and criminal
activity (Edwards, Roundtree, Kent, & Parker, 1981;
Mitchell & Rosa, 1981). Likewise, from the perspective of
the school system DSB is undesirable, involving excessive
teacher attention (Rubel, 1977, Chap. 1), litigation
(Lufler, 1982), vandalism costs (Goldstein, Apter, &
Harootunian, 1984), teacher stress (Pettegrew & Wolf,
1982), and weakened public support (Amos, 1980). Conse-
quences for the community include criminal actions and
psychiatric referrals (Faretra, 1981). Levin (1972) esti-
mated the expense of inadequate education to be about 6
billion dollars a year (1972 dollars) for costs associated
with welfare and crime.
Researchers have identified the middle and junior
high school age student as particularly prone to behavior
disorder (Geiger & Turiel, 1983; Loeber, 1982; Nielsen &
Gerber, 1979; Quay, 1978). These studies suggest the
middle and junior high schools as a focus for identifying
and remediating disruptive school behavior. Unfortu-
nately, no adequate instruments are available specifically
for this population (Mesinger, 1982). Instruments
developed from clinical populations contain some items
irrelevant to the non-clinical population in the public
schools (Quay & Peterson, 1967). Instruments offered with
norms developed from research samples and no procedure for
developing local norms for disruptive behavior do not
consider the placement needs of local school districts
(Messick, 1980). Levels of disruptive behavior that can
be managed within the regular school environment vary
across settings because of differences in such factors as
facilities, experience of teachers and administrators, and
school board policies.
Current instruments fail to consider the widely
differing consequences of specific disruptive acts (Kane &
Bernardin, 1982). Some possible effects of this omission
may be to group together students whose behaviors differ
widely in their severity, to encourage conceptualizing all
disruptive behavior as equally deleterious, and to base
placement decisions on personal judgments about the seri-
ousness of a particular type of behavior. Neither does
any available instrument provide procedures for creating a
prescriptive profile of a student based on the authors'
conceptual model of disruptive school behavior (Salvia &
Ysseldyke, 1981). This failure may seriously limit the
interpretation and application of rating scale results.
Purpose of the Study
The purpose of this study was to develop and validate
an instrument, the Disruptive Student Behavior Scale
(DSBS). The DSBS would be used to assess quantitatively
the disruptive school behaviors of students referred for
placement in either special education or alternative
Need for the Study
Salvia and Ysseldyke (1981, pp. 443, 444, 450) have
called for norm-referenced instruments to support
placement decisions, evaluate student progress, evaluate
programs, provide intervention suggestions, and help
parents understand their children's abilities in relation
to other students. Reeves et al. (1978) called for
reliable instruments to use in placing handicapped
children. Also, Camp (1981) notes that
there is very little current, objective,
research-based information in existence to help
identify specific student behavior problems
occurring in the schools. A need exists for
research of this nature to quantitatively
establish the actual, current situation with
regard to student discipline problems in the
public secondary schools. (p. 48)
Presumably, these calls for reliable and valid instruments
apply both to special education and alternative schooling
programs, as both to some degree remove the student from
mainstream classroom activities. However, the Florida law
(State Board of Education Rule 6A-6.3017) providing for
special education programs for the socially maladjusted
was repealed July 24, 1981.
"Educational alternative programs" were created in
Florida in 1978 (Florida Statute 230.2315) specifically to
reduce disruptive behavior and truancy. Florida Statute
229.565 provides for the evaluation of "procedures for
identification and placement of students in educational
alternative programs." As an example of practice, in 1982
the alternative education program in the Pinellas County
School District did not require quantitative behavioral
assessment prior to placement.
Studies, however, have identified problems in using
subjective criteria for alternative education placement.
Disagreements in ranking behaviors (Pisarra & Giblette,
1981), value systems (Messick, 1980), labels applied to
students (Leyser & Abrams, 1982), teaching experience
(Rubel, 1977, p.51), level of frustration (Walker &
Holland, 1979), race (Arnove & Strout, 1978; Bennett &
Harris, 1982; Florida DOE, 1983; Goldsmith, 1982;
Mesinger, 1982), sex (Bennett & Harris, 1982), and
socioeconomic status (Arnove & Strout, 1978; NEA, 1975)
are variables that may confound perceptions of disruptive
One way to help neutralize these confounding vari-
ables is to use quantitative measures. A review of
current literature indicates that appropriate instruments
may not exist. After a major study of alternative educa-
tion programs, Mesinger (1982) was unable to recommend
even one instrument for use in selecting students.
Messick (1964, 1965, 1980) argued against applying to
local environments behavioral norms developed elsewhere.
Stott, Marston, and Neill (1975, p. 8), Wodarski and Pedi
(1978, p. 480), and Quay and Peterson (1975, 1979) advised
the setting of local norms. However, no instrument
located in this review provides a specific procedure for
determining local norms.
Another advantage of locally-developed norms is the
opportunity to compute the mean DSB level for individual
schools. Intervention program entry and exit criteria may
be defined by the deviation of an individual student's
mean DSB score from the school mean. This may provide the
type of quantitative assessment required by state (SBE
Rule 6A-6.3016) and federal (P.L. 94-142) law for special
education placement and may meet the need noted by
Mesinger (1982) for quantitative instruments to assist in
selecting students for alternative education programs.
A major need in intervention programs is prescriptive
information (Lovitt, 1967 p. 238; Spivack & Swift, 1977).
However, many instruments do not provide operationally-
defined items which are useful in the classroom. For
example, the Behavior Problem Checklist (Quay & Peterson,
1979) items used to identify conduct problem students
include "restlessness," "disruptiveness," and "irresponsi-
bility." These items originally were taken from the files
of a child guidance clinic (Quay, 1977).
Defining disruptive behavior on the dimensions of
type, frequency, and severity has received support from
numerous sources (American Psychiatric Association, 1980,
p. 45; Bernardin, LaShells, Smith, & Alvares, 1976; Camp,
1980, 1981; Grosek, 1979; Taylor, Warren, & Slocumb,
1979). Criticisms of assessment procedures not incorporat-
ing a severity factor have been made by Kane and Bernardin
(1982) and Pisarra and Giblette (1981). Nevertheless, no
instrument was located which specifically recommended
using a severity factor in assessing disruptive school
An instrument which provides for quantifying DSB may
help to protect students from placement in school programs
according to inappropriate criteria. To be most effec-
tive, the instrument should include provisions for
establishing locally-determined placement norms, for
comparing with those norms the scores of individual
students, for providing prescriptive information, and for
systematically considering the type, frequency, and
severity of the disruptive behaviors.
Significance of the Study
This study investigated the theoretical position that
disruptive school behavior (DSB) can best be described
in terms of its type, frequency, and severity. Theoretical
considerations in the use of teachers as observers and
raters of disruptive school behavior were discussed.
The feasibility of using teacher-generated behavioral
statements from disciplinary referrals to better specify
the parameters of DSB was suggested. A review of various
rating scale development procedures attempted by business,
industry, and government were summarized.
The instrument developed by this study will initially
be most appropriate as a research tool for conducting
studies of DSB. The availability of a process for
establishing local norms for DSB may facilitate local
research studies in evaluating the effectiveness of
disciplinary measures, in-service training, and alterna-
tive education programs. This study will likely suggest
additional areas for other investigations.
The identification of disruptive students for inter-
ventions is not standardized. This instrument may assist
in establishing quantitative criteria for selection,
placement, and treatment of disruptive students. This, in
turn, may lead to recognition of DSB as a category for
exceptional student education funding.
A major premise in much of the literature concerning
DSB is the role of school personnel in exacerbating
disruptive behavior. It may be that an instrument which
provides a behavioral profile of the disruptive student
will suggest goals for in-service training programs.
Definition of Terms
For the purposes of this study, the following
Alternative education program. An educational
procedure which provides intervention outside the regular
classroom for students exhibiting some predetermined level
of disruptive or disinterested school behavior.
Disruptive school behavior (DSB). Behavior that
disrupts the learning of self and/or others and is not
attributable to severe emotional disturbance or other
exceptional education categories.
Delinquent behavior. Behavior by persons under 18
years of age which violates laws and regulations pertain-
ing to them.
Exceptional child (student) education programs.
Programs which receive additional funding in order to
better serve the needs of students meeting governmental
guidelines for special assistance.
Experienced teachers. Full-time, regular classroom
teachers who have held that position at least two academic
Expulsions. Removal from school for at least the
remainder of the school year.
Locally developed norms. Criteria for comparing an
individual student's DSB with the expected DSB of a
specific reference population in the local school or
Maladaptive social behavior. Behavior not of organic
origin which would be judged by impartial observers to be
inappropriate for the social situation and which ulti-
mately results in aversive consequences for the person
exhibiting the behavior.
Method bias. The influence on ratings of the type of
rating method used.
Non-quantitative assessment. See Qualitative assessment.
Qualitative assessment. Evaluation based on individ-
ual opinion and lacking a systematic basis.
Quantitative assessment. The use of numbers in
describing behavior so that a higher number indicates a
higher level of the behavior.
Severity. A prediction, stated quantitatively, of
the potentially detrimental consequences a disruptive
behavior would likely have for a student.
Special education programs. See Exceptional child
Suspensions. Temporary removal from the regular
educational program of a school, usually involving
exclusion from school facilities for a specified number of
Organization of the Study
There are four remaining chapters in this
dissertation. Chapter Two will present a review of the
literature related to the development of an instrument to
assess disruptive school behavior (DSB). Specifically,
consideration will be given to disruptive behavior in the
schools, existing assessment methods, rating scale develop-
ment, the psychometric properties of rating scales, and
the possible uses of results from a disruptive behavior
Chapter Three will present the methodology employed
in the development, validation, and field testing of the
Disruptive Student Behavior Scale (DSBS). Included are
the research questions, information on the population,
procedures used in developing the scale, pilot testing,
data analyses, and possible limitations of the study.
Chapter Four will present the results of this study,
including the data and the information inferred from the
data. An explanation of the results will be given and
they will be related to past research.
Chapter Five will include conclusions from this
study, along with implications for theory, research,
practice, and training. A summary of the entire study
will be presented, followed by recommendations for addi-
REVIEW OF LITERATURE
This study requires an investigation of the history
and current status of attempts to define disruptive
behavior in public schools; identification, assessment,
and placement efforts directed toward disruptive students;
rating scale development procedures; research into the
psychometric properties of rating scales; and the use by
schools of results obtained from rating scales. Accord-
ingly, this chapter will review research and opinion
covering both theoretical and applied considerations
relating to these topics.
Definition of Disruptive School Behavior (DSB)
According to Camp (1981), the major issue in student
discipline in the secondary schools is how to describe
quantitatively the kinds of disruptive behavior currently
occurring. Summarizing a 1978 survey of state directors
of special education, Hirshoren and Heller (1979) reported
that while individual states define emotional disturbance
consistently, there is considerable variation in the kinds
of children so identified. That is, children meeting
program criteria in one state appeared to be excluded in
another. Much has been written in an attempt to resolve
this situation. A review of the literature suggests the
emergence of five discrete perspectives: (a) empirical,
(b) clinical, (c) conceptual, (d) educational, and (e)
The empirical approach of applying factor analysis
(Cattell, 1978; Gorsuch, 1974) to a variety of items has
resulted in the identification of some common behaviors
associated with disruptive school behavior and has contri-
buted to defining DSB (Achenbach, 1978; Achenbach &
Edelbrock, 1978; Edelbrock, 1979; Peterson, 1961; Quay,
1964, 1978; Quay & Peterson, 1967). However, researchers
utilizing the empirical approach have included a broad
range of behaviors, including many which identify delin-
quency and personality disorders (Freemont & Wallbrown,
1979), and so the scales developed from these studies have
limited application for school personnel in defining the
specific category of DSB.
The classification of disorders contained in the
Diagnostic and Statistical Manual of Mental Disorders,
3/e. (DSM-III) (American Psychiatric Association, 1980)
and research studies incorporating these classifications
and descriptions exemplify clinical efforts to define
disruptive school behavior. Hewett and Forness (1982)
pointed to the necessity of finding a common frame of
reference between educational and psychiatric diagnoses in
order for school personnel to accurately interpret
clinical reports. Forness and Cantwell (1982) concluded
that the respective diagnostic systems of psychiatry and
special education remain dissimilar. Likewise, other
studies (Loeber, 1982; Werry, Methuen, Fitzpatrick, &
Dixon, 1983) failed to find support for the use of
psychiatric diagnoses to assign students to special
The conceptual approach utilizes experience,
research, and opinion in formulating descriptions of what
is usually referred to in this perspective as "problem
behavior" (Jessor & Jessor, 1977, p. 4). Cullinan (1975),
Howell (1978), and Richard Jessor (1982) are among those
applying a psychosocial conceptualization of problem
behavior to the study of adolescent behavior. Neverthe-
less, while the conceptual perspective gives support to
the notion of comparing the behavior of an individual
student with the behavior of peers before declaring the
student to be deviant, this perspective fails to provide
specific criteria for making such a comparison.
The educational perspective includes the definitions
contained in federal and state statutes, guidelines
proposed by governmental agencies, and district codes of
student conduct. In 1977, the U.S. government, without
defining the term, specifically excluded the socially
maladjusted student from receiving exceptional child
education services under P.L. 94-142. The term "socially
maladjusted" is not defined in the latest Florida guide-
lines for providing special education for exceptional
students (Florida DOE, 1985). The U.S. Bureau of Educa-
tion for the Handicapped has sponsored the compilation of
a manual on behavior disorders (Yard, 1977). However,
these items are too general for use in a quantitative
Codes of student conduct contain lists of behaviors
for which punishment may be administered. Offenses listed
in the codes may be violations of either school rules
(e.g., inappropriate display of affection) (Duval County
Public Schools, 1980) or of law (e.g., vandalism) (Pinellas
County Schools, 1983). While these offenses must be
considered in defining disruptive school behavior, they
exclude many of the disruptive behaviors frequently
occurring within the classroom. Federal, state, and local
guidelines seem insufficient for operationally defining
DSB specifically enough to be useful in a selection
The school perspective focuses on the interactions of
students, teachers, and administrators within schools.
Disruptive school behavior is seen as a product of these
interactions. H. M. Walker, author of The Walker Problem
Behavior Identification Checklist (1970), described the
acting-out child as one who usually defies rules and
ignores classroom procedures, is difficult to manage,
avoids failure by attempting little academic work, and
alienates teachers and other students by behaving
Specific behaviors often include hitting, yelling,
leaving seat, arguing, having temper tantrums, and provok-
ing others and often lead to confrontations. These
confrontations may be verbal, physical, or both. Acting-
out behavior may occur in the classroom, in nonclassroom
areas, or both. Walker (1970) proposed that acting-out
children are differentiated from other students by the
frequency, or quantity, of these behaviors, not by the
type of behaviors. Thus, a measuring instrument must
provide for a frequency component.
Camp (1981) explored the types of behavior considered
to be disciplinary problems, the perceived degree of
severity of these behaviors, and the frequency with which
these behaviors were observed. Camp found that the types
of behaviors rated most serious were rarely observed and
concluded that the most serious problem may be the
frequent, though mild, behaviors that undermine student
and teacher morale. A study of 21 secondary school
administrators' attitudes toward aggressive behavior
suggested that suspensions were awarded according to the
administrators' attitudes toward the referred behavior,
rather than according to a consistent standard for the
school district (Pisarra & Giblette, 1981).
An evaluation of literature of the school perspective
suggests that DSB can be defined in terms which students,
teachers, and administrators understand; that the three
factors of type, severity, and frequency need to be
considered; and that measures of DSB need to be standard-
ized. In this section five perspectives for defining
disruptive school behavior were presented. Each perspec-
tive offers some assistance in differentiating this
category from other behavioral categories. There appears
to be support for an instrument which operationally
defines types of behaviors occurring throughout the school
environment, assigns a quantity to each descriptive item
based on the perceived frequency of occurrence and sever-
ity, and provides for comparing the score of an individual
student to a predetermined norm for that environment.
Identification, Assessment, and Placement
"Measurement is the construction of a model of some
property of the world" (Fraser, 1980, p. 27) and in
education this property is often the behavior of a
student. One role of the model provided by a measure is
to give accurate prescriptive information for planning
interventions with students (Forness, 1983). Several
studies have suggested this is being performed
inadequately (Greenwood, Walker, & Hops, 1977; Schenck,
1980; Sinclair, 1980; Sinclair & Kheifets, 1982; Spivack &
Swift, 1973; Strain, Cooke, & Apolloni, 1976).
Fraser (1980) acknowledged that psychological mea-
surement has been regarded as being quantitatively and
qualitatively of a lower order than physical measurement.
To achieve improvement, Ysseldyke and Marston (1982) have
argued for the use of direct observations of target behav-
iors by either teachers or trained observers. However,
Jones, Reid, and Patterson (1975) found observer reli-
ability varied inversely with the complexity of the
behaviors being observed.
Attempts to improve the validity of observations have
included such sophisticated approaches as Multidimensional
Scaling (MDS) (Torgerson, 1958). Sanson-Fisher and
Mulligan (1977), using adolescent student models, found
only marginal improvement for this technique over ratings
by classroom teachers. A comparison of a computer-driven
program for selecting behavioral/emotional disorders with
two expert psychologists' selections indicated no mean-
ingful differences existed (McDermott & Hale, 1982).
Weinrott (1979) summarized studies that indicated global
ratings could be significantly influenced by expectations,
while post hoc ratings of the same children by the same
raters when recorded on an instrument accurately reflected
discrete behavioral events. Gaynor and Gaynor (1976)
argued for instruments written to define behaviors so they
may be described quantitatively by teachers.
Beltramini (1982) suggested that scale-item content
is more important than other variables in obtaining reli-
able and valid results. A review by Albaum, Best, and
Hawkins (1981) of measurement literature found evidence to
support the use of from five to seven categories on Likert-
type scales, with no significant losses in reliability,
validity, or discrimination when compared with instruments
using more intervals. Fewer intervals sometimes resulted
in a loss of discriminative power and validity. It
appears that teachers using instruments which operation-
ally describe disruptive behaviors can be effective post
hoc raters and are able to provide reliable and valid
identification of disruptive school behavior (Edelbrock,
1979; Gresham, 1982; O'Leary & Johnson, 1979).
A review of current assessment techniques suggests
the emergence of a quantitative/qualitative dichotomy,
which will now be explored. In two reviews (Spivack &
Swift, 1973, 1977) of instruments for measuring secondary
school classroom behaviors no instrument was located which
limits its focus to disruptive school behavior, uses only
behaviorally-stated items, and provides for calculating
local norms. Descriptions follow of representative
instruments currently in use.
The Behavior Problem Checklist (BPC) (Quay & Peterson
1967, 1975, 1979) is a 55-item scale of behavioral traits
developed from a review of clinical records of kinder-
garten through eighth grade students referred for
psychiatric treatment (Quay, 1977). The items were
assigned by factor analysis to four scales plus a grouping
suggestive of psychosis. Epstein, Cullinan, and Rosemier
(1983, p. 172) and Gresham (1982, p. 137) reported that
the BPC is one of the behavior rating scales most widely
used in school studies.
The BPC has been used extensively both as a research
device (Eaves, 1975; Jacob, Grounds, & Haley, 1982;
Kelley, 1981; Touliatos & Lindholm, 1981) and in selecting
students for interventions (Algozzine, 1977; Balow, 1979;
R. Bower, 1969; Gerard, 1970; Ingram, Gerard, Quay, &
Levinson, 1970; McCarthy & Paraskevapoulas, 1969). Jacob
et al. (1982) reported that reviews of studies utilizing
the BPC suggested reliability and validity issues in need
of further study. The inability of the BPC to provide
other than broadband classifications has been noted
(Achenbach & Edelbrock, 1978).
Comprehensive normative data are not available for
the BPC for adolescents (Kelley, 1981). In an investiga-
tion of the effects of race on BPC ratings, Eaves (1975)
found that white teachers consistently rated black
students higher than white students on three of the
subscales. Black teachers showed no such bias. Eaves
(1975) concluded this bias could have a major effect on
the reported norms for the BPC. Touliatos and Lindholm
(1981) found that grade level, sex, and social class had a
significant effect on BPC ratings. However, differences
between schools and teachers contributed more variance in
the BPC ratings than grade, sex, and social class.
Touliatos and Lindholm (1981) suggested that Quay and
Peterson's (1967) recommendations be followed and
individual assessment be based on norms calculated for
particular schools and individual teachers.
Spivack and Swift (1973) concluded that the BPC was a
reasonably reliable measurement tool. Potential users
were cautioned, however, that most items are not specifi-
cally observable, but more like labels which imply
behaviors and designate traits. Likewise, Stott (1971,
p. 232) cited certain BPC items as requiring a teacher to
make inferences about students' feelings (e.g., "feelings
of inferiority"), being vague or ambiguous (e.g., "oddness,
bizarre behavior"), and relating to behaviors unobservable
by a teacher (e.g., "stays out late at night," "bed
wetting"). This review has identified several areas of
the BPC for which additional research has been suggested.
The Behavior Rating Profile (BRP) (Brown & Hammill,
1978) is composed of five rating scales and a sociogram.
Three of the scales (60 items) are completed by the target
student, one (30 items) by the teacher, and one (30 items)
by parents. The sociogram is a peer nominating techni-
que. The student scales provide self-ratings of behaviors
at home, at school, and with peers.
The BRP is based on an ecological approach which,
according to the authors, recognizes that students'
behaviors are dependent on the settings in which they
occur. Its purposes are the identification of students
with behavior problems and the differentiations among
learning disabled, emotionally disturbed, and behaviorally
disordered students in grades 1-12. Each of the six
measures is described as independent and individually
normed, allowing any scale to be used alone or in conjunc-
tion with any of the others.
The BRP manual (Brown & Hammill, 1978) reports
internal consistency reliability coefficients exceeding
.80. Concurrent validity was investigated by correlating
the BRP with measures obtained from other rating scales.
Adequate construct and content validity also are reported
by the authors. Norms are provided using scale scores
with means of 10 and standard deviations of 3, with scores
from 7 to 13 considered to be in the normal range.
One study (Reisberg, Fudell, & Hudson, 1982) of
behavior disordered students indicated that regular
classroom teachers gave higher ratings than special
educators (X=8.85 vs. X=6.87). Thus, norms may vary
according to the type of respondent (e.g., regular teacher
or special education teacher). Also, students' self-
ratings were inflated relative to other respondents'
ratings. Other investigators have noted problems
associated with attempts at multiple and self-ratings.
Lessing and her associates (Lessing & Clarke, 1982;
Lessing, Williams, & Gil, 1982; Lessing, Williams, &
Revelle, 1981) have reported on their unsuccessful
attempts to develop parallel checklists for use by
parents, teachers, and clinicians in psychiatric diag-
noses. Lobitz and Johnson (1975) found low correlations
between parent ratings and observed behaviors. Variables
confounding self-ratings include halo effect (Holzbach,
1978), social desirability (Dunnett, Koun, & Barber, 1981;
Seidman,, Rappaport, Kramer, Linney, Herzberger, & Alden,
1979), and lack of self-knowledge (Beitchman & Raman,
Ledingham, Younger, Schwartzman, and Bergeron (1982)
investigated teacher, peer, and self-ratings of 801
elementary school students. Self-ratings yielded the
lowest ratings for deviant behavior, aggression, and
withdrawal and the highest ratings for likability. Accu-
racy of self-evaluation has been found to be positively
correlated with high intelligence, high achievement
status, and internal locus of control, characteristics not
usually associated with DSB (Dunnett et al., 1981).
Reported research using the Behavior Rating Profile is
sparse. Additional verification of the assumptions of
equivalency of norms within respondent categories and the
validity of the self-report scales seems indicated.
The Bristol Social Adjustment Guides, 5/e.(BSAG)
(Stott, 1972) consist of 110 behaviorally-stated items
from which teachers select those descriptive of a
student's behavior in the month prior to the rating. The
items were originally developed in 1955 from clinical
observations of children aged 6 to 14 and modified by
classroom teachers (Stott & Sykes, 1956). A primary goal
was to incorporate context into the behavioral descrip-
tions (Stott, 1971).
The BSAG has been used extensively in clinical and
research studies (Davis, Butler, & Goldstein, 1972;
McDermott, 1980; Stott, 1978; Stott & Wilson, 1977).
Reliability and validity data were obtained through
extensive research (Stott et al. 1975) but are not
reported in a manner that is easily abstracted. Normative
data are available only for elementary school populations
(Stott, 1972). More recent research (McDermott, 1980,
1981; McDermott & Hale, 1982) has questioned the
specificity of the core syndromes of the BSAG and called
for further investigation of construct and predictive
validities (Hale & Zuckerman, 1981). At present, it
appears that not all of the core syndromes of the BSAG
have the specificity required in an instrument to be used
in educational placement.
The Hahneman High School Behavior Rating Scale (HHSB)
is a 13-factor, 45-item scale published in 1971 (Spivack &
Swift, 1971). The HHSB items were developed from observa-
tions of actual classroom behaviors, operationally stated
in educational terms. The items cover both academic and
interpersonal issues and can be rated by teachers in the
classroom. The intent is to provide prescriptive informa-
tion (Spivack & Swift, 1977). The factor scores for each
student are found by adding the raw scores for the three
or four items comprising each factor. These scores are
then combined into a profile, which is used to classify
students on the basis of their ability to adapt to total
According to the authors (Spivack & Swift, 1973),
validity studies suggest consistent and significant
relationships between factor scores and academic grades.
No data are available on test-retest or interrater reli-
ability (Spivack & Swift, 1973). Norms are available
separately for suburban and urban samples. The HHSB is
limited as a selection device for special education pro-
grams by lack of reliability data, use of only three or
four items per factor, and overlapping among profile
The Behavior Evaluation Scale (BES) (McCarney, Leigh,
& Cornbleet, 1983) is a 52-item rating scale for use by
school personnel. Each item is assigned to a subscale
associated with one of the five characteristics of the
Bower (1958) definition of behavior disorders used in
Public Law 94-142. The BES was developed to aid in
diagnosis, placement, and program planning under federal
guidelines. Since federal criteria specifically exclude
the "socially maladjusted" student, the BES is inappropri-
ate for assessing DSB.
The Portland Problem Behavior Checklist (PPBC)
(Waksman & Loveland, 1980) was developed to aid in
assessment, evaluation, and intervention planning for
school children. The 29 items cover teacher-rated
behaviors for grade levels K-12. Norms are not avail-
able. Items are very generally stated (e.g., aggressive-
physical, destructive) and are rated on a scale of 0 (no
problem) to 5 (severe). It is not clear if this is a
rating of frequency of behavior or severity of the
consequences of the behavior. These features of the PPBC
would seem to limit the preciseness and reduce the confi-
dence level of quantitative scores intended to support
evaluation and placement for professional services.
The Pupil Classroom Behavior Scale (PCBS) (Dayton,
1967) is a 24-item, teacher-administered rating scale
intended to measure the effectiveness of special education
services for students displaying inappropriate classroom
behaviors. Most items are behaviorally stated and yield a
profile of three factors, achievement orientation, socio-
academic creativity, and socio-cooperativeness. Dayton
(1967) suggested using the scales for research on groups
rather than to describe individual students. Norms are
not available. Spivack and Swift (1973) concluded that
the PCBS is flawed by having overlapping items in the
factors and lacking data to support a relationship between
scale scores and emotional adjustment.
The 36-item Conners Teachers' Rating Scale (CTRS)
(Conners, 1969) has been used primarily in clinical diagno-
sis of children, particularly in the area of hyperactivity
(Goyette, Conners, & Ulrich, 1978). It does, however,
cover a wide range of school problem behaviors (Roberts,
Milich, Loney, & Caputo, 1981). There appears to be a
high intercorrelation between the Conduct Problem and
Hyperactivity subscales, limiting the usefulness of the
CTRS in identifying DSB.
The Brief Behavior Rating Scale (BBRS) (Kahn &
Ribner, 1982) was developed from the Devereux series of
rating scales (Spivack, Haimes, & Spotts, 1967). A cross-
validation study (Kahn & Ribner, 1982) reported that 61%
of a socially maladjusted group and 27% of an emotionally
handicapped group were correctly identified. These
results suggest that additional development is needed
to obtain support for the discriminant validity of the
Some of the most complete research in instrument
development has been conducted in attempts to improve the
diagnosis of clinical populations in the school environ-
ment. Although these efforts are not directly comparable
to the intent of the present study, six instruments having
potential interest to researchers working in the school
setting will be summarized.
The Child Behavior Check List (CBCL) (Achenbach,
1978) contains 118 behavior problem items and 20 social
competence items. Parallel forms exist for parents and
teachers. A review by Achenbach and Edelbrock (1978) of
empirical attempts to derive syndromes of child behavior
problems concluded with the recommendation that these
efforts be linked to the existing mental health system.
Recent efforts by these researchers and their associates
(Edelbrock & Achenbach, 1980; Reed & Edelbrock, 1983)
continue to pursue this objective. At present the
applicability of this instrument for educational measure-
ment is limited.
The role of parent observations in describing chil-
dren's behavior is formalized in the Louisville Behavior
Check Lists (Miller, 1967, 1980). A study (Tarte, Vernon,
Luke, & Clark, 1982) confirmed the validity of parent
observations of clinical symptoms in their children.
The items require inferences and judgments by raters.
Eight subscales were created through factor analysis and
although several appear to relate to school activities
(e.g., hyperactivity, antisocial), the content of
individual items comprising the subscales renders them
only marginally useful for school assessments.
The Children's Behaviour Questionnaire (Rutter, 1967)
was developed for teachers' use in screening for psychi-
atric assessment large numbers of school children. Many
of the 26 items are vaguely stated and some appear to
require inferences by the rater. The two subscales are
labeled neurotic and antisocial, terms which lack direct
application to the school setting.
The Devereux Adolescent Behavior Rating Scale
(Spivack et al., 1967) was developed to measure behavior
requiring professional intervention. The subscales are
oriented to clinical diagnosis and offer little specific
information for use in placement decisions.
The Pupil Behavior Inventory: 7-12 Grades (Vinter,
Sarri, Vorwaller, & Schafer, 1966) is a 34-item, teacher-
administered rating scale intended to furnish information
on students referred for agency treatment. Behavioral
items were collected from teachers, screened and factor-
analyzed, and grouped into five factors. Lack of data on
reliability, validity, and norms suggests caution in
using this instrument to select students for special
services (Spivack & Swift, 1973).
The Mooney Problem Check List (MPCL) (Mooney, 1942),
has been widely used by counselors to identify problems of
individuals seeking counseling or to explore the problem
profile of a group of students (Sundberg, 1961). However,
two studies (Joshi, 1964; Stewart & Deiker, 1976) of the
underlying factors of the MPCL scales have identified only
a single general factor. The MPCL may be further limited
by utilizing items generated from problems mentioned by
high school students in 1942.
Several instruments designed for other populations
include behaviors often used in descriptions of disruptive
school behavior. The Adolescent Behavioral Classification
Project instrument (Dreger, 1980) was developed for
assessing problems of institutionalized adolescents. An
analysis of the first-order factors indicates some common-
alities with both the Hahnemann High School Behavior
Rating Scale (Spivack & Swift, 1977) and Achenbach and
Edelbrock's (1978) syndromes, but many are couched in
clinical terms that have little or no relevance to the
Ostrov and associates (Ostrov, Marohn, Offer, Curtiss,
& Feczko, 1980) developed and validated the Adolescent
Antisocial Behavior Check List (AABCL) for delinquents
housed in an institutional treatment setting. The authors
called for modification of the instrument for use in other
settings; however, extensive rewriting of items would seem
to be required.
The Jesness Inventory (Jesness, 1972) was created to
measure attitude change in youthful offenders undergoing
treatment. One study (Graham, 1981) found the Jesness
Inventory did not have the power to discriminate between
non-adjudicated and normal populations and thus would not
be useful in a school setting. The Jesness Inventory
appears best suited for research (Buros, 1978, pp.
The Jesness Behavior Checklist (JBC)(Jesness, 1970)
is also a measure of delinquent behavior. The reliability
and validity of this instrument have been questioned and
the JBC is recommended only for research purposes (Buros,
1978, pp. 873-876).
Non-quantitative assessment often uses nonsystematic
observations to provide the information from which judg-
ments will be made. Judgments about individuals are
required in all assessment. Inaccurate, biased, or sub-
jective judgments can be misleading and harmful (Salvia &
Ysseldyke, 1981). The Russell Sage Foundation Conference
Guidelines (Goslin, 1969) and the 1974 Family Educational
Rights and Privacy Act (P.L. 93-380--the Buckley amend-
ment) established guidelines for the proper collection,
maintenance, and dissemination of data concerning students.
For data to be used in making judgments, it must be
verified. For standardized tests, this verification is
implicit in the psychometric qualities of the instrument.
For observational data, verification requires confirmation
by persons other than the original observers (Salvia &
Ysseldyke, 1981). When the observation is nonsystematic,
verification may be difficult to establish and support and
the assessment and resulting evaluation may be open to
After a classroom teacher nominates a child for
evaluation for exceptional child education services, that
teacher's observation is verified by required legal proce-
dures (P.L. 94-142). There may be no such procedures for
other interventions. The Duval County, Florida, School
District has used teacher and principal nominations as the
criteria for admittance and dismissal from a program to
intervene with students displaying inappropriate social
behaviors (Duval County Public Schools, 1980). Short-term
suspensions in many school districts do not require hear-
ings and are based solely on a judgment by the school
principal (Lines, 1972; Pisarra & Giblette, 1981).
Subjective assessment practices such as these may
allow extraneous variables to influence judgments
(Poulton, 1976). Four such variables are bias, the influ-
ence of observer expectations, inaccurate perceptions,
and vagueness of the criteria for intervention.
Pupil characteristics were found by Ysseldyke and
Marston (1982) to influence rater bias. Variables
contributing to bias include perceived physical attrac-
tiveness (Ross & Salvia, 1975); sex, socioeconomic status,
and reason for referral (Matusek & Oakland, 1979;
Ysseldyke & Algozzine, 1982; Ysseldyke, Algozzine, Regan,
& McGue, 1979, 1981); race (Florida Department of
Education Report on Public Schools, 1983; Sikes, 1975);
type of behavior displayed by the student (Algozzine,
1980); and the theoretical orientation of the observer
(Messick, 1980; Salvia & Ysseldyke, 1981).
Erickson (1974) and Shuller and McNamara (1976) found
naive observers' reports coincided with experimenter-
induced expectancies about problem behavior. After
observing decisions made by educators, Weinrott (1979);
Ysseldyke, Algozzine, and Richey (1982); and Algozzine and
Ysseldyke (1981) speculated that these judgments were
influenced by an expectancy factor created by the
situation itself. A more direct measure of expectation
was reported on by Green and Brydon (1975). They found
teachers' attitudes were much more favorable toward
middle-income children than low-income children and that
43% of teachers' comments about black children were
negative as opposed to 17% of comments about white
Dunlap and Dillard (1980) investigated 164 school
principals' perceptions of the factors indicative of
emotional disturbance in children. The factor least
chosen by the principals was the one considered by the
researchers most predictive of emotional disturbance.
The vagueness of criteria for suspension in one
school district was investigated by Pisarra and Giblette
(1981). They found the criterion to be improper conduct,
which was not further defined. The researchers concluded
that a student reported for fighting would be suspended,
possibly suspended, or not suspended depending on the
individual administrator who had jurisdiction.
A few of the possible sources of error in nonsystem-
atic observation leading to inaccurate, biased, or
subjective judgments have been presented to suggest their
ubiquitous nature and the necessity of providing for
systematic observations in judgments leading to educa-
tional placement decisions.
Rating Scale Development
Designing a rating scale requires addressing four
major issues: (a) what to measure (parameters), (b) how
to measure (item content and format), (c) how to record
(response format), and (d) how to interpret the results
(statistical analysis). Literature pertaining to these
issues will be reviewed in this section.
In a frequently cited longitudinal study of deviant
behavior, Robins (1966) found the variables of type of
behavior, frequency of occurrences, and severity of
consequences to be indicators of future behavior pat-
terns. More recent studies supporting these criteria
include those of Kohn, Koretzky, and Haft (1979); Camp
(1980); Forness and Cantwell (1982); Gresham (1982);
Loeber (1982); and a United States Department of Justice
report (1982, p. 1).
The types of behavior to be measured by a rating
scale are determined by its authorss, who must consider
content, sources, format, number, and order of presenta-
tion of the items to be included. Halo effects, or the
tendency to rate individuals holistically (Thorndike,
1920, p. 25; Willingham & Jones, 1958), were found by
Cooper (1981; 1983) to be reduced by having more specific
item content. Kreitler and Kreitler (1981) demonstrated
that items deemed irrelevant by raters tended to be scored
neutrally, thus limiting the derived information. Never-
theless, scales for rating disruptive behavior sometimes
include prosocial behavior content (Miller, 1980).
However, Deno (1979) suggested that to observe non-
disruptive behavior ignores the purpose of these ratings,
i.e., to determine whether inappropriate behaviors are
actually excessive. Schriesheim and Hill (1981) mixed
positive and negative statements on a questionnaire and
concluded that the effect was to reduce response validity.
Many scales do limit their items to behaviors that focus
on problem behavior (DiPrete, 1981; Duke, 1978; Governor's
Task Force on Disrupted Youth, 1974; Spivack & Swift, 1966;
Walker, 1979, p. 55), although not necessarily school
problems. Camp (1980) suggested that only school problems
directly observable by teachers and/or administrators be
included in scales for rating disruptive school behavior.
Logically, items taken from the setting in which the
ratings will be made best meet the criteria for relevant
content. Smith and Kendall (1963) used this premise in
devising Behavioral Expectation Scales (BES). Numerous
examples exist of the application of this premise in
education (Brown & Hammill, 1978; Camp, 1980; Duval County
School Board, 1979; Ross, Lacey, & Parton, 1965; Sherry,
1979; Spivack & Swift, 1977; Stott et al., 1975), mental
health (Kaufman, Swan, & Wood, 1979; Kohn et al., 1979;
Lachar & Gdowski, 1979; Miller, 1980) and industry (Vance,
Kuhnert, & Farr, 1978).
Item format refers to the various forms used in
presenting the information to which the rater is asked to
respond. It is often related to response format, which
refers to the methods of collecting information from the
raters. Response format literature will be presented in
the section covering the frequency characteristic.
Four types of item formats are currently in use in
behavioral rating scales. Behavioral Observation Scales
(BOS) describe the target behavior in specific terms that
require direct observation at the time the rating is made
(Latham & Wexley, 1977). Behaviorally Anchored Rating
Scales (BARS) provide a specific description of a behavior
for each successive rating point (anchor) of an item and
assess cumulative behavior over some time period (Smith &
Kendall, 1963). The Mixed Standard Scale (MSS) uses sev-
eral scales, with three levels of behavioral description
for each trait to be measured, and randomizes the order of
presentation (Blanz & Ghiselli, 1972).
Summated rating scales (Edwards, 1957), referred to
as Likert scales (LT) (1932) or graphic rating scales
(Waters, Reardon, & Edwards, 1982), present for each item
one statement that may be specific or general. Likert
scales have been used with both direct and deferred obser-
vation. BOS scales are developed using summated rating
procedures (Likert, 1932), while BARS and MSS use the
Thurstone (Thurstone & Chave, 1929) scale development
process (Bruvold, 1969).
Conflicting conclusions have resulted from numerous
investigations into the advantages and disadvantages of
these scale formats. Fay and Latham (1982) found BOS to
be superior to BARS in rating video-taped behavior during
job interviews. However, Murphy, Martin, and Garcia
(1982) questioned the theoretical basis for BOS and found
evidence to suggest that BOS tapped recall for behavior
traits as well as immediate observation. Several studies
(Hom, DeNisi, Kinicki, & Bannister, 1982; Ivancevich,
1980; Keaveny & McGann, 1975; Lee, Malone, & Greco, 1981)
failed to find significant advantages for the BARS format
over summated rating scales or other alternative methods
(Jacobs, Kafry, & Zedeck, 1980; Kingstrom & Bass, 1981;
Schwab, Heneman, & DeCotiis, 1975).
In opposition to MSS theory, Finley, Osburn, Dubin,
and Jeanneret (1977) found evidence to suggest that an
obvious scale format may be superior to a hidden contin-
uum. Dickinson and Zellinger (1980) compared MSS, BARS,
and LT formats and found MSS produced less method bias,
BARS produced as much discriminant validity as MSS and
provided the best feedback to rates, and LT scales were
easiest to understand and use. When Bruvold (1969) tested
the application of summated scales (Likert, 1932) and
successive interval scales (Edwards & Thurstone, 1952) to
the same data set, no significant differences were found
between the two scaling methods. According to Bernardin
and Smith (1981), one explanation may be that scale
constructors have deviated from the original procedures
(Smith & Kendall, 1963) in developing BARS instruments.
In addition to the Thurstone and Likert scaling
procedures, a third method is available. According to
Edwards (1957, p. 172), a Guttman (1944, 1945, 1947a,
1947b), or cumulative scale, requires that the construct
to be measured be unidimensional. Since disruptive school
behavior consists of many discrete behaviors, a Guttman
scale is not suitable for the instrument developed in this
study. At present, it appears that no item format is
superior enough to warrant relinquishing the clarity of
understanding and ease of use (Dickinson & Zellinger, 1980)
of the Likert scale, which presents one descriptive item
at a time to which the rater assigns a quantitative value
from a given range of values.
In determining the number of items to include in a
rating scale, some researchers (Quay & Peterson, 1967,
1979; Spivack & Swift, 1971; Stott, 1972) have relied on
factor analysis, using an arbitrarily chosen factor score
as the cut-off score. Edwards (1957) suggested an intui-
tive approach, utilizing 20-25 items that discriminate
between the groups at the extremes of the scale. A
comprehensive study (Achenbach & Edelbrock, 1978) of 18
rating scales found the range of items to be from 36 to
287 (median = 68 items; mean = 90.4 items). Of the 6
scales intended for use by teachers, 4 contained fewer
than 50 items and 2 between 50 and 100 items.
In a study of preferred scale length, Meredith (1981)
found half of the respondents preferred from 20 to 40
items, with 25 the median preferred length. In another
study, Meredith (1975) found a 52-item scale was judged
too long. Seidman and his associates (Seidman et al.,
1979) concluded their 46-item Teacher Behavior Description
Form was too cumbersome and reduced it to 23 items. While
item complexity is probably a factor (Meredith, 1981),
this review suggests a scale using no more than 40 items
would probably be acceptable to most teachers.
The ordering of items within a scale has been
suggested as a possible source of leniency error, halo
effect, and impaired discriminant validity (Blanz &
Ghiselli, 1972). Schriesheim and DeNisi (1980) and
Schriesheim (1981b) found that grouping according to
constructs rather than randomizing questionnaire items
resulted in impaired discriminant validity. Increased
leniency response bias was also found when items were
grouped (Schriesheim, 1981a).
Dickinson and Zellinger (1980) concluded that a
randomized scale contributed as much discriminant validity
as an ordered scale while displaying less method bias. In
a comparison of randomized and grouped scales, the
randomized scale engendered as much convergent and
discriminant validity (Waters et al., 1982). Thus, a
randomized order of presentation seems indicated.
Obtaining a meaningful measure of the frequency of
target behaviors requires attention to the variables of
response format, length of the observation period, and
type and number of raters. According to Tzeng (1983),
four response formats are most frequently cited in the
literature. They can be differentiated in terms of two
psychometric criteria. First, the existence of a neutral
response option defines the free choice format. Absence
of a neutral rating option defines the forced choice
format. Second, categorical (qualitative) ratings answer
the question "Does the ratee fit this category?" while
discriminatory (quantitative) ratings answer the question
"To what degree does the ratee fit?"
Tzeng (1983) criticized forced choice measures for
their omission of a valid response category, i.e., uncer-
tainty or neutrality of the raters' perceptions. King,
Hunter, and Schmidt (1980) concluded that a forced choice
format was ineffective in reducing rater halo. Dunnette
(1963, p. 96) reported that rater resistance to forced
choice formats led to their abandonment.
Categorical, or qualitative, formats used in
checklists cannot detect relative differences in degree
between two behaviors performed by the same ratee or
between the same behaviors among rates (Tzeng, 1983).
Johnson, Smith, and Tucker (1982) found less response
skewness on a 5-point Likert discriminatory scale compared
to a yes/?/no categorical format. A zero-based discrimina-
tory, free choice response format seems most appropriate
(Likert, 1932). The absence of a behavior can be indicated
by the 0 position or, if present, the perceived frequency
can be indicated by choosing a value from the remainder of
the scale (Edwards, 1957).
The number of value choices permitted to the rater is
a critical issue. If few points are used some information
may be lost, but the scales are less ambiguous for the
rater. If there are too many points the discrimination
may be too fine for the rater to make. Albaum et al.
(1981) attempted to show superiority for a continuous
scale format, but concluded that equivalent aggregate
measurements were obtained from a 5-category, discrete
Likewise, Bernardin et al. (1976) and Bardo and
Yaeger (1982) failed to find continuous scales superior to
discrete scales. The superiority of a 5-point, discrete
rating scale has been suggested by Cowen, Dorr, Clarfield,
Kreling, McWilliams, Pokracki, Pratt, Terrell, and Wilson
(1973); Lissitz and Green (1975); McKelvie (1978); Neumann
and Neumann (1981); and Broadbent, Cooper, Fitzgerald, and
Conversely, Bardo and his associates (Bardo & Yeager,
1982; Bardo, Yeager. & Klingsporn, 1982) found obtained
means and variances closer to the expected values for
4-point scales over 5- and 7-point scales. These results
appear contrary to most other studies. Edwards (1957, pp.
150-151) gives Likert's original statistical rationale for
the use of a 5-point scale, anchored with the integers 0
through 4, and the summation of scores for individual
items as a total score for each ratee. Current research
provides no compelling evidence for departing from this
An anchor, e.g., "always," "sometimes," "never," is
usually associated with each scale point of a Likert-type
summated rating scale (Pohl, 1981). While a variety of
anchors has been used, the basis for the selection is
often not stated (Beatty, Schneier, & Beatty, 1977;
Broadbent et al., 1982; Camp, 1980; Cowen et al., 1973;
Hunter, Hunter, & Lopis, 1979; Kassin & Wrightsman, 1983;
Moses, 1974; Siegel, Dragovich, & Marholin, 1976; Solomon
& Kendall, 1977; White, 1977).
Several studies have investigated the assumptions
involved in the selection of one popular set of anchors:
always, often, occasionally, seldom, and never. Parducci
(1968), Chase (1969), and Pepper and Prytulak (1974) con-
cluded that the meanings of anchor words were influenced
by context. The effects of individual differences among
raters on their interpretations of anchor words were
demonstrated by Helson (1969) and Goocher (1965). These
studies suggested that the above anchors may not define
perceptually equal intervals along the rating continuum.
Four studies (Bass, Cascio, & O'Conner, 1974;
Schriesheim & Shriesheim, 1974, 1978; Spector, 1976) have
sought to select five anchor words that would be perceived
by raters as defining equally spaced rating intervals.
However, the most definitive study appears to be Pohl's
(1981) partial replication of the Bass et al. (1974) and
Shriesheim and Shriesheim (1974, 1978) studies. Using
responses from 164 college students, Pohl (1981) calcu-
lated the means and standard deviations for 39 expressions
Comparing these with the theoretical mean responses
for a 5-point equal interval scale, Pohl (1981) derived
the response set of always, quite often, sometimes, very
infrequently, and none of the time. The calculated mean
(26.71) for the mid-point term "sometimes" differed signi-
ficantly (p < .001) from the theoretical mean (29.05), but
nevertheless was the value closest to the optimal for a
5-point scale. The other calculated values were not
significantly different from the theoretical profile.
Thus, with the exception of the mid-point term, it appears
that the anchors produced by the Pohl (1981) study
adequately defined equal-appearing intervals on a 5-point
The length of the period for which behaviors are to
be rated has been little studied. For instance, the
manual for the Behavior Problem Checklist (Quay &
Peterson, 1967, 1975, 1979) does not specify for the rater
the inclusive time period to be considered in rating the
listed behaviors. The authors of the Devereux Elementary
School Behavior Rating Scales (Spivack & Swift, 1966)
instructed their raters to "consider recent and current
behavior" (p. 75). The same authors (Spivack & Swift,
1977), in developing the Hahnemann High School Behavior
Rating Scale, instructed teachers to base ratings on
behavior observed "over the past month" (p. 300).
A study (Hinton, Webster, & O'Neill, 1978) of hospi-
talized clinical patients used a 6-week time period. An
investigation (Beatty et al., 1977) of performance rating
in a data processing firm utilized three assessment
periods of two months each for a total of six months. In
a study of several response formats, Broadbent et al.
(1982) used a 6-month inclusive time period. However, in
none of these studies was a rationale given for selection
of the time period.
Two attempts at aggregating measures over specific
time periods have provided more precise instructions to
the rater. Cowen et al. (1973) defined each of five
rating points in terms of the inclusive time periods to be
considered when aggregating occurrences of behavior. For
example, the fourth anchor point, often, was defined as
"you have seen this behavior more often than once a week
but less often than daily" (p. 16). Camp (1980) used the
Frequency of occurrence
0 Never observed
1 Once or more in semester
2 Once or more monthly
3 Once or more weekly
4 Once or more daily (p. 11)
The work of Seymour Epstein (1980), in support of the
stability over time of personality traits, bears directly
on the issue of aggregating behavior ratings over some
time period. Epstein (1980) stated that "stability can be
demonstrated . as long as the behavior in question is
averaged over a sufficient number of occurrences" (p.
791). In testing this hypothesis, Epstein conducted four
studies in which he used, among other types, ratings
performed in classrooms by teachers. Epstein suggested
aggregating behavior over subjects, stimulus situations,
time, and modes of measurement in order to establish
predictive reliability and validity (p. 797).
Ratings of middle and junior high school students by
their teachers in different courses would meet the
conditions of subjects and situations. Epstein (1980)
suggested that ratings at a single time following multiple
or extended observations represent an intuitive averaging
that has the "potential for producing highly replicable
and valid results" (p. 802). Harrop (1979) also
challenged the common assumptions (Fay & Latham, 1982;
Latham, Fay, & Saari, 1979) that coding of directly
observed behaviors produced superior results to aggregat-
ing behaviors over time.
A related concern in the assessment of school-related
behavior is selection of the time of year in which the
ratings will be made. Several studies (Cowen et al.,
1973; Epstein, et al., 1983; Larrivee & Bourque, 1980)
recommend allowing student behavior and teacher percep-
tions to stabilize. Supporting these decisions are data
from the Texas Junior High School Study (Evertson,
Anderson, & Brophy, 1979).
Evertson and Veldman (1981) found a moderate but
steady increase in serious misbehavior over the course of
the school year and an increase in general misbehavior in
April. Evertson and Veldman (1981) concluded that short-
term studies should avoid ratings made either early or
late in the school year. The available literature seems
to suggest the feasibility of aggregating behaviors over
time periods specified in the rating scale instructions
and after teachers have had at least two months to observe
Deciding on the most appropriate type of rater to use
in assessing children's behavior has long been a problem.
In 1965, Ross et al. recognized the potential usefulness
of teacher ratings. Teacher's ratings have been found to
be more accurate than peer ratings of classroom behaviors
(Bailey, Bender, & Montgomery, 1983); other school profes-
sionals' ratings (Bower & Lambert, 1971, p. 143; Freemont
& Wallbrown, 1979), and institutional child care workers'
ratings (Kohn et al., 1979) and to be equivalent to the
ratings obtained by a multidimensional scaling technique
(MDS) applied to classroom behavior (Sanson-Fisher &
A number of researchers have found support for
teacher ratings as appropriate measures of general class-
room behaviors (Solomon & Kendall, 1977), social behavior
(Loranger, Lacroix, & Kaley, 1982), assertive vs. aggres-
sive behavior (Roberts & Jenkins, 1982), acting out
behavior (Walker, 1970), and behavior that would likely
result in referrals for exceptional child education (Dean,
1980; Epstein et al., 1983; Home & Larrivee, 1979; Lahey,
Green, & Forehand, 1980; McKinney & Forman, 1982; Roberts
et al., 1981).
Not all studies have yielded positive results. Morris
and Arrant (1978) found that regular classroom teachers
tended to see more behavior problems in students referred
for evaluation than did school psychologists. A study
(Kazdin, Esveldt-Dawson, & Loar, 1983) of psychiatric
inpatient children found extra-class raters' evaluations
of overt classroom behaviors to correspond more closely to
direct observational data than did teachers' ratings.
However, teachers were more accurate than the extra-class
raters in identifying hyperactive children using a behav-
ior checklist. Overall, the evidence suggests strong
support for the use of teachers as raters of classroom
An associated issue is the use of multiple raters to
increase reliability and reduce halo effect (Epstein,
1980). Ratings of students commonly are obtained from all
teachers having direct classroom contact (Linton & Chavez,
1979; Wixson, 1980). This procedure could result in as
few as one or perhaps as many as seven ratings, depending
on the grade level and local practice.
More recent research efforts have focused on
empirically determining the most effective number of
raters. Prinz and Kent (1978) increased from 1 to 4 the
number of raters of parent-adolescent interactions in a
clinical setting and reported increased reliabilities.
Both reliability and concurrent validity of clinical
judgments were shown to increase when the number of judges
was increased from one to ten (Horowitz, Inouye, &
Siegelman, 1979). Strahan (1980) extended the Horowitz et
al. (1979) study and concluded that after using four
raters, adding additional ones contributed little to
measurement effectiveness. Another study (Green, Bigelow,
O'Brien, Stahl, & Wyatt, 1977) of inpatient clinical
behaviors found little improvement when using more than
Although in general agreement with the above studies,
a cautionary note was added by Kenny and Berman (1980),
who pointed out that if raters are completely unreliable,
increasing their numbers will not increase reliability.
The number of teachers usually available in a middle or
junior high school to serve as raters would appear to be
adequate to contribute to both improved reliability and
Various classifications of severity have been adopted
in school settings. Student conduct codes typically use
some method of indicating seriousness of offenses, such as
"serious misconduct" (Pinellas County Schools, 1983, p. 7)
and "minor, intermediate, and major" (Duval County Public
Schools, 1980, p. 16). Researchers (Pisarra & Giblette,
1981) have used categories emphasizing the targets of the
behavior (e.g., offenses against persons, offenses against
state laws). Teachers often focus on specific behaviors
(e.g., use of drugs, striking teacher) (Camp, 1981) and
administrators have used a combination of both (National
School Public Relations Association, 1973).
There is little consensus on the number of levels to
be used in assigning degrees of severity. Taylor et al.
(1979) used levels ranging from 1 (not very severe) to 4
(extremely severe). Camp (1980) used 0 for "not con-
cerned" through 4 for "extremely concerned." In an
earlier study, Moses (1974) used three levels, 1 (mild), 2
(moderate), and 3 (severe) in asking mental health and
criminal justice professionals to rate a list of problem
behaviors. To use too many levels may imply a degree of
confidence in discrimination not supported by the subjec-
tive nature of such ratings.
Not all rating scale authors and researchers accept
the necessity for including a severity rating (Searls,
Isett, & Bowders, 1981; Spivack & Swift, 1977). Even
when, as in the Behavior Problem Checklist (Quay &
Peterson, 1967, 1975. 1979), a severity factor is provided
for, the author does not always recommend its use. How-
ever, at the practitioners' level the degree of severity
of behaviors is a major concern.
Algozzine (1979), using items characteristic of
several behavior rating scales, developed the Disturbing
Behavior Checklist which asks teachers to rate the degree
of disturbance they experience as a result of different
student behaviors. This suggests a consequence to the
teacher based not on the frequency of the behavior, but on
the type and severity. After noting irregularities and
lower reliabilities, Taylor et al. (1979) had teachers
rate for severity 26 items of Part Two of the Adaptive
Behavior Scale (ABS) (Nihira, Foster, Shellhaas, & Leland,
1969). Teachers were able both to categorize behaviors
and rate them in terms of severity, leading Taylor et al.
(1979) to conclude that this additional information would
be useful in refining the scale and adding to its clinical
Inasmuch as the instrument developed in this study is
intended to have locally developed norms, the statistical
techniques used in the norming procedure and the comparing
of individual scores to the derived local norms are not
complex. While some more recent studies have focused on
problems associated with such common procedures as the
calculation of measures of central tendency (Mosteller &
Tukey, 1977; Stavig, 1978, 1982), many researchers
continue to rely on descriptive statistics utilizing raw
scores, arithmetic means, standard deviations, and stan-
White (1977) compared individual student's scores on
classroom behavior to the computed mean score for five
classes of "Follow Through" program students in order to
identify immature students. In a business setting, Fay
and Latham (1982) used means and standard deviations in
comparing scores obtained using two different rating
methods. A study (Lyness & Cornelius, 1982) comparing
judgment strategies and ratings of college instructors
supported the use of a rating scale composed of discrete
items, with an overall rating calculated by weighting the
items and summing the weighted scores. To obtain mean
sub-scores for subjects, Algozzine (1980) summed scores
across the items defining each of four factors of
disturbing behaviors and used means and standard devia-
tions in analyzing the results.
The cited studies seem to support the use of descrip-
tive statistics in both obtaining individual scores (i.e.,
sum of weighted ratings)-and deriving a local norm (i.e.,
mean) from ratings of a representative sample of a total
population. Salvia and Ysseldyke (1981, chap. 4) offer
definitions of common terms for descriptive statistics
applied to assessment.
Psychometric Properties of Rating Scales
Historically, rating techniques have aroused contro-
versy over estimations of validity and reliability (Ryan,
1958). Validity is the relevance of the scale to the
variables being measured. Most sources recognize three
types of validity, i.e., content, criterion-related or
concurrent, and construct (American Psychological
Association, 1966; Cronbach, 1970; Kerlinger, 1972).
Reliability is the accuracy or precision of a measuring
instrument and has been usually classified as either
temporal, inter-rater, or internal (Cronbach, 1970).
However, investigations (Epstein, 1980) into the
effects of situations on behavior have recently introduced
a fourth consideration, situational reliability, or the
consistency of behavior across settings. The development
of norms against which to compare results obtained from
individual administrations of rating scales is another
area of active investigation (Mendelsohn & Erdwins, 1978;
Messick, 1980). Research on these issues is reviewed in
Content validity refers to the relevance and repre-
sentativeness of the items used in construction of a scale
(Epstein, 1980). Often, this is determined by obtaining
judgments from experts not otherwise involved in the scale
construction (DiStefano, Pryer, & Erffmeyer, 1983; Jones
et al., 1975, p. 83; Lawshe, 1975; Thorne, 1978).
Kreitler and Kreitler (1981) found that item content
determined the rater's perception of the central theme of
an instrument. Items not perceived as relevant to the
central theme tended to be given neutral responses, thus
limiting the information contributed by the rater.
Criterion-related validity is studied by comparing
scores obtained from an instrument with one or more
external criteria of the variable being measured
(Kerlinger, 1972, p. 459). Criterion-related validity
encompasses both concurrent and predictive qualities
(Epstein, 1980). The comparison of scale results with an
independent judgment or diagnosis of a subject is an
example of an attempt at estimating criterion-related
validity. If the judgment or diagnosis confirms the scale
indications, the inference may be drawn that the scale is
in agreement with the concurrent diagnosis and is
predictive that others given a similar rating would also
be diagnosed similarly (Kohn et al., 1979; Mendelsohn &
In one validation study, Harris, Kreil, and Orpet
(1977) used the school principal, guidance counselor, and
two teachers as judges in selecting both disruptive and
prosocial students for rating by the Behavior Coding
System (Patterson, Ray, Shaw, & Cobb, 1969). In develop-
ing the Pittsburgh Adjustment Survey Scales (Ross et al.,
1965), school principals were used to nominate adjusted,
withdrawn, and aggressive students for rating by their
teachers and scale results were compared with these
According to Kerlinger (1972, p. 461) and Cronbach
(1970), the significance of construct validity is its
concern with the theory behind the variable being
measured. Guion (1977) argues that construct validity
integrates both content and criterion considerations.
Likewise, the usefulness of content and concurrent
validity is questioned by Sanson-Fisher and Mulligan
(1977) and construct validity is supported.
A definition of construct validity as the process of
ascribing meaning to scores is offered by Stenner and
Smith (1982). Messick (1980) broadens the concept of
validity to include both test interpretation and test
use. Messick (1980) describes construct validity as
"interpretive meaningfulness" (p. 1015) and suggests that
it rests on four bases: convergent and discriminant
validity, ethical interpretation, relevance and utility
for the specific application, and the consequences follow-
ing use of the instrument.
To be interpretable, a rating scale must be reli-
able. That is, a scale must produce similar results when
applied to the same person over several administrations,
the instrument must be relatively free of errors of mea-
surement, and the results must closely approximate the
"true" value of the variable for the person being rated
(Cronbach, 1970; Kerlinger, 1972).
Typically, test-retest data are compiled for varying
time periods between administrations. The correlation
between the two obtained scores is used to justify esti-
mations of temporal stability and, in the case of rating
scales, intra-rater reliability. Examples of reported
test-retest intervals include one week (Duval County
School Board, 1979; Quay, 1977), two weeks (Mendelsohn &
Erdwins, 1978; Russell, Lankford, & Grinnell, 1981) and
two years (Quay, 1977). However, Masterson (1968) pointed
out that low test-retest correlation coefficients may
reflect the transitory nature of the measured variable and
suggested high coefficients of internal consistency may be
more indicative of reliability for some instruments.
Internal consistency has often been estimated by
inter-item and item-total analysis (Edwards, 1957;
Kerlinger, 1972). In these procedures, an individual's
rating on one item is compared with the rating on all
other items or with the total score from the scale or
subscale to estimate the degree to which each item is
similar to the other items. Item analysis may be impor-
tant in reducing errors of measurement attributable to the
composition of the instrument (Benson & Clark, 1982).
However, internal consistency may not provide good reliabi-
lity estimation for a rating scale assessing constructs
comprised of many discrete behaviors (Kerlinger, 1972).
Some research (Rosenthal & Jacobson, 1968; Sulzbacher,
1973) into observer bias has suggested that beliefs about
rates may affect rater perceptions and, consequently, the
reliability of the ratings. In three studies (O'Leary &
Kent, 1973; Shuller & McNamara, 1976; Siegel et al., 1976)
of disruptive classroom behavior, while biasing informa-
tion experimentally introduced was found to influence
global ratings, it had no significant effect upon results
obtained from behaviorally stated scales. Siegel et al.
(1976) suggested that behaviorally specific items reduce
bias and improve inter-rater and intra-rater reliability.
The degrees of agreement among different raters on
measures of the same subjects at the same time in the same
setting have been used to indicate the inter-rater
reliability of an instrument (Cronbach, 1970). Also, the
agreement among different raters of subjects in the same
settings at different times has been used for the same
purpose (Cronbach, 1970). In middle and junior high
schools, these conditions do not usually occur naturally.
Fortunately, investigations of trait consistency in
subjects (Abikoff, Gittelman, & Klein, 1980; Epstein,
1980; Mischel, 1969) have encouraged the comparisons of
ratings by different raters over the same elapsed time
periods, but for different settings and situations,
conditions which do occur naturally in the secondary
Epstein (1980) concluded that subjects do manifest
trait consistency, if aggregation techniques are applied
in assessing behaviors. Epstein (1980) suggested aggre-
gation over raters (e.g., teachers), situations (e.g.,
classrooms), occasions (e.g., class periods), and measures
(e.g., disciplinary records). Epstein further suggested
that when single ratings are made after extended periods
of observation, these ratings are similar to aggregated
ratings in that they represent an intuitive averaging of
ratings over many observations. Thus, reliability may be
improved by combining different teachers' ratings of the
same student over the same portion of the school year.
According to Cooper (1981), perhaps the most ubiqui-
tous challenge to inter-rater reliability is halo error
(Thorndike, 1920) or the tendency of a rater to allow
overall impressions of an individual to influence judgment
of specific areas of behavior (Holzbach, 1978). Attempts
(Landy, Vance, Barnes-Farrell, & Steele, 1980; Landy,
Vance, & Barnes-Farrell,' 1982) to statistically control
for halo effects have apparently not succeeded (Harvey,
1982; Hulin, 1982; Mossholder & Giles, 1983; Murphy, 1982).
One exploration of ways to reduce halo error resulted in a
restatement of classic advice: do not use rating cate-
gories that are imprecise and overlapping (Cooper, 1983).
In an extensive review of the literature, Cooper (1981)
concluded that of nine methods currently employed to reduce
halo effect, all leave residual illusory halo.
Studies of variables affecting reliability have iden-
tified several other challenges to the accuracy of school
behavior ratings. The sex of the teacher was found in two
studies (Levine, 1977; Silvern, 1978) to be correlated
with ratings of classroom behavior, with male teachers
consistently reporting lower levels of disruptive behav-
ior. Teachers' ratings seemed to be influenced by special
education labels in one study (Fogel & Nelson, 1983). In
two studies (Marwit, 1982; Marwit, Marwit, & Walker, 1978),
perceived unattractiveness of students has been shown to
correlate with higher ratings of disruptive behavior.
While challenges to reliability from a variety of
sources have been observed, several studies (Bernardin &
Pence, 1980; Fay & Latham, 1982; Latham, Wexley, & Pursell,
1975; Madle, Neisworth, & Kurtz, 1980; Pursell, Dossett, &
Latham, 1980) have suggested that training in the use of
rating scales may be effective in reducing errors of
measurement. This review of studies of validity and
reliability has identified some sources of and counter-
measures for errors of measurement. Next, studies of the
variables affecting the norming of rating scales will be
Several writers have shown concern for the relation-
ship between behavior and the context in which it occurs.
The social value of a test, according to Messick (1980),
is determined by its instrumental value for a particular
setting. Willems (1975) stated that few phenomena have
meaning independent of the context in which they occur.
Likewise, researchers were cautioned by Dickinson (1978)
to evaluate behavior only in an environmental context.
Epstein (1980) referred to the "extreme situational
specificity of behavior" (p. 794) and warned that experi-
ments conducted in a single situation cannot be relied on
to generalize across even minor variations in stimulus
conditions. Others supporting this psychosocial approach
include Sherif (1954); Erickson (1963), quoted in Tinto,
Paclilio, and Cullen (1978); Salvia and Ysseldyke (1981,
p. 378); and Zammuto, London, and Rowland (1982).
Schools were described by Garbarino (1980) as "con-
texts for behavior and development" (p. 19). Some of the
characteristics of schools which may influence levels and
interpretations of disruptive behavior are size of enroll-
ment (DiPrete, 1981, p. 86; Garbarino, 1980; Kowalski,
Adams, & Gundlach, 1983); public or private administration
(DiPrete, 1981, p. 81); control orientation (e.g., human-
istic vs. custodial) (Deibert & Hoy, 1977; Gaynor &
Gaynor, 1976); degree of person-environment fit (Kulka,
Klingel, & Mann, 1980); traditional vs. open classrooms
(Solomon & Kendall, 1975); length of faculty tenure
(DiPrete, 1981, p. 107); socioeconomic level of the host
community (Kowalski et al., 1983); and region of the
country (DiPrete, 1981, p. xx; Kowalski et al., 1983).
Researchers advocating the use of local norms for behav-
ioral measurements include Fremont and Wallbrown (1979);
Mendelsohn and Erdwins (1978); Quay and Peterson (1967);
Smith (1976); Walker and Hops (1976); and Wallbrown,
Wallbrown, and Blaha (1976).
The effects of sex, age, race, and socioeconomic
status on ratings of disruptive behavior have been
frequently studied. The types of disruptive behavior
displayed in both educational and clinical settings have
not been found to be significantly different for the
variables of sex (Behar & Stewart, 1984; Epstein et al.,
1983; Morris & Arrant, 1978; Stott et al., 1975, p. 166),
age (Behar & Stewart, 1984; Ghodsian, Fogelman, Lambert, &
Tibbenham, 1980; Stott et al., 1975, p. 83), race (Gajar &
Hale, 1982), or socioeconomic status (Behar & Stewart,
1984; Stott et al., 1975, p. 97). Thus, providing for
separate norms for these variables seems unnecessary in
any scale rating only disruptive behaviors.
Uses of Behavior Rating Scales
Bailey et al. (1983) supported the use of rating
scales in program planning and evaluation. Likewise, the
lack of effective measurement devices was seen by Hirshoren
and Heller (1979) as limiting the evaluation of program
effectiveness. Mesinger (1982) called for the use of
appropriate measurement devices in providing services for
deviant youth within the public school setting. Cooper
(1983), Peed and Pinsker (1978), and Beatty et al. (1977)
have suggested providing rating scale results to rates to
influence behavior changes. Using rating scales to pro-
vide a standardized description of behavioral problems has
been suggested (Edelbrock & Achenbach 1978).
In a study comparing resource room delivery models,
Wixson (1980) used a behavior rating scale in developing
and evaluating intervention programs for various cate-
gories of handicapped children. Morton Bortner (Buros,
1978, p. 493), reviewing the AAMD Adaptive Behavior Scale,
pointed out its usefulness for evaluating the progress of
individuals and evaluating program goals. The Duval
County School Board (1979) used a locally constructed
behavior checklist to evaluate their grant-funded program
for disruptive students.
Several programs which retained students in their
regular classrooms have used behavior scales for evalua-
tion purposes. Walker and Holland (1979) and Linton and
Chavez (1979) developed and used rating scales for this
purpose in elementary and junior high schools, respec-
tively. The Hahnemann High School Behavior Rating Scale
(Spivack & Swift, 1977) was intended to provide teachers
with a practical means of describing disruptive classroom
behavior to parents and other school personnel. In a
study of junior high school truants, Nielsen and Gerber
(1979) used a behavior rating scale to match school inter-
ventions with student needs.
A quantitative measure of disruptive behavior was
developed by Mendelsohn and Erdwins (1978) to assist
community agencies in devising programs for expelled
students. Haskell (1979) developed a method of quantify-
ing clinical behavior in institutional settings to provide
a basis for planning individual programs and evaluating
results. McSweeney and Trout (1979) used the Jesness
Behavior Checklist (Jesness, 1970) to evaluate the social
progress of deviant children in a wilderness camp pro-
gram. Five reasons for obtaining measures of students
are offered by Salvia and Ysseldyke (1981): "Screening,
placement, program planning, program evaluation, and
assessment of individual programs" (p. 14). Behavior
rating scales have been used to obtain measures for each
of these needs.
This review has identified five approaches in the
literature to define disruptive school behavior (DSB). A
conceptualization of DSB based on the interactions of
students, teachers, and administrators within the school
setting was suggested as most relevant for the development
of an instrument to quantify DSB.
Psychometric challenges to the use of rating scales
for identifying behavioral characteristics were consid-
ered. Research was cited to suggest that teachers using
reliable and valid scales could accurately identify DSB.
Nineteen instruments available for assessing problem
behaviors were reviewed. None appeared to meet the
psychometric criteria required for educational placement
decisions. Possible sources of error in nonsystematic
observations were presented with the suggestion that
inaccurate, biased, or subjective judgments may result.
Type, frequency, and severity of behaviors were
related to item content, item format, and response
format. Support was found for the inclusion of these
measurement parameters in assessing DSB. The use of
descriptive statistics in current research for obtaining
individual behavior ratings and deriving local norms was
A review of the sources of error in measurement was
conducted and counter-measures for improving validity and
reliability estimations were suggested. A number of
variables affecting the norming of rating scales were
investigated. Research evidence rejected separate norms
based on gender, race, or socioeconomic status. The
effective use of behavioral instruments in a variety of
settings was documented, suggesting the suitability of
such a device for describing students who display DSB.
The purpose of this study was to develop and validate
an instrument, the Disruptive Student Behavior Scale
(DSBS). The DSBS is intended to be used to assess
quantitatively the disruptive school behaviors of students
referred for placement in either special education or
alternative education programs. This chapter presents the
research questions, defines the target population,
presents a plan for constructing the scale, describes
procedures for a pilot study, details statistical tests
and procedures for the data analyses, and discusses
possible limitations of the study.
1. Does the content of the DSBS represent behaviors
recognized and accepted by educators as occurring
in and disruptive to the school environment?
2. In the judgment of experts, does the DSBS contain
an equitable distribution of items descriptive of
the underlying theoretical constructs that
identify disruptive students and discriminate them
from non-disruptive students?
3. To what degree does the DSBS demonstrate
criterion, convergent, and discriminant validity?
4. To what degree does the DSBS provide ratings which
are stable over time?
Construction of the DSBS
The following plan is a modification of a suggested
procedure (Benson & Clark, 1982) for rating scale construc-
tion. A review of disruptive school behavior (DSB)
literature provided a research base for defining the
constructs comprising DSB. A total of 303 descriptive
items and 22 categories were found in 36 studies. After
eliminating duplications and items not pertaining directly
to DSB, 56 items remained. Combining similar categories
resulted in a total of 13 potential categories of behav-
iors associated with DSB.
In a project conducted by the Research Committee of
the Psychological Services Department of the Duval County,
Florida School District, the 56 items and 13 categories
were presented to 16 teachers of middle school students
enrolled in a behavior management program for disruptive
students. The rating group was composed of 10 females and
6 males, and all had at least two years full-time teaching
experience. Group members were instructed to assign each
item to one or more of the 13 categories. Instructions
and results are reproduced in Appendix A.
The judges' ratings and comments resulted in the
retention of 10 categories, which were considered to be
one set of constructs which could be used in identifying
DSB. A tentative definition of each construct was
formulated using the descriptive items assigned by the
teachers. A verification was attempted of the inclu-
siveness of these derived constructs. A frequency
distribution was prepared for all of the conduct code
violations reported for chronic violators in a sample of
Duval County, Florida, elementary, middle/junior high,
high, and alternative schools (Moses, 1981). Of 7717
behavior violations, 7686, or 99.6%, were included within
the definitions of the proposed constructs.
The items, as taken from the studies and used in
developing the constructs, were not considered specific
enough for use in a quantitative rating scale. However, a
readily available pool of potential items was located in
the disciplinary referral records of an inner-city junior
high school in a metropolitan Florida school district.
Verbatim transcriptions were made of the reasons recorded
on the referral forms by teachers when sending students to
the deans. All active folders for the 1980-1981 school
year were reviewed. A total of 395 items, including dupli-
cations, were recorded without regard for gender, age,
race, or grade level. Combining obvious duplications and
similarities resulted in 66 items (Appendix B) to be
considered for inclusion in a scale for rating DSB.
All of the 66 items were then presented individually
to six male and five female volunteers, experienced
secondary school regular classroom teachers from suburban
Florida middle and junior high schools. Instructions are
reproduced in Appendix C. These teachers were asked to
verify the specificity of the items and edit those consid-
ered ambiguous. This review yielded 40 items for possible
use on an instrument. These items were then stated in the
past tense to reflect the intention to measure students'
past behavior (Appendix D). This preliminary study indi-
cated the feasibility of using research-based constructs
and teacher-generated items as the basis for a rating scale
for disruptive school behavior.
In order to reduce halo and leniency errors, it has
been suggested (Blanz & Ghiselli, 1972) that a scale be
arranged so that items from the same construct will not be
contiguous. Accordingly, items were initially randomly
ordered, then inspected and rearranged to meet this crite-
rion. (Appendix M). Research studies previously cited
suggested that in addition to specifying the type, a
quantitative measure of disruptive behavior must provide
for rating both frequency and severity. Frequency rating
was provided for by the choice of response format selected
for the instrument. The literature review suggested the
suitabilty of a 5-point, equal interval, summated rating
scale (Likert, 1932) using the following anchors:
0 None of the time
1 Very infrequently
3 Quite often
4 Always (Pohl, 1981, p. 239)
The rating scale (Appendix F) utilized this response
The severity rating for each scale item was estab-
lished with assistance from the faculty, staff, and
administrators of two alternative schools located in two
metropolitan Florida school districts. From their experi-
ence with disruptive students, these educators were
particularly aware of the consequences for students who
display DSB. Respondents were selected from volunteers,
including the principal and assistant principal, school
psychologist, social worker, educational evaluator, and
faculty members. This group contained both males and
females in approximately equal numbers. All had more than
two years' experience working with disruptive students.
The school experience may be conceptualized as influ-
encing the social, personal, and academic domains of a
student's life. Each of these domains may be subdivided
to facilitate closer study of the consequences of the
school experience (See Table 1). One way for educators to
assign a severity factor to a disruptive activity is to
have them estimate which domains of student life would
likely be affected adversely by that particular behavior.
Instructions for this procedure are reproduced in Appendix
G. The number of adverse consequences assigned by at
least 50% of the raters, divided by a constant of three to
keep the numbers small and with fractions rounded up to
the next whole number, gave a severity rating of 1, 2, or
3 to each of the items on the rating scale. Results are
reported in Chapter Four.
A scoring template incorporating the severity factor
was prepared for the DSBS (Appendix H). This template has
five holes, one corresponding to each possible frequency
rating (i.e., 0, 1, 2, 3, 4) for each rating scale item.
Through the holes are read the rater's mark (X) indicating
the frequency rating assigned. Above each hole is printed
a number which is the product of that frequency rating and
the previously determined severity factor for that item.
Thus, the weighted score for that item may be read by the
scorer directly from the scoring template and recorded on
the DSBS rating form beside each item.
These item scores were then added to give the page
score and form score (see Appendix F) and recorded onto a
summary sheet (Appendix I). The Summary of Teacher
Ratings form (Appendix I) contains for'each student the
DSBS rating; the deviation, in z-scores, from the local
r. a 0
HI a) V) > a
Q) (0 > bO I.
Q. 0 bO O 0
X r C bO.r-,
0 (*M r- .
0 a) CM ii
C C 4.)
1) ., o .-1
0 --- > >
C C O
0 0 0
4 0Q -
4- bOi 4-4
CC Oi U
C 0u 4
* LO 0
r= ., CO.
sC 0 -
S( Oa O
VC 0 *0
4- ) Q
(0 0 *P
Iu 0 >
T- t/3 Q .
r-1 r- o0
(U CO CO
norm; a comparison of ratings by each teacher; and the
basis for constructing a DSBS profile for prescriptive use
(Appendix J). These data are intended to provide local
school authorities with criteria for estimating the devia-
tion of any student's rating from the local DSB norm and
are intended to assist in determining a student's need for
an intervention program. The DSBS is normed locally
within each school district. Norms from this study are
reported in Chapter Four for information, but are not to
be used as criteria for judgments about students in other
Validation of the DSBS
To assure content validity, the 40 items and the 10
constructs developed from this preliminary study were
presented to a group of 24 teachers with instructions to
assign each item to a construct category or to no category.
The instructions are reproduced in Appendix E. Each judge
had at least two years of regular classroom teaching
experience in a middle or junior high school. Thirteen
male and 11 female teachers participated. The judges were
also asked to verify the specificity of the retained items
and reword those considered ambiguous. Revisions were
made as suggested and confirmed by a follow-up study using
another group of eight similarly-qualified teachers.
As described in the field study section, at a Florida
middle school a criterion group of disruptive students was
selected by nomination by seven non-teaching school person-
nel, including two deans, three guidance counselors, and
two administrators. Students in the disruptive group were
ranked numerically on a continuum from non- to severely
disruptive, based on subjective ratings from all the
nominating personnel. DSBS ratings from teachers were
compared to these subjective ratings to determine how well
high DSBS teacher ratings correlated with high levels of
disruptiveness as perceived by non-teaching school person-
To estimate how well the DSBS identified the disrup-
tive group, the mean of DSBS ratings for the disruptive
group students were compared with the mean of DSBS ratings
for a norming group representing a sample, stratified by
grade, of the school population. If the DSBS demonstrated
agreement with the concurrent judgments of disruptiveness
made by non-teaching school officials, there would be made
a prima facie case for predicting that students in other
settings identified by the DSBS as disruptive would also
be judged disruptive by non-teaching school officials.
Messick (1980) described construct validity as based
on convergent and discriminant validity, ethical
interpretation, relevance and utility for the specific
application, and the consequences following use of the
instrument. Convergent validity requires that the DSBS be
able to identify all students who are considered exces-
sively disruptive. To demonstrate satisfactory convergent
validity, the DSBS ratings of 100% of the students in the
disruptive group would have to be significantly above the
local DSB norm. The disruptive group ratings are reported
in Chapter Four.
Discriminant validity requires that the DSBS be able
to reject those students who are not considered exces-
sively disruptive. To demonstrate satisfactory discrimi-
nant validity, the DSBS ratings of only those students in
the disruptive group, or eligible for inclusion, could be
significantly above the local DSB norm. Ratings of the
norming group are reported in Chapter Four.
Ethical interpretation of DSBS ratings requires an
understanding of both the theoretical and practical
concepts underlying development of this instrument. There-
fore, a manual will be prepared before the DSBS is offered
for research use. Relevance was supported by the theoreti-
cal basis on which the 10 constructs were chosen to define
DSB for this study. Utility was provided by the proce-
dures used to select appropriate items, score the forms,
interpret the ratings, and present the results. The conse-
quences of using the DSBS cannot be predicted until it is
thoroughly researched. The intent is to improve the
validity of the selection process for programs assisting
Reliability of the DSBS
The DSBS rating for each student is an aggregate of
scores from at least four teachers. A test-retest measure
compared two DSB ratings obtained from individual teachers.
Fourteen days after the receipt of teacher ratings, a
follow-up rating by the same teachers of approximately 10%
of both the norming and disruptive groups was made. These
results are reported in Chapter Four.
The internal consistency of the DSBS was protected by
choosing only items previously used by teachers to
describe DSB. Item analysis is not an effective technique
for establishing reliability of individual administrations
of the DSBS. Patterns of disruptive behavior are often
narrow and stereotypical, while the DSBS contains items
descriptive of a broad range of possible behaviors. Thus,
item scores were not likely to correlate with each other.
No attempt was made to assess interrater reliability.
Classroom settings are conceptualized as discrete environ-
ments, whose norms for behavior are determined by the
personality of the teacher. The behavior of interest is
the interaction of students with their teachers totally,
The purpose of the field study was to identify and
correct any problems, actual or potential, with item
content, response format, or administration and scoring
procedures of the DSBS. Following a successful field
study, the instrument may be offered to the profession for
further research and development (Benson & Clark, 1980).
Accordingly, the operational goal of this present effort
was to conduct a field study to determine the readiness of
the DSBS for use as a research instrument.
The target population consisted of students enrolled
in grades six through nine (i.e., middle and junior high
school grades) in public schools anywhere in the United
States. No restrictions were placed on age, gender, race
or socioeconomic status. The selection criteria for the
host school were a heterogeneous ethnic population, an
urban or suburban location, public middle (grades 6, 7, 8)
or junior high (grades 7, 8, 9) school status, random
assignment of students to basic courses, and an average
daily attendance figure of at least 500 students. Special
schools, such as alternative schools and special education
centers, were not considered.
A public middle school meeting these criteria was
located in a predominately urban school district on the
west coast of Florida. The student enrollment was
approximately 76% white, 22% black, and 2% Asian- and
Hispanic-American, with an average daily attendance of
733. Socioeconomic status was said by the principal to be
primarily upper-lower class and lower-middle class.
For the norming group, a sample consisting of 90 stu-
dents was selected using one English and one mathematics
class, with randomly-assigned enrollments, at each grade
level. A total of six classes containing 203 students and
ranging from 32 through 35 students each were sampled.
The numbers 1 through 35 were written on individual slips
of paper and 15 numbers drawn randomly using the replace-
ment procedure. For each class, the students whose class
roll numbers matched the 15 randomly selected numbers were
included in the norming groups.
The disruptive group was selected by nomination by
non-teaching school personnel, who were asked to list the
names of all of the excessively disruptive students
encountered during the current school year. It was thus
possible for a student's name to be included in both the
norming and the disruptive groups. The nominating process
initially produced a group of 64 students. After a confer-
ence among the raters, this group was reduced to 36
All students finally nominated into the disruptive
group were assigned to one of four levels of disruptive-
ness (none, mild, moderate, or severe) by each nominating
person working independently. Nominated students were
assigned a numerical rating according to the following
Level of Disruptiveness Rating
Students were ranked according to the average of these rat-
ings. This ranking permitted the correlation, reported in
Chapter Four, of levels of disruptiveness between the DSBS
results and the qualitative assessments by school person-
nel for each disruptive group student.
Schedules for the sample students were obtained from
school records. No contact was made with any student.
Training of all participating teachers took place in a
meeting at which a DSBS form for each period of a sample
student's current schedule was distributed. Appendix K
contains these instructions. The purpose of the study was
explained and a date and procedure for returning the forms
agreed upon. Emphasis was placed on the need to respond
to only the behaviors actually mentioned on the instrument
and to perform the ratings independently of other teachers.
Provision was made for a faculty member to either answer
or refer questions that might arise during the rating
Teachers not submitting all their DSBS forms by the
agreed upon date were contacted and reminded of the
importance of their participation. Upon receipt of at
least four completed DSBS forms for each student, the DSBS
rating for that student was calculated. At least four
scorable forms, totaling 622, were received for 108 stu-
dents, 76 in the norming group and 32 in the disruptive
group. The scoring template (Appendix H) provided for
calculating item scores weighted for severity.
The item scores were totaled to produce a form score,
which was entered on the Summary of Teacher Ratings form
(Appendix I). This summary form contains spaces for the
student's name, grade, age, and sex; school name; evalua-
tor's name and title; individual form scores; each rater's
name, subject, and class period; and calculation of the
student's DSBS rating and z-score. Each sample student's
form scores were summed to give a total score. The total
score was divided by the number of raters to yield the
average score, which is the student's DSBS rating.
After the DSBS ratings for all the norming group stu-
dents were calculated, the mean DSBS rating and standard
deviation for the group were obtained. This mean of the
means is the local DSBS rating, or norm, for the target
school. The local DSBS norm was subtracted from the
students' DSBS rating, giving their deviation from the
Dividing this deviation by the local standard devia-
tion gave the number of standard deviation units, or
z-scores, the student's DSBS rating differed from the
local DSBS norm. The criterion of two standard deviation
units above the local DSBS norm translates to a disrup-
tiveness score higher than approximately 98% of the
predicted scores from the school population. The distri-
bution of scores obtained from the norming group was
inspected to assure the existence of sufficient variance
to make the z-scores meaningful.
A reliability check was performed. Fourteen days
after all the rating forms were collected, approximately
10% of the students from both the norming and disruptive
groups were selected to be representative of the range of
scores. New forms were submitted to the original raters
for rerating the same students and the results compared.
These results are reported in Chapter Four. After comple-
tion of the data analyses, all participants were invited
to a meeting to discuss the results, offer comments, and
receive appreciation for their participation.
To establish content validity, 24 expert judges
assigned proposed DSBS items to construct categories.
Results of the judges assignments were totaled for each
item. An item was dropped if not assigned to at least one
category by each judge. If this content validation pro-
cedure had resulted either in fewer than 30 items being
assigned to at least one construct or in having a con-
struct with fewer than three items assigned by 80% of the
respondents, enough items would have been constructed and
validated to meet these criteria. The judges' item assign-
ments are reported in Chapter Four.
To ascertain how well the DSBS identified the
disruptive group, the t-test was used to estimate the
significance of the difference between the means of the
norming group and disruptive group. An obtained prob-
ability level of .05 or less was considered evidence of
statistical significance. The magnitude of the difference
between the means was used to evaluate the practical
significance of the instrument and its potential for
identifying disruptive students. These results are be
reported and discussed in Chapter Four.
To estimate convergent validity for the DSBS, the DSBS
rating for each disruptive group member was compared with
the mean DSBS rating of the norming group. For the
purposes of this study, a DSBS rating of at least two
z-scores above the norming group mean was accepted as
evidence that the DSBS had correctly identified a disrup-
tive group member. The standard error of the mean was
used to include students when evaluating borderline cases.
The criterion for satisfactory convergent validity was the
correct identification of 100% of the disruptive group.
Discriminant validity also was estimated by using
ratings, means, and z-scores. The DSBS rating for each
norming group member was compared with the mean of that
group. Any norming group member whose DSBS rating
exceeded the mean by at least two z-scores was considered
identified by the DSBS as excessively disruptive. Identi-
fied cases, not members of or eligible for the disruptive
group, were considered challenges to the discriminant
validity of the DSBS. All cases not meeting the construct
validity criteria were investigated. Construct validity
results are reported and discussed in Chapter Four.
The Pearson product-moment correlation statistic was
used to compare the original ratings on approximately 10%
of the completed forms with follow-up ratings made after
14 days. Individual coefficients of at least .80 were set
arbitrarily to establish an acceptable level of test-
1. The school for the field study was selected based on
the willingness to cooperate by both the school and
the faculty. This may have mitigated problems that
would occur in a less favorable environment.
2. Teacher resistance and/or concerns about this type of
research may have biased or limited their partici-
3. The study was limited to exploration and the results
are not intended to generalize beyond the administra-
tion and scoring procedures. Specifically, the
calculated DSB norm is valid only for this school.
4. No provision was made to assess the possible effects
of grade and sex on DSB norms. Studies have indicated
the influences are not significant, but at some point
this should be investigated.
5. The disruptive sample group was likely composed of
students who had been referred to the dean. The same
teachers who referred these students to the dean may
have rated their behaviors, with bias a possibility.
6. The use of expert judges in the validation procedures
may have introduced personal bias into the items used
on the instrument.
RESULTS AND DISCUSSION
The purpose of this study was to develop and validate
an instrument, the Disruptive Student Behavior Scale
(DSBS). The study focused on identifying components of
disruptive school behavior as perceived by middle and
junior high school teachers and constructing an instrument
to quantify these behaviors. To accomplish this, an
instrument was constructed using behaviors taken from
disciplinary referrals and field tested on a representa-
tive sample of students from a Florida middle school.
Teacher ratings for a norm group and a disruptive group
were collected and analyzed as outlined in Chapter Three.
These results are reported in this chapter.
The Severity Factor
Results of the assignment of potential adverse conse-
quences resulting from DSBS behaviors are reported in
Table 2. Twenty packets containing 40 DSBS items and an
instruction sheet were distributed and 16 were returned.
At least 50%, or 8, of the raters had to assign a DSBS item
to a particular domain before that domain was
Tmm 4--- n -j- N MN~ I
I- mj Crt-~m~r0 0I1
CMr0o 0e r oo Co
o0 % n M Ch wNOOOON N O OCN ON N 0
S- r r
cM -Moo~~b0oom mMO~MM
m01~o- 0 0J '00 '- oC~c 0
y- OJO '-011 ~-O
OOm mO I. 0 0- M C1 V-m M--:o 3 nO iMm -coom
SeMm-c m~00o0 mom ~m
0C m^0m0 ccOci rmo oorO
r-i-r0T- T-t- --CM M 0 0C