Title: Development of the disruptive student behavior scale
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00102775/00001
 Material Information
Title: Development of the disruptive student behavior scale
Physical Description: Book
Language: English
Creator: Moses, William L., 1936-
Copyright Date: 1986
 Record Information
Bibliographic ID: UF00102775
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: ltuf - AEJ3538
oclc - 15111825

Full Text






MAY 1986

Copyright 1986


William L. Moses

To Billy

Who would have been proud


I wish to express appreciation to my parents for their

love, understanding, and help; my committee chairperson,

Dr. McDavis, for his counsel and encouragement; my

committee members, Dr. Ziller for his confidence in my

ability to work independently and Dr. Loesch for stepping

into the breach and contributing so much so quickly while

continuing his friendship and support; and my employers at

Pasco-Hernando Community College for their financial


Special thanks go to my friend and colleague, Dr. Tom

Floyd, who listened for hours and encouraged for years,

and to my friends and lovers who were usually supportive,

sometimes distracting, and always worth it.


ACKNOWLEDGMENTS . . . . . . . . .

LIST OF TABLES. . . . . . . . . .

ABSTRACT. . . . . . . . . . .

Statement of the Problem . .
Purpose of the Study . . .
Need for the Study . . . .
Significance of the Study . .
Definition of Terms. . . .
Organization of the Study . .

* .. . .
*. . . .


Definition of Disruptive School Behavior (DSB)
Identification, Assessment, and Placement. .
Rating Scale Development . . . . . .
Psychometric Properties of Rating Scales . .
Uses of Behavior Rating Scales . . . .
Summary . . . . . . . . .

Research Questions . . . . . . . .
Construction of the DSBS . . . . . .
Validation of the DSBS . . . . . . .
Reliability of the DSBS. . . . . . .
Field Study . . . . . . . . .
Data Analyses . . . . . . . .
Validity . . . . . . . . .
Reliability . . . . . . . .
Limitations . . . . . . . . .

Results . . . . . . . . . .
The Severity Factor . . . . . .
The Samples . . . . . . . .
Research Question One . . . . ..
Research Question Two . . . . . .
Research Question Three . . . . ..
Research Question Four. . . . . .
Summary . . . . . . . .
Discussion . . . . . . . . . .

S. vii

. .viii

S 21
S 38
S 57
S 66
S 68

S 70
S 70
S 71
. 77
. 80
. 80
. 85
. 85
S 87
S 87

S 89
. 89
S 89
. 92
S 93
S 98
. 110
. 117
. 113

* *

* .
. .


RECOMMENDATIONS. . . . . . .. . . 118
Conclusions. . . . . . . . . . 118
Implications . . ... . . . . . 119
Summary. . . . . . . . . ... . 121
Recommendations. . . . . . . . . 123


REFERENCES. . . . . . . . . ... . . 162

BIOGRAPHICAL SKETCH . . . . . . . . 193


Table rage

1. Domains of Student Life Influenced by
the School Experience . . . . 76

2. Potential Adverse Consequences of DSBS
Behaviors . . . . . .... ... 90

3. Rating Form Distribution by Demographic
Categories--Norming Group . . . .. 94

4. Rating Form Distribution by Demographic
Categories--Disruptive Group. . . ... 95

5. Frequency of Observed DSBS Behaviors by
Constructs . . . .... . . .. 97

6. DSBS Constructs by Number. . . . . ... 99

7. Assignment of Proposed Scale Items to
Constructs for Content Validation 101

8. Follow-up Study for Assignment of Proposed
Scale Items to Constructs. . . . ... 103

9. Comparison of Disruptiveness Ratings by
Teachers and Non-teaching Personnel. . 106

10. DSBS Ratings and z-scores for the Disruptive
Group . . . . . . . . . 107

11. DSBS Ratings and z-scores for the
Norming Group. . . . . . . . 108

12. Test-Retest Correlations . . .. . 112


Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial
Fulfillment of the Requirements for the
Degree of Doctor of Philosophy



William L. Moses

May 1986

Chairperson: Roderick McDavis, Ph.D.
Major Department: Counselor Education

Disruptive behavior is currently seen by both educa-

tors and the public as a major problem in American

education. A procedure for quantitatively assessing

disruptive behavior in schools is required to show a need

for intervention programs and to select students for

placement in either special education or alternative

education programs. The purpose of this study was to

develop and validate an instrument, the Disruptive Student

Behavior Scale (DSBS). The DSBS is intended for use in

assessing quantitatively the disruptive school behaviors


of middle and junior high students referred for placement

in special education and alternative education programs.

This study investigated the position that disruptive

school behavior (DSB) can best be described in terms of

its type, frequency, and severity. The use of teachers as

observers and raters of disruptive school behavior is

discussed. Using teacher-generated behavioral statements

from disciplinary referrals to better describe DSB is

suggested. A review of various rating scale development

procedures attempted by business, industry, and government

is summarized.

A set of 10 constructs was selected to define DSB.

Scale items were developed from referral statements on

disciplinary records in a junior high school. A severity

factor was incorporated into the scoring system so that

behaviors rated as more detrimental to the student were

given a higher DSBS rating.

The DSBS was field tested in a public middle school.

Students in a norming group and a criterion, or disrup-

tive, group were rated by their classroom teachers using

the DSBS. A norm for disruptive behavior for the target

school was calculated and a criterion for classifying a

student as disruptive was established.

Results indicated the DSBS could identify the crite-

rion group of disruptive students, classify individual

students as disruptive, and exclude non-disruptive

students from the disruptive group. A follow-up study

suggested the results were consistent over time for all

DSBS ratings except those at the lowest end of the scale.



The public school system in the United States has

been assigned a major role in socializing and enculturating

American youth (Filipczak, 1978). The U.S. Supreme Court

in its 1954 landmark civil rights decision (Brown v. Board

of Education of Topeka, 74 S.Ct. 686, 691) described

education as "a principal instrument in awakening the

child to cultural values, in preparing him for later

professional training, and in helping him to adjust

normally to his environment." The materialistic emphasis

of American society and culture ordains that the educa-

tional institution at all levels be driven by the broadly

defined goal of career success for its graduates (Bell,

1984; DiPrete, 1981, p. 199; National Education Associa-

tion (NEA), 1975, p. 108).

Unfortunately, a significant number of students are

detoured from this goal when educators describe them as

displaying behaviors inappropriate to the school environ-

ment and not attributable to legally-defined mental or

emotional handicaps. Suspensions, expulsions, and assign-

ments to alternative programs are evidence of failure by

the educational system to effect students' adherence to

current social norms and culturally-specified behaviors.

The consequences to the schools for this failure include

loss of both funds and credibility, neither of which the

educational system has in sufficient quantity to squander.

Attempts to correct this failure to convey effec-

tively norms and behaviors have included both exceptional

child education and alternative schooling programs. The

Education for All Handicapped Children Act of 1975 (P.L.

94-142)(Department of Health, Education, and Welfare,

1977) effectively administered the coup de grace to

exceptional child education approaches in Florida by

failing to include a category appropriate to disruptive

behavior (Florida Department of Education, 1975, 1985).

Alternative programs frequently fail to provide for

selection and discharge criteria, rendering evaluation

virtually impossible (Pinellas County School District,

1982). A primary reason for failure to specify behavioral

criteria for alternative schooling programs is the lack of

appropriate instruments for quantifying disruptive

behavior (Salvia & Ysseldyke, 1981, pp. 8, 9).

Inadequacies of existing behavioral assessment

instruments include failure to provide for local norming,

inclusion of inappropriate items, omission of the severity

factor, and inadequacy of prescriptive information

(Mesinger, 1982). An instrument providing both a

theoretical and a pragmatic rationale for identifying

disruptive students is a requirement for reconsidering the

inclusion of this category in special education legisla-

tion and enhancing the credibility of alternative

education programs (Reeves, Perkins, & Hollon, 1978).

Statement of the Problem

Disruptive behavior in the public school system is

not a new phenomenon (Garibaldi, 1979). That it remains a

problem is emphasized by Robert J. Rubel in introducing a

collection of papers on crime and violence in public


The issue in the 1980's no longer centers on
whether or not violence in American schools is
serious; the issue no longer centers on whether
violence is increasing or decreasing; the issue
no longer centers on technical anomalies concern-
ing under- or over-reporting of incidents. In
the debate of the 1980's, the primary issue
before large proportions of our urban schools
(and sizeable numbers of our suburban and even
rural schools) revolves around the continued
viability of American education as it existed a
generation ago. (1980, p. 5)

The U. S. government has acknowledged the existence

of disruptive behavior by awarding federal grants for

alternative education pilot programs (Law Enforcement

Assistance Administration, 1979; Moses, 1976).

Included in definitions of disruptive school

behavior (DSB) are such varied activities as talking,

hitting, yelling (Mayer & Butterworth, 1979); defy-

ing rules and procedures (Walker, 1979); aggressive

behavior which interrupts the instructional program

(Foley, 1982); and conduct disorders (American Psychiatric

Association, 1980 pp. 45-50). Forness and Cantwell (1982)

and Forness, Sinclair, and Russell (1984) have identified

these categories as likely to be ineligible for special

education services under P.L. 94-142.

The U.S. government (Department of Health, Education,

and Welfare, 1977), in implementing P.L. 94-142, specif-

ically denied services to the "socially maladjusted."

Florida law provides essentially the same restrictions

(State Board of Education Rule 6A-6.3016), although Bower

(1982), whose research (Bower, 1958) formed the basis for

the P.L. 94-142 definition of emotionally disturbed,

called this exclusion "contradictory in intent and content

with . the research from which it came" (1982, p.


The need for alternative education services for

disruptive students seems supported by reports of the

widespread existence of DSB. Individuals and institutions

reporting on the continuing crisis in school discipline

include the California Department of Education (1973),

the National Education Association (1975), the U.S.

Congress (Bayh, 1975; Tygart, 1980), the Michigan

Department of Education (Vergon & Williams, 1978), the

National Institute of Education (Feldhusen, 1978), Cross

and Kohl (1978), Duke (1978), the New York State United

Teachers (1979), and the National Education Association

(1980). The Safe School Study Report to Congress (National

Institute of Education, 1978) indicated 5,000 teacher

assaults per month occurred across the nation. The Gallup

Poll on Education (Gallup, 1984) continues to report lack

of student discipline as the number one concern of

Americans about the public school system.

In Florida, the Governor's Task Force on Disrupted

Youth (GTFDY) found 17,983 student-days lost to suspen-

sions over a 2-year period in the 10 school districts

studied (GTFDY, 1973, p. 11). An analysis of conduct code

violations in Duval County, Florida, schools for 1980-1981

revealed more than 33,000 violations resulting in 13,679

days lost from school (Moses, 1981).

The aversive consequences of chronic DSB for students

include lowered self-esteem and functioning level (Caliste,

1979); dropping out and underemployment (Grise, 1980; NEA,

1975; Safer, Heaton, & Parker, 1981); alienation

(Garbarino, 1980; Moyer & Motta, 1982); and criminal

activity (Edwards, Roundtree, Kent, & Parker, 1981;

Mitchell & Rosa, 1981). Likewise, from the perspective of

the school system DSB is undesirable, involving excessive

teacher attention (Rubel, 1977, Chap. 1), litigation

(Lufler, 1982), vandalism costs (Goldstein, Apter, &

Harootunian, 1984), teacher stress (Pettegrew & Wolf,

1982), and weakened public support (Amos, 1980). Conse-

quences for the community include criminal actions and

psychiatric referrals (Faretra, 1981). Levin (1972) esti-

mated the expense of inadequate education to be about 6

billion dollars a year (1972 dollars) for costs associated

with welfare and crime.

Researchers have identified the middle and junior

high school age student as particularly prone to behavior

disorder (Geiger & Turiel, 1983; Loeber, 1982; Nielsen &

Gerber, 1979; Quay, 1978). These studies suggest the

middle and junior high schools as a focus for identifying

and remediating disruptive school behavior. Unfortu-

nately, no adequate instruments are available specifically

for this population (Mesinger, 1982). Instruments

developed from clinical populations contain some items

irrelevant to the non-clinical population in the public

schools (Quay & Peterson, 1967). Instruments offered with

norms developed from research samples and no procedure for

developing local norms for disruptive behavior do not

consider the placement needs of local school districts

(Messick, 1980). Levels of disruptive behavior that can

be managed within the regular school environment vary

across settings because of differences in such factors as

facilities, experience of teachers and administrators, and

school board policies.

Current instruments fail to consider the widely

differing consequences of specific disruptive acts (Kane &

Bernardin, 1982). Some possible effects of this omission

may be to group together students whose behaviors differ

widely in their severity, to encourage conceptualizing all

disruptive behavior as equally deleterious, and to base

placement decisions on personal judgments about the seri-

ousness of a particular type of behavior. Neither does

any available instrument provide procedures for creating a

prescriptive profile of a student based on the authors'

conceptual model of disruptive school behavior (Salvia &

Ysseldyke, 1981). This failure may seriously limit the

interpretation and application of rating scale results.

Purpose of the Study

The purpose of this study was to develop and validate

an instrument, the Disruptive Student Behavior Scale

(DSBS). The DSBS would be used to assess quantitatively

the disruptive school behaviors of students referred for

placement in either special education or alternative

education programs.

Need for the Study

Salvia and Ysseldyke (1981, pp. 443, 444, 450) have

called for norm-referenced instruments to support

placement decisions, evaluate student progress, evaluate

programs, provide intervention suggestions, and help

parents understand their children's abilities in relation

to other students. Reeves et al. (1978) called for

reliable instruments to use in placing handicapped

children. Also, Camp (1981) notes that

there is very little current, objective,
research-based information in existence to help
identify specific student behavior problems
occurring in the schools. A need exists for
research of this nature to quantitatively
establish the actual, current situation with
regard to student discipline problems in the
public secondary schools. (p. 48)

Presumably, these calls for reliable and valid instruments

apply both to special education and alternative schooling

programs, as both to some degree remove the student from

mainstream classroom activities. However, the Florida law

(State Board of Education Rule 6A-6.3017) providing for

special education programs for the socially maladjusted

was repealed July 24, 1981.

"Educational alternative programs" were created in

Florida in 1978 (Florida Statute 230.2315) specifically to

reduce disruptive behavior and truancy. Florida Statute

229.565 provides for the evaluation of "procedures for

identification and placement of students in educational

alternative programs." As an example of practice, in 1982

the alternative education program in the Pinellas County

School District did not require quantitative behavioral

assessment prior to placement.

Studies, however, have identified problems in using

subjective criteria for alternative education placement.

Disagreements in ranking behaviors (Pisarra & Giblette,

1981), value systems (Messick, 1980), labels applied to

students (Leyser & Abrams, 1982), teaching experience

(Rubel, 1977, p.51), level of frustration (Walker &

Holland, 1979), race (Arnove & Strout, 1978; Bennett &

Harris, 1982; Florida DOE, 1983; Goldsmith, 1982;

Mesinger, 1982), sex (Bennett & Harris, 1982), and

socioeconomic status (Arnove & Strout, 1978; NEA, 1975)

are variables that may confound perceptions of disruptive


One way to help neutralize these confounding vari-

ables is to use quantitative measures. A review of

current literature indicates that appropriate instruments

may not exist. After a major study of alternative educa-

tion programs, Mesinger (1982) was unable to recommend

even one instrument for use in selecting students.

Messick (1964, 1965, 1980) argued against applying to

local environments behavioral norms developed elsewhere.

Stott, Marston, and Neill (1975, p. 8), Wodarski and Pedi

(1978, p. 480), and Quay and Peterson (1975, 1979) advised

the setting of local norms. However, no instrument

located in this review provides a specific procedure for

determining local norms.

Another advantage of locally-developed norms is the

opportunity to compute the mean DSB level for individual

schools. Intervention program entry and exit criteria may

be defined by the deviation of an individual student's

mean DSB score from the school mean. This may provide the

type of quantitative assessment required by state (SBE

Rule 6A-6.3016) and federal (P.L. 94-142) law for special

education placement and may meet the need noted by

Mesinger (1982) for quantitative instruments to assist in

selecting students for alternative education programs.

A major need in intervention programs is prescriptive

information (Lovitt, 1967 p. 238; Spivack & Swift, 1977).

However, many instruments do not provide operationally-

defined items which are useful in the classroom. For

example, the Behavior Problem Checklist (Quay & Peterson,

1979) items used to identify conduct problem students

include "restlessness," "disruptiveness," and "irresponsi-

bility." These items originally were taken from the files

of a child guidance clinic (Quay, 1977).

Defining disruptive behavior on the dimensions of

type, frequency, and severity has received support from

numerous sources (American Psychiatric Association, 1980,

p. 45; Bernardin, LaShells, Smith, & Alvares, 1976; Camp,

1980, 1981; Grosek, 1979; Taylor, Warren, & Slocumb,

1979). Criticisms of assessment procedures not incorporat-

ing a severity factor have been made by Kane and Bernardin

(1982) and Pisarra and Giblette (1981). Nevertheless, no

instrument was located which specifically recommended

using a severity factor in assessing disruptive school


An instrument which provides for quantifying DSB may

help to protect students from placement in school programs

according to inappropriate criteria. To be most effec-

tive, the instrument should include provisions for

establishing locally-determined placement norms, for

comparing with those norms the scores of individual

students, for providing prescriptive information, and for

systematically considering the type, frequency, and

severity of the disruptive behaviors.

Significance of the Study

This study investigated the theoretical position that

disruptive school behavior (DSB) can best be described

in terms of its type, frequency, and severity. Theoretical

considerations in the use of teachers as observers and

raters of disruptive school behavior were discussed.

The feasibility of using teacher-generated behavioral

statements from disciplinary referrals to better specify

the parameters of DSB was suggested. A review of various

rating scale development procedures attempted by business,

industry, and government were summarized.

The instrument developed by this study will initially

be most appropriate as a research tool for conducting

studies of DSB. The availability of a process for

establishing local norms for DSB may facilitate local

research studies in evaluating the effectiveness of

disciplinary measures, in-service training, and alterna-

tive education programs. This study will likely suggest

additional areas for other investigations.

The identification of disruptive students for inter-

ventions is not standardized. This instrument may assist

in establishing quantitative criteria for selection,

placement, and treatment of disruptive students. This, in

turn, may lead to recognition of DSB as a category for

exceptional student education funding.

A major premise in much of the literature concerning

DSB is the role of school personnel in exacerbating

disruptive behavior. It may be that an instrument which

provides a behavioral profile of the disruptive student

will suggest goals for in-service training programs.

Definition of Terms

For the purposes of this study, the following

definitions apply:

Alternative education program. An educational

procedure which provides intervention outside the regular

classroom for students exhibiting some predetermined level

of disruptive or disinterested school behavior.

Disruptive school behavior (DSB). Behavior that

disrupts the learning of self and/or others and is not

attributable to severe emotional disturbance or other

exceptional education categories.

Delinquent behavior. Behavior by persons under 18

years of age which violates laws and regulations pertain-

ing to them.

Exceptional child (student) education programs.

Programs which receive additional funding in order to

better serve the needs of students meeting governmental

guidelines for special assistance.

Experienced teachers. Full-time, regular classroom

teachers who have held that position at least two academic


Expulsions. Removal from school for at least the

remainder of the school year.

Locally developed norms. Criteria for comparing an

individual student's DSB with the expected DSB of a

specific reference population in the local school or


Maladaptive social behavior. Behavior not of organic

origin which would be judged by impartial observers to be

inappropriate for the social situation and which ulti-

mately results in aversive consequences for the person

exhibiting the behavior.

Method bias. The influence on ratings of the type of

rating method used.

Non-quantitative assessment. See Qualitative assessment.

Qualitative assessment. Evaluation based on individ-

ual opinion and lacking a systematic basis.

Quantitative assessment. The use of numbers in

describing behavior so that a higher number indicates a

higher level of the behavior.

Severity. A prediction, stated quantitatively, of

the potentially detrimental consequences a disruptive

behavior would likely have for a student.

Special education programs. See Exceptional child

education programs.

Suspensions. Temporary removal from the regular

educational program of a school, usually involving

exclusion from school facilities for a specified number of


Organization of the Study

There are four remaining chapters in this

dissertation. Chapter Two will present a review of the

literature related to the development of an instrument to

assess disruptive school behavior (DSB). Specifically,

consideration will be given to disruptive behavior in the

schools, existing assessment methods, rating scale develop-

ment, the psychometric properties of rating scales, and

the possible uses of results from a disruptive behavior

'rating scale.

Chapter Three will present the methodology employed

in the development, validation, and field testing of the

Disruptive Student Behavior Scale (DSBS). Included are

the research questions, information on the population,

procedures used in developing the scale, pilot testing,

data analyses, and possible limitations of the study.

Chapter Four will present the results of this study,

including the data and the information inferred from the

data. An explanation of the results will be given and

they will be related to past research.

Chapter Five will include conclusions from this

study, along with implications for theory, research,

practice, and training. A summary of the entire study

will be presented, followed by recommendations for addi-

tional research.



This study requires an investigation of the history

and current status of attempts to define disruptive

behavior in public schools; identification, assessment,

and placement efforts directed toward disruptive students;

rating scale development procedures; research into the

psychometric properties of rating scales; and the use by

schools of results obtained from rating scales. Accord-

ingly, this chapter will review research and opinion

covering both theoretical and applied considerations

relating to these topics.

Definition of Disruptive School Behavior (DSB)

According to Camp (1981), the major issue in student

discipline in the secondary schools is how to describe

quantitatively the kinds of disruptive behavior currently

occurring. Summarizing a 1978 survey of state directors

of special education, Hirshoren and Heller (1979) reported

that while individual states define emotional disturbance

consistently, there is considerable variation in the kinds

of children so identified. That is, children meeting

program criteria in one state appeared to be excluded in

another. Much has been written in an attempt to resolve

this situation. A review of the literature suggests the

emergence of five discrete perspectives: (a) empirical,

(b) clinical, (c) conceptual, (d) educational, and (e)


The empirical approach of applying factor analysis

(Cattell, 1978; Gorsuch, 1974) to a variety of items has

resulted in the identification of some common behaviors

associated with disruptive school behavior and has contri-

buted to defining DSB (Achenbach, 1978; Achenbach &

Edelbrock, 1978; Edelbrock, 1979; Peterson, 1961; Quay,

1964, 1978; Quay & Peterson, 1967). However, researchers

utilizing the empirical approach have included a broad

range of behaviors, including many which identify delin-

quency and personality disorders (Freemont & Wallbrown,

1979), and so the scales developed from these studies have

limited application for school personnel in defining the

specific category of DSB.

The classification of disorders contained in the

Diagnostic and Statistical Manual of Mental Disorders,

3/e. (DSM-III) (American Psychiatric Association, 1980)

and research studies incorporating these classifications

and descriptions exemplify clinical efforts to define

disruptive school behavior. Hewett and Forness (1982)

pointed to the necessity of finding a common frame of

reference between educational and psychiatric diagnoses in

order for school personnel to accurately interpret

clinical reports. Forness and Cantwell (1982) concluded

that the respective diagnostic systems of psychiatry and

special education remain dissimilar. Likewise, other

studies (Loeber, 1982; Werry, Methuen, Fitzpatrick, &

Dixon, 1983) failed to find support for the use of

psychiatric diagnoses to assign students to special

education programs.

The conceptual approach utilizes experience,

research, and opinion in formulating descriptions of what

is usually referred to in this perspective as "problem

behavior" (Jessor & Jessor, 1977, p. 4). Cullinan (1975),

Howell (1978), and Richard Jessor (1982) are among those

applying a psychosocial conceptualization of problem

behavior to the study of adolescent behavior. Neverthe-

less, while the conceptual perspective gives support to

the notion of comparing the behavior of an individual

student with the behavior of peers before declaring the

student to be deviant, this perspective fails to provide

specific criteria for making such a comparison.

The educational perspective includes the definitions

contained in federal and state statutes, guidelines

proposed by governmental agencies, and district codes of

student conduct. In 1977, the U.S. government, without

defining the term, specifically excluded the socially

maladjusted student from receiving exceptional child

education services under P.L. 94-142. The term "socially

maladjusted" is not defined in the latest Florida guide-

lines for providing special education for exceptional

students (Florida DOE, 1985). The U.S. Bureau of Educa-

tion for the Handicapped has sponsored the compilation of

a manual on behavior disorders (Yard, 1977). However,

these items are too general for use in a quantitative


Codes of student conduct contain lists of behaviors

for which punishment may be administered. Offenses listed

in the codes may be violations of either school rules

(e.g., inappropriate display of affection) (Duval County

Public Schools, 1980) or of law (e.g., vandalism) (Pinellas

County Schools, 1983). While these offenses must be

considered in defining disruptive school behavior, they

exclude many of the disruptive behaviors frequently

occurring within the classroom. Federal, state, and local

guidelines seem insufficient for operationally defining

DSB specifically enough to be useful in a selection


The school perspective focuses on the interactions of

students, teachers, and administrators within schools.

Disruptive school behavior is seen as a product of these

interactions. H. M. Walker, author of The Walker Problem

Behavior Identification Checklist (1970), described the

acting-out child as one who usually defies rules and

ignores classroom procedures, is difficult to manage,

avoids failure by attempting little academic work, and

alienates teachers and other students by behaving


Specific behaviors often include hitting, yelling,

leaving seat, arguing, having temper tantrums, and provok-

ing others and often lead to confrontations. These

confrontations may be verbal, physical, or both. Acting-

out behavior may occur in the classroom, in nonclassroom

areas, or both. Walker (1970) proposed that acting-out

children are differentiated from other students by the

frequency, or quantity, of these behaviors, not by the

type of behaviors. Thus, a measuring instrument must

provide for a frequency component.

Camp (1981) explored the types of behavior considered

to be disciplinary problems, the perceived degree of

severity of these behaviors, and the frequency with which

these behaviors were observed. Camp found that the types

of behaviors rated most serious were rarely observed and

concluded that the most serious problem may be the

frequent, though mild, behaviors that undermine student

and teacher morale. A study of 21 secondary school

administrators' attitudes toward aggressive behavior

suggested that suspensions were awarded according to the

administrators' attitudes toward the referred behavior,

rather than according to a consistent standard for the

school district (Pisarra & Giblette, 1981).

An evaluation of literature of the school perspective

suggests that DSB can be defined in terms which students,

teachers, and administrators understand; that the three

factors of type, severity, and frequency need to be

considered; and that measures of DSB need to be standard-

ized. In this section five perspectives for defining

disruptive school behavior were presented. Each perspec-

tive offers some assistance in differentiating this

category from other behavioral categories. There appears

to be support for an instrument which operationally

defines types of behaviors occurring throughout the school

environment, assigns a quantity to each descriptive item

based on the perceived frequency of occurrence and sever-

ity, and provides for comparing the score of an individual

student to a predetermined norm for that environment.

Identification, Assessment, and Placement

"Measurement is the construction of a model of some

property of the world" (Fraser, 1980, p. 27) and in

education this property is often the behavior of a

student. One role of the model provided by a measure is

to give accurate prescriptive information for planning

interventions with students (Forness, 1983). Several

studies have suggested this is being performed

inadequately (Greenwood, Walker, & Hops, 1977; Schenck,

1980; Sinclair, 1980; Sinclair & Kheifets, 1982; Spivack &

Swift, 1973; Strain, Cooke, & Apolloni, 1976).

Fraser (1980) acknowledged that psychological mea-

surement has been regarded as being quantitatively and

qualitatively of a lower order than physical measurement.

To achieve improvement, Ysseldyke and Marston (1982) have

argued for the use of direct observations of target behav-

iors by either teachers or trained observers. However,

Jones, Reid, and Patterson (1975) found observer reli-

ability varied inversely with the complexity of the

behaviors being observed.

Attempts to improve the validity of observations have

included such sophisticated approaches as Multidimensional

Scaling (MDS) (Torgerson, 1958). Sanson-Fisher and

Mulligan (1977), using adolescent student models, found

only marginal improvement for this technique over ratings

by classroom teachers. A comparison of a computer-driven

program for selecting behavioral/emotional disorders with

two expert psychologists' selections indicated no mean-

ingful differences existed (McDermott & Hale, 1982).

Weinrott (1979) summarized studies that indicated global

ratings could be significantly influenced by expectations,

while post hoc ratings of the same children by the same

raters when recorded on an instrument accurately reflected

discrete behavioral events. Gaynor and Gaynor (1976)

argued for instruments written to define behaviors so they

may be described quantitatively by teachers.

Beltramini (1982) suggested that scale-item content

is more important than other variables in obtaining reli-

able and valid results. A review by Albaum, Best, and

Hawkins (1981) of measurement literature found evidence to

support the use of from five to seven categories on Likert-

type scales, with no significant losses in reliability,

validity, or discrimination when compared with instruments

using more intervals. Fewer intervals sometimes resulted

in a loss of discriminative power and validity. It

appears that teachers using instruments which operation-

ally describe disruptive behaviors can be effective post

hoc raters and are able to provide reliable and valid

identification of disruptive school behavior (Edelbrock,

1979; Gresham, 1982; O'Leary & Johnson, 1979).

A review of current assessment techniques suggests

the emergence of a quantitative/qualitative dichotomy,

which will now be explored. In two reviews (Spivack &

Swift, 1973, 1977) of instruments for measuring secondary

school classroom behaviors no instrument was located which

limits its focus to disruptive school behavior, uses only

behaviorally-stated items, and provides for calculating

local norms. Descriptions follow of representative

instruments currently in use.

The Behavior Problem Checklist (BPC) (Quay & Peterson

1967, 1975, 1979) is a 55-item scale of behavioral traits

developed from a review of clinical records of kinder-

garten through eighth grade students referred for

psychiatric treatment (Quay, 1977). The items were

assigned by factor analysis to four scales plus a grouping

suggestive of psychosis. Epstein, Cullinan, and Rosemier

(1983, p. 172) and Gresham (1982, p. 137) reported that

the BPC is one of the behavior rating scales most widely

used in school studies.

The BPC has been used extensively both as a research

device (Eaves, 1975; Jacob, Grounds, & Haley, 1982;

Kelley, 1981; Touliatos & Lindholm, 1981) and in selecting

students for interventions (Algozzine, 1977; Balow, 1979;

R. Bower, 1969; Gerard, 1970; Ingram, Gerard, Quay, &

Levinson, 1970; McCarthy & Paraskevapoulas, 1969). Jacob

et al. (1982) reported that reviews of studies utilizing

the BPC suggested reliability and validity issues in need

of further study. The inability of the BPC to provide

other than broadband classifications has been noted

(Achenbach & Edelbrock, 1978).

Comprehensive normative data are not available for

the BPC for adolescents (Kelley, 1981). In an investiga-

tion of the effects of race on BPC ratings, Eaves (1975)

found that white teachers consistently rated black

students higher than white students on three of the

subscales. Black teachers showed no such bias. Eaves

(1975) concluded this bias could have a major effect on

the reported norms for the BPC. Touliatos and Lindholm

(1981) found that grade level, sex, and social class had a

significant effect on BPC ratings. However, differences

between schools and teachers contributed more variance in

the BPC ratings than grade, sex, and social class.

Touliatos and Lindholm (1981) suggested that Quay and

Peterson's (1967) recommendations be followed and

individual assessment be based on norms calculated for

particular schools and individual teachers.

Spivack and Swift (1973) concluded that the BPC was a

reasonably reliable measurement tool. Potential users

were cautioned, however, that most items are not specifi-

cally observable, but more like labels which imply

behaviors and designate traits. Likewise, Stott (1971,

p. 232) cited certain BPC items as requiring a teacher to

make inferences about students' feelings (e.g., "feelings

of inferiority"), being vague or ambiguous (e.g., "oddness,

bizarre behavior"), and relating to behaviors unobservable

by a teacher (e.g., "stays out late at night," "bed

wetting"). This review has identified several areas of

the BPC for which additional research has been suggested.

The Behavior Rating Profile (BRP) (Brown & Hammill,

1978) is composed of five rating scales and a sociogram.

Three of the scales (60 items) are completed by the target

student, one (30 items) by the teacher, and one (30 items)

by parents. The sociogram is a peer nominating techni-

que. The student scales provide self-ratings of behaviors

at home, at school, and with peers.

The BRP is based on an ecological approach which,

according to the authors, recognizes that students'

behaviors are dependent on the settings in which they

occur. Its purposes are the identification of students

with behavior problems and the differentiations among

learning disabled, emotionally disturbed, and behaviorally

disordered students in grades 1-12. Each of the six

measures is described as independent and individually

normed, allowing any scale to be used alone or in conjunc-

tion with any of the others.

The BRP manual (Brown & Hammill, 1978) reports

internal consistency reliability coefficients exceeding

.80. Concurrent validity was investigated by correlating

the BRP with measures obtained from other rating scales.

Adequate construct and content validity also are reported

by the authors. Norms are provided using scale scores

with means of 10 and standard deviations of 3, with scores

from 7 to 13 considered to be in the normal range.

One study (Reisberg, Fudell, & Hudson, 1982) of

behavior disordered students indicated that regular

classroom teachers gave higher ratings than special

educators (X=8.85 vs. X=6.87). Thus, norms may vary

according to the type of respondent (e.g., regular teacher

or special education teacher). Also, students' self-

ratings were inflated relative to other respondents'

ratings. Other investigators have noted problems

associated with attempts at multiple and self-ratings.

Lessing and her associates (Lessing & Clarke, 1982;

Lessing, Williams, & Gil, 1982; Lessing, Williams, &

Revelle, 1981) have reported on their unsuccessful

attempts to develop parallel checklists for use by

parents, teachers, and clinicians in psychiatric diag-

noses. Lobitz and Johnson (1975) found low correlations

between parent ratings and observed behaviors. Variables

confounding self-ratings include halo effect (Holzbach,

1978), social desirability (Dunnett, Koun, & Barber, 1981;

Seidman,, Rappaport, Kramer, Linney, Herzberger, & Alden,

1979), and lack of self-knowledge (Beitchman & Raman,


Ledingham, Younger, Schwartzman, and Bergeron (1982)

investigated teacher, peer, and self-ratings of 801

elementary school students. Self-ratings yielded the

lowest ratings for deviant behavior, aggression, and

withdrawal and the highest ratings for likability. Accu-

racy of self-evaluation has been found to be positively

correlated with high intelligence, high achievement

status, and internal locus of control, characteristics not

usually associated with DSB (Dunnett et al., 1981).

Reported research using the Behavior Rating Profile is

sparse. Additional verification of the assumptions of

equivalency of norms within respondent categories and the

validity of the self-report scales seems indicated.

The Bristol Social Adjustment Guides, 5/e.(BSAG)

(Stott, 1972) consist of 110 behaviorally-stated items

from which teachers select those descriptive of a

student's behavior in the month prior to the rating. The

items were originally developed in 1955 from clinical

observations of children aged 6 to 14 and modified by

classroom teachers (Stott & Sykes, 1956). A primary goal

was to incorporate context into the behavioral descrip-

tions (Stott, 1971).

The BSAG has been used extensively in clinical and

research studies (Davis, Butler, & Goldstein, 1972;

McDermott, 1980; Stott, 1978; Stott & Wilson, 1977).

Reliability and validity data were obtained through

extensive research (Stott et al. 1975) but are not

reported in a manner that is easily abstracted. Normative

data are available only for elementary school populations

(Stott, 1972). More recent research (McDermott, 1980,

1981; McDermott & Hale, 1982) has questioned the

specificity of the core syndromes of the BSAG and called

for further investigation of construct and predictive

validities (Hale & Zuckerman, 1981). At present, it

appears that not all of the core syndromes of the BSAG

have the specificity required in an instrument to be used

in educational placement.

The Hahneman High School Behavior Rating Scale (HHSB)

is a 13-factor, 45-item scale published in 1971 (Spivack &

Swift, 1971). The HHSB items were developed from observa-

tions of actual classroom behaviors, operationally stated

in educational terms. The items cover both academic and

interpersonal issues and can be rated by teachers in the

classroom. The intent is to provide prescriptive informa-

tion (Spivack & Swift, 1977). The factor scores for each

student are found by adding the raw scores for the three

or four items comprising each factor. These scores are

then combined into a profile, which is used to classify

students on the basis of their ability to adapt to total

classroom demands.

According to the authors (Spivack & Swift, 1973),

validity studies suggest consistent and significant

relationships between factor scores and academic grades.

No data are available on test-retest or interrater reli-

ability (Spivack & Swift, 1973). Norms are available

separately for suburban and urban samples. The HHSB is

limited as a selection device for special education pro-

grams by lack of reliability data, use of only three or

four items per factor, and overlapping among profile


The Behavior Evaluation Scale (BES) (McCarney, Leigh,

& Cornbleet, 1983) is a 52-item rating scale for use by

school personnel. Each item is assigned to a subscale

associated with one of the five characteristics of the

Bower (1958) definition of behavior disorders used in

Public Law 94-142. The BES was developed to aid in

diagnosis, placement, and program planning under federal

guidelines. Since federal criteria specifically exclude

the "socially maladjusted" student, the BES is inappropri-

ate for assessing DSB.

The Portland Problem Behavior Checklist (PPBC)

(Waksman & Loveland, 1980) was developed to aid in

assessment, evaluation, and intervention planning for

school children. The 29 items cover teacher-rated

behaviors for grade levels K-12. Norms are not avail-

able. Items are very generally stated (e.g., aggressive-

physical, destructive) and are rated on a scale of 0 (no

problem) to 5 (severe). It is not clear if this is a

rating of frequency of behavior or severity of the

consequences of the behavior. These features of the PPBC

would seem to limit the preciseness and reduce the confi-

dence level of quantitative scores intended to support

evaluation and placement for professional services.

The Pupil Classroom Behavior Scale (PCBS) (Dayton,

1967) is a 24-item, teacher-administered rating scale

intended to measure the effectiveness of special education

services for students displaying inappropriate classroom

behaviors. Most items are behaviorally stated and yield a

profile of three factors, achievement orientation, socio-

academic creativity, and socio-cooperativeness. Dayton

(1967) suggested using the scales for research on groups

rather than to describe individual students. Norms are

not available. Spivack and Swift (1973) concluded that

the PCBS is flawed by having overlapping items in the

factors and lacking data to support a relationship between

scale scores and emotional adjustment.

The 36-item Conners Teachers' Rating Scale (CTRS)

(Conners, 1969) has been used primarily in clinical diagno-

sis of children, particularly in the area of hyperactivity

(Goyette, Conners, & Ulrich, 1978). It does, however,

cover a wide range of school problem behaviors (Roberts,

Milich, Loney, & Caputo, 1981). There appears to be a

high intercorrelation between the Conduct Problem and

Hyperactivity subscales, limiting the usefulness of the

CTRS in identifying DSB.

The Brief Behavior Rating Scale (BBRS) (Kahn &

Ribner, 1982) was developed from the Devereux series of

rating scales (Spivack, Haimes, & Spotts, 1967). A cross-

validation study (Kahn & Ribner, 1982) reported that 61%

of a socially maladjusted group and 27% of an emotionally

handicapped group were correctly identified. These

results suggest that additional development is needed

to obtain support for the discriminant validity of the


Some of the most complete research in instrument

development has been conducted in attempts to improve the

diagnosis of clinical populations in the school environ-

ment. Although these efforts are not directly comparable

to the intent of the present study, six instruments having

potential interest to researchers working in the school

setting will be summarized.

The Child Behavior Check List (CBCL) (Achenbach,

1978) contains 118 behavior problem items and 20 social

competence items. Parallel forms exist for parents and

teachers. A review by Achenbach and Edelbrock (1978) of

empirical attempts to derive syndromes of child behavior

problems concluded with the recommendation that these

efforts be linked to the existing mental health system.

Recent efforts by these researchers and their associates

(Edelbrock & Achenbach, 1980; Reed & Edelbrock, 1983)

continue to pursue this objective. At present the

applicability of this instrument for educational measure-

ment is limited.

The role of parent observations in describing chil-

dren's behavior is formalized in the Louisville Behavior

Check Lists (Miller, 1967, 1980). A study (Tarte, Vernon,

Luke, & Clark, 1982) confirmed the validity of parent

observations of clinical symptoms in their children.

The items require inferences and judgments by raters.

Eight subscales were created through factor analysis and

although several appear to relate to school activities

(e.g., hyperactivity, antisocial), the content of

individual items comprising the subscales renders them

only marginally useful for school assessments.

The Children's Behaviour Questionnaire (Rutter, 1967)

was developed for teachers' use in screening for psychi-

atric assessment large numbers of school children. Many

of the 26 items are vaguely stated and some appear to

require inferences by the rater. The two subscales are

labeled neurotic and antisocial, terms which lack direct

application to the school setting.

The Devereux Adolescent Behavior Rating Scale

(Spivack et al., 1967) was developed to measure behavior

requiring professional intervention. The subscales are

oriented to clinical diagnosis and offer little specific

information for use in placement decisions.

The Pupil Behavior Inventory: 7-12 Grades (Vinter,

Sarri, Vorwaller, & Schafer, 1966) is a 34-item, teacher-

administered rating scale intended to furnish information

on students referred for agency treatment. Behavioral

items were collected from teachers, screened and factor-

analyzed, and grouped into five factors. Lack of data on

reliability, validity, and norms suggests caution in

using this instrument to select students for special

services (Spivack & Swift, 1973).

The Mooney Problem Check List (MPCL) (Mooney, 1942),

has been widely used by counselors to identify problems of

individuals seeking counseling or to explore the problem

profile of a group of students (Sundberg, 1961). However,

two studies (Joshi, 1964; Stewart & Deiker, 1976) of the

underlying factors of the MPCL scales have identified only

a single general factor. The MPCL may be further limited

by utilizing items generated from problems mentioned by

high school students in 1942.

Several instruments designed for other populations

include behaviors often used in descriptions of disruptive

school behavior. The Adolescent Behavioral Classification

Project instrument (Dreger, 1980) was developed for

assessing problems of institutionalized adolescents. An

analysis of the first-order factors indicates some common-

alities with both the Hahnemann High School Behavior

Rating Scale (Spivack & Swift, 1977) and Achenbach and

Edelbrock's (1978) syndromes, but many are couched in

clinical terms that have little or no relevance to the

classroom setting.

Ostrov and associates (Ostrov, Marohn, Offer, Curtiss,

& Feczko, 1980) developed and validated the Adolescent

Antisocial Behavior Check List (AABCL) for delinquents

housed in an institutional treatment setting. The authors

called for modification of the instrument for use in other

settings; however, extensive rewriting of items would seem

to be required.

The Jesness Inventory (Jesness, 1972) was created to

measure attitude change in youthful offenders undergoing

treatment. One study (Graham, 1981) found the Jesness

Inventory did not have the power to discriminate between

non-adjudicated and normal populations and thus would not

be useful in a school setting. The Jesness Inventory

appears best suited for research (Buros, 1978, pp.


The Jesness Behavior Checklist (JBC)(Jesness, 1970)

is also a measure of delinquent behavior. The reliability

and validity of this instrument have been questioned and

the JBC is recommended only for research purposes (Buros,

1978, pp. 873-876).

Non-quantitative assessment often uses nonsystematic

observations to provide the information from which judg-

ments will be made. Judgments about individuals are

required in all assessment. Inaccurate, biased, or sub-

jective judgments can be misleading and harmful (Salvia &

Ysseldyke, 1981). The Russell Sage Foundation Conference

Guidelines (Goslin, 1969) and the 1974 Family Educational

Rights and Privacy Act (P.L. 93-380--the Buckley amend-

ment) established guidelines for the proper collection,

maintenance, and dissemination of data concerning students.

For data to be used in making judgments, it must be

verified. For standardized tests, this verification is

implicit in the psychometric qualities of the instrument.

For observational data, verification requires confirmation

by persons other than the original observers (Salvia &

Ysseldyke, 1981). When the observation is nonsystematic,

verification may be difficult to establish and support and

the assessment and resulting evaluation may be open to


After a classroom teacher nominates a child for

evaluation for exceptional child education services, that

teacher's observation is verified by required legal proce-

dures (P.L. 94-142). There may be no such procedures for

other interventions. The Duval County, Florida, School

District has used teacher and principal nominations as the

criteria for admittance and dismissal from a program to

intervene with students displaying inappropriate social

behaviors (Duval County Public Schools, 1980). Short-term

suspensions in many school districts do not require hear-

ings and are based solely on a judgment by the school

principal (Lines, 1972; Pisarra & Giblette, 1981).

Subjective assessment practices such as these may

allow extraneous variables to influence judgments

(Poulton, 1976). Four such variables are bias, the influ-

ence of observer expectations, inaccurate perceptions,

and vagueness of the criteria for intervention.

Pupil characteristics were found by Ysseldyke and

Marston (1982) to influence rater bias. Variables

contributing to bias include perceived physical attrac-

tiveness (Ross & Salvia, 1975); sex, socioeconomic status,

and reason for referral (Matusek & Oakland, 1979;

Ysseldyke & Algozzine, 1982; Ysseldyke, Algozzine, Regan,

& McGue, 1979, 1981); race (Florida Department of

Education Report on Public Schools, 1983; Sikes, 1975);

type of behavior displayed by the student (Algozzine,

1980); and the theoretical orientation of the observer

(Messick, 1980; Salvia & Ysseldyke, 1981).

Erickson (1974) and Shuller and McNamara (1976) found

naive observers' reports coincided with experimenter-

induced expectancies about problem behavior. After

observing decisions made by educators, Weinrott (1979);

Ysseldyke, Algozzine, and Richey (1982); and Algozzine and

Ysseldyke (1981) speculated that these judgments were

influenced by an expectancy factor created by the

situation itself. A more direct measure of expectation

was reported on by Green and Brydon (1975). They found

teachers' attitudes were much more favorable toward

middle-income children than low-income children and that

43% of teachers' comments about black children were

negative as opposed to 17% of comments about white



Dunlap and Dillard (1980) investigated 164 school

principals' perceptions of the factors indicative of

emotional disturbance in children. The factor least

chosen by the principals was the one considered by the

researchers most predictive of emotional disturbance.

The vagueness of criteria for suspension in one

school district was investigated by Pisarra and Giblette

(1981). They found the criterion to be improper conduct,

which was not further defined. The researchers concluded

that a student reported for fighting would be suspended,

possibly suspended, or not suspended depending on the

individual administrator who had jurisdiction.

A few of the possible sources of error in nonsystem-

atic observation leading to inaccurate, biased, or

subjective judgments have been presented to suggest their

ubiquitous nature and the necessity of providing for

systematic observations in judgments leading to educa-

tional placement decisions.

Rating Scale Development

Designing a rating scale requires addressing four

major issues: (a) what to measure (parameters), (b) how

to measure (item content and format), (c) how to record

(response format), and (d) how to interpret the results

(statistical analysis). Literature pertaining to these

issues will be reviewed in this section.

In a frequently cited longitudinal study of deviant

behavior, Robins (1966) found the variables of type of

behavior, frequency of occurrences, and severity of

consequences to be indicators of future behavior pat-

terns. More recent studies supporting these criteria

include those of Kohn, Koretzky, and Haft (1979); Camp

(1980); Forness and Cantwell (1982); Gresham (1982);

Loeber (1982); and a United States Department of Justice

report (1982, p. 1).

The types of behavior to be measured by a rating

scale are determined by its authorss, who must consider

content, sources, format, number, and order of presenta-

tion of the items to be included. Halo effects, or the

tendency to rate individuals holistically (Thorndike,

1920, p. 25; Willingham & Jones, 1958), were found by

Cooper (1981; 1983) to be reduced by having more specific

item content. Kreitler and Kreitler (1981) demonstrated

that items deemed irrelevant by raters tended to be scored

neutrally, thus limiting the derived information. Never-

theless, scales for rating disruptive behavior sometimes

include prosocial behavior content (Miller, 1980).

However, Deno (1979) suggested that to observe non-

disruptive behavior ignores the purpose of these ratings,

i.e., to determine whether inappropriate behaviors are

actually excessive. Schriesheim and Hill (1981) mixed

positive and negative statements on a questionnaire and

concluded that the effect was to reduce response validity.

Many scales do limit their items to behaviors that focus

on problem behavior (DiPrete, 1981; Duke, 1978; Governor's

Task Force on Disrupted Youth, 1974; Spivack & Swift, 1966;

Walker, 1979, p. 55), although not necessarily school

problems. Camp (1980) suggested that only school problems

directly observable by teachers and/or administrators be

included in scales for rating disruptive school behavior.

Logically, items taken from the setting in which the

ratings will be made best meet the criteria for relevant

content. Smith and Kendall (1963) used this premise in

devising Behavioral Expectation Scales (BES). Numerous

examples exist of the application of this premise in

education (Brown & Hammill, 1978; Camp, 1980; Duval County

School Board, 1979; Ross, Lacey, & Parton, 1965; Sherry,

1979; Spivack & Swift, 1977; Stott et al., 1975), mental

health (Kaufman, Swan, & Wood, 1979; Kohn et al., 1979;

Lachar & Gdowski, 1979; Miller, 1980) and industry (Vance,

Kuhnert, & Farr, 1978).

Item format refers to the various forms used in

presenting the information to which the rater is asked to

respond. It is often related to response format, which

refers to the methods of collecting information from the

raters. Response format literature will be presented in

the section covering the frequency characteristic.

Four types of item formats are currently in use in

behavioral rating scales. Behavioral Observation Scales

(BOS) describe the target behavior in specific terms that

require direct observation at the time the rating is made

(Latham & Wexley, 1977). Behaviorally Anchored Rating

Scales (BARS) provide a specific description of a behavior

for each successive rating point (anchor) of an item and

assess cumulative behavior over some time period (Smith &

Kendall, 1963). The Mixed Standard Scale (MSS) uses sev-

eral scales, with three levels of behavioral description

for each trait to be measured, and randomizes the order of

presentation (Blanz & Ghiselli, 1972).

Summated rating scales (Edwards, 1957), referred to

as Likert scales (LT) (1932) or graphic rating scales

(Waters, Reardon, & Edwards, 1982), present for each item

one statement that may be specific or general. Likert

scales have been used with both direct and deferred obser-

vation. BOS scales are developed using summated rating

procedures (Likert, 1932), while BARS and MSS use the

Thurstone (Thurstone & Chave, 1929) scale development

process (Bruvold, 1969).

Conflicting conclusions have resulted from numerous

investigations into the advantages and disadvantages of

these scale formats. Fay and Latham (1982) found BOS to

be superior to BARS in rating video-taped behavior during

job interviews. However, Murphy, Martin, and Garcia

(1982) questioned the theoretical basis for BOS and found

evidence to suggest that BOS tapped recall for behavior

traits as well as immediate observation. Several studies

(Hom, DeNisi, Kinicki, & Bannister, 1982; Ivancevich,

1980; Keaveny & McGann, 1975; Lee, Malone, & Greco, 1981)

failed to find significant advantages for the BARS format

over summated rating scales or other alternative methods

(Jacobs, Kafry, & Zedeck, 1980; Kingstrom & Bass, 1981;

Schwab, Heneman, & DeCotiis, 1975).

In opposition to MSS theory, Finley, Osburn, Dubin,

and Jeanneret (1977) found evidence to suggest that an

obvious scale format may be superior to a hidden contin-

uum. Dickinson and Zellinger (1980) compared MSS, BARS,

and LT formats and found MSS produced less method bias,

BARS produced as much discriminant validity as MSS and

provided the best feedback to rates, and LT scales were

easiest to understand and use. When Bruvold (1969) tested

the application of summated scales (Likert, 1932) and

successive interval scales (Edwards & Thurstone, 1952) to

the same data set, no significant differences were found

between the two scaling methods. According to Bernardin

and Smith (1981), one explanation may be that scale

constructors have deviated from the original procedures

(Smith & Kendall, 1963) in developing BARS instruments.

In addition to the Thurstone and Likert scaling

procedures, a third method is available. According to

Edwards (1957, p. 172), a Guttman (1944, 1945, 1947a,

1947b), or cumulative scale, requires that the construct

to be measured be unidimensional. Since disruptive school

behavior consists of many discrete behaviors, a Guttman

scale is not suitable for the instrument developed in this

study. At present, it appears that no item format is

superior enough to warrant relinquishing the clarity of

understanding and ease of use (Dickinson & Zellinger, 1980)

of the Likert scale, which presents one descriptive item

at a time to which the rater assigns a quantitative value

from a given range of values.

In determining the number of items to include in a

rating scale, some researchers (Quay & Peterson, 1967,

1979; Spivack & Swift, 1971; Stott, 1972) have relied on

factor analysis, using an arbitrarily chosen factor score

as the cut-off score. Edwards (1957) suggested an intui-

tive approach, utilizing 20-25 items that discriminate

between the groups at the extremes of the scale. A

comprehensive study (Achenbach & Edelbrock, 1978) of 18

rating scales found the range of items to be from 36 to

287 (median = 68 items; mean = 90.4 items). Of the 6

scales intended for use by teachers, 4 contained fewer

than 50 items and 2 between 50 and 100 items.

In a study of preferred scale length, Meredith (1981)

found half of the respondents preferred from 20 to 40

items, with 25 the median preferred length. In another

study, Meredith (1975) found a 52-item scale was judged

too long. Seidman and his associates (Seidman et al.,

1979) concluded their 46-item Teacher Behavior Description

Form was too cumbersome and reduced it to 23 items. While

item complexity is probably a factor (Meredith, 1981),

this review suggests a scale using no more than 40 items

would probably be acceptable to most teachers.

The ordering of items within a scale has been

suggested as a possible source of leniency error, halo

effect, and impaired discriminant validity (Blanz &

Ghiselli, 1972). Schriesheim and DeNisi (1980) and

Schriesheim (1981b) found that grouping according to

constructs rather than randomizing questionnaire items

resulted in impaired discriminant validity. Increased

leniency response bias was also found when items were

grouped (Schriesheim, 1981a).

Dickinson and Zellinger (1980) concluded that a

randomized scale contributed as much discriminant validity

as an ordered scale while displaying less method bias. In

a comparison of randomized and grouped scales, the

randomized scale engendered as much convergent and

discriminant validity (Waters et al., 1982). Thus, a

randomized order of presentation seems indicated.

Obtaining a meaningful measure of the frequency of

target behaviors requires attention to the variables of

response format, length of the observation period, and

type and number of raters. According to Tzeng (1983),

four response formats are most frequently cited in the

literature. They can be differentiated in terms of two

psychometric criteria. First, the existence of a neutral

response option defines the free choice format. Absence

of a neutral rating option defines the forced choice

format. Second, categorical (qualitative) ratings answer

the question "Does the ratee fit this category?" while

discriminatory (quantitative) ratings answer the question

"To what degree does the ratee fit?"

Tzeng (1983) criticized forced choice measures for

their omission of a valid response category, i.e., uncer-

tainty or neutrality of the raters' perceptions. King,

Hunter, and Schmidt (1980) concluded that a forced choice

format was ineffective in reducing rater halo. Dunnette

(1963, p. 96) reported that rater resistance to forced

choice formats led to their abandonment.

Categorical, or qualitative, formats used in

checklists cannot detect relative differences in degree

between two behaviors performed by the same ratee or

between the same behaviors among rates (Tzeng, 1983).

Johnson, Smith, and Tucker (1982) found less response

skewness on a 5-point Likert discriminatory scale compared

to a yes/?/no categorical format. A zero-based discrimina-

tory, free choice response format seems most appropriate

(Likert, 1932). The absence of a behavior can be indicated

by the 0 position or, if present, the perceived frequency

can be indicated by choosing a value from the remainder of

the scale (Edwards, 1957).

The number of value choices permitted to the rater is

a critical issue. If few points are used some information

may be lost, but the scales are less ambiguous for the

rater. If there are too many points the discrimination

may be too fine for the rater to make. Albaum et al.

(1981) attempted to show superiority for a continuous

scale format, but concluded that equivalent aggregate

measurements were obtained from a 5-category, discrete

rating scale.

Likewise, Bernardin et al. (1976) and Bardo and

Yaeger (1982) failed to find continuous scales superior to

discrete scales. The superiority of a 5-point, discrete

rating scale has been suggested by Cowen, Dorr, Clarfield,

Kreling, McWilliams, Pokracki, Pratt, Terrell, and Wilson

(1973); Lissitz and Green (1975); McKelvie (1978); Neumann

and Neumann (1981); and Broadbent, Cooper, Fitzgerald, and

Parkes (1982).

Conversely, Bardo and his associates (Bardo & Yeager,

1982; Bardo, Yeager. & Klingsporn, 1982) found obtained

means and variances closer to the expected values for

4-point scales over 5- and 7-point scales. These results

appear contrary to most other studies. Edwards (1957, pp.

150-151) gives Likert's original statistical rationale for

the use of a 5-point scale, anchored with the integers 0

through 4, and the summation of scores for individual

items as a total score for each ratee. Current research

provides no compelling evidence for departing from this

original format.

An anchor, e.g., "always," "sometimes," "never," is

usually associated with each scale point of a Likert-type

summated rating scale (Pohl, 1981). While a variety of

anchors has been used, the basis for the selection is

often not stated (Beatty, Schneier, & Beatty, 1977;

Broadbent et al., 1982; Camp, 1980; Cowen et al., 1973;

Hunter, Hunter, & Lopis, 1979; Kassin & Wrightsman, 1983;

Moses, 1974; Siegel, Dragovich, & Marholin, 1976; Solomon

& Kendall, 1977; White, 1977).

Several studies have investigated the assumptions

involved in the selection of one popular set of anchors:

always, often, occasionally, seldom, and never. Parducci

(1968), Chase (1969), and Pepper and Prytulak (1974) con-

cluded that the meanings of anchor words were influenced

by context. The effects of individual differences among

raters on their interpretations of anchor words were

demonstrated by Helson (1969) and Goocher (1965). These

studies suggested that the above anchors may not define

perceptually equal intervals along the rating continuum.

Four studies (Bass, Cascio, & O'Conner, 1974;

Schriesheim & Shriesheim, 1974, 1978; Spector, 1976) have

sought to select five anchor words that would be perceived

by raters as defining equally spaced rating intervals.

However, the most definitive study appears to be Pohl's

(1981) partial replication of the Bass et al. (1974) and

Shriesheim and Shriesheim (1974, 1978) studies. Using

responses from 164 college students, Pohl (1981) calcu-

lated the means and standard deviations for 39 expressions

of frequency.

Comparing these with the theoretical mean responses

for a 5-point equal interval scale, Pohl (1981) derived

the response set of always, quite often, sometimes, very

infrequently, and none of the time. The calculated mean

(26.71) for the mid-point term "sometimes" differed signi-

ficantly (p < .001) from the theoretical mean (29.05), but

nevertheless was the value closest to the optimal for a

5-point scale. The other calculated values were not

significantly different from the theoretical profile.

Thus, with the exception of the mid-point term, it appears

that the anchors produced by the Pohl (1981) study

adequately defined equal-appearing intervals on a 5-point

rating scale.

The length of the period for which behaviors are to

be rated has been little studied. For instance, the

manual for the Behavior Problem Checklist (Quay &

Peterson, 1967, 1975, 1979) does not specify for the rater

the inclusive time period to be considered in rating the

listed behaviors. The authors of the Devereux Elementary

School Behavior Rating Scales (Spivack & Swift, 1966)

instructed their raters to "consider recent and current

behavior" (p. 75). The same authors (Spivack & Swift,

1977), in developing the Hahnemann High School Behavior

Rating Scale, instructed teachers to base ratings on

behavior observed "over the past month" (p. 300).

A study (Hinton, Webster, & O'Neill, 1978) of hospi-

talized clinical patients used a 6-week time period. An

investigation (Beatty et al., 1977) of performance rating

in a data processing firm utilized three assessment

periods of two months each for a total of six months. In

a study of several response formats, Broadbent et al.

(1982) used a 6-month inclusive time period. However, in

none of these studies was a rationale given for selection

of the time period.

Two attempts at aggregating measures over specific

time periods have provided more precise instructions to

the rater. Cowen et al. (1973) defined each of five

rating points in terms of the inclusive time periods to be

considered when aggregating occurrences of behavior. For

example, the fourth anchor point, often, was defined as

"you have seen this behavior more often than once a week

but less often than daily" (p. 16). Camp (1980) used the

following scale:

Frequency of occurrence

0 Never observed
1 Once or more in semester
2 Once or more monthly
3 Once or more weekly
4 Once or more daily (p. 11)

The work of Seymour Epstein (1980), in support of the

stability over time of personality traits, bears directly

on the issue of aggregating behavior ratings over some

time period. Epstein (1980) stated that "stability can be

demonstrated . as long as the behavior in question is

averaged over a sufficient number of occurrences" (p.

791). In testing this hypothesis, Epstein conducted four

studies in which he used, among other types, ratings

performed in classrooms by teachers. Epstein suggested

aggregating behavior over subjects, stimulus situations,

time, and modes of measurement in order to establish

predictive reliability and validity (p. 797).

Ratings of middle and junior high school students by

their teachers in different courses would meet the

conditions of subjects and situations. Epstein (1980)

suggested that ratings at a single time following multiple

or extended observations represent an intuitive averaging

that has the "potential for producing highly replicable

and valid results" (p. 802). Harrop (1979) also

challenged the common assumptions (Fay & Latham, 1982;

Latham, Fay, & Saari, 1979) that coding of directly

observed behaviors produced superior results to aggregat-

ing behaviors over time.

A related concern in the assessment of school-related

behavior is selection of the time of year in which the

ratings will be made. Several studies (Cowen et al.,

1973; Epstein, et al., 1983; Larrivee & Bourque, 1980)

recommend allowing student behavior and teacher percep-

tions to stabilize. Supporting these decisions are data

from the Texas Junior High School Study (Evertson,

Anderson, & Brophy, 1979).

Evertson and Veldman (1981) found a moderate but

steady increase in serious misbehavior over the course of

the school year and an increase in general misbehavior in

April. Evertson and Veldman (1981) concluded that short-

term studies should avoid ratings made either early or

late in the school year. The available literature seems

to suggest the feasibility of aggregating behaviors over

time periods specified in the rating scale instructions

and after teachers have had at least two months to observe

student behavior.

Deciding on the most appropriate type of rater to use

in assessing children's behavior has long been a problem.

In 1965, Ross et al. recognized the potential usefulness

of teacher ratings. Teacher's ratings have been found to

be more accurate than peer ratings of classroom behaviors

(Bailey, Bender, & Montgomery, 1983); other school profes-

sionals' ratings (Bower & Lambert, 1971, p. 143; Freemont

& Wallbrown, 1979), and institutional child care workers'

ratings (Kohn et al., 1979) and to be equivalent to the

ratings obtained by a multidimensional scaling technique

(MDS) applied to classroom behavior (Sanson-Fisher &

Mulligan, 1977).

A number of researchers have found support for

teacher ratings as appropriate measures of general class-

room behaviors (Solomon & Kendall, 1977), social behavior

(Loranger, Lacroix, & Kaley, 1982), assertive vs. aggres-

sive behavior (Roberts & Jenkins, 1982), acting out

behavior (Walker, 1970), and behavior that would likely

result in referrals for exceptional child education (Dean,

1980; Epstein et al., 1983; Home & Larrivee, 1979; Lahey,

Green, & Forehand, 1980; McKinney & Forman, 1982; Roberts

et al., 1981).

Not all studies have yielded positive results. Morris

and Arrant (1978) found that regular classroom teachers

tended to see more behavior problems in students referred

for evaluation than did school psychologists. A study

(Kazdin, Esveldt-Dawson, & Loar, 1983) of psychiatric

inpatient children found extra-class raters' evaluations

of overt classroom behaviors to correspond more closely to

direct observational data than did teachers' ratings.

However, teachers were more accurate than the extra-class

raters in identifying hyperactive children using a behav-

ior checklist. Overall, the evidence suggests strong

support for the use of teachers as raters of classroom


An associated issue is the use of multiple raters to

increase reliability and reduce halo effect (Epstein,

1980). Ratings of students commonly are obtained from all

teachers having direct classroom contact (Linton & Chavez,

1979; Wixson, 1980). This procedure could result in as

few as one or perhaps as many as seven ratings, depending

on the grade level and local practice.

More recent research efforts have focused on

empirically determining the most effective number of

raters. Prinz and Kent (1978) increased from 1 to 4 the

number of raters of parent-adolescent interactions in a

clinical setting and reported increased reliabilities.

Both reliability and concurrent validity of clinical

judgments were shown to increase when the number of judges

was increased from one to ten (Horowitz, Inouye, &

Siegelman, 1979). Strahan (1980) extended the Horowitz et

al. (1979) study and concluded that after using four

raters, adding additional ones contributed little to

measurement effectiveness. Another study (Green, Bigelow,

O'Brien, Stahl, & Wyatt, 1977) of inpatient clinical

behaviors found little improvement when using more than

four raters.

Although in general agreement with the above studies,

a cautionary note was added by Kenny and Berman (1980),

who pointed out that if raters are completely unreliable,

increasing their numbers will not increase reliability.

The number of teachers usually available in a middle or

junior high school to serve as raters would appear to be

adequate to contribute to both improved reliability and

concurrent validity.

Various classifications of severity have been adopted

in school settings. Student conduct codes typically use

some method of indicating seriousness of offenses, such as

"serious misconduct" (Pinellas County Schools, 1983, p. 7)

and "minor, intermediate, and major" (Duval County Public

Schools, 1980, p. 16). Researchers (Pisarra & Giblette,

1981) have used categories emphasizing the targets of the

behavior (e.g., offenses against persons, offenses against

state laws). Teachers often focus on specific behaviors

(e.g., use of drugs, striking teacher) (Camp, 1981) and

administrators have used a combination of both (National

School Public Relations Association, 1973).

There is little consensus on the number of levels to

be used in assigning degrees of severity. Taylor et al.

(1979) used levels ranging from 1 (not very severe) to 4

(extremely severe). Camp (1980) used 0 for "not con-

cerned" through 4 for "extremely concerned." In an

earlier study, Moses (1974) used three levels, 1 (mild), 2

(moderate), and 3 (severe) in asking mental health and

criminal justice professionals to rate a list of problem

behaviors. To use too many levels may imply a degree of

confidence in discrimination not supported by the subjec-

tive nature of such ratings.

Not all rating scale authors and researchers accept

the necessity for including a severity rating (Searls,

Isett, & Bowders, 1981; Spivack & Swift, 1977). Even

when, as in the Behavior Problem Checklist (Quay &

Peterson, 1967, 1975. 1979), a severity factor is provided

for, the author does not always recommend its use. How-

ever, at the practitioners' level the degree of severity

of behaviors is a major concern.

Algozzine (1979), using items characteristic of

several behavior rating scales, developed the Disturbing

Behavior Checklist which asks teachers to rate the degree

of disturbance they experience as a result of different

student behaviors. This suggests a consequence to the

teacher based not on the frequency of the behavior, but on

the type and severity. After noting irregularities and

lower reliabilities, Taylor et al. (1979) had teachers

rate for severity 26 items of Part Two of the Adaptive

Behavior Scale (ABS) (Nihira, Foster, Shellhaas, & Leland,

1969). Teachers were able both to categorize behaviors

and rate them in terms of severity, leading Taylor et al.

(1979) to conclude that this additional information would

be useful in refining the scale and adding to its clinical


Inasmuch as the instrument developed in this study is

intended to have locally developed norms, the statistical

techniques used in the norming procedure and the comparing

of individual scores to the derived local norms are not

complex. While some more recent studies have focused on

problems associated with such common procedures as the

calculation of measures of central tendency (Mosteller &

Tukey, 1977; Stavig, 1978, 1982), many researchers

continue to rely on descriptive statistics utilizing raw

scores, arithmetic means, standard deviations, and stan-

dard scores.

White (1977) compared individual student's scores on

classroom behavior to the computed mean score for five

classes of "Follow Through" program students in order to

identify immature students. In a business setting, Fay

and Latham (1982) used means and standard deviations in

comparing scores obtained using two different rating

methods. A study (Lyness & Cornelius, 1982) comparing

judgment strategies and ratings of college instructors

supported the use of a rating scale composed of discrete

items, with an overall rating calculated by weighting the

items and summing the weighted scores. To obtain mean

sub-scores for subjects, Algozzine (1980) summed scores

across the items defining each of four factors of

disturbing behaviors and used means and standard devia-

tions in analyzing the results.

The cited studies seem to support the use of descrip-

tive statistics in both obtaining individual scores (i.e.,

sum of weighted ratings)-and deriving a local norm (i.e.,

mean) from ratings of a representative sample of a total

population. Salvia and Ysseldyke (1981, chap. 4) offer

definitions of common terms for descriptive statistics

applied to assessment.

Psychometric Properties of Rating Scales

Historically, rating techniques have aroused contro-

versy over estimations of validity and reliability (Ryan,

1958). Validity is the relevance of the scale to the

variables being measured. Most sources recognize three

types of validity, i.e., content, criterion-related or

concurrent, and construct (American Psychological

Association, 1966; Cronbach, 1970; Kerlinger, 1972).

Reliability is the accuracy or precision of a measuring

instrument and has been usually classified as either

temporal, inter-rater, or internal (Cronbach, 1970).

However, investigations (Epstein, 1980) into the

effects of situations on behavior have recently introduced

a fourth consideration, situational reliability, or the

consistency of behavior across settings. The development

of norms against which to compare results obtained from

individual administrations of rating scales is another

area of active investigation (Mendelsohn & Erdwins, 1978;

Messick, 1980). Research on these issues is reviewed in

this section.

Content validity refers to the relevance and repre-

sentativeness of the items used in construction of a scale

(Epstein, 1980). Often, this is determined by obtaining

judgments from experts not otherwise involved in the scale

construction (DiStefano, Pryer, & Erffmeyer, 1983; Jones

et al., 1975, p. 83; Lawshe, 1975; Thorne, 1978).

Kreitler and Kreitler (1981) found that item content

determined the rater's perception of the central theme of

an instrument. Items not perceived as relevant to the

central theme tended to be given neutral responses, thus

limiting the information contributed by the rater.

Criterion-related validity is studied by comparing

scores obtained from an instrument with one or more

external criteria of the variable being measured

(Kerlinger, 1972, p. 459). Criterion-related validity

encompasses both concurrent and predictive qualities

(Epstein, 1980). The comparison of scale results with an

independent judgment or diagnosis of a subject is an

example of an attempt at estimating criterion-related

validity. If the judgment or diagnosis confirms the scale

indications, the inference may be drawn that the scale is

in agreement with the concurrent diagnosis and is

predictive that others given a similar rating would also

be diagnosed similarly (Kohn et al., 1979; Mendelsohn &

Erdwins, 1978).

In one validation study, Harris, Kreil, and Orpet

(1977) used the school principal, guidance counselor, and

two teachers as judges in selecting both disruptive and

prosocial students for rating by the Behavior Coding

System (Patterson, Ray, Shaw, & Cobb, 1969). In develop-

ing the Pittsburgh Adjustment Survey Scales (Ross et al.,

1965), school principals were used to nominate adjusted,

withdrawn, and aggressive students for rating by their

teachers and scale results were compared with these


According to Kerlinger (1972, p. 461) and Cronbach

(1970), the significance of construct validity is its

concern with the theory behind the variable being

measured. Guion (1977) argues that construct validity

integrates both content and criterion considerations.

Likewise, the usefulness of content and concurrent

validity is questioned by Sanson-Fisher and Mulligan

(1977) and construct validity is supported.

A definition of construct validity as the process of

ascribing meaning to scores is offered by Stenner and

Smith (1982). Messick (1980) broadens the concept of

validity to include both test interpretation and test

use. Messick (1980) describes construct validity as

"interpretive meaningfulness" (p. 1015) and suggests that

it rests on four bases: convergent and discriminant

validity, ethical interpretation, relevance and utility

for the specific application, and the consequences follow-

ing use of the instrument.

To be interpretable, a rating scale must be reli-

able. That is, a scale must produce similar results when

applied to the same person over several administrations,

the instrument must be relatively free of errors of mea-

surement, and the results must closely approximate the

"true" value of the variable for the person being rated

(Cronbach, 1970; Kerlinger, 1972).

Typically, test-retest data are compiled for varying

time periods between administrations. The correlation

between the two obtained scores is used to justify esti-

mations of temporal stability and, in the case of rating

scales, intra-rater reliability. Examples of reported

test-retest intervals include one week (Duval County

School Board, 1979; Quay, 1977), two weeks (Mendelsohn &

Erdwins, 1978; Russell, Lankford, & Grinnell, 1981) and

two years (Quay, 1977). However, Masterson (1968) pointed

out that low test-retest correlation coefficients may

reflect the transitory nature of the measured variable and

suggested high coefficients of internal consistency may be

more indicative of reliability for some instruments.

Internal consistency has often been estimated by

inter-item and item-total analysis (Edwards, 1957;

Kerlinger, 1972). In these procedures, an individual's

rating on one item is compared with the rating on all

other items or with the total score from the scale or

subscale to estimate the degree to which each item is

similar to the other items. Item analysis may be impor-

tant in reducing errors of measurement attributable to the

composition of the instrument (Benson & Clark, 1982).

However, internal consistency may not provide good reliabi-

lity estimation for a rating scale assessing constructs

comprised of many discrete behaviors (Kerlinger, 1972).

Some research (Rosenthal & Jacobson, 1968; Sulzbacher,

1973) into observer bias has suggested that beliefs about

rates may affect rater perceptions and, consequently, the

reliability of the ratings. In three studies (O'Leary &

Kent, 1973; Shuller & McNamara, 1976; Siegel et al., 1976)

of disruptive classroom behavior, while biasing informa-

tion experimentally introduced was found to influence

global ratings, it had no significant effect upon results

obtained from behaviorally stated scales. Siegel et al.

(1976) suggested that behaviorally specific items reduce

bias and improve inter-rater and intra-rater reliability.

The degrees of agreement among different raters on

measures of the same subjects at the same time in the same

setting have been used to indicate the inter-rater

reliability of an instrument (Cronbach, 1970). Also, the

agreement among different raters of subjects in the same

settings at different times has been used for the same

purpose (Cronbach, 1970). In middle and junior high

schools, these conditions do not usually occur naturally.

Fortunately, investigations of trait consistency in

subjects (Abikoff, Gittelman, & Klein, 1980; Epstein,

1980; Mischel, 1969) have encouraged the comparisons of

ratings by different raters over the same elapsed time

periods, but for different settings and situations,

conditions which do occur naturally in the secondary

school setting.

Epstein (1980) concluded that subjects do manifest

trait consistency, if aggregation techniques are applied

in assessing behaviors. Epstein (1980) suggested aggre-

gation over raters (e.g., teachers), situations (e.g.,

classrooms), occasions (e.g., class periods), and measures

(e.g., disciplinary records). Epstein further suggested

that when single ratings are made after extended periods

of observation, these ratings are similar to aggregated

ratings in that they represent an intuitive averaging of

ratings over many observations. Thus, reliability may be

improved by combining different teachers' ratings of the

same student over the same portion of the school year.

According to Cooper (1981), perhaps the most ubiqui-

tous challenge to inter-rater reliability is halo error

(Thorndike, 1920) or the tendency of a rater to allow

overall impressions of an individual to influence judgment

of specific areas of behavior (Holzbach, 1978). Attempts

(Landy, Vance, Barnes-Farrell, & Steele, 1980; Landy,

Vance, & Barnes-Farrell,' 1982) to statistically control

for halo effects have apparently not succeeded (Harvey,

1982; Hulin, 1982; Mossholder & Giles, 1983; Murphy, 1982).

One exploration of ways to reduce halo error resulted in a

restatement of classic advice: do not use rating cate-

gories that are imprecise and overlapping (Cooper, 1983).

In an extensive review of the literature, Cooper (1981)

concluded that of nine methods currently employed to reduce

halo effect, all leave residual illusory halo.

Studies of variables affecting reliability have iden-

tified several other challenges to the accuracy of school

behavior ratings. The sex of the teacher was found in two

studies (Levine, 1977; Silvern, 1978) to be correlated

with ratings of classroom behavior, with male teachers

consistently reporting lower levels of disruptive behav-

ior. Teachers' ratings seemed to be influenced by special

education labels in one study (Fogel & Nelson, 1983). In

two studies (Marwit, 1982; Marwit, Marwit, & Walker, 1978),

perceived unattractiveness of students has been shown to

correlate with higher ratings of disruptive behavior.

While challenges to reliability from a variety of

sources have been observed, several studies (Bernardin &

Pence, 1980; Fay & Latham, 1982; Latham, Wexley, & Pursell,

1975; Madle, Neisworth, & Kurtz, 1980; Pursell, Dossett, &

Latham, 1980) have suggested that training in the use of

rating scales may be effective in reducing errors of

measurement. This review of studies of validity and

reliability has identified some sources of and counter-

measures for errors of measurement. Next, studies of the

variables affecting the norming of rating scales will be


Several writers have shown concern for the relation-

ship between behavior and the context in which it occurs.

The social value of a test, according to Messick (1980),

is determined by its instrumental value for a particular

setting. Willems (1975) stated that few phenomena have

meaning independent of the context in which they occur.

Likewise, researchers were cautioned by Dickinson (1978)

to evaluate behavior only in an environmental context.

Epstein (1980) referred to the "extreme situational

specificity of behavior" (p. 794) and warned that experi-

ments conducted in a single situation cannot be relied on

to generalize across even minor variations in stimulus

conditions. Others supporting this psychosocial approach

include Sherif (1954); Erickson (1963), quoted in Tinto,

Paclilio, and Cullen (1978); Salvia and Ysseldyke (1981,

p. 378); and Zammuto, London, and Rowland (1982).

Schools were described by Garbarino (1980) as "con-

texts for behavior and development" (p. 19). Some of the

characteristics of schools which may influence levels and

interpretations of disruptive behavior are size of enroll-

ment (DiPrete, 1981, p. 86; Garbarino, 1980; Kowalski,

Adams, & Gundlach, 1983); public or private administration

(DiPrete, 1981, p. 81); control orientation (e.g., human-

istic vs. custodial) (Deibert & Hoy, 1977; Gaynor &

Gaynor, 1976); degree of person-environment fit (Kulka,

Klingel, & Mann, 1980); traditional vs. open classrooms

(Solomon & Kendall, 1975); length of faculty tenure

(DiPrete, 1981, p. 107); socioeconomic level of the host

community (Kowalski et al., 1983); and region of the

country (DiPrete, 1981, p. xx; Kowalski et al., 1983).

Researchers advocating the use of local norms for behav-

ioral measurements include Fremont and Wallbrown (1979);

Mendelsohn and Erdwins (1978); Quay and Peterson (1967);

Smith (1976); Walker and Hops (1976); and Wallbrown,

Wallbrown, and Blaha (1976).

The effects of sex, age, race, and socioeconomic

status on ratings of disruptive behavior have been

frequently studied. The types of disruptive behavior

displayed in both educational and clinical settings have

not been found to be significantly different for the

variables of sex (Behar & Stewart, 1984; Epstein et al.,

1983; Morris & Arrant, 1978; Stott et al., 1975, p. 166),

age (Behar & Stewart, 1984; Ghodsian, Fogelman, Lambert, &

Tibbenham, 1980; Stott et al., 1975, p. 83), race (Gajar &

Hale, 1982), or socioeconomic status (Behar & Stewart,

1984; Stott et al., 1975, p. 97). Thus, providing for

separate norms for these variables seems unnecessary in

any scale rating only disruptive behaviors.

Uses of Behavior Rating Scales

Bailey et al. (1983) supported the use of rating

scales in program planning and evaluation. Likewise, the

lack of effective measurement devices was seen by Hirshoren

and Heller (1979) as limiting the evaluation of program

effectiveness. Mesinger (1982) called for the use of

appropriate measurement devices in providing services for

deviant youth within the public school setting. Cooper

(1983), Peed and Pinsker (1978), and Beatty et al. (1977)

have suggested providing rating scale results to rates to

influence behavior changes. Using rating scales to pro-

vide a standardized description of behavioral problems has

been suggested (Edelbrock & Achenbach 1978).

In a study comparing resource room delivery models,

Wixson (1980) used a behavior rating scale in developing

and evaluating intervention programs for various cate-

gories of handicapped children. Morton Bortner (Buros,

1978, p. 493), reviewing the AAMD Adaptive Behavior Scale,

pointed out its usefulness for evaluating the progress of

individuals and evaluating program goals. The Duval

County School Board (1979) used a locally constructed

behavior checklist to evaluate their grant-funded program

for disruptive students.

Several programs which retained students in their

regular classrooms have used behavior scales for evalua-

tion purposes. Walker and Holland (1979) and Linton and

Chavez (1979) developed and used rating scales for this

purpose in elementary and junior high schools, respec-

tively. The Hahnemann High School Behavior Rating Scale

(Spivack & Swift, 1977) was intended to provide teachers

with a practical means of describing disruptive classroom

behavior to parents and other school personnel. In a

study of junior high school truants, Nielsen and Gerber

(1979) used a behavior rating scale to match school inter-

ventions with student needs.

A quantitative measure of disruptive behavior was

developed by Mendelsohn and Erdwins (1978) to assist

community agencies in devising programs for expelled

students. Haskell (1979) developed a method of quantify-

ing clinical behavior in institutional settings to provide

a basis for planning individual programs and evaluating

results. McSweeney and Trout (1979) used the Jesness

Behavior Checklist (Jesness, 1970) to evaluate the social

progress of deviant children in a wilderness camp pro-

gram. Five reasons for obtaining measures of students

are offered by Salvia and Ysseldyke (1981): "Screening,

placement, program planning, program evaluation, and

assessment of individual programs" (p. 14). Behavior

rating scales have been used to obtain measures for each

of these needs.


This review has identified five approaches in the

literature to define disruptive school behavior (DSB). A

conceptualization of DSB based on the interactions of

students, teachers, and administrators within the school

setting was suggested as most relevant for the development

of an instrument to quantify DSB.

Psychometric challenges to the use of rating scales

for identifying behavioral characteristics were consid-

ered. Research was cited to suggest that teachers using

reliable and valid scales could accurately identify DSB.

Nineteen instruments available for assessing problem

behaviors were reviewed. None appeared to meet the

psychometric criteria required for educational placement

decisions. Possible sources of error in nonsystematic

observations were presented with the suggestion that

inaccurate, biased, or subjective judgments may result.

Type, frequency, and severity of behaviors were

related to item content, item format, and response

format. Support was found for the inclusion of these

measurement parameters in assessing DSB. The use of

descriptive statistics in current research for obtaining

individual behavior ratings and deriving local norms was


A review of the sources of error in measurement was

conducted and counter-measures for improving validity and

reliability estimations were suggested. A number of

variables affecting the norming of rating scales were

investigated. Research evidence rejected separate norms

based on gender, race, or socioeconomic status. The

effective use of behavioral instruments in a variety of

settings was documented, suggesting the suitability of

such a device for describing students who display DSB.



The purpose of this study was to develop and validate

an instrument, the Disruptive Student Behavior Scale

(DSBS). The DSBS is intended to be used to assess

quantitatively the disruptive school behaviors of students

referred for placement in either special education or

alternative education programs. This chapter presents the

research questions, defines the target population,

presents a plan for constructing the scale, describes

procedures for a pilot study, details statistical tests

and procedures for the data analyses, and discusses

possible limitations of the study.

Research Questions

1. Does the content of the DSBS represent behaviors

recognized and accepted by educators as occurring

in and disruptive to the school environment?

2. In the judgment of experts, does the DSBS contain

an equitable distribution of items descriptive of

the underlying theoretical constructs that

identify disruptive students and discriminate them

from non-disruptive students?

3. To what degree does the DSBS demonstrate

criterion, convergent, and discriminant validity?

4. To what degree does the DSBS provide ratings which

are stable over time?

Construction of the DSBS

The following plan is a modification of a suggested

procedure (Benson & Clark, 1982) for rating scale construc-

tion. A review of disruptive school behavior (DSB)

literature provided a research base for defining the

constructs comprising DSB. A total of 303 descriptive

items and 22 categories were found in 36 studies. After

eliminating duplications and items not pertaining directly

to DSB, 56 items remained. Combining similar categories

resulted in a total of 13 potential categories of behav-

iors associated with DSB.

In a project conducted by the Research Committee of

the Psychological Services Department of the Duval County,

Florida School District, the 56 items and 13 categories

were presented to 16 teachers of middle school students

enrolled in a behavior management program for disruptive

students. The rating group was composed of 10 females and

6 males, and all had at least two years full-time teaching

experience. Group members were instructed to assign each

item to one or more of the 13 categories. Instructions

and results are reproduced in Appendix A.

The judges' ratings and comments resulted in the

retention of 10 categories, which were considered to be

one set of constructs which could be used in identifying

DSB. A tentative definition of each construct was

formulated using the descriptive items assigned by the

teachers. A verification was attempted of the inclu-

siveness of these derived constructs. A frequency

distribution was prepared for all of the conduct code

violations reported for chronic violators in a sample of

Duval County, Florida, elementary, middle/junior high,

high, and alternative schools (Moses, 1981). Of 7717

behavior violations, 7686, or 99.6%, were included within

the definitions of the proposed constructs.

The items, as taken from the studies and used in

developing the constructs, were not considered specific

enough for use in a quantitative rating scale. However, a

readily available pool of potential items was located in

the disciplinary referral records of an inner-city junior

high school in a metropolitan Florida school district.

Verbatim transcriptions were made of the reasons recorded

on the referral forms by teachers when sending students to

the deans. All active folders for the 1980-1981 school

year were reviewed. A total of 395 items, including dupli-

cations, were recorded without regard for gender, age,

race, or grade level. Combining obvious duplications and

similarities resulted in 66 items (Appendix B) to be

considered for inclusion in a scale for rating DSB.

All of the 66 items were then presented individually

to six male and five female volunteers, experienced

secondary school regular classroom teachers from suburban

Florida middle and junior high schools. Instructions are

reproduced in Appendix C. These teachers were asked to

verify the specificity of the items and edit those consid-

ered ambiguous. This review yielded 40 items for possible

use on an instrument. These items were then stated in the

past tense to reflect the intention to measure students'

past behavior (Appendix D). This preliminary study indi-

cated the feasibility of using research-based constructs

and teacher-generated items as the basis for a rating scale

for disruptive school behavior.

In order to reduce halo and leniency errors, it has

been suggested (Blanz & Ghiselli, 1972) that a scale be

arranged so that items from the same construct will not be

contiguous. Accordingly, items were initially randomly

ordered, then inspected and rearranged to meet this crite-

rion. (Appendix M). Research studies previously cited

suggested that in addition to specifying the type, a

quantitative measure of disruptive behavior must provide

for rating both frequency and severity. Frequency rating

was provided for by the choice of response format selected

for the instrument. The literature review suggested the

suitabilty of a 5-point, equal interval, summated rating

scale (Likert, 1932) using the following anchors:

0 None of the time

1 Very infrequently

2 Sometimes

3 Quite often

4 Always (Pohl, 1981, p. 239)

The rating scale (Appendix F) utilized this response


The severity rating for each scale item was estab-

lished with assistance from the faculty, staff, and

administrators of two alternative schools located in two

metropolitan Florida school districts. From their experi-

ence with disruptive students, these educators were

particularly aware of the consequences for students who

display DSB. Respondents were selected from volunteers,

including the principal and assistant principal, school

psychologist, social worker, educational evaluator, and

faculty members. This group contained both males and

females in approximately equal numbers. All had more than

two years' experience working with disruptive students.

The school experience may be conceptualized as influ-

encing the social, personal, and academic domains of a

student's life. Each of these domains may be subdivided

to facilitate closer study of the consequences of the

school experience (See Table 1). One way for educators to

assign a severity factor to a disruptive activity is to

have them estimate which domains of student life would

likely be affected adversely by that particular behavior.

Instructions for this procedure are reproduced in Appendix

G. The number of adverse consequences assigned by at

least 50% of the raters, divided by a constant of three to

keep the numbers small and with fractions rounded up to

the next whole number, gave a severity rating of 1, 2, or

3 to each of the items on the rating scale. Results are

reported in Chapter Four.

A scoring template incorporating the severity factor

was prepared for the DSBS (Appendix H). This template has

five holes, one corresponding to each possible frequency

rating (i.e., 0, 1, 2, 3, 4) for each rating scale item.

Through the holes are read the rater's mark (X) indicating

the frequency rating assigned. Above each hole is printed

a number which is the product of that frequency rating and

the previously determined severity factor for that item.

Thus, the weighted score for that item may be read by the

scorer directly from the scoring template and recorded on

the DSBS rating form beside each item.

These item scores were then added to give the page

score and form score (see Appendix F) and recorded onto a

summary sheet (Appendix I). The Summary of Teacher

Ratings form (Appendix I) contains for'each student the

DSBS rating; the deviation, in z-scores, from the local

o I

r. a 0

4) rz

HI a) V) > a

Q) (0 > bO I.
Q. 0 bO O 0
X r C bO.r-,
0 (*M r- .

0 a) CM ii



C C 4.)

c oo

1) ., o .-1
0 --- > >
0 0 0

4 0Q -
4- bOi 4-4
-4 .0


C 0u 4
* LO 0

r= ., CO.

.0 4-)0
sC 0 -

S( Oa O

VC 0 *0
4- ) Q

(0 0 *P
Iu 0 >
T- t/3 Q .
r-1 r- o0


norm; a comparison of ratings by each teacher; and the

basis for constructing a DSBS profile for prescriptive use

(Appendix J). These data are intended to provide local

school authorities with criteria for estimating the devia-

tion of any student's rating from the local DSB norm and

are intended to assist in determining a student's need for

an intervention program. The DSBS is normed locally

within each school district. Norms from this study are

reported in Chapter Four for information, but are not to

be used as criteria for judgments about students in other


Validation of the DSBS

To assure content validity, the 40 items and the 10

constructs developed from this preliminary study were

presented to a group of 24 teachers with instructions to

assign each item to a construct category or to no category.

The instructions are reproduced in Appendix E. Each judge

had at least two years of regular classroom teaching

experience in a middle or junior high school. Thirteen

male and 11 female teachers participated. The judges were

also asked to verify the specificity of the retained items

and reword those considered ambiguous. Revisions were

made as suggested and confirmed by a follow-up study using

another group of eight similarly-qualified teachers.

As described in the field study section, at a Florida

middle school a criterion group of disruptive students was

selected by nomination by seven non-teaching school person-

nel, including two deans, three guidance counselors, and

two administrators. Students in the disruptive group were

ranked numerically on a continuum from non- to severely

disruptive, based on subjective ratings from all the

nominating personnel. DSBS ratings from teachers were

compared to these subjective ratings to determine how well

high DSBS teacher ratings correlated with high levels of

disruptiveness as perceived by non-teaching school person-


To estimate how well the DSBS identified the disrup-

tive group, the mean of DSBS ratings for the disruptive

group students were compared with the mean of DSBS ratings

for a norming group representing a sample, stratified by

grade, of the school population. If the DSBS demonstrated

agreement with the concurrent judgments of disruptiveness

made by non-teaching school officials, there would be made

a prima facie case for predicting that students in other

settings identified by the DSBS as disruptive would also

be judged disruptive by non-teaching school officials.

Messick (1980) described construct validity as based

on convergent and discriminant validity, ethical

interpretation, relevance and utility for the specific

application, and the consequences following use of the

instrument. Convergent validity requires that the DSBS be

able to identify all students who are considered exces-

sively disruptive. To demonstrate satisfactory convergent

validity, the DSBS ratings of 100% of the students in the

disruptive group would have to be significantly above the

local DSB norm. The disruptive group ratings are reported

in Chapter Four.

Discriminant validity requires that the DSBS be able

to reject those students who are not considered exces-

sively disruptive. To demonstrate satisfactory discrimi-

nant validity, the DSBS ratings of only those students in

the disruptive group, or eligible for inclusion, could be

significantly above the local DSB norm. Ratings of the

norming group are reported in Chapter Four.

Ethical interpretation of DSBS ratings requires an

understanding of both the theoretical and practical

concepts underlying development of this instrument. There-

fore, a manual will be prepared before the DSBS is offered

for research use. Relevance was supported by the theoreti-

cal basis on which the 10 constructs were chosen to define

DSB for this study. Utility was provided by the proce-

dures used to select appropriate items, score the forms,

interpret the ratings, and present the results. The conse-

quences of using the DSBS cannot be predicted until it is

thoroughly researched. The intent is to improve the

validity of the selection process for programs assisting

disruptive students.

Reliability of the DSBS

The DSBS rating for each student is an aggregate of

scores from at least four teachers. A test-retest measure

compared two DSB ratings obtained from individual teachers.

Fourteen days after the receipt of teacher ratings, a

follow-up rating by the same teachers of approximately 10%

of both the norming and disruptive groups was made. These

results are reported in Chapter Four.

The internal consistency of the DSBS was protected by

choosing only items previously used by teachers to

describe DSB. Item analysis is not an effective technique

for establishing reliability of individual administrations

of the DSBS. Patterns of disruptive behavior are often

narrow and stereotypical, while the DSBS contains items

descriptive of a broad range of possible behaviors. Thus,

item scores were not likely to correlate with each other.

No attempt was made to assess interrater reliability.

Classroom settings are conceptualized as discrete environ-

ments, whose norms for behavior are determined by the

personality of the teacher. The behavior of interest is

the interaction of students with their teachers totally,

not individually.

Field Study

The purpose of the field study was to identify and

correct any problems, actual or potential, with item

content, response format, or administration and scoring

procedures of the DSBS. Following a successful field

study, the instrument may be offered to the profession for

further research and development (Benson & Clark, 1980).

Accordingly, the operational goal of this present effort

was to conduct a field study to determine the readiness of

the DSBS for use as a research instrument.

The target population consisted of students enrolled

in grades six through nine (i.e., middle and junior high

school grades) in public schools anywhere in the United

States. No restrictions were placed on age, gender, race

or socioeconomic status. The selection criteria for the

host school were a heterogeneous ethnic population, an

urban or suburban location, public middle (grades 6, 7, 8)

or junior high (grades 7, 8, 9) school status, random

assignment of students to basic courses, and an average

daily attendance figure of at least 500 students. Special

schools, such as alternative schools and special education

centers, were not considered.

A public middle school meeting these criteria was

located in a predominately urban school district on the

west coast of Florida. The student enrollment was

approximately 76% white, 22% black, and 2% Asian- and

Hispanic-American, with an average daily attendance of

733. Socioeconomic status was said by the principal to be

primarily upper-lower class and lower-middle class.

For the norming group, a sample consisting of 90 stu-

dents was selected using one English and one mathematics

class, with randomly-assigned enrollments, at each grade

level. A total of six classes containing 203 students and

ranging from 32 through 35 students each were sampled.

The numbers 1 through 35 were written on individual slips

of paper and 15 numbers drawn randomly using the replace-

ment procedure. For each class, the students whose class

roll numbers matched the 15 randomly selected numbers were

included in the norming groups.

The disruptive group was selected by nomination by

non-teaching school personnel, who were asked to list the

names of all of the excessively disruptive students

encountered during the current school year. It was thus

possible for a student's name to be included in both the

norming and the disruptive groups. The nominating process

initially produced a group of 64 students. After a confer-

ence among the raters, this group was reduced to 36


All students finally nominated into the disruptive

group were assigned to one of four levels of disruptive-

ness (none, mild, moderate, or severe) by each nominating

person working independently. Nominated students were

assigned a numerical rating according to the following


Level of Disruptiveness Rating

None 0

Mild 1

Moderate 2

Severe 3

Students were ranked according to the average of these rat-

ings. This ranking permitted the correlation, reported in

Chapter Four, of levels of disruptiveness between the DSBS

results and the qualitative assessments by school person-

nel for each disruptive group student.

Schedules for the sample students were obtained from

school records. No contact was made with any student.

Training of all participating teachers took place in a

meeting at which a DSBS form for each period of a sample

student's current schedule was distributed. Appendix K

contains these instructions. The purpose of the study was

explained and a date and procedure for returning the forms

agreed upon. Emphasis was placed on the need to respond

to only the behaviors actually mentioned on the instrument

and to perform the ratings independently of other teachers.

Provision was made for a faculty member to either answer

or refer questions that might arise during the rating


Teachers not submitting all their DSBS forms by the

agreed upon date were contacted and reminded of the

importance of their participation. Upon receipt of at

least four completed DSBS forms for each student, the DSBS

rating for that student was calculated. At least four

scorable forms, totaling 622, were received for 108 stu-

dents, 76 in the norming group and 32 in the disruptive

group. The scoring template (Appendix H) provided for

calculating item scores weighted for severity.

The item scores were totaled to produce a form score,

which was entered on the Summary of Teacher Ratings form

(Appendix I). This summary form contains spaces for the

student's name, grade, age, and sex; school name; evalua-

tor's name and title; individual form scores; each rater's

name, subject, and class period; and calculation of the

student's DSBS rating and z-score. Each sample student's

form scores were summed to give a total score. The total

score was divided by the number of raters to yield the

average score, which is the student's DSBS rating.

After the DSBS ratings for all the norming group stu-

dents were calculated, the mean DSBS rating and standard

deviation for the group were obtained. This mean of the

means is the local DSBS rating, or norm, for the target

school. The local DSBS norm was subtracted from the

students' DSBS rating, giving their deviation from the

local norm.

Dividing this deviation by the local standard devia-

tion gave the number of standard deviation units, or

z-scores, the student's DSBS rating differed from the

local DSBS norm. The criterion of two standard deviation

units above the local DSBS norm translates to a disrup-

tiveness score higher than approximately 98% of the

predicted scores from the school population. The distri-

bution of scores obtained from the norming group was

inspected to assure the existence of sufficient variance

to make the z-scores meaningful.

A reliability check was performed. Fourteen days

after all the rating forms were collected, approximately

10% of the students from both the norming and disruptive

groups were selected to be representative of the range of

scores. New forms were submitted to the original raters

for rerating the same students and the results compared.

These results are reported in Chapter Four. After comple-

tion of the data analyses, all participants were invited

to a meeting to discuss the results, offer comments, and

receive appreciation for their participation.

Data Analyses


To establish content validity, 24 expert judges

assigned proposed DSBS items to construct categories.

Results of the judges assignments were totaled for each

item. An item was dropped if not assigned to at least one

category by each judge. If this content validation pro-

cedure had resulted either in fewer than 30 items being

assigned to at least one construct or in having a con-

struct with fewer than three items assigned by 80% of the

respondents, enough items would have been constructed and

validated to meet these criteria. The judges' item assign-

ments are reported in Chapter Four.

To ascertain how well the DSBS identified the

disruptive group, the t-test was used to estimate the

significance of the difference between the means of the

norming group and disruptive group. An obtained prob-

ability level of .05 or less was considered evidence of

statistical significance. The magnitude of the difference

between the means was used to evaluate the practical

significance of the instrument and its potential for

identifying disruptive students. These results are be

reported and discussed in Chapter Four.

To estimate convergent validity for the DSBS, the DSBS

rating for each disruptive group member was compared with

the mean DSBS rating of the norming group. For the

purposes of this study, a DSBS rating of at least two

z-scores above the norming group mean was accepted as

evidence that the DSBS had correctly identified a disrup-

tive group member. The standard error of the mean was

used to include students when evaluating borderline cases.

The criterion for satisfactory convergent validity was the

correct identification of 100% of the disruptive group.

Discriminant validity also was estimated by using

ratings, means, and z-scores. The DSBS rating for each

norming group member was compared with the mean of that

group. Any norming group member whose DSBS rating

exceeded the mean by at least two z-scores was considered

identified by the DSBS as excessively disruptive. Identi-

fied cases, not members of or eligible for the disruptive

group, were considered challenges to the discriminant

validity of the DSBS. All cases not meeting the construct

validity criteria were investigated. Construct validity

results are reported and discussed in Chapter Four.


The Pearson product-moment correlation statistic was

used to compare the original ratings on approximately 10%

of the completed forms with follow-up ratings made after

14 days. Individual coefficients of at least .80 were set

arbitrarily to establish an acceptable level of test-

retest reliability.


1. The school for the field study was selected based on

the willingness to cooperate by both the school and

the faculty. This may have mitigated problems that

would occur in a less favorable environment.

2. Teacher resistance and/or concerns about this type of

research may have biased or limited their partici-


3. The study was limited to exploration and the results

are not intended to generalize beyond the administra-

tion and scoring procedures. Specifically, the

calculated DSB norm is valid only for this school.

4. No provision was made to assess the possible effects

of grade and sex on DSB norms. Studies have indicated

the influences are not significant, but at some point

this should be investigated.

5. The disruptive sample group was likely composed of

students who had been referred to the dean. The same

teachers who referred these students to the dean may

have rated their behaviors, with bias a possibility.

6. The use of expert judges in the validation procedures

may have introduced personal bias into the items used

on the instrument.



The purpose of this study was to develop and validate

an instrument, the Disruptive Student Behavior Scale

(DSBS). The study focused on identifying components of

disruptive school behavior as perceived by middle and

junior high school teachers and constructing an instrument

to quantify these behaviors. To accomplish this, an

instrument was constructed using behaviors taken from

disciplinary referrals and field tested on a representa-

tive sample of students from a Florida middle school.

Teacher ratings for a norm group and a disruptive group

were collected and analyzed as outlined in Chapter Three.

These results are reported in this chapter.


The Severity Factor

Results of the assignment of potential adverse conse-

quences resulting from DSBS behaviors are reported in

Table 2. Twenty packets containing 40 DSBS items and an

instruction sheet were distributed and 16 were returned.

At least 50%, or 8, of the raters had to assign a DSBS item

to a particular domain before that domain was

Tmm 4--- n -j- N MN~ I
I- mj Crt-~m~r0 0I1

=r'-immemminmmewo0 rc'jm


V) 4-)

> a

0) w










CMr0o 0e r oo Co

o0 % n M Ch wNOOOON N O OCN ON N 0

S- r r


cM -Moo~~b0oom mMO~MM

m01~o- 0 0J '00 '- oC~c 0

y- OJO '-011 ~-O

OOm mO I. 0 0- M C1 V-m M--:o 3 nO iMm -coom

SeMm-c m~00o0 mom ~m
0C m^0m0 ccOci rmo oorO

r-i-r0T- T-t- --CM M 0 0C

nmoom -rwoON







V -


University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - Version 2.9.9 - mvs