|
Citation |
- Permanent Link:
- https://ufdc.ufl.edu/UFE0004828/00001
Material Information
- Title:
- Accountability and the Florida Comprehensive Assessment Test : effects of item format on low performing students in measuring one year's growth
- Creator:
- Elizondo, Stewart Francis ( Dissertant )
Miller, M. David. ( Thesis advisor )
Seraphine, Anne ( Reviewer )
- Place of Publication:
- Gainesville, Fla.
- Publisher:
- University of Florida
- Publication Date:
- 2004
- Copyright Date:
- 2004
- Language:
- English
Subjects
- Subjects / Keywords:
- Disabilities ( jstor )
Grade levels ( jstor ) Mathematical tables ( jstor ) Mathematics ( jstor ) Mathematics education ( jstor ) Reading tables ( jstor ) Sample mean ( jstor ) Schools ( jstor ) Sex linked differences ( jstor ) Special needs students ( jstor ) Dissertations, Academic -- UF -- Educational Psychology Educational Psychology thesis, M.A.E City of Tallahassee ( local )
- Genre:
- bibliography ( marcgt )
theses ( marcgt ) non-fiction ( marcgt )
Notes
- Abstract:
- The impact of constructed response items in the measurement of one year's growth on the Florida Comprehensive Assessment Test for low performing students is analyzed. Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on the vertically equated developmental scores as defined by the Florida School Accountability System. Data in reading and mathematics for 6th through 10th grade were examined for a mid-sized school district and disaggregated into several subgroups as defined in the state No Child Left Behind Accountability Plan. These results were compared across grade levels to see if growth differed in the grade levels that are assessed with constructed response items. Results indicated that constructed response items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. Differential results between subgroups were most evident in reading. ( , )
- Subject:
- accountability, FCAT
- General Note:
- Title from title page of source document.
- General Note:
- Document formatted into pages; contains 59 pages.
- General Note:
- Includes vita.
- Thesis:
- Thesis (M.A.E.)--University of Florida, 2004.
- Bibliography:
- Includes bibliographical references.
Record Information
- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Elizondo, Stewart Francis. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Embargo Date:
- 4/30/2004
- Resource Identifier:
- 55898851 ( OCLC )
|
Downloads |
This item has the following downloads:
|
Full Text |
ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT
TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN
MEASURING ONE YEAR' S GROWTH
By
STEWART FRANCIS ELIZONDO
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ARTS IN EDUCATION
UNIVERSITY OF FLORIDA
2004
Copyright 2004
by
Stewart Francis Elizondo
This work is dedicated to Stephanie, who has unconditionally supported me not only in
this endeavor, but also in all of my undertakings over the last seventeen years.
ACKNOWLEDGMENTS
I acknowledge my committee chair, Dr. M. David Miller, and committee member,
Dr. Anne E. Seraphine, for their skilled mentorship and expertise. They have allowed
this process be to challenging, yet enjoyable and rewarding. Of course, any errors
remaining in this paper are entirely my own responsibility.
Gratitude is also given to Dr. Bridget A. Franks, graduate coordinator, and Dr.
James Algina, Director of the Research and Evaluation Program, for their assistance in
helping me secure an Alumni Fellowship without which this proj ect may not be or at
least would have been protracted and less cohesive.
Enough appreciation cannot be expressed to my family for their undying support
of my graduate studies. Repaying the concessions made by my sons, Spencer and Seth, is
something I cannot do in my lifetime, but I look forward to making the effort.
Encouragement and support given to me by my in-laws, Jeanne and John Garrod, are also
not taken for granted and I am most thankful. I would like to thank my mother, Elizabeth
Marder. It is because of her that I realize and celebrate in fact that education is truly a
lifelong process. Finally, the many thanks owed to my wife, Stephanie, are beyond
words. I would not be half of what I am if it were not for her.
TABLE OF CONTENTS
page
ACKNOWLEDGMENT S .............. .................... iv
LI ST OF T ABLE S .........__.. ..... .___ .............._ vii..
LIST OF FIGURES .............. .................... ix
AB S TRAC T ......_ ................. ............_........x
CHAPTER
1 INTRODUCTION ................. ...............1.......... ......
The Florida Comprehensive Assessment Test............... ...............1..
State and Federal Accountability .................. ...............4................
The Florida School Accountability System ................. .............................4
The No Child Left Behind Act of 2001 ................ ...............5............ .
Low Performing Students and One Year' s Growth. ........ ................. ...............5
2 REVIEW OF LITERATURE ................. ...............7.......... .....
The Movement Towards Constructed Response .............. ...............7.....
Trait Equivalence Between Item Formats .............. ...............8.....
Psychometric Differences Between Item Formats .............. ...............9.....
Combining Item Formats ................. ........... ... ...............10....
Differential Effects of Item Formats on Various Groups ................ ............... .....11
3 M ETHODS ................. ...............13.......... .....
Data............... ...............13..
Sam ple .............. ...............13....
M measures ................ ...............15.......... ......
Analysis Approach............... ...............16
4 RE SULT S .............. ...............20....
Aggregate Developmental Scale Score Analysis .............. ..... ............... 2
Reading ................. ...............20.................
M them atics .............. ...............21....
Developmental Scale Score Contrasts by Sex ................. ............... ......... ...22
Reading ................. ...............22.................
M them atics .............. ... .... ....... .. ........2
Developmental Scale Score Contrasts by Sex and Race .............. .....................2
Reading ................. ...............24.................
M them atics .............. ... .... ........ .. .... ........2
Developmental Scale Score Contrasts by Socio-economic Status .............................27
Reading ............. ...._._. ...............27.....
M them atics .............. .. ............... .. .. .. ..................2
Developmental Scale Score Contrasts for Students with Disabilities and Regular
Education Students............... ...............29
Reading ........._..... ...._... ...............29.....
M them atics .............. ...............3 1....
Summary ........._..... ...._... ...............32.....
5 DI SCUS SSION ........._..... ...._... ...............3 3....
Trends by Grade .............. ...............34....
Reading ........._..... ...._... ...............34.....
M them atics .............. ...............35....
Item Format Effects .............. ...............35....
Closing Remarks ........._..... ...._... ...............36.....
APPENDIX
A GRADING FLORIDA PUBLIC SCHOOLS .............. ...............38....
B ANNUAL AYP OBJECTIVES FOR READING ................. .......... ...............40
C ANNUAL AYP OBJECTIVES FOR MATHEMATICS ................. .........._ .....41
LIST OF REFERENCES ............. ...... ._ ...............42....
BIOGRAPHICAL SKETCH .............. ...............47....
LIST OF TABLES
Table pg
1-1 Florida Comprehensive Assessment Test Item Formats by Grade ............................2
1-2 Florida Comprehensive Assessment Test Achievement Levels .............. ................3
3-1 Matched Low Performing Students by Grade for Reading and Mathematics .........13
3-2 Disaggregation of Matched District and Low Performing Students by Subgroup
for Grades 6 through 10 for Reading and Mathematics............__ ..........__ .....14
4-1 Mean Aggregate Reading DSS for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............20
4-2 Mean Differences for Aggregate Reading DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....21
4-3 Mean Aggregate Mathematics DSS for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............21....
4-4 Mean Differences for Aggregate Mathematics DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....22
4-5 Mean Reading DSS by Sex for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............22
4-6 Mean Differences for Reading DSS by Sex for Matched Low Performing Students
not Making Gains by Achievement Level Criteria from 2001 to 2002 .................. .23
4-7 Mean Mathematics DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............23....
4-8 Mean Differences for Mathematics DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....24
4-9 Mean Reading DSS by Sex and Race for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............24....
4-10 Mean Differences for Reading DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from 2001
to 2002 ........._._ ....._.._ ........_._......25..
4-11 Mean Mathematics DSS by Sex and Race for Matched Low Performing Students
not Making Gains by Achievement Level Criteria .............. ....................2
4-12 Mean Differences for Mathematics DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from 2001
to 2002 ................. ...............26........._.....
4-13 Mean Reading DSS by SES for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............27
4-14 Mean Differences for Reading DSS by SES for Matched Low Performing Students
not Making Gains by Achievement Level Criteria from 2001 to 2002 ........._.._......28
4-15 Mean Mathematics DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............28....
4-16 Mean Differences for Mathematics DSS by SES for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....29
4-17 Mean Reading DSS for Matched Low Performing Students with Disabilities and
Regular Education Students not Making Gains by Achievement Level Criteria.....29
4-18 Mean Differences for Reading DSS for Matched Low Performing Students with
Disabilities and Regular Education Students not Making Gains by Achievement
Level Criteria from 2001 to 2002............... ...............30..
4-19 Mean Mathematics DSS for Matched Low Performing Students with Disabilities
and Regular Education Students not Making Gains by Achievement Level
Criteria............... ...............31
4-20 Mean Differences for Mathematics DSS for Matched Low Performing Students
with Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria from 2001 to 2002 ................. .......... ...............32
4-21 Subgroups of Low Performing Students Demonstrating More than One Year' s
Growth in Reading and Mathematics............... ..............3
LIST OF FIGURES
Figure pg
3-1 FCAT Achievement Levels for the Developmental Scale. .................. ...............15
3-2 Expected Growth under Gain Alternative 3 .............. ...............18....
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Arts in Education
ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT
TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN
MEASURING ONE YEAR' S GROWTH
By
Stewart Francis Elizondo
May 2004
Chair: M. David Miller
Maj or Department: Educational Psychology
The impact of constructed response items in the measurement of one year' s growth
on the Florida Comprehensive Assessment Test for low performing students is analyzed.
Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on
the vertically equated developmental scores as defined by the Florida School
Accountability System. Data in reading and mathematics for 6th through 10th grade
were examined for a mid-sized school district and disaggregated into several subgroups
as defined in the state No Child Left Behind Accountability Plan. These results were
compared across grade levels to see if growth differed in the grade levels that are
assessed with constructed response items. Results indicated that constructed response
items may be of benefit to many low performing students in mathematics, but to a lesser
extent in reading. Differential results between subgroups were most evident in reading.
CHAPTER 1
INTRODUCTION
Enacted in 1968, The Educational Accountability Act, Section 229.51, Florida
Statutes, empowered the Commissioner of Education to use
all appropriate management tools, techniques, and practices which will cause the
state's educational programs to be more effective and which will provide the
greatest economics in the management and operation of the state's system of
education.
Subsequent changes to the Florida Department of Education' s (FDOE) capabilities
towards that end can be marked by a stream of legislation that can currently be
characterized as an environment of accountability. Utilizing components of the current
accountability systems, questions regarding Florida's lowest performing students will be
analyzed.
The Florida Comprehensive Assessment Test
The Florida Comprehensive Assessment Test (FCAT) currently serves as the
measure for the statewide School Accountability System as envisioned by the A+ Plan
for Education. It also serves as the measure for the federal No Child Left Behind Act of
2001 (NCLB) for all public schools. Both accountability systems include requirements
for annual student growth. Section 229.57, Florida Statutes was amended in 1999 and
specifies that school performance grade category designations shall be based on the
school's current year' s performance and the school's annual learning gains (FDOE, 2001).
Section 1 1 11(b)(2)(H) of NCLB mandates that "intermediate goals for annual yearly
progress" shall be established (NCLB, 2001).
Reading and mathematics components of FCAT based on the Sunshine State
Standards (SSS) are administered in grades 3 through 10. These components are
commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics. Emphasis on
these standards-based assessments by the accountability systems makes their
construction, scoring, and reporting of special interest to stakeholders. Constructed
response (CR) items are utilized on the FCAT SSS Reading assessment in grades 4, 8,
and 10 and on the FCAT SSS Mathematics in grades 5, 8, and 10. CR items are either
Short-Response (SR) or Extended-Response (ER). Not only does item format vary, but
weight given to item formats can also vary across grades (see Table 1-1).
Table 1-1. Florida Comprehensive Assessment Test Item Formats by Grade
Reading Mathematics
Grade Item Formats % of Items Item Formats % of Items
3 MC 100% MC 100%
4 MC 85-90% MC 100%
SR & ER 10-15%
5 MC 100% MC 60-70%
GR 20-25%
SR & ER 10-15%
6 MC 100% MC 60-70%
GR 30-40%
7 MC 100% MC 60-70%
GR 30-40%
8 MC 85-90% MC 50-60%
SR & ER 10-15% GR 25-30%
SR & ER 10-15%
9 MC 100% MC 60-70%
GR 30-40%
10 MC 85-90% MC 50-60%
SR & ER 10-15% GR 25-30%
SR & ER 10-15%
Note. MC: Multiple-Choice, GR: Gridded-Response, SR: Short-Response, and ER:
Extended-Response.
Student scores have been reported on a scale of 100 to 500 for both the FCAT SSS
Reading and Mathematics since they were first reported in 1998. Though these scores
yield Achievement Levels that are used for the School Accountability System and NCLB
(see Table 1-2), they are not for interpretation from grade to grade.
Table 1-2. Florida Comprehensive Assessment Test Achievement Levels
5 Advanced Performance at this level indicates that the student has success
with the most challenging content of the Sunshine State
Standards. A Level 5 student answers most of the test questions
correctly, including the most challenging questions.
4 Proficient Performance at this level indicates that the student has success
with the challenging content of the Sunshine State Standards. A
Level 4 student answers most of the questions correctly, but may
have only some success with questions that reflect the most
challenging content.
3 Proficient Performance at this level indicates that the student has partial
success with the challenging content of the Sunshine State
Standards, but performance is inconsistent. A Level 3 student
answers many of the questions correctly, but is generally less
successful with questions that are most challenging.
2 Basic Performance at this level indicates that the student has limited
success with the challenging content of the Sunshine State
Standards.
1 Below Basic Performance at this level indicates that the student has little
success with the challenging content of the Sunshine State
Standards.
To address this, FCAT results began to be reported on a vertically equated scale
that ranges from 86 to 3008 across grades 3 through 10. This facilitates interpretations of
annual progress from grade to grade. Developmental Scale Scores (DSS), as well as DSS
Change, the difference between consecutive years, appear on student reports as well as
school, district, and state reports for the aggregate means (FDOE, 2003a). It has been
proposed that NCLB encourages the use of vertical scaling procedures (Ananda, 2003).
Though FCAT field-testing began in 1997, reports including DSS have only been
available since 2001.
State and Federal Accountability
The Florida School Accountability System
In 1999, school accountability became more visible to the public with the advent of
school performance grades. Category designations range from "A," making excellent
progress, to "F," failing to make adequate progress (see Appendix A). Schools are
evaluated on the basis of aggregate student performance on FCAT SSS Reading and
FCAT SSS Mathematics in grades 3 through 10, and the Florida Writes statewide writing
assessment administered in grades 4, 8, and 10. Currently, three major components
contribute in their calculation:
1. Yearly achievement of high standards in reading, mathematics and writing,
2. Annual learning gains in reading and mathematics, and
3. Annual learning gains in reading for the lowest 25% of students in each school.
As of 2002, annual learning gains on FCAT SSS Mathematics and FCAT SSS
Reading account for half of the point system that yields school performance grades, with
special attention given to the reading gains of the lowest 25% of students in each school.
There are three ways that schools can be credited for the annual yearly gains of their
students :
1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or
2. When students maintain a relatively high Achievement Level (3, 4 or 5); or
3. When students demonstrate more than one year's growth within Levels 1 or 2, as
measured by an increase in their FCAT Developmental Scale Scores from one year
to the next (FDOE, 2003b).
Incentives for schools graded "A" and those that improve at least one grade level
include monetary incentives under the Florida School Recognition Program under
Section 231.2905, Florida Statutes. Sanctions for schools graded "F" for two years in a
four-year period include eligibility for Opportunity Scholarships under Section 229.0537,
Florida Statutes, so that students can attend higher performing public or private schools
(FDOE, 2001).
The No Child Left Behind Act of 2001
The inclusion of all students when making accountability calculations was not a
common practice prior to the passing of the federal No Child Left Behind Act of 2001
(Linn, 2000). Approved in April of 2003, Florida' s State Accountability Plan meets this
defining requirement and builds on the Florida School Accountability System. It is also
compliant with the Adequate Yearly Progress (AYP) components of the federal law. To
make AYP, the percentage of students earning a score of Proficient or above in reading
and mathematics has to meet or exceed the annual obj ectives for the given year (see
Appendix B and C).
Furthermore, not only will AYP criteria be applied to the aggregate, but it will also
be applied separately to subgroups that disaggregate the data by race, socio-economic
status, disability, and limited English proficiency. If a school fails to meet the annual
obj ectives in reading or mathematics for any subgroup in any content area, the school is
designated as not making AYP. The expectation for growth is such that all students are
Proficient or above on the FCAT SSS Reading and Mathematics no later than 2014
(FDOE, 2003c).
Low Performing Students and One Year's Growth
The fundamental question addressed in this paper is the following: Do grades with
constructed response items show differential growth for various subgroups of students
within Achievement Levels 1 or 2, as measured by an increase in their FCAT
Developmental Scale Scores from one year to the next? The subgroups of interest are
those defined by NCLB. These annual gains directly impact the calculation of school
performance grades under the Florida School Accountability System, and can serve as a
method of monitoring the progress and composition of those non-Proficient students for
the AYP calculations under NCLB. Comparisons of item format effects will be explored
for FCAT SSS Reading and Mathematics in grades 6 through 10.
Other issues, though very relevant, such as: a) the longstanding debate on how to
yield gain scores (Cronbach & Furby, 1970; Lord, 1963), b) the discrepancies
surrounding the various methods of vertically equating tests, especially when including
multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde &
Linn, 1977; Thissen & Wainer, 2001), and c) the most recent debate of effects of high-
stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002) will be set-aside in an
effort to work within the existing framework of the current accountability systems.
Unintended item format effects, especially on the FCAT SSS Reading and
Mathematics assessments, given their emphasis by both the Florida School
Accountability System and the federal No Child Left Behind Act, only grow in
importance as trends in Florida' s student population continue. In 2002, 14 of Florida' s
67 school districts reported minority enrollment of 50% or more. Many of these were
among the most populous districts across the state. Furthermore, minority representation
statewide has steadily increased from 29.9% in 1976 to 49.4% in 2002 (FDOE, 2003 d).
CHAPTER 2
REVIEW OF LITERATURE
The Movement Towards Constructed Response
Some type of constructed response (CR) item was used in over three quarters of all
statewide assessment programs (Council of Chief State School Officers [CCSSO], 1999).
This is a marked change from the assessment practices of just a decade ago when Barton
and Coley (1994), as cited in Linn (1994), stated that,
The nation is entering an era of change in testing and assessment. Efforts at both
the national and state levels are now directed at greater use of performance
assessment, constructed response questions, and portfolios based on actual student
work (p. 3).
The hypothesis is that these types of items are more closely aligned with the
standards-based reform movement currently underway in education and that they are
better able to detect its effects (Hamilton et al., 2003; Klein et al., 2000). Some have
recently tested whether the relationship between achievement and instruction varies as a
function of item format (McCaffrey et al., 2001).
One of the first and most prominent statewide programs to implement performance-
based assessments in a high-stakes environment was the Maryland School Performance
Assessment Program (MSPAP), first administered in 1991. Though replaced by the
Maryland School Assessment in 2002, Maryland still utilizes CR items in all grades
tested. Although virtually no information was available on the psychometric properties
of performance assessment during the design stages of the MSPAP, results were
encouraging and allowed for innovations in later years (Yen & Ferrara, 1997).
The use of CR items has since become a hallmark of quality testing programs to the
extent that a National Academy of Education panel repeatedly applauded the National
Assessment of Educational Progress' continued move to include CR items. In a later
review, the Committee on the Evaluation of National and State Assessments of
Educational Progress, in conjunction with the Board on Testing and Assessment, called
for an enhancement of the use of testing results, especially CR items, to provide better
interpretation (Pellegrino et al., 1999).
Trait Equivalence Between Item Formats
The debate regarding the existence and interpretation of psychological differences
tapped by CR items, as opposed to multiple choice (MC) items, has been equivocal.
Thissen and Wainer (2001) make clear a basic assumption of psychometrics:
Before the responses of any set of items are combined into a single score that is
taken to be, in some sense, representative of the responses to all of the items, we
must ascertain the extent to which the items measure the same thing (p. 10).
Factor analysis techniques, among other methodologies, have been employed to
analyze the extent to which CR items measure a different trait than do MC items.
Evidence of a free-response factor was found by Bennett, Rock, and Wang (1991),
though a single-factor solution was reported as providing the most parsimonious fit.
Thissen, Wainer, and Wang (1994) reported that despite a small degree of
multidimensionality, it would seem meaningful to combine the scores of MC and CR
items as they, for the most part, measure the same underling proficiency. These authors
suggest that proficiency may be better estimated with only MC items due to a high
correlation with the more reliable multiple-choice factor.
Still yet, others have used factor analysis techniques with MC and CR items and
found that CR items were not measuring anything beyond what was measured by the MC
items (Bridgeman & Rock, 1993). With item response theory methodology, others have
concluded that CR items yielded little information over and above that provided by MC
items (Lukhele et al., 1994).
A recent meta-analysis framed the question of construct (trait) equivalence as a
function of stem equivalence between the two item types. It was found that when items
are constructed in both formats using the same stem, the mean correlation between the
two formats approaches unity and is significantly higher than when using non-stem
equivalent items (Rodriguez, 2003).
It has also been suggested that a better aim of CR items is not to measure the same
trait, but to better measure some cognitive processes better than MC items (Traub, 1993).
This is consistent with a stronger positive relationship between reform practices and
student achievement when measured with CR items rather than MC items (Klein et al.,
2000).
Psychometric Differences Between Item Formats
The psychometric properties of CR items have also been a point of contention in
the literature. Messick (1993) asserted that the question of trait equivalence was
essentially a question of construct validity of test interpretation and use. Furthermore,
this pragmatic view is tolerated to facilitate effectiveness of testing purposes. The two
maj or threats to construct validity, irrelevant test variance and construct
underrepresentation (Messick, 1993), can be judiciously negotiated for the purposes of
demonstrating construct relevant variance with due representation of the construct.
Messick posits that if the hypothesis of perfectly correlated true scores across
formats is not rej ected, then either format, or a combination of both, could be employed
as construct indicators. Although, if the hypothesis is rejected, likely due to differential
method variance, a dominant construct relevant factor or second order factor, may cut
across formats (Messick, 1993). Messick (1995) also later provided a thorough treatment
on the broader systematic validation of performance assessments within the framework of
the unified concept of validity (AERA, APA, & NCME, 1999).
Returning to the issue of reliability raised earlier, Thissen and Wainer' s (2001)
suggestion that MC items more reliably measure a CR factor may, to paraphrase, come
about because,
measuring something that is not quite right accurately may yield far better
measurement than measuring the right thing poorly.
Lastly, we are warned that however appealing CR items may be as exemplars of
instruction or student performance, we must exercise caution when drawing inferences
about people or changes in instruction due to their relatively limited generalizability
across tasks (Brennan, 1995; Dunbar et al., 1991). In an analysis of state-mandated
programs, Miller (2002) concluded that other advantages, such as consequences for
instruction, are necessary to justify these concessions.
Combining Item Formats
Methods on how to best combine the use of CR items with MC items have also
garnered much needed attention. Mehrens (1992) put it succinctly when stating that we
have known for decades that MC items measure some things well and efficiently, but
they do not measure everything and they can be overemphasized. Performance
assessments, on the other hand, have the potential to measure important instructional
obj ectives that cannot be measured by MC items. He also notes that most large-scale
assessments have added performance assessments to the more traditional tests, and not
replaced them.
Wainer and Thissen (1993) add that combining the two formats may allow for the
concatenation of their strengths while compensating for their weaknesses. More
specifically, if a test has various components, it is sensible to weigh the components as a
function of their reliabilities, or modify the lengths of the components to make them
equally reliable. Given difficulties in execution of the latter, it was recommended that
serious consideration be given to IRT weighting in the construction and scoring of tests
with mixed item formats. Just such a process is employed in the construction of the
Florida Comprehensive Assessment Test (FCAT) via an interactive test construction
system, which makes available the capability to store, retrieve, and manipulate IRT test
information, test error, and expected scores (FDOE, 2002)
Differential Effects of Item Formats on Various Groups
Another maj or component to the literature is a segment that seeks differential
effects that CR items may have on gender, race, socio-economic status, and disability.
These contrasts are most timely, given that nature of the Adequate Yearly Progress
requirements of the No Child Left Behind Act of 2001 where, by design, data is
disaggregated by such groups.
In an analysis of open-ended counterparts to a set of items from the quantitative
section of the Graduate Record Exam, Bridgeman (1992) concluded that gender and
ethnicity differences were neither lessened nor exaggerated and that there were no
significant interactions of test format with either gender or ethnicity. Similarly, in a
metanalysis of over 30 studies, Ryan and DeMark (2002) conclude that females typically
outperform males in mathematics and language when students must construct their
responses, but effect sizes are small or less by Cohen's (1988) standards.
There has been mixed results on the inclusion of students with disabilities in high-
stakes standards-based assessments. DeFur (2002) summarizes one such example in
Virginia' s experiences with their Standards of Learning and an inclusion policy.
Conspicuous by its absence are studies on differential effects of item formats involving
students with disabilities. This may soon draw needed attention with the March 2003
announcement of a statutory NCLB provision. Effective for the 2003-2004 academic
year, it states that in calculating AYP, at most 1% of all students tested can be held to
alternative achievement standards at the district and state levels (Paige, 2003). The
provision was published in the Federal Register, Vol. 68, No. 236, on December 9, 2003.
CHAPTER 3
METHOD S
Data
The electronic file received by a mid-sized school district in May 2002 from the
Florida Department of Education (FDOE) was analyzed. This file contained Florida
Comprehensive Assessment Test (FCAT) scores from the 2001 and 2002 administrations.
The data for Florida had changed in 2001 in two key ways. Beginning with the 2001
administration, the FCAT program was expanded to include all grades 3 through 10.
Prior to this, reading was assessed only in grades 4, 8, and 10, while mathematics was
assessed only in grades 5, 8, and 10. Secondly, these changes offered the opportunity to
introduce a Developmental Scale Score (DSS) that linked adj acent grades together
allowing for progress to be tracked over time.
Sample
The data set consisted of 2,402 low performing students in reading and 2,506 low
performing students in mathematics in grades 6 through 10 that had a score from both the
2001 and 2002 FCAT administrations. Table 3-1 shows the composition of the students
by grade. Low performing is defined as remaining within the Basic or Below Basic
performance standards commonly referred to as Level 1 and Level 2 (FDOE, 2003c).
Table 3-1. Matched Low Performing Students by Grade for Reading and Mathematics
Grade Reading Mathematics
6 498 569
7 475 557
8 477 533
9 489 462
10 463 385
Disaggregation of low performing students by sex, race, socio-economic status,
and disability is shown in Table 3-2 for reading and mathematics. For comparison,
disaggregation of all matched district students also in grades 6 through 10 is provided
Table 3-2. Disaggregation of Matched District and Low Performing Students by
Subgroup for Grades 6 through 10 for Reading and Mathematics
Reading Mathematics
Low Low
Performing Performing
All Students Students All Students Students
Agg 8,399 2,402 8,404 2,506
Sex F 4,3 07 (5 1%) 1,158 (48%) 4,309 (51%) 1,300 (52%)
M 4,092 (49%) 1,244 (52%) 4,095 (49%) 1,206 (48%)
Race B 3,158 (38%) 1,646 (69%) 3,161 (38%) 1,767 (71%)
H 326 (4%) 77 328 (4%) 70
W 4,915 (59%) 756 (31%) 4,915 (58%) 739 (29%)
SES ED 3,271 (39%) 1,543 (64%) 3,275 (39%) 1,671 (67%)
N-ED 5,128 (61%) 859 (36%) 5,128 (61%) 835 (33%)
SWD Gift 853 (10%) 16 852 (10%) 11
RES 5,928 (71%) 1,311 (55%) 5,928 (71%) 1,430 (57%)
SWD 1,618 (19%) 1,075 (45%) 1,624 (19%) 1,065 (42%)
Note. Agg = Aggregate, F = Female, M = Male, B = Black, H = Hispanic, W = White,
SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-
Economically Disadvantaged, SWD = Students With Disabilities, Gift = Gifted Students,
RES = Regular Education Students.
Gifted students and Hispanic students were omitted from analysis due to small
sample sizes. Disaggregated reporting for Adequate Yearly Progress (AYP) is not
required when the number of students is insufficient to yield statistically reliable
information or would reveal personally identifiable information (NCLB, 2002). Linn et
al. (2002) suggested a minimum number of 25. The FDOE subscribes to a minimum
group size of 30 for school performance grades (FDOE, 2003b) and for AYP calculations
(FDOE. 2003c). In the analyses of that follow, subgroups of low performing students are
further disaggregated by grade and, in all cases, gifted students and Hispanic students fail
to meet even Linn's suggestion. In several cases, group size was as low as zero or one.
85~W1665 ii ulr a t r : 2 Ls 00 a LM !l La.02 77-! ld bl rl 51.221::.8. 141'. 19 1*.P.1 .r lv~ 501 G'
772.1771 1772.1971 197 -l6214 21415.2297 2290.e ~ 2925 -I 178j1 ~: 1 78213 0 r 4- G91 2022 20T~id "J142 .*5!
Fiur 3-1. FCA Acivmn Levels for the'' Devlometa Scle Sure
c~4 1~r1Unders15ILtanding~;~5 FCAT :j Report (FOE 2003a).~ .......~ .~.....~ C.
FCA peforanc across82 grades15 3W~: through~ 10i dur~ingj the-e4 Spring of~ 201. Facilitte by
moduel (Thisen 1991) AcIRTmentricvs havte bevensowmn to l beprefere Sovergae
equvaent whena saexaminin suh longitudinale patterns asil thy ar sensitiv to varyisngo
rateso grad owtha over timre (Seltzr et aol., 1994). d tgad 0 hr tea
Measures
Reading and mathematics components of FCAT based on the Sunshine State
Standards (SSS) were examined. Commonly referred to as FCAT SSS Reading and
FCAT SSS Mathematics, they are the criterion-referenced components of the statewide
assessment program. The resulting Developmental Scale Scores (DSS) were analyzed.
Figure 3-1 shows the relationship between these scores and the corresponding
Achievement Levels by content area.
Analysis Approach
Components of both the Florida School Accountability System, as envisioned by
the A+ Plan for Education, and the federal No Child Left Behind Act of 2001 (NCLB)
were used to approach the question of differential growth for various groups of students
within the Achievement Levels of Basic and Below Basic focusing on grades with
constructed response items.
Within the Florida School Accountability System, making annual learning gains on
FCAT SSS Mathematics and FCAT SSS Reading account for three out of six categories
that are used in calculating school performance grades. Special attention is given to the
FCAT SSS Reading gains of the lowest 25% of students in each school (see Appendix
A). As of 2002, students can demonstrate gains via three different alternatives (FDOE,
2003b).
1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or
2. When students maintain a relatively high Achievement Level (3, 4 or 5); or
3. When students demonstrate more than one year's growth within Levels 1 or 2, as
measured by an increase in their FCAT Developmental Scale Scores from one year
to the next.
The focus of the analysis will be on the students eligible for demonstrating gains
via Gain Alternative 3. The reasoning being that these students have demonstrated two
consecutive years of Basic or Below Basic performance. Students that demonstrated
gains via Gain Alternative 1 have shown a substantive improvement in their performance.
For example, a student that has improved from Achievement Level 2 to Achievement
Level 3, has gone from a classification of Basic to Proficient (FDOE, 2003c). Students
that demonstrated gains via Gain Alternative 2 have shown consistently high
performance as Achievement Level 3 or higher is classified as Proficient or Advanced
(FDOE, 2003c). Furthermore, this speaks directly to the policy that is at hand and
utilizes the FCAT Developmental Scale Scores as they are intended.
Five analyses were performed, the first of which is for all matched low performing
students not making gains by achievement level criteria. The remaining four address
subgroups that are suggested (CCSSO, 2002a) or required (NCLB, 2002) when
disaggregating school, district, and state data in calculating the Adequate Yearly Progress
(AYP) components of NCLB. To make AYP, the percentage of students earning a score
of Proficient or above in reading and mathematics has to meet or exceed the annual
objectives for the given year (see Appendix B and C). Though all students in the
analyses are non-Proficient, tracking the composition and progress of these groups could
lead to an understanding that may facilitate policy and instruction.
Comparisons of item format effects in measuring one year's growth will be made
between FCAT SSS Reading and FCAT SSS Mathematics as well as across grades 6
through 10 within each content area. Lastly, differential growth within each grade for
both content areas will be discussed.
To further specify what constitutes one year' s growth as defined in the Florida
School Accountability System, it is necessary to revisit the developmental scale,
specifically as it applies to Gain Alternative 3. The definition is based on the numerical
cut-scores for the FCAT Achievement Levels that have been approved by the State Board
of Education. The following steps were applied to the cut scores, separately, for each
subject and each grade-level pair (FDOE, 2003b). Per State Board Rule 6A-1.09422,
there are four cut scores that separate DSS into five Achievement Levels. The increase in
the DSS necessary to maintain in the same relative standing within Achievement Levels
from one grade to the next was calculated for each of the four cut scores between the five
Achievement Levels. The median value of these four differences was determined to best
represent the entire student population. Median gain expectations were then calculated
for each grade and a logarithmic curve was fitted. Others were considered but the
logarithmic curve was adopted because it best captures the theoretical expectation of
greater gains in the early grades due to student maturation. Graphs of these curves and
the resulting expected growth on the development scale is shown in Figure 3-2 for
reading and mathematics.
Gain Alternative 3: Expected Growt~h
GrB 3- r~ 4- r -8 Br 6-7 Gr 7-8 Gr 8-9 Gr9-110
230 186 133 11 92 77 7
-HMath "182 110 95 78 64 54 48
Figure 3-2. Expected Growth under Gain Alternative 3. Source: Guide to Calculating
School Grades Technical Assistance Paper (FDOE, 2003b).
To be denoted as making gains under Gain Alternative 3, students must
demonstrate more than the expected growth on the developmental scale. Therefore, they
must score at least one developmental scale score point more than the values listed above.
These criteria were applied to mean differences for the matched low performing students.
This was done for the aggregate for each grade 6 through 10 and, in the interest of the
NCLB requirements, by sex, sex and race, socio-economic status, and disability.
As noted earlier, other issues, though admittedly relevant, will be set-aside in an
effort to work within the existing framework of the state and federal accountability
systems. These include and are not limited to; the longstanding debate on how to yield
gain scores (Cronbach & Furby, 1970; Lord, 1963); the discrepancies surrounding the
various methods of vertically equating tests, especially when including multiple formats,
and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen
& Wainer, 2001); and the most recent debate of effects of high-stakes testing (Amrein &
Berliner, 2002; Carnoy & Loeb, 2002).
CHAPTER 4
RESULTS
Aggregate Developmental Scale Score Analysis
Reading
Sample sizes, means, and standard deviations for grade level aggregate data for
the Florida Comprehensive Assessment Test Sunshine State Standards (FCAT SSS)
Reading Developmental Scale Scores (DSS) are shown in Table 4-1.
Table 4-1. Mean Aggregate Reading DSS for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Grade Year n M~ (SD)
6 2002 498 1,074 (266)
2001 498 1,031 (294)
7 2002 475 1,245 (245)
2001 475 1,116 (255)
8 2002 477 1,353 (258)
2001 477 1,257 (254)
9 2002 489 1,525 (264)
2001 489 1,3 74 (25 1)
10 2002 463 1,590 (275)
2001 463 1,592 (221)
Mean differences and comparisons to expected growth for grade level aggregate
data for the FCAT SSS Reading DSS are shown in Table 4-2.
Table 4-2. Mean Differences for Aggregate Reading DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
DSS Mean Expected Met
Grades n Difference Growth Expectations
5 to 6 498 43 133 140
6 to 7 475 129 110 Yes
7 to 8a 477 96 92 Yes
8 to 9 489 150 77 Yes
9 to 10a 463 -2 77 No
a Grades 8 and 10 utilize constructed-response items.
Mathematics
Sample sizes, means, and standard deviations for grade level aggregate data for
the FCAT SSS Mathematics DSS are shown in Table 4-3.
Table 4-3. Mean Aggregate Mathematics DSS for Matched Low Performing Students
not Making Gains by Achievement Level Criteria
Grade Year n M (SD)
6 2002 569 1,314 (23 5)
2001 569 1,123 (227)
7 2002 557 1,360 (262)
2001 557 1,290 (257)
8 2002 533 1,459 (242)
2001 533 1,307 (272)
9 2002 462 1,587 (220)
2001 462 1,438 (241)
10 2002 385 1,689 (190)
2001 385 1,636 (206)
Mean differences and comparisons to expected growth for grade level aggregate
data for the FCAT SSS Mathematics DSS are shown in Table 4-4.
Table 4-4. Mean Differences for Aggregate Mathematics DSS for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Table 4-5. Mean Reading DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Female Male
Grade Year n M~ (SD) n M~ (SD)
6 2002 241 1,115 (247) 257 1,035 (277)
2001 241 1,060 (269) 257 1,003 (313)
7 2002 225 1,288 (234) 250 1,206 (247)
2001 225 1,156 (237) 250 1,079 (266)
8 2002 223 1,388 (259) 254 1,321 (253)
2001 223 1,304 (236) 254 1,215 (262)
9 2002 237 1,583 (231) 252 1,470 (281)
2001 237 1,413 (235) 252 1,338 (261)
10 2002 232 1,601 (284) 231 1,580 (266)
2001 232 1,608 (208) 231 1,576 (233)
DSS Mean
Difference
190
70
151
149
53
Expected
Growth
95
78
64
54
48
Met
Expectations
Yes
No
Yes
Yes
Yes
Grades
5 to 6
6 to 7
7 to 8a
8 to 9
9 to 10a
n
569
557
533
462
385
a Grades 8 and 10 utilize constructed-response items.
Developmental Scale Score Contrasts by Sex
Reading
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Reading DSS by sex are shown in Table 4-5.
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Reading DSS by sex are shown in Table 4-6.
Table 4-6. Mean Differences for Reading DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 F 241 55 133 No
M 257 31 133 No
6 to 7 F 225 132 110 Yes
M 250 126 110 Yes
7 to 8a F 223 84 92 No
M 254 105 92 Yes
8 to 9 F 237 170 77 Yes
M 252 132 77 Yes
9 to 10a F 232 -7 77 No
M 231 4 77 No
Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items.
Mathematics
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Mathematics DSS by sex are shown in Table 4-7.
Table 4-7. Mean Mathematics DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Female Male
Grade Year n M~ (SD) n M~ (SD)
6 2002 303 1,332 (213) 266 1,294 (256)
2001 303 1,127 (223) 266 1,120 (231)
7 2002 295 1,398 (245) 262 1,317 (273)
2001 295 1,325 (237) 262 1,251 (274)
8 2002 255 1,487 (238) 278 1,433 (244)
2001 255 1,357 (254) 278 1,262 (280)
9 2002 235 1,604 (210) 227 1,569 (229)
2001 235 1,476 (220) 227 1,397 (255)
10 2002 212 1,692 (184) 173 1,686 (198)
2001 212 1,643 (201) 173 1,628 (212)
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Mathematics DSS by sex are shown in Table 4-8.
Table 4-8. Mean Differences for Mathematics DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 F 303 205 95 Yes
M 266 174 95 Yes
6 to 7 F 295 72 78 No
M 262 67 78 No
7 to 8a F 255 130 64 Yes
M 278 171 64 Yes
8 to 9 F 235 128 54 Yes
M 227 172 54 Yes
9 to 10a F 212 49 48 Yes
M 173 57 48 Yes
Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items.
Developmental Scale Score Contrasts by Sex and Race
Reading
Sample sizes, means, and standard deviations for grade level data for the FCAT
Table 4-9. Mean Reading DSS by Sex and Race for Matched Low Performing Students
not Making Gains by Achievement Level Criteria
Female Male
Black White Black White
Grade Year n Mn Mn MnM
6 2002 189 1,090 52 1,206 193 1,016 64 1,094
(246) (231) (264) (307)
2001 189 1,034 52 1,153 193 986 64 1,059
(277) (216) (305) (335)
7 2002 166 1,266 59 1,350 156 1,178 94 1,251
(241) (202) (250) (237)
2001 166 1,130 59 1,229 156 1,028 94 1,164
(244) (200) (271) (235)
8 2002 148 1,352 75 1,460 166 1,277 88 1,404
(259) (245) (262) (213)
2001 148 1,293 75 1,326 166 1,189 88 1,267
(211) (279) (253) (273)
9 2002 157 1,541 80 1,665 172 1,410 80 1,597
(228) (216) (278) (242)
2001 157 1,362 80 1,514 172 1,286 80 1,449
(237) (197) (253) (244)
SSS Reading DSS by sex and race are shown in Table 4-9.
Table 4-9 Continued.
Female Male
Black White Black White
Grade Year n Mn Mn MnM
10 2002 148 1,533 84 1,720 151 1,512 80 1,708
(303) (196) (274) (194)
2001 148 1,563 84 1,688 151 1,540 80 1,644
(211) (177) (220) (241)
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Reading DSS by sex and race are shown in Table 4-10.
Table 4-10. Mean Differences for Reading DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 BF 189 56 133 No
BM 193 30 133 No
WF 52 54 133 No
WM 64 34 133 No
6 to 7 BF 166 136 110 Yes
BM 156 150 110 Yes
WF 59 121 110 Yes
WM 94 87 110 No
7 to 8a BF 148 59 92 No
BM 166 89 92 No
WF 75 133 92 Yes
WM 88 138 92 Yes
8 to 9 BF 157 179 77 Yes
BM 172 125 77 Yes
WF 80 152 77 Yes
WM 80 148 77 Yes
9 to 10a BF 148 -30 77 No
BM 151 -28 77 No
WF 84 33 77 No
WM 80 64 77 No
Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male.
a Grades 8 and 10 utilize constructed-response items.
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Mathematics DSS by sex and race are shown in Table 4-12.
Table 4-12. Mean Differences for Mathematics DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
Mathematics
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Mathematics DSS by sex and race are shown in Table 4-11.
Table 4-11. Mean Mathematics DSS by Sex and Race for Matched Low Performing
Students not Making Gains by Achievement Level Criteria
Female
Male
Black
n M
185 1,247
(262)
185 1,082
(230)
176 1,263
(280)
176 1,199
(286)
183 1,390
(241)
183 1,208
(270)
178 1,543
(238)
178 1,370
(252)
128 1,656
(207)
128 1,604
(219)
Black
Grade Year n M
6 2002 220 1,294
(220)
2001 220 1,087
(227)
7 2002 203 1,342
(259)
2001 203 1,276
(244)
8 2002 180 1,462
(239)
2001 180 1,334
(252)
9 2002 170 1,578
(214)
2001 170 1,439
(222)
10 2002 144 1,654
(192)
2001 144 1,601
(207)
White
n M
83 1,433
(153)
83 1,233
(174)
92 1,520
(155)
92 1,433
(179)
75 1,546
(226)
75 1,412
(252)
65 1,672
(184)
65 1,573
(184)
68 1,771
(134)
68 1,732
(155)
White
nM
81 1,403
(206)
81 1,208
(210)
86 1,428
(224)
86 1,356
(212)
95 1,516
(227)
95 1,365
(271)
49 1,666
(163)
49 1,496
(243)
45 1,770
(140)
45 1,697
(171)
95
Yes
Yes
Yes
Yes
5 to 6
BF
BM
WF
WM
207
I
Developmental Scale Score Contrasts by Socio-economic Status
Reading
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Reading DSS by socio-economic status (SES) are shown in Table 4-13.
Table 4-13. Mean Reading DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Economically Non-Economically
Disadvantaged Disadvantaged
Grade Year n M~ (SD) n M~ (SD)
6 2002 398 1,055 (266) 100 1,149 (253)
2001 398 1,004 (295) 100 1,134 (262)
7 2002 337 1,223 (254) 138 1,297 (212)
2001 337 1,093 (257) 138 1,172 (243)
8 2002 323 1,314 (264) 154 1,433 (225)
2001 323 1,239 (247) 154 1,294 (266)
9 2002 280 1,465 (272) 209 1,604 (231)
2001 280 1,329 (252) 209 1,434 (238)
10 2002 205 1,510 (295) 258 1,655 (240)
2001 205 1,525 (235) 258 1,645 (194)
Table 4-12. Continued.
Grades Group n DSS Mean Differenc
6 to 7 BF 203 66
BM 176 64
WF 92 87
WM 86 72
7 to 8a BF 180 128
BM 183 182
WF 75 134
WM 95 150
8 to 9 BF 170 139
BM 178 172
WF 65 99
WM 49 170
9 to 10a BF 144 53
BM 128 52
WF 68 40
WM 45 73
Note. BF = Black Female, BM = Black Male, WF =
a Grades 8 and 10 utilize constructed-response items.
Expected
:e Growth
78
78
78
78
64
64
64
64
54
54
54
54
48
48
48
48
White Female, WM
Met
Expectations
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
= White Male.
Mean differences and comparisons to expected growth for grade level data for the
1 2M (SD)
;3 1,381 (217)
;3 1,191 (212)
'2 1,452 (229)
'2 1,386 (212)
;8 1,529 (220)
;8 1,372 (277)
;9 1,637 (203)
;9 1,473 (258)
,3 1,728 (174)
,3 1,684 (164)
Grade Year
6 2002
2001
7 2002
2001
8 2002
2001
9 2002
2001
10 2002
2001
M (SD)
1,294 (237)
1,103 (227)
1,319 (265)
1,248 (264)
1,426 (245)
1,277 (265)
1,558 (225)
1,417 (228)
1,649 (198)
1,588 (231)
n
436
436
385
385
365
365
293
293
192
192
FCAT SSS Mathematics DSS by SES are shown in Table 4-16.
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Reading DSS by SES are shown in Table 4-14.
Table 4-14. Mean Differences for Reading DSS by SES for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 ED 398 50 133 No
N-ED 100 15 133 No
6 to 7 ED 337 130 110 Yes
N-ED 138 125 110 Yes
7 to 8a ED 323 75 92 No
N-ED 154 138 92 Yes
8 to 9 ED 280 135 77 Yes
N-ED 209 169 77 Yes
9 to 10a ED 205 -16 77 No
N-ED 258 9 77 No
Note. ED Economically Disadvantaged, N-ED Non-Economically Disadvantaged.
a Grades 8 and 10 utilize constructed-response items.
Mathematics
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Mathematics DSS by SES are shown in Table 4-15.
Table 4-15. Mean Mathematics DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Economically Non-Economically
Disadvantaged Disadvantaged
Developmental Scale Score Contrasts for Students with Disabilities and Regular
Education Students
Reading
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Reading DSS for students with disabilities and regular education students are shown
in Table 4-17. Gifted students were omitted due to a small sample size.
Table 4-17. Mean Reading DSS for Matched Low Performing Students with Disabilities
and Regular Education Students not Making Gains by Achievement Level
Criteria
Students with Disabilities Regular Education Students
Grade Year n (SD) n (SD)
6 2002 252 980 240 1,165
(271) (223)
2001 252 899 240 1,162
(317) (189)
7 2002 240 1,148 234 1,344
(252) (192)
2001 240 1,018 234 1,215
(254) (216)
Table 4-16. Mean Differences for Mathematics DSS by SES for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Grades Group n DSS Mean Differenc
5 to 6 ED 436 191
N-ED 133 190
6 to 7 ED 385 71
N-ED 172 66
7 to 8a ED 365 149
N-ED 168 157
8 to 9 ED 293 141
N-ED 169 164
9 to 10a ED 192 61
N-ED 193 45
Note. ED Economically Disadvantaged, N-ED N r
a Grades 8 and 10 utilize constructed-response items.
Expected Met
:e Growth Expectations
95 Yes
95 Yes
78 No
78 No
64 Yes
64 Yes
54 Yes
54 Yes
48 Yes
48 No
Ton-Economically Disadvantaged.
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Reading DSS for students with disabilities and regular education students are
shown in Table 4-18. Gifted students were omitted due to a small sample size.
Table 4-18. Mean Differences for Reading DSS for Matched Low Performing Students
with Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 SWD 252 82 133 No
RES 240 3 133 No
6 to 7 SWD 240 130 110 Yes
RES 234 129 110 Yes
7 to 8a SWD 219 126 92 Yes
RES 252 68 92 No
8 to 9 SWD 212 124 77 Yes
RES 276 170 77 Yes
9 to 10a SWD 152 -9 77 No
RES 309 2 77 No
Note. SWD Students With Disabilities, RES Regular Education Students.
a Grades 8 and 10 utilize constructed-response items.
Table 4-17 Continued.
Students with Disabilities
Regular Education Students
(SD)
1,259
(241)
1,132
(249)
1,376
(279)
1,251
(256)
1,435
(255)
1,444
(232)
(SD)
1,428
(246)
1,360
(208)
1,638
(184)
1,468
(203)
1,666
(252)
1,664
(175)
Grade Year
8 2002
2001
9 2002
2001
10 2001
2002
n
252
252
276
276
309
309
Mathematics
Sample sizes, means, and standard deviations for grade level data for the FCAT
SSS Mathematics DSS for students with disabilities and regular education students are
shown in Table 4-19. Gifted students were omitted due to a small sample size.
Table 4-19. Mean Mathematics DSS for Matched Low Performing Students with
Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria
Students with Disabilities Regular Education Students
Grade Year n (SD) n (SD)
6 2002 246 1,208 319 1,393
(254) (182)
2001 246 1,026 319 1,196
(234) (191)
7 2002 244 1,215 309 1,471
(266) (194)
2001 244 1,157 309 1,392
(270) (191)
8 2002 233 1,341 297 1,548
(243) (199)
2001 233 1,159 297 1,419
(261) (220)
9 2002 198 1,473 264 1,673
(247) (149)
2001 198 1,307 264 1,536
(236) (194)
10 2002 144 1,573 241 1,758
(206) (141)
2001 144 1,513 241 1,710
(240) (137)
Mean differences and comparisons to expected growth for grade level data for the
FCAT SSS Mathematics DSS for students with disabilities and regular education students
are shown in Table 4-20. Gifted students were omitted due to a small sample size.
Table 4-20. Mean Differences for Mathematics DSS for Matched Low Performing
Students with Disabilities and Regular Education Students not Making Gains
by Achievement Level Criteria from 2001 to 2002
DSS Mean Difference Expected Met
Grades Group n Growth Expectations
5 to 6 SWD 246 182 95 Yes
RES 319 197 95 Yes
6 to 7 SWD 244 58 78 No
RES 309 72 78 Yes
7 to 8a SWD 233 182 64 Yes
RES 297 128 64 Yes
8 to 9 SWD 198 169 54 Yes
RES 264 137 54 Yes
9 to 10a SWD 144 61 48 Yes
RES 241 48 48 Yes
Note. SWD Students With Disabilities, RES Regular Education Students.
a Grades 8 and 10 utilize constructed-response items.
Summary
Subgroups of low performing students demonstrating more than one year' s
growth, as measured by an increase in their FCAT Developmental Scale Scores from one
year to the next, are indicated in Table 4-21.
Table 4-21. Subgroups of Low Performing Students Demonstrating More than One
Year' s Growth in Reading and Mathematics
Reading Mathematics
5-6 6-7 7-8 8-9 9-10 5-6 6-7 7-8 8-9 9-10
Sex/ BF I I 9 99
Race BM I I 9 99
ME = ae, lcW=Wie E oi-eooi tts D=Eooial
Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With
Disabilities, RES = Regular Education Students.
CHAPTER 5
DISCUSSION
What makes the isolation of item format effects in an existing accountability
system difficult is that comparative groups cannot be established without compromising
the very structure of that accountability system. The Florida Comprehensive Assessment
Test (FCAT) is administered to all eligible students and the results are used for decisions
that can be of high stakes. To alter the composition of item format for some from what is
currently used for all others and still be able to draw these conclusions would not only
have political repercussions, but would not be tenable.
The approach taken here, instead, seeks the emergence of patterns from data
obtained from test administrations under all actual conditions. This also has the added
advantage of incorporating the first two years of results from, what at the time, was a
change in the FCAT testing program and its application to the Florida School
Accountability System. Despite the changes, including the addition of annual learning
gains in the calculation of school performance grades, continued improvement was
widely reported in 2002 and more so in 2003 with almost half of all schools earning an A
grade.
The first results of the federal No Child Left Behind Act of 2001 (NCLB) were
also made available in 2003. For Florida, these results were not as encouraging. Only
over 400 schools out of 3,000 met all requirements for making Adequate Yearly Progress
(AYP). These mixed messages, along with recent results of the National Assessment of
Educational Progress, lead to the conclusion that, as a whole, while Florida is making
progress, it simply is not enough for many and much remains to be done.
Towards this end, the analyses presented here focused attention on students within
Achievement Level 1 or 2, i.e. performing at the Basic and Below Basic level. The 2003
NCLB results show that the district from which the data for these analyses was used did
not make AYP. A closer look at the subgroups will show Black students and students
with disabilities did not meet the required percentage proficient, i.e. AYP, in reading with
economically disadvantaged students coming close to not making the criteria. For
mathematics, none of these subgroups made AYP. Although the analyses are from 2001
and 2002 and focus on grades 6 through 10, the discussion below may help clarify these
and other patterns.
Trends by Grade
Reading
Grade 6. Mean differences for all subgroups of low performing students did not
demonstrate one year' s growth.
Grade 7. Mean differences for all subgroups of low performing students
demonstrated one year' s growth except white males.
Grade 8. Mean differences for all subgroups of low performing students
demonstrated one year's growth except females, Black females, Black males,
economically disadvantaged students, and regular education students.
Grade 9. Mean differences for all subgroups of low performing students
demonstrated one year' s growth.
Grade 10. Mean differences for all subgroups of low performing students did not
demonstrate one year' s growth.
Mathematics
Grade 6. Mean differences for all subgroups of low performing students
demonstrated one year' s growth.
Grade 7. Mean differences for all subgroups of low performing students did not
demonstrate one year' s growth except white females and regular education students.
Grade 8. Mean differences for all subgroups of low performing students
demonstrated one year' s growth.
Grade 9. Mean differences for all subgroups of low performing students
demonstrated one year' s growth.
Grade 10. Mean differences for all subgroups of low performing students
demonstrated one year' s growth except white females and non-economically
disadvantaged students.
Item Format Effects
Results indicated that constructed response (CR) items may be of benefit to many
low performing students in mathematics, but to a lesser extent in reading. All subgroups
made one year' s growth in mathematics in both grades analyzed that utilize CR items
with the exception of white females and non-economically disadvantaged students in
grade 10. Further inspection will show developmental scale scores for both of these
subgroups were higher than all other subgroups in their respective analysis in 2001 and
2002.
Differential results between subgroups within grades where CR items are utilized
were most evident in reading. In grade 8, the aggregate demonstrated one year' s growth,
though not by much. Disaggregation by all other analyses had at least one subgroup and,
in most cases, more than half of all students not demonstrating one year' s growth.
Females did not make one year' s growth, though they demonstrated higher
developmental scale scores than males for both years of interest. Black females and
males did not make one year' s growth and scored lower than white females and males for
both years of interest. Economically disadvantaged students did not make one year' s
growth and scored lower than non-economically disadvantaged students for both years of
interest. Regular education students did not make one year's growth, though they
demonstrated higher developmental scale scores than students with disabilities. In grade
10, all subgroups of low performing students did not demonstrate one year' s growth.
Closing Remarks
Utilizing constructed response items on statewide assessments has been reported
by those in the classroom to a) place a greater emphasis on higher-level thinking and
problem solving (Huebert & Hauser, 1998), b) provide a more transparent link to
everyday curriculum, and c) increase motivation (CCSSO, 2002b). All of these are
aligned with the standards based reform that is at the very root of the accountability
movement (NCTE, 1996; NCTM, 1995). These implications for teaching and learning
are too important to be ignored and lend evidence for consequential validity to the
inclusion of constructed response items on high stakes test (AERA, APA, & NCME,
1999).
Lastly, and most importantly, it has also been reported that constructed response
items lend construct validity to interpretations made from test scores (Messick, 1989;
Snow, 1993), though the more contemporary paradigm frames it as evidence for content
validity, inclusive of item format (AERA, APA, & NCME, 1999). Constructed response
items may approach the original definition of the term assessment, from the Latin
a~ssidere, which means, "to sit beside". Teacher and student working alongside one
37
another is what large-scale assessment yearns to be and it certainly is a possibility that
constructed response items best paint that picture.
APPENDIX A
GRADING FLORIDA PUBLIC SCHOOLS
Scoring High on the FCAT
The FIorlda Cornprehenslve Assessment Test (FCAT) Is the primary measure
of students' achievement of the Sunshine State Siandards student scores
are classified inio firre achievement levels, mth 1 be ng the lowrrest and 5
being the highest
Schools earn one point for each percent of students wyho score
in ach aernment levels 3, 4, or 5 in reading and one point for
each percent of students who score 3, 4, or 5 in math
SThe writing exam Is scored by at least two readers on a scale
of 1 to 6. The percent of students scor ng "3" and above 18
averaged wuith the percent scoring "3.5" and above to yield the
percent meeting minimum and higher standards Schools earn
one point for each percent of students on the cornblnecI
measure.
IIIC`+~~lllllc`I*~IIC~~ILL7I~~IL
1I;IICI;IIII11~l II;IICII'I'IIIIC~III
VUtiich students are included in school gradle clculalons? As in vtinat happens if the lowest 25% of students in the school do not mak~e adequatee
previous years, only standard curriculum students who were enrolled in progress" in reading? Schools that aspire to be graded "C" or above, but do not
the same school in both October and February are included. Speech rnake adequate progress with their lowest 25%b in reading, must develop a School
impaired, gifted, hospital/hornehound, and Limited English Proficient Improvement Plan component that addresses this need. If a school, otherwise
students with more than twoe years in an ESOL program are also gracled "C" or "B ", does not demonstrate adlecludte p ro gress for two years in a row,
included. the fin al grade wrill be red ucedl by one letter grade.
GRADING FLORIDA
SCHOOLS 20032-2003
)IMISS ION ER, ulwomfldne~ur
Making Annual Learning Gains
'Since FCAT readling and mah exams are cliven in grades 3 10, It is now possible to
monitor how much students learn from one ,tear to the next
> Schools earn one point for each percent of students who
rnake learning gains in reading and one point for each
percent of students who make learning gains In rnath.
Students can demonstrate learning gains in any one of three ways
(1) Improve achievement levels from l-2, 2-3, 3-4, or 4-5; or
(2) Malntaln wi~thin the relatively high 19491 of 3, 4, or 5, or
(3) Demonstrate more than one year s growth within achievement
levels 1 or 2.
=- Special attention is given to the reading gains of students in the Inwest 25%6 in
levels 1, 2, or 3 In each school. Schools earn one point for each percent of
the lowyest performing readers .*,*h make learning gains from the prw-fous
year. It takes at least 5C66 to make "adequate progress" for this group.
SCHO OL PERFORMANCE GRADING SCALE
APPENDIX B
ANNUAL AYP OBJECTIVES FOR READING
Starting Point and An nual Objlectives for
Reading, 2001-02 2013-04
120
100
40
1 2 3 4 5 8 7 8 910111213
Years
N OTE: Year 1I = 2001-02 base yea r.
Sourea Date:
Re adi ng
%h Prof.
Year
20301-02
2002-03
2003-04
2004-05
2005-06
2006-07
2007-08
2008-09
2009-10
2010-11
2011-12
2012-13
2013-14
Starting Point and An nu~al Objeectivs for
Mathewaltics, 2001-02 2013-04
120
*100-
S40-
20
1 2 3 4 5 6 7 B 9 10 1"112 13
Years
NOTE: Year "1 = 200"1-02 basse year.
Source Da-te:
APPENDIX C
ANNUAL AYP OBJECTIVES FOR MATHEMATICS
Math
%r Prof.
Year
2001-02
2002-03
2003-14
2004-05
2005-06
2006-07
2007-08
2000-09
2009-10
2010-11
2011-12
2012-13
2013-14
38
3B
38
53
53
63
83
83
100
LIST OF REFERENCES
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education, (1999). Standar~~dds for~ddd~~ddd~ educational
and psychological testing. Washington, DC: American Educational Research
Association.
Amrein, A. L., & Berliner, D. C. (2002, March 28). High-stakes testing, uncertainty, and
student learning. Education Policy Analysis Archives, 10(18). Retrieved February
11, 2003, from http://epaa.asu. edu/epaa/v 10nl18/
Ananda, S. (2003). Rethinking Issues of Alignment Under No Child Left Behind. San
Francisco: WestEd.
Barton, P. E., & Coley, R. J. (1994). Testing in America 's schools. Princeton, NJ:
Educational Testing Service, Policy Information Center.
Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and
multiple-choice items. Journal of~ducational M~ea;surement, 28(1), 77-92.
Brennan, R. L. (1995). Generalizability in performance assessments. Educational
Measurement: Issues and Practice, 14(4), 9-12, 27.
Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and
multiple choice formats. Journal of~ducational M~ea;surement, 29(3), 253-271.
Bridgeman, B., & Rock, D. A. (1993). Relationships among multiple-choice and open-
ended analytical questions. Journal of~ducational2~ea;sureenten, 30(4), 313-329.
Camnoy, M., & Loeb, S, (2002). Does external accountability affect student outcomes: A
cross-state analysis. Educational Evaheation and Policy Analysis, 24(4), 305-33 1.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New
York: Academic Press.
Council of Chief State School Officers. (1999). Annual survey of State student
assessment programs: A sunmanay report, Fall, 1999. Washington, DC: Author.
Council of Chief State School Officers. (2002a). A guide to effective accountability
reporting. Washington, DC: Author.
Council of Chief State School Officers. (2002b). The role ofperformance-ba;sed
assessments in large-scale accountability systems: Lessons learned fr~om the inside.
Washington, DC: Author.
Crocker, L., & Al gina, J. (1986). Introduction to classical and modern test theory. New
York: Wardsworth.
Cronbach, L. J., & Furby, L. (1970). How de we measure change-or should we?
Psychological Bulletin, 74, 68-80.
DeFur, S. H. (2002). Education reform, high-stakes assessment, and students with
disabilities: One state's approach. Remediala~ndSpecialEducation, 23(4), 203-211.
Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the
development and use of performance assessments, Applied2\~ea;surement in
Education, 4(4), 289-303.
Florida Department of Education [FDOE]. (2001). FCA T briefing book. Tallahassee:
Author.
Florida Department of Education [FDOE]. (2002). Technical report: For operational test
administrations of the 2000 Florida Comprehensive Assessment Test. Tallahassee,
FL: Author.
Florida Department of Education [FDOE]. (2003 a). Understanding FCA Treports.
Tallahassee, FL: Author.
Florida Department of Education [FDOE]. (2003b). 2003 Guide to calculating school
grades:: Technical assistance paper. Tallahassee, FL: Author.
Florida Department of Education [FDOE]. (2003c). Consolidated state application
Accountability Workbook for State grants under Title IX, Part C, Sec. 9302 for the
Elementary and Secondarydd~~~~~ddddd~~~~ Education Act (Pub. L. No. 107-110). March 26.
Florida Department of Education [FDOE]. (2003d). G; 1,n thr of minority student
populations in Florida's public schools. Tallahassee, FL: Author.
Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Klein, S. P., Robyn, A., & Bugliari, D.
(2003). Studying large-scale reforms of instructional practice: An example from
mathematics and science. Educational Evaluation and Policy Analysis, 20(2), 95-
113.
Huebert, J. P., & Hauser, R. M. (Eds.). (1998). High-stakes~~~~tttt~~~~ttt testing for tracking,
promotion, and graduation. Washington, DC: National Academy Press.
Klein, S. P., Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Robyn, A., & Burroughs,
D. (2000). Teaching practices and student achievement: Report of first-year results
fr~om the M~osaic Study of Systemic Initiatives in ]Aubell~linai \ and Science (MR-
1233-EDU). Santa Monica, CA: RAND.
Linn, R. L. (1994). Performance Assessment: Policy promises and technical measurement
standards. Educational Researcher, 23(9), 4-14.
Linn, R. L. (2000). Assessments and Accountability. Educational Researcher, 29(2), 4-
16.
Linn, R. L., Baker, E. L., & Herman, J. L. (2002, Fall). Minimum group size or
measuring adequate yearly progress. The CRESSTLine, 1, 4-5. (Newsletter of the
National Center for Research on Evaluation, Standards, and Student Testing
[CRESST]. University of California, Los Angeles)
Lord, F. M. (1963). Elementary models for measuring change. In C. W. Harris (Ed.),
Problems in measuring change (pp. 21-3 8). Madison, WI: University of Wisconsin
Press.
Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice,
constructed-response, and examinee-selected items on two achievement tests.
Journal of~ducational M~ea;surement, 31(3), 234-250.
McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Klein, S. P., Bugliari, D., & Robyn, A.
(2001). Interactions among instructional practices, curriculum, and student
achievement: The case of standards-based high school mathematics. Journal for
Research in ]Aubeinat,(iL \ Education, 32(5), 493-517.
Mehrens, W. A. (1992). Using performance assessment for accountability purposes.
Educational M~easurement: Issues and Practice, 11I(1), 3 -9, 20.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13-103). New York: Macmillan.
Messick, S. (1993). Trait equivalence as construct validity of score interpretation across
multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.),
Construction versus choice in cognitive measurement (pp. 61-74). Mahwah, NJ:
Lawrence Erlbaum.
Messick, S. (1995). Standards of validity and the validity of standards in performance
assessment. Educational M~easurement: Issues and Practice, 14(4), 5-8.
Miller, M. D. (2002). Generalizability of performance-based assessments. Washington,
DC: Council of Chief State School Officers.
National Council of Teachers of English (1996). Standar~~dds for~ddd~~ddd~ the English language arts:
A joint project of the National Council of Teachers of English and the International
Reading Association. Urbana, IL: Author.
National Council of Teachers of Mathematics (1995). Assessment standar~~dds for~ddd~~ddd~ school
mathematics. Reston, VA: Author.
No Child Left Behind Act of 2001, Public Law 107-110, (115 Stat. 1425, 107th Congress,
(2002).
Paige, R. (2003, June 27). Key policy letters signed by the Education Secretary or Deputy
Secretary. Retrieved February 10, 2004, from
http://www. ed.gov/policy/speced/guid/secletter/03 0627.html
Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation 's report
card: Evaluating NAEP and transforming the assessment of educational progress.
Washington, DC: National Academy Press.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-
response items: A random effects synthesis of correlations. Journal of Educational
Measurement, 40(2), 163-184.
Ryan, J. M., & DeMark, S. (2002). Variation in achievement scores related to gender,
item format, and content area tested. In J. Tindal & T. Haladyna (Eds.), Large-scale
assessment progra~msfor all students: Validity, technical quality, and
implementation (pp. 67-88). Mahwah, NJ: Lawrence Erlbaum Associates.
Seltzer, M. H., Frank, K. A. & Bryk, A. S. (1994). The metric matters: The sensitivity of
conclusions about growth in student achievement to choice of metric. Educational
Evaluation and Policy Analysis, 1 6(1), 4 1-49.
Slinde, J. A., & Linn, R. L. (1977). Vertical equated tests: Fact or phantom. Journal of
Educational M~ea;surement, 14(1), 23-32.
Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett &
W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in
constructed response, performance testing, and portfolio assessment (pp. 45-60).
Hillsdale, NJ: Lawrence Erlbaum.
Thissen, D. (1991). M~ultilog user 's guide. Lincolnwood, IL: Scientific Software.
Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence Erlbaum.
Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice
and free-response items necessarily less unidimensional than multiple-choice test?
An analysis of two tests. Journal of~ducational2\~ea;surement, 31(2), 1 13-123.
46
Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and
constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction
versus choice in cognitive measurement: Issues in constructed response,
performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence
Erlbaum.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response
test scores: Towards a Marxist theory of test construction. Applied2\~easurement in
Education, 6(2), 103-118.
Yen, W. M., & Ferrara, S. (1997). The Maryland school assessment program:
Performance assessment with psychometric quality suitable for high stakes usage.
Educational and Psychological M~ea;surement, 5 7(1), 60-84.
BIOGRAPHICAL SKETCH
Stewart Francis Elizondo was born in San Jose, Costa Rica. At six years of age he
moved to Old Bridge, New Jersey. At eighteen years of age, he moved to Ocala, Florida,
where he attended Lake Weir High School. Before graduating, he had the honor of
receiving a nomination to the United States Military Academy from Kenneth Hood
"Buddy" MacKay Jr., former Governor of Florida. After pursing the honor to its end, he
worked his way through Central Florida Community College in Ocala with assistance
from a scholarship from the Silver Springs Shores Lion's Club. After two years, he was
fortunate enough to be awarded a Critical Teacher Scholarship in mathematics from the
Florida Department of Education (FDOE). He transferred to the University of South
Florida in Tampa and graduated with a Bachelor of Arts in mathematics and education.
Returning to Ocala, Stewart then taught for ten years in the public school system.
He is proud to have taught in a state recognized, high-performing "A" school. His
service includes mentoring beginning teachers in the county's Peer Orientation Program,
sitting on county textbook adoption committees, and facilitating school and district-wide
workshops. His students also twice nominated him for Teacher of the Year.
Concurrent with his teaching, he served on the Florida Mathematics Curriculum
Frameworks and Sunshine State Standards (SSS) Writing Team. The FDOE approved
the SSS as Florida's academic standards and the Florida Comprehensive Assessment Test
(FCAT) as its assessment tool. These changes were also incorporated into Governor Jeb
Bush's "A+ Education Plan," which was approved by the state legislature by amending
Section 229.57, F. S.
While still teaching, he worked with the FDOE and its Test Development Center
(TDC) on several statewide assessment endeavors. They include collaborations with
Harcourt Educational Measurement in development of the FCAT, specifically grade nine
and ten mathematics item reviews. He later worked with the TDC and NCS Pearson in
the scoring of FCAT items, specifically in developing and adjusting training guidelines
for the scoring of grade ten mathematics performance tasks using student responses from
Hield-tested and operational items. Letters expressing appreciation from Florida
Commissioner of Education James Wallace Horne, as well as former Commissioners
Charles J. Crist Jr., Frank T. Brogan, and the late Douglas L. Jamerson, mark the timeline
of service to these state proj ects.
These activities rekindled a long-standing passion for the assessment Hield and led
him to pursue various roles in the Hield, then ultimately an advanced degree. Of note, he
has been an item writer for Harcourt Educational Measurement and has written
mathematics items for the Massachusetts Comprehensive Assessment System and the
Connecticut Academic Performance Test. He has also served as both proctor and
administrator for the Florida Teacher Certifieation Examination and the College-Level
Academic Skills Test.
Stewart currently is an Alumni Fellow and graduate teaching assistant at the
University of Florida. He is grateful to still have the opportunity to be in the classroom
as he has taught undergraduate courses in elementary mathematics methods and
49
graduate courses in measurement and assessment while continuing to pursue a doctorate
in research and evaluation methodology.
His greatest j oy always has been and still remains spending time with his sons,
Spencer and Seth, and his wife, Stephanie.
|
Full Text |
PAGE 1
ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN MEASURING ONE YEARS GROWTH By STEWART FRANCIS ELIZONDO A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2004
PAGE 2
Copyright 2004 by Stewart Francis Elizondo
PAGE 3
This work is dedicated to Stephanie, who has unconditionally supported me not only in this endeavor, but also in all of my undertakings over the last seventeen years.
PAGE 4
ACKNOWLEDGMENTS I acknowledge my committee chair, Dr. M. David Miller, and committee member, Dr. Anne E. Seraphine, for their skilled mentorship and expertise. They have allowed this process be to challenging, yet enjoyable and rewarding. Of course, any errors remaining in this paper are entirely my own responsibility. Gratitude is also given to Dr. Bridget A. Franks, graduate coordinator, and Dr. James Algina, Director of the Research and Evaluation Program, for their assistance in helping me secure an Alumni Fellowship without which this project may not be or at least would have been protracted and less cohesive. Enough appreciation cannot be expressed to my family for their undying support of my graduate studies. Repaying the concessions made by my sons, Spencer and Seth, is something I cannot do in my lifetime, but I look forward to making the effort. Encouragement and support given to me by my in-laws, Jeanne and John Garrod, are also not taken for granted and I am most thankful. I would like to thank my mother, Elizabeth Marder. It is because of her that I realize and celebrate in fact that education is truly a lifelong process. Finally, the many thanks owed to my wife, Stephanie, are beyond words. I would not be half of what I am if it were not for her. iv
PAGE 5
TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES...........................................................................................................ix ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION........................................................................................................1 The Florida Comprehensive Assessment Test..............................................................1 State and Federal Accountability..................................................................................4 The Florida School Accountability System...........................................................4 The No Child Left Behind Act of 2001.................................................................5 Low Performing Students and One Years Growth......................................................5 2 REVIEW OF LITERATURE.......................................................................................7 The Movement Towards Constructed Response..........................................................7 Trait Equivalence Between Item Formats....................................................................8 Psychometric Differences Between Item Formats.......................................................9 Combining Item Formats............................................................................................10 Differential Effects of Item Formats on Various Groups...........................................11 3 METHODS.................................................................................................................13 Data.............................................................................................................................13 Sample........................................................................................................................13 Measures.....................................................................................................................15 Analysis Approach......................................................................................................16 4 RESULTS...................................................................................................................20 Aggregate Developmental Scale Score Analysis.......................................................20 Reading................................................................................................................20 Mathematics........................................................................................................21 v
PAGE 6
Developmental Scale Score Contrasts by Sex............................................................22 Reading................................................................................................................22 Mathematics........................................................................................................23 Developmental Scale Score Contrasts by Sex and Race............................................24 Reading................................................................................................................24 Mathematics........................................................................................................26 Developmental Scale Score Contrasts by Socio-economic Status.............................27 Reading................................................................................................................27 Mathematics........................................................................................................28 Developmental Scale Score Contrasts for Students with Disabilities and Regular Education Students.................................................................................................29 Reading................................................................................................................29 Mathematics........................................................................................................31 Summary.....................................................................................................................32 5 DISCUSSION.............................................................................................................33 Trends by Grade.........................................................................................................34 Reading................................................................................................................34 Mathematics........................................................................................................35 Item Format Effects....................................................................................................35 Closing Remarks.........................................................................................................36 APPENDIX A GRADING FLORIDA PUBLIC SCHOOLS.............................................................38 B ANNUAL AYP OBJECTIVES FOR READING......................................................40 C ANNUAL AYP OBJECTIVES FOR MATHEMATICS...........................................41 LIST OF REFERENCES...................................................................................................42 BIOGRAPHICAL SKETCH.............................................................................................47 vi
PAGE 7
LIST OF TABLES Table page 1-1 Florida Comprehensive Assessment Test Item Formats by Grade............................2 1-2 Florida Comprehensive Assessment Test Achievement Levels................................3 3-1 Matched Low Performing Students by Grade for Reading and Mathematics.........13 3-2 Disaggregation of Matched District and Low Performing Students by Subgroup for Grades 6 through 10 for Reading and Mathematics...........................................14 4-1 Mean Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................20 4-2 Mean Differences for Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....21 4-3 Mean Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................21 4-4 Mean Differences for Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....22 4-5 Mean Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................22 4-6 Mean Differences for Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002...................23 4-7 Mean Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................23 4-8 Mean Differences for Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....24 4-9 Mean Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................24 4-10 Mean Differences for Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................................................................................25 vii
PAGE 8
4-11 Mean Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria..................................................26 4-12 Mean Differences for Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................................................................................26 4-13 Mean Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................27 4-14 Mean Differences for Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002...................28 4-15 Mean Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................28 4-16 Mean Differences for Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....29 4-17 Mean Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria.....29 4-18 Mean Differences for Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002.............................................................................30 4-19 Mean Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria......................................................................................................................31 4-20 Mean Differences for Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................32 4-21 Subgroups of Low Performing Students Demonstrating More than One Years Growth in Reading and Mathematics.......................................................................32 viii
PAGE 9
LIST OF FIGURES Figure page 3-1 FCAT Achievement Levels for the Developmental Scale.......................................15 3-2 Expected Growth under Gain Alternative 3.............................................................18 ix
PAGE 10
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN MEASURING ONE YEARS GROWTH By Stewart Francis Elizondo May 2004 Chair: M. David Miller Major Department: Educational Psychology The impact of constructed response items in the measurement of one years growth on the Florida Comprehensive Assessment Test for low performing students is analyzed. Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on the vertically equated developmental scores as defined by the Florida School Accountability System. Data in reading and mathematics for 6th through 10th grade were examined for a mid-sized school district and disaggregated into several subgroups as defined in the state No Child Left Behind Accountability Plan. These results were compared across grade levels to see if growth differed in the grade levels that are assessed with constructed response items. Results indicated that constructed response items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. Differential results between subgroups were most evident in reading. x
PAGE 11
CHAPTER 1 INTRODUCTION Enacted in 1968, The Educational Accountability Act, Section 229.51, Florida Statutes, empowered the Commissioner of Education to use all appropriate management tools, techniques, and practices which will cause the state's educational programs to be more effective and which will provide the greatest economics in the management and operation of the state's system of education. Subsequent changes to the Florida Department of Educations (FDOE) capabilities towards that end can be marked by a stream of legislation that can currently be characterized as an environment of accountability. Utilizing components of the current accountability systems, questions regarding Floridas lowest performing students will be analyzed. The Florida Comprehensive Assessment Test The Florida Comprehensive Assessment Test (FCAT) currently serves as the measure for the statewide School Accountability System as envisioned by the A+ Plan for Education. It also serves as the measure for the federal No Child Left Behind Act of 2001 (NCLB) for all public schools. Both accountability systems include requirements for annual student growth. Section 229.57, Florida Statutes was amended in 1999 and specifies that school performance grade category designations shall be based on the school's current years performance and the school's annual learning gains (FDOE, 2001). Section 1111(b)(2)(H) of NCLB mandates that intermediate goals for annual yearly progress shall be established (NCLB, 2001). 1
PAGE 12
2 Reading and mathematics components of FCAT based on the Sunshine State Standards (SSS) are administered in grades 3 through 10. These components are commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics. Emphasis on these standards-based assessments by the accountability systems makes their construction, scoring, and reporting of special interest to stakeholders. Constructed response (CR) items are utilized on the FCAT SSS Reading assessment in grades 4, 8, and 10 and on the FCAT SSS Mathematics in grades 5, 8, and 10. CR items are either Short-Response (SR) or Extended-Response (ER). Not only does item format vary, but weight given to item formats can also vary across grades (see Table 1-1). Table 1-1. Florida Comprehensive Assessment Test Item Formats by Grade Reading Mathematics Grade Item Formats % of Items Item Formats % of Items 3 MC 100% MC 100% 4 MC SR & ER 85-90% 10-15% MC 100% 5 MC 100% MC GR SR & ER 60-70% 20-25% 10-15% 6 MC 100% MC GR 60-70% 30-40% 7 MC 100% MC GR 60-70% 30-40% 8 MC SR & ER 85-90% 10-15% MC GR SR & ER 50-60% 25-30% 10-15% 9 MC 100% MC GR 60-70% 30-40% 10 MC SR & ER 85-90% 10-15% MC GR SR & ER 50-60% 25-30% 10-15% Note. MC: Multiple-Choice, GR: Gridded-Response, SR: Short-Response, and ER: Extended-Response. Student scores have been reported on a scale of 100 to 500 for both the FCAT SSS Reading and Mathematics since they were first reported in 1998. Though these scores
PAGE 13
3 yield Achievement Levels that are used for the School Accountability System and NCLB (see Table 1-2), they are not for interpretation from grade to grade. Table 1-2. Florida Comprehensive Assessment Test Achievement Levels 5 Advanced Performance at this level indicates that the student has success with the most challenging content of the Sunshine State Standards. A Level 5 student answers most of the test questions correctly, including the most challenging questions. 4 Proficient Performance at this level indicates that the student has success with the challenging content of the Sunshine State Standards. A Level 4 student answers most of the questions correctly, but may have only some success with questions that reflect the most challenging content. 3 Proficient Performance at this level indicates that the student has partial success with the challenging content of the Sunshine State Standards, but performance is inconsistent. A Level 3 student answers many of the questions correctly, but is generally less successful with questions that are most challenging. 2 Basic Performance at this level indicates that the student has limited success with the challenging content of the Sunshine State Standards. 1 Below Basic Performance at this level indicates that the student has little success with the challenging content of the Sunshine State Standards. To address this, FCAT results began to be reported on a vertically equated scale that ranges from 86 to 3008 across grades 3 through 10. This facilitates interpretations of annual progress from grade to grade. Developmental Scale Scores (DSS), as well as DSS Change, the difference between consecutive years, appear on student reports as well as school, district, and state reports for the aggregate means (FDOE, 2003a). It has been proposed that NCLB encourages the use of vertical scaling procedures (Ananda, 2003). Though FCAT field-testing began in 1997, reports including DSS have only been available since 2001.
PAGE 14
4 State and Federal Accountability The Florida School Accountability System In 1999, school accountability became more visible to the public with the advent of school performance grades. Category designations range from A, making excellent progress, to F, failing to make adequate progress (see Appendix A). Schools are evaluated on the basis of aggregate student performance on FCAT SSS Reading and FCAT SSS Mathematics in grades 3 through 10, and the Florida Writes statewide writing assessment administered in grades 4, 8, and 10. Currently, three major components contribute in their calculation: 1. Yearly achievement of high standards in reading, mathematics and writing, 2. Annual learning gains in reading and mathematics, and 3. Annual learning gains in reading for the lowest 25% of students in each school. As of 2002, annual learning gains on FCAT SSS Mathematics and FCAT SSS Reading account for half of the point system that yields school performance grades, with special attention given to the reading gains of the lowest 25% of students in each school. There are three ways that schools can be credited for the annual yearly gains of their students: 1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or 2. When students maintain a relatively high Achievement Level (3, 4 or 5); or 3. When students demonstrate more than one years growth within Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next (FDOE, 2003b). Incentives for schools graded A and those that improve at least one grade level include monetary incentives under the Florida School Recognition Program under Section 231.2905, Florida Statutes. Sanctions for schools graded F for two years in a four-year period include eligibility for Opportunity Scholarships under Section 229.0537,
PAGE 15
5 Florida Statutes, so that students can attend higher performing public or private schools (FDOE, 2001). The No Child Left Behind Act of 2001 The inclusion of all students when making accountability calculations was not a common practice prior to the passing of the federal No Child Left Behind Act of 2001 (Linn, 2000). Approved in April of 2003, Floridas State Accountability Plan meets this defining requirement and builds on the Florida School Accountability System. It is also compliant with the Adequate Yearly Progress (AYP) components of the federal law. To make AYP, the percentage of students earning a score of Proficient or above in reading and mathematics has to meet or exceed the annual objectives for the given year (see Appendix B and C). Furthermore, not only will AYP criteria be applied to the aggregate, but it will also be applied separately to subgroups that disaggregate the data by race, socio-economic status, disability, and limited English proficiency. If a school fails to meet the annual objectives in reading or mathematics for any subgroup in any content area, the school is designated as not making AYP. The expectation for growth is such that all students are Proficient or above on the FCAT SSS Reading and Mathematics no later than 2014 (FDOE, 2003c). Low Performing Students and One Years Growth The fundamental question addressed in this paper is the following: Do grades with constructed response items show differential growth for various subgroups of students within Achievement Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next? The subgroups of interest are those defined by NCLB. These annual gains directly impact the calculation of school
PAGE 16
6 performance grades under the Florida School Accountability System, and can serve as a method of monitoring the progress and composition of those non-Proficient students for the AYP calculations under NCLB. Comparisons of item format effects will be explored for FCAT SSS Reading and Mathematics in grades 6 through 10. Other issues, though very relevant, such as: a) the longstanding debate on how to yield gain scores (Cronbach & Furby, 1970; Lord, 1963), b) the discrepancies surrounding the various methods of vertically equating tests, especially when including multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen & Wainer, 2001), and c) the most recent debate of effects of high-stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002) will be set-aside in an effort to work within the existing framework of the current accountability systems. Unintended item format effects, especially on the FCAT SSS Reading and Mathematics assessments, given their emphasis by both the Florida School Accountability System and the federal No Child Left Behind Act, only grow in importance as trends in Floridas student population continue. In 2002, 14 of Floridas 67 school districts reported minority enrollment of 50% or more. Many of these were among the most populous districts across the state. Furthermore, minority representation statewide has steadily increased from 29.9% in 1976 to 49.4% in 2002 (FDOE, 2003d).
PAGE 17
CHAPTER 2 REVIEW OF LITERATURE The Movement Towards Constructed Response Some type of constructed response (CR) item was used in over three quarters of all statewide assessment programs (Council of Chief State School Officers [CCSSO], 1999). This is a marked change from the assessment practices of just a decade ago when Barton and Coley (1994), as cited in Linn (1994), stated that, The nation is entering an era of change in testing and assessment. Efforts at both the national and state levels are now directed at greater use of performance assessment, constructed response questions, and portfolios based on actual student work (p. 3). The hypothesis is that these types of items are more closely aligned with the standards-based reform movement currently underway in education and that they are better able to detect its effects (Hamilton et al., 2003; Klein et al., 2000). Some have recently tested whether the relationship between achievement and instruction varies as a function of item format (McCaffrey et al., 2001). One of the first and most prominent statewide programs to implement performance-based assessments in a high-stakes environment was the Maryland School Performance Assessment Program (MSPAP), first administered in 1991. Though replaced by the Maryland School Assessment in 2002, Maryland still utilizes CR items in all grades tested. Although virtually no information was available on the psychometric properties of performance assessment during the design stages of the MSPAP, results were encouraging and allowed for innovations in later years (Yen & Ferrara, 1997). 7
PAGE 18
8 The use of CR items has since become a hallmark of quality testing programs to the extent that a National Academy of Education panel repeatedly applauded the National Assessment of Educational Progress continued move to include CR items. In a later review, the Committee on the Evaluation of National and State Assessments of Educational Progress, in conjunction with the Board on Testing and Assessment, called for an enhancement of the use of testing results, especially CR items, to provide better interpretation (Pellegrino et al., 1999). Trait Equivalence Between Item Formats The debate regarding the existence and interpretation of psychological differences tapped by CR items, as opposed to multiple choice (MC) items, has been equivocal. Thissen and Wainer (2001) make clear a basic assumption of psychometrics: Before the responses of any set of items are combined into a single score that is taken to be, in some sense, representative of the responses to all of the items, we must ascertain the extent to which the items measure the same thing (p. 10). Factor analysis techniques, among other methodologies, have been employed to analyze the extent to which CR items measure a different trait than do MC items. Evidence of a free-response factor was found by Bennett, Rock, and Wang (1991), though a single-factor solution was reported as providing the most parsimonious fit. Thissen, Wainer, and Wang (1994) reported that despite a small degree of multidimensionality, it would seem meaningful to combine the scores of MC and CR items as they, for the most part, measure the same underling proficiency. These authors suggest that proficiency may be better estimated with only MC items due to a high correlation with the more reliable multiple-choice factor. Still yet, others have used factor analysis techniques with MC and CR items and found that CR items were not measuring anything beyond what was measured by the MC
PAGE 19
9 items (Bridgeman & Rock, 1993). With item response theory methodology, others have concluded that CR items yielded little information over and above that provided by MC items (Lukhele et al., 1994). A recent meta-analysis framed the question of construct (trait) equivalence as a function of stem equivalence between the two item types. It was found that when items are constructed in both formats using the same stem, the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem equivalent items (Rodriguez, 2003). It has also been suggested that a better aim of CR items is not to measure the same trait, but to better measure some cognitive processes better than MC items (Traub, 1993). This is consistent with a stronger positive relationship between reform practices and student achievement when measured with CR items rather than MC items (Klein et al., 2000). Psychometric Differences Between Item Formats The psychometric properties of CR items have also been a point of contention in the literature. Messick (1993) asserted that the question of trait equivalence was essentially a question of construct validity of test interpretation and use. Furthermore, this pragmatic view is tolerated to facilitate effectiveness of testing purposes. The two major threats to construct validity, irrelevant test variance and construct underrepresentation (Messick, 1993), can be judiciously negotiated for the purposes of demonstrating construct relevant variance with due representation of the construct. Messick posits that if the hypothesis of perfectly correlated true scores across formats is not rejected, then either format, or a combination of both, could be employed as construct indicators. Although, if the hypothesis is rejected, likely due to differential
PAGE 20
10 method variance, a dominant construct relevant factor or second order factor, may cut across formats (Messick, 1993). Messick (1995) also later provided a thorough treatment on the broader systematic validation of performance assessments within the framework of the unified concept of validity (AERA, APA, & NCME, 1999). Returning to the issue of reliability raised earlier, Thissen and Wainers (2001) suggestion that MC items more reliably measure a CR factor may, to paraphrase, come about because, measuring something that is not quite right accurately may yield far better measurement than measuring the right thing poorly. Lastly, we are warned that however appealing CR items may be as exemplars of instruction or student performance, we must exercise caution when drawing inferences about people or changes in instruction due to their relatively limited generalizability across tasks (Brennan, 1995; Dunbar et al., 1991). In an analysis of state-mandated programs, Miller (2002) concluded that other advantages, such as consequences for instruction, are necessary to justify these concessions. Combining Item Formats Methods on how to best combine the use of CR items with MC items have also garnered much needed attention. Mehrens (1992) put it succinctly when stating that we have known for decades that MC items measure some things well and efficiently, but they do not measure everything and they can be overemphasized. Performance assessments, on the other hand, have the potential to measure important instructional objectives that cannot be measured by MC items. He also notes that most large-scale assessments have added performance assessments to the more traditional tests, and not replaced them.
PAGE 21
11 Wainer and Thissen (1993) add that combining the two formats may allow for the concatenation of their strengths while compensating for their weaknesses. More specifically, if a test has various components, it is sensible to weigh the components as a function of their reliabilities, or modify the lengths of the components to make them equally reliable. Given difficulties in execution of the latter, it was recommended that serious consideration be given to IRT weighting in the construction and scoring of tests with mixed item formats. Just such a process is employed in the construction of the Florida Comprehensive Assessment Test (FCAT) via an interactive test construction system, which makes available the capability to store, retrieve, and manipulate IRT test information, test error, and expected scores (FDOE, 2002) Differential Effects of Item Formats on Various Groups Another major component to the literature is a segment that seeks differential effects that CR items may have on gender, race, socio-economic status, and disability. These contrasts are most timely, given that nature of the Adequate Yearly Progress requirements of the No Child Left Behind Act of 2001 where, by design, data is disaggregated by such groups. In an analysis of open-ended counterparts to a set of items from the quantitative section of the Graduate Record Exam, Bridgeman (1992) concluded that gender and ethnicity differences were neither lessened nor exaggerated and that there were no significant interactions of test format with either gender or ethnicity. Similarly, in a metanalysis of over 30 studies, Ryan and DeMark (2002) conclude that females typically outperform males in mathematics and language when students must construct their responses, but effect sizes are small or less by Cohens (1988) standards.
PAGE 22
12 There has been mixed results on the inclusion of students with disabilities in high-stakes standards-based assessments. DeFur (2002) summarizes one such example in Virginias experiences with their Standards of Learning and an inclusion policy. Conspicuous by its absence are studies on differential effects of item formats involving students with disabilities. This may soon draw needed attention with the March 2003 announcement of a statutory NCLB provision. Effective for the 2003-2004 academic year, it states that in calculating AYP, at most 1% of all students tested can be held to alternative achievement standards at the district and state levels (Paige, 2003). The provision was published in the Federal Register, Vol. 68, No. 236, on December 9, 2003.
PAGE 23
CHAPTER 3 METHODS Data The electronic file received by a mid-sized school district in May 2002 from the Florida Department of Education (FDOE) was analyzed. This file contained Florida Comprehensive Assessment Test (FCAT) scores from the 2001 and 2002 administrations. The data for Florida had changed in 2001 in two key ways. Beginning with the 2001 administration, the FCAT program was expanded to include all grades 3 through 10. Prior to this, reading was assessed only in grades 4, 8, and 10, while mathematics was assessed only in grades 5, 8, and 10. Secondly, these changes offered the opportunity to introduce a Developmental Scale Score (DSS) that linked adjacent grades together allowing for progress to be tracked over time. Sample The data set consisted of 2,402 low performing students in reading and 2,506 low performing students in mathematics in grades 6 through 10 that had a score from both the 2001 and 2002 FCAT administrations. Table 3-1 shows the composition of the students by grade. Low performing is defined as remaining within the Basic or Below Basic performance standards commonly referred to as Level 1 and Level 2 (FDOE, 2003c). Table 3-1. Matched Low Performing Students by Grade for Reading and Mathematics Grade Reading Mathematics 6 498 569 7 475 557 8 477 533 9 489 462 10 463 385 13
PAGE 24
14 Disaggregation of low performing students by sex, race, socio-economic status, and disability is shown in Table 3-2 for reading and mathematics. For comparison, disaggregation of all matched district students also in grades 6 through 10 is provided Table 3-2. Disaggregation of Matched District and Low Performing Students by Subgroup for Grades 6 through 10 for Reading and Mathematics Reading Mathematics All Students Low Performing Students All Students Low Performing Students Agg 8,399 2,402 8,404 2,506 Sex F M 4,307 (51%) 4,092 (49%) 1,158 (48%) 1,244 (52%) 4,309 (51%) 4,095 (49%) 1,300 (52%) 1,206 (48%) Race B H W 3,158 (38%) 326 (4%) 4,915 (59%) 1,646 (69%) 77 756 (31%) 3,161 (38%) 328 (4%) 4,915 (58%) 1,767 (71%) 70 739 (29%) SES ED N-ED 3,271 (39%) 5,128 (61%) 1,543 (64%) 859 (36%) 3,275 (39%) 5,128 (61%) 1,671 (67%) 835 (33%) SWD Gift RES SWD 853 (10%) 5,928 (71%) 1,618 (19%) 16 1,311 (55%) 1,075 (45%) 852 (10%) 5,928 (71%) 1,624 (19%) 11 1,430 (57%) 1,065 (42%) Note. Agg = Aggregate, F = Female, M = Male, B = Black, H = Hispanic, W = White, SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With Disabilities, Gift = Gifted Students, RES = Regular Education Students. Gifted students and Hispanic students were omitted from analysis due to small sample sizes. Disaggregated reporting for Adequate Yearly Progress (AYP) is not required when the number of students is insufficient to yield statistically reliable information or would reveal personally identifiable information (NCLB, 2002). Linn et al. (2002) suggested a minimum number of 25. The FDOE subscribes to a minimum group size of 30 for school performance grades (FDOE, 2003b) and for AYP calculations (FDOE. 2003c). In the analyses of that follow, subgroups of low performing students are further disaggregated by grade and, in all cases, gifted students and Hispanic students fail to meet even Linns suggestion. In several cases, group size was as low as zero or one.
PAGE 25
15 Measures Reading and mathematics components of FCAT based on the Sunshine State Standards (SSS) were examined. Commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics, they are the criterion-referenced components of the statewide assessment program. The resulting Developmental Scale Scores (DSS) were analyzed. Figure 3-1 shows the relationship between these scores and the corresponding Achievement Levels by content area. Figure 3-1. FCAT Achievement Levels for the Developmental Scale. Source: Understanding FCAT Reports (FDOE, 2003a). The vertical scales that the DSS are based on were built by a special comparison of FCAT performance across grades 3 through 10 during the Spring of 2001. Facilitated by a calibration step (Thissen & Wainer, 2001), items from adjacent grade level tests were embedded into field test item positions. It was decided that the scale would be anchored at grade 3 so that the average score would be 1300 and at grade 10 where the average score would be 2000. Analysis was accomplished using Item Response Theory (IRT) models (Thissen, 1991). IRT metrics have been shown to be preferred over grade equivalents when examining such longitudinal patterns, as they are sensitive to varying rates of growth over time (Seltzer et al., 1994).
PAGE 26
16 Analysis Approach Components of both the Florida School Accountability System, as envisioned by the A+ Plan for Education, and the federal No Child Left Behind Act of 2001 (NCLB) were used to approach the question of differential growth for various groups of students within the Achievement Levels of Basic and Below Basic focusing on grades with constructed response items. Within the Florida School Accountability System, making annual learning gains on FCAT SSS Mathematics and FCAT SSS Reading account for three out of six categories that are used in calculating school performance grades. Special attention is given to the FCAT SSS Reading gains of the lowest 25% of students in each school (see Appendix A). As of 2002, students can demonstrate gains via three different alternatives (FDOE, 2003b). 1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or 2. When students maintain a relatively high Achievement Level (3, 4 or 5); or 3. When students demonstrate more than one years growth within Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next. The focus of the analysis will be on the students eligible for demonstrating gains via Gain Alternative 3. The reasoning being that these students have demonstrated two consecutive years of Basic or Below Basic performance. Students that demonstrated gains via Gain Alternative 1 have shown a substantive improvement in their performance. For example, a student that has improved from Achievement Level 2 to Achievement Level 3, has gone from a classification of Basic to Proficient (FDOE, 2003c). Students that demonstrated gains via Gain Alternative 2 have shown consistently high performance as Achievement Level 3 or higher is classified as Proficient or Advanced
PAGE 27
17 (FDOE, 2003c). Furthermore, this speaks directly to the policy that is at hand and utilizes the FCAT Developmental Scale Scores as they are intended. Five analyses were performed, the first of which is for all matched low performing students not making gains by achievement level criteria. The remaining four address subgroups that are suggested (CCSSO, 2002a) or required (NCLB, 2002) when disaggregating school, district, and state data in calculating the Adequate Yearly Progress (AYP) components of NCLB. To make AYP, the percentage of students earning a score of Proficient or above in reading and mathematics has to meet or exceed the annual objectives for the given year (see Appendix B and C). Though all students in the analyses are non-Proficient, tracking the composition and progress of these groups could lead to an understanding that may facilitate policy and instruction. Comparisons of item format effects in measuring one years growth will be made between FCAT SSS Reading and FCAT SSS Mathematics as well as across grades 6 through 10 within each content area. Lastly, differential growth within each grade for both content areas will be discussed. To further specify what constitutes one years growth as defined in the Florida School Accountability System, it is necessary to revisit the developmental scale, specifically as it applies to Gain Alternative 3. The definition is based on the numerical cut-scores for the FCAT Achievement Levels that have been approved by the State Board of Education. The following steps were applied to the cut scores, separately, for each subject and each grade-level pair (FDOE, 2003b). Per State Board Rule 6A-1.09422, there are four cut scores that separate DSS into five Achievement Levels. The increase in the DSS necessary to maintain in the same relative standing within Achievement Levels
PAGE 28
18 from one grade to the next was calculated for each of the four cut scores between the five Achievement Levels. The median value of these four differences was determined to best represent the entire student population. Median gain expectations were then calculated for each grade and a logarithmic curve was fitted. Others were considered but the logarithmic curve was adopted because it best captures the theoretical expectation of greater gains in the early grades due to student maturation. Graphs of these curves and the resulting expected growth on the development scale is shown in Figure 3-2 for reading and mathematics. Figure 3-2. Expected Growth under Gain Alternative 3. Source: Guide to Calculating School Grades Technical Assistance Paper (FDOE, 2003b). To be denoted as making gains under Gain Alternative 3, students must demonstrate more than the expected growth on the developmental scale. Therefore, they must score at least one developmental scale score point more than the values listed above. These criteria were applied to mean differences for the matched low performing students. This was done for the aggregate for each grade 6 through 10 and, in the interest of the NCLB requirements, by sex, sex and race, socio-economic status, and disability.
PAGE 29
19 As noted earlier, other issues, though admittedly relevant, will be set-aside in an effort to work within the existing framework of the state and federal accountability systems. These include and are not limited to; the longstanding debate on how to yield gain scores (Cronbach & Furby, 1970; Lord, 1963); the discrepancies surrounding the various methods of vertically equating tests, especially when including multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen & Wainer, 2001); and the most recent debate of effects of high-stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002).
PAGE 30
CHAPTER 4 RESULTS Aggregate Developmental Scale Score Analysis Reading Sample sizes, means, and standard deviations for grade level aggregate data for the Florida Comprehensive Assessment Test Sunshine State Standards (FCAT SSS) Reading Developmental Scale Scores (DSS) are shown in Table 4-1. Table 4-1. Mean Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria Grade Year n M (SD) 2002 498 1,074 (266) 6 2001 498 1,031 (294) 2002 475 1,245 (245) 7 2001 475 1,116 (255) 2002 477 1,353 (258) 8 2001 477 1,257 (254) 2002 489 1,525 (264) 9 2001 489 1,374 (251) 2002 463 1,590 (275) 10 2001 463 1,592 (221) Mean differences and comparisons to expected growth for grade level aggregate data for the FCAT SSS Reading DSS are shown in Table 4-2. 20
PAGE 31
21 Table 4-2. Mean Differences for Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades n DSS Mean Difference Expected Growth Met Expectations 5 to 6 498 43 133 No 6 to 7 475 129 110 Yes 7 to 8 a 477 96 92 Yes 8 to 9 489 150 77 Yes 9 to 10 a 463 -2 77 No a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level aggregate data for the FCAT SSS Mathematics DSS are shown in Table 4-3. Table 4-3. Mean Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria Grade Year n M (SD) 2002 569 1,314 (235) 6 2001 569 1,123 (227) 2002 557 1,360 (262) 7 2001 557 1,290 (257) 2002 533 1,459 (242) 8 2001 533 1,307 (272) 2002 462 1,587 (220) 9 2001 462 1,438 (241) 2002 385 1,689 (190) 10 2001 385 1,636 (206)
PAGE 32
22 Mean differences and comparisons to expected growth for grade level aggregate data for the FCAT SSS Mathematics DSS are shown in Table 4-4. Table 4-4. Mean Differences for Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades n DSS Mean Difference Expected Growth Met Expectations 5 to 6 569 190 95 Yes 6 to 7 557 70 78 No 7 to 8 a 533 151 64 Yes 8 to 9 462 149 54 Yes 9 to 10 a 385 53 48 Yes a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Sex Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by sex are shown in Table 4-5. Table 4-5. Mean Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Grade Year n M (SD) n M (SD) 2002 241 1,115 (247) 257 1,035 (277) 6 2001 241 1,060 (269) 257 1,003 (313) 2002 225 1,288 (234) 250 1,206 (247) 7 2001 225 1,156 (237) 250 1,079 (266) 2002 223 1,388 (259) 254 1,321 (253) 8 2001 223 1,304 (236) 254 1,215 (262) 2002 237 1,583 (231) 252 1,470 (281) 9 2001 237 1,413 (235) 252 1,338 (261) 2002 232 1,601 (284) 231 1,580 (266) 10 2001 232 1,608 (208) 231 1,576 (233)
PAGE 33
23 Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by sex are shown in Table 4-6. Table 4-6. Mean Differences for Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 F M 241 257 55 31 133 133 No No 6 to 7 F M 225 250 132 126 110 110 Yes Yes 7 to 8 a F M 223 254 84 105 92 92 No Yes 8 to 9 F M 237 252 170 132 77 77 Yes Yes 9 to 10 a F M 232 231 -7 4 77 77 No No Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by sex are shown in Table 4-7. Table 4-7. Mean Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Grade Year n M (SD) n M (SD) 2002 303 1,332 (213) 266 1,294 (256) 6 2001 303 1,127 (223) 266 1,120 (231) 2002 295 1,398 (245) 262 1,317 (273) 7 2001 295 1,325 (237) 262 1,251 (274) 2002 255 1,487 (238) 278 1,433 (244) 8 2001 255 1,357 (254) 278 1,262 (280) 2002 235 1,604 (210) 227 1,569 (229) 9 2001 235 1,476 (220) 227 1,397 (255) 2002 212 1,692 (184) 173 1,686 (198) 10 2001 212 1,643 (201) 173 1,628 (212) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by sex are shown in Table 4-8.
PAGE 34
24 Table 4-8. Mean Differences for Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 F M 303 266 205 174 95 95 Yes Yes 6 to 7 F M 295 262 72 67 78 78 No No 7 to 8 a F M 255 278 130 171 64 64 Yes Yes 8 to 9 F M 235 227 128 172 54 54 Yes Yes 9 to 10 a F M 212 173 49 57 48 48 Yes Yes Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Sex and Race Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by sex and race are shown in Table 4-9. Table 4-9. Mean Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 189 1,090 (246) 52 1,206 (231) 193 1,016 (264) 64 1,094 (307) 6 2001 189 1,034 (277) 52 1,153 (216) 193 986 (305) 64 1,059 (335) 2002 166 1,266 (241) 59 1,350 (202) 156 1,178 (250) 94 1,251 (237) 7 2001 166 1,130 (244) 59 1,229 (200) 156 1,028 (271) 94 1,164 (235) 2002 148 1,352 (259) 75 1,460 (245) 166 1,277 (262) 88 1,404 (213) 8 2001 148 1,293 (211) 75 1,326 (279) 166 1,189 (253) 88 1,267 (273) 2002 157 1,541 (228) 80 1,665 (216) 172 1,410 (278) 80 1,597 (242) 9 2001 157 1,362 (237) 80 1,514 (197) 172 1,286 (253) 80 1,449 (244)
PAGE 35
25 Table 4-9 Continued. Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 148 1,533 (303) 84 1,720 (196) 151 1,512 (274) 80 1,708 (194) 10 2001 148 1,563 (211) 84 1,688 (177) 151 1,540 (220) 80 1,644 (241) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by sex and race are shown in Table 4-10. Table 4-10. Mean Differences for Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 BF BM WF WM 189 193 52 64 56 30 54 34 133 133 133 133 No No No No 6 to 7 BF BM WF WM 166 156 59 94 136 150 121 87 110 110 110 110 Yes Yes Yes No 7 to 8 a BF BM WF WM 148 166 75 88 59 89 133 138 92 92 92 92 No No Yes Yes 8 to 9 BF BM WF WM 157 172 80 80 179 125 152 148 77 77 77 77 Yes Yes Yes Yes 9 to 10 a BF BM WF WM 148 151 84 80 -30 -28 33 64 77 77 77 77 No No No No Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male. a Grades 8 and 10 utilize constructed-response items.
PAGE 36
26 Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by sex and race are shown in Table 4-11. Table 4-11. Mean Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 220 1,294 (220) 83 1,433 (153) 185 1,247 (262) 81 1,403 (206) 6 2001 220 1,087 (227) 83 1,233 (174) 185 1,082 (230) 81 1,208 (210) 2002 203 1,342 (259) 92 1,520 (155) 176 1,263 (280) 86 1,428 (224) 7 2001 203 1,276 (244) 92 1,433 (179) 176 1,199 (286) 86 1,356 (212) 2002 180 1,462 (239) 75 1,546 (226) 183 1,390 (241) 95 1,516 (227) 8 2001 180 1,334 (252) 75 1,412 (252) 183 1,208 (270) 95 1,365 (271) 2002 170 1,578 (214) 65 1,672 (184) 178 1,543 (238) 49 1,666 (163) 9 2001 170 1,439 (222) 65 1,573 (184) 178 1,370 (252) 49 1,496 (243) 2002 144 1,654 (192) 68 1,771 (134) 128 1,656 (207) 45 1,770 (140) 10 2001 144 1,601 (207) 68 1,732 (155) 128 1,604 (219) 45 1,697 (171) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by sex and race are shown in Table 4-12. Table 4-12. Mean Differences for Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 BF BM WF WM 220 185 83 81 207 165 200 195 95 95 95 95 Yes Yes Yes Yes
PAGE 37
27 Table 4-12. Continued. Grades Group n DSS Mean Difference Expected Growth Met Expectations 6 to 7 BF BM WF WM 203 176 92 86 66 64 87 72 78 78 78 78 No No Yes No 7 to 8 a BF BM WF WM 180 183 75 95 128 182 134 150 64 64 64 64 Yes Yes Yes Yes 8 to 9 BF BM WF WM 170 178 65 49 139 172 99 170 54 54 54 54 Yes Yes Yes Yes 9 to 10 a BF BM WF WM 144 128 68 45 53 52 40 73 48 48 48 48 Yes Yes No Yes Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Socio-economic Status Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by socio-economic status (SES) are shown in Table 4-13. Table 4-13. Mean Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria Economically Disadvantaged Non-Economically Disadvantaged Grade Year n M (SD) n M (SD) 2002 398 1,055 (266) 100 1,149 (253) 6 2001 398 1,004 (295) 100 1,134 (262) 2002 337 1,223 (254) 138 1,297 (212) 7 2001 337 1,093 (257) 138 1,172 (243) 2002 323 1,314 (264) 154 1,433 (225) 8 2001 323 1,239 (247) 154 1,294 (266) 2002 280 1,465 (272) 209 1,604 (231) 9 2001 280 1,329 (252) 209 1,434 (238) 2002 205 1,510 (295) 258 1,655 (240) 10 2001 205 1,525 (235) 258 1,645 (194)
PAGE 38
28 Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by SES are shown in Table 4-14. Table 4-14. Mean Differences for Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 ED N-ED 398 100 50 15 133 133 No No 6 to 7 ED N-ED 337 138 130 125 110 110 Yes Yes 7 to 8 a ED N-ED 323 154 75 138 92 92 No Yes 8 to 9 ED N-ED 280 209 135 169 77 77 Yes Yes 9 to 10 a ED N-ED 205 258 -16 9 77 77 No No Note. ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged. a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by SES are shown in Table 4-15. Table 4-15. Mean Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria Economically Disadvantaged Non-Economically Disadvantaged Grade Year n M (SD) n M (SD) 2002 436 1,294 (237) 133 1,381 (217) 6 2001 436 1,103 (227) 133 1,191 (212) 2002 385 1,319 (265) 172 1,452 (229) 7 2001 385 1,248 (264) 172 1,386 (212) 2002 365 1,426 (245) 168 1,529 (220) 8 2001 365 1,277 (265) 168 1,372 (277) 2002 293 1,558 (225) 169 1,637 (203) 9 2001 293 1,417 (228) 169 1,473 (258) 2002 192 1,649 (198) 193 1,728 (174) 10 2001 192 1,588 (231) 193 1,684 (164) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by SES are shown in Table 4-16.
PAGE 39
29 Table 4-16. Mean Differences for Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 ED N-ED 436 133 191 190 95 95 Yes Yes 6 to 7 ED N-ED 385 172 71 66 78 78 No No 7 to 8 a ED N-ED 365 168 149 157 64 64 Yes Yes 8 to 9 ED N-ED 293 169 141 164 54 54 Yes Yes 9 to 10 a ED N-ED 192 193 61 45 48 48 Yes No Note. ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts for Students with Disabilities and Regular Education Students Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS for students with disabilities and regular education students are shown in Table 4-17. Gifted students were omitted due to a small sample size. Table 4-17. Mean Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 252 980 (271) 240 1,165 (223) 6 2001 252 899 (317) 240 1,162 (189) 2002 240 1,148 (252) 234 1,344 (192) 7 2001 240 1,018 (254) 234 1,215 (216)
PAGE 40
30 Table 4-17 Continued. Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 219 1,259 (241) 252 1,428 (246) 8 2001 219 1,132 (249) 252 1,360 (208) 2002 212 1,376 (279) 276 1,638 (184) 9 2001 212 1,251 (256) 276 1,468 (203) 2001 152 1,435 (255) 309 1,666 (252) 10 2002 152 1,444 (232) 309 1,664 (175) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS for students with disabilities and regular education students are shown in Table 4-18. Gifted students were omitted due to a small sample size. Table 4-18. Mean Differences for Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 SWD RES 252 240 82 3 133 133 No No 6 to 7 SWD RES 240 234 130 129 110 110 Yes Yes 7 to 8 a SWD RES 219 252 126 68 92 92 Yes No 8 to 9 SWD RES 212 276 124 170 77 77 Yes Yes 9 to 10 a SWD RES 152 309 -9 2 77 77 No No Note. SWD = Students With Disabilities, RES = Regular Education Students. a Grades 8 and 10 utilize constructed-response items.
PAGE 41
31 Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS for students with disabilities and regular education students are shown in Table 4-19. Gifted students were omitted due to a small sample size. Table 4-19. Mean Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 246 1,208 (254) 319 1,393 (182) 6 2001 246 1,026 (234) 319 1,196 (191) 2002 244 1,215 (266) 309 1,471 (194) 7 2001 244 1,157 (270) 309 1,392 (191) 2002 233 1,341 (243) 297 1,548 (199) 8 2001 233 1,159 (261) 297 1,419 (220) 2002 198 1,473 (247) 264 1,673 (149) 9 2001 198 1,307 (236) 264 1,536 (194) 2002 144 1,573 (206) 241 1,758 (141) 10 2001 144 1,513 (240) 241 1,710 (137) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS for students with disabilities and regular education students are shown in Table 4-20. Gifted students were omitted due to a small sample size.
PAGE 42
32 Table 4-20. Mean Differences for Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 SWD RES 246 319 182 197 95 95 Yes Yes 6 to 7 SWD RES 244 309 58 72 78 78 No Yes 7 to 8 a SWD RES 233 297 182 128 64 64 Yes Yes 8 to 9 SWD RES 198 264 169 137 54 54 Yes Yes 9 to 10 a SWD RES 144 241 61 48 48 48 Yes Yes Note. SWD = Students With Disabilities, RES = Regular Education Students. a Grades 8 and 10 utilize constructed-response items. Summary Subgroups of low performing students demonstrating more than one years growth, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next, are indicated in Table 4-21. Table 4-21. Subgroups of Low Performing Students Demonstrating More than One Years Growth in Reading and Mathematics Reading Mathematics 5-6 6-7 7-8 8-9 9-10 5-6 6-7 7-8 8-9 9-10 Agg Sex F M Sex/ Race BF BM WF WM SES ED N-ED SWD RES SWD Note. = demonstrated more than one years growth, Agg = Aggregate, F = Female, M = Male, B = Black, W = White, SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With Disabilities, RES = Regular Education Students.
PAGE 43
CHAPTER 5 DISCUSSION What makes the isolation of item format effects in an existing accountability system difficult is that comparative groups cannot be established without compromising the very structure of that accountability system. The Florida Comprehensive Assessment Test (FCAT) is administered to all eligible students and the results are used for decisions that can be of high stakes. To alter the composition of item format for some from what is currently used for all others and still be able to draw these conclusions would not only have political repercussions, but would not be tenable. The approach taken here, instead, seeks the emergence of patterns from data obtained from test administrations under all actual conditions. This also has the added advantage of incorporating the first two years of results from, what at the time, was a change in the FCAT testing program and its application to the Florida School Accountability System. Despite the changes, including the addition of annual learning gains in the calculation of school performance grades, continued improvement was widely reported in 2002 and more so in 2003 with almost half of all schools earning an A grade. The first results of the federal No Child Left Behind Act of 2001 (NCLB) were also made available in 2003. For Florida, these results were not as encouraging. Only over 400 schools out of 3,000 met all requirements for making Adequate Yearly Progress (AYP). These mixed messages, along with recent results of the National Assessment of 33
PAGE 44
34 Educational Progress, lead to the conclusion that, as a whole, while Florida is making progress, it simply is not enough for many and much remains to be done. Towards this end, the analyses presented here focused attention on students within Achievement Level 1 or 2, i.e. performing at the Basic and Below Basic level. The 2003 NCLB results show that the district from which the data for these analyses was used did not make AYP. A closer look at the subgroups will show Black students and students with disabilities did not meet the required percentage proficient, i.e. AYP, in reading with economically disadvantaged students coming close to not making the criteria. For mathematics, none of these subgroups made AYP. Although the analyses are from 2001 and 2002 and focus on grades 6 through 10, the discussion below may help clarify these and other patterns. Trends by Grade Reading Grade 6. Mean differences for all subgroups of low performing students did not demonstrate one years growth. Grade 7. Mean differences for all subgroups of low performing students demonstrated one years growth except white males. Grade 8. Mean differences for all subgroups of low performing students demonstrated one years growth except females, Black females, Black males, economically disadvantaged students, and regular education students. Grade 9. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 10. Mean differences for all subgroups of low performing students did not demonstrate one years growth.
PAGE 45
35 Mathematics Grade 6. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 7. Mean differences for all subgroups of low performing students did not demonstrate one years growth except white females and regular education students. Grade 8. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 9. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 10. Mean differences for all subgroups of low performing students demonstrated one years growth except white females and non-economically disadvantaged students. Item Format Effects Results indicated that constructed response (CR) items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. All subgroups made one years growth in mathematics in both grades analyzed that utilize CR items with the exception of white females and non-economically disadvantaged students in grade 10. Further inspection will show developmental scale scores for both of these subgroups were higher than all other subgroups in their respective analysis in 2001 and 2002. Differential results between subgroups within grades where CR items are utilized were most evident in reading. In grade 8, the aggregate demonstrated one years growth, though not by much. Disaggregation by all other analyses had at least one subgroup and, in most cases, more than half of all students not demonstrating one years growth.
PAGE 46
36 Females did not make one years growth, though they demonstrated higher developmental scale scores than males for both years of interest. Black females and males did not make one years growth and scored lower than white females and males for both years of interest. Economically disadvantaged students did not make one years growth and scored lower than non-economically disadvantaged students for both years of interest. Regular education students did not make one years growth, though they demonstrated higher developmental scale scores than students with disabilities. In grade 10, all subgroups of low performing students did not demonstrate one years growth. Closing Remarks Utilizing constructed response items on statewide assessments has been reported by those in the classroom to a) place a greater emphasis on higher-level thinking and problem solving (Huebert & Hauser, 1998), b) provide a more transparent link to everyday curriculum, and c) increase motivation (CCSSO, 2002b). All of these are aligned with the standards based reform that is at the very root of the accountability movement (NCTE, 1996; NCTM, 1995). These implications for teaching and learning are too important to be ignored and lend evidence for consequential validity to the inclusion of constructed response items on high stakes test (AERA, APA, & NCME, 1999). Lastly, and most importantly, it has also been reported that constructed response items lend construct validity to interpretations made from test scores (Messick, 1989; Snow, 1993), though the more contemporary paradigm frames it as evidence for content validity, inclusive of item format (AERA, APA, & NCME, 1999). Constructed response items may approach the original definition of the term assessment, from the Latin assidere, which means, to sit beside. Teacher and student working alongside one
PAGE 47
37 another is what large-scale assessment yearns to be and it certainly is a possibility that constructed response items best paint that picture.
PAGE 48
APPENDIX A GRADING FLORIDA PUBLIC SCHOOLS
PAGE 49
39
PAGE 50
APPENDIX B ANNUAL AYP OBJECTIVES FOR READING 40
PAGE 51
APPENDIX C ANNUAL AYP OBJECTIVES FOR MATHEMATICS 41
PAGE 52
LIST OF REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Amrein, A. L., & Berliner, D. C. (2002, March 28). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). Retrieved February 11, 2003, from http://epaa.asu.edu/epaa/v10n18/ Ananda, S. (2003). Rethinking Issues of Alignment Under No Child Left Behind. San Francisco: WestEd. Barton, P. E., & Coley, R. J. (1994). Testing in Americas schools. Princeton, NJ: Educational Testing Service, Policy Information Center. Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28(1), 77-92. Brennan, R. L. (1995). Generalizability in performance assessments. Educational Measurement: Issues and Practice, 14(4), 9-12, 27. Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple choice formats. Journal of Educational Measurement, 29(3), 253-271. Bridgeman, B., & Rock, D. A. (1993). Relationships among multiple-choice and open-ended analytical questions. Journal of Educational Measurement, 30(4), 313-329. Carnoy, M., & Loeb, S, (2002). Does external accountability affect student outcomes: A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305-331. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press. Council of Chief State School Officers. (1999). Annual survey of State student assessment programs: A summary report, Fall, 1999. Washington, DC: Author. Council of Chief State School Officers. (2002a). A guide to effective accountability reporting. Washington, DC: Author. 42
PAGE 53
43 Council of Chief State School Officers. (2002b). The role of performance-based assessments in large-scale accountability systems: Lessons learned from the inside. Washington, DC: Author. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Wardsworth. Cronbach, L. J., & Furby, L. (1970). How de we measure change-or should we? Psychological Bulletin, 74, 68-80. DeFur, S. H. (2002). Education reform, high-stakes assessment, and students with disabilities: One states approach. Remedial and Special Education, 23(4), 203-211. Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments, Applied Measurement in Education, 4(4), 289-303. Florida Department of Education [FDOE]. (2001). FCAT briefing book. Tallahassee: Author. Florida Department of Education [FDOE]. (2002). Technical report: For operational test administrations of the 2000 Florida Comprehensive Assessment Test. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003a). Understanding FCAT reports. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003b). 2003 Guide to calculating school grades:: Technical assistance paper. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003c). Consolidated state application Accountability Workbook for State grants under Title IX, Part C, Sec. 9302 for the Elementary and Secondary Education Act (Pub. L. No. 107-110). March 26. Florida Department of Education [FDOE]. (2003d). Growth of minority student populations in Floridas public schools. Tallahassee, FL: Author. Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Klein, S. P., Robyn, A., & Bugliari, D. (2003). Studying large-scale reforms of instructional practice: An example from mathematics and science. Educational Evaluation and Policy Analysis, 20(2), 95-113. Huebert, J. P., & Hauser, R. M. (Eds.). (1998). High-stakes testing for tracking, promotion, and graduation. Washington, DC: National Academy Press.
PAGE 54
44 Klein, S. P., Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Robyn, A., & Burroughs, D. (2000). Teaching practices and student achievement: Report of first-year results from the Mosaic Study of Systemic Initiatives in Mathematics and Science (MR-1233-EDU). Santa Monica, CA: RAND. Linn, R. L. (1994). Performance Assessment: Policy promises and technical measurement standards. Educational Researcher, 23(9), 4-14. Linn, R. L. (2000). Assessments and Accountability. Educational Researcher, 29(2), 4-16. Linn, R. L., Baker, E. L., & Herman, J. L. (2002, Fall). Minimum group size or measuring adequate yearly progress. The CRESST Line, 1, 4-5. (Newsletter of the National Center for Research on Evaluation, Standards, and Student Testing [CRESST]. University of California, Los Angeles) Lord, F. M. (1963). Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measuring change (pp. 21-38). Madison, WI: University of Wisconsin Press. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed-response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234-250. McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Klein, S. P., Bugliari, D., & Robyn, A. (2001). Interactions among instructional practices, curriculum, and student achievement: The case of standards-based high school mathematics. Journal for Research in Mathematics Education, 32(5), 493-517. Mehrens, W. A. (1992). Using performance assessment for accountability purposes. Educational Measurement: Issues and Practice, 11(1), 3-9, 20. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan. Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 61-74). Mahwah, NJ: Lawrence Erlbaum. Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5-8. Miller, M. D. (2002). Generalizability of performance-based assessments. Washington, DC: Council of Chief State School Officers.
PAGE 55
45 National Council of Teachers of English (1996). Standards for the English language arts: A joint project of the National Council of Teachers of English and the International Reading Association. Urbana, IL: Author. National Council of Teachers of Mathematics (1995). Assessment standards for school mathematics. Reston, VA: Author. No Child Left Behind Act of 2001, Public Law 107-110, Stat. 1425, 107 th Congress, (2002). Paige, R. (2003, June 27). Key policy letters signed by the Education Secretary or Deputy Secretary. Retrieved February 10, 2004, from http://www.ed.gov/policy/speced/guid/secletter/030627.html Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nations report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press. Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163-184. Ryan, J. M., & DeMark, S. (2002). Variation in achievement scores related to gender, item format, and content area tested. In J. Tindal & T. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical quality, and implementation (pp. 67-88). Mahwah, NJ: Lawrence Erlbaum Associates. Seltzer, M. H., Frank, K. A. & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16(1), 41-49. Slinde, J. A., & Linn, R. L. (1977). Vertical equated tests: Fact or phantom. Journal of Educational Measurement, 14(1), 23-32. Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence Erlbaum. Thissen, D. (1991). Multilog users guide. Lincolnwood, IL: Scientific Software. Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence Erlbaum. Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice test? An analysis of two tests. Journal of Educational Measurement, 31(2), 113-123.
PAGE 56
46 Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence Erlbaum. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Towards a Marxist theory of test construction. Applied Measurement in Education, 6(2), 103-118. Yen, W. M., & Ferrara, S. (1997). The Maryland school assessment program: Performance assessment with psychometric quality suitable for high stakes usage. Educational and Psychological Measurement, 57(1), 60-84.
PAGE 57
BIOGRAPHICAL SKETCH Stewart Francis Elizondo was born in San Jose, Costa Rica. At six years of age he moved to Old Bridge, New Jersey. At eighteen years of age, he moved to Ocala, Florida, where he attended Lake Weir High School. Before graduating, he had the honor of receiving a nomination to the United States Military Academy from Kenneth Hood Buddy MacKay Jr., former Governor of Florida. After pursing the honor to its end, he worked his way through Central Florida Community College in Ocala with assistance from a scholarship from the Silver Springs Shores Lions Club. After two years, he was fortunate enough to be awarded a Critical Teacher Scholarship in mathematics from the Florida Department of Education (FDOE). He transferred to the University of South Florida in Tampa and graduated with a Bachelor of Arts in mathematics and education. Returning to Ocala, Stewart then taught for ten years in the public school system. He is proud to have taught in a state recognized, high-performing A school. His service includes mentoring beginning teachers in the countys Peer Orientation Program, sitting on county textbook adoption committees, and facilitating school and district-wide workshops. His students also twice nominated him for Teacher of the Year. Concurrent with his teaching, he served on the Florida Mathematics Curriculum Frameworks and Sunshine State Standards (SSS) Writing Team. The FDOE approved the SSS as Floridas academic standards and the Florida Comprehensive Assessment Test (FCAT) as its assessment tool. These changes were also incorporated into Governor Jeb 47
PAGE 58
48 Bushs A+ Education Plan, which was approved by the state legislature by amending Section 229.57, F. S. While still teaching, he worked with the FDOE and its Test Development Center (TDC) on several statewide assessment endeavors. They include collaborations with Harcourt Educational Measurement in development of the FCAT, specifically grade nine and ten mathematics item reviews. He later worked with the TDC and NCS Pearson in the scoring of FCAT items, specifically in developing and adjusting training guidelines for the scoring of grade ten mathematics performance tasks using student responses from field-tested and operational items. Letters expressing appreciation from Florida Commissioner of Education James Wallace Horne, as well as former Commissioners Charles J. Crist Jr., Frank T. Brogan, and the late Douglas L. Jamerson, mark the timeline of service to these state projects. These activities rekindled a long-standing passion for the assessment field and led him to pursue various roles in the field, then ultimately an advanced degree. Of note, he has been an item writer for Harcourt Educational Measurement and has written mathematics items for the Massachusetts Comprehensive Assessment System and the Connecticut Academic Performance Test. He has also served as both proctor and administrator for the Florida Teacher Certification Examination and the College-Level Academic Skills Test. Stewart currently is an Alumni Fellow and graduate teaching assistant at the University of Florida. He is grateful to still have the opportunity to be in the classroom as he has taught undergraduate courses in elementary mathematics methods and
PAGE 59
49 graduate courses in measurement and assessment while continuing to pursue a doctorate in research and evaluation methodology. His greatest joy always has been and still remains spending time with his sons, Spencer and Seth, and his wife, Stephanie.
|
|