<%BANNER%>

Accountability and the Florida Comprehensive Assessment Test

University of Florida Institutional Repository

PAGE 1

ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN MEASURING ONE YEARS GROWTH By STEWART FRANCIS ELIZONDO A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2004

PAGE 2

Copyright 2004 by Stewart Francis Elizondo

PAGE 3

This work is dedicated to Stephanie, who has unconditionally supported me not only in this endeavor, but also in all of my undertakings over the last seventeen years.

PAGE 4

ACKNOWLEDGMENTS I acknowledge my committee chair, Dr. M. David Miller, and committee member, Dr. Anne E. Seraphine, for their skilled mentorship and expertise. They have allowed this process be to challenging, yet enjoyable and rewarding. Of course, any errors remaining in this paper are entirely my own responsibility. Gratitude is also given to Dr. Bridget A. Franks, graduate coordinator, and Dr. James Algina, Director of the Research and Evaluation Program, for their assistance in helping me secure an Alumni Fellowship without which this project may not be or at least would have been protracted and less cohesive. Enough appreciation cannot be expressed to my family for their undying support of my graduate studies. Repaying the concessions made by my sons, Spencer and Seth, is something I cannot do in my lifetime, but I look forward to making the effort. Encouragement and support given to me by my in-laws, Jeanne and John Garrod, are also not taken for granted and I am most thankful. I would like to thank my mother, Elizabeth Marder. It is because of her that I realize and celebrate in fact that education is truly a lifelong process. Finally, the many thanks owed to my wife, Stephanie, are beyond words. I would not be half of what I am if it were not for her. iv

PAGE 5

TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES...........................................................................................................ix ABSTRACT.........................................................................................................................x CHAPTER 1 INTRODUCTION........................................................................................................1 The Florida Comprehensive Assessment Test..............................................................1 State and Federal Accountability..................................................................................4 The Florida School Accountability System...........................................................4 The No Child Left Behind Act of 2001.................................................................5 Low Performing Students and One Years Growth......................................................5 2 REVIEW OF LITERATURE.......................................................................................7 The Movement Towards Constructed Response..........................................................7 Trait Equivalence Between Item Formats....................................................................8 Psychometric Differences Between Item Formats.......................................................9 Combining Item Formats............................................................................................10 Differential Effects of Item Formats on Various Groups...........................................11 3 METHODS.................................................................................................................13 Data.............................................................................................................................13 Sample........................................................................................................................13 Measures.....................................................................................................................15 Analysis Approach......................................................................................................16 4 RESULTS...................................................................................................................20 Aggregate Developmental Scale Score Analysis.......................................................20 Reading................................................................................................................20 Mathematics........................................................................................................21 v

PAGE 6

Developmental Scale Score Contrasts by Sex............................................................22 Reading................................................................................................................22 Mathematics........................................................................................................23 Developmental Scale Score Contrasts by Sex and Race............................................24 Reading................................................................................................................24 Mathematics........................................................................................................26 Developmental Scale Score Contrasts by Socio-economic Status.............................27 Reading................................................................................................................27 Mathematics........................................................................................................28 Developmental Scale Score Contrasts for Students with Disabilities and Regular Education Students.................................................................................................29 Reading................................................................................................................29 Mathematics........................................................................................................31 Summary.....................................................................................................................32 5 DISCUSSION.............................................................................................................33 Trends by Grade.........................................................................................................34 Reading................................................................................................................34 Mathematics........................................................................................................35 Item Format Effects....................................................................................................35 Closing Remarks.........................................................................................................36 APPENDIX A GRADING FLORIDA PUBLIC SCHOOLS.............................................................38 B ANNUAL AYP OBJECTIVES FOR READING......................................................40 C ANNUAL AYP OBJECTIVES FOR MATHEMATICS...........................................41 LIST OF REFERENCES...................................................................................................42 BIOGRAPHICAL SKETCH.............................................................................................47 vi

PAGE 7

LIST OF TABLES Table page 1-1 Florida Comprehensive Assessment Test Item Formats by Grade............................2 1-2 Florida Comprehensive Assessment Test Achievement Levels................................3 3-1 Matched Low Performing Students by Grade for Reading and Mathematics.........13 3-2 Disaggregation of Matched District and Low Performing Students by Subgroup for Grades 6 through 10 for Reading and Mathematics...........................................14 4-1 Mean Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................20 4-2 Mean Differences for Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....21 4-3 Mean Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................21 4-4 Mean Differences for Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....22 4-5 Mean Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................22 4-6 Mean Differences for Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002...................23 4-7 Mean Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................23 4-8 Mean Differences for Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....24 4-9 Mean Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................24 4-10 Mean Differences for Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................................................................................25 vii

PAGE 8

4-11 Mean Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria..................................................26 4-12 Mean Differences for Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................................................................................26 4-13 Mean Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria......................................................................27 4-14 Mean Differences for Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002...................28 4-15 Mean Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria........................................................28 4-16 Mean Differences for Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....29 4-17 Mean Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria.....29 4-18 Mean Differences for Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002.............................................................................30 4-19 Mean Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria......................................................................................................................31 4-20 Mean Differences for Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002......................................................32 4-21 Subgroups of Low Performing Students Demonstrating More than One Years Growth in Reading and Mathematics.......................................................................32 viii

PAGE 9

LIST OF FIGURES Figure page 3-1 FCAT Achievement Levels for the Developmental Scale.......................................15 3-2 Expected Growth under Gain Alternative 3.............................................................18 ix

PAGE 10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN MEASURING ONE YEARS GROWTH By Stewart Francis Elizondo May 2004 Chair: M. David Miller Major Department: Educational Psychology The impact of constructed response items in the measurement of one years growth on the Florida Comprehensive Assessment Test for low performing students is analyzed. Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on the vertically equated developmental scores as defined by the Florida School Accountability System. Data in reading and mathematics for 6th through 10th grade were examined for a mid-sized school district and disaggregated into several subgroups as defined in the state No Child Left Behind Accountability Plan. These results were compared across grade levels to see if growth differed in the grade levels that are assessed with constructed response items. Results indicated that constructed response items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. Differential results between subgroups were most evident in reading. x

PAGE 11

CHAPTER 1 INTRODUCTION Enacted in 1968, The Educational Accountability Act, Section 229.51, Florida Statutes, empowered the Commissioner of Education to use all appropriate management tools, techniques, and practices which will cause the state's educational programs to be more effective and which will provide the greatest economics in the management and operation of the state's system of education. Subsequent changes to the Florida Department of Educations (FDOE) capabilities towards that end can be marked by a stream of legislation that can currently be characterized as an environment of accountability. Utilizing components of the current accountability systems, questions regarding Floridas lowest performing students will be analyzed. The Florida Comprehensive Assessment Test The Florida Comprehensive Assessment Test (FCAT) currently serves as the measure for the statewide School Accountability System as envisioned by the A+ Plan for Education. It also serves as the measure for the federal No Child Left Behind Act of 2001 (NCLB) for all public schools. Both accountability systems include requirements for annual student growth. Section 229.57, Florida Statutes was amended in 1999 and specifies that school performance grade category designations shall be based on the school's current years performance and the school's annual learning gains (FDOE, 2001). Section 1111(b)(2)(H) of NCLB mandates that intermediate goals for annual yearly progress shall be established (NCLB, 2001). 1

PAGE 12

2 Reading and mathematics components of FCAT based on the Sunshine State Standards (SSS) are administered in grades 3 through 10. These components are commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics. Emphasis on these standards-based assessments by the accountability systems makes their construction, scoring, and reporting of special interest to stakeholders. Constructed response (CR) items are utilized on the FCAT SSS Reading assessment in grades 4, 8, and 10 and on the FCAT SSS Mathematics in grades 5, 8, and 10. CR items are either Short-Response (SR) or Extended-Response (ER). Not only does item format vary, but weight given to item formats can also vary across grades (see Table 1-1). Table 1-1. Florida Comprehensive Assessment Test Item Formats by Grade Reading Mathematics Grade Item Formats % of Items Item Formats % of Items 3 MC 100% MC 100% 4 MC SR & ER 85-90% 10-15% MC 100% 5 MC 100% MC GR SR & ER 60-70% 20-25% 10-15% 6 MC 100% MC GR 60-70% 30-40% 7 MC 100% MC GR 60-70% 30-40% 8 MC SR & ER 85-90% 10-15% MC GR SR & ER 50-60% 25-30% 10-15% 9 MC 100% MC GR 60-70% 30-40% 10 MC SR & ER 85-90% 10-15% MC GR SR & ER 50-60% 25-30% 10-15% Note. MC: Multiple-Choice, GR: Gridded-Response, SR: Short-Response, and ER: Extended-Response. Student scores have been reported on a scale of 100 to 500 for both the FCAT SSS Reading and Mathematics since they were first reported in 1998. Though these scores

PAGE 13

3 yield Achievement Levels that are used for the School Accountability System and NCLB (see Table 1-2), they are not for interpretation from grade to grade. Table 1-2. Florida Comprehensive Assessment Test Achievement Levels 5 Advanced Performance at this level indicates that the student has success with the most challenging content of the Sunshine State Standards. A Level 5 student answers most of the test questions correctly, including the most challenging questions. 4 Proficient Performance at this level indicates that the student has success with the challenging content of the Sunshine State Standards. A Level 4 student answers most of the questions correctly, but may have only some success with questions that reflect the most challenging content. 3 Proficient Performance at this level indicates that the student has partial success with the challenging content of the Sunshine State Standards, but performance is inconsistent. A Level 3 student answers many of the questions correctly, but is generally less successful with questions that are most challenging. 2 Basic Performance at this level indicates that the student has limited success with the challenging content of the Sunshine State Standards. 1 Below Basic Performance at this level indicates that the student has little success with the challenging content of the Sunshine State Standards. To address this, FCAT results began to be reported on a vertically equated scale that ranges from 86 to 3008 across grades 3 through 10. This facilitates interpretations of annual progress from grade to grade. Developmental Scale Scores (DSS), as well as DSS Change, the difference between consecutive years, appear on student reports as well as school, district, and state reports for the aggregate means (FDOE, 2003a). It has been proposed that NCLB encourages the use of vertical scaling procedures (Ananda, 2003). Though FCAT field-testing began in 1997, reports including DSS have only been available since 2001.

PAGE 14

4 State and Federal Accountability The Florida School Accountability System In 1999, school accountability became more visible to the public with the advent of school performance grades. Category designations range from A, making excellent progress, to F, failing to make adequate progress (see Appendix A). Schools are evaluated on the basis of aggregate student performance on FCAT SSS Reading and FCAT SSS Mathematics in grades 3 through 10, and the Florida Writes statewide writing assessment administered in grades 4, 8, and 10. Currently, three major components contribute in their calculation: 1. Yearly achievement of high standards in reading, mathematics and writing, 2. Annual learning gains in reading and mathematics, and 3. Annual learning gains in reading for the lowest 25% of students in each school. As of 2002, annual learning gains on FCAT SSS Mathematics and FCAT SSS Reading account for half of the point system that yields school performance grades, with special attention given to the reading gains of the lowest 25% of students in each school. There are three ways that schools can be credited for the annual yearly gains of their students: 1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or 2. When students maintain a relatively high Achievement Level (3, 4 or 5); or 3. When students demonstrate more than one years growth within Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next (FDOE, 2003b). Incentives for schools graded A and those that improve at least one grade level include monetary incentives under the Florida School Recognition Program under Section 231.2905, Florida Statutes. Sanctions for schools graded F for two years in a four-year period include eligibility for Opportunity Scholarships under Section 229.0537,

PAGE 15

5 Florida Statutes, so that students can attend higher performing public or private schools (FDOE, 2001). The No Child Left Behind Act of 2001 The inclusion of all students when making accountability calculations was not a common practice prior to the passing of the federal No Child Left Behind Act of 2001 (Linn, 2000). Approved in April of 2003, Floridas State Accountability Plan meets this defining requirement and builds on the Florida School Accountability System. It is also compliant with the Adequate Yearly Progress (AYP) components of the federal law. To make AYP, the percentage of students earning a score of Proficient or above in reading and mathematics has to meet or exceed the annual objectives for the given year (see Appendix B and C). Furthermore, not only will AYP criteria be applied to the aggregate, but it will also be applied separately to subgroups that disaggregate the data by race, socio-economic status, disability, and limited English proficiency. If a school fails to meet the annual objectives in reading or mathematics for any subgroup in any content area, the school is designated as not making AYP. The expectation for growth is such that all students are Proficient or above on the FCAT SSS Reading and Mathematics no later than 2014 (FDOE, 2003c). Low Performing Students and One Years Growth The fundamental question addressed in this paper is the following: Do grades with constructed response items show differential growth for various subgroups of students within Achievement Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next? The subgroups of interest are those defined by NCLB. These annual gains directly impact the calculation of school

PAGE 16

6 performance grades under the Florida School Accountability System, and can serve as a method of monitoring the progress and composition of those non-Proficient students for the AYP calculations under NCLB. Comparisons of item format effects will be explored for FCAT SSS Reading and Mathematics in grades 6 through 10. Other issues, though very relevant, such as: a) the longstanding debate on how to yield gain scores (Cronbach & Furby, 1970; Lord, 1963), b) the discrepancies surrounding the various methods of vertically equating tests, especially when including multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen & Wainer, 2001), and c) the most recent debate of effects of high-stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002) will be set-aside in an effort to work within the existing framework of the current accountability systems. Unintended item format effects, especially on the FCAT SSS Reading and Mathematics assessments, given their emphasis by both the Florida School Accountability System and the federal No Child Left Behind Act, only grow in importance as trends in Floridas student population continue. In 2002, 14 of Floridas 67 school districts reported minority enrollment of 50% or more. Many of these were among the most populous districts across the state. Furthermore, minority representation statewide has steadily increased from 29.9% in 1976 to 49.4% in 2002 (FDOE, 2003d).

PAGE 17

CHAPTER 2 REVIEW OF LITERATURE The Movement Towards Constructed Response Some type of constructed response (CR) item was used in over three quarters of all statewide assessment programs (Council of Chief State School Officers [CCSSO], 1999). This is a marked change from the assessment practices of just a decade ago when Barton and Coley (1994), as cited in Linn (1994), stated that, The nation is entering an era of change in testing and assessment. Efforts at both the national and state levels are now directed at greater use of performance assessment, constructed response questions, and portfolios based on actual student work (p. 3). The hypothesis is that these types of items are more closely aligned with the standards-based reform movement currently underway in education and that they are better able to detect its effects (Hamilton et al., 2003; Klein et al., 2000). Some have recently tested whether the relationship between achievement and instruction varies as a function of item format (McCaffrey et al., 2001). One of the first and most prominent statewide programs to implement performance-based assessments in a high-stakes environment was the Maryland School Performance Assessment Program (MSPAP), first administered in 1991. Though replaced by the Maryland School Assessment in 2002, Maryland still utilizes CR items in all grades tested. Although virtually no information was available on the psychometric properties of performance assessment during the design stages of the MSPAP, results were encouraging and allowed for innovations in later years (Yen & Ferrara, 1997). 7

PAGE 18

8 The use of CR items has since become a hallmark of quality testing programs to the extent that a National Academy of Education panel repeatedly applauded the National Assessment of Educational Progress continued move to include CR items. In a later review, the Committee on the Evaluation of National and State Assessments of Educational Progress, in conjunction with the Board on Testing and Assessment, called for an enhancement of the use of testing results, especially CR items, to provide better interpretation (Pellegrino et al., 1999). Trait Equivalence Between Item Formats The debate regarding the existence and interpretation of psychological differences tapped by CR items, as opposed to multiple choice (MC) items, has been equivocal. Thissen and Wainer (2001) make clear a basic assumption of psychometrics: Before the responses of any set of items are combined into a single score that is taken to be, in some sense, representative of the responses to all of the items, we must ascertain the extent to which the items measure the same thing (p. 10). Factor analysis techniques, among other methodologies, have been employed to analyze the extent to which CR items measure a different trait than do MC items. Evidence of a free-response factor was found by Bennett, Rock, and Wang (1991), though a single-factor solution was reported as providing the most parsimonious fit. Thissen, Wainer, and Wang (1994) reported that despite a small degree of multidimensionality, it would seem meaningful to combine the scores of MC and CR items as they, for the most part, measure the same underling proficiency. These authors suggest that proficiency may be better estimated with only MC items due to a high correlation with the more reliable multiple-choice factor. Still yet, others have used factor analysis techniques with MC and CR items and found that CR items were not measuring anything beyond what was measured by the MC

PAGE 19

9 items (Bridgeman & Rock, 1993). With item response theory methodology, others have concluded that CR items yielded little information over and above that provided by MC items (Lukhele et al., 1994). A recent meta-analysis framed the question of construct (trait) equivalence as a function of stem equivalence between the two item types. It was found that when items are constructed in both formats using the same stem, the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem equivalent items (Rodriguez, 2003). It has also been suggested that a better aim of CR items is not to measure the same trait, but to better measure some cognitive processes better than MC items (Traub, 1993). This is consistent with a stronger positive relationship between reform practices and student achievement when measured with CR items rather than MC items (Klein et al., 2000). Psychometric Differences Between Item Formats The psychometric properties of CR items have also been a point of contention in the literature. Messick (1993) asserted that the question of trait equivalence was essentially a question of construct validity of test interpretation and use. Furthermore, this pragmatic view is tolerated to facilitate effectiveness of testing purposes. The two major threats to construct validity, irrelevant test variance and construct underrepresentation (Messick, 1993), can be judiciously negotiated for the purposes of demonstrating construct relevant variance with due representation of the construct. Messick posits that if the hypothesis of perfectly correlated true scores across formats is not rejected, then either format, or a combination of both, could be employed as construct indicators. Although, if the hypothesis is rejected, likely due to differential

PAGE 20

10 method variance, a dominant construct relevant factor or second order factor, may cut across formats (Messick, 1993). Messick (1995) also later provided a thorough treatment on the broader systematic validation of performance assessments within the framework of the unified concept of validity (AERA, APA, & NCME, 1999). Returning to the issue of reliability raised earlier, Thissen and Wainers (2001) suggestion that MC items more reliably measure a CR factor may, to paraphrase, come about because, measuring something that is not quite right accurately may yield far better measurement than measuring the right thing poorly. Lastly, we are warned that however appealing CR items may be as exemplars of instruction or student performance, we must exercise caution when drawing inferences about people or changes in instruction due to their relatively limited generalizability across tasks (Brennan, 1995; Dunbar et al., 1991). In an analysis of state-mandated programs, Miller (2002) concluded that other advantages, such as consequences for instruction, are necessary to justify these concessions. Combining Item Formats Methods on how to best combine the use of CR items with MC items have also garnered much needed attention. Mehrens (1992) put it succinctly when stating that we have known for decades that MC items measure some things well and efficiently, but they do not measure everything and they can be overemphasized. Performance assessments, on the other hand, have the potential to measure important instructional objectives that cannot be measured by MC items. He also notes that most large-scale assessments have added performance assessments to the more traditional tests, and not replaced them.

PAGE 21

11 Wainer and Thissen (1993) add that combining the two formats may allow for the concatenation of their strengths while compensating for their weaknesses. More specifically, if a test has various components, it is sensible to weigh the components as a function of their reliabilities, or modify the lengths of the components to make them equally reliable. Given difficulties in execution of the latter, it was recommended that serious consideration be given to IRT weighting in the construction and scoring of tests with mixed item formats. Just such a process is employed in the construction of the Florida Comprehensive Assessment Test (FCAT) via an interactive test construction system, which makes available the capability to store, retrieve, and manipulate IRT test information, test error, and expected scores (FDOE, 2002) Differential Effects of Item Formats on Various Groups Another major component to the literature is a segment that seeks differential effects that CR items may have on gender, race, socio-economic status, and disability. These contrasts are most timely, given that nature of the Adequate Yearly Progress requirements of the No Child Left Behind Act of 2001 where, by design, data is disaggregated by such groups. In an analysis of open-ended counterparts to a set of items from the quantitative section of the Graduate Record Exam, Bridgeman (1992) concluded that gender and ethnicity differences were neither lessened nor exaggerated and that there were no significant interactions of test format with either gender or ethnicity. Similarly, in a metanalysis of over 30 studies, Ryan and DeMark (2002) conclude that females typically outperform males in mathematics and language when students must construct their responses, but effect sizes are small or less by Cohens (1988) standards.

PAGE 22

12 There has been mixed results on the inclusion of students with disabilities in high-stakes standards-based assessments. DeFur (2002) summarizes one such example in Virginias experiences with their Standards of Learning and an inclusion policy. Conspicuous by its absence are studies on differential effects of item formats involving students with disabilities. This may soon draw needed attention with the March 2003 announcement of a statutory NCLB provision. Effective for the 2003-2004 academic year, it states that in calculating AYP, at most 1% of all students tested can be held to alternative achievement standards at the district and state levels (Paige, 2003). The provision was published in the Federal Register, Vol. 68, No. 236, on December 9, 2003.

PAGE 23

CHAPTER 3 METHODS Data The electronic file received by a mid-sized school district in May 2002 from the Florida Department of Education (FDOE) was analyzed. This file contained Florida Comprehensive Assessment Test (FCAT) scores from the 2001 and 2002 administrations. The data for Florida had changed in 2001 in two key ways. Beginning with the 2001 administration, the FCAT program was expanded to include all grades 3 through 10. Prior to this, reading was assessed only in grades 4, 8, and 10, while mathematics was assessed only in grades 5, 8, and 10. Secondly, these changes offered the opportunity to introduce a Developmental Scale Score (DSS) that linked adjacent grades together allowing for progress to be tracked over time. Sample The data set consisted of 2,402 low performing students in reading and 2,506 low performing students in mathematics in grades 6 through 10 that had a score from both the 2001 and 2002 FCAT administrations. Table 3-1 shows the composition of the students by grade. Low performing is defined as remaining within the Basic or Below Basic performance standards commonly referred to as Level 1 and Level 2 (FDOE, 2003c). Table 3-1. Matched Low Performing Students by Grade for Reading and Mathematics Grade Reading Mathematics 6 498 569 7 475 557 8 477 533 9 489 462 10 463 385 13

PAGE 24

14 Disaggregation of low performing students by sex, race, socio-economic status, and disability is shown in Table 3-2 for reading and mathematics. For comparison, disaggregation of all matched district students also in grades 6 through 10 is provided Table 3-2. Disaggregation of Matched District and Low Performing Students by Subgroup for Grades 6 through 10 for Reading and Mathematics Reading Mathematics All Students Low Performing Students All Students Low Performing Students Agg 8,399 2,402 8,404 2,506 Sex F M 4,307 (51%) 4,092 (49%) 1,158 (48%) 1,244 (52%) 4,309 (51%) 4,095 (49%) 1,300 (52%) 1,206 (48%) Race B H W 3,158 (38%) 326 (4%) 4,915 (59%) 1,646 (69%) 77 756 (31%) 3,161 (38%) 328 (4%) 4,915 (58%) 1,767 (71%) 70 739 (29%) SES ED N-ED 3,271 (39%) 5,128 (61%) 1,543 (64%) 859 (36%) 3,275 (39%) 5,128 (61%) 1,671 (67%) 835 (33%) SWD Gift RES SWD 853 (10%) 5,928 (71%) 1,618 (19%) 16 1,311 (55%) 1,075 (45%) 852 (10%) 5,928 (71%) 1,624 (19%) 11 1,430 (57%) 1,065 (42%) Note. Agg = Aggregate, F = Female, M = Male, B = Black, H = Hispanic, W = White, SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With Disabilities, Gift = Gifted Students, RES = Regular Education Students. Gifted students and Hispanic students were omitted from analysis due to small sample sizes. Disaggregated reporting for Adequate Yearly Progress (AYP) is not required when the number of students is insufficient to yield statistically reliable information or would reveal personally identifiable information (NCLB, 2002). Linn et al. (2002) suggested a minimum number of 25. The FDOE subscribes to a minimum group size of 30 for school performance grades (FDOE, 2003b) and for AYP calculations (FDOE. 2003c). In the analyses of that follow, subgroups of low performing students are further disaggregated by grade and, in all cases, gifted students and Hispanic students fail to meet even Linns suggestion. In several cases, group size was as low as zero or one.

PAGE 25

15 Measures Reading and mathematics components of FCAT based on the Sunshine State Standards (SSS) were examined. Commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics, they are the criterion-referenced components of the statewide assessment program. The resulting Developmental Scale Scores (DSS) were analyzed. Figure 3-1 shows the relationship between these scores and the corresponding Achievement Levels by content area. Figure 3-1. FCAT Achievement Levels for the Developmental Scale. Source: Understanding FCAT Reports (FDOE, 2003a). The vertical scales that the DSS are based on were built by a special comparison of FCAT performance across grades 3 through 10 during the Spring of 2001. Facilitated by a calibration step (Thissen & Wainer, 2001), items from adjacent grade level tests were embedded into field test item positions. It was decided that the scale would be anchored at grade 3 so that the average score would be 1300 and at grade 10 where the average score would be 2000. Analysis was accomplished using Item Response Theory (IRT) models (Thissen, 1991). IRT metrics have been shown to be preferred over grade equivalents when examining such longitudinal patterns, as they are sensitive to varying rates of growth over time (Seltzer et al., 1994).

PAGE 26

16 Analysis Approach Components of both the Florida School Accountability System, as envisioned by the A+ Plan for Education, and the federal No Child Left Behind Act of 2001 (NCLB) were used to approach the question of differential growth for various groups of students within the Achievement Levels of Basic and Below Basic focusing on grades with constructed response items. Within the Florida School Accountability System, making annual learning gains on FCAT SSS Mathematics and FCAT SSS Reading account for three out of six categories that are used in calculating school performance grades. Special attention is given to the FCAT SSS Reading gains of the lowest 25% of students in each school (see Appendix A). As of 2002, students can demonstrate gains via three different alternatives (FDOE, 2003b). 1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or 2. When students maintain a relatively high Achievement Level (3, 4 or 5); or 3. When students demonstrate more than one years growth within Levels 1 or 2, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next. The focus of the analysis will be on the students eligible for demonstrating gains via Gain Alternative 3. The reasoning being that these students have demonstrated two consecutive years of Basic or Below Basic performance. Students that demonstrated gains via Gain Alternative 1 have shown a substantive improvement in their performance. For example, a student that has improved from Achievement Level 2 to Achievement Level 3, has gone from a classification of Basic to Proficient (FDOE, 2003c). Students that demonstrated gains via Gain Alternative 2 have shown consistently high performance as Achievement Level 3 or higher is classified as Proficient or Advanced

PAGE 27

17 (FDOE, 2003c). Furthermore, this speaks directly to the policy that is at hand and utilizes the FCAT Developmental Scale Scores as they are intended. Five analyses were performed, the first of which is for all matched low performing students not making gains by achievement level criteria. The remaining four address subgroups that are suggested (CCSSO, 2002a) or required (NCLB, 2002) when disaggregating school, district, and state data in calculating the Adequate Yearly Progress (AYP) components of NCLB. To make AYP, the percentage of students earning a score of Proficient or above in reading and mathematics has to meet or exceed the annual objectives for the given year (see Appendix B and C). Though all students in the analyses are non-Proficient, tracking the composition and progress of these groups could lead to an understanding that may facilitate policy and instruction. Comparisons of item format effects in measuring one years growth will be made between FCAT SSS Reading and FCAT SSS Mathematics as well as across grades 6 through 10 within each content area. Lastly, differential growth within each grade for both content areas will be discussed. To further specify what constitutes one years growth as defined in the Florida School Accountability System, it is necessary to revisit the developmental scale, specifically as it applies to Gain Alternative 3. The definition is based on the numerical cut-scores for the FCAT Achievement Levels that have been approved by the State Board of Education. The following steps were applied to the cut scores, separately, for each subject and each grade-level pair (FDOE, 2003b). Per State Board Rule 6A-1.09422, there are four cut scores that separate DSS into five Achievement Levels. The increase in the DSS necessary to maintain in the same relative standing within Achievement Levels

PAGE 28

18 from one grade to the next was calculated for each of the four cut scores between the five Achievement Levels. The median value of these four differences was determined to best represent the entire student population. Median gain expectations were then calculated for each grade and a logarithmic curve was fitted. Others were considered but the logarithmic curve was adopted because it best captures the theoretical expectation of greater gains in the early grades due to student maturation. Graphs of these curves and the resulting expected growth on the development scale is shown in Figure 3-2 for reading and mathematics. Figure 3-2. Expected Growth under Gain Alternative 3. Source: Guide to Calculating School Grades Technical Assistance Paper (FDOE, 2003b). To be denoted as making gains under Gain Alternative 3, students must demonstrate more than the expected growth on the developmental scale. Therefore, they must score at least one developmental scale score point more than the values listed above. These criteria were applied to mean differences for the matched low performing students. This was done for the aggregate for each grade 6 through 10 and, in the interest of the NCLB requirements, by sex, sex and race, socio-economic status, and disability.

PAGE 29

19 As noted earlier, other issues, though admittedly relevant, will be set-aside in an effort to work within the existing framework of the state and federal accountability systems. These include and are not limited to; the longstanding debate on how to yield gain scores (Cronbach & Furby, 1970; Lord, 1963); the discrepancies surrounding the various methods of vertically equating tests, especially when including multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen & Wainer, 2001); and the most recent debate of effects of high-stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002).

PAGE 30

CHAPTER 4 RESULTS Aggregate Developmental Scale Score Analysis Reading Sample sizes, means, and standard deviations for grade level aggregate data for the Florida Comprehensive Assessment Test Sunshine State Standards (FCAT SSS) Reading Developmental Scale Scores (DSS) are shown in Table 4-1. Table 4-1. Mean Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria Grade Year n M (SD) 2002 498 1,074 (266) 6 2001 498 1,031 (294) 2002 475 1,245 (245) 7 2001 475 1,116 (255) 2002 477 1,353 (258) 8 2001 477 1,257 (254) 2002 489 1,525 (264) 9 2001 489 1,374 (251) 2002 463 1,590 (275) 10 2001 463 1,592 (221) Mean differences and comparisons to expected growth for grade level aggregate data for the FCAT SSS Reading DSS are shown in Table 4-2. 20

PAGE 31

21 Table 4-2. Mean Differences for Aggregate Reading DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades n DSS Mean Difference Expected Growth Met Expectations 5 to 6 498 43 133 No 6 to 7 475 129 110 Yes 7 to 8 a 477 96 92 Yes 8 to 9 489 150 77 Yes 9 to 10 a 463 -2 77 No a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level aggregate data for the FCAT SSS Mathematics DSS are shown in Table 4-3. Table 4-3. Mean Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria Grade Year n M (SD) 2002 569 1,314 (235) 6 2001 569 1,123 (227) 2002 557 1,360 (262) 7 2001 557 1,290 (257) 2002 533 1,459 (242) 8 2001 533 1,307 (272) 2002 462 1,587 (220) 9 2001 462 1,438 (241) 2002 385 1,689 (190) 10 2001 385 1,636 (206)

PAGE 32

22 Mean differences and comparisons to expected growth for grade level aggregate data for the FCAT SSS Mathematics DSS are shown in Table 4-4. Table 4-4. Mean Differences for Aggregate Mathematics DSS for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades n DSS Mean Difference Expected Growth Met Expectations 5 to 6 569 190 95 Yes 6 to 7 557 70 78 No 7 to 8 a 533 151 64 Yes 8 to 9 462 149 54 Yes 9 to 10 a 385 53 48 Yes a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Sex Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by sex are shown in Table 4-5. Table 4-5. Mean Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Grade Year n M (SD) n M (SD) 2002 241 1,115 (247) 257 1,035 (277) 6 2001 241 1,060 (269) 257 1,003 (313) 2002 225 1,288 (234) 250 1,206 (247) 7 2001 225 1,156 (237) 250 1,079 (266) 2002 223 1,388 (259) 254 1,321 (253) 8 2001 223 1,304 (236) 254 1,215 (262) 2002 237 1,583 (231) 252 1,470 (281) 9 2001 237 1,413 (235) 252 1,338 (261) 2002 232 1,601 (284) 231 1,580 (266) 10 2001 232 1,608 (208) 231 1,576 (233)

PAGE 33

23 Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by sex are shown in Table 4-6. Table 4-6. Mean Differences for Reading DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 F M 241 257 55 31 133 133 No No 6 to 7 F M 225 250 132 126 110 110 Yes Yes 7 to 8 a F M 223 254 84 105 92 92 No Yes 8 to 9 F M 237 252 170 132 77 77 Yes Yes 9 to 10 a F M 232 231 -7 4 77 77 No No Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by sex are shown in Table 4-7. Table 4-7. Mean Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Grade Year n M (SD) n M (SD) 2002 303 1,332 (213) 266 1,294 (256) 6 2001 303 1,127 (223) 266 1,120 (231) 2002 295 1,398 (245) 262 1,317 (273) 7 2001 295 1,325 (237) 262 1,251 (274) 2002 255 1,487 (238) 278 1,433 (244) 8 2001 255 1,357 (254) 278 1,262 (280) 2002 235 1,604 (210) 227 1,569 (229) 9 2001 235 1,476 (220) 227 1,397 (255) 2002 212 1,692 (184) 173 1,686 (198) 10 2001 212 1,643 (201) 173 1,628 (212) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by sex are shown in Table 4-8.

PAGE 34

24 Table 4-8. Mean Differences for Mathematics DSS by Sex for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 F M 303 266 205 174 95 95 Yes Yes 6 to 7 F M 295 262 72 67 78 78 No No 7 to 8 a F M 255 278 130 171 64 64 Yes Yes 8 to 9 F M 235 227 128 172 54 54 Yes Yes 9 to 10 a F M 212 173 49 57 48 48 Yes Yes Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Sex and Race Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by sex and race are shown in Table 4-9. Table 4-9. Mean Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 189 1,090 (246) 52 1,206 (231) 193 1,016 (264) 64 1,094 (307) 6 2001 189 1,034 (277) 52 1,153 (216) 193 986 (305) 64 1,059 (335) 2002 166 1,266 (241) 59 1,350 (202) 156 1,178 (250) 94 1,251 (237) 7 2001 166 1,130 (244) 59 1,229 (200) 156 1,028 (271) 94 1,164 (235) 2002 148 1,352 (259) 75 1,460 (245) 166 1,277 (262) 88 1,404 (213) 8 2001 148 1,293 (211) 75 1,326 (279) 166 1,189 (253) 88 1,267 (273) 2002 157 1,541 (228) 80 1,665 (216) 172 1,410 (278) 80 1,597 (242) 9 2001 157 1,362 (237) 80 1,514 (197) 172 1,286 (253) 80 1,449 (244)

PAGE 35

25 Table 4-9 Continued. Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 148 1,533 (303) 84 1,720 (196) 151 1,512 (274) 80 1,708 (194) 10 2001 148 1,563 (211) 84 1,688 (177) 151 1,540 (220) 80 1,644 (241) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by sex and race are shown in Table 4-10. Table 4-10. Mean Differences for Reading DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 BF BM WF WM 189 193 52 64 56 30 54 34 133 133 133 133 No No No No 6 to 7 BF BM WF WM 166 156 59 94 136 150 121 87 110 110 110 110 Yes Yes Yes No 7 to 8 a BF BM WF WM 148 166 75 88 59 89 133 138 92 92 92 92 No No Yes Yes 8 to 9 BF BM WF WM 157 172 80 80 179 125 152 148 77 77 77 77 Yes Yes Yes Yes 9 to 10 a BF BM WF WM 148 151 84 80 -30 -28 33 64 77 77 77 77 No No No No Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male. a Grades 8 and 10 utilize constructed-response items.

PAGE 36

26 Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by sex and race are shown in Table 4-11. Table 4-11. Mean Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria Female Male Black White Black White Grade Year n M (SD) n M (SD) n M (SD) n M (SD) 2002 220 1,294 (220) 83 1,433 (153) 185 1,247 (262) 81 1,403 (206) 6 2001 220 1,087 (227) 83 1,233 (174) 185 1,082 (230) 81 1,208 (210) 2002 203 1,342 (259) 92 1,520 (155) 176 1,263 (280) 86 1,428 (224) 7 2001 203 1,276 (244) 92 1,433 (179) 176 1,199 (286) 86 1,356 (212) 2002 180 1,462 (239) 75 1,546 (226) 183 1,390 (241) 95 1,516 (227) 8 2001 180 1,334 (252) 75 1,412 (252) 183 1,208 (270) 95 1,365 (271) 2002 170 1,578 (214) 65 1,672 (184) 178 1,543 (238) 49 1,666 (163) 9 2001 170 1,439 (222) 65 1,573 (184) 178 1,370 (252) 49 1,496 (243) 2002 144 1,654 (192) 68 1,771 (134) 128 1,656 (207) 45 1,770 (140) 10 2001 144 1,601 (207) 68 1,732 (155) 128 1,604 (219) 45 1,697 (171) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by sex and race are shown in Table 4-12. Table 4-12. Mean Differences for Mathematics DSS by Sex and Race for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 BF BM WF WM 220 185 83 81 207 165 200 195 95 95 95 95 Yes Yes Yes Yes

PAGE 37

27 Table 4-12. Continued. Grades Group n DSS Mean Difference Expected Growth Met Expectations 6 to 7 BF BM WF WM 203 176 92 86 66 64 87 72 78 78 78 78 No No Yes No 7 to 8 a BF BM WF WM 180 183 75 95 128 182 134 150 64 64 64 64 Yes Yes Yes Yes 8 to 9 BF BM WF WM 170 178 65 49 139 172 99 170 54 54 54 54 Yes Yes Yes Yes 9 to 10 a BF BM WF WM 144 128 68 45 53 52 40 73 48 48 48 48 Yes Yes No Yes Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts by Socio-economic Status Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS by socio-economic status (SES) are shown in Table 4-13. Table 4-13. Mean Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria Economically Disadvantaged Non-Economically Disadvantaged Grade Year n M (SD) n M (SD) 2002 398 1,055 (266) 100 1,149 (253) 6 2001 398 1,004 (295) 100 1,134 (262) 2002 337 1,223 (254) 138 1,297 (212) 7 2001 337 1,093 (257) 138 1,172 (243) 2002 323 1,314 (264) 154 1,433 (225) 8 2001 323 1,239 (247) 154 1,294 (266) 2002 280 1,465 (272) 209 1,604 (231) 9 2001 280 1,329 (252) 209 1,434 (238) 2002 205 1,510 (295) 258 1,655 (240) 10 2001 205 1,525 (235) 258 1,645 (194)

PAGE 38

28 Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS by SES are shown in Table 4-14. Table 4-14. Mean Differences for Reading DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 ED N-ED 398 100 50 15 133 133 No No 6 to 7 ED N-ED 337 138 130 125 110 110 Yes Yes 7 to 8 a ED N-ED 323 154 75 138 92 92 No Yes 8 to 9 ED N-ED 280 209 135 169 77 77 Yes Yes 9 to 10 a ED N-ED 205 258 -16 9 77 77 No No Note. ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged. a Grades 8 and 10 utilize constructed-response items. Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS by SES are shown in Table 4-15. Table 4-15. Mean Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria Economically Disadvantaged Non-Economically Disadvantaged Grade Year n M (SD) n M (SD) 2002 436 1,294 (237) 133 1,381 (217) 6 2001 436 1,103 (227) 133 1,191 (212) 2002 385 1,319 (265) 172 1,452 (229) 7 2001 385 1,248 (264) 172 1,386 (212) 2002 365 1,426 (245) 168 1,529 (220) 8 2001 365 1,277 (265) 168 1,372 (277) 2002 293 1,558 (225) 169 1,637 (203) 9 2001 293 1,417 (228) 169 1,473 (258) 2002 192 1,649 (198) 193 1,728 (174) 10 2001 192 1,588 (231) 193 1,684 (164) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS by SES are shown in Table 4-16.

PAGE 39

29 Table 4-16. Mean Differences for Mathematics DSS by SES for Matched Low Performing Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 ED N-ED 436 133 191 190 95 95 Yes Yes 6 to 7 ED N-ED 385 172 71 66 78 78 No No 7 to 8 a ED N-ED 365 168 149 157 64 64 Yes Yes 8 to 9 ED N-ED 293 169 141 164 54 54 Yes Yes 9 to 10 a ED N-ED 192 193 61 45 48 48 Yes No Note. ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged. a Grades 8 and 10 utilize constructed-response items. Developmental Scale Score Contrasts for Students with Disabilities and Regular Education Students Reading Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Reading DSS for students with disabilities and regular education students are shown in Table 4-17. Gifted students were omitted due to a small sample size. Table 4-17. Mean Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 252 980 (271) 240 1,165 (223) 6 2001 252 899 (317) 240 1,162 (189) 2002 240 1,148 (252) 234 1,344 (192) 7 2001 240 1,018 (254) 234 1,215 (216)

PAGE 40

30 Table 4-17 Continued. Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 219 1,259 (241) 252 1,428 (246) 8 2001 219 1,132 (249) 252 1,360 (208) 2002 212 1,376 (279) 276 1,638 (184) 9 2001 212 1,251 (256) 276 1,468 (203) 2001 152 1,435 (255) 309 1,666 (252) 10 2002 152 1,444 (232) 309 1,664 (175) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Reading DSS for students with disabilities and regular education students are shown in Table 4-18. Gifted students were omitted due to a small sample size. Table 4-18. Mean Differences for Reading DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 SWD RES 252 240 82 3 133 133 No No 6 to 7 SWD RES 240 234 130 129 110 110 Yes Yes 7 to 8 a SWD RES 219 252 126 68 92 92 Yes No 8 to 9 SWD RES 212 276 124 170 77 77 Yes Yes 9 to 10 a SWD RES 152 309 -9 2 77 77 No No Note. SWD = Students With Disabilities, RES = Regular Education Students. a Grades 8 and 10 utilize constructed-response items.

PAGE 41

31 Mathematics Sample sizes, means, and standard deviations for grade level data for the FCAT SSS Mathematics DSS for students with disabilities and regular education students are shown in Table 4-19. Gifted students were omitted due to a small sample size. Table 4-19. Mean Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria Students with Disabilities Regular Education Students Grade Year n M (SD) n M (SD) 2002 246 1,208 (254) 319 1,393 (182) 6 2001 246 1,026 (234) 319 1,196 (191) 2002 244 1,215 (266) 309 1,471 (194) 7 2001 244 1,157 (270) 309 1,392 (191) 2002 233 1,341 (243) 297 1,548 (199) 8 2001 233 1,159 (261) 297 1,419 (220) 2002 198 1,473 (247) 264 1,673 (149) 9 2001 198 1,307 (236) 264 1,536 (194) 2002 144 1,573 (206) 241 1,758 (141) 10 2001 144 1,513 (240) 241 1,710 (137) Mean differences and comparisons to expected growth for grade level data for the FCAT SSS Mathematics DSS for students with disabilities and regular education students are shown in Table 4-20. Gifted students were omitted due to a small sample size.

PAGE 42

32 Table 4-20. Mean Differences for Mathematics DSS for Matched Low Performing Students with Disabilities and Regular Education Students not Making Gains by Achievement Level Criteria from 2001 to 2002 Grades Group n DSS Mean Difference Expected Growth Met Expectations 5 to 6 SWD RES 246 319 182 197 95 95 Yes Yes 6 to 7 SWD RES 244 309 58 72 78 78 No Yes 7 to 8 a SWD RES 233 297 182 128 64 64 Yes Yes 8 to 9 SWD RES 198 264 169 137 54 54 Yes Yes 9 to 10 a SWD RES 144 241 61 48 48 48 Yes Yes Note. SWD = Students With Disabilities, RES = Regular Education Students. a Grades 8 and 10 utilize constructed-response items. Summary Subgroups of low performing students demonstrating more than one years growth, as measured by an increase in their FCAT Developmental Scale Scores from one year to the next, are indicated in Table 4-21. Table 4-21. Subgroups of Low Performing Students Demonstrating More than One Years Growth in Reading and Mathematics Reading Mathematics 5-6 6-7 7-8 8-9 9-10 5-6 6-7 7-8 8-9 9-10 Agg Sex F M Sex/ Race BF BM WF WM SES ED N-ED SWD RES SWD Note. = demonstrated more than one years growth, Agg = Aggregate, F = Female, M = Male, B = Black, W = White, SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With Disabilities, RES = Regular Education Students.

PAGE 43

CHAPTER 5 DISCUSSION What makes the isolation of item format effects in an existing accountability system difficult is that comparative groups cannot be established without compromising the very structure of that accountability system. The Florida Comprehensive Assessment Test (FCAT) is administered to all eligible students and the results are used for decisions that can be of high stakes. To alter the composition of item format for some from what is currently used for all others and still be able to draw these conclusions would not only have political repercussions, but would not be tenable. The approach taken here, instead, seeks the emergence of patterns from data obtained from test administrations under all actual conditions. This also has the added advantage of incorporating the first two years of results from, what at the time, was a change in the FCAT testing program and its application to the Florida School Accountability System. Despite the changes, including the addition of annual learning gains in the calculation of school performance grades, continued improvement was widely reported in 2002 and more so in 2003 with almost half of all schools earning an A grade. The first results of the federal No Child Left Behind Act of 2001 (NCLB) were also made available in 2003. For Florida, these results were not as encouraging. Only over 400 schools out of 3,000 met all requirements for making Adequate Yearly Progress (AYP). These mixed messages, along with recent results of the National Assessment of 33

PAGE 44

34 Educational Progress, lead to the conclusion that, as a whole, while Florida is making progress, it simply is not enough for many and much remains to be done. Towards this end, the analyses presented here focused attention on students within Achievement Level 1 or 2, i.e. performing at the Basic and Below Basic level. The 2003 NCLB results show that the district from which the data for these analyses was used did not make AYP. A closer look at the subgroups will show Black students and students with disabilities did not meet the required percentage proficient, i.e. AYP, in reading with economically disadvantaged students coming close to not making the criteria. For mathematics, none of these subgroups made AYP. Although the analyses are from 2001 and 2002 and focus on grades 6 through 10, the discussion below may help clarify these and other patterns. Trends by Grade Reading Grade 6. Mean differences for all subgroups of low performing students did not demonstrate one years growth. Grade 7. Mean differences for all subgroups of low performing students demonstrated one years growth except white males. Grade 8. Mean differences for all subgroups of low performing students demonstrated one years growth except females, Black females, Black males, economically disadvantaged students, and regular education students. Grade 9. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 10. Mean differences for all subgroups of low performing students did not demonstrate one years growth.

PAGE 45

35 Mathematics Grade 6. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 7. Mean differences for all subgroups of low performing students did not demonstrate one years growth except white females and regular education students. Grade 8. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 9. Mean differences for all subgroups of low performing students demonstrated one years growth. Grade 10. Mean differences for all subgroups of low performing students demonstrated one years growth except white females and non-economically disadvantaged students. Item Format Effects Results indicated that constructed response (CR) items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. All subgroups made one years growth in mathematics in both grades analyzed that utilize CR items with the exception of white females and non-economically disadvantaged students in grade 10. Further inspection will show developmental scale scores for both of these subgroups were higher than all other subgroups in their respective analysis in 2001 and 2002. Differential results between subgroups within grades where CR items are utilized were most evident in reading. In grade 8, the aggregate demonstrated one years growth, though not by much. Disaggregation by all other analyses had at least one subgroup and, in most cases, more than half of all students not demonstrating one years growth.

PAGE 46

36 Females did not make one years growth, though they demonstrated higher developmental scale scores than males for both years of interest. Black females and males did not make one years growth and scored lower than white females and males for both years of interest. Economically disadvantaged students did not make one years growth and scored lower than non-economically disadvantaged students for both years of interest. Regular education students did not make one years growth, though they demonstrated higher developmental scale scores than students with disabilities. In grade 10, all subgroups of low performing students did not demonstrate one years growth. Closing Remarks Utilizing constructed response items on statewide assessments has been reported by those in the classroom to a) place a greater emphasis on higher-level thinking and problem solving (Huebert & Hauser, 1998), b) provide a more transparent link to everyday curriculum, and c) increase motivation (CCSSO, 2002b). All of these are aligned with the standards based reform that is at the very root of the accountability movement (NCTE, 1996; NCTM, 1995). These implications for teaching and learning are too important to be ignored and lend evidence for consequential validity to the inclusion of constructed response items on high stakes test (AERA, APA, & NCME, 1999). Lastly, and most importantly, it has also been reported that constructed response items lend construct validity to interpretations made from test scores (Messick, 1989; Snow, 1993), though the more contemporary paradigm frames it as evidence for content validity, inclusive of item format (AERA, APA, & NCME, 1999). Constructed response items may approach the original definition of the term assessment, from the Latin assidere, which means, to sit beside. Teacher and student working alongside one

PAGE 47

37 another is what large-scale assessment yearns to be and it certainly is a possibility that constructed response items best paint that picture.

PAGE 48

APPENDIX A GRADING FLORIDA PUBLIC SCHOOLS

PAGE 49

39

PAGE 50

APPENDIX B ANNUAL AYP OBJECTIVES FOR READING 40

PAGE 51

APPENDIX C ANNUAL AYP OBJECTIVES FOR MATHEMATICS 41

PAGE 52

LIST OF REFERENCES American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Amrein, A. L., & Berliner, D. C. (2002, March 28). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). Retrieved February 11, 2003, from http://epaa.asu.edu/epaa/v10n18/ Ananda, S. (2003). Rethinking Issues of Alignment Under No Child Left Behind. San Francisco: WestEd. Barton, P. E., & Coley, R. J. (1994). Testing in Americas schools. Princeton, NJ: Educational Testing Service, Policy Information Center. Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28(1), 77-92. Brennan, R. L. (1995). Generalizability in performance assessments. Educational Measurement: Issues and Practice, 14(4), 9-12, 27. Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple choice formats. Journal of Educational Measurement, 29(3), 253-271. Bridgeman, B., & Rock, D. A. (1993). Relationships among multiple-choice and open-ended analytical questions. Journal of Educational Measurement, 30(4), 313-329. Carnoy, M., & Loeb, S, (2002). Does external accountability affect student outcomes: A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305-331. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press. Council of Chief State School Officers. (1999). Annual survey of State student assessment programs: A summary report, Fall, 1999. Washington, DC: Author. Council of Chief State School Officers. (2002a). A guide to effective accountability reporting. Washington, DC: Author. 42

PAGE 53

43 Council of Chief State School Officers. (2002b). The role of performance-based assessments in large-scale accountability systems: Lessons learned from the inside. Washington, DC: Author. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Wardsworth. Cronbach, L. J., & Furby, L. (1970). How de we measure change-or should we? Psychological Bulletin, 74, 68-80. DeFur, S. H. (2002). Education reform, high-stakes assessment, and students with disabilities: One states approach. Remedial and Special Education, 23(4), 203-211. Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments, Applied Measurement in Education, 4(4), 289-303. Florida Department of Education [FDOE]. (2001). FCAT briefing book. Tallahassee: Author. Florida Department of Education [FDOE]. (2002). Technical report: For operational test administrations of the 2000 Florida Comprehensive Assessment Test. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003a). Understanding FCAT reports. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003b). 2003 Guide to calculating school grades:: Technical assistance paper. Tallahassee, FL: Author. Florida Department of Education [FDOE]. (2003c). Consolidated state application Accountability Workbook for State grants under Title IX, Part C, Sec. 9302 for the Elementary and Secondary Education Act (Pub. L. No. 107-110). March 26. Florida Department of Education [FDOE]. (2003d). Growth of minority student populations in Floridas public schools. Tallahassee, FL: Author. Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Klein, S. P., Robyn, A., & Bugliari, D. (2003). Studying large-scale reforms of instructional practice: An example from mathematics and science. Educational Evaluation and Policy Analysis, 20(2), 95-113. Huebert, J. P., & Hauser, R. M. (Eds.). (1998). High-stakes testing for tracking, promotion, and graduation. Washington, DC: National Academy Press.

PAGE 54

44 Klein, S. P., Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Robyn, A., & Burroughs, D. (2000). Teaching practices and student achievement: Report of first-year results from the Mosaic Study of Systemic Initiatives in Mathematics and Science (MR-1233-EDU). Santa Monica, CA: RAND. Linn, R. L. (1994). Performance Assessment: Policy promises and technical measurement standards. Educational Researcher, 23(9), 4-14. Linn, R. L. (2000). Assessments and Accountability. Educational Researcher, 29(2), 4-16. Linn, R. L., Baker, E. L., & Herman, J. L. (2002, Fall). Minimum group size or measuring adequate yearly progress. The CRESST Line, 1, 4-5. (Newsletter of the National Center for Research on Evaluation, Standards, and Student Testing [CRESST]. University of California, Los Angeles) Lord, F. M. (1963). Elementary models for measuring change. In C. W. Harris (Ed.), Problems in measuring change (pp. 21-38). Madison, WI: University of Wisconsin Press. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed-response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234-250. McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Klein, S. P., Bugliari, D., & Robyn, A. (2001). Interactions among instructional practices, curriculum, and student achievement: The case of standards-based high school mathematics. Journal for Research in Mathematics Education, 32(5), 493-517. Mehrens, W. A. (1992). Using performance assessment for accountability purposes. Educational Measurement: Issues and Practice, 11(1), 3-9, 20. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan. Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 61-74). Mahwah, NJ: Lawrence Erlbaum. Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5-8. Miller, M. D. (2002). Generalizability of performance-based assessments. Washington, DC: Council of Chief State School Officers.

PAGE 55

45 National Council of Teachers of English (1996). Standards for the English language arts: A joint project of the National Council of Teachers of English and the International Reading Association. Urbana, IL: Author. National Council of Teachers of Mathematics (1995). Assessment standards for school mathematics. Reston, VA: Author. No Child Left Behind Act of 2001, Public Law 107-110, Stat. 1425, 107 th Congress, (2002). Paige, R. (2003, June 27). Key policy letters signed by the Education Secretary or Deputy Secretary. Retrieved February 10, 2004, from http://www.ed.gov/policy/speced/guid/secletter/030627.html Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nations report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press. Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163-184. Ryan, J. M., & DeMark, S. (2002). Variation in achievement scores related to gender, item format, and content area tested. In J. Tindal & T. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical quality, and implementation (pp. 67-88). Mahwah, NJ: Lawrence Erlbaum Associates. Seltzer, M. H., Frank, K. A. & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16(1), 41-49. Slinde, J. A., & Linn, R. L. (1977). Vertical equated tests: Fact or phantom. Journal of Educational Measurement, 14(1), 23-32. Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence Erlbaum. Thissen, D. (1991). Multilog users guide. Lincolnwood, IL: Scientific Software. Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence Erlbaum. Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice test? An analysis of two tests. Journal of Educational Measurement, 31(2), 113-123.

PAGE 56

46 Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence Erlbaum. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Towards a Marxist theory of test construction. Applied Measurement in Education, 6(2), 103-118. Yen, W. M., & Ferrara, S. (1997). The Maryland school assessment program: Performance assessment with psychometric quality suitable for high stakes usage. Educational and Psychological Measurement, 57(1), 60-84.

PAGE 57

BIOGRAPHICAL SKETCH Stewart Francis Elizondo was born in San Jose, Costa Rica. At six years of age he moved to Old Bridge, New Jersey. At eighteen years of age, he moved to Ocala, Florida, where he attended Lake Weir High School. Before graduating, he had the honor of receiving a nomination to the United States Military Academy from Kenneth Hood Buddy MacKay Jr., former Governor of Florida. After pursing the honor to its end, he worked his way through Central Florida Community College in Ocala with assistance from a scholarship from the Silver Springs Shores Lions Club. After two years, he was fortunate enough to be awarded a Critical Teacher Scholarship in mathematics from the Florida Department of Education (FDOE). He transferred to the University of South Florida in Tampa and graduated with a Bachelor of Arts in mathematics and education. Returning to Ocala, Stewart then taught for ten years in the public school system. He is proud to have taught in a state recognized, high-performing A school. His service includes mentoring beginning teachers in the countys Peer Orientation Program, sitting on county textbook adoption committees, and facilitating school and district-wide workshops. His students also twice nominated him for Teacher of the Year. Concurrent with his teaching, he served on the Florida Mathematics Curriculum Frameworks and Sunshine State Standards (SSS) Writing Team. The FDOE approved the SSS as Floridas academic standards and the Florida Comprehensive Assessment Test (FCAT) as its assessment tool. These changes were also incorporated into Governor Jeb 47

PAGE 58

48 Bushs A+ Education Plan, which was approved by the state legislature by amending Section 229.57, F. S. While still teaching, he worked with the FDOE and its Test Development Center (TDC) on several statewide assessment endeavors. They include collaborations with Harcourt Educational Measurement in development of the FCAT, specifically grade nine and ten mathematics item reviews. He later worked with the TDC and NCS Pearson in the scoring of FCAT items, specifically in developing and adjusting training guidelines for the scoring of grade ten mathematics performance tasks using student responses from field-tested and operational items. Letters expressing appreciation from Florida Commissioner of Education James Wallace Horne, as well as former Commissioners Charles J. Crist Jr., Frank T. Brogan, and the late Douglas L. Jamerson, mark the timeline of service to these state projects. These activities rekindled a long-standing passion for the assessment field and led him to pursue various roles in the field, then ultimately an advanced degree. Of note, he has been an item writer for Harcourt Educational Measurement and has written mathematics items for the Massachusetts Comprehensive Assessment System and the Connecticut Academic Performance Test. He has also served as both proctor and administrator for the Florida Teacher Certification Examination and the College-Level Academic Skills Test. Stewart currently is an Alumni Fellow and graduate teaching assistant at the University of Florida. He is grateful to still have the opportunity to be in the classroom as he has taught undergraduate courses in elementary mathematics methods and

PAGE 59

49 graduate courses in measurement and assessment while continuing to pursue a doctorate in research and evaluation methodology. His greatest joy always has been and still remains spending time with his sons, Spencer and Seth, and his wife, Stephanie.


Permanent Link: http://ufdc.ufl.edu/UFE0004828/00001

Material Information

Title: Accountability and the Florida Comprehensive Assessment Test : effects of item format on low performing students in measuring one year's growth
Physical Description: Mixed Material
Language: English
Creator: Elizondo, Stewart Francis ( Dissertant )
Miller, M. David. ( Thesis advisor )
Seraphine, Anne ( Reviewer )
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2004
Copyright Date: 2004

Subjects

Subjects / Keywords: Educational Psychology thesis, M.A.E
Dissertations, Academic -- UF -- Educational Psychology
Genre: bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract: The impact of constructed response items in the measurement of one year's growth on the Florida Comprehensive Assessment Test for low performing students is analyzed. Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on the vertically equated developmental scores as defined by the Florida School Accountability System. Data in reading and mathematics for 6th through 10th grade were examined for a mid-sized school district and disaggregated into several subgroups as defined in the state No Child Left Behind Accountability Plan. These results were compared across grade levels to see if growth differed in the grade levels that are assessed with constructed response items. Results indicated that constructed response items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. Differential results between subgroups were most evident in reading.
Subject: accountability, FCAT
General Note: Title from title page of source document.
General Note: Document formatted into pages; contains 59 pages.
General Note: Includes vita.
Thesis: Thesis (M.A.E.)--University of Florida, 2004.
Bibliography: Includes bibliographical references.

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0004828:00001

Permanent Link: http://ufdc.ufl.edu/UFE0004828/00001

Material Information

Title: Accountability and the Florida Comprehensive Assessment Test : effects of item format on low performing students in measuring one year's growth
Physical Description: Mixed Material
Language: English
Creator: Elizondo, Stewart Francis ( Dissertant )
Miller, M. David. ( Thesis advisor )
Seraphine, Anne ( Reviewer )
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2004
Copyright Date: 2004

Subjects

Subjects / Keywords: Educational Psychology thesis, M.A.E
Dissertations, Academic -- UF -- Educational Psychology
Genre: bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract: The impact of constructed response items in the measurement of one year's growth on the Florida Comprehensive Assessment Test for low performing students is analyzed. Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on the vertically equated developmental scores as defined by the Florida School Accountability System. Data in reading and mathematics for 6th through 10th grade were examined for a mid-sized school district and disaggregated into several subgroups as defined in the state No Child Left Behind Accountability Plan. These results were compared across grade levels to see if growth differed in the grade levels that are assessed with constructed response items. Results indicated that constructed response items may be of benefit to many low performing students in mathematics, but to a lesser extent in reading. Differential results between subgroups were most evident in reading.
Subject: accountability, FCAT
General Note: Title from title page of source document.
General Note: Document formatted into pages; contains 59 pages.
General Note: Includes vita.
Thesis: Thesis (M.A.E.)--University of Florida, 2004.
Bibliography: Includes bibliographical references.

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0004828:00001


This item has the following downloads:


Full Text











ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT
TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN
MEASURING ONE YEAR' S GROWTH















By

STEWART FRANCIS ELIZONDO


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ARTS IN EDUCATION

UNIVERSITY OF FLORIDA


2004

































Copyright 2004

by

Stewart Francis Elizondo


































This work is dedicated to Stephanie, who has unconditionally supported me not only in
this endeavor, but also in all of my undertakings over the last seventeen years.
















ACKNOWLEDGMENTS

I acknowledge my committee chair, Dr. M. David Miller, and committee member,

Dr. Anne E. Seraphine, for their skilled mentorship and expertise. They have allowed

this process be to challenging, yet enjoyable and rewarding. Of course, any errors

remaining in this paper are entirely my own responsibility.

Gratitude is also given to Dr. Bridget A. Franks, graduate coordinator, and Dr.

James Algina, Director of the Research and Evaluation Program, for their assistance in

helping me secure an Alumni Fellowship without which this proj ect may not be or at

least would have been protracted and less cohesive.

Enough appreciation cannot be expressed to my family for their undying support

of my graduate studies. Repaying the concessions made by my sons, Spencer and Seth, is

something I cannot do in my lifetime, but I look forward to making the effort.

Encouragement and support given to me by my in-laws, Jeanne and John Garrod, are also

not taken for granted and I am most thankful. I would like to thank my mother, Elizabeth

Marder. It is because of her that I realize and celebrate in fact that education is truly a

lifelong process. Finally, the many thanks owed to my wife, Stephanie, are beyond

words. I would not be half of what I am if it were not for her.





















TABLE OF CONTENTS


page


ACKNOWLEDGMENT S .............. .................... iv


LI ST OF T ABLE S .........__.. ..... .___ .............._ vii..


LIST OF FIGURES .............. .................... ix


AB S TRAC T ......_ ................. ............_........x


CHAPTER


1 INTRODUCTION ................. ...............1.......... ......


The Florida Comprehensive Assessment Test............... ...............1..
State and Federal Accountability .................. ...............4................
The Florida School Accountability System ................. .............................4
The No Child Left Behind Act of 2001 ................ ...............5............ .
Low Performing Students and One Year' s Growth. ........ ................. ...............5


2 REVIEW OF LITERATURE ................. ...............7.......... .....


The Movement Towards Constructed Response .............. ...............7.....
Trait Equivalence Between Item Formats .............. ...............8.....
Psychometric Differences Between Item Formats .............. ...............9.....
Combining Item Formats ................. ........... ... ...............10....
Differential Effects of Item Formats on Various Groups ................ ............... .....11


3 M ETHODS ................. ...............13.......... .....


Data............... ...............13..

Sam ple .............. ...............13....
M measures ................ ...............15.......... ......

Analysis Approach............... ...............16


4 RE SULT S .............. ...............20....


Aggregate Developmental Scale Score Analysis .............. ..... ............... 2
Reading ................. ...............20.................
M them atics .............. ...............21....












Developmental Scale Score Contrasts by Sex ................. ............... ......... ...22
Reading ................. ...............22.................
M them atics .............. ... .... ....... .. ........2
Developmental Scale Score Contrasts by Sex and Race .............. .....................2
Reading ................. ...............24.................
M them atics .............. ... .... ........ .. .... ........2

Developmental Scale Score Contrasts by Socio-economic Status .............................27
Reading ............. ...._._. ...............27.....
M them atics .............. .. ............... .. .. .. ..................2
Developmental Scale Score Contrasts for Students with Disabilities and Regular
Education Students............... ...............29
Reading ........._..... ...._... ...............29.....
M them atics .............. ...............3 1....
Summary ........._..... ...._... ...............32.....


5 DI SCUS SSION ........._..... ...._... ...............3 3....


Trends by Grade .............. ...............34....
Reading ........._..... ...._... ...............34.....
M them atics .............. ...............35....
Item Format Effects .............. ...............35....

Closing Remarks ........._..... ...._... ...............36.....


APPENDIX


A GRADING FLORIDA PUBLIC SCHOOLS .............. ...............38....


B ANNUAL AYP OBJECTIVES FOR READING ................. .......... ...............40


C ANNUAL AYP OBJECTIVES FOR MATHEMATICS ................. .........._ .....41


LIST OF REFERENCES ............. ...... ._ ...............42....


BIOGRAPHICAL SKETCH .............. ...............47....

















LIST OF TABLES


Table pg

1-1 Florida Comprehensive Assessment Test Item Formats by Grade ............................2

1-2 Florida Comprehensive Assessment Test Achievement Levels .............. ................3

3-1 Matched Low Performing Students by Grade for Reading and Mathematics .........13

3-2 Disaggregation of Matched District and Low Performing Students by Subgroup
for Grades 6 through 10 for Reading and Mathematics............__ ..........__ .....14

4-1 Mean Aggregate Reading DSS for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............20

4-2 Mean Differences for Aggregate Reading DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....21

4-3 Mean Aggregate Mathematics DSS for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............21....

4-4 Mean Differences for Aggregate Mathematics DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....22

4-5 Mean Reading DSS by Sex for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............22

4-6 Mean Differences for Reading DSS by Sex for Matched Low Performing Students
not Making Gains by Achievement Level Criteria from 2001 to 2002 .................. .23

4-7 Mean Mathematics DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............23....

4-8 Mean Differences for Mathematics DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....24

4-9 Mean Reading DSS by Sex and Race for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............24....

4-10 Mean Differences for Reading DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from 2001
to 2002 ........._._ ....._.._ ........_._......25..










4-11 Mean Mathematics DSS by Sex and Race for Matched Low Performing Students
not Making Gains by Achievement Level Criteria .............. ....................2

4-12 Mean Differences for Mathematics DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from 2001
to 2002 ................. ...............26........._.....

4-13 Mean Reading DSS by SES for Matched Low Performing Students not Making
Gains by Achievement Level Criteria............... ...............27

4-14 Mean Differences for Reading DSS by SES for Matched Low Performing Students
not Making Gains by Achievement Level Criteria from 2001 to 2002 ........._.._......28

4-15 Mean Mathematics DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria .............. ...............28....

4-16 Mean Differences for Mathematics DSS by SES for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002.....29

4-17 Mean Reading DSS for Matched Low Performing Students with Disabilities and
Regular Education Students not Making Gains by Achievement Level Criteria.....29

4-18 Mean Differences for Reading DSS for Matched Low Performing Students with
Disabilities and Regular Education Students not Making Gains by Achievement
Level Criteria from 2001 to 2002............... ...............30..

4-19 Mean Mathematics DSS for Matched Low Performing Students with Disabilities
and Regular Education Students not Making Gains by Achievement Level
Criteria............... ...............31

4-20 Mean Differences for Mathematics DSS for Matched Low Performing Students
with Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria from 2001 to 2002 ................. .......... ...............32

4-21 Subgroups of Low Performing Students Demonstrating More than One Year' s
Growth in Reading and Mathematics............... ..............3

















LIST OF FIGURES


Figure pg

3-1 FCAT Achievement Levels for the Developmental Scale. .................. ...............15

3-2 Expected Growth under Gain Alternative 3 .............. ...............18....
















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Arts in Education

ACCOUNTABILITY AND THE FLORIDA COMPREHENSIVE ASSESSMENT
TEST: EFFECTS OF ITEM FORMAT ON LOW PERFORMING STUDENTS IN
MEASURING ONE YEAR' S GROWTH

By

Stewart Francis Elizondo

May 2004

Chair: M. David Miller
Maj or Department: Educational Psychology

The impact of constructed response items in the measurement of one year' s growth

on the Florida Comprehensive Assessment Test for low performing students is analyzed.

Growth was measured from the 2001 to the 2002 Sunshine State Standards assessment on

the vertically equated developmental scores as defined by the Florida School

Accountability System. Data in reading and mathematics for 6th through 10th grade

were examined for a mid-sized school district and disaggregated into several subgroups

as defined in the state No Child Left Behind Accountability Plan. These results were

compared across grade levels to see if growth differed in the grade levels that are

assessed with constructed response items. Results indicated that constructed response

items may be of benefit to many low performing students in mathematics, but to a lesser

extent in reading. Differential results between subgroups were most evident in reading.















CHAPTER 1
INTRODUCTION

Enacted in 1968, The Educational Accountability Act, Section 229.51, Florida

Statutes, empowered the Commissioner of Education to use

all appropriate management tools, techniques, and practices which will cause the
state's educational programs to be more effective and which will provide the
greatest economics in the management and operation of the state's system of
education.

Subsequent changes to the Florida Department of Education' s (FDOE) capabilities

towards that end can be marked by a stream of legislation that can currently be

characterized as an environment of accountability. Utilizing components of the current

accountability systems, questions regarding Florida's lowest performing students will be

analyzed.

The Florida Comprehensive Assessment Test

The Florida Comprehensive Assessment Test (FCAT) currently serves as the

measure for the statewide School Accountability System as envisioned by the A+ Plan

for Education. It also serves as the measure for the federal No Child Left Behind Act of

2001 (NCLB) for all public schools. Both accountability systems include requirements

for annual student growth. Section 229.57, Florida Statutes was amended in 1999 and

specifies that school performance grade category designations shall be based on the

school's current year' s performance and the school's annual learning gains (FDOE, 2001).

Section 1 1 11(b)(2)(H) of NCLB mandates that "intermediate goals for annual yearly

progress" shall be established (NCLB, 2001).










Reading and mathematics components of FCAT based on the Sunshine State

Standards (SSS) are administered in grades 3 through 10. These components are

commonly referred to as FCAT SSS Reading and FCAT SSS Mathematics. Emphasis on

these standards-based assessments by the accountability systems makes their

construction, scoring, and reporting of special interest to stakeholders. Constructed

response (CR) items are utilized on the FCAT SSS Reading assessment in grades 4, 8,

and 10 and on the FCAT SSS Mathematics in grades 5, 8, and 10. CR items are either

Short-Response (SR) or Extended-Response (ER). Not only does item format vary, but

weight given to item formats can also vary across grades (see Table 1-1).

Table 1-1. Florida Comprehensive Assessment Test Item Formats by Grade
Reading Mathematics
Grade Item Formats % of Items Item Formats % of Items
3 MC 100% MC 100%
4 MC 85-90% MC 100%
SR & ER 10-15%
5 MC 100% MC 60-70%
GR 20-25%
SR & ER 10-15%
6 MC 100% MC 60-70%
GR 30-40%
7 MC 100% MC 60-70%
GR 30-40%
8 MC 85-90% MC 50-60%
SR & ER 10-15% GR 25-30%
SR & ER 10-15%
9 MC 100% MC 60-70%
GR 30-40%
10 MC 85-90% MC 50-60%
SR & ER 10-15% GR 25-30%
SR & ER 10-15%
Note. MC: Multiple-Choice, GR: Gridded-Response, SR: Short-Response, and ER:
Extended-Response.

Student scores have been reported on a scale of 100 to 500 for both the FCAT SSS

Reading and Mathematics since they were first reported in 1998. Though these scores










yield Achievement Levels that are used for the School Accountability System and NCLB

(see Table 1-2), they are not for interpretation from grade to grade.

Table 1-2. Florida Comprehensive Assessment Test Achievement Levels
5 Advanced Performance at this level indicates that the student has success
with the most challenging content of the Sunshine State
Standards. A Level 5 student answers most of the test questions
correctly, including the most challenging questions.

4 Proficient Performance at this level indicates that the student has success
with the challenging content of the Sunshine State Standards. A
Level 4 student answers most of the questions correctly, but may
have only some success with questions that reflect the most
challenging content.

3 Proficient Performance at this level indicates that the student has partial
success with the challenging content of the Sunshine State
Standards, but performance is inconsistent. A Level 3 student
answers many of the questions correctly, but is generally less
successful with questions that are most challenging.

2 Basic Performance at this level indicates that the student has limited
success with the challenging content of the Sunshine State
Standards.

1 Below Basic Performance at this level indicates that the student has little
success with the challenging content of the Sunshine State
Standards.


To address this, FCAT results began to be reported on a vertically equated scale

that ranges from 86 to 3008 across grades 3 through 10. This facilitates interpretations of

annual progress from grade to grade. Developmental Scale Scores (DSS), as well as DSS

Change, the difference between consecutive years, appear on student reports as well as

school, district, and state reports for the aggregate means (FDOE, 2003a). It has been

proposed that NCLB encourages the use of vertical scaling procedures (Ananda, 2003).

Though FCAT field-testing began in 1997, reports including DSS have only been

available since 2001.









State and Federal Accountability

The Florida School Accountability System

In 1999, school accountability became more visible to the public with the advent of

school performance grades. Category designations range from "A," making excellent

progress, to "F," failing to make adequate progress (see Appendix A). Schools are

evaluated on the basis of aggregate student performance on FCAT SSS Reading and

FCAT SSS Mathematics in grades 3 through 10, and the Florida Writes statewide writing

assessment administered in grades 4, 8, and 10. Currently, three major components

contribute in their calculation:

1. Yearly achievement of high standards in reading, mathematics and writing,
2. Annual learning gains in reading and mathematics, and
3. Annual learning gains in reading for the lowest 25% of students in each school.

As of 2002, annual learning gains on FCAT SSS Mathematics and FCAT SSS

Reading account for half of the point system that yields school performance grades, with

special attention given to the reading gains of the lowest 25% of students in each school.

There are three ways that schools can be credited for the annual yearly gains of their

students :

1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or

2. When students maintain a relatively high Achievement Level (3, 4 or 5); or

3. When students demonstrate more than one year's growth within Levels 1 or 2, as
measured by an increase in their FCAT Developmental Scale Scores from one year
to the next (FDOE, 2003b).

Incentives for schools graded "A" and those that improve at least one grade level

include monetary incentives under the Florida School Recognition Program under

Section 231.2905, Florida Statutes. Sanctions for schools graded "F" for two years in a

four-year period include eligibility for Opportunity Scholarships under Section 229.0537,









Florida Statutes, so that students can attend higher performing public or private schools

(FDOE, 2001).

The No Child Left Behind Act of 2001

The inclusion of all students when making accountability calculations was not a

common practice prior to the passing of the federal No Child Left Behind Act of 2001

(Linn, 2000). Approved in April of 2003, Florida' s State Accountability Plan meets this

defining requirement and builds on the Florida School Accountability System. It is also

compliant with the Adequate Yearly Progress (AYP) components of the federal law. To

make AYP, the percentage of students earning a score of Proficient or above in reading

and mathematics has to meet or exceed the annual obj ectives for the given year (see

Appendix B and C).

Furthermore, not only will AYP criteria be applied to the aggregate, but it will also

be applied separately to subgroups that disaggregate the data by race, socio-economic

status, disability, and limited English proficiency. If a school fails to meet the annual

obj ectives in reading or mathematics for any subgroup in any content area, the school is

designated as not making AYP. The expectation for growth is such that all students are

Proficient or above on the FCAT SSS Reading and Mathematics no later than 2014

(FDOE, 2003c).

Low Performing Students and One Year's Growth

The fundamental question addressed in this paper is the following: Do grades with

constructed response items show differential growth for various subgroups of students

within Achievement Levels 1 or 2, as measured by an increase in their FCAT

Developmental Scale Scores from one year to the next? The subgroups of interest are

those defined by NCLB. These annual gains directly impact the calculation of school









performance grades under the Florida School Accountability System, and can serve as a

method of monitoring the progress and composition of those non-Proficient students for

the AYP calculations under NCLB. Comparisons of item format effects will be explored

for FCAT SSS Reading and Mathematics in grades 6 through 10.

Other issues, though very relevant, such as: a) the longstanding debate on how to

yield gain scores (Cronbach & Furby, 1970; Lord, 1963), b) the discrepancies

surrounding the various methods of vertically equating tests, especially when including

multiple formats, and their respective implications (Crocker & Algina, 1986; Slinde &

Linn, 1977; Thissen & Wainer, 2001), and c) the most recent debate of effects of high-

stakes testing (Amrein & Berliner, 2002; Carnoy & Loeb, 2002) will be set-aside in an

effort to work within the existing framework of the current accountability systems.

Unintended item format effects, especially on the FCAT SSS Reading and

Mathematics assessments, given their emphasis by both the Florida School

Accountability System and the federal No Child Left Behind Act, only grow in

importance as trends in Florida' s student population continue. In 2002, 14 of Florida' s

67 school districts reported minority enrollment of 50% or more. Many of these were

among the most populous districts across the state. Furthermore, minority representation

statewide has steadily increased from 29.9% in 1976 to 49.4% in 2002 (FDOE, 2003 d).















CHAPTER 2
REVIEW OF LITERATURE

The Movement Towards Constructed Response

Some type of constructed response (CR) item was used in over three quarters of all

statewide assessment programs (Council of Chief State School Officers [CCSSO], 1999).

This is a marked change from the assessment practices of just a decade ago when Barton

and Coley (1994), as cited in Linn (1994), stated that,

The nation is entering an era of change in testing and assessment. Efforts at both
the national and state levels are now directed at greater use of performance
assessment, constructed response questions, and portfolios based on actual student
work (p. 3).

The hypothesis is that these types of items are more closely aligned with the

standards-based reform movement currently underway in education and that they are

better able to detect its effects (Hamilton et al., 2003; Klein et al., 2000). Some have

recently tested whether the relationship between achievement and instruction varies as a

function of item format (McCaffrey et al., 2001).

One of the first and most prominent statewide programs to implement performance-

based assessments in a high-stakes environment was the Maryland School Performance

Assessment Program (MSPAP), first administered in 1991. Though replaced by the

Maryland School Assessment in 2002, Maryland still utilizes CR items in all grades

tested. Although virtually no information was available on the psychometric properties

of performance assessment during the design stages of the MSPAP, results were

encouraging and allowed for innovations in later years (Yen & Ferrara, 1997).









The use of CR items has since become a hallmark of quality testing programs to the

extent that a National Academy of Education panel repeatedly applauded the National

Assessment of Educational Progress' continued move to include CR items. In a later

review, the Committee on the Evaluation of National and State Assessments of

Educational Progress, in conjunction with the Board on Testing and Assessment, called

for an enhancement of the use of testing results, especially CR items, to provide better

interpretation (Pellegrino et al., 1999).

Trait Equivalence Between Item Formats

The debate regarding the existence and interpretation of psychological differences

tapped by CR items, as opposed to multiple choice (MC) items, has been equivocal.

Thissen and Wainer (2001) make clear a basic assumption of psychometrics:

Before the responses of any set of items are combined into a single score that is
taken to be, in some sense, representative of the responses to all of the items, we
must ascertain the extent to which the items measure the same thing (p. 10).

Factor analysis techniques, among other methodologies, have been employed to

analyze the extent to which CR items measure a different trait than do MC items.

Evidence of a free-response factor was found by Bennett, Rock, and Wang (1991),

though a single-factor solution was reported as providing the most parsimonious fit.

Thissen, Wainer, and Wang (1994) reported that despite a small degree of

multidimensionality, it would seem meaningful to combine the scores of MC and CR

items as they, for the most part, measure the same underling proficiency. These authors

suggest that proficiency may be better estimated with only MC items due to a high

correlation with the more reliable multiple-choice factor.

Still yet, others have used factor analysis techniques with MC and CR items and

found that CR items were not measuring anything beyond what was measured by the MC









items (Bridgeman & Rock, 1993). With item response theory methodology, others have

concluded that CR items yielded little information over and above that provided by MC

items (Lukhele et al., 1994).

A recent meta-analysis framed the question of construct (trait) equivalence as a

function of stem equivalence between the two item types. It was found that when items

are constructed in both formats using the same stem, the mean correlation between the

two formats approaches unity and is significantly higher than when using non-stem

equivalent items (Rodriguez, 2003).

It has also been suggested that a better aim of CR items is not to measure the same

trait, but to better measure some cognitive processes better than MC items (Traub, 1993).

This is consistent with a stronger positive relationship between reform practices and

student achievement when measured with CR items rather than MC items (Klein et al.,

2000).

Psychometric Differences Between Item Formats

The psychometric properties of CR items have also been a point of contention in

the literature. Messick (1993) asserted that the question of trait equivalence was

essentially a question of construct validity of test interpretation and use. Furthermore,

this pragmatic view is tolerated to facilitate effectiveness of testing purposes. The two

maj or threats to construct validity, irrelevant test variance and construct

underrepresentation (Messick, 1993), can be judiciously negotiated for the purposes of

demonstrating construct relevant variance with due representation of the construct.

Messick posits that if the hypothesis of perfectly correlated true scores across

formats is not rej ected, then either format, or a combination of both, could be employed

as construct indicators. Although, if the hypothesis is rejected, likely due to differential









method variance, a dominant construct relevant factor or second order factor, may cut

across formats (Messick, 1993). Messick (1995) also later provided a thorough treatment

on the broader systematic validation of performance assessments within the framework of

the unified concept of validity (AERA, APA, & NCME, 1999).

Returning to the issue of reliability raised earlier, Thissen and Wainer' s (2001)

suggestion that MC items more reliably measure a CR factor may, to paraphrase, come

about because,

measuring something that is not quite right accurately may yield far better
measurement than measuring the right thing poorly.

Lastly, we are warned that however appealing CR items may be as exemplars of

instruction or student performance, we must exercise caution when drawing inferences

about people or changes in instruction due to their relatively limited generalizability

across tasks (Brennan, 1995; Dunbar et al., 1991). In an analysis of state-mandated

programs, Miller (2002) concluded that other advantages, such as consequences for

instruction, are necessary to justify these concessions.

Combining Item Formats

Methods on how to best combine the use of CR items with MC items have also

garnered much needed attention. Mehrens (1992) put it succinctly when stating that we

have known for decades that MC items measure some things well and efficiently, but

they do not measure everything and they can be overemphasized. Performance

assessments, on the other hand, have the potential to measure important instructional

obj ectives that cannot be measured by MC items. He also notes that most large-scale

assessments have added performance assessments to the more traditional tests, and not

replaced them.









Wainer and Thissen (1993) add that combining the two formats may allow for the

concatenation of their strengths while compensating for their weaknesses. More

specifically, if a test has various components, it is sensible to weigh the components as a

function of their reliabilities, or modify the lengths of the components to make them

equally reliable. Given difficulties in execution of the latter, it was recommended that

serious consideration be given to IRT weighting in the construction and scoring of tests

with mixed item formats. Just such a process is employed in the construction of the

Florida Comprehensive Assessment Test (FCAT) via an interactive test construction

system, which makes available the capability to store, retrieve, and manipulate IRT test

information, test error, and expected scores (FDOE, 2002)

Differential Effects of Item Formats on Various Groups

Another maj or component to the literature is a segment that seeks differential

effects that CR items may have on gender, race, socio-economic status, and disability.

These contrasts are most timely, given that nature of the Adequate Yearly Progress

requirements of the No Child Left Behind Act of 2001 where, by design, data is

disaggregated by such groups.

In an analysis of open-ended counterparts to a set of items from the quantitative

section of the Graduate Record Exam, Bridgeman (1992) concluded that gender and

ethnicity differences were neither lessened nor exaggerated and that there were no

significant interactions of test format with either gender or ethnicity. Similarly, in a

metanalysis of over 30 studies, Ryan and DeMark (2002) conclude that females typically

outperform males in mathematics and language when students must construct their

responses, but effect sizes are small or less by Cohen's (1988) standards.









There has been mixed results on the inclusion of students with disabilities in high-

stakes standards-based assessments. DeFur (2002) summarizes one such example in

Virginia' s experiences with their Standards of Learning and an inclusion policy.

Conspicuous by its absence are studies on differential effects of item formats involving

students with disabilities. This may soon draw needed attention with the March 2003

announcement of a statutory NCLB provision. Effective for the 2003-2004 academic

year, it states that in calculating AYP, at most 1% of all students tested can be held to

alternative achievement standards at the district and state levels (Paige, 2003). The

provision was published in the Federal Register, Vol. 68, No. 236, on December 9, 2003.















CHAPTER 3
METHOD S

Data

The electronic file received by a mid-sized school district in May 2002 from the

Florida Department of Education (FDOE) was analyzed. This file contained Florida

Comprehensive Assessment Test (FCAT) scores from the 2001 and 2002 administrations.

The data for Florida had changed in 2001 in two key ways. Beginning with the 2001

administration, the FCAT program was expanded to include all grades 3 through 10.

Prior to this, reading was assessed only in grades 4, 8, and 10, while mathematics was

assessed only in grades 5, 8, and 10. Secondly, these changes offered the opportunity to

introduce a Developmental Scale Score (DSS) that linked adj acent grades together

allowing for progress to be tracked over time.

Sample

The data set consisted of 2,402 low performing students in reading and 2,506 low

performing students in mathematics in grades 6 through 10 that had a score from both the

2001 and 2002 FCAT administrations. Table 3-1 shows the composition of the students

by grade. Low performing is defined as remaining within the Basic or Below Basic

performance standards commonly referred to as Level 1 and Level 2 (FDOE, 2003c).

Table 3-1. Matched Low Performing Students by Grade for Reading and Mathematics
Grade Reading Mathematics
6 498 569
7 475 557
8 477 533
9 489 462
10 463 385









Disaggregation of low performing students by sex, race, socio-economic status,

and disability is shown in Table 3-2 for reading and mathematics. For comparison,

disaggregation of all matched district students also in grades 6 through 10 is provided

Table 3-2. Disaggregation of Matched District and Low Performing Students by
Subgroup for Grades 6 through 10 for Reading and Mathematics
Reading Mathematics
Low Low
Performing Performing
All Students Students All Students Students
Agg 8,399 2,402 8,404 2,506
Sex F 4,3 07 (5 1%) 1,158 (48%) 4,309 (51%) 1,300 (52%)
M 4,092 (49%) 1,244 (52%) 4,095 (49%) 1,206 (48%)
Race B 3,158 (38%) 1,646 (69%) 3,161 (38%) 1,767 (71%)
H 326 (4%) 77 328 (4%) 70
W 4,915 (59%) 756 (31%) 4,915 (58%) 739 (29%)
SES ED 3,271 (39%) 1,543 (64%) 3,275 (39%) 1,671 (67%)
N-ED 5,128 (61%) 859 (36%) 5,128 (61%) 835 (33%)
SWD Gift 853 (10%) 16 852 (10%) 11
RES 5,928 (71%) 1,311 (55%) 5,928 (71%) 1,430 (57%)
SWD 1,618 (19%) 1,075 (45%) 1,624 (19%) 1,065 (42%)
Note. Agg = Aggregate, F = Female, M = Male, B = Black, H = Hispanic, W = White,
SES = Socio-economic Status, ED = Economically Disadvantaged, N-ED = Non-
Economically Disadvantaged, SWD = Students With Disabilities, Gift = Gifted Students,
RES = Regular Education Students.

Gifted students and Hispanic students were omitted from analysis due to small

sample sizes. Disaggregated reporting for Adequate Yearly Progress (AYP) is not

required when the number of students is insufficient to yield statistically reliable

information or would reveal personally identifiable information (NCLB, 2002). Linn et

al. (2002) suggested a minimum number of 25. The FDOE subscribes to a minimum

group size of 30 for school performance grades (FDOE, 2003b) and for AYP calculations

(FDOE. 2003c). In the analyses of that follow, subgroups of low performing students are

further disaggregated by grade and, in all cases, gifted students and Hispanic students fail

to meet even Linn's suggestion. In several cases, group size was as low as zero or one.

































85~W1665 ii ulr a t r : 2 Ls 00 a LM !l La.02 77-! ld bl rl 51.221::.8. 141'. 19 1*.P.1 .r lv~ 501 G'

772.1771 1772.1971 197 -l6214 21415.2297 2290.e ~ 2925 -I 178j1 ~: 1 78213 0 r 4- G91 2022 20T~id "J142 .*5!

Fiur 3-1. FCA Acivmn Levels for the'' Devlometa Scle Sure
c~4 1~r1Unders15ILtanding~;~5 FCAT :j Report (FOE 2003a).~ .......~ .~.....~ C.


FCA peforanc across82 grades15 3W~: through~ 10i dur~ingj the-e4 Spring of~ 201. Facilitte by









moduel (Thisen 1991) AcIRTmentricvs havte bevensowmn to l beprefere Sovergae


equvaent whena saexaminin suh longitudinale patterns asil thy ar sensitiv to varyisngo





rateso grad owtha over timre (Seltzr et aol., 1994). d tgad 0 hr tea


Measures

Reading and mathematics components of FCAT based on the Sunshine State

Standards (SSS) were examined. Commonly referred to as FCAT SSS Reading and

FCAT SSS Mathematics, they are the criterion-referenced components of the statewide

assessment program. The resulting Developmental Scale Scores (DSS) were analyzed.

Figure 3-1 shows the relationship between these scores and the corresponding

Achievement Levels by content area.










Analysis Approach

Components of both the Florida School Accountability System, as envisioned by

the A+ Plan for Education, and the federal No Child Left Behind Act of 2001 (NCLB)

were used to approach the question of differential growth for various groups of students

within the Achievement Levels of Basic and Below Basic focusing on grades with

constructed response items.

Within the Florida School Accountability System, making annual learning gains on

FCAT SSS Mathematics and FCAT SSS Reading account for three out of six categories

that are used in calculating school performance grades. Special attention is given to the

FCAT SSS Reading gains of the lowest 25% of students in each school (see Appendix

A). As of 2002, students can demonstrate gains via three different alternatives (FDOE,

2003b).

1. When students improve their FCAT Achievement Levels (1-2, 2-3, 3-4, 4-5); or

2. When students maintain a relatively high Achievement Level (3, 4 or 5); or

3. When students demonstrate more than one year's growth within Levels 1 or 2, as
measured by an increase in their FCAT Developmental Scale Scores from one year
to the next.

The focus of the analysis will be on the students eligible for demonstrating gains

via Gain Alternative 3. The reasoning being that these students have demonstrated two

consecutive years of Basic or Below Basic performance. Students that demonstrated

gains via Gain Alternative 1 have shown a substantive improvement in their performance.

For example, a student that has improved from Achievement Level 2 to Achievement

Level 3, has gone from a classification of Basic to Proficient (FDOE, 2003c). Students

that demonstrated gains via Gain Alternative 2 have shown consistently high

performance as Achievement Level 3 or higher is classified as Proficient or Advanced










(FDOE, 2003c). Furthermore, this speaks directly to the policy that is at hand and

utilizes the FCAT Developmental Scale Scores as they are intended.

Five analyses were performed, the first of which is for all matched low performing

students not making gains by achievement level criteria. The remaining four address

subgroups that are suggested (CCSSO, 2002a) or required (NCLB, 2002) when

disaggregating school, district, and state data in calculating the Adequate Yearly Progress

(AYP) components of NCLB. To make AYP, the percentage of students earning a score

of Proficient or above in reading and mathematics has to meet or exceed the annual

objectives for the given year (see Appendix B and C). Though all students in the

analyses are non-Proficient, tracking the composition and progress of these groups could

lead to an understanding that may facilitate policy and instruction.

Comparisons of item format effects in measuring one year's growth will be made

between FCAT SSS Reading and FCAT SSS Mathematics as well as across grades 6

through 10 within each content area. Lastly, differential growth within each grade for

both content areas will be discussed.

To further specify what constitutes one year' s growth as defined in the Florida

School Accountability System, it is necessary to revisit the developmental scale,

specifically as it applies to Gain Alternative 3. The definition is based on the numerical

cut-scores for the FCAT Achievement Levels that have been approved by the State Board

of Education. The following steps were applied to the cut scores, separately, for each

subject and each grade-level pair (FDOE, 2003b). Per State Board Rule 6A-1.09422,

there are four cut scores that separate DSS into five Achievement Levels. The increase in

the DSS necessary to maintain in the same relative standing within Achievement Levels










from one grade to the next was calculated for each of the four cut scores between the five

Achievement Levels. The median value of these four differences was determined to best

represent the entire student population. Median gain expectations were then calculated

for each grade and a logarithmic curve was fitted. Others were considered but the

logarithmic curve was adopted because it best captures the theoretical expectation of

greater gains in the early grades due to student maturation. Graphs of these curves and

the resulting expected growth on the development scale is shown in Figure 3-2 for

reading and mathematics.


Gain Alternative 3: Expected Growt~h









GrB 3- r~ 4- r -8 Br 6-7 Gr 7-8 Gr 8-9 Gr9-110
230 186 133 11 92 77 7
-HMath "182 110 95 78 64 54 48



Figure 3-2. Expected Growth under Gain Alternative 3. Source: Guide to Calculating
School Grades Technical Assistance Paper (FDOE, 2003b).

To be denoted as making gains under Gain Alternative 3, students must

demonstrate more than the expected growth on the developmental scale. Therefore, they

must score at least one developmental scale score point more than the values listed above.

These criteria were applied to mean differences for the matched low performing students.

This was done for the aggregate for each grade 6 through 10 and, in the interest of the

NCLB requirements, by sex, sex and race, socio-economic status, and disability.









As noted earlier, other issues, though admittedly relevant, will be set-aside in an

effort to work within the existing framework of the state and federal accountability

systems. These include and are not limited to; the longstanding debate on how to yield

gain scores (Cronbach & Furby, 1970; Lord, 1963); the discrepancies surrounding the

various methods of vertically equating tests, especially when including multiple formats,

and their respective implications (Crocker & Algina, 1986; Slinde & Linn, 1977; Thissen

& Wainer, 2001); and the most recent debate of effects of high-stakes testing (Amrein &

Berliner, 2002; Carnoy & Loeb, 2002).














CHAPTER 4
RESULTS

Aggregate Developmental Scale Score Analysis

Reading

Sample sizes, means, and standard deviations for grade level aggregate data for

the Florida Comprehensive Assessment Test Sunshine State Standards (FCAT SSS)

Reading Developmental Scale Scores (DSS) are shown in Table 4-1.

Table 4-1. Mean Aggregate Reading DSS for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Grade Year n M~ (SD)
6 2002 498 1,074 (266)

2001 498 1,031 (294)

7 2002 475 1,245 (245)

2001 475 1,116 (255)

8 2002 477 1,353 (258)

2001 477 1,257 (254)

9 2002 489 1,525 (264)

2001 489 1,3 74 (25 1)

10 2002 463 1,590 (275)

2001 463 1,592 (221)


Mean differences and comparisons to expected growth for grade level aggregate

data for the FCAT SSS Reading DSS are shown in Table 4-2.









Table 4-2. Mean Differences for Aggregate Reading DSS for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
DSS Mean Expected Met
Grades n Difference Growth Expectations
5 to 6 498 43 133 140

6 to 7 475 129 110 Yes

7 to 8a 477 96 92 Yes

8 to 9 489 150 77 Yes

9 to 10a 463 -2 77 No

a Grades 8 and 10 utilize constructed-response items.

Mathematics

Sample sizes, means, and standard deviations for grade level aggregate data for

the FCAT SSS Mathematics DSS are shown in Table 4-3.

Table 4-3. Mean Aggregate Mathematics DSS for Matched Low Performing Students
not Making Gains by Achievement Level Criteria
Grade Year n M (SD)
6 2002 569 1,314 (23 5)

2001 569 1,123 (227)

7 2002 557 1,360 (262)

2001 557 1,290 (257)

8 2002 533 1,459 (242)

2001 533 1,307 (272)

9 2002 462 1,587 (220)

2001 462 1,438 (241)

10 2002 385 1,689 (190)

2001 385 1,636 (206)









Mean differences and comparisons to expected growth for grade level aggregate

data for the FCAT SSS Mathematics DSS are shown in Table 4-4.

Table 4-4. Mean Differences for Aggregate Mathematics DSS for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002


Table 4-5. Mean Reading DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Female Male
Grade Year n M~ (SD) n M~ (SD)
6 2002 241 1,115 (247) 257 1,035 (277)
2001 241 1,060 (269) 257 1,003 (313)
7 2002 225 1,288 (234) 250 1,206 (247)
2001 225 1,156 (237) 250 1,079 (266)
8 2002 223 1,388 (259) 254 1,321 (253)
2001 223 1,304 (236) 254 1,215 (262)
9 2002 237 1,583 (231) 252 1,470 (281)
2001 237 1,413 (235) 252 1,338 (261)
10 2002 232 1,601 (284) 231 1,580 (266)
2001 232 1,608 (208) 231 1,576 (233)


DSS Mean
Difference
190

70

151

149

53


Expected
Growth
95

78

64

54

48


Met
Expectations
Yes

No

Yes

Yes

Yes


Grades
5 to 6

6 to 7

7 to 8a

8 to 9

9 to 10a


n
569

557

533

462

385


a Grades 8 and 10 utilize constructed-response items.

Developmental Scale Score Contrasts by Sex

Reading

Sample sizes, means, and standard deviations for grade level data for the FCAT


SSS Reading DSS by sex are shown in Table 4-5.









Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Reading DSS by sex are shown in Table 4-6.

Table 4-6. Mean Differences for Reading DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 F 241 55 133 No
M 257 31 133 No
6 to 7 F 225 132 110 Yes
M 250 126 110 Yes
7 to 8a F 223 84 92 No
M 254 105 92 Yes
8 to 9 F 237 170 77 Yes
M 252 132 77 Yes
9 to 10a F 232 -7 77 No
M 231 4 77 No
Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items.

Mathematics

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Mathematics DSS by sex are shown in Table 4-7.

Table 4-7. Mean Mathematics DSS by Sex for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Female Male
Grade Year n M~ (SD) n M~ (SD)
6 2002 303 1,332 (213) 266 1,294 (256)
2001 303 1,127 (223) 266 1,120 (231)
7 2002 295 1,398 (245) 262 1,317 (273)
2001 295 1,325 (237) 262 1,251 (274)
8 2002 255 1,487 (238) 278 1,433 (244)
2001 255 1,357 (254) 278 1,262 (280)
9 2002 235 1,604 (210) 227 1,569 (229)
2001 235 1,476 (220) 227 1,397 (255)
10 2002 212 1,692 (184) 173 1,686 (198)
2001 212 1,643 (201) 173 1,628 (212)

Mean differences and comparisons to expected growth for grade level data for the


FCAT SSS Mathematics DSS by sex are shown in Table 4-8.









Table 4-8. Mean Differences for Mathematics DSS by Sex for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 F 303 205 95 Yes
M 266 174 95 Yes
6 to 7 F 295 72 78 No
M 262 67 78 No
7 to 8a F 255 130 64 Yes
M 278 171 64 Yes
8 to 9 F 235 128 54 Yes
M 227 172 54 Yes
9 to 10a F 212 49 48 Yes
M 173 57 48 Yes
Note. F = Female, M = Male. a Grades 8 and 10 utilize constructed-response items.

Developmental Scale Score Contrasts by Sex and Race

Reading

Sample sizes, means, and standard deviations for grade level data for the FCAT


Table 4-9. Mean Reading DSS by Sex and Race for Matched Low Performing Students
not Making Gains by Achievement Level Criteria
Female Male
Black White Black White
Grade Year n Mn Mn MnM

6 2002 189 1,090 52 1,206 193 1,016 64 1,094
(246) (231) (264) (307)
2001 189 1,034 52 1,153 193 986 64 1,059
(277) (216) (305) (335)
7 2002 166 1,266 59 1,350 156 1,178 94 1,251
(241) (202) (250) (237)
2001 166 1,130 59 1,229 156 1,028 94 1,164
(244) (200) (271) (235)
8 2002 148 1,352 75 1,460 166 1,277 88 1,404
(259) (245) (262) (213)
2001 148 1,293 75 1,326 166 1,189 88 1,267
(211) (279) (253) (273)
9 2002 157 1,541 80 1,665 172 1,410 80 1,597
(228) (216) (278) (242)
2001 157 1,362 80 1,514 172 1,286 80 1,449
(237) (197) (253) (244)


SSS Reading DSS by sex and race are shown in Table 4-9.










Table 4-9 Continued.
Female Male
Black White Black White
Grade Year n Mn Mn MnM

10 2002 148 1,533 84 1,720 151 1,512 80 1,708
(303) (196) (274) (194)
2001 148 1,563 84 1,688 151 1,540 80 1,644
(211) (177) (220) (241)

Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Reading DSS by sex and race are shown in Table 4-10.

Table 4-10. Mean Differences for Reading DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 BF 189 56 133 No
BM 193 30 133 No
WF 52 54 133 No
WM 64 34 133 No
6 to 7 BF 166 136 110 Yes
BM 156 150 110 Yes
WF 59 121 110 Yes
WM 94 87 110 No
7 to 8a BF 148 59 92 No
BM 166 89 92 No
WF 75 133 92 Yes
WM 88 138 92 Yes
8 to 9 BF 157 179 77 Yes
BM 172 125 77 Yes
WF 80 152 77 Yes
WM 80 148 77 Yes
9 to 10a BF 148 -30 77 No
BM 151 -28 77 No
WF 84 33 77 No
WM 80 64 77 No
Note. BF = Black Female, BM = Black Male, WF = White Female, WM = White Male.
a Grades 8 and 10 utilize constructed-response items.












































Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Mathematics DSS by sex and race are shown in Table 4-12.

Table 4-12. Mean Differences for Mathematics DSS by Sex and Race for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations


Mathematics

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Mathematics DSS by sex and race are shown in Table 4-11.

Table 4-11. Mean Mathematics DSS by Sex and Race for Matched Low Performing
Students not Making Gains by Achievement Level Criteria


Female


Male
Black
n M

185 1,247
(262)
185 1,082
(230)
176 1,263
(280)
176 1,199
(286)
183 1,390
(241)
183 1,208
(270)
178 1,543
(238)
178 1,370
(252)
128 1,656
(207)
128 1,604
(219)


Black
Grade Year n M

6 2002 220 1,294
(220)
2001 220 1,087
(227)
7 2002 203 1,342
(259)
2001 203 1,276
(244)
8 2002 180 1,462
(239)
2001 180 1,334
(252)
9 2002 170 1,578
(214)
2001 170 1,439
(222)
10 2002 144 1,654
(192)
2001 144 1,601
(207)


White
n M

83 1,433
(153)
83 1,233
(174)
92 1,520
(155)
92 1,433
(179)
75 1,546
(226)
75 1,412
(252)
65 1,672
(184)
65 1,573
(184)
68 1,771
(134)
68 1,732
(155)


White
nM

81 1,403
(206)
81 1,208
(210)
86 1,428
(224)
86 1,356
(212)
95 1,516
(227)
95 1,365
(271)
49 1,666
(163)
49 1,496
(243)
45 1,770
(140)
45 1,697
(171)


95


Yes
Yes
Yes
Yes


5 to 6


BF
BM
WF
WM


207





I


Developmental Scale Score Contrasts by Socio-economic Status

Reading

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Reading DSS by socio-economic status (SES) are shown in Table 4-13.

Table 4-13. Mean Reading DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Economically Non-Economically
Disadvantaged Disadvantaged
Grade Year n M~ (SD) n M~ (SD)
6 2002 398 1,055 (266) 100 1,149 (253)
2001 398 1,004 (295) 100 1,134 (262)
7 2002 337 1,223 (254) 138 1,297 (212)
2001 337 1,093 (257) 138 1,172 (243)
8 2002 323 1,314 (264) 154 1,433 (225)
2001 323 1,239 (247) 154 1,294 (266)
9 2002 280 1,465 (272) 209 1,604 (231)
2001 280 1,329 (252) 209 1,434 (238)
10 2002 205 1,510 (295) 258 1,655 (240)
2001 205 1,525 (235) 258 1,645 (194)


Table 4-12. Continued.

Grades Group n DSS Mean Differenc
6 to 7 BF 203 66
BM 176 64
WF 92 87
WM 86 72
7 to 8a BF 180 128
BM 183 182
WF 75 134
WM 95 150
8 to 9 BF 170 139
BM 178 172
WF 65 99
WM 49 170
9 to 10a BF 144 53
BM 128 52
WF 68 40
WM 45 73
Note. BF = Black Female, BM = Black Male, WF =
a Grades 8 and 10 utilize constructed-response items.


Expected
:e Growth
78
78
78
78
64
64
64
64
54
54
54
54
48
48
48
48
White Female, WM


Met
Expectations
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
= White Male.


























































Mean differences and comparisons to expected growth for grade level data for the


1 2M (SD)
;3 1,381 (217)
;3 1,191 (212)
'2 1,452 (229)
'2 1,386 (212)
;8 1,529 (220)
;8 1,372 (277)
;9 1,637 (203)
;9 1,473 (258)
,3 1,728 (174)
,3 1,684 (164)


Grade Year
6 2002
2001
7 2002
2001
8 2002
2001
9 2002
2001
10 2002
2001


M (SD)
1,294 (237)
1,103 (227)
1,319 (265)
1,248 (264)
1,426 (245)
1,277 (265)
1,558 (225)
1,417 (228)
1,649 (198)
1,588 (231)


n
436
436
385
385
365
365
293
293
192
192


FCAT SSS Mathematics DSS by SES are shown in Table 4-16.


Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Reading DSS by SES are shown in Table 4-14.

Table 4-14. Mean Differences for Reading DSS by SES for Matched Low Performing
Students not Making Gains by Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 ED 398 50 133 No
N-ED 100 15 133 No
6 to 7 ED 337 130 110 Yes
N-ED 138 125 110 Yes
7 to 8a ED 323 75 92 No
N-ED 154 138 92 Yes
8 to 9 ED 280 135 77 Yes
N-ED 209 169 77 Yes
9 to 10a ED 205 -16 77 No
N-ED 258 9 77 No
Note. ED Economically Disadvantaged, N-ED Non-Economically Disadvantaged.
a Grades 8 and 10 utilize constructed-response items.

Mathematics

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Mathematics DSS by SES are shown in Table 4-15.

Table 4-15. Mean Mathematics DSS by SES for Matched Low Performing Students not
Making Gains by Achievement Level Criteria
Economically Non-Economically
Disadvantaged Disadvantaged





Developmental Scale Score Contrasts for Students with Disabilities and Regular
Education Students

Reading

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Reading DSS for students with disabilities and regular education students are shown

in Table 4-17. Gifted students were omitted due to a small sample size.

Table 4-17. Mean Reading DSS for Matched Low Performing Students with Disabilities
and Regular Education Students not Making Gains by Achievement Level
Criteria
Students with Disabilities Regular Education Students


Grade Year n (SD) n (SD)
6 2002 252 980 240 1,165
(271) (223)
2001 252 899 240 1,162
(317) (189)
7 2002 240 1,148 234 1,344
(252) (192)
2001 240 1,018 234 1,215
(254) (216)


Table 4-16. Mean Differences for Mathematics DSS by SES for Matched Low
Performing Students not Making Gains by Achievement Level Criteria from
2001 to 2002


Grades Group n DSS Mean Differenc
5 to 6 ED 436 191
N-ED 133 190
6 to 7 ED 385 71
N-ED 172 66
7 to 8a ED 365 149
N-ED 168 157
8 to 9 ED 293 141
N-ED 169 164
9 to 10a ED 192 61
N-ED 193 45
Note. ED Economically Disadvantaged, N-ED N r
a Grades 8 and 10 utilize constructed-response items.


Expected Met
:e Growth Expectations
95 Yes
95 Yes
78 No
78 No
64 Yes
64 Yes
54 Yes
54 Yes
48 Yes
48 No
Ton-Economically Disadvantaged.





























Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Reading DSS for students with disabilities and regular education students are

shown in Table 4-18. Gifted students were omitted due to a small sample size.

Table 4-18. Mean Differences for Reading DSS for Matched Low Performing Students
with Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria from 2001 to 2002
Expected Met
Grades Group n DSS Mean Difference Growth Expectations
5 to 6 SWD 252 82 133 No
RES 240 3 133 No
6 to 7 SWD 240 130 110 Yes
RES 234 129 110 Yes
7 to 8a SWD 219 126 92 Yes
RES 252 68 92 No
8 to 9 SWD 212 124 77 Yes
RES 276 170 77 Yes
9 to 10a SWD 152 -9 77 No
RES 309 2 77 No
Note. SWD Students With Disabilities, RES Regular Education Students.
a Grades 8 and 10 utilize constructed-response items.


Table 4-17 Continued.
Students with Disabilities


Regular Education Students


(SD)
1,259
(241)
1,132
(249)
1,376
(279)
1,251
(256)
1,435
(255)
1,444
(232)


(SD)
1,428
(246)
1,360
(208)
1,638
(184)
1,468
(203)
1,666
(252)
1,664
(175)


Grade Year
8 2002

2001

9 2002

2001

10 2001

2002


n
252

252

276

276

309

309









Mathematics

Sample sizes, means, and standard deviations for grade level data for the FCAT

SSS Mathematics DSS for students with disabilities and regular education students are

shown in Table 4-19. Gifted students were omitted due to a small sample size.

Table 4-19. Mean Mathematics DSS for Matched Low Performing Students with
Disabilities and Regular Education Students not Making Gains by
Achievement Level Criteria
Students with Disabilities Regular Education Students


Grade Year n (SD) n (SD)
6 2002 246 1,208 319 1,393
(254) (182)
2001 246 1,026 319 1,196
(234) (191)
7 2002 244 1,215 309 1,471
(266) (194)
2001 244 1,157 309 1,392
(270) (191)
8 2002 233 1,341 297 1,548
(243) (199)
2001 233 1,159 297 1,419
(261) (220)
9 2002 198 1,473 264 1,673
(247) (149)
2001 198 1,307 264 1,536
(236) (194)
10 2002 144 1,573 241 1,758
(206) (141)
2001 144 1,513 241 1,710
(240) (137)

Mean differences and comparisons to expected growth for grade level data for the

FCAT SSS Mathematics DSS for students with disabilities and regular education students

are shown in Table 4-20. Gifted students were omitted due to a small sample size.









Table 4-20. Mean Differences for Mathematics DSS for Matched Low Performing
Students with Disabilities and Regular Education Students not Making Gains
by Achievement Level Criteria from 2001 to 2002
DSS Mean Difference Expected Met
Grades Group n Growth Expectations
5 to 6 SWD 246 182 95 Yes
RES 319 197 95 Yes
6 to 7 SWD 244 58 78 No
RES 309 72 78 Yes
7 to 8a SWD 233 182 64 Yes
RES 297 128 64 Yes
8 to 9 SWD 198 169 54 Yes
RES 264 137 54 Yes
9 to 10a SWD 144 61 48 Yes
RES 241 48 48 Yes
Note. SWD Students With Disabilities, RES Regular Education Students.
a Grades 8 and 10 utilize constructed-response items.

Summary

Subgroups of low performing students demonstrating more than one year' s

growth, as measured by an increase in their FCAT Developmental Scale Scores from one

year to the next, are indicated in Table 4-21.

Table 4-21. Subgroups of Low Performing Students Demonstrating More than One
Year' s Growth in Reading and Mathematics
Reading Mathematics
5-6 6-7 7-8 8-9 9-10 5-6 6-7 7-8 8-9 9-10



Sex/ BF I I 9 99
Race BM I I 9 99





ME = ae, lcW=Wie E oi-eooi tts D=Eooial



Disadvantaged, N-ED = Non-Economically Disadvantaged, SWD = Students With
Disabilities, RES = Regular Education Students.















CHAPTER 5
DISCUSSION

What makes the isolation of item format effects in an existing accountability

system difficult is that comparative groups cannot be established without compromising

the very structure of that accountability system. The Florida Comprehensive Assessment

Test (FCAT) is administered to all eligible students and the results are used for decisions

that can be of high stakes. To alter the composition of item format for some from what is

currently used for all others and still be able to draw these conclusions would not only

have political repercussions, but would not be tenable.

The approach taken here, instead, seeks the emergence of patterns from data

obtained from test administrations under all actual conditions. This also has the added

advantage of incorporating the first two years of results from, what at the time, was a

change in the FCAT testing program and its application to the Florida School

Accountability System. Despite the changes, including the addition of annual learning

gains in the calculation of school performance grades, continued improvement was

widely reported in 2002 and more so in 2003 with almost half of all schools earning an A

grade.

The first results of the federal No Child Left Behind Act of 2001 (NCLB) were

also made available in 2003. For Florida, these results were not as encouraging. Only

over 400 schools out of 3,000 met all requirements for making Adequate Yearly Progress

(AYP). These mixed messages, along with recent results of the National Assessment of









Educational Progress, lead to the conclusion that, as a whole, while Florida is making

progress, it simply is not enough for many and much remains to be done.

Towards this end, the analyses presented here focused attention on students within

Achievement Level 1 or 2, i.e. performing at the Basic and Below Basic level. The 2003

NCLB results show that the district from which the data for these analyses was used did

not make AYP. A closer look at the subgroups will show Black students and students

with disabilities did not meet the required percentage proficient, i.e. AYP, in reading with

economically disadvantaged students coming close to not making the criteria. For

mathematics, none of these subgroups made AYP. Although the analyses are from 2001

and 2002 and focus on grades 6 through 10, the discussion below may help clarify these

and other patterns.

Trends by Grade

Reading

Grade 6. Mean differences for all subgroups of low performing students did not

demonstrate one year' s growth.

Grade 7. Mean differences for all subgroups of low performing students

demonstrated one year' s growth except white males.

Grade 8. Mean differences for all subgroups of low performing students

demonstrated one year's growth except females, Black females, Black males,

economically disadvantaged students, and regular education students.

Grade 9. Mean differences for all subgroups of low performing students

demonstrated one year' s growth.

Grade 10. Mean differences for all subgroups of low performing students did not

demonstrate one year' s growth.










Mathematics

Grade 6. Mean differences for all subgroups of low performing students

demonstrated one year' s growth.

Grade 7. Mean differences for all subgroups of low performing students did not

demonstrate one year' s growth except white females and regular education students.

Grade 8. Mean differences for all subgroups of low performing students

demonstrated one year' s growth.

Grade 9. Mean differences for all subgroups of low performing students

demonstrated one year' s growth.

Grade 10. Mean differences for all subgroups of low performing students

demonstrated one year' s growth except white females and non-economically

disadvantaged students.

Item Format Effects

Results indicated that constructed response (CR) items may be of benefit to many

low performing students in mathematics, but to a lesser extent in reading. All subgroups

made one year' s growth in mathematics in both grades analyzed that utilize CR items

with the exception of white females and non-economically disadvantaged students in

grade 10. Further inspection will show developmental scale scores for both of these

subgroups were higher than all other subgroups in their respective analysis in 2001 and

2002.

Differential results between subgroups within grades where CR items are utilized

were most evident in reading. In grade 8, the aggregate demonstrated one year' s growth,

though not by much. Disaggregation by all other analyses had at least one subgroup and,

in most cases, more than half of all students not demonstrating one year' s growth.









Females did not make one year' s growth, though they demonstrated higher

developmental scale scores than males for both years of interest. Black females and

males did not make one year' s growth and scored lower than white females and males for

both years of interest. Economically disadvantaged students did not make one year' s

growth and scored lower than non-economically disadvantaged students for both years of

interest. Regular education students did not make one year's growth, though they

demonstrated higher developmental scale scores than students with disabilities. In grade

10, all subgroups of low performing students did not demonstrate one year' s growth.

Closing Remarks

Utilizing constructed response items on statewide assessments has been reported

by those in the classroom to a) place a greater emphasis on higher-level thinking and

problem solving (Huebert & Hauser, 1998), b) provide a more transparent link to

everyday curriculum, and c) increase motivation (CCSSO, 2002b). All of these are

aligned with the standards based reform that is at the very root of the accountability

movement (NCTE, 1996; NCTM, 1995). These implications for teaching and learning

are too important to be ignored and lend evidence for consequential validity to the

inclusion of constructed response items on high stakes test (AERA, APA, & NCME,

1999).

Lastly, and most importantly, it has also been reported that constructed response

items lend construct validity to interpretations made from test scores (Messick, 1989;

Snow, 1993), though the more contemporary paradigm frames it as evidence for content

validity, inclusive of item format (AERA, APA, & NCME, 1999). Constructed response

items may approach the original definition of the term assessment, from the Latin

a~ssidere, which means, "to sit beside". Teacher and student working alongside one






37


another is what large-scale assessment yearns to be and it certainly is a possibility that

constructed response items best paint that picture.














APPENDIX A
GRADING FLORIDA PUBLIC SCHOOLS




























Scoring High on the FCAT
The FIorlda Cornprehenslve Assessment Test (FCAT) Is the primary measure
of students' achievement of the Sunshine State Siandards student scores
are classified inio firre achievement levels, mth 1 be ng the lowrrest and 5
being the highest

Schools earn one point for each percent of students wyho score
in ach aernment levels 3, 4, or 5 in reading and one point for
each percent of students who score 3, 4, or 5 in math

SThe writing exam Is scored by at least two readers on a scale
of 1 to 6. The percent of students scor ng "3" and above 18
averaged wuith the percent scoring "3.5" and above to yield the
percent meeting minimum and higher standards Schools earn
one point for each percent of students on the cornblnecI
measure.


IIIC`+~~lllllc`I*~IIC~~ILL7I~~IL
1I;IICI;IIII11~l II;IICII'I'IIIIC~III


VUtiich students are included in school gradle clculalons? As in vtinat happens if the lowest 25% of students in the school do not mak~e adequatee
previous years, only standard curriculum students who were enrolled in progress" in reading? Schools that aspire to be graded "C" or above, but do not
the same school in both October and February are included. Speech rnake adequate progress with their lowest 25%b in reading, must develop a School
impaired, gifted, hospital/hornehound, and Limited English Proficient Improvement Plan component that addresses this need. If a school, otherwise
students with more than twoe years in an ESOL program are also gracled "C" or "B ", does not demonstrate adlecludte p ro gress for two years in a row,
included. the fin al grade wrill be red ucedl by one letter grade.


GRADING FLORIDA


SCHOOLS 20032-2003
)IMISS ION ER, ulwomfldne~ur


Making Annual Learning Gains
'Since FCAT readling and mah exams are cliven in grades 3 10, It is now possible to
monitor how much students learn from one ,tear to the next
> Schools earn one point for each percent of students who
rnake learning gains in reading and one point for each
percent of students who make learning gains In rnath.
Students can demonstrate learning gains in any one of three ways

(1) Improve achievement levels from l-2, 2-3, 3-4, or 4-5; or
(2) Malntaln wi~thin the relatively high 19491 of 3, 4, or 5, or
(3) Demonstrate more than one year s growth within achievement
levels 1 or 2.
=- Special attention is given to the reading gains of students in the Inwest 25%6 in
levels 1, 2, or 3 In each school. Schools earn one point for each percent of
the lowyest performing readers .*,*h make learning gains from the prw-fous
year. It takes at least 5C66 to make "adequate progress" for this group.


SCHO OL PERFORMANCE GRADING SCALE
















APPENDIX B
ANNUAL AYP OBJECTIVES FOR READING


Starting Point and An nual Objlectives for
Reading, 2001-02 2013-04

120


100






40


1 2 3 4 5 8 7 8 910111213
Years


N OTE: Year 1I = 2001-02 base yea r.

Sourea Date:


Re adi ng
%h Prof.


Year
20301-02
2002-03
2003-04
2004-05
2005-06
2006-07
2007-08
2008-09
2009-10
2010-11
2011-12
2012-13
2013-14



















Starting Point and An nu~al Objeectivs for
Mathewaltics, 2001-02 2013-04

120

*100-





S40-

20

1 2 3 4 5 6 7 B 9 10 1"112 13
Years

NOTE: Year "1 = 200"1-02 basse year.
Source Da-te:


APPENDIX C
ANNUAL AYP OBJECTIVES FOR MATHEMATICS


Math
%r Prof.


Year
2001-02
2002-03
2003-14
2004-05
2005-06
2006-07
2007-08
2000-09
2009-10
2010-11
2011-12
2012-13
2013-14


38
3B
38
53
53

63

83
83
100
















LIST OF REFERENCES


American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education, (1999). Standar~~dds for~ddd~~ddd~ educational
and psychological testing. Washington, DC: American Educational Research
Association.

Amrein, A. L., & Berliner, D. C. (2002, March 28). High-stakes testing, uncertainty, and
student learning. Education Policy Analysis Archives, 10(18). Retrieved February
11, 2003, from http://epaa.asu. edu/epaa/v 10nl18/

Ananda, S. (2003). Rethinking Issues of Alignment Under No Child Left Behind. San
Francisco: WestEd.

Barton, P. E., & Coley, R. J. (1994). Testing in America 's schools. Princeton, NJ:
Educational Testing Service, Policy Information Center.

Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and
multiple-choice items. Journal of~ducational M~ea;surement, 28(1), 77-92.

Brennan, R. L. (1995). Generalizability in performance assessments. Educational
Measurement: Issues and Practice, 14(4), 9-12, 27.

Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and
multiple choice formats. Journal of~ducational M~ea;surement, 29(3), 253-271.

Bridgeman, B., & Rock, D. A. (1993). Relationships among multiple-choice and open-
ended analytical questions. Journal of~ducational2~ea;sureenten, 30(4), 313-329.

Camnoy, M., & Loeb, S, (2002). Does external accountability affect student outcomes: A
cross-state analysis. Educational Evaheation and Policy Analysis, 24(4), 305-33 1.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New
York: Academic Press.

Council of Chief State School Officers. (1999). Annual survey of State student
assessment programs: A sunmanay report, Fall, 1999. Washington, DC: Author.

Council of Chief State School Officers. (2002a). A guide to effective accountability
reporting. Washington, DC: Author.










Council of Chief State School Officers. (2002b). The role ofperformance-ba;sed
assessments in large-scale accountability systems: Lessons learned fr~om the inside.
Washington, DC: Author.

Crocker, L., & Al gina, J. (1986). Introduction to classical and modern test theory. New
York: Wardsworth.

Cronbach, L. J., & Furby, L. (1970). How de we measure change-or should we?
Psychological Bulletin, 74, 68-80.

DeFur, S. H. (2002). Education reform, high-stakes assessment, and students with
disabilities: One state's approach. Remediala~ndSpecialEducation, 23(4), 203-211.

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the
development and use of performance assessments, Applied2\~ea;surement in
Education, 4(4), 289-303.

Florida Department of Education [FDOE]. (2001). FCA T briefing book. Tallahassee:
Author.

Florida Department of Education [FDOE]. (2002). Technical report: For operational test
administrations of the 2000 Florida Comprehensive Assessment Test. Tallahassee,
FL: Author.

Florida Department of Education [FDOE]. (2003 a). Understanding FCA Treports.
Tallahassee, FL: Author.

Florida Department of Education [FDOE]. (2003b). 2003 Guide to calculating school
grades:: Technical assistance paper. Tallahassee, FL: Author.

Florida Department of Education [FDOE]. (2003c). Consolidated state application
Accountability Workbook for State grants under Title IX, Part C, Sec. 9302 for the
Elementary and Secondarydd~~~~~ddddd~~~~ Education Act (Pub. L. No. 107-110). March 26.

Florida Department of Education [FDOE]. (2003d). G; 1,n thr of minority student
populations in Florida's public schools. Tallahassee, FL: Author.

Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Klein, S. P., Robyn, A., & Bugliari, D.
(2003). Studying large-scale reforms of instructional practice: An example from
mathematics and science. Educational Evaluation and Policy Analysis, 20(2), 95-
113.

Huebert, J. P., & Hauser, R. M. (Eds.). (1998). High-stakes~~~~tttt~~~~ttt testing for tracking,
promotion, and graduation. Washington, DC: National Academy Press.










Klein, S. P., Hamilton, L. S., McCaffrey, D. F., Stecher, B. M., Robyn, A., & Burroughs,
D. (2000). Teaching practices and student achievement: Report of first-year results
fr~om the M~osaic Study of Systemic Initiatives in ]Aubell~linai \ and Science (MR-
1233-EDU). Santa Monica, CA: RAND.

Linn, R. L. (1994). Performance Assessment: Policy promises and technical measurement
standards. Educational Researcher, 23(9), 4-14.

Linn, R. L. (2000). Assessments and Accountability. Educational Researcher, 29(2), 4-
16.

Linn, R. L., Baker, E. L., & Herman, J. L. (2002, Fall). Minimum group size or
measuring adequate yearly progress. The CRESSTLine, 1, 4-5. (Newsletter of the
National Center for Research on Evaluation, Standards, and Student Testing
[CRESST]. University of California, Los Angeles)

Lord, F. M. (1963). Elementary models for measuring change. In C. W. Harris (Ed.),
Problems in measuring change (pp. 21-3 8). Madison, WI: University of Wisconsin
Press.

Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice,
constructed-response, and examinee-selected items on two achievement tests.
Journal of~ducational M~ea;surement, 31(3), 234-250.

McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Klein, S. P., Bugliari, D., & Robyn, A.
(2001). Interactions among instructional practices, curriculum, and student
achievement: The case of standards-based high school mathematics. Journal for
Research in ]Aubeinat,(iL \ Education, 32(5), 493-517.

Mehrens, W. A. (1992). Using performance assessment for accountability purposes.
Educational M~easurement: Issues and Practice, 11I(1), 3 -9, 20.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13-103). New York: Macmillan.

Messick, S. (1993). Trait equivalence as construct validity of score interpretation across
multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.),
Construction versus choice in cognitive measurement (pp. 61-74). Mahwah, NJ:
Lawrence Erlbaum.

Messick, S. (1995). Standards of validity and the validity of standards in performance
assessment. Educational M~easurement: Issues and Practice, 14(4), 5-8.

Miller, M. D. (2002). Generalizability of performance-based assessments. Washington,
DC: Council of Chief State School Officers.










National Council of Teachers of English (1996). Standar~~dds for~ddd~~ddd~ the English language arts:
A joint project of the National Council of Teachers of English and the International
Reading Association. Urbana, IL: Author.

National Council of Teachers of Mathematics (1995). Assessment standar~~dds for~ddd~~ddd~ school
mathematics. Reston, VA: Author.

No Child Left Behind Act of 2001, Public Law 107-110, (115 Stat. 1425, 107th Congress,
(2002).

Paige, R. (2003, June 27). Key policy letters signed by the Education Secretary or Deputy
Secretary. Retrieved February 10, 2004, from
http://www. ed.gov/policy/speced/guid/secletter/03 0627.html

Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation 's report
card: Evaluating NAEP and transforming the assessment of educational progress.
Washington, DC: National Academy Press.

Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-
response items: A random effects synthesis of correlations. Journal of Educational
Measurement, 40(2), 163-184.

Ryan, J. M., & DeMark, S. (2002). Variation in achievement scores related to gender,
item format, and content area tested. In J. Tindal & T. Haladyna (Eds.), Large-scale
assessment progra~msfor all students: Validity, technical quality, and
implementation (pp. 67-88). Mahwah, NJ: Lawrence Erlbaum Associates.

Seltzer, M. H., Frank, K. A. & Bryk, A. S. (1994). The metric matters: The sensitivity of
conclusions about growth in student achievement to choice of metric. Educational
Evaluation and Policy Analysis, 1 6(1), 4 1-49.

Slinde, J. A., & Linn, R. L. (1977). Vertical equated tests: Fact or phantom. Journal of
Educational M~ea;surement, 14(1), 23-32.

Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E. Bennett &
W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in
constructed response, performance testing, and portfolio assessment (pp. 45-60).
Hillsdale, NJ: Lawrence Erlbaum.

Thissen, D. (1991). M~ultilog user 's guide. Lincolnwood, IL: Scientific Software.

Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence Erlbaum.

Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice
and free-response items necessarily less unidimensional than multiple-choice test?
An analysis of two tests. Journal of~ducational2\~ea;surement, 31(2), 1 13-123.







46


Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and
constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction
versus choice in cognitive measurement: Issues in constructed response,
performance testing, and portfolio assessment (pp. 45-60). Hillsdale, NJ: Lawrence
Erlbaum.

Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response
test scores: Towards a Marxist theory of test construction. Applied2\~easurement in
Education, 6(2), 103-118.

Yen, W. M., & Ferrara, S. (1997). The Maryland school assessment program:
Performance assessment with psychometric quality suitable for high stakes usage.
Educational and Psychological M~ea;surement, 5 7(1), 60-84.
















BIOGRAPHICAL SKETCH

Stewart Francis Elizondo was born in San Jose, Costa Rica. At six years of age he

moved to Old Bridge, New Jersey. At eighteen years of age, he moved to Ocala, Florida,

where he attended Lake Weir High School. Before graduating, he had the honor of

receiving a nomination to the United States Military Academy from Kenneth Hood

"Buddy" MacKay Jr., former Governor of Florida. After pursing the honor to its end, he

worked his way through Central Florida Community College in Ocala with assistance

from a scholarship from the Silver Springs Shores Lion's Club. After two years, he was

fortunate enough to be awarded a Critical Teacher Scholarship in mathematics from the

Florida Department of Education (FDOE). He transferred to the University of South

Florida in Tampa and graduated with a Bachelor of Arts in mathematics and education.

Returning to Ocala, Stewart then taught for ten years in the public school system.

He is proud to have taught in a state recognized, high-performing "A" school. His

service includes mentoring beginning teachers in the county's Peer Orientation Program,

sitting on county textbook adoption committees, and facilitating school and district-wide

workshops. His students also twice nominated him for Teacher of the Year.

Concurrent with his teaching, he served on the Florida Mathematics Curriculum

Frameworks and Sunshine State Standards (SSS) Writing Team. The FDOE approved

the SSS as Florida's academic standards and the Florida Comprehensive Assessment Test

(FCAT) as its assessment tool. These changes were also incorporated into Governor Jeb










Bush's "A+ Education Plan," which was approved by the state legislature by amending

Section 229.57, F. S.

While still teaching, he worked with the FDOE and its Test Development Center

(TDC) on several statewide assessment endeavors. They include collaborations with

Harcourt Educational Measurement in development of the FCAT, specifically grade nine

and ten mathematics item reviews. He later worked with the TDC and NCS Pearson in

the scoring of FCAT items, specifically in developing and adjusting training guidelines

for the scoring of grade ten mathematics performance tasks using student responses from

Hield-tested and operational items. Letters expressing appreciation from Florida

Commissioner of Education James Wallace Horne, as well as former Commissioners

Charles J. Crist Jr., Frank T. Brogan, and the late Douglas L. Jamerson, mark the timeline

of service to these state proj ects.

These activities rekindled a long-standing passion for the assessment Hield and led

him to pursue various roles in the Hield, then ultimately an advanced degree. Of note, he

has been an item writer for Harcourt Educational Measurement and has written

mathematics items for the Massachusetts Comprehensive Assessment System and the

Connecticut Academic Performance Test. He has also served as both proctor and

administrator for the Florida Teacher Certifieation Examination and the College-Level

Academic Skills Test.

Stewart currently is an Alumni Fellow and graduate teaching assistant at the

University of Florida. He is grateful to still have the opportunity to be in the classroom

as he has taught undergraduate courses in elementary mathematics methods and






49


graduate courses in measurement and assessment while continuing to pursue a doctorate

in research and evaluation methodology.

His greatest j oy always has been and still remains spending time with his sons,

Spencer and Seth, and his wife, Stephanie.