Citation
An Investigation of the Relationship between Performance on Curriculum-Based Measures of Oral Reading Fluency and High-Stakes Tests of Reading Achievement

Material Information

Title:
An Investigation of the Relationship between Performance on Curriculum-Based Measures of Oral Reading Fluency and High-Stakes Tests of Reading Achievement
Creator:
Grapin, Sally L
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (105 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
School Psychology
Special Education, School Psychology and Early Childhood Studies
Committee Chair:
KRANZLER,JOHN H
Committee Co-Chair:
WALDRON,NANCY L
Committee Members:
BEAULIEU,DIANA JOYCE
ALGINA,JAMES J
Graduation Date:
8/9/2014

Subjects

Subjects / Keywords:
Achievement tests ( jstor )
At risk students ( jstor )
Emergent literacy ( jstor )
Gold standard ( jstor )
Open reading frames ( jstor )
Reading fluency ( jstor )
Schools ( jstor )
Spring ( jstor )
Standardized tests ( jstor )
Students ( jstor )
Special Education, School Psychology and Early Childhood Studies -- Dissertations, Academic -- UF
at-risk -- dibels -- mtss -- orf -- reading -- rti
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
School Psychology thesis, Ph.D.

Notes

Abstract:
The purpose of this study was to compare the diagnostic utility of locally-developed and publisher-recommended cut scores for the Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency measure (DORF) in identifying students who were either at-risk or not at-risk for poor performance on two high-stakes tests of reading achievement. This study also investigated whether locally-developed and publisher-recommended cut scores retained their classification accuracy when applied to subsequent student cohorts over time. Participants were 266 students at a university-affiliated research school. All participants completed the DORF, 6th Edition during the fall, winter, and spring of second grade as well as the Florida Comprehensive Assessment Test (FCAT) and the Stanford Achievement Test, 10th Edition (SAT-10) during the spring of third grade. Participants were then divided into three subsamples: those who completed second grade between the years 2004 and 2008 (S1; n = 170); those who completed second grade during the 2008-2009 school year (S2; n = 46); and those who completed second grade during the 2009-2010 school year (S3; n = 50). Using data from S1, local DORF cut scores for predicting FCAT and SAT-10 performance were developed for each of the fall, winter, and spring assessment periods via three statistical methods: discriminant analysis (DA), logistic regression (LR), and receiver operating characteristic (ROC) curve analysis. Both the local and publisher-recommended cut scores were subsequently applied to all subsamples, and estimates of sensitivity, specificity, positive predictive power (PPP), negative predictive power (NPP), and overall correct classification (OCC) were computed. Of the three statistical methods, only two (DA and ROC curve analysis) produced cut scores that maintained adequate levels of sensitivity (less than .70) across cohorts, in predicting FCAT and SAT-10 performance. These cut scores consistently had higher levels of sensitivity than the publisher-recommended and LR cut scores; however, they had lower levels of specificity and OCC. Across all sets of cut scores, levels of PPP were relatively low, while levels of NPP were higher. Overall, these results suggest that locally-developed cut scores may be a promising alternative to publisher-recommended cut scores in identifying students who are at risk for long-term reading problems. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: KRANZLER,JOHN H.
Local:
Co-adviser: WALDRON,NANCY L.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2016-08-31
Statement of Responsibility:
by Sally L Grapin.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
8/31/2016
Resource Identifier:
969976915 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

1 AN INVESTIGATION OF THE RELATIONSHIP BETWEEN PERFORMANCE ON CURRICULUM BASED MEASURES OF ORAL READING FLUENCY AND HIGH STAKES TESTS OF READING ACHIEVEMENT By SALLY LAUREN GRAPIN DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014

PAGE 2

2 ©2014 Sally Lauren Grapin

PAGE 3

3 To Scott, Larry, and Julie Grapin, the three people I admire most and with whom I could never do without.

PAGE 4

4 ACKNOWLEDGMENTS I would like to acknowledge several important people for their continued support throughout my time in graduate school. First and foremost, I would like to thank my graduate advisor, Dr. John Kranzler, for spending countless hours discussing current resea rch with me, helping me refine my academic writing skills, and teaching me about the research and publication process. His guidance and encouragement have been essential to the development of my professional skills, interests, and goals, and I am grateful for his extraordinary mentorship. I would also like to express my gratitude to my wonderful doctoral committee as well as several other faculty members who made this research possible. In particular, I am grateful for the support of Dr. Nancy Waldron, wh o has played a pivotal role in helping me to identify and pursue my research interests. Her moral support and guidanc e on this project and others have been invaluable to my professional development. I would also like to thank Dr. Diana Joyce for being su ch a dedicated and encouraging practicum supervisor and instructor. Over the past four years, my skills in both research and practice have grown immensely under her supervision. I would like to thank Dr. James Algina for always being so generous with his t ime and for helping me to develop my research methods, execute data analysis, and interpret my findings. I have been fortunate to have him as both a doctoral committee member and course instructor during the past two years, and my understanding of researc h methodology has expanded significantly under his tutelage. Moreover, I am grateful for the assistance of those faculty members in the research setting who helped me to obtain the necessary data. Additionally, I would like to thank my loving friends and family. I would like to thank my mother, Julie Grapin, for being the steadfast source of moral support I needed to complete this project and for modeling the type of professional I hope to become someday. I would also like to thank my father, Larry Grapin , for his boundless love and encouragement and for his

PAGE 5

5 unwavering enthusiasm for my professional endeavors. I would like to thank my brother, Scott Grapin, for his endless love and support and for reminding me what it means to be passionate about a career in education. Furthermore , I am grateful to my incredible graduate school cohort (Angell Callahan, Alyson Celauro, Maggie Clark, Akiko Goen, Nicole Jean Paul, Michelle Judkins, Gillian Lipari, April Ponder, Jeanette Rodriguez, and Yulia Tamayo) for all o f the wonderful times and laughs we shared. In particular, I would like to thank Jill Pineda and Jacqueline Maye for their invaluable companionship during the most challenging and exciting times of graduate school. Finally, I would like to thank the Post family for all of their encouragement. Specifically, I would like to thank Dr. Shawn Post for inspiring me to pursue this area of study and for always being available to discuss professional issues with me. I would also like to thank Kate and Andrew Post, Jennifer Araneda Post, and Harry Klauber for all of their love and support.

PAGE 6

6 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ........... 4 LIST OF TABLES ................................ ................................ ................................ ...................... 8 LIST OF ABBREV IATIONS ................................ ................................ ................................ ...... 9 ABSTRACT ................................ ................................ ................................ ............................. 10 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ ............. 12 Universal Screening in Multi tiered Systems of Support ................................ ..................... 12 Using Screening Measure to Predict Student Outcomes ................................ ...................... 14 Screening Instruments: Using Curriculum Based Measures of Oral Reading Fluency ................................ ................................ ................................ .................... 15 Gold Standard Measures: Examining Reading Comprehension Outcomes ................... 16 Relationship between Measures of Oral Reading Fluency and Reading Comprehens ion ..... 17 Classification Agreement Analyses: Using Cut Scores to Identify At Risk Students ........... 18 Measuring Decision Making Accuracy: Classification Agreement Analyses ...................... 20 Using Research Based or Publisher Recommended Cut Scores to Identify At Risk Students ................................ ................................ ................................ .......................... 23 Developing Alternative Cut Scores for Oral Reading Fluency Measures ............................. 27 Comparing Methods of Developing ORF Cut Scores ................................ .......................... 34 Limitations to Prior Research on Developing ORF Cut Scores ................................ ........... 38 Aims of the Present Study ................................ ................................ ................................ .. 41 2 METHODS ................................ ................................ ................................ ........................ 44 Research Setting and Participants ................................ ................................ ....................... 44 Data Collection ................................ ................................ ................................ .................. 45 Measures ................................ ................................ ................................ ............................ 45 Dynamic Indicators of Basic Early Literacy Skills (DIBELS), 6 th Edition .................... 45 Florida Comprehensive Assessment Test (FCAT) ................................ ....................... 46 Stanford Achievement Test, 10th Edition (SAT 10) ................................ .................... 47 Procedure ................................ ................................ ................................ ........................... 47 Assessment Administration ................................ ................................ ......................... 47 Data Analysis ................................ ................................ ................................ .............. 48 Development of DORF Cut Scores ................................ ................................ .............. 48 Discriminant analysis (DA) ................................ ................................ .................. 49 Logistic regression (LR) ................................ ................................ ....................... 50 Receiver operating characteristic (ROC) curve analysis ................................ ........ 51 Publisher recommended cut scores ................................ ................................ ....... 51 Evaluating the Diagnostic Accuracy of Cut Scores ................................ ...................... 52

PAGE 7

7 Cross validation ................................ ................................ ................................ ... 52 Comparing the Diagnostic Accuracy of Cut Scores ................................ ............... 53 3 RESULTS ................................ ................................ ................................ .......................... 54 Evaluating the Long term Diagnostic Accuracy of DORF Cut Scores: Results of Classification Agreement Analyses ................................ ................................ ................. 59 Research Question #1: ................................ ................................ ................................ . 60 Research Question #2 ................................ ................................ ................................ .. 65 Research Question #3 ................................ ................................ ................................ .. 68 4 DISCUS SION ................................ ................................ ................................ .................... 82 Limitations ................................ ................................ ................................ ......................... 88 Implications for Schools ................................ ................................ ................................ ..... 92 Directions for Future Research ................................ ................................ ........................... 95 REFERENCES ................................ ................................ ................................ ......................... 98 BIOGRAPHICAL SKETC H ................................ ................................ ................................ ... 105

PAGE 8

8 LIST OF TABLES Table page 1 1 Summary of Classification Decisions Using Screening and Gold Standard Measures ..... 43 3 1 Percentages of Students in Various Demographic Categories by Sample ....................... 71 3 2 Publisher recommended and Locally Developed Cut Scores for the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) Oral Reading Fluen cy Measure ...... 72 3 3 Descriptive Statistics for DORF, FCAT, and SAT 10 Performance Across Samples ...... 73 3 4 Pass and Fail Rates for the FCAT and SAT 10 Across Subsamples ............................... 74 3 5 Pear son Correlations among DORF, FCAT, and SAT 10 Measures ............................... 75 3 6 Fall Publisher recommended and Local DORF Cut Scores for Predicting FCAT Performance: Results of Classification Agreement Analyses ................................ .......... 76 3 7 Winter Publisher recommended and Local DORF Cut Scores for P redicting FCAT Performance: Results of Classification Agreement Analyses ................................ .......... 77 3 8 Spring Publisher recommended and Local DORF Cut Sc ores for Predicting FCAT Performance: Results of Classification Agreement Analyses ................................ .......... 78 3 9 Fall Publisher recommended and Local DOR F Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses ................................ .......... 79 3 10 Winter Publisher recommended and Local DORF Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses ................................ .......... 80 3 11 Spring Publishe r recommended and Local DORF Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses ................................ .......... 81

PAGE 9

9 LIST OF ABBREVIATIONS CAI Classification agreement indicator CBM Curriculum based measure DA Discriminant analysis DIBELS Dynamic Indicators of Basic Early Literacy Skills DORF Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure FCAT Florida Compreh ensive Assessment Test FLDOE Florida Department of Education LR Logistic regression MTSS Multi tiered Systems of Support NPP Negative predictive power OCC Overall correct classification ORF Oral reading fluency PPP Positive predictive power R CBM C urriculum based measure of oral reading fluency ROC curve analysis Receiver operating characteristic curve analysis SAT 10 Stanford Achievement Test, 10 th Edition WCPM Words correct per minute

PAGE 10

10 Abstract of Dissertation Presented to the Graduate School o f the University of Florida in Partial Fulfillment of the Requirements of the Degree of Doctor of Philosophy AN INVESTIGATION OF THE RELATIONSHIP BETWEEN PERFORMANCE ON CURRICULUM BASED MEASURES OF ORAL READING FLUENCY AND HIGH STAKES TESTS OF READING ACHIEVEMENT By Sally Lauren Grapin August 2014 Chair: John Kranzler Major: School Psychology The purpose of this study was to compare the diagnostic utility of locally developed and publisher recommended cut scores for the Dynamic Ind icators of Basic Early Literacy Skills Oral Reading Fluency measure (DORF) in identifying students who were either at on two high stakes tests of reading achievement. This study also investigated whether locally developed and publisher recommended cut scores retained their classification accuracy when applied to subsequent student cohorts over time. Participants were 266 students at a university affiliated research sc hool. A ll participants completed the DORF , 6 th Edition during the fall, winte r, and spring of second grade as well as the Florida Comprehensive Assessment Test (FCAT) and the Stanford Achievement Test, 10 th Edition (SAT 10) during the spring of third grade . P articipants were then divided into three subsamples: those who completed second grade between the years 2004 and 2008 ( S1 ; n = 170); those who completed second grade during the 2008 2009 school year ( S2 ; n = 46); and those who completed second grade during the 2009 2010 school year ( S3 ; n = 50). Using data from S1 , local DORF cut scores for predicting FCAT and SAT 10 performance were developed for each of the fall, winter, and spring assessment periods via three statistical methods: discriminant analysis (DA), logistic

PAGE 11

11 regression (LR), and receive r operating characteristic (ROC) curve analysis. Both the local and publisher recommended cut scores were subsequently applied to all subsamples, and estimates of sensitivity, specificity, positive predictive power (PPP), negative predictive power (NPP), a nd overall correct classification (OCC) were computed. Of the three statistical methods, only two (DA and ROC curve analysis) produced cut scores that maintained adequate levels of sensitivity FCAT and SAT 10 performan ce. T hese cut scores consistently had higher levels of sensitivity than the publisher recommended and LR cut scores; however, they had lower levels of specificity and OCC . Across all sets of cut scores, levels of PPP were relatively low, while level s of NP P were higher . Overall, these results suggest that locally developed cut scores may be a promising alternative to publisher recommended cut scores in identifying students who are at risk for long term reading problems.

PAGE 12

12 CHAPTER 1 INTRODUCTION Universal Sc reening in Multi tiered Systems of Support Early intervention is critical to supporting students who are at risk for long term reading problems. A considerable body of research has indicated that early interventions can lead to marked improvements in foun dational skills (e.g., phonemic awareness and decoding) for struggling readers (e.g., Cavanaugh, Kim, Wanzek, & Vaughn, 2004). Moreover, early intervention may prevent the onset of long term reading difficulties as well as lead to reductions in the number of students who are identified as having a learning disability (Vellutino, Scanlon, Small, & Fanuele, 2006). Conversely, failure to provide interventions early on to struggling readers may result in profound skill deficits that are especially difficult t o remediate. For example, delays in the development of early reading skills may affect vocabulary growth as well Palincsar, & Purcell, 1986; Cunningham & Stanovi ch, 1997). Consequently, students who remain poor readers throughout the first few years of elementary school often find it difficult to attain proficiency in higher level skill areas such as comprehension (Simmons, Kuykendall, nui, 2000; Torgesen, Rashotte, & Alexander, 2001). Taken together, these findings indicate the importance of providing high quality reading instruction and intervention to students who experience difficulties in reading, as early as possible. The ability to provide effective early intervention services to those who demonstrate need is contingent on the use of efficient and accurate procedures for identifying students who are at risk for reading problems (Compton et al., 2010). To ensure swift and early id entification of these students, a number of schools nationwide have implemented Multi tiered Systems of Support (MTSS; also referred to as Response to Intervention models; Zirkel & Thomas, 2010).

PAGE 13

13 Broadly defined, MTSS are models of service delivery that ar e designed to provide high quality, research based instruction to all students through a multi tiered model of prevention and intervention. While numerous definitions have been proposed to describe the structure and implementation of MTSS, several integra l features invariably constitute their foundation. These features include a preventive orientation, a focus on outcomes for all students, a multi tiered instructional framework, evidence based assessment and intervention practices, an emphasis on both form ative and summative assessments, and a strong, functional link between assessment and intervention (Fuchs & Fuchs, 2006; Gresham, 2007). In MTSS, universal screenings are a primary means of identifying students who are at risk for reading difficulties (C ompton et al., 2010). Universal screenings consist of brief measures that focus on target skills and are designed to predict later academic outcomes. In most MTSS, screening measures are usually administered to all students three or four times per year. Students who fail to meet performance criteria on these measures are subsequently identified as potential candidates for further assessment and intervention. Currently, two primary approaches to universal screenings are represented in the MTSS literature: the direct route and the progress monitoring route. In the direct route approach, students who are identified as at risk through the universal screening process are provided immediate access to secondary interventions. In contrast, in the progress monit oring approach, students who are identified as at risk via universal screening procedures first undergo several weeks of progress monitoring. Decisions regarding whether these students will receive secondary interventions are based on both their rate of g rowth and overall level of performance during the progress monitoring period. Several studies have indicated that progress monitoring routes may increase classification accuracy in identifying at risk students (Compton, Fuchs,

PAGE 14

14 Fuchs, & Bryant, 2006; Compt on et al., 2010); however, they require significantly more time and resources than do direct route approaches. Efficiency is an important property of screening instruments, which should facilitate quick and accurate identification of students who are expe riencing academic difficulties. As a result, ongoing research on universal screening procedures continues to focus on improving the accuracy and feasibility of d irect route screening processes (e.g., Johnson, Jenkins, & Petscher, 2010). Using Screening Me asure to Predict Student Outcomes Effective screening measures are highly predictive of future academic outcomes for used in the classification analysis and medical literature to describe a criterion or outcome measure that "represents the best possible way to know if a condition is truly present" (VanDerHeyden, 2010a, p. 282) . In schools, the "condition" of interest is often an undesirable academic outcome, such as failure to attain reading proficiency (VanDerHeyden, 2011). Gold standard measures typically are longer, time intensive assessments that indicate whether a child h as acquired basic and more advanced reading skills. In the field of education, an essential feature of gold standard measures is that they are linked to meaningful outcomes for students (VanDerHeyden, 2010a). Often, schools select state developed tests o f achievement (e.g., Florida Comprehensive Assessment Test) as their gold standard measures because these tests are used to make high stakes decisions about individual students and school systems. Often, gold standard measures, especially state developed t ests of reading achievement, are not administered until the end of the school year. However, in order to ensure that all students are on track to meet end of year reading goals, school personnel need additional , c progress throughout the school year. Such

PAGE 15

15 information can be obtained through periodic universal screenings. As noted above, these screeners should be predictive of future academic outcomes (e.g., performance on high stakes achievement tests) and shoul d allow personnel to identify and provide early intervention services to students who are at risk for failing to meet end of year reading goals. Screening Instruments: Using Curriculum B ased Measures of Oral Reading Fluency School personnel can use a variety of measures to screen students for academic problems. One type of measure that is administered widely by schools is curriculum based measures (CBMs). CBMs are brief, standardized assessments of basic academic skills that are used to monitor studen about students' progress is obtained through multiple performance samplings that are conducted periodically throughout the school year. In addition to screening students for a cademic difficulties and predicting performance on high stakes tests, CBMs may be used also to enhance instructional decision making and to inform intervention planning for individual students (Deno, 2003). These measures are especially appropriate for un iversal screenings because they are time efficient and can be administered easily by a wide array of school personnel. In the area of reading specifically, researchers have given considerable attention to CBMs that assess oral reading fluency skills (R C BMs). Oral reading fluency (ORF) refers to the oral translation of text with speed and accuracy and involves the integration of multiple foundational skills, such as phonological awareness, decoding, and rapid word recognition (Fuchs, Fuchs, Hosp, & Jenki ns, 2001). R CBMs have been found to be a strong indicator of overall reading competence in elementary school, especially for students in the early grades (Shinn, Good, Knutson, Tilly, & Collins, 1992). Moreover, because students typically exhibit the gr eatest growth in this skill area during the primary grades, periodic administration of R CBMs can be

PAGE 16

16 used to detect incremental changes in reading skills for younger students (Fuchs, Fuchs, Hamlett, Walz, & Germann, 1993). Traditionally, R CBMs assess ra te and accuracy of reading by prompting students to read aloud as many words as possible from grade level passages within one minute (Deno, 1985). Consistent with the defining characteristics of CBMs, these measures are brief, standardized, and can be admi nistered at multiple points during the school year to assess reading. One of the most widely used R CBMs is included in the Dynamic Indicators of Basic Early Literacy Skills (DIBELS; Good & Kaminski, 2002a). This measure (often referre d to as the DORF) is widely used in elementary schools throughout the country and has generally been found to have strong psychometric properties (e.g., high alternate form reliability and test rest reliability; e.g., Goffreda & DiPerna, 2010). Gold Stand ard Measures: Examining Reading Comprehension Outcomes As described above, gold standard measures in reading should be linked to meaningful academic outcomes. More specifically, they should indicate whether students are able to demonstrate critical litera cy skills in multiple, relevant contexts and to comprehend, apply, and draw conclu sions when engaging with text. For this reason, many end of year statewide reading tests assess higher order skills such as reading comprehension. Reading comprehension refe rs to of comprehension skills is critical to students' learning because i t allows them to use reading as a vehicle for further learning; in other words, students are able to make the transition from Since the passage of the No Child Left Behind Act (NCLB; 2002), all states are require d

PAGE 17

17 these tests are state developed and based on state standards for K 12 curricula (Shapiro, Keller, Lutz, Santoro, & Hintze, 2006). In the primary grades, tests t ypically consist of short reading passages and related multiple choice questions that assess skills in vocabulary and comprehension (Guthrie, 2002). Additionally, school personnel may also administer nationally normed tests of reading achievement, such a s the Stanford Achievement Test (Harcourt Brace, 2003a) and the TerraNova (CTB/McGraw Hill, 2008), to measure student outcomes. Usually, hey are used to make important decisions about individual students and school systems (Guthrie, 2002). Failure to meet grade level standards for proficiency on these assessments can have profound consequences for students, including retention. Thus, it i s critical that schools identify and provide appropriate intervention services early on to students who are at risk for poor performance on these tests. Relationship between Measures of Oral Reading Fluency and Reading Comprehension A considerable body of literature has provided support for a strong empirical and theoretical relationship between ORF and reading comprehension skills in elementary school aged children. Ultimately, the successful execution of higher order reading processes, such as comprehen sion, is contingent on the development of lower level skills, such as decoding, word recognition, and fluency (Fuchs et al., 2001; LaBerge & Samuels, 1974). As the automaticity and accuracy of word recognition processes increase, readers are able to devot e more of their attentional and cognitive resources to comprehending text. Empirical investigations of the relationship between reading fluency and comprehension have indicated that the former may be an especially robust index of the latter (e.g., Shinn e t al., 1992), although the relationship between these skills may change as students progress through elementary school. Specifically, the relationship between ORF and comprehension may be strongest during the primary grades

PAGE 18

18 but become weaker over time as students are expected to engage in increasingly complex literary activities (Fuchs et al., 2001; Silberglitt, Burns, Madyun, & Lail, 2006). Because there is a robust relationship between reading fluency and comprehension in the early grades, moderate to CBMs and end of year statewide assessments (which, as noted above, often are primarily tests of reading comprehension; e.g., Wood, 2006). Recent meta analytic research investigating the relation ship between R CBMs and various statewide tests indicates that there is a moderately strong positive correlation (r = .68) between performance on these two types of measures (Yeo, 2010). Over the past decade, observed correlations between R CBMs and state developed reading te sts typically have ranged from .60 to .90, with higher correlations observed between measures administered more closely in time (Wood, 2006). Research also indicates that, of the various types of reading CBMs typically used for univer sal screenings in the primary grades (e.g., nonsense word fluency and letter naming fluency measures), R CBMs consistently have the strongest predictive and concurrent validity with statewide reading tests (Goffreda & DiPerna, 2010). Given these findings, periodic administration of R CBMs is a promising approach to identifying students who are at risk for poor performance on high stakes, end of year achievement measures. Classification Agreement Analyses: Using Cut Scores to Identify At Risk Students One of the primary purposes of conducting universal screenings is to differentiate those students who need additional interventions from those who do not. As such, screenings can be used to classify students into one of two groups: test positives , or those wh o are at risk for poor performance on the gold standard measure, and test negatives , or those who are not at risk. This can be accomplished by specifying a cut score on screening instruments, whereby students who fall below this designated score are deemed to be at risk and students who perform at or above

PAGE 19

19 this score are deemed not at risk. For measures of ORF, the cut score is defined as the critical number of words students must read correctly in one minute to demonstrate that they are on track to meetin g end of year grade level standards in reading (Silberglitt & Hintze, 2005). Similarly, on state developed achievement tests (i.e., gold standard measures), cut scores also are used to distinguish proficient from non proficient readers. A number of concern s have been raised about the use of cut scores to classify student performance on CBMs and statewide achievement tests, such as their unnatural dichotomization of continuous score distributions (e.g., VanDerHeyden, 2010b). Nonetheless, one advantage of us ing cut scores is that they allow for the establishment of concrete, standardized decision rules that promote consistency in determining which students should receive intervention services. Collectively, all possible permutations of outcomes on the scree ning and gold standard measures yield four types of classification decisions. These classification decisions (displayed in Table 1 1) are as follows: (1) true positives (i.e., students who are identified as at risk on the screening measure and who, as pred icted, do not meet the minimum performance criterion on the gold standard test); (2) true negatives (i.e., students who are identified as not at risk on the screener and who subsequently meet the performance criterion on the gold standard test ) ; (3) false positives (i.e., students who are identified as at risk on the screener but who ultimately do meet performance standards on the gold standard test); and (4) false negatives (i.e., students who are identified as not at risk but who subsequently do not meet the performance criterion on the gold standard test). The first two decision categories represent correct classification decisions, while the latter two categories represent incorrect classification decisions. When using screening measures to predict performance on the gold standard test, schools must give careful consideration to the number correct and incorrect classification decisions that

PAGE 20

20 result from this process, as incorrect classification decisions can be costly for both systems and indi vidual students. For example, a high incidence of false positives may result in the provision of intervention services to a large number of students who do not need these supports, which may impose a considerable strain on school resources. Moreover, if t oo many students are identified as needing secondary supports, student teacher ratios in small group interventions may become too large, thereby affecting the quality and outcomes of these programs (Johnson et al., 2010; Scammacca, Vaughn, Roberts, Wanzek, & Torgesen, 2007). Conversely, a high incidence of false negatives may result in failure to provide intervention services to a significant number of students who truly demonstrate need. Ultimately, high numbers of false negative decisions may be even mor e problematic than high numbers of false positive decisions, given that failure to provide early interventions to struggling students can have a profound negative impact on their immediate and long term academic outcomes (Johnson et al., 2010; Torgesen et al., 2001). Measuring Decision Making Accuracy: Classification Agreement Analyses Schools can use several metrics, or classification agreement indicators (CAIs), to evaluate whether screening procedures result in acceptable numbers of correct classificatio ns decisions. Classification agreement analyses quantify the degree to which a decision based on one measure (e.g., classification of a student as at risk based on the results of a screener) corresponds to an outcome on a second measure (e.g., failure to m eet performance standards on the gold standard measure; VanDerHeyden, 2010a). Common CAIs include overall correct classification, sensitivity, and specificity. Overall correct classification (OCC) refers to the percentage of students who are correctly cla ssified as "at risk" and "not at risk" based on the results of the screener. While this metric reflects the general classification accuracy of a screening measure, it does not indicate the value of a positive or negative screening result for decision makin g (VanDerHeyden, 2010a). Two additional CAIs, however, can provide this

PAGE 21

21 information. Sensitivity describes the power of a screener to detect true positives and is calculated by applying the following formula: (1 1) w here: TP is the number of true positives, and FN is the number of false negatives. In this equation, the number of cases that test positive on both the screener and gold standard measures is divided by the total number of cases that test positive on the g old standard (VanDerHeyden, 2010a). Sensitivity may be an especially important metric for evaluating the decision making utility of a screener because it indicates the percentage of truly at risk students that are actually identified in practice. Conversel y, specificity describes the power of a screener to detect true negatives and is calculated by using the formula below: (1 2) where: TN is the number of true negatives, and FP is the number of false positives. In this formula, the number of cases that test negative on both the screener and gold standard is divided by the total number of cases that test negative on the gold standard (VanDerHeyden, 2010a). Sensitivity and specificity are considered to be static properties of a screening measure and do not vary as base rates of the target condition change (VanDerHeyden, 2011). Thus, values of sensitivity and specificity can be compared across screening instruments as well as across settings, as long as designated cut scores and the gold standard measure remain the same.

PAGE 22

22 Sensitivity and specificity offer valuable information for selecting screening instruments and decision rules for a particular population or setting; however, they do not inform decision making at the indivi dual student level. For example, sensitivity indicates the percentage of true positives that are correctly identified by the screener but does not indicate the likelihood that an individual who receives a positive screening result is truly at risk. To add ress this issue, estimates of positive and negative predictive power can provide information about a screener's utility for making decisions about individual test results. Positive predictive power (PPP) refers to the probability that a positive test find ing is truly positive and is calculated by applying the following formula: (1 3) where: PPP is positive predictive power; TP is the number of true positives; and FP is the number of false positives. As shown above, the number of cases that test positive on both the screener and gold standard is divided by the total number of cases that test positive on the screener. Conversely, negative predictive power (NPP) refers to the probability that a negative t est finding is truly negative. Its calculation is displayed below: (1 4) where: NPP is negative predictive power; TN is the number of true negatives; and FN is the number of false negatives. In this formula, the number of cases that test negative on both the screener a nd outcome measure is divided by the total number of individuals who test negative on the screener.

PAGE 23

23 It should be noted that, unlike sensitivity and specificity, estimates of PPP and NPP are not stable across settings in which the prevalence of the condit ion of interest varies. Therefore, these metrics can be useful to clinicians when known for a particular population; however, they are not comparable across studies (VanDerHeyden, 2010a). Collectively, estimates of OCC, sensitivity, specificity, PPP and N PP can provide valuable information for selecting screening instruments and decision rules that accurately identify students who are at risk for reading problems. Using Research Based or Publisher R ecommended Cut Scores to Identify At Risk Students A number of studies have investigated the decision making utility of universal ORF cut scores for identifying students who are at risk for poor reading outcomes (e.g., Crawford, Tindal, & Steiber, 2001; McGlinchey & Hixson, 2004; Schilling, Carlisle, Scott , & Zeng, 2007). Generally, these cut scores are developed from large national samples, are intended for universal use by a diverse range of schools, and are designed to serve as benchmarks against which educators can make judgments about students' reading proficiency (Hintze & Silberglitt, 2005). More specifically, they are intended to indicate where students' skills should fall at regular intervals during the school year (typically the beginning, middle, and end of the year). Students who do not meet the se benchmarks throughout the year may be at risk for performing below grade level standards on end of year high stakes achievement measures and, ultimately, for long term reading problems. Several researchers have attempted to develop universal ORF benchm arks that can be applied by a range of schools nationwide. For example, Hasbrouck and Tindal (1992) used standardized CBM procedures to collect ORF data for approximately 8,000 students who attended school in one of eight geographically and culturally dive rse districts. They established norms for the number of words read correctly per minute (WCPM) during three assessment

PAGE 24

24 periods (i.e., fall, winter and spring) for students in grades two through five. For each grade level and assessment period, WPCM scores were specified for students scoring at the 25 th , 50 th , and 75 th empirical literature investigating R CBMs (e.g., Compton, Appleton, & Hosp, 2004; Crawford et al., 2 001; Wayman, Wallac e, Wiley, Ticha, & Espin, 2007) . (1992) ORF norms for identifying students who are at risk for poor performance on state developed measures of rea ding achieveme nt. For example, Crawford and colleagues (2001) administered modified ORF passages from the Houghton Mifflin Basal Reading Series to 51 students during the winter of their second and third grade years. For both years, they classified stude nts into four groups corresponding to each of the four ORF performance quartiles established by Hasbrouck and Tindal. Crawford and colleagues found that 78% of students in the three highest performing groups (i.e., students who performed above the 25 th pe rcentile) on the ORF measure in second grade subsequently passed the Oregon statewide reading test at the end of third grade. Moreover, 81 % of students reading in the two highest scoring groups (i.e., who performed above the 50 th percentile) in third grade also passed this test. Perhaps most interestingly, they also found that 100% of students who read at least 72 WCPM in second grade passed the Oregon state test in third grade. Based on these findings, Crawford and colleagues concluded that Hasbrouck and progress in reading and to make valid predictions about which individuals may be at risk for poor performance on high stakes achievement tests. McGlinchey and Hixon (2004) also used Hasbrouck and Tindal's (1992) norms to developed

PAGE 25

25 reading achievement test. In this study, three passages from a basal fourth grade reading text (from the MacMillan Connections Read ing Program) were administered to fourth graders in a district serving a high percentage of students from economically disadvantaged backgrounds. Two weeks following the CBM administration, students were administered the Michigan Educational Assessment Pro fourth grade students who scored in the 50th percentile on their ORF measure read 99 WCPM, McGlinchey and Hixson used a cut score of 100 WCPM to distinguish students who were at risk for substandar d performance on the MEAP from students who were not at risk. In applying this cut score, they found that 75% of fourth graders who did not achieve satisfactory performance on the MEAP were correctly identified as at risk, while 74% of students who did ac hieve satisfactory performance were correctly identified as not at risk. Based on their findings, McGlinchey and Hixson concluded that ORF data could be used to improve the prediction of developed reading tests. A number o f studies have also used benchm ark recommendations published in the DIBELS, 6 th Edition to identify students who are at risk for poor reading outcomes (Good, Simmons, Kame'enui, Kaminski, & Wallin, 2002). These benchmarks are intended for use with the DOR (1992) norms, 6 th Edition DIBELS recommended cut scores were developed b ased on a representative national sample of elementary school aged students. However, unlike Hasbrouck criterion referenced benchmarks for performance. Good, Simmons, argued that criterion referenced standards were more appropriate for gauging the development of

PAGE 26

26 reading skills over time, because they indicate how well a student has performed in relation to a predetermined goal or criterion rather t han in relation to peers. Schilling and colleagues (2007) examined the utility of DIBELS recommended cut scores for predicting end of year performance on the Iowa Tests of Basic Skills (ITBS) in a sample of first, second, and third grade students. For s econd and third graders, they found that fall cut scores for distinguishing "some risk" and "at risk" students from "low risk" students could be used to accurately identify those individuals who performed below the 50th percentile on the ITBS in the same y ear. More specifically, 86% of second graders and 88% of third graders who did not meet grade level standards on the ITBS were correctly identified using the fall DIBELS recommended cut scores. However, these cut scores also resulted in the identificatio n of large numbers of false positives, which indicated that they had somewhat lower specificity (.35 and .45 for the second and third grade samples, respectively). Additionally, a series of technical reports examining the decision making utility of DIBELS recommended benchmarks for identifying at risk students have been published as well (Barger, 2003; Buck & Torgesen, 2003; Shaw & Shaw, 2002; Vander Meer et al., 200 5; Wilson, 2005). These studies have investigated the utility of DIBELS recommended cut scores for identifying students who are at risk for performing below grade level standards on measures such as the Arizona Instrument to Measure Standards, the Colorad o Student Assessment Test, the Ohio Proficiency Test in Reading, and the North Carolina End of Grade Test. For example, one of these reports examined the predictive validity of DORF scores with respect to outcomes on the Florida Comprehensive Assessment T est (FCAT; Buck & Torgesen, 2003). In this study, 6 th E dition DIBELS recommended cut scores were used to discriminate three groups of students (i.e., "at risk," "some risk," and "low risk" students) in the spring of third grade. Results of

PAGE 27

27 classification agreement analyses indicated that the cut score distinguishing "at risk" from "some risk" students resulted in moderately high hit rates for identifying students who truly were at risk for f ailing the FCAT (sensitivity = .83). Moreover, the cut score disti nguishing "some risk" students from "at risk" students had high specificity (i.e., 91% of students who were on track to meeting grade level proficiency standards on the FCAT were correctly identified as not at risk). Overall, these reports further indicate the promise of using DIBELS recommended cut scores to identify subgroups of students who are at risk for poor reading outcomes. In summary, research examining universal and publisher recommended benchmarks (e.g., DIBELS recommended cut scores) has suggest ed that they can be useful for discriminating students who are at risk from students who are not at risk for poor performance on statewide achievement measures. However, it should be noted that several of these studies examined the relationship between OR F measures and statewide achievement tests that were administered only months or even weeks apart (e.g., Barger, 2003; Buck & Torgesen, 2003; McGlinchey & Hixson, 2004). Ideally, screeners would be administered months or years in advance of high stakes ac hievement tests, such that interventions could be provided to at risk students as early as possible (Johnson, Jenkins, Petscher, & Catts, 2009). Ultimately, further research is needed to evaluate the decision making utility of cut scores for R CBMs that ar e administered well in advance of high stakes achievement tests. Developing Alternative Cut Scores for Oral Reading Fluency Measures An emerging body of research has investigated the utility of developing alternative ORF cut scores that are tailored to sp ecific outcome measures and student populations, rather than relying on universal or publisher recommended cut scores. There are a number of reasons that using locally developed cut scores may be a beneficial practice for schools. First, most end of year high stakes achievement tests are state developed and therefore vary across different parts

PAGE 28

28 of the country (Shapiro et al., 2006). As indicated by VanDerHeyden (2010a), estimates of sensitivity and specificity for a screening measure vary as the gold sta ndard measure changes. Therefore, using CBM cut scores that are designed to predict performance on specific state achievement measures may improve classification accuracy. Moreover, population characteristics can influence classification agreement betwee n screening and gold standard measures. For example, VanDerHeyden (2010a) indicated that the prevalence of the condition of interest (i.e., poor reading) can influence estimates of positive and negative predictive power associated with particular screenin g measures and decision rules. As a result, cut scores found to exhibit appropriate levels of PPP and NPP in one sample may no longer do so when applied to others. As noted above, failure to maintain sufficient levels of PPP and NPP may adversely impact decision making for individual students. A second concern is that CBMs may predict performance on reading comprehension measures differently for diverse groups of students (i.e., for males and females and for individuals from different racial and ethnic b ackgrounds), especially in the later elementary grades (Kranzler, Miller, & Jordan, 1999). Given that student demographics can vary significantly within and among districts, it may be beneficial for schools to use cut scores that are designed specifically for predicting reading comprehension outcomes for their respective student populations. Finally, individual schools may wish to establish cut scores and decision rules that are compatible with their available level of resources. Since values of sensitivi ty and specificity change as cut scores for a screener are adjusted, schools can modify these scores to yield varying numbers of true positive, false positive, true negative and false negative classification decisions. As noted previously, the identificat ion of too many false positives can be costly for schools in terms of both time and resources; however, these classification errors may be preferable to false

PAGE 29

29 negative identifications, which often result in failure to provide appropriate services to studen ts who truly need them. Thus, schools with more instructional resources may wish to use screening cut scores with higher levels of sensitivity, especially if they have the capacity to provide secondary supports to greater numbers of students. Researchers have investigated a number of methods for developing local ORF cut scores that are tailored for use in specific schools or districts. For example, Stage and Jacobsen (2001) examined the utility of fall, winter, and spring DORF scores for predicting end o f year performance on the Washington Assessment of Student Learning (WASL) in a sample of fourth grade students. Rather than using DIBELS recommended cut scores to identify students at risk for failing the WASL, they developed new cut scores using partici pant test data. More specifically, for each CBM assessment period, they calculated the mean DORF score for students who performed in the Level 3 range (i.e., the minimum achievement level for meeting proficiency standards) at the end of the year on the WA SL. The researchers then constructed 95% confidence intervals around these mean scores and identified the lowest score in each interval as the "at risk" cut score for the fall, winter, and spring assessment periods, respectively. Finally, Stage and Jaco bsen calculated the overall classification accuracy of the fall DORF cut score for predicting performance on the WASL. Overall, they found a moderate level of agreement between classifications of poor readers based on fall DORF scores and pe rformance on th e WASL (Gamma = .72). Wood (2006) also applied Stage and Jacobsen's procedure for developing ORF cut scores to predict performance on a state developed achievement test. Specifically, Wood examined the relationship between fall DORF scores and end of year performance on the Colorado Student Assessment Program (CSAP) for third, fourth, and fifth grade students, respectively. Across

PAGE 30

30 ores ranged from .85 to .95, and estima tes of specificity ranged from .58 to .67. Moreover, the cut scores had moderately high overall correct classificati on (i.e., Gamma values between .83 and .93). Based on these findings, Wood concluded that these alternative ORF cut scores provided valuable information for identifying s tudents who were at risk for failing the CSAP. In addition to Stage and Jacobsen's (2001) method, researchers have explored several other procedures for developing ORF cut scores. For example, LeB lanc, Dufore, and McDougal (2012) used ordinary least square s regression to develop DORF cut scores that accurately identified students at risk for failing the New York State English Language Arts (NYSELA) test. Ordinary least squares (OLS) regression involves using a continuous independent variable (e.g., DORF sco res) to predict values of a continuous dependent variable (e.g., scores on the NYSELA). In LeB lanc study, third, fourth and fifth grade students completed DORF measures in the fall and winter and then completed the NYSELA at the end of the school year. Subsequently, Le B lanc and colleagues constructed a series of regression equations to model the relationship between DORF and NYSELA scores , at each grade level. To establish baseline fall and winter cut scores for the three grade levels, the y substituted the minimum passing score on the NYSELA as the dependent variable into each equation and identified the corresponding fall and winter DORF values associated with that score. Finally, in order to ensure that these cut scores were sufficiently liberal in identifying at risk students, the researchers added seven tenths of the standard error of estimate (i.e., average error of prediction in the regression model) to each baseline cut score. LeB lanc and colleagues computed sensitivity (which range d from .79 .91), specificity (.70 .79), and OCC (73 89% correctly classified) values for both the fall and winter DORF administrations at each of the three grades.

PAGE 31

31 Based on their results, they concluded that cut scores developed via the OLS regression met hod demonstrated adequate diagnostic efficiency. However, the researchers noted also that their dataset violated several statistical assumptions un derlying OLS regression (e.g., equal variances). Given the potential difficulties associated with meeting t hese assumptions (especially for smaller samples), LeB lanc and colleagues concluded that regression methods may not be ideal for developing local ORF cut scores in schools. Finally, several studies have also explored the use of receiver operating character istic (ROC) curve analysis for developing ORF cut scores (e.g., Hintze & Silberglitt, 2005; Shapiro et al., 2006; Silberglitt & Hintze, 2005). In ROC curve analysis, sensitivity and specificity values of a predictor variable (e.g., DORF scores) are plotte d for all possible values of the cut score, thereby allowing practitioners to identify the cut score that yields an optimal balance between sensitivity and specificity (Silberglitt & Hintze, 2005). One such study conducted by Shapiro and colleagues (2006) spring performance on an ORF measure (from the web based assessment system AIMSweb) and their scores on the Pennsylvania System of School Assessment (PSSA) in the same year. The res earchers used ROC curve analysis to identify ORF cut scores for each assessment period that resulted in an optimal balance between sensitivity and specificity for identifying students at risk for failing the PSSA. For the fall, winter, and spring cut scor es, sensitivity values ranged from .69 to .86, and specificity values ranged from .67 to .83. Across the three administration periods, cut scores accurately classified between 68% and 86% of all students. Based on these findings, Shapiro and colleagues co ncluded that ORF cut scores developed via ROC curve analysis provided valuable information for identifying students at risk for failing statewide achievement measures.

PAGE 32

32 In a similar study, Keller Margulis, Shapiro, and Hintze (2008) examined the relationshi p between students' performance on the AIMSweb ORF measure and the PSSA. The ORF measures were administered to students at various times during first through fourth grade. Students also completed the TerraNova (TN) achievement test at the end of fourt h gr ade and the PSSA at the end of third and fifth grades. Keller Margulis and colleagues used ROC curve analysis to identify ORF cut scores with optimal levels of sensitivity and specificity in predicting long term performance on the TN and PSSA (i.e., perfo rmance on these tests between one and two years later). Generally, cut scores developed for the spring CBM administrations in first through fourth grades met the researchers' minimum standards for levels of sensitivity and dicting later performance on the TN and PSSA. Keller Margulis and colleagues concluded that ROC curve analysis could be used to develop CBM cut scores that accurately predicted long term reading achievement. This study represents one of the first investig ations of the long term diagnostic accuracy of locally developed ORF cut scores in predicting outcomes on high stakes tests of reading achievement. Both Shapiro and colleagues (2 006) and Keller Margulis and colleagues (2008) used ROC curve analysis to esta blish cut scores that offered the perfect balance between sensitivity and specificity; however, several researchers have argued that practitioners should favor higher levels of sensitivity in developing CBM cut scores (Johnson et al., 2009; Johnson et al., 2010; Riedel, 2007). They argued that false negative classification errors are more costly and egregious than false positive errors; therefore, levels of sensitivity should be higher, despite inevitable sacrifices to specificity. For example, Riedel (20 07) used ROC curve analysis to develop ORF cut scores with slightly higher levels of sensitivity than specificity in predicting performance on two high stakes reading achievement tests. In this study, participants were

PAGE 33

33 administered the DORF in the winter and spring of first grade, the Group Reading Assessment and Diagnostic Evaluation (GRADE) at the end of first grade, and the TN at the end of second grade. For both the winter and spring CBM administrations, ROC curve analysis was used to identify ORF cut scores for which there was the smallest difference between sensitivity and specificity, whereby sensitivity was greater than or equal to specificity. The resulting cut scores exhibi ted sensitivity levels between .69 and .80 as well as specificity leve ls between .65 and .80, in predicting risk status for failing the TN and the GRADE. Overall, Riedel concluded that these scores accurately distinguished students who were at risk from students who were not at risk for poor performance on these end of year assessments. Johns on and colleagues (2010) also used ROC curve analysis to develop cut scores that had higher levels of sensitivity than specificity. However, unlike Riedel (2007), they argued that cut scores should correctly identify at least 90% of true positives (i.e., s hould have sensitivity levels of at least .90). In this study, Johnson and colleagues administered the DORF to students at various times during second grade and then again in the fall and winter of third grade. Students also completed the FCAT at the end o f third grade. Subsequently, they used ROC curve analysis to develop ORF cut scores that accurately identified 90% of students who failed the FCAT in th ird grade (i.e., sensitivity = .90). When sensitivity was held at .90, values of speci ficity were .43 and .45 for the second grade spring and third grade fall DORF administrations, respectively. In other words, nearly 60% of all students who met grade level standards on the FCAT were incorrectly classified as at risk on the screeners. Based on these resul ts, the researchers concluded that this approach resulted in the identification of too many false positives, and that ultimately, neither cut score efficiently discriminated students who were at risk for failing the FCAT from students who were not at risk.

PAGE 34

34 Johnson and colleagues (2009) used ROC curve analysis to establish ORF cut scores that correctly identified at least 90% of students who were at risk for poor performance on an end of year test of reading achievement. In this study, participants complet ed the DORF in the fall of first grade and the SAT 10 at the end of the school year. Using ROC curve analysis to ensure 90% sensitivity, Johnson and colleagues (2009) established two fall DORF cut scores to identify students at risk for performing below t he 20th percentile (severely impaired) and the 40th percentile (moderately impaired) on the SAT 10, respectively. They concluded that both DORF cut scores resulted in the identification of too many false positives, although the specificity level of the cu t score for identifying students wi th severe reading impairments ( .65) was somewhat higher than the specificity level of the cut score for identifying students with moderat e reading impairments ( .59). Overall, Johnson and colleagues (2009) concluded that t he DORF cut scores developed in their study did not accurately distinguish students who were at risk for poor performance on the SAT 10 from students who were not at risk. In summary, results of a number of studies indicate the potential utility of regress ion, ROC curve analysis and other statistical methods for developing CBM cut scores that accurately predict students' risk for poor performance on high stakes achievement tests. Nevertheless, it remains unclear which methods will yield cut scores with the greatest diagnostic accuracy. Ultimately, further research is needed to compare the classification accuracy of CBM cut scores developed via different statistical procedures and to explore the utility of these methods for local districts and schools. Com paring Methods of Developing ORF Cut Scores As described above, a number of studies have investigated the utility of various statistical procedures for developing alternative ORF cut scores; however, few studies have compared the classification accuracy of cut scores developed via these methods to that of universal or

PAGE 35

35 publisher recommended cut scores. Research comparing the utility of alternative and universal cut scores is necessary because it informs researchers and practitioners as to whether using loca lly developed cut scores is a worthwhile practice for individual districts and schools. Ultimately, if locally developed ORF cut scores exhibit superior diagnostic accuracy in comparison to universal or publisher recommended scores, it may be worthwhile f or schools to invest time and resources in developing their own cut scores. To date, only two studies have compared the predictive utility of locally developed ORF cut scores with that of publisher recommended cut scores in predicting performance on high stakes tests of reading achievement. In one study, Goffreda, DiPerna, and Pedersen (2009) compared the decision making utility of DIBELS recommended and locally developed ORF cut scores for identifying students at risk for failing the TN in second grade an d the PSSA in third grade. Participants were administered the DORF in the winter of first grade, the TN in spring of second grade, and the PSSA in the s pring of third grade. Goffreda and colleagues then used ROC curve analysis to establish ORF cut scores with optimal levels of sensitivity and specificity for predicting performance on the TN and the PSSA, respectively. Generally, the observed levels of sensitivity and specificity for the DIBELS recommended and adjusted cut scores were comparable. However, Goffreda and colleagues noted that the sensitivity level of the adjusted winter ORF cut score ( .88) for predicting third grade PSSA performance was somewhat higher than the sensitivity level of the DIBELS recommended cut score ( .77). Nevertheless, it rem ains unclear whether the difference between these sensitivity values is substantial, as there is no known statistical test for detecting significant differences between sensitivity values. Similar to Goffreda and colleagues (2009), Roehrig, Petscher, Nettl es, Hudson, and Torgesen (2008) compared the decision making utility of universal and recalibrated DORF cut

PAGE 36

36 scores for predicting performance on two end of year, standardized reading achievement measures. However, rather than comparing their recalibrated cut scores to those recommended for the DIBELS, 6 th Edition, Roehrig and colleagues compared them to ORF cut scores developed for universal use by schools participating in the Reading First initiative in the state of Florida. In this study, students were a dministered the DORF in the fall, early winter, late winter, and spring of third grade. These students were also administered the SAT 10 and the FCAT in the spring o f the same year. Roehrig and colleagues used ROC curve analysis to recalibrate the Reading First DORF cut scores designated for the fall, early winter, and late winter CBM administrations. They found that, across these three assessments periods, the recalibration of cut scores resulted in an average improvement of +5.8% in sensitivity, with the greatest gains observed in the sensitivity values for the fall assessment period (+10%). Notably, values of sensitivity remained high for recalibrated cut scores when they were applied to an equally sized and demographically comparable cross validation s ample. In contrast, specificity improved only slightly when universal Reading First cut scores were recal ibrated; however, Roehrig and colleagues concluded that use of ROC curve analysis to adjust universal cut scores may improve the identification of tru e positives. Findings from Goffreda and colleagues (2009) and Roehrig and colleagues (2008) accuracy in identifying at risk students. Nevertheless, it remains u nclear which method of developing local cut scores (e.g., ROC curve analysis and regression methods) will allow schools to attain the greatest level of decision making accuracy. To date, only two studies have compared multiple statistical procedures for de veloping alternative CBM cut scores. In the first study, Silberglitt and Hintze (2005) investigated four methods of developing alternative ORF cut

PAGE 37

37 scores: namely, predictive discriminant analysis, the equipercentile method, logistic regression, and ROC cu rve analysis. Discriminant analysis (DA) involves using a predictor variable (i.e., ORF scores) to determine the probability of membership in a group (i.e., passing or failing t he statewide achievement test ). In the equipercentile method (EQ), the percen tage of students below a designated cut score on one measure (i.e., the ORF measure) is equated with that percentage on a second measure (i.e ., statewide achievement test ), thereby producing two equivalent scores for the two measures (Silberglitt & Hintze, 2005). Finally, logistic regression (LR) involves using a continuous or categorical independent variable (i.e., ORF scores) to determine the probability of membership to each category of the dependent variable (i.e., passing or failing the statewide achi evement test). In this way, LR is similar to DA; however, it is less restrictive and complex than DA with respect to its underlying assumptions. and spring of first g rades as well as in the fall, winter, and spring of second and third grades. Participants also completed the Minnesota Comprehensive Assessment (MCA) in the spring of third grade. Subsequently, the researchers used the DA, EQ, LR, and ROC curve analysis methods to develop ORF cut scores for each of the eight assessment periods. Generally, all four methods produced cut scores that met minimum criteria for levels of sensitivit y a nd specificity ut scores derived via LR consistently demonstrat ed the highest or near hi ghest values of PPP and OCC, across assessment periods. Moreover, cut scores developed via the R OC curve analysis and DA procedures consistently had the highest values of NPP. Given that the ROC curve analysis and LR methods were most robust to violations of statistical assumptions (e.g., normality and equal variance assumptions) and produced cut scores with the greatest

PAGE 38

38 diagnostic accuracy, Silberglitt and Hintze concluded that these two methods may be most appropriate for use by school based practitioners. In a second study, Hintze and Silberglitt (2005) compared three of the four aforementioned procedures for developing CBM cut scores: DA, LR and ROC curve analysis. Similar to Silberglitt and Hintze (2005), students completed O RF measures in the winter and spring of first grades as well as in the fall, winter, and spring of second and third grades. In the spring of third grade, participants also completed the MCA. For each statistical procedure (i.e., DA, LR, and ROC), ORF cut scores were established for all eight assessment periods to predict scores developed using DA procedures consistently exhibited inadequate levels of sensitivi ty analysis and LR, however. For all three methods (i.e., DA, LR , and ROC curve analysis), values of sensitivity and specificity were highest for ORF measures administered closer in time to the MCA (i.e., in second and third grades). In summary, these results indicate that LR and ROC curve analysis methods may offer a promising approach to developing local CBM cut scores. Limitations to Prior Research on Developing ORF Cut Scores Collectively, the research described above provides support for the use alternative or recalibrated ORF cut scores to predict performance on state mandated tests of reading achievement. However, there are a number of limitations to this body of research. F irst, at the current time, only two studies have compared multiple statistical procedures for developing CBM cut scores (i.e., Hintze & Silberglitt, 2005; Silberglitt & Hintze, 2005). While these studies provide valuable information for understanding diff erences in the characteristics and functionality of these procedures, further research is needed to determine which methods are

PAGE 39

39 most appropriate for use at the local school level. Moreover, neither study compared the decision making utility of locally dev eloped cut scores with that of universal or publisher recommended cut scores. Additional research comparing the diagnostic accuracy of universal cut scores with that of local cut scores developed via multiple statistical procedures is needed. A second li mitation to the aforementioned research concerns the use of cross validation to determine whether cut scores will continue to demonstrate adequate diagnostic accuracy when applied to independent samples. Because local cut scores typically are developed in a post hoc fashion (based on known student outcomes on the gold standard measure), their respective levels of sensitivity and specificity may be overestimated (VanDerHeyden, 2011). A number of researchers have contended that research on locally developed cut scores should be followed up with systematic replications that investigate their classification accuracy in independent samples (e.g., Jenkins, Hudson, & Johnson, 2007; VanDerHeyden, 2010b; VanDerHeyden, 2011). Nevertheless, few studies have included cross validation samples in their designs, and of the 22 studies reviewed above, on ly Roehrig and colleagues (2008) and Johnson and colleagues (2009) used cross validation to evaluate the diagnostic utility of locally developed ORF cut sc ores in an independent sample. Furthermore, although two of the studies described above cross validated cut scores on separate samples, neither study attempted to cross validate cut scores across multiple cohorts or years. For example, both the calibrati on and cro ss validation samples in Roehrig and colleague s (2008) study consisted of participants who completed third grade in the same academic year. This is potentially problematic, given that school systems, assessments, and student populations may chan ge over time. For example, schools that are in the process of implementing MTSS may experience shifts in instructional practices and other systems level variables over time. These

PAGE 40

40 shifts can impact both the quality and focus of core instruction and tiere d interventions. Ultimately, given that schools and assessments may undergo significant changes from year to year, it is unclear whether cut scores developed to predict performance on a state achievement test during one school year will continue to demons trate adequate diagnostic accuracy in subsequent years. Further research is therefore needed to determine whether cut scores developed for end of year, state mandated assessments retain their decision making utility from year to year. Moreover, few stud ies have compared the diagnostic accuracy of ORF cut scores derived to predict performance on state developed and nationally developed achievement measures. For example, Roehrig and colleagues ORF scores and their end of year performance on one nationally normed and one state developed achievement measure (i.e., FCAT and SAT 10); however, these researchers developed cut scores to predict performance on the state achievement test only. Goffreda and colleagues (2009) also developed ORF cut scores to predict performance on one nationally normed test of reading achievement (i.e., the TN) and one state developed achievement test (i.e., the PSAA); nevertheless, because these two tests were administere d to students at different grade levels, the researchers were unable to directly compare the diagnostic accuracy of cut scores for these measures. Ultimately, comparisons between CBM cut scores for predicting performance on these two types of tests may be necessary, as recent meta analytic research has indicated that R CBMs may correlate more strongly with nationally normed tests than with state developed tests (Reschly, Busch, Betts, Deno, & Long, 2009). Thus, further research is needed to compare the lon g term diagnostic accuracy of cut scores developed for these two types of tests.

PAGE 41

41 Finally, it remains unclear whether various methods for developing cut scores can be applied at the local school level, where available sample sizes may be smaller. A number of researchers have recommended that local schools use ROC curve analysis and LR methods to develop ORF cut scores that are tailored to their specific student populations and outcome measures (e.g., LeB lanc et al., 2012). Because there may be substantial variation within districts with r espect to student demographics, instructional practices, gold standard measures, and the prevalence of reading problems, it may be beneficial for individual schools to establish their own screening cut scores for identifyin g students who are at risk f or long term reading difficulties. Nevertheless, most research examining the decision making utility of alternative cut scores has used large statewide samples. Specifically, a recent meta analysis of research investigating the relationship between reading CBMs and state developed achievement tests indicated that the mean number of participants in these studies was 1,453.4 (SD = 918.3; Yeo, 2010). Given that some methods for developing local cut scores have rigid underlying stat istical assumptions, further research is needed to determine whether these procedures can be applied in schools with smaller student populations. Aims of the Present Study The purpose of the present research is to examine the diagnostic utility of locally developed ORF cut scores for identifying students who are at risk for poor performance on high stakes, end of year tests of reading achievement. As noted above, only two studies have compared multiple statistical procedures for developing local cut score s, and neither study compared the diagnostic accuracy of these cut scores to that of universal or publisher recommended cut scores. Specifically, this study will investigate three statistical methods (viz., ROC curve analysis, DA, and LR) for developing D ORF cut scores and will compare their classifi cation accuracy to that of 6th E dition publisher recommended cut scores. This study will

PAGE 42

42 also investigate the diagnostic accuracy of locally developed and publisher recommended cut scores over multiple years. Moreover, no study has directly compared the diagnostic accuracy of cut scores developed to predict performance on nationally normed and state developed tests of reading achievement. It is important to note that these two types of tests may differ in sev eral ways (e.g., nationally normed tests may be more likely to measure general reading achievement, whereas state developed tests are often designed to measure statewide learning standards; Reschly et al., 2009). Therefore, the present study will compare t he long term diagnostic utility of cut scores developed for these two types of achievement tests. Specifically, this study will investigate how various statistical methods for developing alternative ORF cut scores function at the local school level, wher e accessible sample sizes will most likely be smaller. Ultimately, the purpose of this study is to determine which statistical procedures can be used by individual schools to develop ORF cut scores that accurately identify students who are at risk for poo r performance on high stakes reading tests. The results of this research may assist local schools and districts in improving screening decision rules for identifying students who are at risk for long term reading problems. The following summarizes the re search questions posed in the present study. 1. Which of the thre e statistical procedures (i.e., DA, LR, and ROC curve analysis) yield cut scores with adequate diagnostic accuracy? Wh ich of these methods produce cut scores that retain their diagnostic accurac y over multiple years (i.e., across subsamples)? 2. How does the long term classification accuracy of publisher recommended cut scores compare with that of locally developed cut scores? 3. Do cut scores developed to predict performance on state developed achieve ment tests and nationally normed achievement tests differ with respect to their long term diagnostic accuracy?

PAGE 43

43 Table 1 1. Summary of Classification Decisions Using Screening and Gold Standard Measures Meets the performance criterion on gold standard mea sure Does not meet the performance criterion on gold standard measure Test positive on screener (at risk) False positive True positive Test negative on screener (not at risk) True negative False negative

PAGE 44

44 CHAPTER 2 METHODS Research Setting and Participants Participants were 266 students who were enrolled in second and third grade between 2004 and 2011 at a university affiliated, developmental research school in North Central Florida. This study included students who completed second grade durin g one of six school years; specifically, it included 21 students who were enrolled in second grade during 2004 200 5; 50 students during 2005 2006; 51 students during 2006 200 7; 48 students during 2007 2008; 46 students during 2008 2009; and 50 students dur ing 2009 2010. These participants also completed third grade at this school the following year, and only students who were enrolled for the entirety of second and third grades were included in the study. Of the total sample, two students had been retained previously in second grade. For these students, only data from their second year in second grade were included in the study to ensure that all participants completed CBMs and end of year achievement tests within similar timeframes. Across the entire sample , students represented a diverse range of racial and ethnic backgrounds; approximately 45.8% o f students were Caucasian, 21.8 % were African American, 17.3% were Hispanic, 3.1% were Asian, 0.9% were Native American, and 11.1% were multiracial. Because this school generally does not enroll or provide services for students with limited English proficiency, data regarding this variable were not available. Regarding the representation of students with disabilities (as identified through the Individuals with Disa bilities Education Act, 2004), the sample included students with specific learning disabilities (SLDs; 7.5%), language impairments (LI; 2.2%), emotional/behavioral disabilities (E/BD; 0.4%), and other health impairments (OHI; 0.9%). Additionally, a small percentage of participants received services through a 504 Plan (6.2%), and approximately 17.7% of students were identified as

PAGE 45

45 gifted. Students with multiple exceptionalities (e.g., gifted and SLD) comprised approximately 3.1% of individuals in the entire sample. Finally, approximately 26.7% of student s qualified for free or reduced price lunch. Data Collection For all participants in the sample, both demographic and assessment data were collected by the author from archived databases prov ided by the eleme ntary school. Assessment data and end of year testing procedures. Specifically, data from three periodic administrations of the DIBELS, 6 th Edition, one end of year administration of the FCAT, and one end of year administration of the SAT 10 were collected. Measures Dynamic Indicators of Basic Early Literacy Skills (DIBELS), 6 th Edition The DIBELS Oral Reading Fluency measure (DORF; Good & Ka minski, 2002a) is a standardized, individually administered test of reading rate and accuracy. For this measure, students were presented with three brief grade level reading passages. For each passage, they were asked to read aloud as many words as poss ible in one minute. Omitted and substituted words, as well as hesitations of more than three seconds, were recorded as errors. Words self corrected within three seconds were scored as accurate. For each of the three passages, the number of words read corr ectly in one minute (WCPM) was calculated, and the median value for the three passages was recorded as the final score for each administration. Several studies have indicated that the DORF, 6 th Edition has high alternate form reliability (r = .89 .94) and test rest reliability (r = .92 .97) as well as high moderate to high criterion related evidence of validity (i.e., concurrent and predictive) with other standardized tests of reading comprehension (r = .65 -

PAGE 46

46 .80; Barger, 2003; Buck & Torgesen, 2003; Good & K aminski, 2002b; Good, Kaminski, Smith, & Bratten, 2001; Shaw & Shaw, 2002; Vander Meer, Lentz, & Stollar, 2005; Wilson, 2005). Florida Comprehensive Assessment Test (FCAT) The FCAT is administered to students in grades 3 12 across the state of Florida to assess achievement in reading, mathematics, and other academic areas. This assessment is a group administered, criterion referenced test that is designed to evaluate studen objectives specified by the Sunshine State Standards, Florida's state standards for curricula and learning. Specifically, this study examined students' performance on the Reading section of the FCAT. For third grade students, this sectio n is predominantly a measure of comprehension and typically consists of between 50 and 55 multiple choice questions that are posed in response to brief reading passages (average length of 500 words each; Florida Department of Education, 2009). Student pe rformance on the FCAT Reading section is categorized into one of five descriptive achievement levels. Students who score at or above a Level 3 are identified as meeting grade level standards in reading, whereas students who perform at Levels 1 and 2 are id entified as non proficient in reading. Therefore, in the present study, the cutoff score for Level 3 performance was used to distinguish proficient from non proficient readers. Regarding the technical adequacy of the FCAT, the internal consistency reliabil ity coefficient for the third grade Reading section is .92 (Harcourt, 2007). The criterion related evidence of validity of this test with several other measures of language, basic reading, and reading comprehension skills has also been established (Schats chneider et al., 2004). For example, correlations of scores on this test and the SAT 10 have been found to range from .70 to .81 (Crist, 2001).

PAGE 47

47 Stanford Achievement Test, 10th Edition (SAT 10) The SAT 10 is a standardized test of reading and math achiev ement (Harcourt Brace, 2003a). Specifically, this study examined students' performance on the Reading Comprehension section of the SAT 10. For this assessment, students answered a total of 54 multiple choice questions in response to reading brief informat ional, functional, and literary passages. In a representative national sample, the SAT 10 was found to demonstrate relatively high internal consistency, with split half reliabilit y coefficients in the .80s and .90s. Alternate form reliabilities for this test generally fell between .80 and .90 (Carney, 2005; Harcourt Brace, 2003b). Regarding the criterion related evidence of validity of this test with other standardized achievement measures, correlations of SAT 10 subtest scores with scores on the Otis Le nnon School Ability Test, Version 8 (OLSAT 8) range from approximately .40 to .60. Moreover, correlations of SAT 10 and Stanford Achievement Test, 9 th Edition (SAT 9) scores gene rally fall within the range of .70 .80 (Carney, 2005; Crist, 2001; Harcourt Br ace, 2003b; Morse, 2005; Roehrig et al., 2008). Procedure Assessment A dministration Various personnel at the elementary school (e.g., teachers and instructional support staff) administered the DORF to all participants in September (fall administration), January (winter administration), and May (spring administration) of second grade. DORF probes were administered to students individually and in accordance with the DIBELS 6 th Edition standardized administration procedures (Good & Kaminski, 2002a). Partici pants were also administered the FCAT and SAT 10 in May of third grade. Both of these tests were administered to students in a group setting.

PAGE 48

48 Data Analysis Following data collection, the author divided all participants into three subsamples (one calibra tion subsample and two cross validation subsamples) based on the year in which they completed second grade. The calibration sample (S1) consisted of 170 participants who completed second grade during one of the following school years: 2004 2005, 2005 2006 , 2006 2007 and 2007 2008. The first cross validation sample (S2) consisted of 46 students who completed second grade during the 2008 2009 school year, and the second cross validation sample (S3) consisted of 50 students who completed second grade during the 2009 2010 school year. Descriptive statistics for DORF, FCAT and SAT 10 scores were computed for each of the calibration and cross validation subsamples as well as for the entire sample. Moreover, Pearson correlations between scores on each of these measures were calculated. Development of DORF Cut Scores DORF cut scores for identifying students who were at risk for poor performance on the FCAT or SAT 10 were developed based on test data from the calibration sample. Poor performance on the FCAT Rea ding section was defined as scoring below Level 3 (i.e., scaled score < 284). In order to ensure that the classification accuracy of DORF cut scores could be compared for the FCAT and the SAT 10, respectively, the performance criterion for the SAT 10 was a djusted such that equal numbers of students passed both tests. This adjustment resulted in the selection of a scaled score of 615 as the performance criterion for the SAT 10. Thus, students who scored below 615 (i.e., the 63rd percentile) on the Reading C omprehension section of the SAT 10 were said to have performed below grade level standards. In particular, three methods for developing cut scores were compared: DA, LR, and ROC curve analysis. Moreover, the diagnostic accuracy of cut scores developed v ia these three methods was compared to that of cut scores recommended for the 6th Edition of the DIBELS.

PAGE 49

49 The following summarizes the various methods for developing DORF cut scores that were applied in this study. Discriminant a nalysis (DA) DA involves us ing a predictor variable to determine the probability of membership in discrete groups of the dependent variable (Silberglitt & Hintze, 2005). In the present study, the predictor variable was DORF scores, and the two groups were those who met and those wh o did not meet grade level proficiency standards on the outcome measures (i.e., FCAT and SAT 10), respectively. To develop cut scores for predicting risk status on the FCAT, the following formula was applied: (2 1) where is the resulting DORF cut score; is the mean DORF score for students who ultimately met the pe rformance criterion on the FCAT; is the mean DORF score for students who ultimately did not meet the pe rfor mance criterion on the FCAT; is the prior probability of meeting the performance criterion on the outcome measure, and is the prior probability of failing to meet the performance criterion on the outcome measure. The same formula was used for the SAT 10 with an d defined in terms of the performance criterion on the SAT 10. Adjusting the value of the prior probability of failing to meet the performance criterion on the outcome test (and, accordingly, the prior probability of meeting the performance criterion) can lead to the selection of cut scores that are either more libera l or more conservative in their

PAGE 50

50 identification of at risk students. When prior probabilities of meeting and failing to meet the performance criterion are assumed to be equal (i.e., both are equal to .50), the DORF cut score is simply the average of the mea n scores for students who me et and do not meet the criterion, respectively. However, when smaller prior probabilities of failing to meet the performance criterion (e.g., .20 or .30) are substituted in the equation, the resulting DORF cut score is lower, m eaning that fewer students are identified as at risk. In the present study, DA was used to develop DORF cut scores in two ways. In the first approach, it was assumed that the prior probabilities of meeting and failing to meet the performance criterion o n the outcome test were equal. Cut scores developed via this approach are heretofore referred to as the DA(.50) cut scores. In the second approach, adjusted prior probabilities of meeting and failing to meet the performance criterion were estimated based on assessment dat a from the calibration sample. Cut scores developed using this approach are here tofore referred to as the DA(CS ) cut scores. Logistic r egression (LR) LR is a regression analysis procedure in which a continuous or categorical independent v ariable is used to determine probabilities of membership in each of the categories of the dependent variable (Silberglitt & Hintze, 2005). In this respect, it is similar to DA; however, LR does not require strict adherence to normality and equal variance assumptions. For these variable was performance on the two outcome measures, wherein participants fell in one of two categories: students who met grade level profic iency standards on the outcome measure and students who failed to meet grade level standards. For this method, the DORF scores that corresponded to a .50 probability of failing to meet performance standards on the FCAT and SAT 10, respectively, were ident ified as the cut scores for determining risk status.

PAGE 51

51 Receiver operating characteristic (ROC) curve a nalysis ROC curve analysis involves plotting the true positive rate (sensitivity) on the y axis and the false positive rate (1 specificity) on the x axis f or each possible cut score on a screening measure. The resulting curve indicates the sensitivity and specificity of all possible values of the cut score. Using this information, practitioners can select cut scores with appropriate levels of classificatio n accuracy. In the present study, ROC curve analysis was used to derive cut scores in two ways. First, cut scores with the optimal balance between sensitivity and specificity were identified for each of the fall, winter, and spring DORF administrations. These cut scores are heretofore referred to as the ROC(Optimal) cut scores. In the present analyses, the ROC(Optimal) cut scores were defined as the cut scores with the smallest difference between sensitivity and specificity such that sensitivity was eithe r greater than or equal to specificity. Second, given that different schools may wish to select cut scores with different levels of sensitivity, cut scores with sensitivity values as similar as possible to the following values were identified for each admi nistration: .70, .75, .80, .85 and .90. These cut scores are henceforth referred to as the ROC(.70), ROC(.75), ROC(.80), ROC(.85), and ROC(.90) cut scores, respectively. When cut scores with these precise values of sensitivity could not be identified, the cut scores for each DORF administration (i.e., one c ut score with optimal levels of sensitivity and specificity and five additional cut scores with pre selected sensitivity values). Publisher recommended cut s cores The 6 th Edition of the DIBELS provides fall, winter, and spring ORF cut scores for classify scores are research based, criterion referenced scores developed from a representative national

PAGE 52

52 sample. In second grade, students who score below 26 WCPM in the fal l, 52 WCPM in the Finally, students who score at or above these students and one In summary, the aforementioned procedures for selecting DORF cut scores yielded eleven scores for each of the three CBM administrations (i.e., two using DA, one using LR, six using ROC curve anal ysis, and two using published DIBELS recommendations). Evaluating the Diagnostic Accuracy of Cut Scores Using data from S1, the diagnostic accuracy of cut scores for the fall, winter, and spring ORF administrations was evaluated by computing the followin g CAIs: sensitivity, specificity PPP, NPP, and OCC. These indices were calculated for all eleven cut scores developed for each of the three DIBELS administrations. Cross validation Cut scores developed for the calibration sample (S1) were cross validated o n two independent samples (S2 and S3). As noted above, these two cross validation samples consisted of students wh o completed third grade in the S pring of 2010 and 2011, respectively. The DORF cut scores developed for the calibration sample were applied t o both independent samples in order to evaluate their long term diagnostic accuracy in predicting FCAT and SAT 10 performance. Again, CAIs were computed for the fall, winter, and spring cut scores for both samples. These metrics included sensitivity, spe cificity, PPP, NPP, and OCC.

PAGE 53

53 Comparing the Diagnostic Accuracy of Cut Scores In order to address the primary research questions of this study, results of classification agreement analyses (i.e., values of sensitivity, specificity, PPP, NPP, and OCC) were compared for the three statistical procedures (i.e., DA, LR, and ROC curve analysis) as well as across participant subsamples (i.e., the calibration and two cross validation samples). First, CAIs computed for all three assessment periods were compared acr oss the calibration and two cross validation samples (i.e., S1, S2, and S3) to determine whether these scores retained their diagnostic accuracy over consecutive school years. Classification agreement metrics were also compared for cut scores developed vi a DA, LR, and ROC curve analysis to determine which methods were most appropriate for use at the local school level. Additionally, these metrics were compared for the locally developed and DIBELS recommended cut scores. Finally, in order to determine whet her cut scores derived for nationally normed and state developed achievement tests differed in regard to their diagnostic accuracy, values of sensitivity, specificity and other indicators were compared for CBM cut scores developed to predict performance on the FCAT and SAT 10, respectively.

PAGE 54

54 CHAPTER 3 RESULTS Table 3 1 displays demographic information for the whole sample as well as the three subsamples: the calibration sample (S1), the first cross validation sample (S2), and the second cross validation sa mple (S3). For each of the subsamples, demographic information was available for the majority of participants (approximately 85%), and percentages for each category were calculated based on the number of students for whom data were available. As seen in t he table, approximately equal numbers of males and females were observed in S1, S2, and S3. Moreover , the distribution of students from various racial and ethnic backgrounds was generally comparable across subsamples, as was the representation of students who qualified for free or reduced priced lunch. While similar representation of students with various exceptionalities was generally observed across subsamples, the percentages of students who were identified as having a SLD and students who were identifi ed as gifted, respectively, were somewhat higher in S1 than in S2 and S3. Table 3 2 displays both the DIBELS recommended and locally developed cut scores for the fall, winter, and spring administrations of the DORF ( developed using data from S1). The fir st three columns present cut scores developed to predict outcomes on the FCAT, while the latter three display cut scores developed for the SAT 10. As seen in this table, the DIBELS recommended cut scores were among the lowest cut scores examined in the pre sent study, and with the exception of the LR cut scores, the locally developed cut scores were generally higher. Notably, the LR procedure did not result in a viable fall DORF cut score (i.e., a cut score within the range of observed DORF scores for S1) f or either the FCAT or the SAT 10. As shown in the table, cut scores developed via the two DA methods were similar, with a difference of no more than two points observed between cut points developed for the same

PAGE 55

55 assessment period and test. Moreover, acro ss statistical procedures (i.e., LR, DA, & ROC curve analysis), cut scores developed for the two achievement tests were fairly similar to one another. Generally, cut scores that were developed for these two tests via the same method were within 3 6 points of each other (with the exception of the ROC [.90] fall cut scores, for which there was a 15 point difference between the two DORF cut points developed for the SAT 10 and FCAT, respectively). Table 3 3 displays descriptive statistics (i.e., means, standa rd deviations, and ranges) for the DORF, SAT 10, and FCAT for the entire sample as well as for each of the three subsamples. This table also displays descriptive statistics for statewide and national samples of second and third grade students who complete d these measures during various years. As seen in the table, means for the three DORF administrations were fairly similar between S1 ( and S2 ( , . However, the three DORF means were somewhat higher for S3 ( . While the reason for these apparent mean differences remains unclear, possible explanations include differences in the composition of the subsamples (e.g., representation of students with various exceptionalities), shifts in instructional practices that may have impacted the performance of students in S3 specifically, and differences in the DORF probes used across subsamples. Notably, the fall, winter, and spring DORF means for all three subsamples were higher than those reported for the DIBELS 6 th E dition normative sample, although standard deviations for these samples were comparable to the normative sample (Cummings, Ostterstedt,

PAGE 56

56 10, mean scores for these t wo tests appeared to be comparable for S1 ( and S3 ( , although means for S2 were somewhat lower ( . Despite the apparent differences in these means across S1, S2 and S3, standard deviations appeared comparable for both tests, ranging from 54.5 55.9 for the FCAT and from 39.0 39.5 for the SAT 10, across subsamples. In a series of statewide samples assembled by the Human Resources Research Organization (2002), the mean and standard deviation of FCAT scaled scores was 301.4 (SD = 59.2 ), and between the years 2006 and 2011, mean scaled scores for the FCAT ranged from 309 314 (Florida Depar tment of Education; FLDOE, 2011a ). Thus, the mean FCAT scores observed for all subsamples in the present study were higher than those typically seen across the state of Florida. Similarly, in the standardization sample for the SAT 10, the mean and standard deviation of scaled scores for the Reading Comprehension subtest were 621 .2 and 41.8, respectively. Again, mean scores for this subtest were somewhat higher in the present subsamples than in the na tional standardization sample. Although national and statewide data available for some measures were limited, the score ranges for the DORF, FCAT and SAT 10 observed in some of the subsamples may have been smaller than those typically seen in larger, more diverse samples. For example, as seen in the range of FCAT scores for S3, the lowest scaled score was 222, meaning that all studen ts in this subsample achieved no lower than at or near a Level 2 on this test administration. In statewide samples, however, typically 14 19% of students achieve a Le vel 1 on this test (FLDOE, 2011a ). Similarly, as seen in Table 3 3, the maximum DORF scor es for S1, S2, and S3 were often more than 50 points lower than the maximum scores seen in the normative samples (Cummings et al., 2011). Nevertheless, given that the data in the present study were collected from one school only,

PAGE 57

57 it is unsurprising that th e ranges of test scores in these subsamples were smaller than those typically seen in statewide and national samples. Table 3 4 displays pass and fail rates for the FCAT and SAT 10 for the whole sample as well as each of the three subsamples. As noted abov e, earning a passing score on the FCAT was defined as earning a score greater than or equal to 284 (i.e., Level 3 or above), while a passing score on the SAT 10 was defined as earning a score greater than or equal to 615. Across subsamples, between 76.1 an d 84.0% of students passed the FCAT, while between 67.4 and 88.0% of students passed the SAT 10. Although pass rates for both tests were generally comparable across the three subsamples, percentages of passing scores were somewhat lower for S2, whose part icipants had notably lower mean FCAT and SAT 10 scores (as displayed in Table 3 3). Generally, the pass rates observed across these subsamples were higher than those seen in statewide and national samples. Between 2006 and 2011, approximately 69 75% of thi rd grade students earned passing scores on the FCAT each year (FLDOE, 2011a ). Moreover, in the standardization sample for the SAT 10, 63% of students in grade three earned scores greater than or equal to 615 (Harcourt Brace, 2003b). Table 3 5 presents Pe arson correlations among DORF, SAT 10, and FCAT scores for the whole sample as well as for S1, S2, and S3. As indicated in the table, correlations between all measures were statistically significant (p < .01). Regarding the magnitude of these relationship s, Cohen (1992) classified correlation coefficients of .5 or greater as large, correlations of .3 as moderate, and correlations of .1 as small. By these standards, high correlation coefficients (ranging from .88 to .96) were observed between all DORF admi nistrations for the entire sample as well as for all three subsamples. Notably, for all subsamples, the highest correlations were observed between consecutive DORF administrations (i.e., correlations between the fall and

PAGE 58

58 winter administrations and the win ter and spring administrations, respectively, were the highest). Moreover, high correlations were observed between scores on the SAT 10 and FCAT for all subsamples (approximately .78). These findings are generally consistent with previous research, which found correlations between SAT 10 and FCAT scores to range from .70 to .81 (Crist, 2001). Correlations between the DORF measures and the two end of year achievement tests, respectively, were generally comparable to those observed in previous research. In a meta analysis of studies examining the relationship between R CBMs and statewide reading tests, Yeo (2010) estimated the predictive validity coefficient between these two types of measures to be .689. Moreover, a number of studies have indicated that cor relations between R CBMs and statewide reading tests generally ranged from .60 to .90, with higher correlations observed between CBMs and achievement tests that were administered closer in time (Wood, 2006). Thus, while correlations between the DORF and FCAT measures (which ranged from .49 .77) and the DORF and SAT 10 measures (which ranged from .46 .57) were generally high by research. Given these observations, it should be reiterated that the ranges of SAT 10 and FCAT scores for some subsamples may have been relatively smaller than those typically seen in statewide and national samples, and as a result, correlation coefficients between the measures may have been attenuated. Additionally, it should be noted that correlations between the DORF and FCAT scores were comparable or modestly higher than correlations between the DORF and SAT 10 scores. Finally, as seen in previous research, correlations between the DORF measures and the two achievement tests generally increased as the length of time between their administration

PAGE 59

59 decreased. For example, correlations between the fall DORF and FCAT measures were typically smaller than correlations between the winter DORF and FCAT measures, which were in turn smaller than the correlations observed between the spring DORF and FCAT scores. However, while this pattern applied to S1, S3, and the whole sample (fo r both the FCAT and the SAT 10), it did not apply to S2. Evaluating the Long term Diagnostic Accuracy of DORF Cut Scores: Results of Classification Agreement Analyses The following sections of this chapter describe the results of the remaining analyses as compare the long term diagnostic accuracy of publisher recommended and locally developed cut scores in predicting outcomes on the FCAT and the SAT 10. As disc ussed in previous chapters, schools differ in regard to their capacity to accommodate false positive and false negative decisions when identifying students who are at risk for poor reading outcomes. Therefore, they may also differ in their criteria for acc eptable levels of sensitivity (i.e., power of a screener to detect true positives) in using screening tools to predict state test performance. For example, in exploring the relationship between CBM and state test scores, Silberglitt and Hintze (2005) and L eB lanc and colleagues (2012) contended that ORF cut scores should exhibit sensitivity levels of .70 or greater; whereas Johnson and colleagues (2009) and Johnson and colleagues (2010) argued that a minimum criterion of .90 was more appropriate. In additio n to specifying minimum criteria for sensitivity, many researchers also stipulate criteria for specificity (i.e., power of a screener to detect true negatives) as well. A number of researchers have recommended a minimum specificity criterion of .70 or grea ter (Goffreda & DiPerna, 2009; Keller Margulis et al., 2008; Roehrig et al., 2008; Silberglitt & Hintze, 2005). However, given that sensitivity and specificity metrics are inversely related, it is often difficult to

PAGE 60

60 maintain high values of both simultaneou sly. Because false negative errors may have more egregious consequences for both individual students and schools than do false positive errors (e.g., failure to provide critical intervention services to students who demonstrate need), it is arguably more i mportant to ensure that CBM cut scores have adequate levels of sensitivity rather than specificity. Therefore, in the present analyses, strong consideration was given to each of the of the studies reviewed above specified a minimum sensitivity criterion of .70 or greater, this study used a minimum standard of .70 to evaluate the diagnostic accuracy of CBM cut scores. In other words, only ORF cut scores that maintained sensitivity le vels of .70 and above across subsamples were considered to have retained their diagnostic accuracy over multiple years. For cut scores deemed to have acceptable levels of sensitivity across subsamples, additional CAIs, including specificity, PPP, NPP, and OCC, were also considered. Research Question #1: Question. Which of the three statistical procedures (i.e., DA, LR, and ROC curve analysis) yielded cut scores with adequate diagnostic acc uracy? Which of these methods produced cut scores that retained the ir diagnostic accuracy over multiple years (i.e., across subsamples)? Tables 3 6, 3 7, and 3 8 display five CAIs (i.e., sensitivity, specificity, PPP, NPP, and OCC) for each set of 11 cut scores used to predict performance on the FCAT. Each of these three tables displays cut scores and CAIs for one of the three benchmark assessment periods (i.e., fall, winter, and spring). For each cut score, values of the CAIs are provided for S1, S2 and S3, respectively. Similarly, Tables 3 9, 3 10, and 3 11 display CAI s for cut scores developed to predict performance on the SAT 10. Again, each of these tables displays cut scores and CAIs for the fall, winter, and spring assessment periods, respectively.

PAGE 61

61 Of the three methods used to develop the local cut scores (i.e., LR, DA, and ROC curve analysis), only two (DA and ROC curve analysis) produced cut scores that consistently met the minimum sensitivity criterion of .70, across subsamples. Often, the LR method did not yield viable cut scores for one or more of the three s ubsamples (i.e., these cut scores fell outside of the range of DORF scores observed for these subsamples). As indicated in Tables 3 6 and 3 9, the LR procedure did not produce viable fall DORF cut scores for either the FCAT or the SAT 10 and, for some subs amples, this method did not result in viable cut scores for the winter and spring DORF administrations either. For example, as shown in Tables 3 6, 3 7, and 3 8 (which display CAIs for the FCAT), the LR procedure did not produce viable cut scores for the winter and spring DORF administrations for S3. Similarly, as noted in Tables 3 9, 3 10, and 3 11 (which display CAIs for the SAT 10), LR cut scores were not viable for the winter DORF administrations in S2 and S3, nor were they viable for the spring admini strations in S1 and S3. When LR cut scores were viable, their corresponding sensitivity levels were exceptionally low across subsamples, ranging from .00 .09. As indicated by these results, the LR cut scores failed to correctly identify many of the studen ts who truly were at risk for poor performance on the FCAT and SAT Conversely, however, values of specificity for these cut scores were very high (ranging from .99 1.00) across subsamples and assessme nt periods. In other words, the majority of students who truly were not at risk for poor end of c ut scores produced via LR. When LR cut scores were viable, they correctly classified between 78 and 82% of students in predicting outcomes on the FCAT and between 70 and 82% of students in predicting outcomes on the SAT 10.

PAGE 62

62 As mentioned previously, both t he DA and ROC curve analysis methods resulted in cut scores that consistently predicted FCAT and SAT 10 performance with acceptable levels of sensitivity. Generally, values of the CAIs were fairly similar for cut scores generated by the two DA methods (one of which assumed equal prior probabilities of passing and failing the end of year tests, or DA[.50] , and one of which incorporated prior probabilities fro m the calibration sample, or DA[CS] ). As shown in Tables 3 6 through 3 11, the DA(.50) cut scores consistently exhibited sensitivity values that were greater than or equal to .70, across subsamples. Specifically, sensitivity values for these cut scores ranged from .70 1.00 across the three assessment periods for both end of year tests, meaning that be tween 70 and 100% of true positives were co rrectly identified as at risk. Similarly, for the DA(CS) cut scores, adequate levels of sensitivity were typically observed across assessment periods for the two achievement tests, with only two exceptions. As se en in Table 3 9, the fall DA(CS) cut scores developed to predict SAT 10 performance exhibited sensitivity levels of .67 when applied to S2 and S3, falling just below the criterion of .70. Given the relatively high levels of sensitivity observed for the t wo sets of DA cut scores, it is unsurprising that their levels of specificity were somewhat lower. For cut scores developed via both DA methods, values of specificity ranged from .51 to .82 across assessment periods and subsamples for the two end of year t ests, with most of t hese values falling below .70. Thus, over multiple years, these cut scores typically correctly classified fewer than 70% of true negatives as 10. Given that the two sets of DA cut s cores exhibited promising levels of sensitivity, it is important to explore their diagnostic accuracy further by examining additional CAIs, including their NPP, PPP, and OCC. As mentioned previously, OCC refers to the overall percentage of

PAGE 63

63 students who we re correctly classified when the DORF cut scores were applied. For the DA(.50) cut scores, values of OCC ranged from .55 .84 across assessment periods and subsamples for the FCAT and SAT 10, and for the DA(CS) cut scores, they ranged from .55 .8 2. These re sults indicate that, generally, the two sets of DA cut scores resulted in the correct classification of between 55 and 84% of students, across assessment periods and years, for the two achievement tests. Finally, it is important also to examine values of P PP and NPP. Values of PPP indicate risk, whereas values of risk. Unlike the se nsitivity, specificity, and OCC metrics, however, estimates of PPP and NPP are affected by base rates of the target condition (i.e., the percentage of cases who failed to meet performance criteria on the third grade achievement tests) in each of the subsam ples. When base rates of the target condition are relatively low (as in the present study), estimates of PPP tend to be lower, while estimates of NPP tend to be higher. Thus, caution should be exhibited in comparing values of PPP and NPP across subsamples with different base rates. As shown in Table 3 4, base rates of failing to meet the performance criterion were approximately 19%, 24%, and 16% for the FCAT and 18%, 33%, and 12% for the SAT 10 for S1, S2, and S3, respectively. For the DA(.50) cut scores, estimates of PPP ranged from .24 to .56 across assessment periods and subsamples for both end of year tests. These results indicate that the probability that an individual student who was identified as at risk on the DORF ultimately did not meet the perfor mance criterion on either the FCAT or SAT 10 was often below chance (i.e., below 50%). Similarly, for the DA(CS) cut scores, these values ranged from .24 to .65. For both the DA(.50) and DA(CS) cut scores, values of PPP were highest for S2, which notably had the highest base

PAGE 64

64 rates of failing to meet performance criteria on both the FCAT and SAT 10. Estimates of NPP were much higher than estimates of PPP, for both sets of DA cut scores. Across assessment periods and subsamples for both end of year tests, v alues of NPP ranged from .83 1.00 for the DA(.50) cut scores and from .81 1.00 for the DA(CS) cut scores. Thus, the probability that an on the FCAT and SAT 10 wa s typically between 81 and 100%. Regarding the ROC curve analysis cut scores, mostly all cut scores developed via this method met the minimum sensitivity criterion across subsamples (i.e., over multiple years). Specifically, the ROC(.75), ROC(.80), ROC( .85), and ROC (.90) cut scores consistently maintained sensitivity levels of .70 or greater, across subsamples. The ROC(.70) and ROC(Optimal) cut scores also generally maintained levels of sensitivity close to .70; however, for some assessment periods and subsamples, they did not meet this criterion. For example, the sensitivity level of the fall ROC(.70) cut score for predicting SAT 10 outcomes was .67 when applied to S2 and S3. Similarly, the ROC(Optimal) cut scores developed to predict SAT 10 performanc e had sensitivity levels below .70 for multiple assessment periods and subsamples. Generally, the ROC(.75), ROC(.80), ROC(.85), and ROC(.90) cut scores maintained their targeted levels of sensitivity over time. For example, the ROC(.80) cut scores consist ently maintained levels of sensitivity greater than or equal to .80, across subsamples and assessment periods, in predicting outcomes on the FCAT and SAT 10. As expected, however, cut scores with higher targeted sensitivity levels (e.g., the ROC[.85] and ROC[.90] cut scores) also had lower levels of specificity, as compared with cut scores with lower targeted levels of sensitivity. For example, values of specificity for the ROC(.70) cut scores ranged from .56 .86 across

PAGE 65

65 assessment periods and subsamples, w hereas they ranged from .32 .60 for the ROC(.90) cut scores. Regarding the remaining three CAIs, values of OCC varied for the set of cut scores developed via ROC curve analysis. It should be noted that the highest OCC values generally were observed for cut scores with lower targeted levels of sensitivity. For example, OCC values for the ROC(.70) cut scores ranged from .59 .86, whereas they ranged from .43 .66 for the ROC(.90) cut scores. For all cut scores developed via this method, values of PPP were rela tively similar across assessment periods for the two tests, ranging from .22 .34 for S1, from .32 .71 for S2, and from .19 .50 for S3. Conversely, values of NPP were much higher and ranged from .88 .97 for S1, from .81 1.00 for S2, and from .95 1.00 for S 3. In summary, only two of the three statistical procedures, DA and ROC curve analysis, produced cut scores that consistently met the minimum sensitivity criterion across multiple years. Regarding additional CAIs, the DA meth ods classified between 55 and 84 % of students correctly across years. Values of OCC were similar for the set of cut scores developed via ROC curve analysis, with cut scores with lower targeted levels of sensitivity generally having higher rates of correct classification. Values of PPP and NPP varied for both the DA and ROC curve analysis cut scores; however, for both techniques, values of NPP were much higher than values of PPP. Collectively, these results suggest that, of the three sets of locally developed cut scores (i.e., the LR, DA, and ROC curve analysis cut scores), the DA and ROC curve analysis cut scores best retained their diagnostic accuracy over multiple years. Research Question #2 Question. How does the long term classification accuracy of the DIBELS recommended cut scores compare with that of locally developed cut scores?

PAGE 66

66 As noted above, Tables 3 6 throug h 3 DIBELS cut scores for all assessment periods, subsamples, and achievement tests. As indicated in Tables 3 6 and 3 9 , the fall DIB observed DORF scores for S3 and therefore was not a viable cut point for this subsample. Results of the CAI analyses revealed that recommende d cut scores ranged from .03 .63, across assessment periods and subsamples, indicating that none of the fall, winter, and spring cut scores met the minimum sensitivity criterion for predicting SAT 10 and FCAT performance. These cut scores did, however, h ave consistently high levels of specificity across assessment periods and subsamples, ranging from .87 somewhat higher levels of sensitivity, ranging from .23 to 1.00 a cross assessment periods and subsamples. Nevertheless, the sensitivity levels for these cut scores often fell below .70, especially when applied to the calibration sa s cut scores were typically above .70 and ranged from .66 .91. Nevertheless, as discussed previously, sensitivity levels were much higher for the local cut scores, specifically the ones developed via DA and ROC curve analysis. For both the DA and ROC curve a nalysis cut scores, values of sensitivity ranged from .63 1.00 across subsamples, assessment periods, and achievement tests, meaning that between 63 and 100% of students who ultimately did not pass the FCAT or SAT 10 in third grad e were correctly identifie d as at risk in second grade. Conversely, values of specificity were generally lower for the locally developed cut scores than for the DIBELS recommended cut scores. Specifically, for the two sets of DA cut

PAGE 67

67 scores, specificity levels ranged from .51 .82 a cross assessment periods and subsamples, whereas for the ROC curve analysis cut scores, they ranged from .32 .86. Although the two types of DIBELS recommended cut scores had lower levels of sensitivity, as compared with the DA and ROC curve analysis cut s cores, they had among the .90 across assessment periods and subsamples, meaning that between 74 and 90% of students were correctly classified, over multiple years. Notably, most of these OCC values were greater than or .88, indicating that approximately 68 88% of students were correctly classified when these cut scores were applied. In comparison, OCC values were somewhat lower for the DA and ROC curve analysis cut scores. As noted previously, values of OCC r anged from .55 .84 for the DA(.50 ) cut scores and from .55 .82 for the DA(CS ) cut scores. For the ROC curve analysis cut score, values of OCC were more variable, with higher values observed for cut scores that had lower targeted level s of sensitivity (e.g., the ROC[.70] and ROC[.75] cut scores). For both the DA and ROC curve analysis cut scores, most OCC values fell bel ow .80. Thus, generally, it appeared that the DIBELS recommen de risk cut scores) correctly classified larger percentages of students across years, as compared with the DA and ROC curve analysis cut scores. Overall, these re sults suggest that, while the DIBELS recommended cut scores typically had lower levels of sensitivity than the ROC curve analysis and DA cut scores, they generally had higher levels of specificity and OCC. This observation may be attributable to the fact that the DIBELS recommended cut scores were lower than the DA and ROC curve analysis cut scores and therefore identifie d fewer numbers of students as at risk (as compared with the local

PAGE 68

68 cut scores). Since relatively few students truly were at risk for poor performance on the FCAT and SAT 10 (i.e., approximately 16 24% for the FCAT and 12 33% for the SAT 10), the DIBELS cut scores had higher OCC rates. At the same time, however, because these cut scores were also more conservat ive in identifying students as at risk, they had lower levels of sensitivity and failed to detect many true positives. Thus, while the DIBELS recommended cut scores classified greater numbers of students overall, they failed to identify many of the students who truly demonstrated need o f additional supports in second grade. Research Question #3 Question. Do cut scores developed to predict performance on state developed achievement tests and nationally normed achievement tests differ with respect to their long term diagnostic accuracy? Overall, the results suggested that the classification accuracy of local DORF cut scores developed to predict performance on a statewide, criterion referenced test was similar to that of cut scores developed to predict performance on a nationally developed , norm referenced test. As noted previously, the LR procedure consistently produced DORF cut scores that were nonviable or that had sensitivity levels below .70. Thus, in predicting outcomes for both the FCAT and SAT 10, these cut scores consistently did n ot meet the minimum sensitivity criterion, across subsamples. The DA and ROC curve analysis cut scores, however, did predict outcomes adequately for both tests, over multiple years. Both the DA(.50) and DA(CS) cut scores generally maintained sensitivity le vels of .70 or greater in predicting outcomes on the SAT 10 and FCAT, with only two exceptions. Specifically, these exceptions concerned the fall DA(CS) cut scores developed to predict SAT 10 performance, which had sensitivity levels of .67 when applied t o S2 and S3. For both sets of DA cut scores, sensitivity values ranged from .67 1.00 in predicting SAT 10 performance and from .72 1.00 in predicting FCAT performance, across assessment periods and subsamples. Similarly, specificity levels for the DA cut s cores were also

PAGE 69

69 comparable for these two tests and ranged from .51 .82 in predicting both FCAT and SAT 10 milar for the FCAT and SAT 10. In predicting FCAT performance, values of OCC ranged from .55 .84, whereas in predicting SAT 10 outcomes, they ranged from .55 .80. Cut scores developed via ROC curve analysis also appeared to have comparable classification accuracy in predicting outcomes on the FCAT and the SAT 10. In predicting performan ce for both of these end of year tests, each of the ROC(.75) , ROC(.80) , ROC(.85), and ROC(.90) cut scores consistently met the minimum sensitivity criterion of .70, across assessment periods and subsamples. Moreover, cut scores developed to predict perfor mance on both tests typically maintained their targeted sensitivity levels over time (e.g., the ROC[.80] cut scores maintained sensitivity levels greater than or equal to .80, in predicting FCAT and SAT 10 outcomes, over multiple years). Additionally, the ROC(.70) fall, winter, and spring cut scores consistently met the minimum sensitivity criterion in predicting FCAT outcomes; however, in predicting SAT 10 performance, only the winter and spring ROC(.70) cut scores consistently met this criterion. Overall, values of sensitivity for the ROC curve analysis cut scores ranged from .66 1.00 in predicting FCAT outcomes and from .63 1.00 in predicting SAT 10 outcomes. Specificity levels were also comparable for the ROC curve analysis cut scores developed for thes e two tests, ranging from .33 .84 for the SAT 10 and from .32 .86 for the FCAT, respectively. Finally, values of OCC were similar between the ROC curve analysis cut scores developed for these tests and ranged from . 43 .86 for the FCAT and from .43 .82 for the SAT 10. In summary, while the LR cut scores did not appear to have adequate classification accuracy for either test, both the DA and ROC curve analysis cut scores consistently met (or

PAGE 70

70 nearly met) the minimum sensitivity criterion in predicting outcom es on both tests, over multiple years. Moreover, ranges of sensitivity, specificity, and OCC values for the DA and ROC curve analysis cut scores were similar for the FCAT and SAT 10, across assessment periods and subsamples. Collectively, these results in dicate that DORF cut scores developed for the FCAT and the SAT 10 had similar levels of classification accuracy, over multiple years.

PAGE 71

71 Table 3 1: Percentages of Students in Various Demographic Categories by Sample Demographic Whole Sample (n = 266) Calibration Sample (S1; n = 170) First Cross Validation Sample (S2; n = 46) Second Cross Validation Sample (S3; n = 50) Gender Male 50.0 51.6 44.4 50.0 Female 50.0 48.4 55.5 50.0 Race/Ethnicity Caucasian 45.8 47.1 40.0 46.8 African American 21.8 21.7 27.5 17.0 Hispanic 17.3 17.4 12.5 21.3 Asian American 3.1 2.2 5.0 4.3 Native American 0.9 0.7 2.5 0.0 Multiracial 11.1 10.9 12.5 10.6 Exceptionality Specific Learning Disability 7.5 11.6 2.4 0.0 Language Impairment 2.2 1.4 2.4 4.3 Emotional/Behavioral Disability 0.4 0.7 0.0 0.0 Other Health Impaired 0.9 0.0 4.9 0.0 504 Plan 6.2 5.1 9.8 6.4 Gifted 17.7 25.4 7.3 4.3 Multiple Exceptionalities 3.1 4.3 0.0 2.1 Lunch Status Paid Status 73.3 74.2 68.1 75.0 Free or Reduced Price Lunch 26.7 25.8 31.2 25.0

PAGE 72

72 Table 3 2: Publisher recommended and Locally Developed Cut Scores for the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) Oral Reading Fluency Measure Cut Score FCAT SAT 10 Fall Winter Spring Fall Winter Spring DIBELS recommended At risk 26 52 70 26 52 70 Some risk 44 68 90 44 68 90 Discriminant Analysis (DA) DA(.50) 67 81 100 67 81 101 DA(CS) 66 79 99 65 79 99 Logistic Regression N/A 28 47 N/A 22 38 Receiver Operating Characteristic (ROC) Curve Analysis ROC( .70) 62 75 98 62 77 98 ROC( .75) 69 76 102 69 78 103 ROC( .80) 74 78 103 69 84 103 ROC( .85) 80 99 104 79 99 104 ROC( .90) 95 105 115 80 105 115 ROC(Optimal) 62 75 94 62 76 97 Note. FCAT = Florida Comprehensive Assessment Test; SAT 10 = Stanford Achievement Test, 10 th Edition. Values are expressed in terms of words read correctly per minute (WCPM).

PAGE 73

73 Table 3 3: Descriptive Statistics for DORF, FCAT, and SAT 10 Performance Across Samples Sample/Measure Mean Standard Deviation Range Whole Sample (n = 266) Fall DORF 76.6 37.7 12 183 Winter DORF 90.4 37.3 15 201 Spring DORF 109.0 37.1 33 218 FCAT 332.0 55.0 532 762 SAT 10 644.0 39.9 147 500 S1 (n = 170) Fall DORF 73.5 36.4 21 181 Winter DORF 88.8 35.3 15 189 Spring DORF 108.7 36.6 42 218 FCAT 333.4 54.6 147 500 SAT 10 647.0 39.5 545 762 S2 (n = 46) Fall DORF 72.7 38.2 12 166 Winter DORF 86.7 39.2 26 184 Spring DORF 101.6 35.9 33 187 FCAT 318.2 54.5 201 437 SAT 10 627.6 39.0 549 699 S3 (n = 50) Fall DORF 90.6 39.3 27 183 Winter DORF 99.4 41.4 32 201 Spring DORF 116.7 38.9 48 218 FCAT 340.0 55.9 222 500 SAT 10 648.6 39.3 532 762 DIBELS, 6th Edition (Normative Sample)* Fall DORF (n = 637,017) 56.37 33.43 0 256 Winter DORF (n = 615,480) 84.94 37.81 0 275 Spring DORF (n = 608, 782) 98.13 37.77 0 247 FCAT Statewide Sample (n = 4,645)** 301.4 59.2 N/A SAT 10 Standardization Sample (n = 4,615)*** 621.2 41.8 N/A Note. DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; FCAT = Florida Comprehensive Assessment Test; SAT 10 = Stanford Achievement Test, 10 th Edition; S1 = Calibra tion sample; S2 = First cross validation sample; S3 = Second cross validation sample. Means, standard deviations, and ranges for the DORF are expressed in terms of words read correctly per minute (WCPM). Means, standard deviations, and ranges for the FCAT and SAT 10 represent scaled scores. Score ranges for the FCAT statewide sample or the SAT 10 standardization sample were not available to the author . *Data as reported in Cummings et al. (2011); **Data as reported by the Human Resources Research Organizati on (2002); ***Data as reported by Harcourt Assessment (2004) .

PAGE 74

74 Table 3 4: Pass and Fail Rates for the FCAT and SAT 10 Across Subsamples Sample/Test n Pass Percentage (Pass) n Fail Percentage (Fail) Whole Sample (n = 266) FCAT 215 80.8% 51 19.2% SAT 10 215 80.8% 51 19.2% S1 (n = 170) FCAT 138 81.2% 32 18.8% SAT 10 140 82.4% 30 17.6% S2 (n = 46) FCAT 35 76.1% 11 23.9% SAT 10 31 67.4% 15 32.6% S3 (n = 50) FCAT 42 84.0% 8 16.0% SAT 10 44 88.0% 6 12.0% Note. FCAT = Florida Comprehensive Assessment Test; SAT 10 = Stanford Achievement Test, 10 th Edition; S1 = Calibration sample; S2 = First cross validation sample; S3 = Second cross validation sample. n Pass indicates number of participants who passed the standardized test. n Fail indicates number of participants who failed the standardized test.

PAGE 75

75 Table 3 5: Pearson Correlations among DORF, FCAT, and SAT 10 Measures Fall ORF Winter ORF Spring ORF FCAT SAT 10 Whole Sample Fall DORF Winter DORF .91* Spring DORF .88* .94* FCAT .55* .59* .62* SAT 10 .46* .51* .55* .78* S1 Fall DORF Winter DORF .89* Spring DORF .88* .92* FCAT .49* .51* .55* SAT 10 .46* .49* .53* .78* S2 Fall DORF Winter DORF .93* Spring DORF .89* .96* FCAT .72* .77* .74* SAT 10 .51* .57* .56* .78* S3 Fall DORF Winter DORF .95* Spring DORF .90* .96* FCAT .61* .67* .68* SAT 10 .46* .53* .57* .78* Note. DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; FCAT = Florida Comprehensive Assessment Test; SAT 10 = Stanford Achievement Test, 10 th Edition; S1 = Calibration sample; S2 = First cross validation sampl e; S3 = Second cross validation sample. *Denotes significance at p < .01

PAGE 76

76 Table 3 6: Fall Publisher recommended and Local DORF Cut Scores for Predicting FCAT Performance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negative Predictive Power Overall Correct Classification DIBELS At Risk .03/.27/NA .99/1.00/NA .33/1.00/NA .81/.81/NA .81/.83/NA DIBELS Some Risk .25/.64/.50 .80/.80/.91 .23/.50/.50 .82/.88/.91 .70/.76/.86 DA(.50) .72/1.00/.88 .51/.69/.81 .25/.50/.47 .89/1.00/.97 .55/.76/.84 DA(CS) .72/1.00/.75 .51/.71/.81 .25/.52/.43 .89/1.00/.96 .55/.78/.82 LR N/A N/A N/A N/A N/A ROC(.70) .72/.91/.75 .57/.74/.86 .28/.53/.50 .90/.96/.95 .60/.78/.86 ROC(.75) .78/1.00/.88 .48/.57/.79 .28/.42/.44 .90/1.00/.97 .54/.67/.82 ROC(.80) .81/1.00/.88 .45/.51/.72 .25/.39/.37 .91/1.00/.97 .52/.63/.76 ROC(.85) .88/1.00/.88 .41/.40/.58 .25/.34/.28 .93/1.00/.96 .49/.54/.64 ROC(.90) .91/1.00/1.00 .32/.34/.47 .24/.32/.26 .94/1.00/1.00 .43/.50/.56 ROC(Optimal) .72/.91/.75 .57/.74/.86 .28/.53/.50 .90/.96/.95 .60/.78/.86 N ote. DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; FCAT = Florida Comprehensive Assessment Test; ROC = Receiver operating character istic curve analysis. In each cell, values of classification agreement indicators are displayed for all three subsamples and are recorded in the fo llowing format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cross validation subsamples, respectively.

PAGE 77

77 Table 3 7: Winter Publisher recommended and Local DORF Cut Scores for Predicting FCAT Performance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negative Predictive Power Overall Correct Classification DIBELS At Risk .16/.63/.50 .91/.89/.98 .28/.64/.80 .82/.89/.91 .76/.83/.90 DIBELS Some Risk .53/.91/.88 .76/.77/.88 .34/.56/.58 .88/.96/.97 .72/.80/.88 DA(.50) .84/1.00/1.00 .60/.63/.71 .33/.46/.40 .94/1.00/1.00 .65/.72/.76 DA(CS) .84/1.00/1.00 .62/.63/.76 .34/.46/.44 .94/1.00/1.00 .66/.72/.80 LR .00/.09/NA .99/1.00/NA .0 0/1.00/NA .81/.78/NA .81/.78/NA ROC(.70) .72/1.00/.88 .67/.69/.81 .34/.50/.47 .91/1.00/.97 .68/.76/.82 ROC(.75) .75/1.00/.88 .66/.69/.81 .34/.50/.47 .92/1.00/.97 .68/.76/.82 ROC(.80) .81/1.00/1.00 .62/.66/.79 .33/.48/.47 .93/1.00/1.00 .66/.74/.82 ROC(.85) .88/1.00/1.00 .37/.49/.48 .24/.38/.27 .93/1.00/1.00 .46/.61/.56 ROC(.90) .91/1.00/1.00 .33/.43/.43 .24/.35/.25 .94/1.00/1.00 .44/.57/.52 ROC(Optimal) .72/1.00/.88 .67/.69/.81 .34/.50/.47 .91/1.00/.97 .68/.76/.82 Not e. DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; FCAT = Florida Comprehensive Assessment Test; ROC = Receiver operating characteristic curve analysis. In each cell, values of classification agreement indicators are displayed for all three subsamples and are r ecorded in the following format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cross validation subsamples, respectively.

PAGE 78

78 Table 3 8: Spring Publisher recommended and Local DORF Cut Scores for Predicting FCAT Perfor mance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negative Predictive Power Overall Correct Classification DIBELS At Risk .19/.45/.25 .91/.94/.93 .32/.71/.40 .83/.85/.87 .77/.83/.82 DIBELS Some Risk .53/.91/1.00 .73/.66/.81 .31/.45/.50 .87/.96/1.00 .69/.72/.84 DA(.50) .72/1.00/1.00 .56/.54/.74 .27/.50/.42 .90/.83/1.00 .59/.67/.78 DA(CS) .72/1.00/1.00 .58/.57/.74 .28/.50/.42 .90/.81/1.00 .61/.67/.78 LR .03/.09/NA 1.00/1.00/NA 1.00/1.00/NA .82/.78/NA .82/.78/NA ROC(.70) .72/1.00/1.00 .58/.57/.76 .28/.53/.44 .90/.81/1.00 .61/.70/.80 ROC(.75) .75/1.00/1.00 .53/.54/.71 .27/.58/.40 .90/.90/1.00 .57/.67/.76 ROC(.80) .81/1.00/1.00 .52/.51/.71 .28/.58/.40 .92/.90/1.00 .58/.67/.76 ROC(.85) .88/1.00/1.00 .51/.51/.71 .29/.45/.40 .95/1.00/1.00 .58/.63/.76 ROC(.90) .94/1.00/1.00 .44/.43/.60 .28/.45/.32 .97/1.00/1.00 .53/.63/.66 ROC(Optimal) .66/.91/1.00 .65/.60/.76 .30/.42/.44 .89/.95/1.00 .65/.67/.80 Note . DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; FCAT = Florida Comprehensive Assessment Test; ROC = Receiver operating character istic curve analysis. In each cell, values of classification agreement indicators are displayed for all three subsamples and are recorded in the fo llowing format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cross validation subsamples, respectively.

PAGE 79

79 Table 3 9: Fall Publisher recommended and Local DORF Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negative Predi ctive Power Overall Correct Classification DIBELS At Risk .03/.20/NA .99/1.00/NA .33/1.00/NA .83/.72/NA .82/.74/NA DIBELS Some Risk .23/.53/.33 .80/.81/.89 .20/.57/.29 .83/.78/.91 .70/.72/.82 DA(.50) .73/.73/.83 .51/.65/.80 .24/.50/.36 .90/.83/.97 .55/.67/.80 DA(CS) .73/.67/.67 .53/.68/.82 .25/.50/.33 .90/.81/.97 .56/.67/.80 LR N/A N/A N/A N/A N/A ROC(.70) .70/.67/.67 .56/.71/.84 .26/./53/.36 .90/.81/.95 .59/.70/.82 ROC(.75) .80/.87/.83 .48/.58/.77 .25/.50/.33 .92/.90/.97 .53/.67/.78 ROC(.80) .80/.87/.83 .48/.58/.77 .25/.50/.33 .92/.90/.97 .53/.67/.78 ROC(.85) .87/1.00/.83 .41/.45/.61 .24/.47/.23 .94/1.00/.96 .49/.63/.64 ROC(.90) .90/1.00/.83 .41/.45/.57 .25/.47/.21 .95/1.00/.96 .49/.63/.60 ROC(Optimal) .70/.67/.67 .56/.71/.84 .26/.53/.36 .90/.81/.95 .49/.70/.82 Note. DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; ROC = Receiver operating characteristic curve analysis; SAT 10 = Stanford Achiev ement Test, 10 th Edition. In each cell, values of classification agreement indicators are displayed for all three subsamples and are recorded in the fo llowing format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cr oss validation subsamples, respectively.

PAGE 80

80 Table 3 10: Winter Publisher recommended and Local DORF Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negativ e Predictive Power Overall Correct Classification DIBELS At Risk .20/.47/.33 .91/.87/.93 .33/.64/.40 .84/.77/.40 .79/.74/.86 DIBELS Some Risk .50/.73/.83 .75/.77/.84 .30/.61/.42 .88/.86/.42 .71/.76/.84 DA(.50) .77/.87/.83 .58/.65/.66 .28/.54/.25 .92/.91/.97 .61/.71/.68 DA(CS) .77/.87/.83 .59/.65/.70 .24/.54/.28 .93/.91/.97 .61/.71/.72 LR .00/NA/NA .99/NA/NA .00/NA/NA .82/NA/NA .82/NA/NA ROC(.70) .73/.87/.83 .62/.68/.77 .29/.68/.33 .92/.91/.97 .64/.74/.78 ROC(.75) .77/.87/.83 .61/.68/.73 .29/.57/.29 .92/.91/.97 .64/.74/.74 ROC(.80) .80/.87/.83 .54/.65/.61 .27/.54/.23 .93/.91/.96 .58/.72/.64 ROC(.85) .87/1.00/1.00 .36/.55/.45 .23/.52/.20 .93/1.00/1.00 .45/.70/.52 ROC(.90) .90/1.00/1.00 .33/.48/.41 .22/.48/.19 .94/1.00/1.00 .43/.65/.48 ROC(Optimal) .67/.87/.83 .64/.71/.77 .25/.71/.33 .93/.92/.97 .64/.76/.78 Note . DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; ROC = Receiver operating characteristic curve analysis; SAT 10 = Stanford Achiev ement Test, 10 th Edition. In each cell, values of classification agreement indicators are displayed for all three subsamples and are recorded in the fo llowing format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cr oss validation subsamples, respectively.

PAGE 81

81 Table 3 11: Spring Publisher recommended and Local DORF Cut Scores for Predicting SAT 10 Performance: Results of Classification Agreement Analyses Cut Score Sensitivity Specificity Positive Predictive Power Negative Predictive Power Overall Correct Classification DIBELS At Risk .17/.33/.17 .90/.94/.91 .26/.71/.20 .83/.74/.89 .77/.74/.82 DIBELS Some Risk .50/.80/.83 .72/.68/.75 .28/.54/.31 .87/.88/.97 .68/.72/.76 DA(.50) .70/1.00/.83 .54/.61/.66 .24/.56/.25 .90/1.00/.97 .56/.74/.68 DA(CS) .70/1.00/.83 .57/.64/.66 .26/.65/.26 .90/1.00/.97 .59/.76/.70 LR NA/.07/NA NA/1.00/NA NA/1.00/NA NA/.69/NA NA/.70/NA ROC(.70) .70/1.00/.83 .57/.65/.70 .26/.58/.28 .90/1.00/.97 .59/.76/.72 ROC(.75) .80/1.00/.83 .51/.58/.66 .26/.54/.25 .92/1.00/.97 .56/.72/.68 ROC(.80) .80/1.00/.83 .51/.58/.66 .26/.54/.25 .92/1.00/.97 .56/.72/.68 ROC(.85) .87/1.00/.83 .50/.58/.66 .27/.54/.25 .95/1.00/.97 .56/.72/.68 ROC(.90) .93/1.00/1.00 .44/.48/.57 .26/.48/.24 .97/1.00/1.00 .52/.65/.62 ROC(Optimal) .63/1.00/.83 .59/.65/.70 .25/.58/.28 .88/1.00/.97 .60/.76/.72 Note . DA = Discriminant analysis; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DORF = Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency Measure; ROC = Receiver operating characteristic curve analysis; SAT 10 = Stanford Achiev ement Test, 10 th Edition. In each cell, values of classification agreement indicators are displayed for all three subsamples and are recorded in the fo llowing format: S1/S2/S3. S1, S2, and S3 represent the calibration, first cross validation, and second cr oss validation subsamples, respectively.

PAGE 82

82 CHAPTER 4 DISCUSSION The primary purpose of this study was to investigate the decision making utility of locally developed CBM cut scores for identifying students in the primary grades who may be at ri sk for later reading difficulties. More specifically, this study compared the long term classification accuracy of local DORF cut scores generated via three statistical methods: LR, DA, and ROC curve analysis. This study also compared the long term diagnos tic accuracy of each of these sets of local cut scores with that of the risk cut scores recommended by the DIBELS, 6 th Edition. Finally, the classification accuracy of local DORF cut scores developed to predict performance on two high stakes reading tests (i.e., the FCAT and SAT 10) was compared. Participants were 266 second grade students from a range of racial, ethnic, socioeconomic, and disability backgrounds, although this sample was notably more homogenous than student populations typically represented state and nationwide. Moreover, their performance on the three DORF administrations and two end of year achievement tests was higher than is usually observed in larger, statewide and national samples. Given the non represenativeness of this sample, it was hypothesized that the local cut scores would have greater decision making utility in this particular setting than would the DIBELS recommended cut scores (which were developed from a more diverse, nationally representative sample). In comparing each of t he sets of cut scores, the LR and DIBELS recommended cut scores generally had the lowest values. This observation is consistent with findings from previous research, which also found that cut scores developed via ROC curve analysis and DA had higher values than LR and publisher recommended cut scores ( Goffreda et al., 2009; Silberglitt & Hintze, 2005) In fact, several of the DIBELS recommended and LR cut score values were too

PAGE 83

83 low to serve as viable benchmarks for some subsamples. Generally, values of the cu t scores developed to predict performance on the FCAT and SAT 10, respectively, were similar to one another. This similarity may have been partially attributable to the fact that the SAT 10 performance criterion was adjusted such that equal percentages of students in S1 passed the two achievement tests. In regards to classification accuracy, only the DA and ROC curve analysis procedures produced cut scores that consistently maintained the minimum acceptable level of sensitivity specified for this study (i.e were similar both in value and in their degree of classification accuracy, despite differences in the prior probability values that were used to develop them. Although no conspicuous differen ces between these methods emerged, further research is needed to determine if and when altering prior probability values for the DA method significantly impacts the diagnostic accuracy of cut scores developed via this procedure. Regarding the use of ROC cu rve analysis, cut scores developed via this method consistently maintained the minimum levels of sensitivity specified for their development, across multiple years. For example, the ROC(.80) cut scores maintained sensitivity levels of .80 or higher when ap plied to each of the three subsamples. Conversely, when the LR and DIBELS recommended cut scores were viable, they had exceptionally low sensitivity levels, indicating that they identified very few of the students who were truly at risk for low reading ach ievement. Nevertheless, while these cut scores had lower levels of sensitivity (as compared with the DA and ROC curve analysis cut scores), they consistently had the highest levels of specificity and OCC. Thus, while the DA and ROC curve analysis cut score s correctly identified more of the

PAGE 84

84 truly at risk students, the LR and DIBELS recommended cut scores correctly classified greater numbers of students overall. Across all sets of cut scores, low levels of PPP and high levels of NPP were observed. As mention failed to meet grade level standards on the FCAT and SAT 10, respectively) can influence estimates of PPP and NPP. In the present study, relatively few participants failed to meet performance criteria on the FCAT or SAT 10, meaning that the prevalence of the target condition was fairly low (occurring for approximately 19% of participants in the entire sample). Most likely, these relatively low base rates contributed to the high levels of PPP and low levels of NPP observed in this study. As compared with th e PPP and NPP metrics, it should be noted that sensitivity and specificity values may be more pertinent for informing the selection of CBM screening cut scores. While PPP and NPP indicate the accuracy of a prediction (i.e., the likelihood that an individual screening result is accurate), sensitivity and specificity indicate the percentage of truly in a given sample. As a result, estimates of sensitivity and specific ity may be more relevant for selecting screening decision rules for use with a specific student population, whereas PPP and NPP metrics may be more useful for interpreting individual te st findings (VanDerHeyden, 2010a). Finally, the results of this study indicated that CBM cut scores developed for the two third grade achievement tests had similar levels of classification accuracy over multiple years, which suggests that they functioned comparably in predicting risk status for both. Moreover, in general, correlations between the DORF and FCAT (i.e., the state developed test) were comparable or modestly higher than correlations between the DORF and the SAT 10 (i.e., the

PAGE 85

85 nationally develop ed test). These results are somewhat surprising, given that previous meta analytic research has indicated that R CBMs tend to be more highly correlated with nationally developed achievement tests than with state specific achievement tests (Reschly et al., 2009). (It should be noted meta analysis examined a range of R CBMs, only one of whic h was the DORF.) Reschly and colleagues attributed this observation to the fact that nationally developed tests may have superior c onstruction, as compared with state specific tests. They also noted that nationally developed tests are more likely to measure general reading achievement, whereas state developed tests are often designed to assess state sp ecific grade level standards. In this study specifically, there may have been a number of potential differences between the two outcome tests examined, including discrepancies in the readability of their passages and the complexity of their test items. Generally, the results of this st udy corroborated findings from previous research. As noted above, a number of studies found that ROC curve analysis cut scores had considerable decision making utility in identifying students who were at risk for poor performance on high stakes achievement tests (e.g., Hintze & Silberglitt, 2005; Keller Margulis et al., 2008; Riedel, 2007; Shapiro et al., 2006; Silberglitt & Hintze, 2005). Moreover, as in the present study, research comparing the diagnostic accuracy of local and publisher recommended cut sc ores has found that local cut scores tend to have higher levels of sensitivity in identifying at risk students (Goffreda et al., 2009; Roehrig et al., 2008). Nevertheless, the results of this study corroborated only some of the findings from Silberglitt a LR procedures for developing local ORF cut scores. Similar to Silberglitt and Hintze, this study found that, as compared with the DA and ROC curve analysis cut scores, the LR cut scores had

PAGE 86

86 the lowest values and the highest levels of overall classification accuracy (as indicated by the OCC metric in the present study and by Kappa and the Phi Coefficient in Silberglitt & Hintze). They also found that cut scores developed via al l three techniques exhibited adequate levels of practitioners use LR to develop initial cut scores and to identify minimum acceptable criteria for sensitivity and s pecificity values. Next, they recommended that practitioners adjust these cut scores via ROC curve analysis, such that overall classification accuracy in maximized. Conversely, Silberglitt and Hintze cautioned against the use of DA to develop cut scores, d ue to its rigid underlying distributional assumptions (which may be especially difficult to meet in smaller, school wide samples). However, in the present study, only the DA and ROC curve analysis methods consistently yielded cut scores with adequate levels of sensitivity, whereas the LR cut scores did not (nor were they viable decision points for all subsamples). Although Silberglitt and Hintze (2005) only displayed local cut score values for the spring CBM administration of second grade, it should be noted that their LR cut score was more than 30 WCPM higher than both of the LR cut scores developed in the present study for the same a ssessment period. Specifically, Silberglitt whereas in the present study, the spring LR cut scores for predicting FCAT and SAT 10 performance were 47 and 38 WCPM, res pectively. Although a definitive comparison between these cut score values cannot be made (due to differences in the samples and measures used), the discrepancy between them is surprisingly large. While the reason for these discrepancies remains u nclear, o ne possible explanation is that the LR method was executed differently in the two studies. For example, in the present study, a value of .50 was inputted as the probability of

PAGE 87

87 failing the outcome measure in the LR formula , whereas in Silberglitt and Hintz e value was not specified. There may also be a number of other potential explanations for the observed differences (Again, comparisons between the findings of these two studies are made cautiously, due to differences in their participant samples and achievement measures.) First, unlike in the present (rang ing from 1,864 1,965 participants) and included students from a range of elementary schools in a larger district (thus rendering the demographic and educational backgrounds of their participants potentially more heterogeneous). Moreover, although the two s tudies used different R CBMs (meaning that differences in the readability and other characteristics of test probes were unknown), a comparison of mean ORF scores between these studies suggests that students in e, fewer WCPM in the fall, winter, and spring of second grade than students in the present study. Thus, the sample in this study may have comprised higher achieving readers than the sample examined by Silberglitt and Hintze (as indicated by their ORF score s). At this time, the impact of various sample characteristics (e.g., curve analysis procedures to develop CBM cut scores remains unclear. Ultimately, further research is needed to determine how these factors affect local cut score development and the results of classification agreement analyses. Furthermore, it should be noted that the present study attempted to cross validate cut scores on multiple, independen t participant samples, whereas Silberglitt and Hintze (2005) did

PAGE 88

88 one sample of students, which was also the same sample on which the cut scores were initially developed. Although these researchers took measures to ensure that their methods did not capitalize on the specific characteristics of this particular sample, cross validation is generally necessary to ensure that cut scores retain their decision making u tility in independent samples. It have maintained adequate levels of sensitivity, had they been cross validated over multiple years and participant samples. Nev ertheless, despite these differences between the two studies, both suggest the potential value of using locally developed cut scores to improve the identi fication of students who are at risk for later reading difficulties. Limitations While the present st udy has a number of implications for improving universal screening procedures, it is also important to give consideration to its limitations. First, it should be noted that some students in each of the subsamples received secondary and tertiary interventio ns in reading throughout the year and that decisions to initiate and terminate these interventions were based partially on their DORF performance. This interference could have attenuated the relationship between scores on the DORF and the end of year achie vement tests, thereby affecting the predictive power of the screeners (Silberglitt & Hintze, 2005). Additionally, the par ticipant sample in this study differed somewhat from student populations typically seen state and nationwide. For example, students wi th limited English proficiency were conspicuously absent from this sample (since this particular school did not provide services to English language learners). Notably, previous research has indicated that proficiency, can significantly influence the relationship between oral reading fluency skills and reading comprehension. For example, although there is a considerable body of research supporting the use of fluency CBMs with

PAGE 89

89 English language learners (ELLs; Baker & Good, 1995; Betts, Muyskens, & Marston, 2006; Wiley, & Deno, 2005), Klein and Jimerson (2005) found that ORF scores overpredicted performance on the Stanford Achievement Test, 9 th Edition (SAT 9) for students whose home language was Spanish, where as it under predicted performance on this test for students whose home language was English. Moreover, additional research has suggested that factors such as previous instructional experiences and level of English proficiency may influence the trajectory o f reading growth for ELLs (Quirk & Beem, 2012; Wayman, McMaster, Saenz, & Watson, 2010). Given the potential impact of student variables on the relationship between ORF skills and later reading proficiency, it remains unclear whether the results of this st udy are generalizable to different settings and populations. In addition to its unique participant demographics, the present sample also consisted of relatively high achieving readers. As noted above, mean DORF, SAT 10 and FCAT scores of these students wer e higher than those typically seen in larger, statewide and national samples (although, for all three measures, differences between mean score s in the present subsamples and mean scores in larger, more representative samples were no larger than one standar d deviation). Additionally, ranges of scores for these measures were smaller than are usually seen in larger, more representative samples, which may have attenuated the relationship between scores on the DORF and the two third grade outcome measures. Again , it remains unclear whether the results of this study can be generalized to schools whose students have lower achievement and wider ranges of test scores. Despite these apparent limitations, it is important to acknowledge that this study was designed to investigate the development of local CBM benchmarks in one specific school context. Generally, it is expected that local decision rules developed using established research

PAGE 90

90 methods (e.g., ROC curve analysis) will have the greatest utility in the school set tings for which they were originally created. Thus, to some extent, it is logical that the results of this study may have limited generalizability to other settings with more dissimilar student populations. Ultimately, further research is needed to determi ne whether the findings generated in this study are applicable across a range of school settings. Other potential limitations of this research concern the relatively small sample sizes used to develop and evaluate the local cut scores. It should be acknowl edged that the sizes of the two cross validation samples were fairly small (containing 46 and 50 participants, respectively), as was the size of the calibration sample (which contained only 170 participants). Because random sampling error may have a greate r impact in studies with fewer participants, estimates of sensitivity and specificity may be biased for smaller samples (Leeflang, Moons, Reitsma, & Zwinderman, 2008). Although the number of participants that was needed to accurately estimate sensitivity a nd specificity in this study remains unclear, VanDerHeyden (2011) recommended that practitioners obtain a sample size of at least 200 participants when attempting to establish screening cut scores (in relation to a second outcome measure). It is important d on a study by Leeflang and colleagues (2008), which examined the classification accuracy of cut scores developed for a hypothetical screening instrument. Leeflang and colleagues conducted a serie s of data simulations in which they estimated sensitivity and specificity values for a given cut score on this screener using various sample sizes (ranging from 20 1,000 participants). They found that, when sample size increased from 40 to 200 participants , the accuracy of sensitivity and specificity estimates for the chosen cut score improved significantly. Nevertheless, because Leeflang and colleagues did not calculate sensitivity and specificity values for all possible

PAGE 91

91 sample sizes between 40 and 200, it remains unclear whether a somewhat smaller sample (e.g., a sample consisting of 170 or 180 participants) would have yielded sufficiently precise estimates of these metrics. Finally, it should be reiterated that the purpose of this study was to explore the utility of the DA, LR, and ROC procedures in developing CBM cut scores with smaller samples (which are likely to found in individual schools). Most prior studies on cut score development have used datasets aggregated across a number of schools and distric ts (Hintze & Silberglitt, 2005; Johnson et al., 2009; Johnson et al., 2010; LeB lanc et al., 2012; Riedel, 2007; Roehrig et al., 2008; Shapiro et al., 2006; Silberglitt & Hintze, 2005). In many of these studies, participant samples were significantly larger and more heterogeneous (i.e., with respect to race, ethnicity, socioeconomic status, educational placement, and linguistic background) and had wider ranges of test scores than samples that are typically available in individual schools. This study, however , sought to determine whether the LR, DA, and ROC curve analysis procedures would function appropriately when applied to smaller, more homogenous samples. A third limitation of this study concerns the small number of independent samples across which the c ut scores were cross validated. As noted previously, one of the primary purposes of this study was to explore the long term classification accuracy of locally developed DORF cut scores across multiple cohorts of students. However, beyond the data aggregate d for the calibration sample, DORF, FCAT and SAT 10 scores were available for only two additional cohorts of second grade students (i.e., students who completed second grade during the 2008 2009 and 2009 2010 school years, respectively). Thus, while the re sults of this study indicated that the ROC curve analysis and DA cut scores retained their classification accuracy over the two

PAGE 92

92 consecutive years following their development, further research is needed to explore the classification accuracy of these cut sc ores over longer periods of time (e.g., five years). Finally, an important characteristic of this study is that students in the second cross validation subsample (S3) were administered a different version of the FCAT than students in the calibration (S1) and first cross validation (S2) subsamples. It should be noted that the FLDOE transitioned from the FCAT to the FCAT 2.0 in the Spring of 2011, meaning that the participants in S3 were the first cohort to complete this new edition of the test. Unfortunatel y, limited information comparing the FCAT and FCAT 2.0 (e.g., Rasch item analyses) was available to the author, although brief test design summaries released by the FLDOE (2009 , 2011b ) indicated that the two tests were comparable in their intended structur e (i.e., number of reading passages and questions), content (i.e., types of passages and questions), duration, and alleged cognitive complexity. Nonetheless, despite any significant changes that may have been made to the FCAT during this time, the local DO RF cut scores maintained adequate classification accuracy in Ultimately, although modifications to statewide testing practices can be disruptive to a screening procedures, these changes are inevitable and occur continually over time (due to shifts in political interests and emerging advancements in testing and measurement). When this occurs, it is important for practitioners to consider how these change s might affect the utility of their academic screening procedures. Specifically, they may need to recalibrate screening cut scores in order to maintain adequate diagnostic accuracy in identifying at risk students. Implications for Schools Overall, the res ults of this study underscore the value of using locally developed CBM stakes tests of achievement. As

PAGE 93

93 indicated previously, cut scores recommended by the DIBELS, 6 th Edition were not viable cut poi nts for all subsamples in this particular setting. Moreover, when the DIBELS recommended cut sores were viable, they had exceptionally low levels of sensitivity, meaning that too few of the students who ultimately failed the FCAT and SAT 10 in third grade were identified as at risk in second grade. These results suggest that the DIBELS recommended cut scores were less effective in identifying students who demonstrated need of additional reading interventions, as compa red with the local cut scores. Moreove r, they suggest that developing local cut scores for term reading difficulties may be a worthwhile practice in some settings. The results of this study also bring to light several practical considerations for developi ng local CBM benchmarks. First, school personnel must establish standards and priorities for classification accuracy. Because some classification metrics are inversely related, practitioners cannot expect to maintain high levels of all indicators simultane ously. For example, as sensitivity increases, specificity typically decreases, and as the balance between these values becomes increasingly uneven, OCC may eventually be compromised (as observed in the present study). In some situations, practitioners may need to choose between optimizing the identification of truly at risk students (i.e., by maximizing levels of sensitivity) and correctly identifying the greatest number of students overall (i.e., by maximizing OCC). A number of researchers have suggested t hat practitioners favor higher levels of sensitivity when developing screening cut scores, as this approach will maximize the number of truly at risk students that are identified (Johnson et al., 2009; Johnson et al., 2010; Riedel, 2007; VanDerHeyden, 2010 a). However, increasing levels of sensitivity by selecting more liberal cut scores may significantly increase the number of false positive identifications that are made, which can be costly for schools and needlessly alarming

PAGE 94

94 for personnel, parents, and st udents. Thus, practitioners must give careful consideration to the potential consequences of their decision procedures. Second, practitioners must select an appropriate method of cut score development. Overall, the results of this study suggest the promis e of both the DA and ROC curve analysis methods for developing local benchmarks. Ultimately, the DA procedure may be more accessible to practitioners than the ROC curve analysis method, as it involves the application of a standard formula that can be compu ted either by hand or with basic analytic software (e.g., Microsoft Excel). One drawback to this procedure, however, is that it requires greater attention to the distributional properties o f the data set being examined. Conversely, although the ROC curve a nalysis method may require expertise with more advanced statistical software, it affords the greatest flexibility in cut score selection (Silberglitt & Hintze, 2005). Ultimately, given an appropriate sample size, the ROC curve analysis procedure may be mos t ideal because it allows practitioners to choose cut scores with specified values of sensitivity and other metrics. (It should be noted, however, that cross validation of these cut scores is still necessary to ensure that estimates of CAIs remain consiste nt when applied to different student samples). Moreover, selecting a statistical procedure for local cut score development may require practitioners to give careful consideration to the characteristics of their specific data samples. As noted above, the L R and ROC curve analysis methods are robust to a number of common statistical assumption violations (e.g., non normal distributions and unequal variances) and therefore may be more appropriate for use with smaller, non normally distributed data sets (Silbe rglitt & Hintze, 2005). However, again, practitioners are cautioned against conducting these procedures with excessively small sample sizes, as this may bias the estimation of some classification agreement metrics (Leeflang et al., 2008).

PAGE 95

95 In addition to ex amining the size and distributional properties of their data sets, practitioners must also consider the composition of their participant samples. For example, recent research has indicated that several types of student variables, including gender and spec ial education status, may affect ORF growth rates at certain grade levels (Yeo, Fearrington, & Christ, 2 011). Moreover, Kranzler and colleagues (1999) found that ORF performance may be a biased estimator of reading comprehension skills for students from d ifferent ethnic backgrounds, especially in the latter elementary grades (i.e., fourth and fifth grades). A follow up study, however, found no bias of the ORF metric in relation to ethnicity (Hintze, Callahan, Matthews, Williams, & Tobin, 2002). Regardles s of which demographic variables have the greatest impact on CBM growth rates and the predictive power of ORF measures, local benchmarks are most likely to effectively reduce decision making errors when they are tailored to the group of individuals for whi ch the decisions are being made (Stewart & Silberglitt, 2008). Thus, to the greatest extent possible, practitioners should assemble participant samples that adequately represent the larger student population to which the cut scores will be applied. Final ly, if the locally developed cut scores are intended for use over multiple years, personnel must ensure that the classification accuracy of these cut scores is comparable from year to year. As described above, this can be challenging given that instruction al practices, student populations, and state testing practices are likely to change over time. Nevertheless, the results of this study indicated that the local cut scores retained their classif ication accuracy in predicting high stakes test performance ov er multiple years, despite changes to assessment practices during that time. Directions for Future Research Overall, the present study has a number of implications for informing future research on the development of local CBM benchmarks. As noted previou sly, the extent to which the results

PAGE 96

96 of this study are applicable across various school settings remains unclear. Thus, future research should explore the utility of the DA, ROC curve analysis, and LR methods of cut score development in schools that differ with respect to their student populations, instructional practices, and chosen screening and outcome measures. Future studies may also wish to compare the utility of local cut scores in predicting high stakes test performance for diverse groups of student s (e.g., students from various linguistic, racial, ethnic, socioeconomic, and educational backgrounds). When possible, researchers may wish to use to larger student samples than examined in the present study. Additionally, because local cut scores are ult imately intended for use within larger, multi tier models of service delivery, further research is needed to explore how these scores may impact decision making processes. Notably, CBM data can be used to make decisions about individual student supports (e .g., regarding the focus, duration, intensity, initiation and termination of interventions) as well as school wide intervention and instructional practices (e.g., addressing grade level needs and coordinating secondary and tertiary interventions). Future studies may wish to examine the application of local CBM cut scores in the context of MTSS and whether use of these cut scores can be linked to improvements in service delivery and decision making processes. Finally, future research may wish to explore th e classification accuracy of locally developed cut scores over longer periods of time. In the present study, local cut scores were cross validated across only two additional cohorts of students. However, research on the nature of the school change process has indicated that substantial and sustainable reforms are typically implemented over a number of years (i.e., between five and seven years; Fullan, 2007). In addition, drift in student populations may occur gradually over a number of years, as well. Thu s,

PAGE 97

97 in order to investigate the impact of long term school change on the diagnostic utility of CBM cut scores, researchers may need to cross validate these cut scores over longer periods of time (perhaps five years or more). In conclusion, the present stud y has a number of implications for assisting practitioners in developing local CBM cut scores; nonetheless, additional research is needed to continue refining these practices. Further research in this area is critical because it may allow practitioners to develop screening procedures that are optimally effective in specific school settings. More broadly, improvements in screening practices may increase the overall efficiency of instructional services by ensuring that struggling readers are provided necessa ry interventions as early as possible. Hopefully, future research in this area will improve the capacity of schools to provide critical interventions to at risk readers and, ultimately, to meet the needs of all learners in an appropriate and timely manner.

PAGE 98

98 REFERENCES Baker , S. K. , & Good , R. (1995). Curriculum based measurement of English reading with bilingual Hispanic students: A validation study with second grade students. School Psychology Review, 24 , 561 578. Barger, J. (2003). Comparing the DIBELS oral reading fluency ind icator and the North Carolina end of grade reading assessment . Asheville, NC: North Carolina Teacher Academy. Betts, J., Muyskens, P., & Marston, D. (2006). Tracking the progress of students whose first language is not English towar ds English proficiency: Using CBM with English language learners. MinneTESOL/WITESOL Journal, 23, 15 37. Neisser (Ed.), The school achievement of minority childre n: New perspectives (pp.105 143). Hillsdale, NJ: Erlbaum. Buck, J., & Torgesen, J. (2003). The relationship between pe rformance on a measure of oral reading fluency and performance on the Florida Comprehensive Assessment Test (FCRR Report No. 1). Tallahass ee, FL: Florida Center for Reading Research. Carney, R. N. (2005). Stanford Achievement Test, Tenth Edition. In R. A. Spies & B. S. Plake (Eds), The 16th mental measurements yearbook (pp. 968 972). Lincoln, NE: The Buros Institute of Mental Measurements. C avanaugh, C. L., Kim, A., Wanzek, J., & Vaughn, S. (2004 ). Kindergarten reading interventions for at risk students: Twenty years of research. Learning Disabilities, 2, 9 21. Cohen, J. (1992). A power primer. Psychological Bulletin, 112 , 155 159. Compton, D. L., Appleton, A. C., & Hosp, M. K. (2004). Exploring the relationshi p between text leveling systems and reading accuracy and fluency in second grade students who are average and poor decoders. Learning Disabilities Research & Practice, 19 , 176 184. Comp ton, D. L., Fuchs, D., Fuchs, L. S., Bouton, B., Gilbert, J. K., Barquero, L. A., Cho, E., & Crouch, R. B. (2010). Selecting at risk first grade readers for early intervention: Eliminating false positives and exploring the promise of a two stage gated scre ening process. Journal of Educational Psychology, 102, 327 340. doi:10.1037/a0018448 Compton, D. L., Fuchs, D., Fuchs, L. S., & Bryant, J. D. (2006). Selecting at risk readers in first grade for early intervention: A two year longitudin al study of decision rules and procedures. Journal of Educational Psychology, 98 , 394 409. Crist, C. (2001). FCAT briefing book . Tallahassee: Florida Depart ment of Education. Crawford, L., Tindal, G., & Stieber, S. (2001). Using oral r eading rate to predict student performance on statewide achievement tests. Educational Assessment, 7 , 303 323.

PAGE 99

99 CTB/McGraw Hill. (2008). TerraNova Third Edition Complete Battery. Monterey, CA: Author. Cummings, K. D., Otterstedt, J., Kennedy, P. C., Baker, S. ). DIBELS Data System: 2009 2010 percentile ranks for DIBELS 6 th Edition benchmark assessments (Technical Report 1102). Eugene, OR: University of Oregon Center on Teaching and Learning. Cunningham, A. E., & Stanovich, K. E. (1997). Early reading acquisitio n and its relation to reading experience and ability 10 years later. Developmental Psychology , 33 , 934 945. Deno, S. L. (1985). Curriculum based measurement: The emerging alternative. Exceptional Children, 52 , 219 232. Deno, S. L. (2003). Developments in c urriculum based measurement. The Journal of Special Education, 37 , 184 192. Florida Department of Education. (2009). Florida Comprehensiv e Assessment Test: Test design summary . Retrieved from http://fcat.fldoe.org/pdf/fc05designsummary.pdf Florida Departme nt of Education. (2011 a ). State level report: Florida Comprehensive Assessment Test . Retrieved from http://fcat.fldoe.org/results/default.asp Florida Department of Education. (2011b ). Test design summary 2011 Operational Assessments FCAT/FCAT/2.0/End of C ourse . Retrieved from http://fcat.fldoe.org/pdf/1011designsummary.pdf Fuchs, D., & Fuchs, L. (2006). Introduction to response to in tervention: What, why, and how valid is it? Reading Research Quarterly, 41 , 93 99. Fuchs, L. S., Fuchs, D., Hamlett, C. L., W alz, L., & Germann, G. ( 1993). Formative evaluation of academic progress: How much growth can we expect? School Psychology Review , 22 (1), 27 48. Fuchs, L. S., Fuchs, D., Hosp, M., & Jenkins, J. (2001). Oral read ing fluency as an indicator of reading competence: A theoretical, empirical, and historical analysis. Scientific Studies of Reading, 5 , 239 256. doi: 10.1207/S1532799XSSR0503_3 Fullan, M. (2007). The new meaning of educational change (4 th ed.). New York City: Teachers College Press. Go ffreda, C. T., & DiPerna, J. C. (2010). An empirical review of psychometric evidence for the Dynamic Indicators of Basic Early Literacy Skills. School Psychology Review , 39 , 463 483. Goffreda, C. T., DiPerna, J. C., & Pedersen, J. A. (2009). Preventive screening for early readers: Predictive validity of the Dynamic Indicators of Basic Early Literacy Skills (DIBELS). Psychology in the Schools, 46, 539 552. doi:10.1002/pits.20396 82

PAGE 100

100 Good, R. H., & Kaminski, R. A. (2002a). Dynamic Indicators of Basic Early Literacy Skills (6th ed.). Eugene, OR: Institute for the Development of Educational Achievement. Retrieved from http://dibels.uoregon.edu/ Good, R., & Kaminski, R. A. (2002b). DIBELS oral reading fluency passages for first through third grades (Report No. 10). Eugene, OR: University of Oregon. Good, R. H., Kaminski, R. A., Smith, S., & Bratten, J. (2001). Technical adequacy of second grade DIBELS Oral Reading Fluency passages (Report No. 8). Eugene, OR: University of Oregon. Good, R. H., Simmon importance and decision making utility of a continuum of fluency based indica tors of foundational reading skills for third grade outcomes. Scientific Studies of Reading, 5, 257 289. Good, R. H., Simmons, D., Kame'e nui, E., Kaminski, R. A., & Wallin, J. (2002). Summary of decision rules for intensive, strategic, and benchmark in structional recommendations in kindergarten through third grade (Report No. 11). Eugene, OR: University of Oregon. Gresham, F. (2007). Evolut ion of the response to intervention concept: Empirical foundations and recent developments. In S. Jimerson, M. Bur ns, & A., VanDerHeyden (Eds.), Handbook of response to intervention: The science and practice of assessment and intervention (pp. 10 24). New York: Springer. Gu thrie, J. T. (2002). Preparing students for high stakes test taking in r eading. In A.E. Farstrup, & S. Samuels (Eds.), What Research Has to Say About Reading Instruction (pp. 370 391). Newark, DE: International Reading Association. Harcou rt Assessment. (2004). Stanford Achievement Test Series, Tenth Edition: Technical data report. San Antonio, TX: Harcourt Assessment, Inc. Harcourt Brace. (2003a). Stanford Achievement Test (10th ed.). San Antonio, TX: Author. Harcourt Brace. (2003b). Stanf ord Achievement Test (10th ed .): Technical data report . San Antonio, TX: Author. Harcourt. (2007). Reading and mathematics: Technical report for 2006 FCAT test administrations . San Antonio, TX: Harcourt Assessment. Hasbrouck, J. E., & Tindal, G. (1992). Cu rriculum based oral read ing fluency norms for students in grades 2 5. Teaching Exceptional Children , 24 (3), 41 44. Hintze , J. , Callahan , J. , Matthews , W. , Williams , S. , & Tobin , K. (2002). Oral reading fluency and prediction of reading comprehension in African American and Caucasian elementary school children. School Psychology Review, 31 , 540 554.

PAGE 101

101 Hintze, J. M., & Silberglitt, B. (2005). A longitudinal examination of the diagnostic accuracy and predictive validity of R CBM and high stakes testing. Scho ol Psychology Review, 34 , 372 386. Human Resources Research Organization. (2002). Florida Comprehensive Assessment Test (FCAT) for Reading and Mathematics: Technical report for tests administrations of FCAT 2002 . San Antonio, TX: Author/Harcourt Educationa l Measurement. Individuals with Disabilities Education Improvement Act of 2004. 20 U.S.C. § 1400 et seq. Jenkins, J. R., Hudson, R. F., & Johnson, E. S. (2007). Screening for service delivery in a response to intervention (RTI) framework. School Psychology Review, 36 , 582 600. Johnson, E. S., Jenkins, J. R., & Petscher, Y. (2010). Improving the accuracy of a direct route screening process. Assessment for Effective Intervention, 35 , 131 140. Johnson, E., Jenkins, J., Petscher, Y., & Catts, H. (2009). How can we improve the accuracy of screening instruments? Learning Disabilities Research & Practice, 24, 174 185. Keller Margulis, M. A., Shapiro, E. S., & Hintze, J. M. (2008). Long term diagnostic accuracy of curriculum based measures in reading and mathematics . School Psychology Review, 37 , 374 390. Klein , J. , & Jimerson , S. (2005). Examining ethnic, gender, langu age, and socioeconomic bias in oral reading fluency scores among Caucasian and Hispanic students. School Psychology Quarterly , 20 , 23 50. Kranzler, J. H., Miller, D. M., & Jordan, L. (1999). An examinati on of racial/ethnic and gender bias on curriculum based measurement of reading. School Psychology Quarterly, 14 , 327 342. LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic information p rocessing in reading. Cognitive Psychology, 6, 293 323. LeBlanc, M., Dufore, E., & McDougal, J. (2012). Using general outcome measures to predict student performance on state mandated assessments: An applied approach for establishing predictive cut scores . Journal of Applied School Psychology, 28, 1 13. doi: 10.1080/15377903.2012.643753 Leeflang, M. M. G., Moons, K. G. M., Reitsma, J. B., & Zwinderman, A. H. (2008). Bias in sensitivity and specificity caused by data driven sele ction of optimal cutoff value s: Mechanisms, magnitude, and solutions. Clinical Chemistry, 54 , 729 737. No Child Left Behind Act of 2001, Pub. L. No. 107 110, 115 Stat.1425 (2002). McGlinchey, M. T., & Hixson, M. D. (2004). Using curric ulum based measures to predict performance on sta te assessments in reading. School Psychology Review, 33 , 193 203.

PAGE 102

102 Morse, D. T. (2005). Stanford Achievement Test, Tenth Edition. In R. A. Spies & B. S. Plake (Eds), The 16th mental measurements yearbook (pp. 968 972). Lincoln, NE: The Buros Institute of Mental Measurements. Quirk, M., & Beem, S. (2012). Examining the relations betwe en reading fluency and reading comprehension for English language learners. Psychology in the Schools, 49 , 539 553. RAND Reading Study Group. (2002). Reading for understanding . Washington, DC: RAND Education. Reschly, A. L., Busch, T. W., Betts, J., Deno, S. L., & Long, J. D. (2009). Curriculum based measurement of oral reading as an indicator of reading achievement: A meta analysis of the correlational evidence. Journal of Sch ool Psychology, 47 , 427 469. Riedel, B. W. (2007). The relation between DIBELS, reading co mprehension, and vocabulary in urban first grade students. Reading Research Quarterly , 42 , 546 562. Roehrig, A. D., Petscher, Y., Nettles, S. M., Hudson, R. F., & T orgesen, J. K. (2008 ). Accuracy of the DIBELS oral reading fluency measure for predicting third grade reading comprehension outcomes. Journal of School Psychology , 46 , 343 366. Scammacca, N., Vaughn, S., Roberts, G., Wanzek, J., & Torgesen, J. K. (2007). Extensive r eading interventions in grades K 3: From research to practice . Portsmouth, NH: Center on Instruction, RMC Research Corporation. Schatschneider, C., Buck, J., Torgesen, J., Wagner, R., Hassler, L., Hecht, S., & Powell Smith, K. (2004). A multivariate study of factors that contribute to individual differences in performance on the Florida Comprehensive Reading Assessment Test (FCRR Report No. 5). Tallahassee, FL: Florida Center for Reading Research. Schilling, S. G., Carlisle, J. F., Scott, S. E., & Zeng, J. (2007). Are fluency measures accurate predictors of reading achievement? Elementary School Journal , 107 , 429 448. Shinn, M. R., Good, R. H., Knutson, N., Tilly, W. D., & Collins, V. L. (1992). Curriculum based measurement reading fluenc y: A confirmatory analysis of its relation to reading. School Psychology Review, 21 , 459 479. Shapiro, E. S., Keller, M. A., Lutz, J. G., Santoro, L. E., & Hintze, J. M. (2006). C urriculum based measures and performance on state assessment and s tandardized tests: Reading and math performance in Pennsylvania. Journal of Psychoeducational Assessment, 24 , 19 35. Shaw, R., & Shaw, D. (2002). DIBELS oral reading fluency b ased indicators of third grade reading skills for Colorado State Assessment Program (CSAP) . Eugene, OR: University of Oregon. Silberglitt, B., Burns, M. K., Madyun, N., & Lail, K. (2006). Relationship of reading fluency assessment data with state accountability test scores: A longitudinal comparison of grade levels. Psychology in the Schools, 43 , 527 535.

PAGE 103

103 Silberglitt, B., & Hintze, J. (2005). Formative assessment u sing CBM R cut scores to track progress toward success on state mandated achievement tests: A compar ison of methods. Journal of Psychoeducational Assessment, 23, 304 325. Simmons, D. C. (2000). Implementation of a schoolwide reading improvement model: "No one ever told us it would be this hard!" Learning Disabilities Research & Practice, 15 , 92 100. Stage, S. A., & Jacobsen, M. D. (2001). Predicting stude nt success on a state mandated performance based assessment using oral reading fluency. School Psychology Review, 30 (3), 407 419. Stewart, L., & Silberglitt, B. (2008). Best practices in developing academic local norms. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology (5th ed., Vol. 2, pp. 225 242). Washington, DC: National Association of School Psychologists. Torgesen, J. K., Rashotte, C. A., & Alexander, A. (2001). Principles o f fluency instruction in read ing: Relationships with established empirical outcomes. In M. Wolf (Ed.), Dyslexia, fluency, and the brain (pp. 333 355). Parkton, MD.: York Press. VanDerHeyden, A. (2010a). Use of classification agre ement analyses to evaluate RtI implementation. Theory into Practice, 49 , 281 288. VanDerHeyden, A. (2010b). Determining early mathematical risk: Ideas for extending the research. School Psychology Review, 39 , 196 202. VanDerHeyden, A. (2011). Technical adequacy of response to intervention decisions. Council f or Exceptional Children, 77 , 335 350. Vander Meer, C. D., Lentz, F. E., & Stollar, S. (2005). The rel ationship between oral reading fluency and Ohio proficiency testing in reading . Eugene, OR: University of Oregon. Vellutino, F. R., Scanlon, D. M., Small, S., & Fanuele, D. P. (2006). Response to intervention as a vehicle for distinguishing children with and without read ing disabilities: Evidence for the role of kindergarten and first grade interventions. Journal of Learning Disabilities, 39 , 157 169. Wayma n, M., McMaster, K., Saenz, L., & Watson, J. (2010). Using curriculum based measurement to monitor secondary English language learners' responsiveness to pe er mediated reading instruction. Reading and Writing Quarterly, 26 , 308 332. Wayman, M.M., Wallace, T., Wiley, H.I., Ticha, R. & Espin, C.A . (2007). Literature synthesis on curriculum based measurement in reading. Journal of Special Education , 41 (2), 85 120. Wiley , H. I. , & Deno , S. L. (2005). Oral reading and maze measures as predictors of success for E nglish Learners on a state standards assessment. Remedial & Special Education , 26 , 207 214.

PAGE 104

104 Wilson, J. (2005). The relationship of Dynamic Indicators o f Basic Early Literacy Skills (DIBELS) oral reading fluency to performance on Arizona Instrument to Measu re Standards (AIMS) . Tempe, AZ: Assessment and Evalu ation Department, Tempe School District No.3. Wood, D. E. (2006). Modeling the relationship between oral read ing fluency and performance on a statewide reading test. Educational Assessment, 11 (2), 85 104. Yeo, S. (2010). Predicting performance on statewide achievement tests using curriculum based measurement in reading: A multilevel meta analysis. R emedial and Special Education, 31 , 412 422. Yeo , S. , Fearrington , J. , & Christ , T. J. (2011). An investigation of gender, income, and special education status bias on curriculum based measurement slope in reading. School Psychology Quarterly , 26 , 119 130. Zirkel, P. A., & Thomas, L. B. (2010). State laws for RTI: An updated snapshot. Teaching Exce ptional Children, 42 , 56 63.

PAGE 105

105 BIOGRAPHICAL SKETCH Sally Grapin was born in Ridgewood, New Jersey in 1987. She earned her Bachelor of Arts in p sychology from Brown University in 2009 and her Master of Education in school p sychology from the University of F lorida in 2012. D uring her g raduate training , she completed a number of practicum experiences in the public sc hools of North Central Florida as well as a do ctoral internship in the Glen Rock Public Schools in New Jersey. Through these experiences, she has acquired training in the areas of assessment, intervention, consultation, crisis prevention and intervention, and systems level service delivery. Currently, her research interests center on assessment and intervention practices for students with learning disabilities as well as the implementation of Response to Interven tion models in schools. Outside of her professional interests, Sally enjoys spending time w ith her family and friends , playing tennis , and attending musical theater performances.