School Psychology Forum: R E S E A R C H I N VOLUME 1 P R A C T I C E • ISSUE 2 • PAGES 75–86 • SPRING 2007 Interpretations of Curriculum-Based Measurement Outcomes: Standard Error and Confidence Intervals Theodore J. Christ Melissa Coolong-Chaffin University of Minnesota Dr. Christ: I inserted some Abstract: Curriculum-based measurement (CBM) is a set of procedures uniquely suited comments and to inform problem solving and response-to-intervention (RTI) evaluations. The highlights prominence, use, and emphasis on CBM are likely to increase in the coming years throughout. I hope because of the recent changes in federal law (No Child Left Behind and the 2004 reauthorization of the Individuals with Disabilities Education Act [IDEA]) along with the they help. Keep in advent and growing popularity of RTI models (i.e., multitiered and dual discrepancy). mind, this was As a result of these changes in legislation, it is likely that CBM data will be used more written 8 years often to guide high-stakes decisions, such as those related to diagnosis and eligibility ago. Much of it still determination. School psychologists must remain ever vigilant leaders in schools to guide assessment and evaluation practices. That includes implementation and applies, but we advocacy for valid uses and interpretations of measurement outcomes. The authors now know more promote a perspective and methodology for school psychologists to reference the about RTI and the consistency, reliability, and sensitivity of CBM outcomes. Specifically, school interpretation of psychologists should reference standard error and report confidence intervals as they data. apply to estimates of both level and trend. A conceptual foundation and guidelines are presented to interpret CBM outcomes. Just skim. The big ideas matter more than the details. Overview In a previous article that was published in School Psychology Forum, Burns and Coolong-Chaffin (2006) described the three-tiered model for service delivery and assessment within a response-tointervention (RTI) framework. The article explained that the vast majority of students (>80%) should be served by a generalized set of Tier 1 services. A small subset of the student population ( 20%) should be served by the prevention and remediation services that comprise Tier 2. Finally, a very small subset of the student population ( 5%) should be served by the intensive and individualized remediation and support services that comprise Tier 3. According to the article, (a) the primary responsibilities of the school psychologist within an RTI model include the design and selection of both intervention-related and evaluation-related procedures, (b) triannual benchmarking and screening should occur for all students as part of Tier 1 services, (c) monthly assessment and evaluation should occur for students who are targeted for prevention and remediation as part of Tier 2, and (d) in the most critical cases ( 5%), weekly or daily assessments and evaluations should occur for students with intensive and individualized needs. An obvious question for school-based practitioners is what tools should be used to do this assessment. An examination of existing RTI models in the field reveals that curriculum-based measurement (CBM) procedures are commonly used to monitor student progress and make decisions about services (Fuchs, Fuchs, McMaster, & Al Otaiba, 2003; Fuchs, Mock, Morgan, & Young, 2003). CBM fits well within problem-solving and RTI models of service delivery. CBM procedures were established to assess the development of academic skills in the content areas of reading, mathematics, written expression, and spelling (Deno, 1985, 2003). These procedures might be used to assess either the level or trend of academic achievement. Level is typically estimated by the median value across multiple CBM administrations. Best practices suggest that three CBM probes should be administered on each of 3 or more days (Shinn, 2002). The median level of performance within each day is derived. The median of those values is used to establish the typical level of performance. The rate of academic development is often estimated with the line of best fit for a set of progress monitoring data. Progress monitoring data are typically collected once or twice per week for a period of 6 or more weeks. The slope—or steepness—for the line of best fit is used to establish the general rate of academic growth. These are the two most likely measurement schedules and interpretations when CBM is used to assess and evaluate the level and rate of academic achievement. A review of the RTI literature suggests that RTI evaluations will depend substantially on CBM data. Clearly, CBM procedures are useful and uniquely suited for use within RTI. However, there are some issues and potential cautions that should be considered before CBM data are used to guide decisions, especially within the upper tiers of RTI where the stakes increase greatly (e.g., CBM data are used to inform eligibility decisions). The issue of measurement precision should be considered before assessment outcomes are used to guide educational decisions. Although it is often appropriate to use brief and informal assessments to guide lowstakes decisions, there are more rigorous standards for assessments when they are used to guide highstakes decisions. Low-stakes decisions include routine classroom and screening decisions. Many Tier 1 decisions are relatively low stakes. High-stakes decisions include diagnosis and eligibility decisions. Many Tier 2 decisions and most Tier 3 decisions are relatively high stakes. The fundamental distinction is that the impact of a low-stakes decision is relatively short term and easily reversible, while the impact of a highstakes decision is more long term and potentially permanent. When data are used to guide high-stakes decisions then it is necessary to use caution and establish the requisite level of confidence before arriving at a decision. In part, this requires that the consumers of data be aware of the technical qualities of measurements. Those qualities include the reliability of measurement and the validity of their use. Although it is relatively common to hear that CBM is technically adequate or that CBM is reliable and valid, it is necessary to translate the outcomes of psychometric research so these outcomes can be applied at the point of interpretation. The content of this article is limited to the issues of reliability and measurement error and will provide the rationale and methods to reference standard error as part of CBM interpretations. This information is critical for any school-based professional who functions within a problem-solving or RTI framework. Dr. Christ: INFERENCE is central to data use. Consider what you actually observe and what you infer---we almost always go beyond the data with our interpretation, which is inference. Sampling and Reporting Achievement Using CBM Consumers of assessment data are less interested in how a particular student performs at a particular time when presented with a particular set of stimuli. Rather, consumers are interested in performance on multiple administrations within and across days with a variety of stimulus sets. Consumers use assessment outcomes to infer generalized performance across conditions. Curriculum-based approaches to assessment are designed to infer generalized performance within the context of a curriculum or instruction (Hintze, Christ, & Methe, 2006). It is inferred that student performance on CBM tasks provide useful estimates of generalized performance in the annual curriculum. Multiple measurements are usually necessary to ensure an adequate and representative sample of the stimulus materials and student performances within the annual curriculum. Those outcomes might be presented in either a graphic or numeric format. It is common to plot data graphically when multiple measurements are collected. That format is useful because it facilitates visual analysis of level, trend, and variability (Barlow & Hersen, 1984; Tawney & Gast, 1984). Once graphed, patterns are often self-explanatory and readily apparent without the use of statistics. In contrast, statistical presentations of assessment data are typically limited to only a single numeric NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 76 estimate of level (i.e., mean or median) or trend (i.e., slope). It seems that there might be a tendency to overlook estimates of variability when data are reported as a single numerical estimate. One purpose of this article, therefore, is to establish that it is limiting and potentially misleading to the consumer of data when estimates of stability, dispersion, or variability are left out. To that end, this article will address issues of standard error as they apply to estimates of both CBM level and growth. As discussed, the level and trend of student achievement were often derived with data collected across multiple administrations using either graphic or statistical analysis. There is no expectation that student performances would—or should—be identical across administrations. Variability and dispersion of performance is practically inevitable. Therefore, it follows that any single estimate of performance without reference to precision, variability, and dispersion is incomplete. Student performances across administration conditions will vary. The variability is, in part, a function of the inconsistencies related to the setting, stimulus materials, administrators, occasions, and student disposition. When making educational decisions using CBM data, it is critical, therefore, to accurately communicate the variability in those data. To fail to do so may lead to improper decisions that have high stakes for the students involved. Sensitivity, Reliability, and Consistency The consistency of measurement is often analyzed and discussed as reliability, which is, in many ways, the reciprocal of sensitivity. In other words, if a measure is highly sensitive then it is less likely to yield consistent or reliable results across repeated administrations. The research literature has established that CBM is a highly sensitive measurement procedure. CBM is sensitive to the characteristics of the administrator (Derr-Minneci, 1990; Derr-Minneci & Shapiro, 1992), setting/locations of the assessment (DerrMinneci, 1990), specification of the directions (Colon & Kranzler, 2006), task novelty (Christ & Schanding, 2007), probe type (Fuchs & Deno, 1992; Hintze & Christ, 2004; Hintze, Christ, & Keller, 2002; Shinn, Gleason, & Tindal, 1989a), assessment duration (Christ, Johnson-Gros, & Hintze, 2005; Hintze et al., 2002), stimulus content (Fuchs & Deno, 1991, 1992; Hintze et al., 2002; Shinn, Gleason, & Tindal, 1989), and stimulus arrangement (Christ, 2006a; Christ & Vining, 2006). Student performance on CBM tasks are likely to influence those and other characteristics of the measurement conditions. That sensitivity limits the utility of any narrow or specific point estimate. Instead, it is necessary to interpret any CBM outcome as one sample of behavior. It is necessary to collect multiple samples before an estimate of typical performance is derived. Moreover, any estimate of typical performance should reference the likely range of performances across repeated measurements. Variability is practically inevitable and, as such, it should be referenced when CBM outcomes are reported or interpreted to guide educational decisions. Empirical and Theoretical Dispersion Suppose CBM-Reading (CBM-R) were used to assess a second-grade student 100 times with no retest effects, as if the student’s memory of each test were immediately erased after each administration. There would be some variability in performance across measurement occasions. Variability is almost inevitable even if the conditions of measurement were well standardized and tightly controlled. That is especially true in the case of CBM, and CBM-like measures, that are highly sensitive to the variability in student performances within and across administrations (as discussed above). The designs of CBM procedures establish their sensitivity. Both the speeded metric, which is often highly sensitive to fluctuations in student performance, and the discrete, relatively small, unit of observation establish that sensitivity. For example, the metric for CBM-R is words read correct per minute (WRCM). That metric isolates a relatively small unit of behavior (i.e., word) that is easily observable and occurs frequently within the measurement duration. Although it might seem somewhat burdensome and unnecessary, oral reading fluency could be reported in lines of text read, sentences read, phrases, or clusters of five word units read. Each of those measurement units might yield more consistent, but less sensitive, measurement outcomes. NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 77 However, the convention is to use WRCM and, therefore, consumers of CBM-R outcomes should be familiar with the likely variability of student performances across repeated administrations. Figure 1 illustrates the possible distribution of performances for a single student across repeated measurements. The frequency of observed scores is greatest near the central point in the distribution, and the frequency of observed scores is least near the outer edges. The most likely performance approximates 20 WRCM, which is the arithmetic mean of the frequency distribution. The mean provides the best estimate of both previous performances and the best predictor of future performances. If the data in Figure 1 were used to predict future performance, then the best estimate is probably 20 WRCM and not either 10 WRCM or 30 WRCM. Figure 1. A frequency distribution of probable outcomes if CBM-R were to be repeatedly administered approximately 100 times. Test scores are reported in words read correct per minute. 16 14 Frequency 12 10 8 6 4 2 0 1 0 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Test Score A brief reference to test theory will provide the context for the interpretation of variable performances. Dispersion Within Test Theor y Classical test theory (CTT) provides a simple model to conceptualize and evaluate the inconsistency of measurements across occasions. The observed score x is equal to the true score t plus the error e so that x = t + e. That CTT model was the basis for most test development for the last than 100 years (Crocker & Algina, 1986; Spearman, 1904, 1907, 1913). The model provides the conceptual foundation to interpret each observed score x as the sum of two theoretical components t + e. The true score should not be interpreted as a physical or metaphysical truth that is inherent to the individual. The preferred interpretation of the true score is as the theoretical mean of a large, or infinite, number of administrations. In the case of the data presented in Figure 1, the best approximation of the true score value might be 20 WRCM. The true score value is stable and does not fluctuate or change, although our estimates of true score might, so that if 100 more scores were added to the Figure 1 data set the mean and true score estimate might shift up or down. NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 78 In the proper sense, it is not really possible to observe or establish the true score value because, by definition, observations are contaminated by some degree of error. In that sense, error is simply the fluctuation and variability of observations around the true score. The best estimate of a student’s true score value is the mean of many observed scores. Recall that the consumer of assessment data is most likely interested in a generalized interpretation of the outcome. That is, the best estimate of likely performance across either actual or theoretical measurement occasions. The consumer is not interested in the particular performance at a particular time and within a particular set of circumstances. Instead, observed score values are most useful, and typically interpreted, as estimates of true score values. Dr. Christ: A "generalized interpretation" requires INFERENCE. In the case of CBM, the typical educational professional wants to know in general how the child is reading rather than how the child is reading in a particular time, place, using a particular probe, scoring method, and so on. The previous discussion established that variability in performances across actual and theoretical administrations should be expected. Actual administrations yield observed scores. Theoretical administrations do not yield observed scores, but provide a context to infer the range of likely performances across repeated measurements using psychometric analysis and CTT. In summary, no single measurement datum should be used to describe typical performance or estimate the true score value. Instead, best practices dictate that there should be reference to both an observed score and the likely variability. Estimates of variability can be derived from the results of many repeated measurements or derived psychometrically (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). There are two approaches used to estimate variability. The first is to administer a large number of assessments and derive the standard deviation or range of performances. The second is to administer fewer assessments and use derivations of standard error to estimate the likely magnitude of standard deviation and/or range of performances. Because of the time and resources involved in administering a large number of assessments in order to derive the standard deviation, the use of standard error is perhaps more appealing. If the purpose of measurement is to estimate the likely range of performance and true score values, then it seems more efficient and practical to collect fewer observations and rely on standard error. Standard Error Dr. Christ: Focus on the concepts and not the math. The explanation for how standard errors are derived is not very important. The standard error is the amount of variation or dispersion for error for measurements. There are at least two types of standard error that are relevant to the interpretation and use of CBM outcomes within an RTI model. The dual discrepancy model of RTI evaluation establishes that students are eligible for services when their response to low-resourced/typical services is substantially different than their peers (Fuchs, Fuchs, et al., 2003; Fuchs, Mock, et al., 2003). That is, students are eligible for specialized services when the level and trend of their response is (dual) discrepant for that of most other peers. If CBM outcomes are used to estimate the level and trend of student achievement, then there should be some reference to the standard errors that are relevant to each of these interpretations. The standard error of measurement (SEM) is combined with observed level to estimate the true level of student achievement. The standard error of the slope (SEb) is combined with observed slope to estimate the true slope of student achievement. Each of these types of standard error can then be used to derive a confidence interval. These confidence intervals can then be used in practice so that the consumer of the assessment data understands the variability of the measurement. NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 79 Calculate Standard Error of Measurement SEM is relatively easy to calculate. Only two statistics are required to calculate SEM from local normative data: (a) standard deviation (SD ) for performance across the students in the sample or population and (b) an estimate for the reliability of measurement (r ). SEM can then be calculated using the formula . It is relatively easy to construct spreadsheets that will automate this calculation (see Christ, Davie, & Berman, 2006). Researchers and practitioners can derive SEM locally or they might depend on generic estimates from the published literature (Christ, 2006b; Christ et al., 2006; Christ & Silberglitt, 2007). Generic estimates of CBM-R reliability coefficients have typically been within the range of .90 to .97 (Howe & Shinn, 2002; Marston, 1989). A review of the research literature might suggest that higher levels of reliability can be assumed when the conditions of administration and instrumentation are carefully controlled (e.g., .95 to .97). Estimates in the lower range might be used when the conditions of administration and instrumentation are not carefully controlled (i.e., .90 to .94). These are just guidelines based on the opinions of the authors. Estimates of reliability over a variety of contexts will become available in the literature in the upcoming years (based on the research of the first author), and these values are already reported in some technical manuals that accompany published CBM probe sets (e.g., AIMSweb). Previous research suggests that the standard deviation for within grade performance on CBM-R is likely to be within the range of 35–45 WRCM through the primary grades (Ardoin & Christ, in press; Christ & Silberglitt, 2007). Values in the lower (35–38), middle (39–41) and upper range (42–45) might be applied to the lower (first and second), middle (third), and upper grades (fourth, fifth, sixth), respectively. There will be some variation across populations. That is, districts, schools, and classrooms composed of more diverse populations in terms of academic performance might expect slightly larger estimates. Populations with less diversity might expect slightly smaller estimates. Standard deviations of local student performances within and across grades are easy to compute with standard spreadsheet software. The generic estimates for reliability and standard deviation can be combined to yield generic estimates of SEM. Recently published research suggests that the likely range of CBM-R SEM values is 6–13 WRCM (Table 1). The lower end of the range coincides with tightly controlled measurement conditions and the upper end of the range coincides with loosely controlled measurement conditions. The most generic estimates for CBMR administrations under well-controlled conditions might approximate 5–9 WRCM with the lowest estimates for lower primary grades and upper estimates for upper primary grades (Christ & Silberglitt, 2007). The resulting estimates of SEM can be used to guide interpretation and construct confidence intervals. Table 1. Standard Error of Measurement for Grades by Reliability: CBM-R Estimates of reliabilitya Low r Grade First Second Third Fourth Fifth SDb 30 34 39 39 41 .90 9 11 12 12 13 Higher r .94 7 8 10 10 10 .95 7 Dr. Christ: <-- Useful as 8 9 rules of 9 thumb 9 .97 5 6 7 7 7 Note. Standard error of measurements reported in words read correct per minute. a Test–retest reliability estimates reported in the professional literature. b SD = estimates of the typical magnitude of standard deviations for CBM-R within grades NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 80 Dr. Christ: The big idea in the next section is that conceptually, SEM is for level as standard error of slope (SEb) is for slope/trend. We use them in Standard Error of Slope the same way to estimate confidence intervals. It is likely that most school psychologists are familiar with the concept of SEM, and, therefore, it should be fairly simple to calculate and apply SEM to interpret CBM estimates of level. The standard error of the slope SEb might be less familiar and slightly more difficult to calculate. For these reasons, a more thorough explanation of SEb will be provided than was provided for SEM. In short, however, SEb is analogous to SEM and can be used a similar manner to guide the interpretation of CBM estimates of slope or growth (some details are glossed over to allow for a straightforward presentation). A linear, or trend, analysis is often used to estimate growth from CBM-R progress monitoring data. Of the available techniques (i.e., visual analysis, split-middle, regression), research has established that ordinary least squares regression is likely to yield the most accurate and precise estimates of growth (Good & Shinn, 1990; Shinn, Good, & Stein, 1989). The ordinary least squares value for slope (b) describes the rate of change in student performance. It is most common to report growth in weekly unit (7 days). In the early primary grades, students are expected to improve within the range of 1–2 WRCM per week with slightly less growth expected in the later primary grades (Deno, Fuchs, Marston, & Shin, 2001; Fuchs, Fuchs, Hamlett, Walz, & Germann, 1993). Once derived, the observed slope can be compared to the expected slope to evaluate the response to instruction and determine whether students are on pace to meet benchmark goals. Most spreadsheet software can be used to calculate slope using an ordinary least squares regression function. For example, the slope function can be used within Microsoft Excel. The days or dates comprise the x axis data and are listed either across a row or down a column. The corresponding CBM observations comprise the y axis data and are listed in the cells along side the corresponding dates when each datum was collected. The slope or linear function can then be used to estimate the rate of daily growth and corresponding statistics, which includes SEb. The estimate of daily growth and SEb are each multiplied by 7—for the 7 days per week—to yield weekly rates. For example, if the estimate for daily growth and SEb were .20 and .10 WRCM per day, then the weekly growth and SEb would equal 1.40 WRCM per week and .70 WRCM per week, respectively. The alternate approach is to estimate growth and apply published estimates of SEb. There are some generic estimates of SEb available in the research literature (Christ, 2006c). It seems that SEb is primarily influenced by the duration of progress monitoring and the standard error of the estimate (SEE). The SEE is the standard deviation of errors around the line of best fit for a particular set of progressmonitoring data. The line of best fit is defined as the straight/linear line that describes progress monitoring data. The line is essentially a continuous stream of point predictions, or likely estimates of true score values across time. Dr. Christ: We will talk about SEE in the training, but is it just the variability around the line of best fit. It was earlier established that observed scores are simply estimates of true score values. If multiple measurements were collected on each occasion then there would be some dispersion of observed score values. The mean, or central estimate, of those observed scores provides the best estimate of the true score. However, there are relatively few measurements collected at each point in time throughout progress monitoring duration. Therefore, the best estimate of true score values across time is provided by the straight/linear line that minimizes the discrepancy between predicted and observed values. The standard deviation of those discrepancies is the SEE. It is striking to note that the magnitudes of SEM and SEE converge. That is, research has established that SEM is likely to approximate 6–12 WRCM and SEE is likely to approximate 8–14 WRCM. Published estimates do not correspond exactly, but they are substantially similar. That is, the dispersion around observed and true score estimates corresponds across estimates of both level and trend. That observation lends credence to both sets of estimates and the recommended applications of standard error herein. It should be noted that estimates of SEb are smaller when progress monitoring durations are longer and when more data are available for evaluation. For example, Christ (2006b) observed that the median SEb after 2 weeks of progress monitoring was likely to approximate 7.96 WRCM (range = 1.59–14.33) and after 5 NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 81 weeks it was reduced to 2.10 WRCM (range = .42–3.78). The median was further reduced to .41 after 15 weeks (range = .08–.75). The implication is that more data and longer durations improve estimates of growth and improve confidence in the corresponding estimates. Christ also posited that well-controlled measurement conditions and well-developed instrumentation are likely to substantially reduce the magnitude of SEb. Confidence Interval Previous research provides evidence to support the validity and reliability of CBM measures (Good & Jefferson, 1998; Marston, 1989). However, it is difficult to translate the meaning of psychometric reliability to the interpretive process. That is, reliability coefficients can be reviewed to determine whether measurement procedures and instruments should be used to guide low-stakes screening-type decisions (r < .70) or high-stakes diagnostic/eligibility decisions (r < .90) (Kelly, 1927; Sattler, 2001). While checking information about the technical properties of a measure to make sure it has adequate reliability for a particular use is important, school psychologists should not interpret reliability estimates to establish that assessment outcomes themselves are either reliable or unreliable. Dr. Christ: This next statement is the BIG IDEA of this article. The reliability and precision of test scores is distributed across a continuum. No test is either reliable or unreliable. The most efficient way to translate the concept of reliability for interpretation is to calculate and use the SEM or SEb to construct a confidence interval. This allows consumer of the data to see how much variability is present and to determine how that variability has an impact on the decision-making process. Dr. Christ: There is a typo below "x +/- 6" should be "x +/- 10. Don't let that confuse you. As previously discussed, estimates of SEM can be derived from either local normative data or from published estimates. Estimates of SEb can be gleaned from the research literature or be calculated directly from student-specific datasets. In either case, the confidence interval is a multiple of the standard error. A 68%, 90%, or 95% confidence interval is constructed by multiplying the SEM by 1.00, 1.68, or 1.96, respectively. Therefore, assuming a SEM of 10, the confidence interval (68%) is equal to x +/- 6; the confidence interval (90%) is equal to x +/- 10; and the confidence interval (95%) is equal to x +/- 12. If the observed CBM-R performance were 85 WRCM, then there is a 68% chance that the individual’s true level of performance is within the range of 79–91 WRCM, a 90% chance that their true level of performance is within 75–95 WRCM, and a 95% chance that it is within 73–97 WRCM. These confidence intervals provide both a likely range of performance across repeated administrations and an interval to estimate the true level of typical performance (i.e., true score). Confidence interval can also be calculated around estimates of growth using SEb in multiples of 1.00, 1.68, and 1.96 for 68%, 90%, and 95% confidence intervals, respectively. Application and Use CBM-R frequently demonstrates test–retest reliability above .90. However, it is unclear how such estimates should be applied during interpretation when CBM-R data are used to guide educational decisions. For example, suppose the following set of hypothetical facts: (a) you are part of a problem-solving team who must determine which students should receive early intervention services, (b) the cutoff for early intervention services is established at the 15th percentile, and (c) in the fall of second grade the 15th percentile of CBM-R performance is 20 WRCM. It is difficult to determine the appropriate level of confidence that should be placed in scores falling in the range of 15–25 WRCM. How much confidence should the problem-solving team invest in those CBM-R outcomes? Are additional assessments necessary? As previously discussed, these CBM-R outcomes are estimates of typical performance that would likely change if students were assessed on an alternate day or with an alternate passage. A best practices approach to CBM-R interpretation requires that the problem-solving team consider the likely magnitude of variation across actual or theoretical administrations. Actual variation could be observed by administering multiple CBM-R on multiple occasions. In that case the observed range across a moderate number of NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 82 administrations could be reported to estimate the magnitude of variability in performance. Those data might be reported with a statement such as: When administered three second-grade CBM-R probes on each of 3 days, Jason performed an average 20 WRCM with a range between 13 and 27 WRCM. That statement communicates both an estimate of central tendency and the variability in student performance across multiple assessments. Similarly, to communicate the student’s rate of growth over time, the actual slope (calculated using software such as Excel) can be reported with a statement such as: When monitored weekly across 15 weeks using second-grade CBM-R probes, Jason gained an average of two words per week. However, the observed slope should also be recognized as an estimate of the true slope, and this impreciseness should be communicated as well. Using SEb values calculated from the observed data or published estimates of SEb corresponding to the number of weeks of data collection, measurement conditions, and SEE (Table 2), an appropriate interpretation might be: When monitored weekly across 15 weeks using second-grade CBM-R probes, Jason gained between 1.59 and 2.41 words per week. This information could also be communicated using a confidence interval around estimates of growth using SEb in multiples of 1.00, 1.68, and 1.96 for 68%, 90%, and 95% confidence intervals, respectively. When data from multiple administrations are not available to observe the range of performance across multiple assessment occasions, then school psychologists can rely on estimates of SEM, which provide the range of likely performance across multiple theoretical administrations. Those data might be reported with a statement such as: When administered three second-grade CBM-R probes on a single day, Jason’s typical level of performance was estimated to be 20 +/- 6 WRCM. That statement communicates both the median level of performance (i.e., 20 WRCM) and the SEM (i.e., 6 WRCM), which translates to a 68% confidence interval (i.e., 14–26 WRCM). Other more technical and precise language could also be used to communicate that there is a 68% chance that Jason’s true score falls within +/-6 WRCM of his observed performance. Regardless of the language used, it is important to communicate that no individual observation or test score represents the stable and absolute level of an individual’s true score/performance. The procedures to estimate SEM and construct confidence intervals are provided above. Summary CBM comprises a set of measurement procedures that are uniquely suited to inform problem solving and RTI. The prominence, use, and emphasis on CBM are likely to increase in coming years due, in part, to the 2004 reauthorization of IDEA and the advent of RTI models (i.e., multitiered and dual discrepancy). While, CBM has historically been used to guide low-stakes classroom and screening-type decisions, it is likely that CBM will be used to guide high-stakes diagnosis and eligibility-type decisions in the future. As this higher stakes use of CBM data comes into daily practice, school psychologists much remain ever vigilant that they promote best practices in test score interpretation. In part, that requires some reference to the consistency, reliability, and sensitivity of measurement outcomes. School psychologists should reference and report confidence intervals to support the valid use and interpretation of measurement outcomes. Failing to present data in such a manner would likely result in misinterpretation of results, thereby leading to inappropriate decisions where the impact is harder to correct as the stakes get higher. School psychologists should consider the likely magnitude of dispersion and variability when the typical level of CBM performance is reported. Three potential solutions were offered here: (a) repeatedly assess students across days with multiple probes and report performance in a graphic format that supports visual analysis of level and variability; (b) repeatedly assess and report the range of performance across days and probes and report performance in a numeric format, which might include an average score and standard deviation; and (c) administer fewer assessments and report the SEM. A similar set of options are available when growth, trend, or slope is estimated. More measurement and a more robust dataset are often preferred. However, there is a resource allocation issue so that a balance must be struck between the available resources, how these resources are allocated, and the frequency of measurement. Reference to NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 83 NASP School Psychology Forum: Research in Practice 6.18 7.06 7.95 3.53 4.41 5.30 .88 1.77 2.65 3 4.07 4.65 5.23 2.32 2.91 3.49 .58 1.16 1.74 4 2.94 3.36 3.78 1.68 2.10 2.52 .42 .84 1.26 5 2.25 2.57 2.89 1.28 1.61 1.93 .32 .64 .96 6 1.79 2.05 2.30 1.02 1.28 1.54 .26 .51 .77 7 1.47 1.68 1.89 .84 1.05 1.26 .21 .42 .63 8 1.24 1.41 1.59 .71 .88 1.06 .18 .35 .53 9 1.06 1.21 1.36 .61 .76 .91 .15 .30 .45 10 .92 1.05 1.18 .53 .66 .79 .13 .26 .39 11 .81 .92 1.04 .46 .58 .69 .12 .23 .35 12 .72 .82 .92 .41 .51 .62 .10 .21 .31 13 .64 .74 .83 .37 .46 .55 .09 .18 .28 14 .58 .66 .75 .33 .41 .50 .08 .17 .25 15 Note. Standard error of the slope reported in words read correct per minute per week. a Estimates based on the assumption that two data points are collected weekly. b Reported for weekly standard errors of the slope (SEb = SEE/[sddays*√ n]). c SEE is the average standard deviation in words read correctly per minute from predicted CBM-R performances along the line of best fit. d Qualitative descriptor for measurement conditions based on the authors’ subjective evaluation. 11.14 12.74 14.33 6.37 7.96 9.55 Moderated 8 10 12 Poord 14 16 18 1.59 3.18 4.78 2 Optimald 2 4 6 SEEc Weeks of progress monitoring a, b Table 2. Standard Error of the Slope Estimate by Progress Monitoring Duration in Weeks: CBM-R Dr. Christ: Don't worry about this table. It is more technical than you need. Interpreting CBM Outcomes 84 standard error and confidence intervals might help guide decisions regarding the frequency of measurement and issues of interpretations. In the end, CBM is only one source and method of data collection. A multisource and multimethod approach is necessary to guide high-stakes decisions. The convergence of evidence across sources and methods is the critical factor. That does not, however, relieve the burden to promote the most valid interpretations and applications of measurement outcomes, including those collected with CBM procedures. Dr. Christ: Now, wasn't that easy? We will discuss all this in the training. References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational and Psychological Research Association. Ardoin, S. P., & Christ, T. J. (in press). Evaluating curriculum-based measurement slope estimate using data from triannual universal screenings. School Psychology Review. Barlow, D. H., & Hersen, M. (1984). Single-case experimental designs: Strategies for studying behavior change (2nd ed.). New York: Pergamon Press. Burns, M., & Coolong-Chaffin, M. (2006). Response-to-intervention: The role of and effect on school psychology. School Psychology Forum: Research in Practice, 1, 1–13. Christ, T. J. (2006a). Curriculum-based measurement math: Toward improving instrumentation and administration procedures. Paper presented at the annual meeting of the National Association of School Psychologists, Anaheim, CA. Christ, T. J. (2006b). Does CBM have error? Standard error and confidence intervals. Paper presented at the annual meeting of the National Association of School Psychologists, Anaheim, CA. Christ, T. J. (2006c). Short-term estimates of growth using curriculum-based measurement of oral reading fluency: Estimates of standard error of the slope to construct confidence intervals. School Psychology Review, 35, 128–133. Christ, T. J., Davie, J., & Berman, S. (2006). CBM data and decision making in RTI contexts: Addressing performance variability. Communiqué, 35, 29–31. Christ, T. J., Johnson-Gros, K., & Hintze, J. M. (2005). An examination of computational fluency: The reliability of curriculum-based outcomes within the context of educational decisions. Psychology in the Schools, 42, 615–622. Christ, T. J., & Schanding, T. (2007). Practice effects on curriculum based measures of computational skills: Influences on skill versus performance analysis. School Psychology Review, 36, 147–158. Christ, T. J., & Silberglitt, B. (2007). Curriculum-based measurement of oral reading fluency: The standard error of measurement. School Psychology Review, 36, 130–146. Christ, T. J., & Vining, O. (2006). Curriculum-based measurement procedures to develop multiple-skill mathematics computation probes: Evaluation of random and stratified stimulus-set arrangements. School Psychology Review, 35, 387–400. Colon, E. P., & Kranzler, J. H. (2006). Effect of instructions on curriculum-based measurement of reading. Journal of Psychoeducational Assessment, 24, 318–328. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Harcourt Brace. Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232. Deno, S. L. (2003). Developments in curriculum-based measurement. Journal of Special Education, 37, 184–192. Deno, S. L., Fuchs, L. S., Marston, D., & Shin, J. (2001). Using curriculum-based measurement to establish growth standards for students with learning disabilities. School Psychology Review, 30, 507–524. Derr-Minneci, T. F. (1990). A behavioral evaluation of curriculum-based assessment for reading: Tester, setting, and task demand effects on high- vs. average- vs. low-level readers. Dissertation Abstracts International, 51, 2669. Derr-Minneci, T. F., & Shapiro, E. S. (1992). Validating curriculum-based measurement in reading from a behavioral perspective. School Psychology Quarterly, 7, 2–16. NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 85 Fuchs, D., Fuchs, L. S., McMaster, K. N., & Al Otaiba, S. (2003). Identifying children at risk for reading failure: Curriculum-based measurement and the dual-discrepancy approach. In H. L. Swanson, K. R. Harris, & S. Graham (Eds.), Handbook of learning disabilities (pp. 431–449). New York: Guilford Press. Fuchs, D., Mock, D., Morgan, P. L., & Young, C. L. (2003). Responsiveness-to-intervention: Definitions, evidence, and implications for the learning disabilities construct. Learning Disabilities Research & Practice, 18, 157–171. Fuchs, L. S., & Deno, S. L. (1991). Paradigmatic distinctions between instructionally relevant measurement models. Exceptional Children, 57, 488–500. Fuchs, L. S., & Deno, S. L. (1992). Effects of curriculum within curriculum-based measurement. Exceptional Children, 58, 232–242. Fuchs, L. S., Fuchs, D., Hamlett, C. L., Walz, L., & Germann, G. (1993). Formative evaluation of academic progress: How much growth can we expect. School Psychology Review, 22, 27–48. Good, R. H., & Jefferson, G. (1998). Contemporary perspectives on curriculum-based measurement validity. In M. R. Shinn (Ed.), Advanced applications of curriculum-based measurement (pp. 61–88). New York: Guilford Press. Good, R. H., & Shinn, M. R. (1990). Forecasting accuracy of slope estimates for reading curriculum-based measurement: Empirical evidence. Behavioral Assessment, 12, 179–193. Hintze, J. M., & Christ, T. J. (2004). An examination of variability as a function of passage variance in CBM progress monitoring. School Psychology Review, 33, 204–217. Hintze, J. M., Christ, T. J., & Keller, L. A. (2002). The generalizability of CBM survey-level mathematics assessments: Just how many samples do we need? School Psychology Review, 31, 514–528. Hintze, J. M., Christ, T. J., & Methe, S. A. (2006). Curriculum-based assessment. Psychology in the Schools, 43, 45–56. Howe, K. B., & Shinn, M. M. (2002). Standard Reading Assessment Passages (RAPs) for use in general outcome measurement: A manual describing development and technical features. Retrieved April 11, 2007, from www.aimsweb.com Kelly, T. L. (1927). Interpretations of educational measures. Yonkers, NY: World Book. Marston, D. B. (1989). A curriculum-based measurement approach to assessing academic performance: What it is and why do it. In M. R. Shinn (Ed.), Curriculum-based measurement: Assessing special children (pp. 18–78). New York: Guilford Press. Sattler, J. M. (2001). Assessment of children: Cognitive applications (4th ed.). San Diego, CA: Author. Shinn, M. R. (2002). Best practices in using curriculum-based measurement in a problem-solving model. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology IV (pp. 671–698). Bethesda, MD: National Association of School Psychologists. Shinn, M. R., Gleason, M. M., & Tindal, G. (1989). Varying the difficulty of testing materials: Implications for curriculum-based measurement. Journal of Special Education, 23, 223–233. Shinn, M. R., Good, R. H., & Stein, S. (1989). Summarizing trend in student achievement: A comparison of methods. School Psychology Review, 18, 356–370. Spearman, C. (1904). The proof and measurement of associations between two things. American Journal of Psychology, 15, 72–101. Spearman, C. (1907). Demonstration of formula for true measurement of correlation. American Journal of Psychology, 18, 161–169. Spearman, C. (1913). Correlations of sums and differences. British Journal of Psychology, 5, 417–426. Tawney, J. W., & Gast, D. L. (1984). Single subject research in special education. New York: Merrill. NASP School Psychology Forum: Research in Practice Interpreting CBM Outcomes 86