EXPLORING THE ADDED VALUE OF A NUMBER LINE ASSESSMENT FOR KINDERGARTEN MATHEMATICS SCREENING by DAVID J. FURJANIC A DISSERTATION Presented to the Department of Special Education and Clinical Sciences and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2021 DISSERTATION APPROVAL PAGE Student: David J. Furjanic Title: Exploring the Added Value of a Number Line Assessment for Kindergarten Mathematics Screening This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Special Education and Clinical Sciences by: Ben Clarke Chairperson and Advisor Hank Fien Core Member Lillian Duran Core Member Joseph Nese Core Member Gerald Tindal Institutional Representative and Kate Mondloch Interim Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded June 2021. ii © 2020 David J. Furjanic iii DISSERTATION ABSTRACT David J. Furjanic Doctor of Philosophy Department of Special Education and Clinical Sciences June 2021 Title: Exploring the Added Value of a Number Line Assessment for Kindergarten Mathematics Screening Despite the importance of mathematical understanding for academic and occupational success, students in the United States are not meeting necessary levels of mathematics achievement. Multi-tiered systems of support (MTSS) provide a framework for schools to allocate resources to best support students. Universal screening, a key element of MTSS, employs brief assessments of critical academic skills to identify at-risk students. Despite advances in the screening for reading risk, research in mathematics screening is lacking. Current early numeracy screeners target number sense with mixed results. The mental number line is a potential construct for developing more advanced screening measures. The mental number line is a key developmental construct around which students organize their thinking and draw upon when working with elementary mathematics topics. The current study will explore the promise of using a number line assessment as part of a mathematics screening battery to identify students at risk. iv CURRICULUM VITAE NAME OF AUTHOR: David J. Furjanic GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene Millersville University of Pennsylvania, Millersville The Pennsylvania State University, State College DEGREES AWARDED: Master of Science, Psychology, 2017, Millersville University of Pennsylvania Bachelor of Science, Psychology, 2013, The Pennsylvania State University AREAS OF SPECIAL INTEREST: Multi-Tiered Systems of Supports in K-12 Education Data-Based Decision-Making Equity in School Practices PROFESSIONAL EXPERIENCE: Advanced Practicum Student, Center on Teaching and Learning, University of Oregon, Eugene, 2018 to 2020 Diversity and Retention Graduate Employee, Graduate School, University of Oregon, Eugene, 2019 to 2020 Teaching Assistant Graduate Employee, College of Education, University of Oregon, Eugene, 2018 to 2019 School Psychology Intern, Derry Township School District, Hershey, Pennsylvania, 2016 to 2017 Graduate Assistant, Psychology Department, Millersville University, Millersville, 2015 to 2016 Therapeutic Support Staff, Pennsylvania Counseling Services, Harrisburg, Pennsylvania, 2013 to 2014 PUBLICATIONS: Clarke, B., Nelson, N., Kosty, D., Ketterlin-Geller, L., Smolkowski, K, Lesner, T., Furjanic, D., & Fien, H. (Under review). Investigating the promise of a tier two sixth grade fractions intervention. v Sutherland, M., Clarke, B., Nese, J. F. T., Strand Cary, M., Shanley, L., Furjanic, D., & Durán, L. (Submitted for review). Investigating the utility of a kindergarten number line assessment compared to an early numeracy screening battery. vi ACKNOWLEDGEMENTS I thank Dr. Ben Clarke for his assistance in the preparation of this manuscript. I would also like to thank Drs. Joseph Nese, Lillian Durán, Gerald Tindal, and Hank Fien for contributing their expertise as I refined this study. I am sincerely grateful for the Special Education and Clinical Sciences faculty, staff, and students as well as my family and loved ones for their boundless intellectual, emotional, and social support throughout my scholarship at the University of Oregon. The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grants R305K040081 and R305A080699 to the Center on Teaching and Learning at the University of Oregon and Pacific Institutes for Research. vii TABLE OF CONTENTS Chapter Page I. INTRODUCTION .................................................................................................... 1 The State of Mathematics Achievement in the United States ................................ 1 Addressing Mathematics Needs Via Screening for Risk ....................................... 3 Early Mathematics Screening ................................................................................ 4 The Components of Number Sense ................................................................. 5 Counting .................................................................................................. 5 Number Knowledge ................................................................................... 6 Number Operations .................................................................................... 7 Symbolic Number Understanding.............................................................. 8 Screening for Number Sense ........................................................................... 8 The Field's Approach to Screening .................................................................. 11 The Promise of the Number Line........................................................................... 12 State of the Problem: The Current Study ............................................................... 19 Research Questions ................................................................................................ 20 II. METHOD ................................................................................................................ 21 Participants ............................................................................................................. 21 Procedures .............................................................................................................. 22 Measures ................................................................................................................ 23 Number Line Estimation Task ......................................................................... 23 Assessing Student Proficiency of Number Sense ........................................... 24 Number Sense Brief ...................................................................................... 24 viii Chapter Page The Stanford Early School Achievement Test – Tenth Edition ........................... 24 Analyses ................................................................................................................. 24 III. RESULTS .............................................................................................................. 28 Research Question 1 .............................................................................................. 33 Research Question 1A: Association Among Early Numeracy Measures ........ 33 Research Question 1B: Explained Variance .................................................... 35 Research Question 1C: Classification Accuracy ............................................. 37 Research Question 2: Item-Level Analyses ........................................................... 40 IV. DISCUSSION ........................................................................................................ 48 Summary and Interpretation of Results ................................................................. 49 Research Question 1A: Association Among Early Numeracy Measures ........ 49 Research Question 1B: Explained Variance .................................................... 51 Research Question 1C: Classification Accuracy ............................................. 51 Research Question 2: Item-Level Analyses ..................................................... 53 Implications............................................................................................................ 56 Limitations and Future Research ........................................................................... 58 Conclusion ............................................................................................................. 61 REFERENCES CITED ................................................................................................ 63 ix LIST OF FIGURES Figure Page 1. Distributions of the Early Numeracy Measures ..................................................... 29 2. Normal Quantile Plot for Spring NSB Scores. ...................................................... 30 3. Normal Quantile Plot for Spring SESAT Scores .................................................. 30 4. Simple Linear Regressions .................................................................................... 31 5. Residual Plots of the Spring NSB Scores ............................................................. 32 6. Residual Plots of the Spring SESAT Scores ......................................................... 32 7. Correlations Among Study Measures ................................................................... 34 8. ROC Curve Comparing Fall NLT and Fall ASPENS to the Spring NSB ............. 38 9. ROC Curve Comparing Fall NLT and Fall ASPENS to the Spring SESAT ......... 38 10. Distribution of the Individual NLT Items .............................................................. 42 11. Residuals of the Spring NSB Regressed on the Fall NLT Item Scores ................ 43 12. Residuals of the Spring SESAT Regressed on the Fall NLT Item Scores ............ 43 x LIST OF TABLES Table Page 1. Linear Regression Models Conducted for Research Question 1 ........................... 26 2. Descriptive Statistics of Early Numeracy Measures .............................................. 28 3. Correlations Among All Measures ........................................................................ 34 4. Regression Results Predicting Spring NSB Performance ..................................... 36 5. Regression Results Predicting Spring SESAT Performance ................................ 36 6. Unique Variance Explained in the Outcome Measures ........................................ 37 7. AUC for the Fall Screening Measures to the Spring Outcome Measures ............. 39 8. Classification Accuracy and Cut Scores for Fall Screeners ................................... 40 9. Descriptive Statistics of the NLT Items and Outcome Measures .......................... 41 10. Correlations Among the NLT Items and Early Numeracy Measures .................... 44 11. Regression Results Predicting Spring NSB Performance ..................................... 46 12. Regression Results Predicting Spring SESAT Performance ................................ 46 13. Unique Variance Explained in the Outcome Measures ........................................ 47 14. AUC for the Individual NLT Items to the Spring Outcome Measures .................. 47 xi I. INTRODUCTION The State of Mathematics Achievement in the United States Calls to improve mathematics achievement for our nation’s students have been spurred by consistent and long-standing patterns of low achievement by students in the United States (National Mathematics Advisory Panel, 2008; National Research Council, 2001). The 2019 National Assessment for Educational Progress (NAEP) found that less than half of fourth grade students were proficient in mathematics. NAEP proficiency levels have remained relatively stable over the past decade and a half, demonstrating a protracted concern. Of even greater concern, average scores for historically disadvantaged students (such as students eligible for free/reduced lunch, attending urban schools, or identified with disabilities) had statistically significant drops from previous years (National Center for Education Statistics, 2015; National Center for Education Statistics, 2005-2019; OECD, 2012). Comparisons to international peers further illustrate the depth of the problem. Students in the United States are ranked in the lower half of students worldwide, with performance gaps increasing as students advance across grades (Olson, Martin, & Mullis, 2008). United States students will be hindered competing in both international and domestic job markets without secure mathematics skills as high- demand professions, including those in science, technology, engineering, mathematics (STEM), increasingly rely upon a strong mathematical foundation (National Science Board, 2015). Mathematics difficulties reverberate beyond schooling and affect basic functional tasks for adults. Over half of adults cannot calculate a 10% tip for a meal and even more cannot calculate miles per gallon on a trip (Phillips, 2007). While a secure foundation in 1 mathematics can afford opportunities, an insecure foundation in mathematics has educational, occupational, and functional life implications for our students (National Mathematics Advisory Panel, 2008). American adults who struggle with mathematics are, by and large, products of the American school systems (Mcclure et al., 2017; Watts, Duncan, Siegler, & Davis-Kean, 2014). Numerical knowledge at age 7, or typically first grade, predicts socioeconomic status at age 42 even when controlling for IQ, reading achievement, and familial SES (Ritchie & Bates, 2013). Students enter school exhibiting individual differences and these differences compound over time resulting in expanding achievement gaps over time (Bodovski & Farkas, 2007; Jordan, Kaplan, & Hanich, 2002; Judge & Watson, 2011; Schulte & Stevens, 2015; Wei, Lenz, & Blackorby, 2013). Utilizing a large national dataset, the Early Childhood Longitudinal Study – Kindergarten Cohort (ECLS-K), Morgan and colleagues (2009) found that kindergarten students who entered and subsequently exited in the lowest 10th percentile in mathematics had a 70% chance of still being in the bottom 10th percentile in fifth grade. Given the cumulative nature of mathematics understanding, early foundational gaps in knowledge limit the acquisition of more advanced content (Duncan et al., 2007; Hiebert & Wearne, 1996; Judge & Watson, 2011). The importance of serving at-risk students early is underscored by the fact that preventing academic difficulties saves significant time and resources compared to remediation approaches (Fletcher & Vaughn, 2009; Torgesen, 2000, 2002; Torgesen et al., 2001; Vaughn & Wanzek, 2014; Vaughn et al., 2011; Walker et al., 1996). Morgan and colleague’s (2009) study found that of those students who entered kindergarten 2 below the 10th percentile but exited kindergarten above the 10th percentile, only 30% were in the bottom 10th percentile in fifth grade. This phenomenon demonstrates the potential of early intervention during this time period to alter and promote favorable learning trajectories. The promise of early intervention has been codified within the reauthorization of IDEA (Individuals with Disabilities Education Act, 2004). Aligned with calls by the field (Gersten et al., 2009), IDEA (2004) emphasizes the prevention of protracted learning difficulties through early identification and intervention (Gersten et al., 2009). Addressing Mathematics Needs Via Screening for Risk Educators and schools can promote favorable trajectories for students. One avenue for promoting students’ success is universally screening all students to identify who is most at risk for academic difficulties. Universal screening involves administering a brief assessment to all students in a school, typically at three timepoints throughout the year. Screening data is used to guide decisions on which individual students are at-risk and also to systematically gauge the health of the system as a whole (Albers & Kettler, 2014; Shinn, 2006; Simmons et al., 2000). Screeners have the potential to supply schools with critical information for serving their students. Screening is a key feature in Multi-Tiered Systems of Support (MTSS) or Response to Intervention (RTI) frameworks. However, schools across the country utilize screening to varying degrees. Despite over 70% of schools reporting using an MTSS/RTI framework to support reading development, only 35% reported using this framework for mathematics (Balu et al., 2015). Successful MTSS implementation requires substantial resources, tools, training, and commitment on the part of a school or 3 district (Fletcher & Vaughn, 2009; D. Fuchs & Fuchs, 2017). Critical to a successful MTSS framework are the measures around which universal and targeted decisions are made (Balu et al., 2015; D. Fuchs & Fuchs, 2017). Early Mathematics Screening The value in a screener hinges upon its ability to assess important constructs in a given content area. Early numeracy screeners most commonly assess aspects of number sense (Gersten et al., 2012). Number sense, while its definition varies across the field, is best described as a series of interrelated early mathematical competencies that serve as a foundation for the acquisition of more advanced concepts (Feigenson, Libertus, & Halberda, 2013; Griffin, Case, & Siegler, 1994; Jordan, Glutting, & Ramineni, 2010; Jordan, Kaplan, Olah, & Locuniak, 2006; Jordan, Kaplan, Ramineni, & Locuniak, 2009; Siegler & Lortie-Forgues, 2014; Siegler, Thompson, & Schneider, 2011; Starr, Libertus, & Brannon, 2013). Number sense screeners are created with the developmental progression of number sense in mind. Young children exhibit the precursors to number sense prior to formal schooling. The development of number sense begins perceptually before children can visualize or cognitively represent and manipulate numbers. Initially, children need to engage with tangible quantities (such as balls, dots, or patterns). As they develop, they can visualize quantities and patterns with imagined objects. Once school-aged, children extend their ability to reason with numbers to greater quantities and transition from informal to formal number sense. This transition is, in part, aided by the introduction of and continued engagement with symbolic numbers. Students with well-developed number sense can fluently and accurately reason with, manipulate, and problem-solve 4 with numbers and quantities in a base-ten system (Berch, 2005; Case et al., 1996; Gersten, Jordan, & Flojo, 2005). Number sense screeners are intentionally created to capture students’ proficiency at various points in mathematical development. Furthermore, number sense is an amalgamation of interrelated skills and the developmental progressions of these skills look different. Among the interrelated skills invoked in number sense are counting, magnitude comparison, number operations, and symbolic numerical understanding (Case, 1998; Clements, Sarama, & DiBiase, 2003; Cross, Woods, & Schweingruber, 2009). The Components of Number Sense Counting. Counting is an essential foundation to developing number sense (Hudson & Miller, 2005). As young as infancy, children exhibit the first signs of number sense through numerosity, or the beginning stages of understanding quantity (Gallistel & Gelman, 1992). Infants can perceptually subitize, or recognize small quantities without systematically counting (Clements, 1999; Starkey & Cooper, 1980; Wynn, Bloom, & Chiang, 2002), which includes the ability to discriminate that an array of 4 items is different than an array of 2 (Starkey, Spelke, & Gelman, 1990). By eighteen months, they exhibit greater understanding in being able to identify that the array of 4 is greater than the array of 2 (Cooper Jr, 1984). Between the ages of two and three, children begin to learn the number words from one to ten. As children start to grapple with these numbers words and counting in parallel, they learn to associate each word with an object (Baroody, 2002; Wagner & Walters, 1982). In counting to a number, the child associates meaning to each item in the sequence as they touch each object once with an accompanying word (Cross et al., 2009; 5 Mix, Huttenlocher, & Levine, 2002). This skill of associating each object once and only once with a number while counting is called one-to-one correspondence. One-to-one correspondence sets the foundation for children to then learn cardinality, or the understanding that the last word said in the sequence holds meaning for the collection. In counting a group of five objects, “five” as the final word represents the set as a whole (Clements et al., 2003). Counting, as an aspect of number sense, manifests first as an ability to recognize items in a set and then gradually develops into an ability to attach specific meaning to quantities. As their list of known number words and numerals grows, children extend to greater quantities their ability to count using one-to-one correspondence in a fixed order and with cardinality (Clements et al., 2003; Cross et al., 2009). Number Knowledge. Number knowledge, another skill invoked in number sense, first manifests when children begin to compare sets of objects. Before they understand numerical quantities, children often rely upon perceptual clues to decide which set is greater, such as which set is spaced farther apart (Cross et al., 2009). The well-known example of this behavior is children’s lack of understanding conservation. Imagine showing a child two sets of teddy bear counters each containing four teddy bears. The child may agree the two sets are equivalent if they look similar. Now imagine if one of the sets were adjusted so that its four teddy bears are spread out in a long line. A child who doesn’t understand conservation would likely assert that the long line of teddy bears is now greater than its twin because of the child’s reliance upon perceptual clues. Once children become familiar with number words, they tend to rely upon these to make their judgments. A four year old, for example, would count each set of items and 6 decide which is greater based on which number word was “farther” down the list or number line (Clements et al., 2003; Cross et al., 2009). Children in formal schooling increasingly extend their ability to reference their number word list to make judgements about magnitudes. Children in kindergarten leverage their budding knowledge of magnitude to conceptually subitize, or understand how larger quantities are composed of groups of smaller, more familiar quantities (Cross et al., 2009; Griffin, 2004; Jordan, Glutting, & Ramineni, 2010; Jordan et al., 2009). By age 6, children integrate their knowledge of counting and magnitudes into a mental number line (Siegler & Booth, 2004). Referred to as a “central conceptual structure,” children’s construction of the mental number line is theorized to enable children to access the quantitative world in a way they could not previously (Griffin, 2002). Number Operations. The beginnings of children’s ability to engage in number operations is also evident before school-aged years. Children as young as two exhibit preverbal mathematical numeration. They can recognize basic number operations of adding or taking away one object. For example, a toddler observing two balls being placed into a box and one being removed would expect one to remain (Wynn, 1992). Similarly, young children can recognize which set is greater when one item is added to only one of two equivalent groups. It is not until the age of five, however, that children can judge magnitudes for collections that did not begin equivalently (Clements et al., 2003; Cooper Jr, 1984). As children simultaneously develop their ability to count, they rely upon this skill to manipulate numbers. To add three to a set of four, children may initially count each set and then count the two sets together. Children advance into being 7 able to count to four and directly continue counting three more times to reach seven. As children understand that the two addends are subsumed within the total, they can start counting at one of the quantities, such as four, and count on to the total, seven, from there (Clements et al., 2003). Symbolic Number Understanding. Symbolic number understanding, such as recognizing the numeral “2” represents a set of two items, is also crucial to number sense. To secure this skill, students must be able to recognize the form of the numeral, produce the form accurately, and attach the correct meaning to the form. For example, a young child with secure number sense would be able to recognize a printed 6, reproduce the numeral, and understand that it represents a set of six items (Clements et al., 2003). Screening for Number Sense Number sense is comprised of key early numeracy skills that enable children to engage with more advanced mathematics (Gersten et al., 2005; Jordan et al., 2009). Due to the importance of early trajectories, early numeracy researchers have focused measure development on assessing components of this foundational construct. In a review of screeners, Gersten et al. (2012) make clear the prominence of number sense measures as a means to predict risk in mathematics. Researchers have leveraged observable tasks such as magnitude comparison or strategic counting in order to measure the construct of number sense (Chard et al., 2005; Conoyer, Foegen, & Lembke, 2016; Gersten et al., 2012; Lembke & Foegen, 2009; Mazzocco, 2005; Seethaler & Fuchs, 2010). For example, VanDerHeyden et al. (2001) administered a set of three one-minute group-administered measures tapping into early components of number sense. Kindergarten students (n = 107) counted circles and wrote the numeral of the total, 8 counted circles and selected the number of circles from a set, and drew a number of circles corresponding to a numeral. Using a smaller sample of students, the researchers explored validity of the measures in relation to retention at the end of the year. They found that the scores correctly predicted retention in 71.4% (or 5/7) of cases and promotion in 94.4% (or 17/18) of cases. Concurrent validity correlations ranged from .44 to .61. The Number Knowledge Test (NKT; Okamoto & Case, 1996) was explored by Baker et al. (2002) and Gersten, Jordan and Flojo (2005) with a sample of more than 200 kindergarten students. The NKT is a 10-15 minute individually administered assessment of a student’s procedural and conceptual knowledge of whole numbers. The NKT assesses components of number sense through increasingly complex counting, magnitude comparison, and number operation tasks. The researchers found that the NKT exhibited strong predictive validity to end-of-first grade outcomes on the SAT-9 Total Mathematics (r = .73). Clarke and Shinn (2004) and Clarke, Baker, Smolkowski, and Chard (2008) assessed three components of number sense – number identification, quantity discrimination, and missing number – with kindergarten (n = 52) and first grade students (n = 111; with 1-10 and 1-20 target numbers, respectively). In the first task, students identified given numerals. With similar stimuli, students chose the greater quantity in a pair given two numerals for quantity discrimination. For missing number, students identified the number missing from a sequence of three consecutive units with the missing number in the first, middle, or last position (e.g. __, 4, 5 or 6, __, 8). Predictive validities were strong across both studies, ranging from .62 to .64 with standard 9 achievement tests (Woodcock-Johnson Applied Problems subtest and the Stanford Early School Achievement Test, respectively). In a 4-year longitudinal study, Mazzocco and Thompson (2005) followed 226 students from kindergarten to third grade to determine the best assessment or battery to predict mathematics difficulty. Measures included mathematics achievement, formal and informal mathematics ability, visual-spatial reasoning, and rapid automatized naming assessments. Four items within the battery best predicted mathematics difficulty (here defined as performance below the 10th percentile on a third-grade comprehensive mathematics measure). The items that best predicted math difficulty (reading numerals, number constancy, magnitude judgments, and mental addition of one-digit numbers) were all mathematics items and associated with components of number sense. Most importantly, these four items correctly classified 84% of third-grade students at-risk based upon their performance in kindergarten. Seethaler and Fuchs (Seethaler & Fuchs, 2010) also explored screeners that assessed different components of number sense. The researchers administered a magnitude comparison measure and a multiple proficiency measure, Number Sense, to 196 kindergarten students in the fall and spring. At the end of first grade, they administered The Early Math Diagnostic Assessment and the KeyMath-Revised. Predictive validity of the fall screeners to the spring outcome measures ranged from .52 to .72. Classification accuracy was relatively high across both methods, ranging from .67 to .86. Hampton et al. (2012) administered six measures tapping into number sense (counting, number identification, missing number, quantity discrimination, next number, 10 and number facts) to kindergarten (n = 71) and first grade (n = 75) students weekly. The researchers found small to large predictive validities from fall to spring (ranging from .26 to .52) with the Broad Math Score of the Woodcock-Johnson Battery of Achievement-III (J. Cohen, 1992). Across these seminal screening studies and others (L. S. Fuchs et al., 1994; Lee & Lembke, 2016; Lembke & Foegen, 2009), researchers have leveraged the construct of number sense to develop early numeracy screeners and to examine the relationship between current and future achievement in mathematics. Despite earnest efforts, research on early numeracy screeners has not produced optimal screening measures. Gersten et al.’s (2012) review found the median predictive validities for kindergarten students on magnitude comparison and strategic counting measures were “moderate” at .50 and .48, respectively (Gersten et al., 2012). The Field’s Approach to Screening In search of refining screening practices, researchers have explored various structural approaches. Most mathematics screeners adopt a curriculum-sampling approach. Sampling from the curriculum tends to provide more information about specific domains within a grade rather than general mathematics proficiency (Foegen, Jiban, & Deno, 2007). Pulling from curricular objectives is useful for instructional decision-making and progress monitoring throughout a year but less so for screening in the fall (Vanderheyden, Codding, & Martin, 2017). Drawing upon skills which students have not yet been taught often invokes a floor effect, where a tool is unable to discriminate students along a spectrum because too many students scored within a narrow 11 band close to the bottom. A floor effect hinders the tool’s ability to detect which students are at risk and/or whether a systemic problem exists (Vanderheyden et al., 2017). An alternative to a single curriculum-sampled tool is a net of screeners that span multiple domains. This approach, in part necessitated by the nonlinear development of mathematics skills, tends to be more predictive of math performance than single-skill screeners (Gersten et al., 2012; Seethaler & Fuchs, 2010). For example, VanDerHeyden, Codding, and Martin (2017) found that a combined screening net of multi-skill computation, single skill computation, and concepts/applications tasks had high diagnostic accuracy for fourth and fifth grade students. An elaborate net of multiple screening measures may more accurately determine students at risk, but each additional measure included in a screening battery costs significant amounts of instructional and personnel time. Educators must weigh the relative benefits of each measure in their screening battery against the value of the information it provides. Rather than comprehensively sampling across every mathematics domain for a given grade, established batteries of selected measures could be supplemented or replaced with an assessment of a central concept that integrates multiple skills. The Promise of the Number Line The mental number line is theorized to be a central concept around which students organize their mathematical thinking (Case et al., 1996; Laski & Siegler, 2007; Schneider et al., 2018; Siegler, 2016; Siegler et al., 2011). As children first grapple with numbers, they begin to place these quantities along a mental number line. The number words they learn take shape in a linear fashion as they understand each successive number is one 12 greater than that before it (Clements et al., 2003). As they encounter larger quantities, their mental number line expands outwards to accommodate. At this stage, the number line may be more aptly called a number path, as students understand numbers only as integers, or whole numbers to jump to with each successive increment (Cross et al., 2009). As they work with fractions and decimals, their mental number line grows interstitially, becoming more detailed between quantities (Siegler, 2016) and as their understanding of rational numbers continues to expand the mental number line morphs from a series of connected, discrete integers to a continuous spectrum of potentially- infinite quantities. Students draw upon this mental number line for various mathematical tasks (Schneider et al., 2018; Siegler & Lortie-Forgues, 2014; Siegler et al., 2011). When comparing magnitudes, for example, locating 11 and 14 on one’s mental number line can enable a student to understand which has a greater magnitude (Siegler, 2016). Similarly, a student may tap into their mental number line to solve an addition problem such as “4 + 2” by referencing “4” on their mental number line and counting up two integers. Students may also draw upon this mental number line when estimating values, ordering numbers, judging proportions, or performing calculations (Dehaene, 2001; Schneider, Grabner, & Paetsch, 2009). The body of evidence supporting the mental number line has been primarily provided by cognitive and developmental researchers. When comparing magnitudes, participants are quicker to discriminate between numbers that are farther apart, dubbed the distance effect (Schneider et al., 2009). Latency in comparing numbers of similar magnitude supports the holistic view of processing numerals, that numbers are judged as 13 whole magnitudes (Dehaene, Dupoux, & Mehler, 1990; Moyer & Landauer, 1967). This contrasts with the symbolic view which supposes that numbers are processed by each digit place. To elaborate, the symbolic view posits that the ones-digit should have no effect on reaction time when comparing numerals with different tens-digits. In an example of comparing “12” and “23,” the symbolic view asserts only the tens-digit is necessary and processed to judge magnitudes. Evidence, however, supports that responses are not uniform across decades and that numerals are considered as a holistic unit. In other words, the symbolic view supposes that response times when comparing “12” versus “23” and “19” versus “23” should be similar because the tens-digits are the same in both sets. Instead, evidence shows that respondents would be faster in comparing the first set due to the greater distance between the numbers. Respondents are thought to reference the two numbers against their mental number line and come to a judgment more quickly when the numbers are farther apart. In young children, their mental number line more closely resembles a logarithmic, rather than linear, relationship (Berteletti, Lucangeli, Piazza, Dehaene, & Zorzi, 2010; Siegler & Opfer, 2003; Siegler, Thompson, & Opfer, 2009). Young children’s responses to placing numerals on a number line demonstrate a linear relationship for small quantities with which children are highly familiar, such as numbers 0-10. Placing numbers outside of this familiar range results in a logarithmic pattern with greater numbers (such as 29, 42, and 56, for example) being placed relatively close together on the right end of the number line. It is not until middle elementary when students transition to a more accurate wholly-linear mental number line (Siegler et al., 2009). This phenomenon would affect children’s behaviors in responding to a number line task. 14 Another behavioral indicator of the mental number line is the spatial–numerical association of response codes (SNARC) effect. The SNARC effect is observed through decreased latency in physical responses that are aligned to the orientation of the mental number line (Schneider et al., 2009). The mental number line is oriented with smaller quantities on the left and growing to larger quantities on the right. In alignment with this orientation, participants are quicker to respond when presented with lower quantities on the left and higher quantities on the right (Dehaene, Bossini, & Giraux, 1993; Dehaene et al., 1990; Wood, Willmes, Nuerk, & Fischer, 2008). The mental number line also exhibits a strong relationship with overall mathematics competence (Barth & Paladino, 2011; Boyer, Levine, & Huttenlocher, 2008; Friso-van den Bos et al., 2015; Siegler, 2016). Siegler and Booth (2004), for example, found individual differences in accuracy of number line estimations correlated strongly (r = -.60 to -.76) with math achievement test scores on the Stanford Achievement Test (SAT– 9) for first and second graders. Booth and Siegler (2006) extended this work with kindergarten through fourth grade students, again finding a strong correlation between number line estimation and comprehensive math achievement test scores (ranging from r = .54 to .84). Performance on the NLT is associated with performance on magnitude comparison tasks, understanding of fractions, and overall mathematics achievement, even after controlling for compounding variables like working memory or fact fluency (Booth & Siegler, 2006; Geary, 2011; Hansen, 2015; Hansen, Jordan, & Rodrigues, 2017; Jordan et al., 2013; Schneider et al., 2018; Siegler et al., 2011). Students improve on the NLT across broad age and mathematical proficiency ranges, suggesting its utility across time (Siegler & Booth, 2004; Siegler & Opfer, 2003). 15 Performance on the NLT being associated with general mathematics performance suggests the potential utility of the NLT as a screening measure. Schneider et al.’s (2018) meta-analysis examined the NLT’s ability to predict general mathematical competence across 41 studies with 263 effect sizes and 10,576 participants. Schneider et al. (2018) found that the average correlation (from a sample of 263 studies) between the NLT and general mathematical competence measures was r = .443 across studies. In the same meta-analysis, magnitude comparison tasks – a common task in current screening batteries – had an average correlation of r = .274 with general mathematical competence. Similar results were found (r = .438 and .278, respectively) when examining only early elementary students (aged 6-9), as well as across other age ranges, task stimuli, and methodological variations. Most importantly, a correlation of r = .443 suggests that 19.6% (r2) of the variance in students’ general mathematical performance is explained by performance on the NLT. Including the NLT within an established screening battery may aid the decisions schools make in predicting risk and serving students. Despite the potential of the NLT as an educational tool, the number line has been primarily assessed by cognitive and developmental researchers. Prior studies on the number line have typically assessed this construct via the number line estimation task (NLT; Berteletti, Lucangeli, Piazza, Dehaene, & Zorzi, 2010; Geary, Hoard, Nugent, & Byrd-Craven, 2008; Laski & Siegler, 2007; R. Siegler & Booth, 2004; R. S. Siegler & Opfer, 2003). During typical procedures for the NLT, students are presented with a blank number line and asked to place target numerals along the line. The value of the endpoints (e.g. 0 and 20 or 0 and 100), the presence of one or both endpoints, and anchor numbers (e.g. 10, 25, 50) vary across manifestations of the task (Schneider et al., 2018). Student 16 performance is often calculated one of three ways: (1) by summing the absolute distance of the student’s responses from the correct placements, (2) by percentage of correct trials where a correct response is within a range around the correct placement, or (3) by calculating the correlation of student responses to correct placements (Schneider et al., 2018). How students approach the NLT may also provide useful information. Descriptive analyses of the NLT suggest successful performance requires the integration of various mathematics domains. Participants must be able to, at the very least, identify the target numeral, understand the scope of the number line, and accurately estimate the numeral’s place along the line. Respondents also attack stimuli differently, relying upon anchors, rounding, fractions, counting, proportional reasoning, or other strategies to produce accurate responses (Ashcraft & Moore, 2012; Peeters, Degrande, Ebersbach, Verschaffel, & Luwel, 2016; Siegler, 2016; Siegler & Opfer, 2003). Whereas one student may partition the line into salient anchors (25, 50, and 75), another may round the target stimuli to a more familiar number (12 to 10). A third may transform the stimuli’s placement into a more familiar proportion (71/100 to 3/4), while a fourth may find a familiar unit and iterate along the line to estimate the target. These examples underscore how the NLT requires the simultaneous application of various mathematical skills (Siegler, 2016). In light of its promise for educational purposes, emerging research has explored the NLT as a screening tool (Clarke, Strand Cary, Shanley, & Sutherland, 2018). Clarke et al. (Clarke et al., 2018) administered an early numeracy screener (Assessing Student Proficiency of Early Number Sense; ASPENS) and two versions of the NLT (0-20 and 0- 17 100) to exiting students in kindergarten (n = 46) and first grade (n = 60) as part of a five- week summer school program. They found that the NLT explained 13% additional variance above and beyond the typical screening battery for first grade students. The NLT explained 7% additional variance for kindergarten students, although this was not statistically significant. Due to the sample being drawn from a summer school program serving lower-performing students, the general population of kindergarten and first grade students was not represented. Additionally, the limited time between pre- and post- assessment limits the ability of these results to generalize to fall screening processes in schools. Sutherland and colleagues (2020) expanded on the prior study in drawing upon a broader sample (n = 117) of kindergarten students in control classrooms from a larger study, representing a general population. Additionally, measures were administered in the fall and spring, approximating typical screening and outcome processes. Due to time constraints, administration of the number line measure ended after five minutes. The number line assessment (0-100) performed similarly to the typical mathematics screener (ASPENS; r = .60 and r = .62, respectively). Independently, the ASPENS explained 49% of students’ spring mathematics performance while the NLT explained 35%. Additionally, the ASPENS exhibited an Area Under the Curve (AUC) value of .94 compared to the .80 of the NLT. When considered in combination, however, the NLT uniquely explained 7% of the variance in spring mathematics performance above and beyond the ASPENS. Across studies, the NLT results indicate potential value as a supplement to, but not necessarily a wholesale replacement of, established mathematics screening batteries. 18 In both of the preceding studies, the NLT was adapted from Laski et al.’s (2013) version of 26 items. No screening study to date has investigated whether a form of less items, allowing for greater efficiency, could provide comparable predictive value. The goal for practice is to increase or preserve the diagnostic accuracy of a screener while optimizing its efficiency. Some evidence suggests that reducing the number of items on selected measures may be an alternative. For example, Purpura and colleagues (2015) found comparable information was garnered between their original 143-item screening battery and the shortened form of 24 items. Similarly, Rodrigues and colleagues (2019) reduced their screening net by 38-39 items and up to an estimated 29 minutes of administration time while increasing predictive power. By removing items that do not contribute to the overall prediction power of the measures, comparable information can be captured in less time. The mental number line is a central conceptual structure that children form as they grow acquainted with quantities and that develops as children do to eventually accommodate more advanced numbers (Siegler et al., 2011). As a measure that assesses a central construct across years of mathematics, the NLT provides numerous conceptual arguments for exploration as a screener. State of the Problem: The Current Study Despite the importance of mathematics for academic and occupational success, students in the United States are not mastering critical content. Universal screening presents an opportunity for schools to wisely leverage resources and identify students most in need of additional support. Early mathematics research is lacking consensus on best practices in screening. The mental number line offers promise as a screener to 19 accurately and efficiently identify students at risk. As part of a larger study, 226 students were administered the NLT in the fall along with an established early numeracy screener, a short outcome measure, and a comprehensive outcome measure. Analyses explored the predictive properties of a short form four-item NLT. The NLT was compared to an established screener for the extent to which it added value in predicting performance on the spring mathematics outcome measures. Lastly, the items within the NLT were explored for differences in utility. Research Questions 1. To what extent does the Number Line Task (NLT) predict math performance in an educational context? A. What are the associations among the NLT and other measures of early numeracy? B. To what extent does the NLT add value above and beyond a typical mathematics screener (the ASPENS)? C. What are the classification accuracy statistics of the NLT compared to an established math screener (ASPENS)? 2. Within the NLT, which items explain the most variance? Do items differ in their utility for decision-making? 20 II. METHOD This study analyzed data from a larger randomized controlled trial examining the efficacy of the federally-funded ROOTS kindergarten mathematics intervention program (Clarke, Doabler, Fien, Baker, & Smolkowski, 2012). ROOTS is a 50-lesson intervention program that focuses on improving student understanding of whole number concepts and associated skills. Math achievement data were collected at the individual level for students. Random assignment and instructional delivery took place at the classroom level. Blocking on school and teacher experience with the core curriculum (one year or none), classrooms were randomly assigned to treatment and control conditions. Assessments were administered in the fall and spring of kindergarten. Participants Participants were drawn from the first cohort of the parent study (Clarke et al., 2012). Participants were 226 kindergarten students from 14 classrooms during the 2012- 2013 schoolyear. The classrooms were nested within 7 schools within 3 districts. From the 785 students of the parent study, the final sample for analysis (n = 226) removed cases with partial or complete missing data (n = 559) for all measures (fall NLT, fall ASPENS, fall NSB, spring ASPENS, spring NSB, and spring SESAT). Welch independent two-sample t-tests were conducted to determine if student fall mathematics scores differed for included students as compared to excluded students with available fall data. There were no significant differences on any of the fall mathematics measures; NLT t(42.91) = 1.61, p = .21. ASPENS, t(45.38) = 0.22, p = 0.64, and NSB, t(45.36) = .01, p = 0.91. 21 Similarly, Welch independent two-sample t-tests were conducted to determine if student’s fall mathematics scores differed for students assigned to intervention as compared to students assigned to the control condition. There were no significant differences on any of the fall mathematics measures; NLT t(61.95) = .55, p = .46. ASPENS, t(64.39) = 0.05, p = 0.82, and NSB, t(84.59) = .25, p = 0.62. Participating school districts were all in suburban and rural areas of western Oregon. Schools targeted for recruitment across the three districts were primarily those that received Title 1 funding. Of the 226 students in the sample: 129 (57.1%) were female; 150 (66.4%) were 5-years-old, 76 (33.6%) were 6-years-old, 196 students (86.7%) were White, 7 students (3.1%) were American Indian or Alaskan Native, 7 students (3.1%) were Black or African American, 30 students (13.3%) identified as Hispanic and/or Latino, and five or fewer students identified as Asian, Native Hawaiian/Pacific Islander, or more than one race; 15 students (6.6%) were English learners; and 15 students (6.6%) received special education services. Procedures Students were individually administered all measures by trained staff with extensive experience in collecting data for educational research. Interrater reliability of all administrators was at least .90 before collecting data with students. Administrators attended follow-up trainings prior to data collection sessions to prevent drift from standardization. Student assessment protocols were processed using Teleform, a form processing application. Tests of Teleform scoring procedures of assessment protocols from previous 22 research projects reveal high reliability values (i.e., .99) relative to assessor-scored protocols (.95). Measures Four of the parent study’s mathematics measures were chosen to investigate the research questions in this sub-study: the 0-100 NLT, ASPENS, NSB, and SESAT. Number Line Estimation Task (NLT) Administrators folded a paper in half lengthwise and handed the paper and a red pencil to the student. Administrators said, "This is a number line. If this is 0 and this is 100 (administrator points to each endpoint while talking), where would 34 be? Use the pencil, and mark on the number line where 34 would be." Each page displayed two number lines but was folded so that the student would only see one number line at a time. The administrator then displayed a new number line and prompted the student for next item, asking, “where would [x] be?” Items were 34, 12, 89, and 57. Responses were scored as the absolute distance of the students’ responses from the correct responses. The first mark a student places on the number line was used for scoring. A transparency was laid over the student’s response form and the administrator counted the spaces between the student’s response and the target number. For example, a student was told to locate 34 and marked the number line where 42 resides. This student received a score of 8. Summed scores closest to zero indicate better performance. The four stimuli were chosen semi-randomly, sampling across the range of 0 to 100. 23 Assessing Student Proficiency of Number Sense (ASPENS; Clarke, Gersten, Dimino, & Rolfhus, 2011) The ASPENS is a series of three one-minute curriculum-based measures of numeral identification, comparing quantities, and strategic counting. The ASPENS is utilized in the present study as a proxy for a typical mathematics screening battery due to including magnitude comparison and strategic counting tasks. Test-retest reliabilities of kindergarten ASPENS measures are in the moderate to high range (.74 to .85). Predictive validity from fall to spring scores on the TerraNova 3 is reported as ranging from .45 to .52. Number Sense Brief (NSB; Jordan, Glutting, & Ramineni, 2008) The NSB is an individually administered measure with 33 items drawing upon varied early numeracy skills, such as counting knowledge and principles, number recognition, number comparisons, nonverbal calculation, story problems, and number combinations. The NSB has a coefficient alpha of .84. The NSB serves as a short outcome measure for determining general student mathematics performance. The Stanford Early School Achievement Test – Tenth Edition (SESAT; Harcourt Educational Measurement, 2002) The Stanford Early School Achievement Test – Tenth Edition (SESAT) is a group-administered standardized, norm-referenced achievement test with two multiple- choice mathematics subtests, Problem Solving and Procedures. The SESAT has adequate validity (r = .67) and reliability (r = .93). The SESAT serves as a longer, comprehensive outcome measure for determining general student mathematics performance. 24 Analyses Prior to analyses for the study’s research questions, univariate descriptive statistics for each measure at each timepoint were calculated and assumptions of fitness for linear regression (linearity, independence of errors, multivariate normality, and homoscedasticity) were tested (Pedhazur & Kerlinger, 1982). To address research question 1A, Pearson’s r bivariate correlations were estimated among the NLT, established early numeracy screener (ASPENS), and outcome measures (NSB and SESAT). To address research question 1B, six linear regression models were conducted. Table 1 displays the conducted models. In the first model, the spring NSB scores were regressed on the fall NLT scores. In the second model, spring NSB scores were regressed on the fall ASPENS scores. In the third, the spring NSB scores were regressed on both predictors, fall NLT and ASPENS scores. This procedure was repeated for the remaining three models by regressing the second outcome measure, the spring SESAT scores, on the same set of predictors. Including the intervention condition in the combined models only slightly increased predictiveness (by R2 = .01 and .02 for the spring NSB and spring SESAT, respectively). For the sake of parsimony, intervention condition was excluded from the models. The R2 value given by each model estimates the variance explained by the predictors in the outcome measures. Semi-partial correlations were estimated for the final model for each outcome measure. Semi-partial correlations parse out shared variance to better understand what each independent variable uniquely contributes to the model. 25 Table 1 Linear Regression Models Conducted for Research Question 1 Model Number Predictor(s) Outcome 1A Fall NLT Spring NSB 1B Fall ASPENS Spring NSB 1C Fall NLT + Fall ASPENS Spring NSB 2A Fall NLT Spring SESAT 2B Fall ASPENS Spring SESAT 2C Fall NLT + Fall ASPENS Spring SESAT To address research question 1C, receiver operating characteristic (ROC) analyses assessed the diagnostic accuracy of the NLT and the ASPENS. ROC analyses evaluate a measure’s classification performance to a dichotomous outcome variable of “risk.” A cut score of 20 on the NSB was used to qualify “risk” with those scoring above 20 deemed “not at risk.” A cut score of 20 aligns with previous research supporting its utility for diagnostic accuracy in the spring of kindergarten (Jordan, Glutting, Ramineni, & Watkins, 2010). Additionally, a score of 20 corresponds to the 23rd percentile in this sample, which holds clinical significance and approximates a threshold schools may use to assign intervention. For this reason as well, performance below the 25th percentile on the SESAT in this sample was deemed as “at risk.” When evaluating a ROC curve, the area under the curve (AUC) estimates how well a measure accurately classifies subjects. Values close to 1 suggest a measure is highly sensitive and specific (or accurately parses out individuals who are truly “at risk” or truly “not at risk”). Values close to .5, in contrast, denote the measure performs little better than chance. Confidence intervals and statistically significant differences for the AUC values were computed using 2,000 stratified bootstrap replicates. 26 To address the second research question, analyses mirrored the procedure of the first research question. Pearson’s r bivariate correlations were estimated among the individual NLT items and the ASPENS, NSB and SESAT at all available timepoints. Next, each outcome measure (the spring NSB or spring SESAT) was regressed on the individual fall NLT item scores and the fall ASPENS scores. As with research question 1B, the R2 value and semi-partial correlations were collected. Lastly, AUC values were conducted for the NLT Items to examine classification accuracy. It was hypothesized that Item 4 (with a stimulus of 57) would be associated with greater overall mathematical competence in kindergarten. Basis from this hypothesis drew from Rodrigues, Jordan and Hansen (2019) who found that “simpler” items such as the midpoint (1/2) on a 0-1 fraction number line were the most predictive items. Similarly, it was hypothesized that a stimulus of 57 would require students to demonstrate foundational skills in a) identifying the two-digit numeral correctly and b) dissecting the line approximately in half. Exhibiting these developing mathematical competencies may be associated with greater overall mathematical competence in kindergarten. Type I error rate for all analyses was set at 5% (.05) as is standard in educational sciences. All analyses were conducted in R (R Development Core Team, 2011), with the following packages: cowplot (Wilke, 2019); ggplot2 (Wickham, 2016); ggResidpanel (Goode & Rey, 2019); ggROC (Wu, 2013); haven (Wickham & Miller, 2019); here (Müller, 2017); Hmisc (Harrell Jr, 2020); lmSupport (Curtin, 2018); pROC (Robin et al., 2011); rio (Chan, Chan, Leeper, & Becker, 2018); and tidyverse (Wickham et al., 2019). 27 III. RESULTS Univariate descriptive statistics are displayed in Table 2. Students gained, on average, about 44 points (77.5%) from fall to spring of kindergarten on the ASPENS measure. Students also gained, on average, on the NSB measure by about 5.6 points (31.9%). Table 2 Descriptive Statistics of Early Numeracy Measures Measure Mean SD Median Skewness Kurtosis Fall NLT 112.27 39.40 112.50 0.22 -0.45 Fall ASPENS 56.73 39.66 48.85 0.64 -0.21 Fall NSB 17.58 5.29 17.00 -0.02 -0.46 Spring 100.65 41.24 98.80 -0.07 -0.28 ASPENS Spring NSB 23.17 4.81 24.00 -0.61 -0.17 Spring SESAT 28.11 6.72 29.00 -0.77 0.00 Assumptions of fitness for linear regression were tested. First, the variables were examined for normality. Distributions of the study measures are displayed in Figure 1. The fall NLT scores approximate a normal distribution (Shapiro-Wilk normality test p = .07). The fall NLT scores have a slight positive skew (0.22) and less kurtosis than expected (-0.45). However, graphical representation suggests that the distribution of fall NLT scores may be bimodal. The fall ASPENS scores fail the Shapiro-Wilk normality test (p < .001). The fall ASPENS scores have moderate positive skew with approximately normal kurtosis (-0.21). The spring ASPENS scores, however, do approximate a normal distribution (Shapiro-Wilk normality test p = .63), with minimal skew (-0.07) and expected kurtosis (-0.17). 28 The fall NSB scores approximate a normal distribution (Shapiro-Wilk normality test p = .11), with minimal skew (-0.02) and less kurtosis than expected (-0.46). In contrast, the spring NSB scores fail the Shapiro-Wilk normality test (p < .001). The spring NSB scores demonstrate moderate negative skew (-0.61) and approximately normal kurtosis (-0.17). The spring SESAT scores fail the Shapiro-Wilk normality test (p < .001). The spring SESAT scores demonstrate moderate negative skew (-0.77) and expected kurtosis (0.00). Figure 1 Distributions of the Early Numeracy Measures Note. Axes’ scales vary by measure. Normal quantile plots were examined to determine multivariate normality. Quantile plots are displayed in Figures 2 and 3. The linear trend displayed by the theoretical quantities plotted against the sample quantities predicting to the NSB suggests multivariate normality. The tails of the model predicting to the SESAT (Figure 3) deviate from a linear trend. These deviations suggest the multivariate distribution predicting to the SESAT is negatively skewed. 29 Figure 2 Normal Quantile Plot of Spring NSB Scores Regressed on Fall NLT and Fall ASPENS Scores Figure 3 Normal Quantile Plot of Spring SESAT Scores Regressed on Fall NLT and Fall ASPENS Scores Next, linearity of the predictor models was examined. The spring outcome measures (NSB and SESAT) were each regressed on the fall predictor measures (NLT 30 and ASPENS). Linear regressions are displayed in Figure 4. In all models, a linear trend appears to best explain the relationship between the predictors and the outcomes. The assumption of linearity is tenable. Figure 4 Simple Linear Regressions of the Outcomes Regressed on the Predictor Measures The assumption of independence of errors was examined next. Residuals of the outcomes regressed on the predictors are plotted in Figures 5 and 6. Residuals for the NLT appear randomly distributed, suggesting the absence of a relationship between the errors and the outcome variables. Residuals for the ASPENS predicting to each outcome appear to be somewhat overestimated at the extreme values and underestimated at the central values. 31 Figure 5 Residual Plots of the Spring NSB Scores Regressed on the Fall NLT scores and Fall ASPENS scores Figure 6 Residual Plots of the Spring SESAT Scores Regressed on the Fall NLT scores and Fall ASPENS Scores Lastly, the assumption of homoscedasticity was examined. Further examination of the residual plots shows that, for the fall NLT as a predictor, errors appear homogenously 32 distributed across the values of the x-axes. The assumption of homoscedasticity is tenable for the spring NSB and spring SESAT regressed on the fall NLT. The residuals of the fall ASPENS as a predictor do not appear homogenously distributed across the values of the x-axes. The variance of the residuals tend to decrease going across the x-axes. The assumption of homoscedasticity for the fall ASPENS is not tenable. Research Question 1 Research Question 1A: Association Among Early Numeracy Measures Correlations among all measures at all available timepoints were conducted. Descriptors of the strength of correlations are based on Cohen (1992) who defines small, medium and large correlations as r = |.20|, |.30|, and |.50|, respectively. Correlations for all measures are reported in Table 3. Correlations for the study measures only (Fall NLT, Fall ASPENS, Spring NSB, and Spring SESAT) are displayed graphically in Figure 7. Associations with the NLT are expected to be negative as larger scores indicate greater error (response distance from the target numbers). Except for the relations of the fall NLT with the spring ASPENS and fall NLT with the spring SESAT, all correlations are significant (p < .01). The relationship between the NLT and the ASPENS in the fall is small (r = -.26) and weak in the spring (r = -.13, p = .058). The relationship between the NLT and the NSB is small in the fall (r = - .24). In the spring, the NLT’s relationships with the NSB (r = -.19) and the SESAT (r = - .17, p < .05) are weak. The ASPENS in the fall is strongly correlated with the NSB in the fall (r = .68). The ASPENS remains strongly associated with the NSB in the spring (r = .67) and with the other outcome measure administered in the spring, the SESAT (r = .64). The outcome measures demonstrate strong relationships among each other at all 33 timepoints (r = .66 to .73), echoing prior evidence of validity (Harcourt Educational Measurement, 2002; Jordan et al., 2008). Table 3 Correlations Among All Measures Administered Fall 2012 and Spring 2013 1 2 3 4 5 6 1. Fall NLT - -.26** -.24** -.13 -.19** -.17* 2. Fall ASPENS - .68** .70** .59** .59** 3. Fall NSB - .57** .72** .66** 4. Spring ASPENS - .67** .64** 5. Spring NSB - .73** 6. Spring SESAT - *p < .05. **p < .01. Figure 7 Correlations Among Study Measures Administered Fall 2012 and Spring 2013 Research Question 1B: Explained Variance Results of the outcome measures regressed on the predictor measures are reported and summarized. In the first model, the spring NSB scores were regressed on the fall 34 NLT scores. In the second model, spring NSB scores were regressed on the fall ASPENS scores. In the third, the spring NSB scores were regressed on both predictors, fall NLT and ASPENS scores. This procedure was repeated for the remaining three models by regressing the second outcome measure, the spring SESAT scores, on the same set of predictors. Regression results are reported in Tables 4 and 5. Fall performance on the NLT explained 3% of the variance in scores on the spring NSB scores, as well as for spring SESAT scores. Students’ performance on the fall ASPENS explained 35% of the variance in scores on the spring NSB, and likewise for the spring SESAT scores. Models that included both predictors (fall NLT and fall ASPENS scores) did not show an increase in explained variance in the outcomes over the models that included only the ASPENS. In addition, the fall NLT scores were no longer a statistically significant predictor in the combined models (p = .48 predicting to the spring NSB, p = .83 predicting to the spring SESAT). In the combined model for predicting the spring NSB (Model 3), for every 1-point increase in the NLT, there is no expected increase in spring NSB score (p = .48). For every 1-point increase on the fall ASPENS, there is an expected .07 increase in score on the spring NSB (p < .001). This model accounts for approximately 35% of the variance in scores on the spring NSB, F(2, 223) = 59.52, p < .001. In the combined model for predicting the spring SESAT (Model 6), for every 1- point increase in the NLT, there is no expected increase in spring SESAT score (p = .83). For every 1-point increase on the fall ASPENS, there is an expected .10 increase in score on the spring SESAT (p < .001). This model accounts for approximately 35% of the variance in scores on the spring SESAT, F(2, 223) = 59.54, p < .001. 35 Semi-partial correlations explain the extent to which each predictor adds unique variance in explaining the outcome, net of the shared variance among predictors. Semi- partial correlations are reported in Table 6. In the combined models (Models 3 and 6), the NLT adds negligible explained variance (R2 <.01) in the outcome measures beyond the ASPENS. Table 6 Unique Variance Explained in the Outcome Measures (Semi-partial Correlations) Spring NSB Spring SESAT Fall NLT <.01 <.01 Fall ASPENS .31* .32* *p < .01. Research Question 1C: Classification Accuracy ROC analyses explored the predictors’ abilities to correctly classify students at risk. The ROC curves predicting to the spring NSB and spring SESAT are displayed in Figures 8 and 9. AUC values are reported in Table 7. The AUC of the fall NLT to the spring NSB was .59 (95% CI from .49 to .68) and to the spring SESAT was .58 (95% CI from .49 to .66). The AUC of the fall ASPENS to the spring NSB was .86 (95% CI from .80 to .92) and to the spring SESAT was .83 (95% CI from .76 to .89). For both outcome measures, the fall ASPENS greatly outperformed the NLT in accurately classifying students. These differences are statistically significant for both predicting to both measures (p < .001). 36 Table 4 Regression Results Predicting Spring NSB Performance (N = 226) Model 1 Model 2 Model 3 Parameter b SE T p b SE t p b SE t p Intercept, 𝑏1 25.80 0.95 27.08 <.001 19.12 0.45 42.18 <.001 19.73 0.98 20.11 <.001 Fall NLT, 𝑏2 -0.02 0.01 -2.92 <.01 0.00 0.01 -0.70 .48 Fall ASPENS, 𝑏3 0.07 0.01 10.90 <.001 0.07 0.01 10.32 <.001 Note. Model 1 R2 = .03, F = 7.95, p = .01. Model 2 R2 = .35, F = 111.80, p < .001. Model 3 R2 = .35, F = 59.52, p < .001. Table 5 Regression Results Predicting Spring SESAT Performance (N = 226) Model 4 Model 5 Model 6 Parameter b SE T p b SE t p b SE t p Intercept, 𝑏1 31.29 1.34 23.42 <.001 22.44 0.63 35.52 <.001 22.70 1.37 16.58 <.001 Fall NLT, 𝑏2 -0.03 0.01 -2.52 .01 0.00 0.01 -0.21 .83 Fall ASPENS, 𝑏3 0.10 0.01 10.93 <.001 0.10 0.01 10.47 <.001 Note. Model 1 R2 = .03, F = 6.36, p = .01. Model 2 R2 = .35, F = 119.60, p < .001. Model 3 R2 = .35, F = 59.54, p < .001. 37 Figure 8 ROC Curve Comparing Fall NLT and Fall ASPENS to the Spring NSB Figure 9 ROC Curve Comparing Fall NLT and Fall ASPENS to the Spring SESAT 38 Table 7 AUC for the Fall Screening Measures to the Spring Outcome Measures Spring NSB Spring SESAT (23rd Percentile) (25th Percentile) AUC CI AUC CI Fall NLT .59 .49-.68 .59 .51-.67 Fall ASPENS .86 .80-.92 .83 .78-.89 Following the ROC analyses, sensitivity and specificity were examined. Classification accuracy statistics and cut scores are reported in Table 8. Two approaches were used. The first approach maximized both sensitivity and specificity. The cut score with the sum of sensitivity and specificity closest to 2.0 was selected for each predictor to each outcome. When risk was classified as below the 23rd percentile in this sample on the NSB, the fall NLT had a sensitivity of .68 and a specificity of .49 (cut score = 105.50) whereas the fall ASPENS had a sensitivity of .72 and a specificity of .89 (cut score = 23.35). While the measures correctly identified students “at risk” to a similar extent (4% difference in favor of the ASPENS), the ASPENS correctly identified 40% more “not at risk” students. When risk was classified as below the 25th percentile in this sample on the SESAT, the fall NLT had a sensitivity of .53 and a specificity of .66 (cut score = 122.50) whereas the fall ASPENS had a sensitivity of .58 and a specificity of .95 (cut score = 24.25). Again, the ASPENS correctly identified slightly more “at risk” students than the NLT (5%), and substantially more “not at risk” students (29%). Because the implications for false negatives are greater than for false positives for students, schools often prioritize sensitivity over specificity. Thus, the next approach examined cut scores and specificities where sensitivity was closest to .90 (L. S. Fuchs et 39 al., 2007; Seethaler & Fuchs, 2010). When risk was classified as below the 23rd percentile on the NSB, the fall NLT had a specificity of .16 (sensitivity = .89, cut score = 71.00) whereas the fall ASPENS had a specificity of .60 (sensitivity = .89, cut score = 49.25). When the measures correctly identified 89% of truly “at risk” students, the ASPENS correctly identified 44% more “not at risk” students. When risk was classified as below the 25th percentile on the SESAT, the fall NLT had a specificity of .12 (sensitivity = .91, cut score = 67.00) whereas the fall ASPENS had a specificity of .48 (sensitivity = .90, cut score = 69.55). When the measures correctly identified close to 90% of truly “at risk” students, the ASPENS outperformed the NLT in correctly identifying “not at risk” students by 36%. Table 8 Classification Accuracy and Cut Scores for Fall Screeners Maximizing Sensitivity and Specificity and with Sensitivity Closest to .90 Fall NLT Fall ASPENS Cut Score Sens Spec Cut Score Sens Spec Spring NSB 105.50 .68 .49 23.35 .72 .89 71.00 .89 .16 49.25 .89 .60 Spring SESAT 122.50 .53 .66 24.25 .58 .95 67.00 .91 .12 69.55 .90 .48 Note. Sens = Sensitivity, Spec = Specificity Research Question 2: Item-Level Analyses Analyses were repeated with the individual NLT items to explore which stimuli best predict future achievement. Descriptive statistics of the NLT Items and the outcome measures are displayed in Table 9. 40 Means for the NLT Items were examined as a one-way, repeated measures analysis of variance. The independent variable was the NLT Item with four levels (Items 1 to 4) and the dependent variable was the received score. The Mauchly Sphericity test was not significant, indicating that the assumption of sphericity is tenable, χ2(5) = .29. The main effect of NLT Item on score was significant, F(3, 900) = 6.58, p < .001. Post- hoc tests using a Bonferroni correction reveal that the mean score of Item 2 is significantly greater than Items 1 and 4 (p < .01). Other pairwise comparisons are not significant. Table 9 Descriptive Statistics of the NLT Items and Outcome Measures Measure Mean SD Median Skewness Kurtosis Fall NLT Item 1: 34 25.09a 18.87 22.00 0.66 -0.75 Fall NLT Item 2: 12 31.69b 19.80 29.00 0.42 -0.81 Fall NLT Item 3: 89 30.01ab 20.58 32.50 0.23 -1.04 Fall NLT Item 4: 57 25.47a 17.81 23.00 0.26 -1.36 Spring NSB 23.17 4.81 24.00 -0.61 -0.17 Spring SESAT 28.11 6.72 29.00 -0.77 0.00 Note. Superscripts denote significantly different group means, p < .05. Next, assumptions of fitness for statistical analyses were tested. The distribution of the NLT item scores are displayed Figure 10. The distributions of the NLT Items do not approximate normality. The distributions of NLT Items 1 and 2 are moderately negatively skewed. The distribution of Item 3 appears to be bimodal. All four item distributions have significantly more kurtosis than expected. 41 Figure 10 Distribution of the Individual NLT Items Next, linearity of the predictor models was examined. The spring outcome measures (NSB and SESAT) were each regressed on the NLT item scores and fall ASPENS scores. Residuals of the regressions are displayed in Figures 11 and 12. A linear trend is apparent in both models and thus the assumption of linearity is tenable in both models. The assumption of independence of errors was tested next. Residuals for the NLT items appear randomly distributed, suggesting the absence of a relationship between the errors and the outcome variables. The assumption of homoscedasticity was considered next. In both models, the variance appears homogenously distribution across values of x. The assumptions of independence of errors and homoscedasticity are tenable. 42 Figure 11 Residuals of the Spring NSB Regressed on the Fall NLT Item Scores and Fall ASPENS Scores Figure 12 Residuals of the Spring SESAT Regressed on the Fall NLT Item Scores and Fall ASPENS Scores Next, correlations among the items of the NLT and other early numeracy measures were conducted. Conducted correlations are displayed in Table 10. Items 1, 2, and 4 of the NLT are unrelated to all other items and measures (r < .20). In the fall, Item 43 Table 10 Correlations Among the NLT Items and Early Numeracy Measures 1 2 3 4 5 6 7 8 9 1. Fall NLT Item 1: 34 - .05 -.01 .01 -.05 -.14* -.04 -.08 -.02 2. Fall NLT Item 2: 12 - -.05 .03 -.14* -.12 -.10 -.09 -.04 3. Fall NLT Item 3: 89 - .06 -.28** -.21** -.16* -.23** -.22** 4. Fall NLT Item 4: 57 - -.05 -.03 .06 .03 -.03 5. Fall ASPENS - .68** .70** .59** .59** 6. Fall NSB - .57** .72** .66** 7. Spring ASPENS - .67** .64** 8. Spring NSB - .73** 9. Spring SESAT - *p < .05. **p < .01. 44 3 of the NLT has a small relationship with the ASPENS (r = -.28) and the NSB (r = -.21). This relationship continues into the spring with the NSB (r = -.23) but is no longer present with the ASPENS (r = -.16). Item 3 also has a small relationship with the spring SESAT (r = -.22). Relations among measures besides the individual NLT Items are discussed in Research Question 1A. The scores of the spring outcome measures were regressed on the fall NLT Items scores and fall ASPENS scores. Results are reported in Tables 11 and 12. Models including the fall ASPENS as the sole predictor are repeated from earlier for the sake of comparison. In the first model (Model 7), the spring NSB scores were regressed on the four individual fall NLT scores. In the third (Model 8), the spring NSB scores were regressed on the fall NLT Items and the fall ASPENS scores. This procedure was repeated for the remaining two models by regressing the second outcome measure, the spring SESAT scores. Fall performance on the NLT Items explained 7% of the variance in scores on the spring NSB scores, with Item 3 being the only statistically significant predictor. Model 8, including NLT Item scores and ASPENS scores, explained 1% more variance (R2 = .36) in the spring NSB over the ASPENS alone, F(5, 220) = 24.62, p < .001. In addition, the NLT Item 3 scores were no longer a statistically significant predictor in this combined model (p = .16). Fall performance on the NLT Items explained 5% of the variance in scores on the spring SESAT scores, with Item 3 being the only statistically significant predictor. Model 10, including NLT Item scores and ASPENS scores, did not explain more variance (R2 = .35) in the spring SESAT over the ASPENS alone, F(5, 220) = 24.01, p < .001. In 45 Table 11 Regression Results Predicting Spring NSB Performance (N = 226) Model 7 Model 2 Model 8 Parameter b SE T p b SE t p b SE t p Intercept, 𝑏1 25.81 0.94 27.40 <.001 19.12 0.45 42.18 <.001 19.86 0.99 20.06 <.001 Fall NLT Item 1, 𝑏2 -0.02 0.02 -1.16 .25 -0.01 0.01 -0.96 .34 Fall NLT Item 2, 𝑏2 -0.03 0.02 -1.61 .11 0.00 0.01 -0.29 .77 Fall NLT Item 3, 𝑏2 -0.06 0.02 -3.76 <.001 -0.02 0.01 -1.42 .16 Fall NLT Item 4, 𝑏2 0.01 0.02 0.80 .42 0.02 0.01 1.21 .23 Fall ASPENS, 𝑏3 0.07 0.01 10.90 <.001 0.07 0.01 9.88 <.001 Note. Model 1 R2 = .07, F = 4.43, p < .001. Model 2 R2 = .35, F = 111.80, p < .001. Model 3 R2 = .36, F = 24.62, p < .001. 46 Table 12 Regression Results Predicting Spring SESAT Performance (N = 226) Model 7 Model 2 Model 8 Parameter b SE T p b SE t p b SE t p Intercept, 𝑏1 31.29 1.33 23.55 <.001 22.44 0.63 35.52 <.001 22.78 1.37 16.58 <.001 Fall NLT Item 1, 𝑏2 -0.01 0.02 -0.35 .73 0.00 0.02 0.01 .99 Fall NLT Item 2, 𝑏2 -0.02 0.02 -0.82 .41 0.01 0.02 0.67 .50 Fall NLT Item 3, 𝑏2 -0.07 0.02 -3.45 <.001 -0.02 0.02 -1.05 .30 Fall NLT Item 4, 𝑏2 -0.01 0.02 -0.28 .78 0.00 0.02 -0.09 .93 Fall ASPENS, 𝑏3 0.10 0.01 10.93 <.001 0.10 0.01 10.08 <.001 Note. Model 1 R2 = .05, F = 3.18, p = .01. Model 2 R2 = .35, F = 119.60, p < .001. Model 3 R2 = .35, F = 24.01, p < .001. 47 Running Head: NUMBER LINE KINDERGARTEN 48 addition, the NLT Item 3 scores were no longer a statistically significant predictor in this combined model (p = .30). Semi-partial correlations explain the extent to which each predictor adds unique variance in explaining the outcome. Semi-partial correlations of the NLT Items and Fall ASPENS are reported in Table 13. In the combined models, the NLT Item scores add negligible explained variance (.01 or less in all cases) in the outcome measures beyond the fall ASPENS scores. Finally, ROC analyses explored the NLT Items’ abilities to correctly classify students at risk. AUC values are reported in Table 14. The NLT Items performed as well as or marginally better than chance in predicting student risk on the spring NSB (AUC = .50 to .58). Similarly, Items 1, 2, and 4 performed as well as or marginally better than chance in predicting student risk on the spring SESAT (AUC = .49 to .54). Item 3 of the NLT identified student risk on the SESAT better than chance (AUC = .64). Table 13 Unique Variance Explained in the Outcome Measures (Semi-partial Correlations) Spring NSB Spring SESAT Fall NLT Item 1: 34 <.01 <.01 Fall NLT Item 2: 12 <.01 <.01 Fall NLT Item 3: 89 .01 <.01 Fall NLT Item 4: 57 <.01 <.01 Fall ASPENS .28* .30* *p < .01. 48 Table 14 AUC for the Individual NLT Items to the Spring Outcome Measures Spring NSB Spring SESAT (25th (23rd Percentile) Percentile) AUC CI AUC CI NLT Item 1: 34 .55 .46-.64 .49 .41-.57 NLT Item 2: 12 .58 .49-.67 .54 .46-.62 NLT Item 3: 89 .57 .47-.67 .64 .57-.72 NLT Item 4: 57 .50 .41-.59 .52 .44-.59 49 IV. DISCUSSION In the discussion, I frame the current study, offer my summarization and interpretation of the results and note limitations. Based on results and limitations of the current study, I suggest next steps and directions for future research. Mathematics instruction in the United States consistently underserves our nation’s students, as evidenced by stagnant and unsatisfactory achievement (Center for Education Statistics, 2019). Universal screening is one mechanism for delivering critical content to the students most in need (Albers & Kettler, 2014; Shinn, 2006). However, the evidence base for early mathematics screeners is limited, thus complicating the task for schools to make accurate, useful screening decisions (Gersten et al., 2012). Due to how it integrates several key mathematical concepts, the mental number line offers promise for serving as a standalone screener or supplementing existing screening batteries (Schneider et al., 2018). The research base of the number line and its relation to mathematical development and competence derives primarily from a developmental and cognitive lens. Only two prior studies have leveraged the number line task as a screener for identifying educational risk (Clarke et al., 2018; Sutherland et al., 2020). This work extended the exploration of the number line assessment as a screener. Due to the breadth and cumulative depth of mathematical curricula, numeracy screening benefits from assessing a range of skills (Gersten et al., 2012; Seethaler & Fuchs, 2010; Vanderheyden et al., 2017). This prompted my investigation of a number line assessment in conjunction with an established, but not maximal, multi-skill math screener. As part of a larger study, 226 kindergarten students were administered the NLT in the fall. In the fall and spring, students were also administered an established screening battery 50 (ASPENS) and a short outcome measure (NSB). A comprehensive outcome measure (SESAT) was administered in the spring only. Analyses explored the predictive properties of a short form four-item NLT as compared to an established screening measure. The items within the NLT were also explored for variations in decision-making utility. Particular attention was paid to practical considerations: added value and efficiency. Summary and Interpretation of Results The mean performance of the students in this sample does not appear to be significantly different than prior research. Number line research often reports student performance as mean absolute error or percent absolute error (Schneider et al., 2018). This study used summed absolute error (M = 112.27), which, averaged over four trials, gives a mean absolute error of 28.07. This mean is relatively similar to that found in the cognitive research (24% and 24%; Booth & Siegler, 2006; Siegler & Booth, 2004) and in the number line screening research (29.30 and 33.30; Clarke et al., 2018; Sutherland et al., 2020) with similar-aged students. The rest of the results should be interpreted in this context. Research Question 1A: Association Among Early Numeracy Measures First, relations among the early numeracy measures at various timepoints was examined. Amongst the ASPENS, NSB and SESAT at both time points, the NLT had small concurrent relations with the ASPENS and NSB in the fall (r2 = -.26 and -.24, respectively). Other concurrent or predictive relations were insignificant and/or negligible (r2 = -.13 to -.19). No evidence exists that the NLT possesses predictive validity to spring NSB or SESAT performance. In comparison, the ASPENS has a strong and significant 51 concurrent association with the NSB in the fall (r2 = .68), as well as strong and significant predictive associations with the spring NSB and SESAT (r2 = .59 for both). The large correlations among the fall ASPENS and other numeracy measures support their tapping into similar mathematics constructs. The small or weak associations of the NLT with other measures should not be wholly unexpected. Cognitive researchers have found that the NLT demonstrates a moderate association with general mathematical competence in elementary-aged children. Schneider et al. (2018) found that age moderated the NLT’s association with general mathematical competence. For early elementary students (aged 6-9), they found an average correlation of .442 between whole-number number line estimation and mathematical competence. Prior to age 6, which likely applies to many children in the fall of kindergarten, number line estimation demonstrates an average correlation of .296 with general mathematical competencies. In the current study, with 66.4% of the sample under six years of age, the NLT demonstrated a comparable concurrent association with the ASPENS (r2 = -.26). Notably, the reviewed studies in Schneider et al.’s (2018) meta-analysis with participants under age 6 primarily completed number lines ranging from 0-10 and 0-20. Furthermore, results from a study by Muldoon et al. (2011) suggest that the association of number line estimation and math competence is dependent on the scale used in the task. Muldoon et al. compared 0-10, 0-20 and 0-100 number lines with Scottish and Chinese 5- year-olds and found that the Scottish children performed best on the 0-10 and 0-20 number lines. In addition, the 0-20 performance had the highest associations with other mathematical measures. Thus, low associations found in the current study may be due to 52 the range (0-100) of the number line task used. This range was originally selected due to the mixed ranges used in prior research and due to its alignment with the numbers that kindergarten students encounter throughout the year. Research Question 1B: Explained Variance When considered as part of a battery with an established mathematics screener, the NLT did not add meaningful value. When either outcome measure (spring NSB or spring SESAT scores) was regressed on the two predictors (fall NLT and fall ASPENS scores), fall ASPENS performance was the only significant predictor of future performance (R2 = .31 to .34). The NLT explained negligible and insignificant variance in the outcomes (R2 <.01). In a prior study, Clarke and colleagues (Clarke et al., 2018) also did not find statistically significant incremental validity with either a 0-20 or 0-100 number line above the ASPENS. In a conceptual replication, Sutherland and colleagues (2020) found a 0-100 number line contributed 7% incremental validity above the ASPENS. While 7% added value holds marginal clinical significance, the literature base establishing the association between number line estimation performance with general mathematical competence and the mixed results of these screening studies suggest that task design should be explored further. Research Question 1C: Classification Accuracy In terms of their abilities to distinguish between students truly at risk or not at risk, the ASPENS (AUC = .83 to .86) again outperformed the NLT by a large margin (AUC = .59 for both measures). The ASPENS meets the minimum acceptable value (.75) to be effective for determining risk status (Cummings & Smolkowski, 2015). Most importantly, the 95% confidence interval for the NLT’s true AUC value includes or is 53 very close to .50. In other words, it is very likely that this form of the NLT performs no better than chance in identifying students. Examining the specificity and sensitivity of various cut scores on these measures enriches these conclusions. When maximizing overall classification power, the NLT and ASPENS are similarly able to identify truly “at risk” students. This is true for both the NSB (at a cut score at the 23rd percentile) and the SESAT (at a cut score at the 25th percentile). However, the ASPENS is substantially more specific than the NLT, correctly identifying 29-40% more “not at risk” students. Schools, however, do not regard false positives and negatives equally. False positives are students who show up as “at risk” on a screener but who would be sufficiently served by their existing classroom supports. These students may be removed from the general education setting for a short, intensified intervention session multiple days a week. Resources may be spread too thin if too many students are identified as “at risk.” Conversely, false negatives are students who do not show up as “at risk” on a screener but will be underserved by their environment and at risk for academic failure without the introduction of intensified supports. We also know that prevention and early intervention is cost- and time-effective. False negatives risk needing remediation, the more costly alternative to prevention. Thus, the NLT and ASPENS were examined for their abilities to classify students while substantially reducing false negatives. Both measures are capable of catching over 90% over of the students truly “at risk”. In doing so, however, the NLT correctly identified only 12-16% of the students “not at risk.” The ASPENS, meanwhile, correctly identified 48-60% of the students “not at risk.” Schools that would utilize the NLT for 54 their classification decisions would either miss many students who truly need supports or would find themselves providing intensified intervention to a large swath of students who wouldn’t have been at risk receiving their existing classroom instruction. Research Question 2: Item-Level Analyses Looking at individual item performance on the NLT provides a lens to understanding the measure’s overall performance and potential item-level contributions. Performance on each NLT item is unexpectedly unrelated to performance on any other item. In addition, only Item 3 of the NLT is associated with the criterion math measures. This lack of internal consistency and concurrent or predictive validity suggests this number line task is not measuring related mathematical skills as expected. In addition, the NLT items considered individually provide no evidence for incremental validity above the ASPENS in explaining the outcome measures. Interestingly, models that only include the NLT Items explain more variance in the outcomes than the full-scale NLT scores. For example, regressing the spring NSB scores on the individual NLT item scores explains more variance than regressing the spring NSB scores on the full-scale (or sum of the items) NLT scores (R2 = .07 and .03, respectively). A similar result is seen with the spring SESAT scores (R2 = .05 and .02, respectively). Additionally, the unique variance explained by the ASPENS is lower in models with the NLT Items (R2 = .28 and .30) than with the full-scale NLT (R2 = .31 and .32). These findings suggest that certain items of the NLT, compared to their peers, are more related to other math measures. A hypothesis for the NLT Items, that Item 4 (target numeral 57) may be more informative than the others, was rejected. Rodrigues, Jordan, and Hansen (2019) found 55 that the “simpler” items, such as 1/2 on a 0-1 fraction number line were the most predictive of general math performance. Similarly, it was hypothesized that a stimulus of 57 would be aided by a student’s abilities to a) identify the two-digit numeral correctly and b) to leverage nascent proportional reasoning to dissect the line approximately in half. It was hypothesized exhibiting these developing mathematical competencies would be associated with greater overall mathematical competence in kindergarten. However, Item 3 (target numeral 89) was found to be most correlated with other math measures. Item 3 has small correlations with the outcome measures (r2 = -.23 and -.22). Interestingly, the relationships between Item 3 and the outcomes are higher than the full-scale NLT scores with the outcomes (r2 = -.19 and -.17). Comparatively, Items 1, 2, and 4 exhibit minimal relationships with the other math measures and, in the models, may simply be noise. Item 3 alone may be responsible for the full-scale’s relation to the other math measures. Thus, summing all four scores together adds noise to Item 3’s contributions. It is speculated that the uniqueness of Item 3 is due to a large portion of the sample using an undiscerned strategy. It’s important to consider that, if students were randomly responding for all items, scores would exhibit an equal frequency of errors. Students would be just as likely to respond with 1 to a prompt to locate the number 61 on the line as they would with 100. However, the cognitive evidence base shows that, when presented with an array of similar options, people attend to and choose options near the middle most often (Atalay, Bodur, & Rasolofoarison, 2012; Christenfeld, 1995; Lo & Tsang, 2018; Rodway, Schepman, & Lambert, 2012). For children presented with 56 unfamiliar numbers, all possible locations on the number line may seem equal. Thus, a tendency towards the center is plausible. In examining the specific items, Items 1 (target numeral 34) and 4 (target numeral 57) are closest to the center of the number line. If students were employing the strategy hypothesized above, such that all placements on the number line are treated relatively equal, one would expect smaller average errors for items nearest the middle. In fact, post- hoc comparisons of the means revealed that students are, on average, more accurate for Items 1 and 4 than for Item 2. Thus, it is possible that a number of students were employing this pattern of responding on the NLT. While the data is presented as absolute error, without directionality, examination of the item distributions also contributes to this theory. More so than the other items, the distribution of Item 3 is somewhat bimodal (Figure 7). These two peaks appear to be centered around 0-10 and around 40. Using these absolute errors, we can infer response patterns to some extent. With the first error peak around 0-10, we may infer one “group” of students responded in the range of 79-99. For the other peak, we can conclude directionality. Any error above 11 (due to the endpoint of 100) means the student had a negatively-oriented error, or they responded with a number less than 89. Thus, the other “group” of students responded around the midpoint of the line. Similarly, the peak of Item 2’s (target numeral 12) error distribution is around 30. A cluster of errors around 40 would add credence to this theory. This pattern of behavior is also supported by Muldoon et al.’s (2011) findings. Their data found children responded in a relatively linear pattern for numbers under 15 (on the 0-100 number line). This pattern was not observed above 15, and they concluded 57 children were likely guessing for these numbers. It’s possible students in the current study approached the task similarly and responded semi-randomly for unfamiliar numbers. However, this study had only one item below 15, limiting the ability to infer trends above or below 15. More importantly, this theory may interact with why Item 3 is most predictive. Attention should be paid again to the bimodal nature of Item 3’s distribution. This item may distinguish, more than the other items, between guessers and non-guessers. The correct placement for Items 1 (target 34) and 4 (target 57) are near the midpoint of the line, so these items may fail to distinguish students who understand these numbers and students who are using a semi-random strategy. Item 3’s target numeral is the rightmost endpoint and is also a number that kindergarteners are not expected to be familiar with at school entry (Muldoon et al., 2011). In contrast, knowing (or not knowing) where to place 89 on the number line at kindergarten may be indicative of future performance (to a small degree). Future research is needed to explore item-level utility. Implications In an applied context, the findings from this study do not present value for schools. This study found no evidence that a four-item NLT promotes more informed screening decisions. By itself and while supplementing an established screener, this NLT does not uniquely contribute to predicting how students will perform at the end of the year. This holds true whether students are judged by a short-form outcome measure (the NSB) or a longer, comprehensive outcome measure (the SESAT). Evidence supports that, at best, educators will get a small sense of their students’ mathematical competency from the NLT. However, the NLT does not provide any information that would not be better 58 provided by the ASPENS. Schools do not have excess time to assess each student with an additional math measure, no matter how short, if it does not provide actionable information. However, prior studies of the NLT in an educational context (Clarke et al., 2018; Sutherland et al., 2020) have found unique and meaningful contributions of the NLT in screening decisions. While it does not appear the students in this sample performed worse, on average, than other studies, a clear difference between this study and its peers is the format of the number line task employed. In the prior cited number line screening studies, the number line assessment had 26 items compared to the four-item form in the current study. Additional items sample more behavior and, potentially, allow for a greater approximation of the construct of interest. In this case, that is general mathematical competence. The prior studies utilized random or semi-random ordering of the task items. The current study did not, limiting the ability to counteract order effects. Order effects can be substantial in number line estimation tasks as prior items can serve as a mental “anchor” that affects future placement of numbers (Siegler, 2016; White & Szucs, 2012). These studies also utilized a descriptive task explanation and practice items with corrective feedback prior to the measure. It is possible these additions would have increased the validity of students’ responses, or the likelihood that students would be able to respond in ways that reflect their true mathematical knowledge. Lastly, the current study utilized a paper and pencil format. For students in the fall of kindergarten it is unclear if motor skills would be a barrier to participation. The cited prior studies used an iPad administered form, which could also be influenced by motor skills. Future number 59 line research should carefully consider these task differences when designing or utilizing number line estimation tasks with young children. Rodrigues, Jordan and Hansen (2019) found that a small number of items on their number line measure held a disproportionate amount of the predictive power. The items most predictive of future performance were the “simpler” items such as 1/2 or 5/6 on a fraction number line. One would assume the simplest item on this shortened NLT would be Item 2 (target numeral 12) but this was not the case. Items 1 (target number 34) and 4 (target number 57) appeared to be the easiest. One theory is that this range of numbers may have been inappropriate for this age range. However, as mentioned earlier, prior studies have utilized the 0-100 number line to some success. These prior studies demonstrated similar amounts of error by participants and yet derived greater value from their number line measures. The task may have been misunderstood by participants, due to limited directions and practice items. A goal of this study was to explore whether a measure that optimizes efficiency (by reducing the number of items and thus the time needed to administer) could maintain its predictive value demonstrated in prior studies. This study does not support that this four-item form of number line estimation succeeds in this mission. However, the pursuit of a screener that efficiently leverages a smaller selection of highly predictive items still holds potential value for school practice. Limitations and Future Research Interpreting these findings should be made in the context of the study’s limitations. While linear regression is relatively robust to violations of assumptions, the data in this study violated numerous assumptions. The most striking of these violations is 60 the bimodal nature of the NLT summed scores, which is also seen in the distribution of Item 3. The ASPENS in the fall also violates the assumption of normality, though the distribution more closely resembles a normal distribution in the spring. It is possible that kindergarteners who are new to formal education more closely resemble a normal distribution after a year of instruction. Regardless, these violations impair the confidence of the results. Another limitation is the relatively small sample size, which limits the extent to which this sample could generalize to a typical school. Furthermore, 25 students were excluded who did not provide an appropriate response to all four NLT items. In other words, over 10% of the potential sample was removed. This may include both students who misunderstood the task or were not attending. Representing these students in the sample may have improved the predictive value of the NLT. The construction of the number line task presents a number of limitations and future directions. Firstly, the NLT had a response range for each item of 0-100. Unlike the other measures used, which dichotomize responses as right or wrong, the NLT allows for a continuous spectrum of responses. This can create situations where, given a stimulus of 12, one student responded with position 65 on the number line and another responded with position 86. Is the former student demonstrating greater understanding of the numeral 12? This point is particularly salient considering prior studies tend to administer a 0-10 or 0-20 number line to young children (Schneider et al., 2018) or found greater associations with a 0-20 number line than a 0-100 number line task (Muldoon et al., 2011). Alternatively, number lines with smaller ranges may be more closely linked to students’ developmental level at kindergarten entry. Due to the mixed evidence regarding 61 range for the number line estimation task, future research should explore varying number line ranges for predicting future performance and different methods for quantifying student responses. The background information and directions provided to participating students prior to the NLT was limited, especially in comparison to other studies. Given the novel nature of the task compounded with kindergarten students’ newness to formal education, more detailed directions including teaching items may be critical to ensure that students fully understand task demands. Future studies are urged to explain the task to participants and provide practice items to increase the possibility of measuring a participant’s true mathematical competence. Studies demonstrating promise of the 0-100 NLT with young children have utilized a 26-item form, unlike the short four-item form used here. Additional items, or additional behaviors sampled, appear to increase the NLT’s association with future math achievement in young children. However, the balance between efficiency and a minimal necessary amount of items is unanswered. This point is especially salient knowing that the administration time of the 26-item form utilized by Sutherland and colleagues (2020) commonly met the self-imposed five-minute limit. In the pursuit of consolidated measures, the specific items selected for stimuli should be considered. Because of the current finding that certain items may have added noise to the model and detracted from the value of the full-scale, identifying high-value items may be critical for increasing the value of the NLT as a screener. As opposed to randomly sampling across the chosen number range, items could be strategically selected based on prior hypotheses or data. 62 Because of the evidence base supporting the logarithmic to linear shift in young children, oversampling 0-20 on a 0-100 number line may be more developmentally appropriate. Additionally, the Common Core calls for kindergarteners to be able to count by both ones and tens to 100 (Practices, 2010). Oversampling the decades may align with and be sensitive to students’ response to classroom instruction. Older students are observed to rely upon familiar anchor points (such as 25, 50, or 75; D. Cohen & Sarnecka, 2016). While kindergarteners may not use these numbers as benchmarks, they may use numbers within their familiar range, like 5, 10, and 15 instead. Research including such items can illuminate which are most predictive. However, the consolidation of items should be performed post-hoc, once the effectiveness of particular items is established. Then, screeners can maximize efficiency by including only these high-utility items. Finally, counterbalancing these items would reduce the influence of potential order effects. Future number line research is urged to explore various task forms to design more effective and efficient screeners. Certain factors should be consistent, such as task directions and practice items, while other factors would benefit from variation across forms, times, and skill levels. Conclusion Schools need accurate and efficient screeners that empower them to make better decisions around mathematics instruction and intervention. While this study does not provide evidence of this form of the NLT filling that gap, other studies have demonstrated unique contributions of assessing the mental number line. Evidence supports the number line’s association with mathematics concepts spanning diverse 63 competencies and ages, highlighting the potential for a versatile mathematics screener across school grades. However, practical implementation is crucial. Though schools exist in a context that demands efficiency and accounting for every minute, this study cautions that maximizing efficiency may sacrifice clinical utility. Knowing the long-reaching implications of successful early prevention and intervention, future research should strive to increase schools’ abilities to make informed decisions around who to serve. 64 REFERENCES CITED Albers, C. A., & Kettler, R. J. (2014). Best practices in universal screening. Best Practices in School Psychology: Data-Based and Collaborative Decision Making, 121–131. Ashcraft, M. H., & Moore, A. M. (2012). Cognitive processes of numerical estimation in children. Journal of Experimental Child Psychology, 111(2), 246–267. https://doi.org/10.1016/j.jecp.2011.08.005 Atalay, A. S., Bodur, H. O., & Rasolofoarison, D. (2012). Shining in the center: Central gaze cascade effect on product choice. Journal of Consumer Research, 39(4), 848– 866. https://doi.org/10.1086/665984 Baker, S., Gersten, R., Flojo, J., Katz, R., Chard, D., & Clarke, B. (2002). Preventing mathematics difficulties in young children: Focus on effective screening of early number sense delays. Technical Report. Balu, R., Zhu, P., Doolittle, F., Schiller, E., Jenkins, J., & Gersten, R. (2015). Evaluation of response to intervention practices for elementary school reading. NCEE 2016- 4000. National Center for Education Evaluation and Regional Assistance. Baroody, A. J. (2002). The developmental foundations of number and operation sense. In Poster presented at the EHR/REC (NSF) Principal Investigators’ Meeting (“Learning and Education: Building Knowledge, Understanding Its Implications”), Arlington, VA. Barth, H. C., & Paladino, A. M. (2011). The development of numerical estimation: Evidence against a representational shift. Developmental Science, 14(1), 125–135. https://doi.org/10.1111/j.1467-7687.2010.00962.x Berch, D. B. (2005). Making sense of number sense: Implications for children with mathematical disabilities. Journal of Learning Disabilities, 38(4), 333–339. https://doi.org/10.1177/00222194050380040901 Berteletti, I., Lucangeli, D., Piazza, M., Dehaene, S., & Zorzi, M. (2010). Numerical estimation in preschoolers. Developmental Psychology, 46(2), 545–551. https://doi.org/10.1037/a0017887 Bodovski, K., & Farkas, G. (2007). Do instructional practices contribute to inequality in achievement? The case of mathematics instruction in kindergarten. Journal of Early Childhood Research, 5(3), 301–322. 65 Booth, J. L., & Siegler, R. (2006). Developmental and individual differences in pure numerical estimation. Developmental Psychology, 42(1), 189–201. https://doi.org/10.1037/0012-1649.41.6.189 Boyer, T. W., Levine, S. C., & Huttenlocher, J. (2008). Development of proportional reasoning: Where young children go wrong. Developmental Psychology, 44(5), 1478–1490. https://doi.org/10.1037/a0013110 Case, R. (1998). A psychological model of number sense and its development. In annual meeting of the American Educational Research Association, San Diego. Case, R., Okamoto, Y., Griffin, S., McKeough, A., Bleiker, C., Henderson, B., … Keating, D. P. (1996). The role of central conceptual structures in the development of children’s thought. Monographs of the Society for Research in Child Development, i–295. Center for Education Statistics, N. (2015). 2015 NAEP mathematics grades 4 and 8 assessment report cards: Summary data tables for national and state average scores and achievement level results. Retrieved from https://www.nationsreportcard.gov/reading_math_2015/files/2015_Results_Appendi x_Math.pdf Chan, C., Chan, G. C. H., Leeper, T. J., & Becker, J. (2018). rio: A Swiss-army knife for data file I/O. Chard, D. J., Clarke, B., Baker, S., Otterstedt, J., Braun, D., & Katz, R. (2005). Using measures of number sense to screen for difficulties in mathematics: Preliminary findings. Assessment for Effective Intervention, 30(2), 3–14. https://doi.org/10.1177/073724770503000202 Christenfeld, N. (1995). Choices from identical options. Psychological Science, 6(1), 50– 55. https://doi.org/10.1111/j.1467-9280.1995.tb00304.x Clarke, B., Baker, S., Smolkowski, K., & Chard, D. J. (2008). An analysis of early numeracy curriculum-based measurement: Examining the role of growth in student outcomes. Remedial and Special Education, 29(1), 46–57. https://doi.org/10.1177/0741932507309694 Clarke, B., Doabler, C. T., Fien, H., Baker, S. K., & Smolkowski, K. (2012). A randomized control trial of a Tier 2 kindergarten mathematics intervention (Project ROOTS). US Department of Education, Institute of Education Sciences. Special Education Research, CFDA, (84.324), 2012–2016. 66 Clarke, B., Gersten, R. M., Dimino, J., & Rolfhus, E. (2011). Assessing student proficiency of number sense (ASPENS). Longmont, CO: Cambium Learning Group, Sopris Learning. Clarke, B., & Shinn, M. R. (2004). A preliminary investigation into the identification and development of early mathematics curriculum-based measurement. School Psychology Review, 33(2), 234–248. Clarke, B., Strand Cary, M. G., Shanley, L., & Sutherland, M. (2018). Exploring the promise of a number line assessment to help identify students at-risk in mathematics. Assessment for Effective Intervention, 153450841879173. https://doi.org/10.1177/1534508418791738 Clements, D. H. (1999). Subitizing: What is it? Why teach it? Teaching Children Mathematics, 5, 400–405. Clements, D. H., Sarama, J., & DiBiase, A.-M. (2003). Engaging young children in mathematics: Standards for early childhood mathematics education. Routledge. Cohen, D., & Sarnecka, B. W. (2016). Children’s number-line estimation shows development of measurement skills (not number representations). Developmental Psychology, 93(4), 292–297. https://doi.org/10.1016/j.contraception.2015.12.017.Women Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155. Conoyer, S. J., Foegen, A., & Lembke, E. S. (2016). Early numeracy indicators: Examining predictive utility across years and states. https://doi.org/10.1177/0741932515619758 Cooper Jr, R. G. (1984). Early number development. In Origins of cognitive skills: The eighteenth annual Carnegie symposium on cognition (pp. 157–192). Erlbaum. Cross, C. T., Woods, T. A., & Schweingruber, H. E. (2009). Mathematics learning in early childhood: Paths toward excellence and equity. National Academies Press. https://doi.org/10.17226/12519 Cummings, K. D., & Smolkowski, K. (2015). Selecting students at risk of academic difficulties. Assessment for Effective Intervention, 41(1), 55–61. https://doi.org/10.1177/1534508415590396 Curtin, J. (2018). lmSupport: Support for linear models. Retrieved from https://cran.r- project.org/package=lmSupport 67 Dehaene, S. (2001). Précis of the number sense. Mind and Language, 16(1), 16–36. https://doi.org/10.1111/1468-0017.00154 Dehaene, S., Bossini, S., & Giraux, P. (1993). The mental representation of parity and number magnitude access to parity and magnitude knowledge during number processing. Journal of Experimental Psychology: General, 122(3), 371–396. Dehaene, S., Dupoux, E., & Mehler, J. (1990). Is numerical comparison digital? Analogical and symbolic effects in two-digit number comparison. Journal of Experimental Psychology: Human Perception and Performance, 16(3), 626. Duncan, G. J., Dowsett, C. J., Claessens, A., Magnuson, K., Huston, A. C., Klebanov, P., … Japel, C. (2007). School readiness and later achievement. Developmental Psychology, 43(6), 1428–1446. https://doi.org/10.1037/0012-1649.43.6.1428 Feigenson, L., Libertus, M. E., & Halberda, J. (2013). Links between the intuitive sense of number and formal mathematics ability. Child Development Perspectives, 7(2), 74–79. https://doi.org/10.1111/cdep.12019 Fletcher, J. M., & Vaughn, S. (2009). Response to intervention: Preventing and remediating academic difficulties. Child Development Perspectives, 3(1), 30–37. https://doi.org/10.1111/j.1750-8606.2008.00072.x Foegen, A., Jiban, C., & Deno, S. (2007). Progress monitoring measures in mathematics: A review of the literature. The Journal of Special Education, 41(2), 121–139. Friso-van den Bos, I., Kroesbergen, E. H., Van Luit, J. E. H., Xenidou-Dervou, I., Jonkman, L. M., Van der Schoot, M., & Van Lieshout, E. C. D. M. (2015). Longitudinal development of number line estimation and mathematics performance in primary school children. Journal of Experimental Child Psychology, 134, 12–29. https://doi.org/10.1016/j.jecp.2015.02.002 Fuchs, D., & Fuchs, L. S. (2017). Critique of the national evaluation of response to intervention: A case for simpler frameworks. Exceptional Children, 83(3), 255–268. https://doi.org/10.1177/0014402917693580 Fuchs, L. S., Fuchs, D., Compton, D. L., Bryant, J. D., Hamlett, C. L., & Seethaler, P. M. (2007). Mathematics screening and progress monitoring at first grade: Implications for responsiveness to intervention. Exceptional Children, 73(3), 311–330. https://doi.org/10.1177/001440290707300303 68 Fuchs, L. S., Fuchs, D., Hamlett, C. L., Thompson, A., Roberts, P. H., Kubek, P., & Stecker, P. M. (1994). Technical features of a mathematics concepts and applications curriculum-based measurement system. Diagnostique, 19(4), 23–49. Gallistel, C. R., & Gelman, R. (1992). Preverbal and verbal counting and computation. Cognition, 44(1–2), 43–74. Geary, D. C. (2011). Consequences, characteristics, and causes of mathematical learning disabilties and persistent low achievement in mathematics. Journal of Devleopmental Behaviour Pediatrics, 32(3), 250–263. https://doi.org/10.1097/DBP.0b013e318209edef.Consequences Geary, D. C., Hoard, M. K., Nugent, L., & Byrd-Craven, J. (2008). Development of number line representations in children with mathematical learning disability. Developmental Neuropsychology, 33(3), 277–299. https://doi.org/10.1080/87565640801982361 Gersten, R., Beckmann, S., Clarke, B., Foegen, A., Marsh, L., Star, J. R., & Witzel, B. (2009). Assisting students struggling with mathematics: Response to Intervention (RtI) for elementary and middle schools. What Works Clearinghouse. https://doi.org/10.1016/j.jhazmat.2011.04.026 Gersten, R., Clarke, B., Jordan, N. C., Newman-Gonchar, R., Haymond, K., & Wilkins, C. (2012). Universal screening in mathematics for the primary grades: Beginnings of a research base. Exceptional Children, 78(4), 423–445. https://doi.org/10.1177/001440291207800403 Gersten, R., Jordan, N. C., & Flojo, J. R. (2005). Early identification and interventions for students with mathematics difficulties. Journal of Learning Disabilities, 38(4), 293–304. https://doi.org/10.1177/00222194050380040301 Goode, K., & Rey, K. (2019). ggResidpanel: Panels and interactive versions of diagnostic plots using “ggplot2.” Retrieved from https://cran.r- project.org/package=ggResidpanel Griffin, S. (2002). The development of math competence in the preschool and early school years: Cognitive foundations and instructional strategies. Mathematical Cognition, 1–32. Griffin, S. (2004). Building number sense with Number Worlds: A mathematics program for young children. Early Childhood Research Quarterly, 19(1), 173–180. 69 Griffin, S., Case, R., & Siegler, R. S. (1994). Rightstart: Providing the central conceptual prerequisites for first formal learning of arithmetic to students at risk for school failure. The MIT Press. Hampton, D. D., Lembke, E. S., Lee, Y. S., Pappas, S., Chiong, C., & Ginsburg, H. P. (2012). Technical adequacy of early numeracy curriculum-based progress monitoring measures for kindergarten and first-grade students. Assessment for Effective Intervention, 37(2), 118–126. https://doi.org/10.1177/1534508411414151 Hansen, N. (2015). Development of fraction knowledge: A longitudinal study from third through sixth grade (Doctoral dissertation). University of Delaware. Hansen, N., Jordan, N. C., & Rodrigues, J. (2017). Identifying learning difficulties with fractions: A longitudinal study of student growth from third through sixth grade. Contemporary Educational Psychology, 50, 45–59. https://doi.org/10.1016/j.cedpsych.2015.11.002 Harcourt Educational Measurement. (2002). Stanford Achievement Test-Tenth edition. San Antonio, Texas: Author. Harrell Jr, F. E. (2020). Hmisc: Harrell Miscellaneous. Retrieved from https://cran.r- project.org/package=Hmisc Hiebert, J., & Wearne, D. (1996). Instruction, understanding, and skill in multidigit addition and subtraction. Cognition and Instruction, 14(3), 251–283. https://doi.org/10.1207/s1532690xci1403_1 Hudson, P., & Miller, S. P. (2005). Designing and implementing mathematics instruction for students with diverse learning needs. Allyn & Bacon. Individuals with Disabilities Education Act (2004). 20 U.S.C. § 1400. Jordan, N. C., Glutting, J., & Ramineni, C. (2008). A number sense assessment tool for identifying children at risk for mathematical difficulties. Elsevier Inc. https://doi.org/10.1016/B978-012373629-1.50005-8 Jordan, N. C., Glutting, J., & Ramineni, C. (2010). The importance of number sense to mathematics achievement in first and third grades. Learning and Individual Differences, 20(2), 82–88. https://doi.org/10.1016/j.lindif.2009.07.004 Jordan, N. C., Glutting, J., Ramineni, C., & Watkins, M. W. (2010). Validating a number sense screening tool for use in kindergarten and first grade: Prediction of mathematics proficiency in third grade. School Psychology Review, 39(2), 181–195. 70 Jordan, N. C., Hansen, N., Fuchs, L. S., Siegler, R. S., Gersten, R., & Micklos, D. (2013). Developmental predictors of fraction concepts and procedures. Journal of Experimental Child Psychology, 116(1), 45–58. https://doi.org/10.1016/j.jecp.2013.02.001 Jordan, N. C., Kaplan, D., & Hanich, L. B. (2002). Achievement growth in children with learning difficulties in mathematics: Findings of a two-year longitudinal study. Journal of Educational Psychology, 94(3), 586–597. https://doi.org/10.1037//0022- 0663.94.3.586 Jordan, N. C., Kaplan, D., Olah, L. N., & Locuniak, M. N. (2006). Number sense growth in kindergarten: A longitudinal investigation of children at-risk for mathematics difficulities. Child Development, 77(1), 153–177. https://doi.org/10.1111/j.1467- 8624.2006.00862.x Jordan, N. C., Kaplan, D., Ramineni, C., & Locuniak, M. N. (2009). Early math matters: Kindergarten number competence and later mathematics outcomes. Developmental Psychology, 45(3), 850–867. https://doi.org/10.1037/a0014939 Judge, S., & Watson, S. M. R. (2011). Longitudinal outcomes for mathematics achievement for students with learning disabilities. Journal of Educational Research, 104(3), 147–157. https://doi.org/10.1080/00220671003636729 Laski, E. V., Casey, B. M., Yu, Q., Dulaney, A., Heyman, M., & Dearing, E. (2013). Spatial skills as a predictor of first grade girls’ use of higher level arithmetic strategies. Learning and Individual Differences, 23, 123–130. https://doi.org/10.1016/J.LINDIF.2012.08.001 Laski, E. V., & Siegler, R. (2007). Is 27 a big number? correlational and causal connections among numerical categorization, number line estimation, and numerical magnitude comparison. Child Development, 78(6), 1723–1743. https://doi.org/10.1111/j.1467-8624.2007.01087.x Lee, Y. S., & Lembke, E. (2016). Developing and evaluating a kindergarten to third grade CBM mathematics assessment. ZDM - Mathematics Education, 48(7), 1019– 1030. https://doi.org/10.1007/s11858-016-0788-6 Lembke, E., & Foegen, A. (2009). Identifying early numeracy indicators for kindergarten and first-grade students. Learning Disabilities Research & Practice, 24(1), 12–20. https://doi.org/10.1111/j.1540-5826.2008.01273.x Lo, L. Y., & Tsang, C. Y. (2018). Best thing is always in the middle? An investigation of centrality preference by eye-tracking technique and memory recall. Journal of Pacific Rim Psychology, 12. https://doi.org/10.1017/prp.2018.5 71 Mazzocco, M. M. M. (2005). Challenges in identifying target skills for math disability screening and intervention. Journal of Learning Disabilities, 38(4), 318–323. https://doi.org/10.1177/00222194050380040701 Mazzocco, M. M. M., & Thompson, R. E. (2005). Kindergarten predictors of math learning disability. Learning Disabilities Research and Practice, 20(3), 142–155. https://doi.org/10.1111/j.1540-5826.2005.00129.x Mcclure, E. R., Guernsey, L., Clements, D. H., Bales, S. N., Nichols, J., Kendall-Taylor, N., & Levine, M. H. (2017). STEM starts early: Grounding science, technology, engineering, and math education in early childhood. Retrieved from http://joanganzcooneycenter.org/publication/stem-starts-early/ Mix, K. S., Huttenlocher, J., & Levine, S. C. (2002). Quantitative development in infancy and early childhood. Oxford University Press. Morgan, P. L., Farkas, G., & Wu, Q. (2009). Five-year growth trajectories of kindergarten children with learning difficulties in mathematics. Journal of Learning Disabilities, 42(4), 306–321. https://doi.org/10.1177/0022219408331037 Moyer, R. S., & Landauer, T. K. (1967). Time required for judgements of numerical inequality. Nature, 215(5109), 1519–1520. Muldoon, K., Simms, V., Towse, J., Menzies, V., & Yue, G. (2011). Cross-cultural comparisons of 5-year-olds’ estimating and mathematical ability. Journal of Cross- Cultural Psychology, 42(4), 669–681. https://doi.org/10.1177/0022022111406035 Müller, K. (2017). here: A simpler way to find your files. Retrieved from https://cran.r- project.org/package=here National Center for Education Statistics. (2015). Postsecondary attainment: Differences by socioeconomic status. The Condition of Education, 1–7. Retrieved from https://nces.ed.gov/programs/coe/pdf/coe_tva.pdf National Mathematics Advisory Panel. (2008). Foundation for success: The final report of the national mathematics advisory panel. U.S. Department of Education (Vol. 37). Washington, DC. https://doi.org/10.3102/0013189X08329195 National Research Council. (2001). Adding it up: Helping children learn mathematics. Washington, D.C. 72 National Science Board. (2015). Revisiting the STEM workforce: A companion to science and engineering indicators 2014. Arlington. National Science Foundation VA. OECD, O. (2012). Equity and quality in education: Supporting disadvantaged students and schools. Computer-Supported Collaborative Learning Conference, CSCL. OECD Publishing Paris. Okamoto, Y., Case, R., & Maes. (1996). Exploring the microstructure of children’s central conceptual structures in the domain of number. Monographs of the Society for Research in Child Development, 61(1‐2), 27–58. Olson, J. F., Martin, M. O., & Mullis, I. V. S. (2008). TIMSS 2007 technical report. TIMSS & PIRLS International Study Center. Pedhazur, E. J., & Kerlinger, F. N. (1982). Multiple regression in behavioral research. Holt, Rinehart, and Winston. Peeters, D., Degrande, T., Ebersbach, M., Verschaffel, L., & Luwel, K. (2016). Children’s use of number line estimation strategies. European Journal of Psychology of Education, 31(2), 117–134. https://doi.org/10.1007/s10212-015-0251- z Phillips, G. W. (2007). Chance favors the prepared mind: Mathematics and science indicators for comparing states and nations. American Institutes for Research. Washington, DC. https://doi.org/10.1097/CCM.0b013e31820e6be4 Practices, N. G. A. C. for B. (2010). Common Core State Standards for Mathematics. Common Core State Standards Initiative. Purpura, D. J., Reid, E. E., Eiland, M. D., & Baroody, A. J. (2015). Using a brief preschool early numeracy skills screener to identify young children with mathematics difficulties. School Psychology Review, 44(1), 41–59. https://doi.org/10.17105/SPR44-1.41-59 R Development Core Team, R. (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://doi.org/10.1007/978-3- 540-74686-7 Ritchie, S. J., & Bates, T. C. (2013). Enduring links from childhood mathematics and reading achievement to adult socioeconomic status. Psychological Science, 24(7), 1301–1308. https://doi.org/10.1177/0956797612466268 73 Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. Rodrigues, J., Jordan, N. C., & Hansen, N. (2019). Identifying fraction measures as screeners of mathematics risk status. Journal of Learning Disabilities, 52(6), 480– 497. https://doi.org/10.1177/0022219419879684 Rodway, P., Schepman, A., & Lambert, J. (2012). Evidence for the centre‐stage effect, 222(July 2011), 215–222. Schneider, M., Grabner, R. H., & Paetsch, J. (2009). Mental number line, number line estimation, and mathematical achievement: Their interrelations in grades 5 and 6. Journal of Educational Psychology, 101(2), 359–372. https://doi.org/10.1037/a0013840 Schneider, M., Merz, S., Stricker, J., De Smedt, B., Torbeyns, J., Verschaffel, L., & Luwel, K. (2018). Associations of number line estimation with mathematical competence: A meta-analysis. Child Development, 89(5), 1467–1484. https://doi.org/10.1111/cdev.13068 Schulte, A. C., & Stevens, J. J. (2015). Once, sometimes, or always in special education: Mathematics growth and achievement gaps. Exceptional Children, 81(3), 370–387. https://doi.org/10.1177/0014402914563695 Seethaler, P. M., & Fuchs, L. S. (2010). The predictive utility of kindergarten screening for math difficulty. Exceptional Children, 77(1), 37–59. https://doi.org/10.1177/001440291007700102 Shinn, M. R. (2006). Best practices in using curriculum-based measurement in a problem-solving model. Best Practices in School Psychology V, I(c), 243–261. Retrieved from http://www.nasponline.org/publications/booksproducts/bp5.aspx Siegler, R. (2016). Magnitude knowledge: The common core of numerical development. Developmental Science, 19(3), 341–361. https://doi.org/10.1111/desc.12395 Siegler, R., & Booth, J. (2004). Development of numerical estimation in young children. Child Development, 75(2), 428–444. https://doi.org/10.1111/j.1467- 8624.2004.00684.x Siegler, R., & Lortie-Forgues, H. (2014). An integrative theory of numerical development. Child Development Perspectives, 8(3), 144–150. https://doi.org/10.1111/cdep.12077 74 Siegler, R., & Opfer, J. E. (2003). The development of numerical estimation: Evidence for multiple representations of numerical quantity. Psychological Science, 14(3), 237–243. https://doi.org/10.1111/1467-9280.02438 Siegler, R., Thompson, C. A., & Opfer, J. E. (2009). The logarithmic-to-linear shift: One learning sequence, many tasks, many time scales. Mind, Brain, and Education, 3(3), 143–150. https://doi.org/10.1111/j.1751-228X.2009.01064.x Siegler, R., Thompson, C. A., & Schneider, M. (2011). An integrated theory of whole number and fractions development. Cognitive Psychology, 62(4), 273–296. https://doi.org/10.1016/j.cogpsych.2011.03.001 Simmons, D. C., Kame’enui, E. J., Good, R. H., Harn, B., Cole, C., & Braun, D. (2000). Building, implementing, and sustaining a beginning reading model: School by school and lessons learned. OSSC Bulletin, 43(3), 3–30. Starkey, P., & Cooper, R. G. (1980). Perception of numbers by human infants. Science, 210(4473), 1033–1035. Starkey, P., Spelke, E. S., & Gelman, R. (1990). Numerical abstraction by human infants. Cognition, 36(2), 97–127. https://doi.org/10.1016/0010-0277(90)90001-Z Starr, A., Libertus, M. E., & Brannon, E. M. (2013). Number sense in infancy predicts mathematical abilities in childhood. Proceedings of the National Academy of Sciences of the United States of America, 110(45), 18116–18120. https://doi.org/10.1073/pnas.1302751110 Sutherland, M., Clarke, B., Nese, J. F. T., Strand Cary, M., Shanley, L., Furjanic, D., & Durán, L. (2020). Investigating the utility of a kindergarten number line assessment compared to an early numeracy screening battery. Torgesen, J. K. (2000). Individual differences in response to early interventions in reading: The lingering problem of treatment resisters. Learning Disabilities Research & Practice, 15(1), 55–64. Torgesen, J. K. (2002). The orevention of reading difficulties. Journal of School Psychology, 40(1), 7–26. Torgesen, J. K., Alexander, A. W., Wagner, R. K., Rashotte, C. A., Voeller, K. K. S., & Conway, T. (2001). Intensive remedial instruction for children with severe reading disabilities: Immediate and long-term outcomes from two instructional approaches. Journal of Learning Disabilities, 34(1), 33–58. Retrieved from http://hdl.handle.net/10829/5385 75 Vanderheyden, A. M., Codding, R., & Martin, R. (2017). Relative value of common screening measures in mathematics. School Psychology Review, 46(1), 65–87. https://doi.org/10.17105/SPR46-1.65-87 VanDerHeyden, A. M., Witt, J. C., Naquin, G., & Noell, G. (2001). The reliability and validity of curriculum-based measurement readiness probes for kindergarten students. School Psychology Review, 30(3), 363–382. Vaughn, S., & Wanzek, J. (2014). Intensive interventions in reading for students with reading disabilities: Meaningful impacts. Learning Disabilities Research and Practice, 29(2), 46–53. https://doi.org/10.1111/ldrp.12031 Vaughn, S., Wexler, J., Roberts, G., Barth, A. A., Cirino, P. T., Romain, M. A., … Denton, C. A. (2011). Effects of individualized and standardized interventions on middle school students with reading disabilities. Exceptional Children, 77(4), 391– 407. https://doi.org/10.1177/001440291107700401 Wagner, S. H., & Walters, J. (1982). A longitudinal analysis of early number concepts: From numbers to number. Action and Thought: From Sensorimotor Schemes to Symbolic Operations, 137–161. Walker, H. M., Horner, R. H., Sugai, G., Bullis, M., Sprague, J. R., Bricker, D., & Kaufman, M. J. (1996). Integrated approaches to preventing antisocial behavior patterns among school-age children and youth. Journal of Emotional and Behavioral Disorders, 4(4), 194–209. Watts, T. W., Duncan, G. J., Siegler, R. S., & Davis-Kean, P. E. (2014). What’s past is prologue: Relations between early mathematics knowledge and high school achievement. Educational Researcher, 43(7), 352–360. https://doi.org/10.3102/0013189X14553660 Wei, X., Lenz, K. B., & Blackorby, J. (2013). Math growth trajectories of students With disabilities: Disability category, gender, racial, and socioeconomic status differences from ages 7 to 17. Remedial and Special Education, 34(3), 154–165. https://doi.org/10.1177/0741932512448253 White, S. L. J. J., & Szucs, D. (2012). Representational change and strategy use in children ’ s number line estimation during the first years of primary school. Behavioral and Brain Functions, 8, 1–12. https://doi.org/10.1186/1744-9081-8-1 Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org 76 Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the {tidyverse}. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686 Wickham, H., & Miller, E. (2019). haven: Import and export “SPSS”, “Stata” and “SAS” files. Retrieved from https://cran.r-project.org/package=haven Wilke, C. O. (2019). cowplot: Streamlined plot theme and plot annotations for “ggplot2.” Retrieved from https://cran.r-project.org/package=cowplot Wood, G., Willmes, K., Nuerk, H.-C., & Fischer, M. H. (2008). On the cognitive link between space and number: a meta-analysis of the SNARC effect. Psychology Science. Wu, H. (2013). ggROC: package for roc curve plot with ggplot2. Retrieved from https://cran.r-project.org/package=ggROC Wynn, K. (1992). Addition and subtraction by human infants. Nature, 358(6389), 749– 750. Wynn, K., Bloom, P., & Chiang, W.-C. (2002). Enumeration of collective entities by 5- month-old infants. Cognition, 83(3), B55–B62. 77