Investigating Content Multidimensionality in a Large-scale Science Assessment: A Mixed Methods Approach by Cassandra N. Malcom A Dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Quantitative Research Methods in Education Dissertation Committee: Dr. Kathleen Scalise, Chair and Advisor Dr. Dianna Carrizales-Engelmann, Core Member Dr. George Harrison, Core Member Dr. Joanna Goode, Core Member Dr. Beth Harn, Institutional Representative University of Oregon Spring 2024 2 © 2024 Cassandra N. Malcom This work is licensed under a Creative Commons CC BY-NC 4.0 3 DISSERTATION ABSTRACT Cassandra N. Malcom Doctor of Philosophy in Quantitative Research Methods in Education Title: Investigating Content Multidimensionality in a Large-scale Science Assessment: A Mixed Methods Approach Science, Technology, Engineering, and Math (STEM) skills are increasingly required of students to be successful in higher education and the workforce. Therefore, modeling assessment outcomes accurately, often using more types of student data to get a complete picture of student learning, is increasingly relevant. The Program for International Student Assessment (PISA) is promoted as a summative assessment opportunity that includes a science framework. As with many science assessments, the framework includes Life, Physical, and Earth science, which alone seems to imply multidimensionality, and also there are other sources of dimensionality that seem to be described conceptually in the framework. Using data from the 2015 PISA science assessment, a multidimensional item response theory (MIRT) model was fit to see how a multidimensional model operates with the data. Before developing the MIRT model, a qualitative review of the framework for multidimensionality took place and exploratory analyses were implemented for the quantitative data, including a data science technique to explore multidimensionality and some factor analysis techniques. After fitting the MIRT model, it was compared to several unidimensional IRT (UIRT) models to determine the model that explains the most variation. The qualitative analyses generated evidence of multidimensional science content domains in the 2015 PISA science framework, which should require a MIRT model, but quantitative analyses indicate a unidimensional model is more 4 practically significant. Once quantitative results were triangulated with the qualitative review of the framework for multidimensionality, the implications on equity and history of harm with regards to science assessments were discussed. Findings from the qualitative and quantitative aspects of the study were used to generate recommendations for different stakeholders. Keywords: multidimensionality, item response theory, STEM education, summative assessment, large-scale assessment, qualitative framework review 5 CURRICULUM VITAE NAME OF AUTHOR: Cassandra N. Malcom GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene Southwest Texas State University, San Marcos DEGREES AWARDED: Doctor of Philosophy, Quantitative Research Methods in Education, (all but dissertation expected to be completed in 2024), University of Oregon Master of Science, Biology, 2003, Southwest Texas State University Bachelor of Science, Marine Biology, 2001, Southwest Texas State University AREAS OF SPECIAL INTEREST: Science Education Measurement and Assessment Collaboration and Inquiry in Science PROFESSIONAL EXPERIENCE: Graduate Instructional and Research Assistant, University of Oregon College of Education, 2020-Present Science Content Writer/Reviewer Independent Contractor, Hurix Digital, 2022 Science Coordinator and Assessment Specialist III, Educational Testing Service (ETS), 2008-2021 Science Department Chair and Teacher, Robert G. Cole High School, 2004-2008 Science Teacher Intern, Nancy Ney Charter School, 2003-2004 Science Adventure Club Teacher, Witte Museum, 2002 Graduate Instructional and Research Assistant, Southwest Texas State (SWT) University 6 Biology Dept., 2001-2003 Undergraduate Instructional Assistant, SWT University Biology Dept., 2001 GRANTS, AWARDS, AND HONORS: Spot Award, ETS, 2011-2013, 2017, and 2019 President’s Award, ETS, 2012 Academic Excellence Award, SWT University Biology Dept., 2003 Ruth Strandman Field Biology Scholarship, 2000 Houston Livestock and Rodeo Scholarship, 1996-1998 Dean’s List, SWT University, 1996-1998 Ford Scholarship, 1996 National Dean’s List, 1996 Girl Scout Gold Award, 1995 PUBLICATIONS: Scalise, K., Malcom, C., & Kaylor, E. (2023). Chapter 8: Analysing and integrating new sources of data reliably in innovative assessments. In N. Foster & M. Piacentini (Eds.), Innovating assessments to measure and support complex skills (pp. 138-150). OECD Publishing. https://doi.org/10.1787/e5f3e341-en Scalise, K., Malcom, C., & Kaylor, E. (2023). Chapter 13: A tale of two worlds: Machine learning approaches at the intersection with educational measurement. In N. Foster & M. Piacentini (Eds.), Innovating assessments to measure and support complex skills (pp. 216-224). OECD Publishing. https://doi.org/10.1787/d01eb8a4-en Malcom, C. & ETS Data, Analysis, and Reporting (DAR) Group (2020, December). National assessment of educational progress (NAEP) science 2019 operational assessment data [Conference presentation]. National Center for Education Statistics (NCES) and the National Assessment Governing Board (NAGB) NAEP IDQC, Princeton, NJ, United States. 7 Lavalli, K. L., Malcom, C. N., & Goldstein, J. S. (2018). Description of pereiopod setae of scyllarid lobsters, Scyllarides aequinoctialis, Scyllarides latus, and Scyllarides nodifer, with observations on the feeding during consumption of bivalves and gastropods. Bulletin of Marine Science, 94(3), 571-601. https://doi.org/10.5343/bms.2017.1125 California Department of Education (CDE) & Malcom, C. (2016, December). California science tests (CAST) and the California alternate assessment (CAA) for science [Conference presentation]. California Educational Research Association (CERA), Sacramento, CA, United States. Malcom, C. N. (2007, September). Description of the setae on the pereiopods of the Mediterranean slipper lobster, Scyllarides latus [Poster presentation]. 8th International Conference and Workshop on Lobster Biology and Management, Charlottetown, Canada. Malcom, C. N. (2003). Setae on slipper lobster pereiopods [Scanning electron microscope photographs and drawings]. In Lavalli, K.L., Spanier, E., & Grasso, F., Behavior and Sensory Biology of Slipper Lobsters (pp. 144, 165-167). CRC Press, 2007. Malcom, C. N. (2003). Description of the setae on the pereiopods of the Mediterranean slipper lobster Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper lobster S. aequinoctialis [master’s thesis, Southwest Texas State University]. https://digital.library.txstate.edu/handle/10877/11901 Malcom, C. N. (2002). Description of the setae on the pereiopods of the Mediterranean slipper lobster, Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper lobster, S. aequinoctialis [Poster presentation]. 8th Colloquium Crustacea Decapoda Mediterranea, Ionian University, Corfu Island, Greece. Malcom, C. N. (2002). Description of the setae on the pereiopods of the Mediterranean slipper lobster, Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper lobster, S. aequinoctialis [Poster presentation]. Marine Benthic Ecology Meeting, United States. 8 ACKNOWLEDGMENTS I wish to express sincere gratitude to my chair and advisor, Dr. Kathleen Scalise, for her invitation to start this educational journey. Her mentorship has guided me throughout and helped me grow as a researcher. Best wishes to her as she starts her next chapter! To my dissertation committee members: Dianna Carrizales-Engelmann, Joanna Goode, Beth Harn, and George Harrison, thank you all for being willing to serve. Especially since this process was done in a tight timeline and for a student based in another state. Your feedback has been invaluable and strengthened my writing. For the opportunity to continue my learning remotely I thank the University of Oregon’s College of Education. To my gracious copy editors: Dr. Linda A. Malcom and Dr. Angelina Galvez-Kiser, also many thanks for pouring over the fine details. Last, I’d like to acknowledge my positionality in writing this dissertation. I do so in counterpoint to the voices that say this type of statement “does no work” in a quantitative research study. For me personally, a positionality statement provides a lens into how a researcher views their world and all its data – it may even uncover biases of which the researcher is unaware. The positionality statement gives voice to underrepresented researchers and allows them to claim how they want to be recognized when too often labels are forced upon them. My positionality is such: This researcher identifies as a liberal, white, cisgender female whose formative education occurred primarily in Texas. Daughter of working-class parents, a high value was placed on STEAM education and reading in order to better oneself and 9 achieve dreams. Her love of science led her to studying and teaching science and biases her views on data in that she believes science can help answer any question. 10 DEDICATION This dissertation is dedicated to my mother who walked a long, hard road to make sure I got here. Without her love, support, friendship, and her own dissertation journey, I might not have seen what all is possible. And to my cousin, who’s faith in me convinced me that there was never any doubt about this journey’s conclusion. 11 TABLE OF CONTENTS Section Page DISSERTATION ABSTRACT ............................................................................................................... 3 CURRICULUM VITAE ....................................................................................................................... 5 ACKNOWLEDGMENTS .................................................................................................................... 8 DEDICATION ................................................................................................................................. 10 LIST OF FIGURES ........................................................................................................................... 15 LIST OF TABLES ............................................................................................................................. 17 LIST OF EQUATIONS ...................................................................................................................... 18 LIST OF ABBREVIATIONS ............................................................................................................... 19 CHAPTER 1. INTRODUCTION AND LITERATURE SYNTHESIS .......................................................... 21 Problem Statement ................................................................................................................... 21 STEM Education and U.S. Economy .......................................................................................... 22 Integrating Science ................................................................................................................... 25 History of Harm to Learning Equity by Science Assessments ................................................... 30 Overview of PISA ....................................................................................................................... 35 Historical Background ........................................................................................................... 36 Assessment Cycle .................................................................................................................. 36 2015 Science Framework ...................................................................................................... 37 2015 Assessment Design ....................................................................................................... 42 12 2015 Science Scoring ............................................................................................................. 46 Three MIRT Case Studies .......................................................................................................... 47 Yen and Leah (2007) - MIRT Model for Composite Scores .................................................... 47 Scalise and Clarke-Midura (2018) - The Many Faces of Scientific Inquiry ............................. 49 Li et al. (2012) - Applying MIRT Models in Validating Test Dimensionality ........................... 51 Research Questions .................................................................................................................. 53 Research Question 1 (RQ1) ................................................................................................... 53 Research Question 2 (RQ2) ................................................................................................... 54 Research Question 3 (RQ3) ................................................................................................... 54 CHAPTER 2. METHODS ................................................................................................................. 56 Developing the Literature Synthesis ......................................................................................... 56 Setting ....................................................................................................................................... 57 Student Demographics ............................................................................................................. 58 Data Collection .......................................................................................................................... 60 Study Sample ............................................................................................................................ 61 Data Analysis – A Mixed Methods Approach ............................................................................ 62 Epistemology ......................................................................................................................... 63 Purpose and Guidelines ........................................................................................................ 65 Step 1: Qualitative Analysis ................................................................................................... 69 Step 2: Quantitative Analysis ................................................................................................ 75 13 Data Triangulation ................................................................................................................. 89 Step 3: Equity Investigation ................................................................................................... 91 CHAPTER 3: RESULTS .................................................................................................................... 92 Results Relating to RQ1 ............................................................................................................. 92 Results Relating to RQ2 ............................................................................................................. 96 Descriptive Statistics ............................................................................................................. 97 RQ2A: Cluster Analyses Results ........................................................................................... 102 RQ2B: PCA Results ............................................................................................................... 103 RQ2C: IRT Results ................................................................................................................ 111 Triangulation ....................................................................................................................... 121 Results Relating to RQ3 ........................................................................................................... 123 CHAPTER 4: DISCUSSION ............................................................................................................ 125 Study Overview ....................................................................................................................... 125 Key Takeaways ........................................................................................................................ 126 A Lack of Synergy Between Results ........................................................................................ 126 Overview of Released Item Set ........................................................................................... 129 Alternate Sources of Multidimensionality .......................................................................... 135 Impact On Equity ................................................................................................................. 136 Limitations .............................................................................................................................. 137 Threats to Validity and Reliability ........................................................................................... 138 14 Future Research ...................................................................................................................... 139 Policy Recommendations ........................................................................................................ 141 Conclusions ............................................................................................................................. 142 REFERENCES ............................................................................................................................... 144 APPENDIX A: STUDENT ENROLLMENT IN SCIENCE COURSES BY ETHNICITY .............................. 161 APPENDIX B: 2015 PISA AVERAGE SCORES FOR SCIENCE ........................................................... 163 APPENDIX C: 2015 PISA AVERAGE SCORES BY SCIENCE SUBDOMAIN ........................................ 165 APPENDIX D: LITERATURE CONNECTIONS .................................................................................. 169 APPENDIX E: LITERATURE REVIEW MATRIX ................................................................................ 170 APPENDIX F: PISA 2015 SCIENCE FRAMEWORK ......................................................................... 187 APPENDIX G: DISSERTATION TIMELINE ...................................................................................... 217 15 LIST OF FIGURES Figure Page 1. STEM Job Predictions for 2031 ............................................................................................... 23 2. 2019 Science Course Enrollment ............................................................................................ 24 3. Relationships among the Four Aspects .................................................................................. 39 4. Released 2015 PISA Science Item ........................................................................................... 41 5. Comparison of PBA and CBA Assessment Designs ................................................................. 45 6. Comparison of Models ........................................................................................................... 52 7. PISA Science Performance by Country ................................................................................... 58 8. From Science Framework Review to MIRT Model Development ........................................... 73 9. ICCs Based on a Three-parameter Logistic (3PL) Model ......................................................... 76 10. Triangulation for Mixed Methods Research ........................................................................... 90 11. Possible Connections between 2015 PISA Science Content Knowledge ................................ 95 12. Proposed Continuum .............................................................................................................. 96 13. Histogram of Student Average Scores for Full U.S. Science Sample ....................................... 98 14. Histogram of Student Average Scores for Item Cluster S10 Full Subsample .......................... 99 15. Histograms of Student Score Point Frequency for Item Cluster S10 Full Subsample ............. 99 16. Distance Heatmap for Item Cluster S10 Full Subsample ...................................................... 100 17. Scree Plot for Item Cluster S10 with Full Subsample ............................................................ 102 18. Scree Plot for Item Cluster S10 with Random Half of Subsample ........................................ 103 19. Scree Plot for Item Cluster S11 ............................................................................................. 103 20. Loadings Bar Plots for Item Cluster S10 with Full Subsample ............................................... 105 16 21. PCA Plot for Item Cluster S10 with Full Subsample .............................................................. 106 22. Loadings Bar Plots for Item Cluster S10 with Random Half of Subsample ........................... 107 23. Confirmation PCA Plot for Item Cluster S10 with Random Half of Subsample ..................... 108 24. Loadings Bar Plots for Item Cluster S11 ................................................................................ 109 25. PCA Plot for Item Cluster S11 ............................................................................................... 110 26. Infit Statistics for 1 PL UIRT Model of Item Cluster S10 with Full Subsample ....................... 113 27. ICC Plots for Item Cluster S10 with Full Subsample .............................................................. 117 28. Wright Map for Item Cluster S10 with Full Subsample ........................................................ 121 29. Triangulation of Results ........................................................................................................ 122 30. Histograms of Student Ability Levels for Item Cluster S10 ................................................... 123 31. Bird Migration Item 1 from Item Cluster S11 ....................................................................... 130 32. Bird Migration Item 2 from Item Cluster S11 ....................................................................... 132 33. Bird Migration Item 3 from Item Cluster S11 ....................................................................... 134 34. Differences in Between-item and Within-item MIRT Models .............................................. 141 35. U.S. High School Physics Enrollment .................................................................................... 161 36. U.S. High School Biology Enrollment .................................................................................... 162 37. U.S. High School Chemistry Enrollment ................................................................................ 162 38. U.S. Mean Scores for Science Stable Over Time ................................................................... 164 39. Connections to the 2012 Li Article ....................................................................................... 169 17 LIST OF TABLES Table Page 1. Three Science Subdomains in 2015 PISA ................................................................................ 40 2. Country Demographic Comparisons ....................................................................................... 59 3. Defining Purpose of Mixed Method Approach ....................................................................... 66 4. Guidelines for Mixed Methods Research ............................................................................... 68 5. Trade-offs Between Calibration Methods for a Unidimensional Score .................................. 84 6. Evidence Supporting Dimensionality Themes ........................................................................ 93 7. Descriptive Statistics for Item Cluster S10 Full Subsample ..................................................... 97 8. Means (M), Standard Deviations (SD), and Correlations with Confidence Intervals (CI) for Item Cluster S10’s Full Subsample ........................................................................................ 101 9. Item Groupings for MIRT Models ......................................................................................... 111 10. Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S10 Subsample ..... 113 11. Comparison of Model Fit – Item Cluster S10 Subsample ..................................................... 114 12. Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S11 Subsample ..... 116 13. 2015 PISA Country Rankings by Average Score in Science ................................................... 163 14. 2015 PISA Country Rankings by Average Score in Science Subdomain ................................ 165 15. Results of Literature Review ................................................................................................. 170 18 LIST OF EQUATIONS Equation Page 1. 1PL UIRT .................................................................................................................................. 87 2. 2PL UIRT .................................................................................................................................. 87 3. 1PL MIRT ................................................................................................................................. 88 4. 2PL MIRT ................................................................................................................................. 88 19 LIST OF ABBREVIATIONS Term Abbreviation Definition1 An estimate of model fit based on the number of model parameters and log-likelihood that favors Akaike Information Criterion AIC more complex models and smaller sample size; the smaller AIC indicates better fitting model when two models are compared An estimate of model fit that favors simpler models Bayesian Information Criterion BIC by adding a penalty for more parameters; the smaller BIC indicates better fitting model when two models are compared Civil Rights Data Collection CRDC NA Classical Test Theory CTT States an observed test score is the sum of a student’s true score and random error An online test designed to increase or decrease test Computer Adaptive Testing CAT difficulty based on a student’s ability as shown during the test by providing an easier or harder item dependent on the score of the last item given Computer-based Assessment CBA An assessment designed to be delivered and administered via a computer or tablet Degrees of Freedom df Number of independent variables that are free to vary in a data sample Diversity, Equity, and Inclusion DEI NA Expected A Posteriori EAP An estimate of the predicted value for the latent trait posterior probability distribution Exploratory Factor Analysis EFA Identifies the latent traits then builds a linear model of the variables Gross Domestic Product GDP NA Item Characteristic Curve ICC A graph of a probability of a correct response versus a student’s ability Item Information-weighted Fit Infit “Information” refers to the variance of the observations Item Response Theory IRT Explains the relationship between a latent trait and an observable outcome The items in an assessment may be related in that an Local Item Dependence LID answer on one item indicates a higher chance of answering other items in a similar manner, even when conditioned on proficiency estimate Generates a random sample of a target distribution with a large number of dimensions where each Markov Chain Monte Carlo MCMC MCMC sample is dependent on the prior MCMC sample; can estimate the sum as either the mean or variance of drawn samples 1 Statistical terms are provided with definitions that are summarized by the researcher. Terms that have no applicable statistical definition are marked NA. 20 Maximum Likelihood Estimation MLE Finding the parameter values that give a curve best fitting the data Maximum marginal likelihood Estimates the parameters that are most likely of the estimation MMLE expected probability distribution based on observed data Multidimensional IRT MIRT Can model an assessment measuring multiple traits National Assessment of Educational Progress NAEP NA National Center for Education Statistics NCES NA Next Generation Science Standards NGSS NA Office for Civil Rights OCR NA One-parameter Model 1PL Simplest model that describes the latent trait (i.e., ability) based only on the difficulty parameter Organization for Economic Co- operation and Development OECD NA Paper-based Assessment CBA An assessment designed to be delivered and administered via a paper form Number of observed variables are reduced to a Principal Component Analysis PCA decreased number of principal components, which account for the most variance of the observed variables Programme for International Student Assessment PISA NA Is the root of the mean of the squared errors Root Mean Square Error RMSE between observed and predicted values; measures the error of a model when predicting quantitative data Is the square root of mean of squared residuals and Root Mean Square of the Residuals RMSR measures badness-of-fit for a model (0 indicating perfect model fit) Refers to education containing these subjects, or a Science, Technology, Engineering, required set of skills needed for the workforce. and Math STEM Sometimes the Arts are included, in which case the acronym becomes STEAM, which is out of scope in this dissertation. Standard Deviation SD A measurement of the spread of data in relation to the mean of the population that indicates variability An inferential statistic that indicates the reliability of Standard Error SE a sample population mean compared to the actual population mean Model that describes three parameters, difficulty, Three-parameter Model 3PL discrimination, and guessing, in relation to the latent trait Two-parameter Model 2PL Model that describes two parameters, difficulty and discrimination, in relation to the latent trait Unidimensional IRT UIRT Models an assessment measuring a single latent trait United States U.S. NA 21 CHAPTER 1. INTRODUCTION AND LITERATURE SYNTHESIS Problem Statement Three science subdomains (Life, Physical, and Earth and Space) that are commonly assessed in large-scale assessments are qualitatively describable as multiple dimensions based on teaching pedagogy and student ability. For a construct, in this instance science, to be considered multidimensional its dimensions2 (e.g., the science subdomains) should be different from each other yet connected to the theorized construct – see section Defining Dimensionality (Ch. 2) for further information (Polites et al., 2012). However, data from national and global large-scale science assessments are typically quantitatively modeled with unidimensional item response theory (or UIRT) models rather than multidimensional IRT (or MIRT) models. IRT is a method of examining the relationship between something intangible, such as science ability, and how that latent trait manifests in, for example, scores on a set of science items – see section Primer on Item Response Theory (Ch. 2) for a more detailed description. Some researchers advocate that we should more accurately model student data from large-scale science assessments by using MIRT models. The disconnect between how large- scale assessments report data, such as on science subscales3, while the data is modeled unidimensionally, could impact policy decisions, instruction, and other aspects of the education experience. IRT models can allow teachers and students to consider and discuss how student ability is impacted by item parameters, such as difficulty (Uesaka et al., 2022), but only if the appropriate IRT models are used. This problem will be explored here using a mixed methods 2 At least two are needed to be multidimensional. 3 For 2015 PISA subscales were based on the science subdomains of life, physical, and Earth and Space systems – see Appendix C. 22 approach, where qualitative (document exploration) techniques are used to better understand the conceptual framework claims in a framework for which the United States (U.S.) sample of the quantitative data set is next explored quantitatively for some aspects of dimensionality. STEM Education and U.S. Economy Understanding the current status of science education with regards to the U.S. economy will help illustrate why there is a desire for Science, Technology, Engineering, and Math (STEM)4 assessment research, such as the more accurate modeling noted above. Citizen science and the need in life for understanding and interpreting personal STEM contexts is very important for decision making in modern society. For instance being able to have enough STEM knowledge to understand and make decisions regarding implications for vaccination, masking, and other precautions in the recent COVID-19 pandemic would have helped citizens, especially in early days of the pandemic given the absence of a fuller understanding. This is one example, of which there are many others, of how decision making by individuals in society may interact with their base STEM understanding. Students’ cognitive skills and general knowledge can impact a nation’s economy (Hanushek et al., 2008). A highly skilled workforce helps nations be competitive in the global marketplace (Hanushek et al., 2008). With regards to the economies of many nations, there is a long-standing link between more STEM education and greater productivity and creativity from the workforce in generating solutions, whether they be environmental, technological, or industrial (Hanushek et al., 2008). Over the years, this link has led to increased government interest in STEM education and in the status of the STEM workforce (Kelley & Knowles, 2016). 4 Sometimes including Arts for STEAM 23 For example, in 2021, the U.S. government spent approximately 3.9 billion on STEM education (Lips & Moritz, 2023). By 2031, the U.S. Department of Labor predicts the STEM workforce will grow by almost 11%, which is more than two times faster than the predicted growth rate of other occupations, see Figure 1 (Krutsch & Roderick, 2022). Figure 1 STEM Job Predictions for 2031 Note. These predictions do not include careers related to just the Arts, as in STEAM where Arts are added to STEM. From “STEM day: Explore growing careers,” by E. Krutsch and V. Roderick, 2022, U.S. Department of Labor Blog (https://blog.dol.gov/2022/11/04/stem-day-explore-growing-careers). Copyright 2022, U.S. Department of Labor. Generation of these new STEM jobs will lead to increased need for workers with STEM skills. This is especially true since the anticipated STEM jobs are expected to be enticing for students as they are also predicted to pay more on average (Krutsch & Roderick, 2022). In order to have enough STEM skilled workers, STEM education frameworks, associated teaching pedagogy, and 24 related assessments are being evaluated in the hopes that improvements in these areas will lead to more students taking STEM classes then entering a STEM career field. Instead of more students entering STEM tracks in higher education or careers, students in the U.S. are currently falling behind other countries in the mastery of science skills (Hanushek et al., 2008). Under 40% of high school students in public and private schools took a biology, chemistry, and physics course in 2019 (National Center for Education Statistics [NCES], 2022). The majority of these students took at least one biology course, with chemistry as the second course students took most frequently, and physics plus Earth sciences lagged far behind in student enrollment – see Figure 2 (NCES, 2022). Figure 2 2019 Science Course Enrollment Note. Adapted from “High school mathematics and science course completion,” 2022, National Center for Education Statistics: Condition of Education (https://nces.ed.gov/programs/coe/indicator/sod/high-school- courses). Copyright 2022 by the U.S. Department of Education, Institute of Education Sciences. 25 Evidence from the National Assessment of Educational Progress (NAEP) science scores shows students in grades 8 and 12 well below 50% proficiency (Stehle & Peters-Burton, 2019). The Programme for International Student Assessment (PISA) science scores rank the U.S. 25th with a mean score of 496, compared to the Organization for Economic Co-operation and Development (OECD) overall mean of 493, out of 72 participating economies and countries (Organisation for Economic Co-operation and Development [OECD], 2018). Due to this lag, both STEM educators and educational researchers are rethinking how STEM skills are taught and evaluated. Integrating Science Science educational pedagogy used to be focused on the retention of facts (Kaldaras et al., 2021; Pierson et al., 2019; Enger & Yager, 2009), such as memorizing all the parts of a cell. Newer pedagogy focuses on integrating science content and scientific inquiry (Kaldaras et al., 2021; Pellegrino & Hilton, 2013; Enger & Yager, 2009), along with soft skills like creative thinking (Csapó & Funke, 2017). This new pedagogy focus has led to the development of the Next Generation Science Standards (NGSS) in 20135, which advocate for crosscutting concepts, practices, and disciplinary core ideas to be taught, in a type of “sensemaking” effort. Crosscutting concepts in particular focus on applying knowledge across different science subdomains6 (NGSS Lead States, 2013), and on drawing on these concepts to explore new STEM ideas as they arise in a student’s life. This new focus could appear to call for science classes with integrated content, but the realization of that type of curriculum has been hard to achieve. 5 The short timeframe between the NGSS being published, adopted by states, and then incorporated into classroom curriculum and the 2015 administration of the PISA left little room for any impacts on student learning to transfer to the 2015 science portion of the PISA. 6 Scientific inquiry skills are found in the NGSS science and engineering practices and are also considered a separate dimension of learning that can apply across each science subdomain. 26 Policy and/or teacher preparation often leaves the coursework in separate “silos”. Assessments also need to change from measurement of constructs only requiring recall to this integrated content that requires students to apply knowledge and make sense of it (Kaldaras et al., 2021). While there was a movement to combine the science subdomains into an integrated curriculum in the 1970s (Welch, 1977) and calls for this approach continue, most U.S. schools do not employ this method and the subdomains remain taught separately with little crossover in science content (Winarno et al., 2020). Winarno et al. (2020) provide several reasons why integration of the subdomains has remained difficult, including: • Educators are often trained in only one of the science subdomains, • Less professional development and college-level training is available for educators, • High schools and state policy are designed around the idea that integrated curriculum in STEM does not yet strongly support higher education goals, since higher education courses often do not draw on integrated frameworks, • Learning still tends to be lecture orientated when labs provide more connections for students, and • Limited availability of integrated science textbooks. In addition, the following factors, some from personal experience, also impact integrating science subdomain instruction. The cost of implementation of integrated science courses is too high for most school districts to develop educator training on how to integrate, plus allow the district to purchase new lab equipment, supplies, and textbooks. A small suburban high school where I taught rejected a new integrated science course as the funds to restructure the science lab with the needed 27 equipment were quite high compared to the current cost of offering three subdomains and a few elective courses. Developing a new curriculum that functionally integrates the 3 science domains can be time restrictive. Integrated science curriculum should not simply be, for example, a restructuring of a physics course to contain life science examples, but rather a deeper dive into learning how both physics and biology play a role together in a shared construct, like muscle movement. This would require a new way of approaching curriculum that provides opportunities for students to dive deep into how physics can help the understanding of the biology of muscles working with bones to create movement in a living organism. Considerable pre-planning, teacher coordination, and community feedback would need to occur for such a course to be successfully delivered. Teacher resistance to the curriculum change, which can be due to the untested nature of how beneficial the new curriculum may be to student learning within the new learning environment, or a tendency to cling to what already works well. All educators have felt this way at some point or another – a dissatisfaction with being asked to change what works well for our students to the newest educational fad without enough teacher buy-in earlier in the process of curriculum redesign. While working on a NGSS-focused assessment redesign for a state, I was able to hear some of the frustrations teachers felt regarding how the three-dimensional learning required by NGSS was to be successfully implemented, especially for students with special needs. Parent and student concern that integrated courses do not offer enough depth of subdomain content to prepare students for college. For example, a student may be more interested in 28 becoming a cell biologist and feel that less time will be devoted to cells in a course that has to cover major concepts from life, physics, chemistry, and Earth sciences. The more intricate details of cell biology may be missed unless the student takes additional courses, such as an advanced biology course. See Johnson’s (2019) EdSource article7 for insight into the parent and student concerns over integrated science courses at a California high school. In a similar vein, policymaker and public resistance due to the belief that some science subdomains, when integrated, will lose teaching and learning minutes (opportunity to learn issues) since available time will likely be divided up between several science subdomains. Instead of a student taking three different science courses spread out over three years, they may end up taking two integrated science courses spread out over two years. Students can often choose to take an elective science course, but counseling, the degree of informed choice, and access to such courses often varies by socioeconomics and can include substantial racial, ethnic, and gender disparities (Gao et al., 2019). Assessments may not accurately show trends in data from the non-integrated year to the integrated year, so student growth can be harder to map. Many teachers develop their own assessments and unless a school requires a standardized formative assessment at the end of each science course the reasons for changes in student performance may be unknowable for many individuals. Finally, there is often misuse of integrated science courses as low-level science classes. In the past, I have heard school counselors refer to integrated science courses as a “dumping ground” for students who are not doing well in math and so cannot manage physics 7 Located here: https://edsource.org/2019/how-one-high-schools-dispute-reflects-the-struggle-to-teach- californias-science-standards/618752 29 or chemistry, have less potential/desire for pursuing a science career, or who have special education needs. This filtering of students is obviously not the intended use of integrated science education and advocates for integrated science instruction will tell you that a better vision of an integrated science course is to prepare all students for dealing with science in their daily lives (Otarigho & Oruese, 2013). All of the above restraints have led to integrated science curriculum not being fully applied in high schools throughout the U.S. Since high schools still offer distinct classes for each science subdomain, one might then expect large-scale science assessment data that is generated from an assessment of these subdomains to be modeled using MIRT, whereas UIRT is vastly employed as the standard.8 By using an UIRT model the assessment designers may be impacting the assessment’s consequential validity by assuming all students have the similar access to similar course content taught in a similar way. As educators struggle to mirror the three-dimensional aspects of NGSS, so too do large- scale assessments like PISA wrestle with the challenges of modelling quantitative data. MIRT models will only be appropriate if what is anticipated by frameworks to be extensively multidimensional data actually shows such patterns. Alternatively, explaining why such data sets do not show such patterns if frameworks anticipate them is theoretically also a challenge. Educators and governments are often calling for the need to be able to assess these skills with the need to use more advanced modeling if necessary to evaluate the student data from those assessments, such as MIRT or hybrid IRT models. This study aims to first explore a framework 8 While the goal of this study is not to determine which method of course delivery, e.g., integrated or nonintegrated, is most appropriate, the main method of course delivery in U.S high schools as nonintegrated subdomains of science may indicate multidimensionality. 30 for multidimensional claims, then use some data analytic and MIRT models to explore some 2015 PISA science scores from U.S. students to determine whether models can help showcase some of the implied multidimensionality. This might include how the three subdomains of the 2015 PISA science framework, i.e., physical, life, and Earth and Space systems9, are distinct dimensions of student learning with supporting evidence from a qualitative analysis of the science subdomains in the 2015 PISA framework, or other aspects of multidimensionality. If a MIRT model substantially improves fit for some portions of the 2015 PISA science data better than unidimensional IRT models this could indicate that some of the latent trait skills are quite different from each other with regards to student ability. Not using a MIRT model when multidimensionality exists could potentially cause harm to students individually or in aggregated groups through invalid inferences being made about their ability in each subdomain (Spencer, 2004). In addition, harm could be caused to educational institutions trying to make policy decisions based on an incorrect understanding of student outcomes. These policies can be especially impactful if such decisions marginalize students of color. History of Harm to Learning Equity by Science Assessments Potential social impacts from how assessments are used is at the core of consequential validity (Iliescu & Greiff, 2021). Messick (1993, p. 5) defines consequential validity as an aspect of construct validity, which “appraises the value implications of score interpretation as a basis for action as well as the actual and potential consequences of test use, especially in regard to 9 For those familiar with NGSS, these three subdomains fall into the NGSS dimension of core ideas. The NGSS dimensions of crosscutting concepts and science and engineering practices, such as inquiry, are outside the scope of this study. My use of the term “dimension” throughout this dissertation is meant to refer to aspects of multidimensionality in the PISA international framework such as science subdomains or other components in the framework, and not specifically to NGSS concepts, which tend to be U.S. embedded. 31 sources of test invalidity related to issues of bias, fairness, and distributive justice.” When analyzing assessments for validity in general, Iliescu and Greiff (2021) advocate for validity to be applied to inferences, or claims made, rather than the instrument itself in order to focus on consequential validity. They further argue that more research is needed into the “social consequences of testing” as the effects of assessments on specific subpopulations and society in general directly draw a line to educational diversity, equity, and inclusion (DEI). If the opportunity to learn hinges on the learning tools10 available to a student, then any learning tool inequity can impact the validity of inferences made from an assessment (American Educational Research Association [AERA], 2014). Furthermore, the scores from that assessment may create additional ripples of inequity if student placement and future opportunities to learn hinge on those scores. We need to be able to redesign assessments to act as tools of equity rather than furthering existing inequities in education by assessing students on material they have not had the opportunity to learn. The Center for Professional Education of Teachers (CPET) (n.d.) suggests assessment practices can become more equitable if we: • “Ensure our assessments align with what we actually teach • Formatively assess students on a regular basis • Differentiate assessment products whenever possible • Offer a variety of ways to demonstrate mastery • Be flexible (but not too flexible), and offer time to make up assessments • Create relevant, engaging assessment methods 10 Learning tools encompasses (but is not limited to) curriculum, technology, books, instructional aids, and lab equipment. 32 • Make assessments rigorous, not rote • Develop and maintain a growth mindset • Emphasize effort and progress, not grades • Acknowledge and cultivate students' strengths and talents” Not all of these suggestions apply to PISA and the intended use of its scores – “to evaluate education systems worldwide” in order to determine “how well students, at the end of compulsory education, can apply their knowledge to real-life situations and can therefore fully participate in society” (OECD, n.d.-a). For example, formative assessments align more with school responsibilities. However, the outcomes from some suggestions could be impacted by PISA. Due to the global nature of PISA, OECD cannot guarantee that it is aligned with what is taught in the schools of every country (OECD, n.d.-a), yet could OECD make its assessments more culturally responsive? With regards to differentiating assessments, PISA did not provide an adaptive11 test that differentiates based on student ability for science in 2015 (OECD, n.d.-b), but is making progress on including more adaptivity (multi-stage) in later cycles in various content areas (OECD, n.d.-a). Taken together, substantial deficits in assessment equity could impact interpretation of assessments results even if the assessment’s intended use is at a more aggregated level. OECD (2016a) defines equity in education as calling for “opportunities to acquire these [science] skills12 should be independent of students’ backgrounds.” PISA is used to analyze education equity through several lenses by: 1) examining “variation in the distribution of 11 OECD does currently have an adaptive test for reading and math (OECD, n.d.-b). 12 Students “have a basic understanding of science that will help them become informed citizens in a world shaped by science and technological progress. (OECD, 2016a).” 33 student outcomes, especially whether students acquire a baseline level of skills, as a way to assess the inclusiveness of school systems”, 2) determining “impact of students’ backgrounds on their outcomes at school”, and 3) exploring if “access to educational resources and the incidence of sorting practices varies between students of different backgrounds as a way to identify some of the factors that mediate their association with performance. (OECD, 2016a).” For the U.S. 11.4% of the variation in science performance by students can be linked to their socio-economic status (OECD, 2016a). Language considerations mostly are outside the scope of this investigation, but there is substantial infrastructure in PISA within and across countries for multiple language support and translation, mostly addressed by national considerations regarding their country’s students (OECD, 2017b). In OECD’s (2016a) report on Country Note: Key Findings from PISA 2015 for the United States, the performance and equity of the U.S. educational system is compared with those same aspects of other countries13 that OECD has identified as showing high or improving levels. The following educational and equity conclusions (OECD, 2016a) were drawn: • “The United States is a wealthy country.” • “The United States spends a large amount on education.” • “There is more variation in socio-economic status14 in the United States than in the other four [comparison] countries.” 13 The four comparison countries identified by OECD are Canada, Estonia, Germany, and Hong Kong (China), which are shown in Table 2. 14 OECD defined socio-economic status (SES) for the 2015 PISA by an index they developed from economic, social, and cultural status variables related to the family background of students (OECD, 2016a, p. 27). The variables included: parental education level, parental occupation, amount of wealthy possessions, and amount of books owned (OECD, 2016a, p. 27). The score is a composite determined from principal component analysis (PCA) and can be compared to other nations (each is weighted equally) since it is standardized to a mean of zero with a standard deviation of 1 (OECD, 2016a, p. 27). 34 • “The United States occupies an intermediate position in terms of the percentage of socio-economically disadvantaged students.” • “A large but not extreme15 percentage of students in the United States have an immigrant background.” • “The United States is a large and complex country.” Another area where the U.S. experiences educational complexity is in the type of and access to science course offerings. Student enrollment in courses covering different science subdomains can vary by student ethnicity. Appendix A provides an overview of student enrollment percentages in U.S. high school science courses (biology, chemistry, and physics) by student ethnicity. Regardless of the science subdomain being taught the majority of students enrolled are White – see Figures 35-37. Therefore, it is important to consider if the inferences made based on PISA science scores can be accurately applied to other subpopulations of students. This effect could be further enlarged if the model used to determine how student ability relates to item parameters is inaccurately modeling the subdomains of science. If the U.S. bases educational policy changes or educational system reform on aspects of PISA ranking (rather than on needs of subpopulations) resulting from a model mismatch there could be unintended consequences for how PISA science data is used to inform economy, policy, and science education. An assessment can be evaluated both in terms of its use and its interpretive inferences, which can occur with probing the model for evidence of its accuracy (Messick, 1989). 15 While OECD does not define what an extreme percentage would be, OECD does clarify that 23% of U.S. students are immigrants and that there are only 5 other OECD-member countries with a greater percent of student immigrants (OECD, 2016a). 35 For example, if policy makers determine that U.S. students are lagging behind other countries that are global competitors with regards to physical systems and Earth and space systems scores (see Appendix C) and earmark more funding for science education in these areas only, then the inference they are making on how to use PISA scores could be inherently flawed. Singer and Braun (2018, p. 39) describe this “mix of nationalism, fears about global competitiveness, and human nature” as leading to “unitary ‘silver bullet’ solutions based on highly aggregated data.” This leads to an ecological fallacy, which is making inferences at an individual level, whether it be on a student’s learning, a school/district’s performance, or a state’s educational requirements, based on population level data and can generate inaccurate conclusions. While OECD (2018) does clarify that PISA results should not be used to make inferences about individual students, it often makes inferences about school policy. As a country, the U.S. needs to clarify use of these inferences to educators, researchers, and most importantly to the media, which often discusses PISA scores incorrectly. OECD could also further clarify how survey data on school policy should be used per the AERA (2014, p. 23) standard for validity 1.3: “If validity for some common or likely interpretation for a given use has not been evaluated…that fact should be made clear and potential users should be strongly cautioned about making unsupported interpretations.” Overview of PISA The OECD develops, administers, scores, and provides data for the PISA (OECD, n.d.-a). Results from PISA are used to rank countries by their students’ mean domain score (OECD, n.d.- a). As mentioned earlier, these rankings can provide a measure for a nation’s economy and are important to educational policy (Pokropek et al., 2022). 36 Historical Background PISA was first administered by the OECD to students in 2000 (OECD, n.d.-a). The international program continues today, and countries can elect16 to participate and receive information on 15-year-old students about their learning in several content areas (OECD, n.d.- a). The science assessment has been computer-based since 2015 (OECD, n.d.-b) for most countries (Jerrim et al., 2018). Students do not all receive the same items due to the assessment design (OECD, n.d.-b). In the first cycle of PISA in 2000, 43 countries participated, including the U.S. (OECD, n.d.-a). Original participants included 29 countries that are members of OECD and 14 non- member countries (OECD, n.d.-a). PISA has since grown to include more than 90 countries and economies worldwide while serving around 3 million students (OECD, n.d.-b). The assessment is influential for policy development in some countries due to the country education ranking that OECD provides (OECD, n.d.-a). and has been linked to other national assessments in several countries, such as in the U.S. (OECD, n.d.-a). Assessment Cycle The “major” assessment domains found in PISA are traditionally math, reading, and science applied to “everyday activities,” which usually rotate as the primary content in three- year cycles17 (OECD, n.d.-a). For instance, science, as currently defined by OECD (see section 2015 Science Framework below), last served as the major domain in 2015, followed by reading 16 Student participation is also voluntary in the U.S. with an offer of a certificate of 4 hours service from the U.S. Department of Education (PISA USA, 2015). Students are chosen randomly by OECD from a list of all eligible students that is provided by each U.S. school that elects to participate with a goal of 42 students per participating school (PISA USA, 2015). 17 Science therefore rotates as a major domain to be the focus of the assessment every nine years (OECD, 2017b). 37 in 2018 and math in 2021 (OECD, n.d.-a). COVID-19 has thrown off some of the administration dates such that science will next be administered as a major domain in PISA in 2025. Some content from the major domains usually reappears as a “minor” domain in PISA in each cycle to help extend trends in non-major years, along with a new innovative domain like collaborative problem solving chosen each cycle since 2012, and sometimes optional domains are delivered in some cycles, such as financial or digital literacy (OECD, n.d.-a). The degree of information available in non-major years, and the rotation of the innovative domains, have become somewhat problematic for some countries since they need information more often and for other reasons (OECD, n.d.-a). 2015 Science Framework18 A content framework like the 2015 PISA science framework drives the discussion around what educators worldwide might think should be taught for students to become proficient in science. There is a committee with science education experts from around the world that reviews the framework, as well as feedback from each participating country for each version of the framework. For the 2015 framework, the focus was on scientific literacy. Scientific literacy is defined as “the ability to engage with science-related issues, and with the ideas of science, as a reflective citizen” (OECD, 2017a). OECD (2017a) claims that a comprehensive list of “all the ideas and theories that might be considered fundamental for a scientifically literate individual” has not yet been made. However, three competencies identified by OECD include: “explain phenomena scientifically”, “evaluate and design scientific inquiry”, and “interpret data and evidence scientifically” to provide evidence for scientific literacy (OECD, 2017a). 18 See Appendix F for the full 2015 PISA science framework. 38 The framework also tries to describe what knowledge is assessable. The 2015 PISA science framework states that knowledge will be assessed if it meets the following criteria: • “has relevance to real-life situations, • represents an important scientific concept or major explanatory theory that has enduring utility, and • is appropriate to the developmental level of 15-year-olds (OECD, 2017a).” Content knowledge will comprise more than half of the assessment according to the developers (OECD, 2017a). The subdomains of science fall squarely into the framework’s knowledge aspect that includes the following elements: “content”, “procedural”, and “epistemic” (OECD, 2017a). Content knowledge includes “facts, concepts, ideas, and theories about the natural world that science has established” (OECD, 2017a) that are historically learned in the classroom by teacher explanation and student exploration of a construct. Putting this altogether is sometimes called “sense making” in STEM. Procedural knowledge is what scientists follow to develop evidence supporting scientific knowledge via “practices and concepts on which empirical enquiry is based such as repeating measurements to minimize error and reduce uncertainty, the control of variables, and standard procedures for representing and communicating data” (OECD, 2017a). Sometimes this is considered to be “scientific practice” although defining specific practices of interest can widen and narrow this framing. Epistemic knowledge derives from “understanding science as a practice, which refers to an understanding of the role of specific constructs and defining features essential to the process of knowledge building in science” and “includes an understanding of the function that questions, observations, theories, hypotheses, models, and arguments play in science; a recognition of the variety of forms of scientific inquiry; and the 39 role peer review plays in establishing knowledge that can be trusted” (OECD, 2017a). Sometimes this is considered to include cross-cutting concepts, although defining specific concepts of interest can widen and narrow this framing. In addition to knowledge, the PISA framework identifies three other aspects that it bases its view of scientific literacy on, which are: contexts, competencies, and attitudes. OECD (2017a) clarifies that the PISA science assessment “is not an assessment of contexts.” Instead, the knowledge and competencies are assessed “in specific contexts” (OECD, 2017a) so that students must make sense of their STEM thinking in the applied context. Aspects and elements described above are provided in Figure 3. Attitudes are thought of in PISA as possibly interacting with dispositions to learn or to use competencies and knowledge and may be addressed in the extensive PISA questionnaires depending on cycle and selection of questionnaire material, which are mentioned for clarity but are outside the scope of this research study. Figure 3 Relationships among the Four Aspects Note. Adapted from “PISA 2015 Assessment and analytical framework: science, reading, mathematic, financial 40 literacy and collaborative problem solving, revised edition,” 2017, OECD Publishing (http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. As shown in Table 1Error! Reference source not found., the science content knowledge is classified into three subdomains for this framework: life, physical19, and Earth and space systems (OECD, 2017a). Table 1 Three Science Subdomains in 2015 PISA Subdomain Code Knowledge of the Content of Science PS1 Structure of matter (e.g. particle model, bonds) PS2 Properties of matter (e.g. changes of state, thermal and electrical conductivity) Physical PS3 Chemical changes of matter (e.g. chemical reactions, energy transfer, acids/bases) Systems PS4 Motion and forces (e.g. velocity, friction) and action at a distance (e.g. magnetic, (PS) gravitational and electrostatic forces) PS5 Energy and its transformation (e.g. conservation, dissipation, chemical reactions) PS6 Interactions between energy and matter (e.g. light and radio waves, sound and seismic waves) LS1 Cells (e.g. structures and function, DNA, plant and animal) LS2 The concept of an organism (e.g. unicellular and multicellular) Living Systems LS3 Humans (e.g. health, nutrition, subsystems such as digestion, respiration, circulation, excretion, reproduction and their relationship) (LS) LS4 Populations (e.g. species, evolution, biodiversity, genetic variation) LS5 Ecosystems (e.g. food chains, matter and energy flow) LS6 Biosphere (e.g. ecosystem services, sustainability) ESS1 Structures of the Earth systems (e.g. lithosphere, atmosphere, hydrosphere) Earth ESS2 Energy in the Earth systems (e.g. sources, global climate) and ESS3 Change in Earth systems (e.g. plate tectonics, geochemical cycles, constructive and Space destructive forces) Systems ESS4 Earth’s history (e.g. fossils, origin and evolution) (ESS) ESS5 Earth in space (e.g. gravity, solar systems, galaxies) ESS6 The history and scale of the universe and its history (e.g. light year, Big Bang theory) Note. Adapted from “PISA 2015 assessment and analytical framework: science, reading, mathematic, financial literacy and collaborative problem solving, revised edition,” 2017, OECD Publishing (http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. 19 The physical systems subdomain seems to encompass both physics and chemistry constructs. 41 The 2015 PISA science framework further clarifies that in the assessment approximately 36% of items will be physical, 36% of living, and 28% of Earth and space systems (OECD, 2017a). An example of a released science item with a focus on life science is provided in Figure 4. Figure 4 Released 2015 PISA Science Item20 20 The correct response needed to reference “a flower cannot produce seed without pollination” (OECD, n.d.-c). 42 Note. From “PISA 2015 Assessment and analytical framework: science, reading, mathematic, financial literacy and collaborative problem solving, revised edition,” 2017, OECD Publishing (http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. Items were delivered in the following contexts: “health, natural resources, the environment, hazards, and the frontiers of science and technology (OECD, n.d.-c).” The contexts were set in “personal, local/national, and global settings (OECD, n.d.-c).” Items were developed to meet one of the following three cognitive demands21: 1. “Low - Carry out a one-step procedure, for example recall of a fact, term, principle or concept, or locate a single point of information from a graph or table. 2. Medium – Use and apply conceptual knowledge to describe or explain phenomena, select appropriate procedures involving two or more steps, organize/display data, interpret or use simple data sets or graphs. 3. High - Analyze complex information or data, synthesize or evaluate evidence, justify, reason given various sources, develop a plan or sequence of steps to approach a problem (OECD, n.d.-c).” OECD (n.d.-c) defined theoretical difficulty for an item as “a combination both of the degree of complexity and range of knowledge it requires and the cognitive operations that are required to process the item.” 2015 Assessment Design22 21 These definitions were newly added to the 2015 PISA framework for the scientific literacy domain (OECD, n.d.-c) 22 The complex design of test forms, item types, and sampling techniques used to calculate a country’s rank are outside the scope of this study. Information provided in this section is to help orientate the reader to the higher- level details of the design of the science assessment. Complete details regarding assessment design are available in the PISA 2015 technical report (OECD, 2017b). 43 In terms of educational theory behind the PISA there is one unifying goal. PISA is intended to be: A collaborative effort among OECD member countries to measure how well 15-year-old students approaching the end of compulsory schooling are prepared to meet the challenges of today’s knowledge societies. The assessment is forward-looking: rather than focusing on the extent to which these students have mastered a specific school curriculum, it looks at their ability to use their knowledge and skills to meet real-life challenges. This orientation reflects a change in curricular goals and objectives, focusing more on what students can do with what they learn at school. (OECD, 2017b, p. 22) A “real-life challenge” faced by an individual in life might be the COVID-19 examples given previously. Such challenges would most likely include integrating content knowledge since rarely do we find successful scientific solutions in a vacuum. For example, for those in STEM careers, biologists rarely describe cellular processes without integrating organic chemistry knowledge. Physics is often used to describe the motion of animals and how planets align. However, as noted earlier, the integrated course model is not how science education is designed – rare is the course that incorporates both life and physical science in a K-12 setting. Neither are PISA or many other large-scale science assessments designed to be integrated with items assessing multiple science subdomains at once. The science assessment design includes items that are targeted to specific subdomains, Figure 4 (Bee Colony Collapse Disorder Question23 1), which is coded to knowledge – system: content – living science (OECD, 2017a). A 23 Question # refers to a specific item in the order it appeared in a cluster. 44 lack of integrated content in items may negate some of the “forward-looking” aspects of PISA, although a fuller theoretical investigation of this topic is outside the scope of this dissertation. The cognitive assessment (see Figure 5) included an additional domain of collaborative problem solving in 2015 (OECD, 2017b). Some of the countries took a paper-based assessment (PBA) version while others took a computer-based assessment (CBA) (OECD, 2017b). The CBA version of PISA was considered a field trial and included both trend and new items while the PBA version only had trend items (OECD, 2017b). Items had several response formats: click on a choice, numeric entry, text entry, select from drop-down menu, and drag and drop and were dispersed among multiple test forms (OECD, 2017b). There was no common form with a common set of items that all students received (OECD, 2017b). Forms were randomly assigned (OECD, 2017b), with no linking information released. Within forms, the science items were organized into clusters (OECD, 2017a). There were 36 random science clusters combinations possible across the 66 CBA forms (OECD, 2017b). There was also a unique form referred to as the Une Heure (UH) form and is for students with special needs. This form included “easier items in each domain” with “a more limited reading load” and 50% of the items assessing science (OECD, 2017b, p. 42). Approximately 61 physical items, 74 life items, and 49 Earth and space items for a total of 184 items were developed and chosen for the assessment, which is equal to 6 hours of test questions – only 85 were trend while 99 were new (Mostafa et al., 2018). However, the actual test was about 2 hours per student. As shown in Figure 5, students have 2 hours to take the cognitive portion of the assessment, which includes the major domain of science and the minor domains of reading, mathematics, and collaborative problem solving, and then they are offered 45 a questionnaire24 with a shorter innovative assessment on financial literacy at the end (OECD, 2017b). The major domain, science for 2015, takes an hour of the time allotted to both the PBA and CBA cognitive assessment to complete (OECD, 2016b). Teachers had an optional short questionnaire that could be taken after the student questionnaire (OECD, 2017b). All elements of the assessment were offered in different languages according to PISA specifications to accommodate some language aspects of the participant’s setting. Figure 5 Comparison of PBA25 and CBA26 Assessment Designs Note. From “PISA 2015 technical report,” 2017, OECD Publishing (http://dx.doi.org/10.1787/9789264281820-en), p. 36. Copyright 2017 by OECD. 24 OECD often refers to the cognitive assessment as a survey while referring to the student and teacher qualitative surveys as questionnaires in both the framework and technical report. 25 While PBA countries were offered the optional financial literacy assessment none took it (OECD, 2017b). 26 ICT refers to Information and Computer Technology Literacy Familiarity (ICT) questionnaires. 46 2015 Science Scoring While items are described in the framework, these do not generate individual scores per student that are visible to the public (OECD, n.d.-b). Per OECD (2018), nearly 540,000 students globally finished the science assessment. There is not a theoretical minimum or maximum score for each of these students (OECD, n.d.-b). In this study’s data, however, item level responses for students within the U.S. sample were used. In the reported data27 at the country level, scaling occurs with IRT and then is transformed for scores around normal distributions (OECD, n.d.-b). Means for OECD country participants are approximately 500 score points with a standard deviation (SD) of 100 score points (OECD, n.d.-b). Countries are then ranked according to the mean score (OECD, n.d.-b). Most students score between 400 and 600 points (OECD, n.d.-b). Having mentioned earlier that PISA scores can be indicators of a country’s education status and economy, it’s important to consider that not all educators and researchers believe that PISA scores are good indicators of these factors (Strauss, 2019). Some educators feel that there is not a one-size fits all assessment and that assessment should be more inquiry-based where students actually show what they can do, which is only partially assessed on the PISA assessment from 2015. In addition, the scores reflect students’ performance in different countries. OECD attempts to take into account cultures and policies in those countries but has not been able to fully resolve that these differences promote different knowledge, skills, and abilities (Strauss, 2019). This is partly what the OECD assessments are intended to consider, but in a large-scale assessment it is difficult to separate issues like cultural relevance from policy 27 See Appendix B for science mean scores by country and Appendix C for science mean subscale scores by country. 47 variations that the countries feel might be important to detect. To better understand scores within the context of the science subdomains, a qualitative analysis of student level data, which is not possible given the nature of the PISA data and lack of access to all science items, might complement a quantitative analysis of 2015 PISA science scores. Instead, I am trying to identify conceptual multidimensionality via conceptual analysis of the framework with document analysis, followed by quantitative analysis of the student-level data set from the instrument representing that framework, to see if empirical evidence of patterns of quantitative multidimensionality exists. Three MIRT Case Studies The following studies exemplify the current state of MIRT model usage in assessment research and are similar to what was implemented in the methodology chapter of this study. Two of the studies provide original research addressing the use of MIRT in science assessments with only one being a large-scale case, while the third case study focuses on a large-scale English language proficiency assessment. All three are united in their use of a methodology to validate the use of MIRT models by comparing fit to other IRT models and by a theory that their educational constructs are expected to be multidimensional. Yen and Leah (2007) - MIRT Model for Composite Scores Similar to the PISA 2015 reading framework, the knowledge content of English language proficiency assessments is often multidimensional. The subdomains for this English language proficiency assessment were identified as speaking, listening, reading, and writing, but the researchers were interested in mainly the speaking and listening constructs. The speaking construct was further separated into four subsequent proficiencies while the listening construct 48 had three subsequent proficiencies. These proficiencies were considered as sub-subdomains in the analysis. Twenty items each were administered for the speaking and listening subdomains. Half of the speaking items were scored polytomously, but the rest were dichotomous, and all of the listening items were dichotomous. Students took the speaking portion of the test in about 10 minutes and the listening portion in about 15 minutes. The assessment itself was considered large-scale because it was given to all eligible K-12 students in an unidentified location that was state, or country sized. However, the researchers pulled a sample of 12,008 student responses from only elementary school students. Only full sets of responses were analyzed as some students did not complete all of the items. The sample included slightly less female students than male students while one unidentified ethnic group dominated the sample. The unidimensional IRT models that were analyzed included the 3PL and the two- parameter partial credit model. Calibrated together, both multiple choice and constructed response items were placed on the related subdomain scale. Marginal maximum likelihood was used to estimate the parameters at the same time for both item types. Importantly, the speaking and listening items underwent distinct calibrations, but an additional calibration was performed on these subtests together to build an “oral” composite scale. Next an exploratory factor analysis was conducted to determine if the sub-subdomain proficiencies were actually multidimensional and to identify the number of dimensions that were present. Using the BMIRT computer program, four different MIRT models were then applied. One of the models [Model Type 1] assumed that speaking and listening subdomains are combined to measure one latent trait with several sub-subdomains while the other three 49 models [Model Type 2] assume that the speaking and listening subdomains measure distinct latent traits, each with their own set of sub-subdomains. The researchers pointed out that MIRT models can have parameter estimation issues due to the number of parameters, but the Markov Chain Monte Carlo (MCMC) method used by BMIRT helped alleviate this. Population parameters were fixed as a normal distribution to the mean of 0 and standard deviation (SD) of 1 for Model Type 1 and fixed as multinormal distribution with mean (0,0) for Model Type 2. Each model went through 10,000 iterations. Each model’s fit was compared using the Akaike Information Criterion (AIC) and chi-square difference test. The Type 2 MIRT models had the better fit and the researchers concluded that multidimensionality existed in the assessment. The MIRT models proved successful even though assessments less than 100 items in length can have trouble with the discrimination parameter. Scalise and Clarke-Midura (2018) also had success in applying a MIRT model to assessment scores based on a multidimensional framework. Scalise and Clarke-Midura (2018) - The Many Faces of Scientific Inquiry The NGSS is a multidimensional framework used, with some adaptions, by over 30 states in the U.S. The researchers looked at whether an online/virtual performance task aligned with the Framework for K-12 Science Education (National Research Council [NRC], 2012), College Board Standards for College Success (College Board, 2009), and the NGSS inquiry practices and science content knowledge delivered evidence on more than one latent trait using a MIRT-Bayes model. The nature of scientific inquiry done by students is described as complex and often has to be less regulated to measure the full reasoning of students, which indicates that when performing inquiry students may use multiple abilities. 50 Items, both polytomous and dichotomous (multiple-choice), were developed by Harvard University learning scientists working in STEM innovations for two dimensions, inquiry and explaining. A sample of 1,986 student response were collected for 23 items. Less than 1% of students had missing data. Information on gender and ethnicity of the student sample was not provided. The entire assessment took students nearly 40 minutes to complete and process data on student actions was also collected during this time. To model the data, researchers used an exploratory study design that included both unidimensional and multidimensional IRT models. MIRT models include a 2-dimensional partial credit and a hybrid-MIRT-Bayes model. For the 2-dimensional MIRT model the difficulty parameter was estimated freely once the means of the latent variables were set to zero. In the case of the hybrid MIRT model, a Bayes net was constructed first then the MIRT model was applied. The Bayes nets can help structure semi-amorphous data onto the constructs being investigated. Upon analysis, the unidimensional model was not as well fit to the data based on the significant deviance difference while item fit was acceptable for the 2-dimensional MIRT model. In addition, the two latent traits of inquiry and explaining were just moderately correlated, indicating student performance might be varied enough to justify the use of MIRT based on the concept framework, although an even larger scale sample could help show this more definitively. Both dimensions had expected a posteriori (EAP) reliability coefficients greater than 0.85. Researchers elaborated that theoretical support from this framework was validated by content experts, who designed the task to have items that would assess each ability/skill independently. In counterpoint to the Yen and Leah (2007) study, no items were found to load negatively, which would indicate negative discrimination. The hybrid MIRT model 51 appeared to fit even better with higher reliability estimates. Wright maps and standard error of measurement plots helping to visualize the data. Similarly, Li et al. (2012) explored how MIRT models can accurately depict data from items that cover several science domains. Li et al. (2012) - Applying MIRT Models in Validating Test Dimensionality Assessment dimensionality should be explored at the beginning of development to provide validity for the inferences made with regards to student ability/latent traits. Development of items is done with alignment to different anticipated dimensions so that each item targets one or more constructs. The researchers define dimensionality of the assessment “as the number of traits that must be considered to achieve weak local independence between the items” (Li et al., 2012, p. 3). The covariance between items should “approach zero as test length increases” to indicate weak local independence (Li et al., 2012, p. 3). The validation of dimensionality with regards to the learning domain or subdomain often occurs during the field test. A random sample of 5,677 grade 5 students who had taken the 2008 Michigan state science assessment was selected in order to validate the dimensionality of four science subdomains: science processes, life, Earth, and physical sciences. A total of 45 science items were taken by students with no timeline given. Item type was not provided by the researchers, and it is also unclear if the items were field tested or operational. Gender and ethnicity of the students also was not declared. To evaluate the dimensionality of the student data, the researchers selected three different IRT models: a unidimensional model, a “simple-structure” MIRT, and a testlet model. The testlet model treats different subdomains as different testlet residual dimensions to the 52 dominant dimension of general science ability. Figure 6 provides a general structure for each model that was evaluated. Figure 6 Comparison of Models Note. From “Applying multidimensional item response theory models in validating test dimensionality: An example of K–12 large-scale science assessment,” by Y. Li, H. Jiao, and R. W. Lissitz, 2012, Journal of Applied Testing Technology, 13(2), p. 9. Copyright 2012 by Journal of Applied Testing Technology. Before applying the three models, a principal component analysis (PCA) and exploratory linear factor analysis (EFA) were conducted. Researchers selected eigenvalues greater than 1, which indicate those components account for more than mean total variance of items, in the PCA. A scree plot helped the researchers determine that at least two components did exist. Next the EFA confirmed that two factors existed and based on the reduction of root mean square error (RMSE) values either a four- or five-factor solution might be possible to adopt. A four-factor solution was used as the researchers felt that the statistical analysis should be compared against the theoretical dimensionality used in the assessment design. 53 The MIRT model was found to significantly fit the data better than the unidimensional model. AIC and Bayesian information criterion (BIC) were then used to compare fit between the MIRT model and the testlet model. The MIRT model was found to be the best fitting since it produced smaller information criteria indicating better model fit to the data. The conclusion reached was that if several abilities are measured then MIRT models should be used. Li et al. (2012) pointed out that at the time of their study no prior study on validating multidimensionality was located comparing the fit of the three models for a large-scale K-12 science assessment. The study outlined in Chapter 2 is similar in methodology to the above 3 studies with a focus on multidimensionality of science assessments. There continues to be a lack of research on large-scale science assessments and their use of MIRT models, and few that also include other data analytic techniques from emerging data sciences such as shown here. This study will include analysis of data from the 2015 PISA large-scale science assessment. By modeling data from PISA’s 2015 science framework28 and assessment, the following research questions are expected to be answered. Research Questions Research Question 1 (RQ1) What evidence can be drawn from the 2015 PISA science framework to qualitatively support whether multiple dimensions are being described regarding student knowledge in science in the framework? 28 See Appendix F. 54 Research Question 2 (RQ2) Do quantitative indications of multidimensionality exist in the 2015 PISA science data? • R2A: Does a data science cluster analysis29 applied to the PISA 2015 data (U.S. sample at the item level) suggest multidimensionality in the student response data set? Regardless of dimensionality, how do the items cluster in the analysis? • R2B: Does principal components analysis30 (PCA) applied to the same data indicate multidimensionality? Does residual analysis employing standard tolerances used in the industry indicate multidimensionality that needs to be modeled in the U.S. sample? How do items cluster in the analysis, and how does this compare to R2A? • R2C: By how much (practical significance) does a multidimensional IRT model applied to the same data indicate improved fit over less dimensionally complex models? Is fit statistically improved based on chi-square comparisons of nested models fitted for dimensionality? How does item clustering compare in R2A/R2B? Research Question 3 (RQ3) Depending on covariates available in the U. S. sample data set or linked data sets or overall population reports, what can be said about subgroup analysis in this data set and about aspects of classroom instruction relative to clustering patterns found in R2A-C? For instance, do multidimensional models yield results that showcase students have different levels of science 29 A cluster analysis is a method for sorting items into groups based on their statistical relationship to one another. For example, if items are closely related because a student must have physic knowledge to answer them then all physics items may cluster together. 30 PCA was chosen since it is a method of reducing the dimensionality of a large dataset into principal components (or dimensions) that retain as much variation as possible (Mailman School of Public Health, 2023). While EFA was considered, its goal is to show the correlation between variables is partially due to common latent variables (Mailman School of Public Health, 2023), which is not the dimensionality predicted for the science subdomains. 55 proficiency on different dimensions identified? Relative to demographic data, if available, what can be said about history of harm and employing or ignoring dimensionality in science data such as these? Whether contrasting or similar, what can be said about the findings viewed from both lenses? 56 CHAPTER 2. METHODS Declaration of Interest: The author worked at one time but not now for Educational Testing Service (ETS), one of the vendors supporting some PISA efforts, and has also been the lead vendor of 5 companies developing, delivering, and analyzing NAEP in the U.S., for all domains including for NAEP Science. Developing the Literature Synthesis A literature review helped determine the state of research on multidimensionality in large-scale science assessments. The majority of searches were in Google Scholar and the University of Oregon library databases but were not limited to only peer-reviewed journals. The main search used the following key phrases and words: “multidimensional” + “IRT” + large-scale + “science” + “assessment”31,32. For the purposes of this literature review, the following definitions were used: • Multidimensional IRT – a model estimating student ability containing more than one latent trait • Large-scale Science Assessment – assessments measuring the science ability of a large proportion of the student population occurring at either the U.S. state or national level, at the country level for areas outside of the U.S., or at a global level Nearly 7,560 results were generated in Google Scholar using the combined search phrase. In each iteration of searching, the Li et al. (2012) article was the top hit with very few 31 Quotation marks were included to indicate for which terms the Google Scholar search platform should return articles with exact matches. 32 Grade level and science subdomain were not used as eliminators. 57 other results matching the specific target. See Appendix D for Figure 39, which provides an overview of literature connected to the Li et al. (2012) study on multidimensional IRT model fit for student data from a large-scale state science assessment. Table 15 in Appendix E provides annotations for resources found during the literature review and used in the literature synthesis. Additionally, the review focused not just on science education, but also on reading assessments that used MIRT to model dimensionality. This addition was made in part due to the lack of studies dealing specifically with multidimensionality of large-scale science assessments, but also because PISA declares reading to be multidimensional in the 2015 framework (OECD, 2017a). Therefore, reading offers somewhat of a mirror into multidimensionality that may be useful through which to view the science framework. Setting PISA is an international CBA and PBA that is available to all OECD countries or affiliates called partners (OECD, 2017b), and countries around the world are invited to partner if they are not already in the OECD. For 2015, the non-gray countries in Figure 7 chose to participate (GEOstata, 2016). The color scale illustrates how these countries ranked by their students’ performance on the science assessment (GEOstata, 2016). The setting focus for this study will be the U.S. Note that in Figure 7 the U.S. is shown ranking in the 470 to 500 mean score range for science (GEOstata, 2016), which is near to but slightly above center. 58 Figure 7 PISA Science Performance by Country U.S. Note. From “PISA 2015 Results – Performance in Science,” 2016, GEOstata. Copyright 2015 by OECD. Countries began testing in April 2015 in the following types of educational settings: educational institutions, vocational training or related educational programs, and foreign schools within a country (OECD, 2017b). Due to the range of countries participating in PISA the student population is very diverse. Student Demographics Around 540,000 global students completed the PISA in 2015 (OECD, 2018). Both full- time and part-time students were eligible to participate (OECD, 2017b). Students that were 59 home-schooled or taught in the workplace were not eligible to take the PISA (OECD, 2017b). Table 2 provides a breakdown of students in the U.S. compared to the highest performing and lowest performing countries. This table also includes demographics for the four countries to which the U.S. is compared by OECD (2016a). OECD member countries who are highlighted blue, and the remaining countries are OECD partners (OECD, 2017b). Table 2 Country Demographic Comparisons 2015 Student Sample 2015 Duration of 2013 PISA 2015 Country33 Size (OECD, Population Size34 Compulsory Ethnic Dominant Fraction- Language/s37,38 Mean 2017b) (in millions) Education 35 alization36 Science (in years) Score39 *Mandarin (35%), *English (23%), Singapore 6,115 5.45 6 38.57% *Malay (14.1%), 556 Hokkien (11.4%) (2000 census) *Estonian 67.3%, Estonia 5,587 1.3 9 50.62% Russian 29.7% 534 (2000 census) *English 58.8%, Canada 20,058 35.7 10 71.24% *French 21.6%, Other 19.6% 528 (2006 Census) Hong *Cantonese 90.8%, Kong 5,359 7.3 9 6.2% *English 2.8%, 523 (China) (2006 census) Germany 6,522 81.7 13 16.82% German40 509 33 Countries, and in the U.S. the schools, elect to participate in PISA (NCES, n.d.-a). 34 From https://datatopics.worldbank.org/world-development-indicators/ 35 From https://data.worldbank.org/indicator/SE.COM.DURS 36 From https://worldpopulationreview.com/country-rankings/most-racially-diverse-countries 37 From https://www.languagerc.net/languages-by-countries/ 38 Showing only those unofficial languages that are found at 10% or greater within the country’s population. 39 From https://www.oecd.org/pisa/pisa-2015-results-in-focus.pdf 40 No census data provided. 60 United English (82.1%) 43, 5,71241,42 States 320.7 12 49.01% Spanish (10.7%) 496 44 (2000 census) Dominican Republic 4,740 10.4 15 † 42.94% *Spanish45 332 Note. A Harvard study defined fractionalization as the probability that two people randomly selected from a country would be from different ethnic groups (Alesina et al., 2003). † This country’s education requirement changed from 9 to 15 in 2010 – the other six countries have remained steady in their education requirements since the late 1990s through 2022. * Indicates official language/s of that country. Even though slightly over half a million global students participated in the 2015 PISA, OECD did not select all of those students in their sample (OECD, 2017b). The U.S. alone had 4,220,325 15- year-olds in 2015 with 3,992,053 actually enrolled in school (OECD, 2017b). A few U.S. schools46 (12,001) chose not to participate in 2015 at an exclusion rate of 0.30% (OECD, 2017b). Data Collection Data were collected during the 2015 PISA administration47 then organized, analyzed, and reported by OECD in 2016-18. Student responses to each PISA form were collected online when students accessed the task through their computer or tablet and in a physical format when students took the assessment via a paper form. In addition to student responses on science content, formal and optional questionnaires collected information on student attitudes 41 NCES (n.d.-a) reported 177 schools participated nationally in the U.S. with a student response rate of 90%. 42 This is the national sample size and does not include the two states (Massachusetts/North Carolina) and one territory (Puerto Rico) sampled. 43 While English is broadly used in the U.S., the U.S. has not designated an official language (USAGov, 2023). 44 The U.S. mean score is not significantly different from the overall OECD mean score of 493 (OECD, 2018). 45 No census data provided. 46 As of February 2024, there were 20,318 high schools in the U.S. per https://mdreducation.com/how-many- schools-are-in-the-u-s/. Per NCES (n.d.-a), there were 27,144 public and private secondary and high schools in the 2015-2016 school year, but the national U.S. sample consisted of only 240 schools and not all of those participated. 47 For the U.S., this occurred during October to November (NCES, n.d.-a) 61 toward science, along with student and teacher educational backgrounds. Student in-person interviews, cognitive lab data, focus group data, pilot and field trials were in many cases collected by OECD (2017b) but not released due to policy requirements with the countries. Study Sample For this study, the full PISA 2015 extant quantitative dataset was narrowed by selecting only the U.S. student population at the national, not state, level that responded to a CBA form48. This study used a science subset from that sample. The science subsample consisted of 5,712 students and 166 dichotomous items. Removal of students that lacked response to any item (so they had all NAs49 for all 166 items) dropped the sample to 5,699 students. These 13 dropped students were considered as missing data at 0.2% for the U.S. CBA science subsample. OECD also performed casewise deletion for missing data (Mostafa et al., 2018). There was no common form between all the students in 2015 PISA science. Items were spread across various forms as clusters of around 15 items each, with no equating sample or linking information available. Hence the item cluster S1050 with the highest response rate was chosen for the first analysis, reducing the sample to 1,306 students and 15 items, all of which were new in 2015. As a confirmation of the model fit findings from the first analysis, a second analysis was conducted as a validity study, examining the second largest cluster, S1151. This was 48 CBA was chosen as this is the mode of delivery most assessments are moving towards, including PISA. Also, eliminating the PBA forms allowed the subsample to be viewed through one mode lens rather than having possible effects on the data from different delivery modes. 49 NA is the abbreviation for “not applicable”; however, when used in a data file it can mean many things. Note that the R program automatically codes blank/missing cells as NA. Per OECD (2017b), a cell coded NA by R in the PISA data file can mean missing data, an item not reached by a student, a data error, or skipped item. 50 Cluster S10 as noted in the OECD (2017b) technical report’s annex A had 17 items, but the 2 polytomous items were dropped from this study. 51 Cluster S11 as noted in the OECD (2017b) technical report’s annex A had 16 items, but the 1 polytomous item was dropped from this study. 62 identified as consisting of a subsample of 1,274 students and 15 items, all of which were new in 2015. Criteria used here for examining clusters was to select two of the 15 item clusters for an initial and then a validity study. Given the scope of this work as an unfunded dissertation, the study started with the largest sample size to help facilitate separately fitting the more complex models, and then selected the next largest. Students in these cluster subsamples with some NAs indicating no response to an item during their attempt at answering the cluster had those NAs converted to 0 since they had the opportunity to attempt the item. There were no missing data in either of the science cluster subsamples using these criteria after the elimination of the 0.2% of U.S. responses indicated earlier. Hence, missing data rates were quite low, but they were not entirely eliminated. No imputation technique was undertaken for the 0.2% missing data due to insufficient additional information released. Also, numerous problems of imputation in assessment calibration made this a questionable statistical adjustment even if information had been available. Data Analysis – A Mixed Methods Approach Data analysis was divided into three steps. The first step directly supported RQ1 and involved qualitative data being analyzed, including the main document analysis of the 2015 PISA science framework. Step 2 supported RQ2 and consisted of the quantitative analysis52 of 2015 PISA science student scores along with data triangulation to the Step 1 analysis. The last step supported RQ3 and involved analyzing with an equity lens how the best fitting IRT model might impact equity in regard to student outcomes for subgroups. All three steps are outlined in detail below within the context of a mixed methods approach to data analysis. 52 Focus of this study is on the item level rather than the student level for each quantitative analysis. 63 When qualitative data exists to support, or contradict, quantitative data, a mixed methods design for research is essential. This type of design can meld findings from both methodologies into an effective solution (Johnson & Onwuegbuzie, 2004). For example, large- scale assessments like PISA often make available content frameworks, student and teacher survey responses, and other documentation like item specifications that can be analyzed qualitatively to determine narrative themes such as multidimensionality of science content. This is in addition to the quantitative scoring data, process data, and student demographic data from the assessment itself. Such narrative themes might add to a quantitative model of student science ability. For example, Claesgens et al. (2008, p. 66) had a mixed methods approach of “using IRT, the scores for a set of student responses and the questions are calibrated relative to one another on the same scale and their fit, validity, and reliability are estimated, and matched against the framework”. In order to support such a mixture of data types and analyses, a researcher often needs a methodological design that incorporates different philosophical underpinnings. Epistemology An equivalent status design refers to a theory of research where both qualitative and quantitative epistemologies are valued for understanding constructs (Venkatesh et al., 2016; Johnson & Onwuegbuzie, 2004). This study used two epistemological foundations to pragmatically agree with Baskarada and Koronios (2018) that researchers should select philosophies that work best for the research questions to be addressed and the data to be analyzed. This choice was pragmatic in nature based on the requirements of this study. The chosen epistemological foundations were also aligned with Greene and Caracelli (1997), who 64 clarify that if methodological pragmatism requires different philosophies then each can be viewed as “logically independent and therefore can be mixed and matched” in order to achieve methodology that will work well in each stage of analysis. In the end, though, RQ1 and RQ2 were designed to consider triangulation, or in other words whether results tend to support each other or not across the techniques. My epistemological approach to science education is mainly social constructivism. Constructivism is a viewpoint that students build their own learning with their reality built on experiences as a learner and the social aspect implies that learning is collaborative. Social constructivism describes students as dynamically constructing and reconfiguring knowledge while interacting socially (OECD, 2023). This educational theory is grounded in constructivism psychological learning theory, which views knowledge as something a learner must “actively construct” for themselves (McLeod, 2019). During social interactions, such as learning within a classroom, students often construct their knowledge in collaboration with others (Atkisson, 2010). The concept that students learn how to learn by interacting with others is a key foundation of social constructivism (Greenwood, 2020). Western Governors University (2020) describes several principles of constructivism: • “Knowledge is constructed, • people learn to learn as they learn, • learning is an active process, • learning is a social activity, • learning is contextual, • knowledge is personal, and 65 • motivation is key to learning.” Boon et al. (2022) also point out that a constructivist epistemology can suit “practice- orientated” research. I have also chosen an approach that acknowledges a philosophy of scientific realism, or an organized reality (Moroi, 2020), which means that at some level I am acknowledging that there is a reality in STEM and it can be known (although in STEM, most if not all models can be disconfirmed to some degree at a different grain size, so the idea of scientific realism is that the philosophy acknowledges utility of the model at the grain size applied, such as in Atomic Theory and Quantum Physics). Similarly, I believe that we can try to measure concepts both independently of ourselves and in a rational manner, at a level that may provide some utility for decision making. This viewpoint pairs well with my social constructivist stance and as Maxwell and Mittapalli (2010) note these two philosophical paradigms can work well in a mixed methods approach. If multidimensionality is present in the 2015 PISA science framework and student data then it should be discoverable by a mixed methods analysis and can then be modeled based on the science content and student data. Of course, in a broader view, we know that all models fail at some level and do not incorporate all aspects of reality but such models can be useful if they provide gains in our understanding or utility in our context. Purpose and Guidelines Even though most researchers have an idea of the philosophy driving their research they still need to define its purpose. Venkatesh et al. (2013) advise identifying the purpose of mixed methods research early. Identifying the purpose/s will help establish the research goals and 66 later serve to inform reviewers of how to center any findings. They identify the seven purposes shown in Table 3. Table 3 Defining Purpose of Mixed Method Approach Purposes Description Illustration Current Study Meets this purpose by triangulating the evidence from Mixed methods are used in order to gain A qualitative study was both methods and statistically complementary views used to gain additional the quantitative model Complementary insights on the findings complements the qualitative about the same from a quantitative review of the science phenomena or study. framework’s multidimensional relationships. design. However, UIRT model was more practical with regards to improvement of model fit. The qualitative data and Mixed methods designs are results provided rich Completeness used to make sure a explanations of the complete picture of a findings from the NA phenomenon is obtained. quantitative data and analysis. Questions for one strand A qualitative study was emerge from the used to develop inferences of a previous constructs and Developmental one (sequential mixed hypotheses and a NA methods), or one strand quantitative study was provides hypotheses to be conducted to test the tested in the next one. hypotheses. The findings from one Meets this purpose by expanding Mixed methods are used in study (e.g., quantitative) prior studies at the state and order to explain or expand were expanded or international levels that focused Expansion upon the understanding elaborated by examining on quantitative model fit; obtained in a previous the findings from a typically, these studies did not strand of a study. different study (e.g., have a qualitative review of the qualitative). dimensionality of the science content within the framework. Meets this purpose by Mixed methods are used in A qualitative study was corroborating other studies’ Corroboration/ order to assess the credibility of inferences conducted to confirm findings of quantitative Confirmation obtained from one the findings from a multidimensionality in science quantitative study. through qualitatively defining approach (strand). that science subdomains are multidimensional. Mixed methods enable The qualitative analysis Compensation compensating for the compensated for the weaknesses of one small sample size in the NA quantitative study. 67 approach by using the other. Qualitative and Did not meet this purpose Mixed methods are used quantitative studies with the hope of obtaining were conducted to because sufficient student Diversity compare perceptions of demographic data was divergent views of the a phenomenon of unavailable for this data set. See same phenomenon. interest by two different research question reclarification types of participants. in last chapter. Note. Adapted from “Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods research in information systems,” by V. Venkatesh, S. Brown, and H. Bala, 2013, MIS Quarterly, 37(1), p. 26. Copyright 2013 by JSTOR. For this research study 4 purposes were highlighted in green in Table 3 as scaffolding the work being done here: complementary since the PISA science assessment is built off the PISA science framework; expansion since the hope is that the qualitative data from the framework will illuminate the finding (or lack thereof) of multidimensionality in the student scores; corroboration since the qualitative data may or may not concur with any quantitative findings; and diversity because a lack of multidimensionality in the quantitative scores may be influenced by diversity issues. In Table 3’s final column are the descriptions of how this study met, or did not meet, each purpose. For example, while outside the scope of this study, a lack of latent diversity, or “diverse student mindsets” that can result from student diversity among other traits (Godwin, 2017, p. 13) could be masking the dimensionality of science subdomains if the student sample is not diverse and the students approach science problems in a similar manner. It is on diversity issues that I differ in opinion from Venkatesh et al. (2013, p. 22) as they state a mixed methods approach should be taken without regard to “cultural incommensurability” as long as it helps 68 the researcher answer their question. If the purpose of mixed methods research hinges in part on diversity, then cultural considerations should be taken into account when developing methodology for the research (Broesch et al., 2020). Once purpose/s are identified the design of and data analysis for mixed methods research needs to be grounded in accepted qualitative and quantitative procedures (Venkatesh et al., 2013). Whether each type of research occurs concurrently or as in this case sequentially, Table 4 provides both general and validation guidelines that can be applied to any mixed methods research (Venkatesh et al., 2013). Table 4 Guidelines for Mixed Methods Research Guideline Researcher Considerations Carefully think about the research questions, objectives, and contexts to decide on the appropriateness of a mixed methods approach for the research. Explication of the broad and specific research objective/s is important to establish the utility of mixed methods research. Carefully select a mixed methods design strategy that is appropriate for the research questions, objectives, and contexts. Develop a strategy for rigorously analyzing mixed methods data. A cursory analysis of qualitative data followed by a rigorous analysis of quantitative data or vice versa is not desirable. Apply the same standard of rigor as typically used in analyzing quantitative and qualitative studies. Integrate inferences from the qualitative and quantitative studies in order to draw meta- inferences from mixed method results. Discuss validation for both quantitative and qualitative studies. When discussing mixed methods validation, use mixed methods research nomenclature consistently. Mixed methods research validation should be assessed on the overall findings and/or meta- inferences from mixed methods research, not from the individual studies. Discuss validation from the point of view of the overall mixed methods design chosen for a study or research inquiry. Discuss potential threats to validity that may arise during data collection and analysis, along with any remedies. Note. Adapted from “Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods research in information systems,” by V. Venkatesh, S. Brown, and H. Bala, 2013, MIS Quarterly, 37(1), p. 41. Copyright 2013 by JSTOR. Validation General 69 Following these guidelines helped support validity in this mixed methods study and might help transferability to other contexts for inferences made based on the data (Venkatesh et al., 2013), although the limitations described earlier with regard to both linking and the documentation of the data set indicate additional work is needed for greater generalization. Step 1: Qualitative Analysis “Comparing and contrasting data is vital to qualitative analysis (Gale et al., 2013).” This is especially true when analyzing a framework, which could be considered a policy53 at the state, national, or global level, and may incorporate multiple sources of information with various degrees of clarification. Since frameworks are documents, this can be done via a qualitative document analysis (Wach et al., 2013; Bowen, 2009). Overview of Document Analysis. For digital and hard copies of documents, an organized procedure is needed for the analysis (Armstrong, 2021). Armstrong (2021) recommends beginning by identifying the objective of the document analysis and describes six common objectives: “defining concepts, mapping range and nature of phenomena, creating typologies54, finding associations, providing explanations, and developing strategies.” Once documents have been selected, it is important to not just “lift” text to be used in the report (Armstrong, 2021) or the analysis may be considered superficial. Rather, analysis should strive for deep understanding to develop meaning with regard to the construct. Wach et al. (2013) outline the process of document analysis in several steps and notes: 1. Defining document inclusion criteria, which may be practical or strategic in nature, 53 Wach, Ward, and Jacimovic (2013) defined policy as documents “that express official organizational aims and strategies.” 54 An analysis based on categories. 70 2. Gathering the document/s, 3. Outline analysis area/s, 4. Analyze the document/s using coding if applicable, and 5. Verify the analysis through an independent source (2nd reviewer) to increase reliability, impartiality, and dependability of the findings. An important note about step 5 – an analysis is considered dependable if the second reviewer would have made the same conclusions while analyzing the document/s in the same manner. Finally, Wach et al. (2013) recommend that thought needs to be given if the organization owning the document actually delivered the proposed policy. In this case the policy would be the science framework claim of three subdomains in science. There are several ways to present a document analysis. Mazzei and Jackson (2024) discuss “re-animating” documents in a visual format to uncover new “intensities”, which can be interpreted as new ways of seeing content contained within the document. This could take the form of a visual aid, such as a logic model or flow chart, that details the structure of the data and points out claims. Advantages and Disadvantages of Document Analysis. Many states and nations have conducted document analysis on content frameworks as an efficient means of uncovering content relatedness and connections to educational theory. Bowen (2009) describes advantages and disadvantages to this type of analysis in the list below. My counterarguments or agreements are in italics. Document analysis provides: • efficiency in data selection as collection from participants is not always required, 71 o Agree, with caveat that sometimes documents can be difficult to get in entirety from institutions. • most documents as they are available in the public domain, o Disagree, some documents may be public, but institutions can also keep many documents internal. • a decreased cost compared to other analyses, o Agree, document analysis is not as high in cost as a recruitment of participants for a quantitative study. • less to no reactivity or obtrusiveness from documents since participants are not being observed, o Agree, but Bowen (2009) also mentions that a researcher is less likely to influence the research due to lack of social interaction, which I disagree with as document analyzers can bring their own viewpoints into describing the meaning of the document. • documents are stable over time along with being exact in nature and researchers do not alter what is researched, and o Disagree, documents often go through several versions that are not always reported to researchers; sometimes changes are left out of a document too. • broad coverage of material and historical events, o Agree, documents allow researchers to peer into history. Document analysis is disadvantageous in that documents can: • sometimes lack detail, 72 • be hard to retrieve, and • become biased through the selection process. Despite the disadvantages, document analysis can provide needed illustration. A document analysis of the framework’s theoretical claims of multidimensionality will help to elucidate whether multidimensionality should be expected in the empirical results. Conducting Document Analysis. The number of documents tied to the 2015 PISA is quite extensive and includes frameworks, reports on results, released items, technical reports, country level reports, webpage FAQs, brochures, and videos. To narrow this field, documents were selected following the process outlined by Voogt and Roblin (2012) who recommend screening for the goal of identifying the main theme, which in this study is science dimensionality. This is done by determining inclusion criteria a priori as per Wach et al. (2013). The inclusion criteria required the documents to mention: PISA 2015 science framework and any form of these words: domain, subdomain, or dimension, along with the document having been developed by OECD. This second screening requirement excluded any secondary sources that were not directly involved in science content development since content developers, including teachers and assessment designers, are closest to the intent of the framework. This led to two documents being identified for possible analysis – OECD’s 2015 PISA Science Framework and PISA 2015 Technical Report. Next, a saturation evaluation on the selected documents was performed. Saturation evaluation, based in grounded theory55, can be used in qualitative analysis to stop the data collection from a document if no additional data are found 55 Grounded theory is an inductive qualitative methodology that allows new theory to be formed from the observed data (Ho & Limpaecher, 2021). That said, I acknowledge my prior experience teaching science may lead to deductive reasoning with regards to PISA science content standards and their fit into a dimensionality theory. 73 to code to the theme being analyzed that go beyond what has already been found, or in other words some degree of key saturation has been reached in the evaluation. (Saunders et al., 2018). The saturation evaluation eliminated the PISA 2015 Technical Report as it offered no substantial theoretical claims of multidimensionality in science. Hence, the 2015 PISA Science Framework document found during document collection was color-coded56 by the main theme [dimensionality] and given relevant coder annotations in order to draw out any supporting evidence or subthemes [multi vs. unidimensionality] (Voogt & Roblin 2012). If no evidence was found supporting science multidimensionality that was also documented, along with any barriers to or reasons why multidimensionality is not present among the science subdomains. The subdomains were graphically connected to show any crossover between science content knowledge in the content standards (see Chapter 3, Figure 11, which was verified via a committee review as recommended by Wach et al. [2013]). After qualitative analysis, the researcher may find among the coded themes evidence of a theory that supports a model. Figure 8 illustrates how this theory building can come about (Carpiano & Daley, 2006). Figure 8 From Science Framework Review to MIRT Model Development 56 This was done by hand rather than a computer program. 74 Note. From “A guide and glossary on postpositivist theory building for population health,” 2006, by R. M. Carpiano, and D. M. Daley, 2006, Journal of Epidemiology and Community Health, 60, p. 566. Carpiano and Daley’s (2006) definitions of a framework, theory, and model associated with Figure 8 are adapted below to describe aspects of this study. 1. Conceptual Framework: A set of standards about content knowledge in science that students should be able to show mastery towards. 2. Theory: Grounded in educational pedagogy and learning philosophy, it indicates a relationship between the variables (science content knowledge) while diagnosing the science learning phenomena to predict an outcome (e.g., the science subdomains will present as multidimensional when scored since learning may be unique to each subdomain). This theory will hopefully be fully developed after qualitative framework analysis. 3. Model: Makes a specific assumption about the learning theory for science content knowledge that allows parameters (science content knowledge variables) to be tested quantitatively. Carpiano and Daley (2006) further clarify that after articulating a theory the researcher could draw the model by detailing the constructs taken from the theory (such as PISA science scores), diagramming their flow from left to right with relationships shown using arrows57, and indicating positive or negative relationships with a +/-. 57 Double headed arrows indicate correlations, while single-headed arrows indicate causal relationships. 75 In order to predict the IRT model needed, evidence from Figure 11 comparing science content standards and from the framework document analysis that supported an educational theory on science dimensionality was then used to develop Figure 29 (similar to Figure 8). After the IRT model with the best fit was identified during quantitative data analysis it was compared to Figure 29’s predicted model. Step 2: Quantitative Analysis “Models make assumptions around measurement explicit and testable (Lang & Tay, 2021).” IRT provides a group of statistical models that determine the probability of a student selecting a specific response (Immekus, 2019). Measurement using an IRT model is an attempt to explain this specific response as a continuous variable (Ayala, 2022) or set of variables (MIRT), such as student ability in the subdomain life science. Primer on Item Response Theory. IRT is best described as a “formalized” statistical set of models for measuring skill/ability in an assessment (Lang & Tay, 2021; Wilson, 2013). This skill/ability is referred to as a latent variable since it is not directly observable and the scores derived from the assessment are manifest variables of this ability (Ayala, 2022; Lang & Tay, 2021). These manifest variables are empirically calibrated to be on an interval scale using the data since the difference between the values is meaningful (Ayala, 2022). Wilson (2013) further clarifies that IRT separates the scale from depending on the random number of items selected to be in the assessment. An item that has utility can differentiate well between student performance on different points of a trait continuum [the interval scale] (Ayala, 2022; Immekus, 2019; Mailman School of Public Health, 2023). 76 Item Parameters. Student performance hinges on several item parameters, which “define a blueprint for the model (Brooks-Bartlett, 2018).” These parameters, and their relationships to student ability, can be visualized with an item characteristic curve (ICC) graph. Figure 9 shows ICCs for three simulated items based on an IRT model estimating three parameters (Park et al., 2020). Each item in Figure 9 has a dichotomous score (0 for incorrect, 1 for correct). Figure 9 ICCs Based on a Three-parameter Logistic (3PL) Model Note. From “Technically speaking: Determining test effectiveness with item response theory,” by S. Park, A. Reeger, and A. M. Aloe, 2020, Iowa Reading Research Center. Copyright 2023 by The University of Iowa. In Figure 9, P(u) represents the probability (P) of a student responding correctly to the item (u). Theta is a representation of student ability (this is the student’s location and is also called 𝜃). The item discrimination parameter or 𝛼 is labeled (a) in the right bottom corner of 77 Figure 9 and provides the maximum slope steepness of each ICC. Steeper slopes indicated by higher 𝛼 values point towards items with greater discrimination between student abilities (Harris, n.d.). In addition to item discrimination, item parameters can include item difficulty and guessing (Ayala, 2022; Mailman School of Public Health, 2023); the 4PL, not shown here, also adds an additional parameter to describe what tends to happen at the top end of the curve, and there are numerous other innovative IRT models; operationally however, usually no more than the three parameters per item shown here are used. In the same corner of the figure, (b) represents the item difficulty parameter as it relates to student ability and (c) represents the “lowest possible probability” of a student responding correctly to the item as an indicator of the student guessing parameter (Park et al., 2020). Therefore, Item 1 discriminates well between students of different abilities, while Item 3 was the least discriminating. Item 2 was moderate in both discrimination, difficulty, and guessing. Finally, Item 3 was the most difficult and had the least probability of students guessing, while Item 1 was the least difficult and had the most probability of students guessing. Higher performing students tend to do better on more difficult items while lower performing students often answer easier items correctly, with discrimination and guessing empirically adding to what is sometimes known about how the item tends to perform. While a difficulty parameter estimate is assigned to each item, an IRT model does not explain why an item is difficult for each student (Lang & Tay, 2021); only that empirically it was calibrated as difficult. An individual’s cognitive process to reach an item response is also not described by an IRT model (Ayala, 2022), so this is done theoretically in large-scale assessment with expert panels in frameworks, and sometimes with practitioners in standard settings, or can 78 be done more empirically with qualitative analysis in cognitive laboratories or other small-scale settings using verbal protocol analysis (“think alouds”) or other techniques. For PISA, only the frameworks are released, but they are extensive documents. Assumptions. For an IRT model to estimate all three parameters successfully, the model needs to conform to several assumptions. Assumption 1 refers to monotonicity, which assumes that as the probability for a correct response increases a student’s ability also increases (Mailman School of Public Health, 2023). Assumption 2 is guided by conditional/local independence, which states item responses on an assessment are independent of each other based on a student’s location/ability (Ayala, 2022). Assumption 3 is based on unidimensionality, which assumes that only one continuous latent trait is measured (Ayala, 2022). Assumption 4 describes the functional form, which maintains that the data match the mathematical function described by a model (Ayala, 2022). Violating these assumptions can affect which model should be chosen as a best fit to the data. For example, if an assessment is measuring several proficiencies/latent traits, then a unidimensional latent variable may not be appropriate (Socha, n.d.). Defining Dimensionality. Sometimes education attributes being assessed lack a pre- defined dimensional structure (Irribarra & Arneson, 2023; Reckase, 1990). The definitions for multidimensionality and unidimensionality need to be clearly defined, which Irribarra and Arneson (2023) argue has still not occurred in many domains even though dimensionality is often discussed in educational research. One way to begin a definition is to compare the 79 statistical versus psychological aspects of dimensionality. Psychological/theoretical58 dimensionality is the hypothetical latent constructs, such as science ability, that in theory as identified by experts are required for performing well on an assessment (Irribarra & Arneson, 2023; Reckase, 1990). Statistical dimensionality is “the minimum59 number of mathematical variables needed to summarize a matrix of item response data” (Reckase, 1990). Reckase (1990) further points out that statistical dimensionality is based on observable data, such as scores, and rests on the data matrix so is not a function of the assessment or the student population being assessed. Actionable dimensionality is described as a third aspect by Irribarra and Arneson (2023), which refers to the “number of values considered when making a decision based on an assessment.” If the “art of assessing dimensionality” is finding the least number of latent abilities to preserve statistical prowess and construct meaning (Briggs & Wilson, 2003) then dimensionality depends on both the construct, the statistical model, and its intended usage. Multidimensionality over the content areas can be defined as three distinct science subdomains related to one theoretical construct – science ability; however other types of multidimensionality may also exist, such as in cognitive processing or item format as discussed earlier. IRT models have evolved as more research has shed light on dimensionality. Development of IRT Models. Preceding multidimensional IRT models was classical test theory (CTT) and unidimensional IRT. MIRT models are often compared to unidimensional IRT 58 Irribarra and Arneson (2023) prefer “theoretical” to “psychological”. They redefine psychological dimensionality to theoretical dimensionality as “the number of relevant psychological attributes that can be reasonably conceived as quantities and are believed to be involved in generating responses to items to some extent (Irribarra & Arneson, 2023).” 59 Irribarra and Arneson (2023) recommend removing the term “minimum” from the definition. 80 models in order to assess model fit. Therefore, an overview of the types of IRT models and their development is needed. Classical Test Theory (CTT). CTT was developed first and used historically with achievement tests (Brandt, 2015). Similar to IRT, the latent variable is assumed to be continuous (Ayala, 2022). Opposite of IRT, CTT focuses on a student’s whole score for an entire assessment (Ayala, 2022). In CTT, the item parameters depend on the sample from which they are taken (Immekus et al., 2019). In addition, CTT does not provide a reason for item difficulties, making linking between assessments with different item sets more challenging (Brandt, 2015). The CTT approach was updated to IRT in order to focus on the item and student relationship with a statistical model rather than the assessment in its entirety (Wilson, 2013). Now large- scale assessments, such as NAEP and PISA, use IRT models to avoid these disadvantages (Brandt, 2015). Unidimensional IRT Models. Each UIRT model described in this section uses a logistic function to describe student ability with regards to item parameters to get the probability of answering an item correctly (Harris, n.d.). The simplest of the UIRT models is the Rasch model (Wilson, 2013; Ayala, 2022; Lang & Tay, 2021). The Rasch model assumes items are on a single continuum showing student ability via standardized z-scores. This allows student responses to an item to be compared based on proficiency level. The item discrimination parameter (𝛼) in the Rasch model is set to a constant value of 1.0. In contrast, the parameter 𝛼 in the one-parameter logistic (1PL) UIRT model is allowed to vary from 1.0 and can be some other constant value across the items. Ayala (2022) describes this as a “philosophical perspective” where the Rasch model focuses on constructing the 81 variable and the 1PL UIRT model focuses on fitting the data. Both the Rasch model and 1PL UIRT model estimate the item difficulty parameter (item location or 𝛿), which can vary. As the 𝛿 parameter increases the probability of a correct response decreases (Reckase, 2009; Lang & Tay, 2021; Ayala, 2022). The 1PL UIRT model60 also has an advantage over CTT in that by fitting a logistic function the model no longer assumes measurement error is the same for each student as CTT does (Lang & Tay, 2021). The two-parameter logistic (2PL) UIRT model allows the discrimination parameter to freely vary across items. Adding to the 2PL UIRT model, the three-parameter logistic (3PL) UIRT model estimates the guessing parameter (𝜒). The proportion of students in the lowest proficiency level choosing the right answer is the estimate used for the 𝜒 parameter (Reckase, 2009). After the 3PL UIRT model the next development in IRT was the 1PL MIRT model. Functionality of MIRT. The complexity of educational constructs directly led to the development of MIRT models (Reckase, 2009). A MIRT model is able to relax the assumption of unidimensionality so that multiple correlated latent traits can be measured (Wang, 2021). An assessment, set of items, or even a single item may require students to use multiple abilities/latent traits, “especially in the compound areas such as the natural sciences” (Issayeva, 2022). A limitation of unidimensional models is they are not well fit for an instrument developed to be multidimensional (Immekus, 2019). Parameters are interpreted similarly to unidimensional IRT models, but they take the form of vectors and their direction in theta (𝜃) space will influence the interpretation (Socha, n.d.). This theta space is multidimensional and can be summarized as 𝜃! = [𝜃!"…𝜃!#]′ with M being the number of unobserved latent 60 Other IRT models also have the same advantage over CTT. 82 dimensions needed to model a student’s predicted response to an item (Immekus, 2019). The 𝜒 parameter is the exception because it is not a vector and retains its 3PL definition. As with unidimensional models, the maximum marginal likelihood estimation (MMLE) can calibrate item parameters (Ayala, 2022; Socha, n.d.). MIRT models can be either compensatory or non-compensatory (Spencer, 2004; Issayeva, 2022; Socha, n.d.). Compensatory models are additive in nature and allows a student’s high score in one dimension to make up for a low score in another (Socha, n.d.; Reckase, 1997). However, assessment designers may find it difficult to explain to test takers why scores on different dimensions are dependent on each other (Baghaei, 2012). A student’s ability (𝜃) in one dimension does not counterbalance for their ability (𝜃) in a different dimension in a non- compensatory (or partially compensatory) model (Socha, n.d.; Reckase, 1997). Results from a non-compensatory model are the nonlinear sum of thetas (Duran, 2014). Non-compensatory models also tend to be simpler (Socha, n.d.) and may capture cognitive ability more accurately (DeMars, 2016). A drawback is that non-compensatory models have not been used as frequently due to issues with parameter estimation, especially the lack of efficient algorithms. although this is somewhat changing with more computing power becoming available (Ayala, 2022; DeMars, 2016; Spencer, 2004; Wang & Nydick, 2015). Limitations and Benefits of MIRT Models. While an assessment’s items may be multidimensional in the lens of a content framework, each dimension’s strength may not be enough to change from a unidimensional model (Reckase, 1985; Socha, n.d.). Sometimes we say the dimensions may exist, but they are not sufficiently “separable” to matter. Another limitation to using MIRT models is that even though collecting data via online assessments is 83 less expensive than in-person, the increasing demand for data and its analysis are testing the limits of models like MIRT and other algorithms (Wang, 2021). There are also several sources of indeterminacy with a MIRT model. Metric indeterminacy results from the metric being relative (Ayala, 2022), that being that the item location and respondent location are relative to each other and not fixed until either the item mean, or respondent mean, for the calibration is set to zero, impacting getting calibration results on the fixed item or person. Rotational indeterminacy indicates the direction of each axis is not unique with regard to the item vectors; however, fixing the axes can help alleviate this problem (Ayala, 2022). A benefit of a MIRT model is the ability to link calibrations in order to create a large pool of calibrated items, which then may be used to develop interesting adaptations, such as computer adaptive testing (CAT) with parallel multidimensional test forms61 (Issayeva, 2022). Within-item multidimensionality can be used with a MIRT model to reduce test length because one item provides data on several dimensions, which is useful in CAT (Duran, 2014). Kose and Demirtasli (2012) found however, that longer tests (i.e., more items) and greater sample size are needed to reduce error and increase MIRT model sensitivity. Perhaps a greater benefit is more data about student ability on each dimension, which in turn can confirm or add to theories about educational constructs. PISA IRT Models. The PISA Results in Focus 2015 report states that PISA’s goal is not to determine the cause and effect of educational “practices and student outcomes” (OECD, 2018). However, the assessment does arguably make decisions about how student latent traits relate 61 Test forms often refer to versions of the same assessment. For example, test form A may have items arranged differently from or contain only some of the same items in test form B. 84 to educational constructs, which is often used by policymakers to lay some of the evidentiary groundwork for educational practice and theory. PISA divided each domain, reading, math, and science, into several subdomains for the 2015 assessment indicating the constructs of each domain provided evidence of multidimensionality (Brandt, 2015). Currently, and in 2015, PISA uses a unidimensional composite score for science developed via a 2PL Rasch model (Jerrim, 2016) while reporting scores in subdomains too. Student scores are developed by first choosing the IRT model to estimate item parameters and then using maximum likelihood estimate (MLE) to determine a latent trait ability level for each student (Jerrim, 2016). This is done in the main study after an extensive field trial to adapt instruments, usually by a keep and drop method, for which data does not get released. The difference in modeling for a multidimensional construct versus unidimensional scale scoring indicates that the assessment is being interpreted in both directions. There are both advantages and disadvantages to this approach per Brandt (2015) – see Table 5. Table 5 Trade-offs Between Calibration Methods for a Unidimensional Score Calibration Method for Multidimensional Advantages Disadvantages Data • Overestimates reliability, plus biases difficulty and variance estimates, by neglecting local item dependence (LID) Scale Score on • Validity of the multidimensional constructs Unidimensional • Reliably allows calculation of individual is reduced since assessment designed to be Calibration scores using MLEs unidimensional • Framework may specify one set of weightings for each dimension, but the actual weight may change due to a need to drop items 85 • Assuming items within a subdomain are • Calculation of reliable individual scores via separate dimensions the items within a MLEs is not possible Composite Score dimension are more closely related than • Reliability of composite score is reduced on items between dimensions then this since items not on a common scale Multidimensional approach allows LID to be considered, which • Not an appropriate calibration method if Calibration more accurately estimates reliability the Rasch model is used since dimensions • Explicit and clear weighting of subdomains cannot be set to equal variances (this leads allows the unidimensional composite score to the need for standardization after model to be developed calibration increasing measurement error) Note. Adapted from “Unidimensional interpretation of multidimensional tests,” by S. Brandt, 2015, Dissertation, p. 28. Construct validity is also a crucial issue here. Per Messick (1995) and Spencer (2004), ignoring any evidence, such as multidimensionality found in a framework, may negatively impact construct validity of the evaluation of meaning behind an assessment’s results. Reading is the only domain that OECD (2017a) describes as multidimensional in the 2015 PISA Framework. Using a MIRT model could help validate the 2015 PISA science framework design if the subdomains are found to be indicators of multidimensionality. This finding would mean the current unidimensional scoring model could contain a construct misspecification, which leads to incorrect interpretation of PISA scores that could in turn impact education policy decisions for diverse student groups. Conducting Quantitative Analyses. The analyses in this section were done for both the S10 and the S11 clusters of items and were all conducted using the R program in R Studio (RStudio Team, 2021). The PISA 2015 science data was first cleaned by removing student cases that contained only NAs. Then data was then run through descriptive analyses using the psych package (Revelle, 2024). Next histograms of average scores for the U.S. science sample and cluster S10 subsample were developed for the student population using the ggplot2 package 86 (Wickham, 2016). Item histograms showing number of each type of response (0 or 1) were plotted for the cluster S10 subsample. Due to the ordinal nature of the dichotomous data a polychoric matrix was developed using the psych package (Revelle, 2024) and saved for use in later analyses. The correlation coefficient rho was examined and developed into a table. Using the factoextra package (Kassambara & Mundt, 2020) a distance matrix heatmap was also developed from the polychoric matrix for the cluster S10 subsample. In order to evaluate the sensitivity of the cluster analysis the cluster S10 subsample was randomly split in half using the dyplr package (Wickham et al., 2023) and the random half subset was run through the same processes outlined in the Cluster Analyses and PCA subsections below, but not through IRT analysis as the subset’s size was very small at 653 students. Cluster Analyses. Each subsample was then run through a cluster analysis using the kmeans function of the stats package (R Core Team, 2023) to estimate 3 clusters. Since each subdomain should represent a unique dimension the logits for each dimension can be obtained via cluster analysis and then weighted to show each item’s relationship to a dimension (Ayala, 2022). The results of the cluster analysis were graphed in a scree plot using the factoextra package (Kassambara & Mundt, 2020). PCA. The two subsamples and random half subset were then also run through a PCA using the prcomp function in the stats package of R (R Core Team, 2023). The loadings, scores, and variances were analyzed to help develop several plots. Three-dimensional (3D) principal component plots were developed based on the PCA scores with the plot_ly function of the plotly package (Sievert, 2020) in R, while loadings were visualized with the barplot function of the graphics package (R Core Team, 2023) in R. Finally, principal component plots were loaded 87 publicly into Plotly Chart Studio (Plotly Technologies Inc., 2015) for later presentation due to their interactive nature. IRT Analyses. A critical step in determining which model provides the most information about student learning is examining the fit of various models (Yamamoto, 1995). Therefore, student data was analyzed in an exploratory quantitative research design. The explorations consisted of a 1PL UIRT, 2PL UIRT, 1PL MIRT, and 2PL MIRT models – each model is described below. Model 1 was a 1PL UIRT model and is described by Equation 162. Equation 1 1PL UIRT 𝟏 𝒑,𝒙𝒋 = 𝟏/𝜽𝒔, 𝜶, 𝜹𝒋4 = 𝟏 + 𝒆&𝜶(𝜽𝒔&𝜹𝒋) Where p is the probability of a value of 1 (a correct response) when the predictor is x, e is a constant of 2.7183 (i.e., the base of the natural logarithm), 𝜃 is a person’s location (i.e., ability), 𝛿 is the item’s location (i.e., estimated difficulty) of item j for student s, and alpha (𝛼) is allowed to vary from 1 while kept constant across items (Ayala, 2022). Model 2 was a 2PL UIRT model and is described by Equation 2. Equation 2 2PL UIRT 𝒆𝜶𝒋𝜽𝒔,𝜸𝒋) 𝒑,𝒙𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜹𝒋4 = 𝟏 + 𝒆𝜶𝒋𝜽𝒔,𝜸𝒋) The discrimination parameter, 𝛼, is now allowed to vary across items (Ayala, 2022). Note that the intercept is represented by gamma (γ), which is constant for this model, and is a representation of the item’s location and discrimination parameters’ interaction (Ayala, 2022). 62 Equations 1, 2, and 4 are from Ayala (2022). Equation 3 is derived using DeMars (2016). 88 “In a proficiency assessment situation γj would be interpreted as related to an item’s difficulty/easiness (Ayala, 2022, p. 393).” Model 3 was a 1PL MIRT model and is described by Equation 3. Equation 3 1PL MIRT 𝒆𝜶 # 𝒋𝜽𝒔,𝜹𝒋) 𝒑,𝒙𝒊𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜸𝒋4 = # 𝟏 + 𝒆𝜶𝒋𝜽𝒔,𝜹𝒋) In order to determine an item to dimension relationship that can change along 2 or more dimensions the logits are weighted (Ayala, 2022). The item slopes (𝛼) are fixed to a constant number across dimensions (DeMars, 2016). Note that an underscore indicates a vector and a prime symbol (‘) indicates a row vector. Model 4 was a 2PL compensatory MIRT model and is described by Equation 4. Equation 4 2PL MIRT 𝜶#𝒆 𝒋𝜽𝒔,𝜹𝒋) 𝒑,𝒙𝒊𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜸𝒋4 = # 𝟏 + 𝒆𝜶𝒋𝜽𝒔,𝜹𝒋) Model 4 differs in that 𝛼 can now vary. After model selection the dichotomous data (coded as 0 or 1) was analyzed using the TAM package (Robitzsch et al., 2022). This package was chosen primarily because of its use by OECD (Kiefer et al., 2015) and other researchers comparing IRT models, along with its ease of use. The TAM package allows for UIRT and MIRT models, but only compensatory MIRT. A table was developed that shows the statistics for each model and a second table comparing the models’ fit was also generated. Based on the best fitting model, a wright map and ICCs were also developed using the wrightMap function of the WrightMap package (Irribarra & Freund, 2014) and the plot function of the graphics package (R Core Team, 2023) in R respectively. 89 Data Triangulation After data analysis, the researcher should begin to build inferences from both types of data. Meta-inferences are defined by Venkatesh et al. (2013) as “theoretical statements…from an integration of quantitative and qualitative strands of mixed methods research.” The pathway for this study flowed as follows: comparing (merging) qualitative Ç quantitative findings à meta-inference/s. The other pathways consist of either qualitative or quantitative findings leading individually to the next set of findings, which would be the opposite of the first step, then to a meta-inference. After an analysis path is chosen Venkatesh et al. (2013) state that researchers should take either a bridging or bracketing research path to develop the meta- inference/s. Bridging is described as a consensus between the two types of findings (qualitative + quantitative) while bracketing uses alternate views of the phenomenon to report differences between the two types of findings (qualitative vs. quantitative). Bridging was used for this study. An analysis path can be taken further to develop these qualitative and quantitative data into a triangulation, or mapped, to each other so that the data from one method supports or contrasts conclusions drawn in the other. Östlund et al. (2011) adapted the diagram shown in Figure 10 from Erzberger and Kelle (2003) to illustrate triangulation. 90 Figure 10 Triangulation for Mixed Methods Research Note. From “Combining qualitative and quantitative research within mixed method research designs: A methodological review,” by U. Östlund, L. Kidd, Y. Wengström, and N. Rowa-Dewar, 2011, International Journal of Nursing Studies, 48(2011), p. 371. Copyright 2010 by Elsevier. Figure 10 showcases that the empirical findings for both quantitative (QUAN) and qualitative (QUAL) can be used to support theory, which may be newly developed. The sides of the triangles represent connections between the theory and each set of findings, along with the finding to one another. Depending on the outcome of the research, the triangle sides can differ in appearance. First, the triangle sides may remain convergent (as shown in Figure 10), which is when findings from both methods support theory and lead to the same conclusion. Second, the triangle sides may become parallel to each other when the findings from both methods compliment or support one another. Third, the triangle sides may be divergent indicating the findings are different for each method or may even contradict one another, in which case the contrasts should be explored to understand why different lenses indicate different directions. 91 As noted in their research, Östlund et al. (2011) stated, that while a mixed method approach is gaining ground with other researchers, the triangulation process described above was only beginning to be used at that time. It has since expanded. Triangulation can help clarify results in mixed methods research by clearly identifying the interactions between different types of data. Bowen (2009) and Armstrong (2021) agree that document analysis is a way to triangulate with other methods of research. They also describe how triangulation can provide a solid foundation for the design of and theory behind the mixed methods approach (Östlund et al., 2011). A data triangulation figure based on Östlund et al.’s (2011) procedure shown in Figure 10 was built for this study. Step 3: Equity Investigation OECD did not make publicly available information on ethnicity/race in 2015 and only collected minimal information on school location and student economic status in the survey questions. Therefore, survey questions were not used to investigate equity issues. Instead, student ability levels between models were compared to determine which model type had a greater range of ability levels. Using item cluster S10 with Models 1b and 3b thetas were mapped in bar plots. Ability level (theta) range was then analyzed to determine where historically marginalized student groups might be located to determine if a less complex model sacrifices information about these groups for a more pragmatic design. 92 CHAPTER 3: RESULTS The following results are reported in three sections by research question (RQ). Triangulation results from the qualitative and quantitative analyses is then reported after the RQ2 results. See section Research Questions for the details of each RQ. Results Relating to RQ1 The saturation evaluation of the two documents identified for analysis, OECD’s 2015 PISA Science Framework and PISA 2015 Technical Report, yielded mixed results. Neither a broader view of science educational theory nor multidimensionality for science knowledge is described in the 2015 PISA science framework. This theory was teased out during the qualitative analysis and is proposed at the end of this section. Coding on dimensionality was accomplished for the first document (i.e., the science framework) and theoretical saturation was achieved at the end of that document analysis. Saturation occurred because the second document, while mentioning science and multidimensionality upon initial analysis, did not actually refer to science content dimensionality, but rather the possible dimensionality between new and trend science items. Any evidence resulting from color codes highlighting the two subthemes are reported in Table 6 below for the only document analyzed. Note that the evidence column for subtheme unidimensionality is deliberately left blank as no portions of the science framework were found to code to unidimensionality. Missing evidence occurs in the subtheme multidimensionality (see yellow-filled cells). 93 Table 6 Evidence Supporting Dimensionality Themes Document Section/Feature Subtheme Subtheme Annotation (Pg. #) Unidimensionality Multidimensionality While OECD does appear to Box 2.1 be referring to Scientific multidimensionality with this knowledge: PISA “…three distinguishable but phrase it is not about science 2015 related elements.” content knowledge, but terminology rather the types of (Pg. 21) knowledge: content63, procedural, and epistemic. Section: Organizing the Science literacy is referred to domain of as a domain of interrelated science and aspects, but its science Figures 2.1-2.2 content dimensionality is not (Pg. 25) clarified. “Given that only a sample of the content domain of science can be assessed in the PISA OECD refers to “fields” of 2015 scientific literacy science indicating that these Section: assessment, content areas are not one Scientific clear criteria are used to guide single subdomain. knowledge and the selection of the 64 Figure 2.5 knowledge that is assessed. Figure 2.5 clearly separates (Pg. 28) The criteria are applied to the content knowledge knowledge from required for each “field” of the major fields of physics, science visually by them out chemistry, biology, earth and with a blue banner. space science…” OECD defined the required percentage of items by science content (physical, living, Earth and space). This Table 2.2 “Desired distribution of items, is similar to other state and (Pg. 29) by content” national assessments that intend to report on subdomains of science to showcase student learning in each content area. 63 This is the only type of knowledge relating to the science subdomains of life, physical, and Earth and space systems. 64 Knowledge of physical systems seems to combine both chemistry and physics science content. 94 Document Section/Feature Subtheme Subtheme (Pg. #) Unidimensionality Multidimensionality Annotation OECD lists content knowledge as a dimension separate from procedural Table 2.3 “Desired distribution of items, and epistemic knowledge. (Pg. 30) by type of knowledge” This does not clarify if content knowledge by itself is built of separate subdomains. OECD provides required item counts needed for the three Table 2.4 “Desired distribution of items types of knowledge by (Pg. 31) for knowledge” science content subdomain indicating each subdomain is its own dimension. Knowledge type is referred to as content indicating that science content is its own domain, but subdomains indicating multiple Figure 2.17 “Framework categories” dimensions is not reported. (Pg. 36) However, as shown in the released item in this study’s Figure 4, the item developers do document to which subdomain an item should be coded. Note. Page number refers to the page numbers shown at the bottom of each page of the PISA 2015 Science Framework provided in Appendix F. The 2015 PISA Science Framework document’s Figure 2.5 seems to provide a strong piece of evidence for multidimensionality of the science subdomains in the form of the differentiated and unconnected pieces of content knowledge that PISA assesses for science. Based on this knowledge, Figure 11 was developed to visualize any crossover of knowledge that was shown in Error! Reference source not found., which replicates and codes OECD’s Figure 2.5. Any crossover noted below is not a concrete occurrence on the PISA science items. Item developers often carefully consider aspects like content crossover and develop the required 95 items in a manner that prevents cluing of other items. My subject matter expertise was used to provide examples in Figure 11 where content knowledge in one subdomain may benefit from content knowledge in another subdomain, thus potentially affecting multidimensionality by making the subdomains less distinct. The orange curved arrows indicate this possible content knowledge crossover in the content standards. For example, PS1 is the knowledge of the structure of matter and having this knowledge might increase a student’s understanding of LS1, which is the knowledge of cells. Figure 11 Possible Connections between 2015 PISA Science Content Knowledge Since the qualitative results above seem to indicate that science is multidimensional with the three subdomains being assessed in 2015 PISA having little crossover, it seems a multidimensional IRT model might be warranted. However, there is insufficient information in the framework on the expected separability (or “difference”) between the theoretical elements seen – is there expected to be enough difference to perceive in a large-scale assessment? Also, there is little or no discussion of confounds. For instance, are high performing respondents in one area expected to be high performing in the others? While this is not necessarily considered theoretically true for U.S. NGSS disciplinary core ideas specifically in the three subject matter 96 areas, it is expected theoretically to be more applicable by their nature for scientific and engineering practices (SEPs)65 and cross-cutting concepts (CCCs)66. The continuum between the framework, the proposed educational theory, and its indicated model is provided in Figure 12. The proposed educational theory is that distinct science subdomains require students to have differentiated knowledge to demonstrate mastery of each subdomain, which may be more accurate for disciplinary core ideas than for practices and cross-cutting concepts. Figure 12 Proposed Continuum Note. Adapted from “A guide and glossary on postpositivist theory building for population health,” 2006, by R. M. Carpiano, and D. M. Daley, 2006, Journal of Epidemiology and Community Health, 60, p. 566. Results Relating to RQ2 The descriptive statistics, histograms, and correlations are provided first then in subsection RQ2A: Cluster Analyses Results the cluster analysis and its sensitivity test are reported. In subsection RQ2B: PCA Results PCA results are provided. Subsection RQ2C: IRT 65 OECD refers to content similar to this as procedural knowledge in the 2015 framework (OECD, 2017a). 66 OECD refers to content similar to this as epistemic knowledge in the 2015 framework (OECD, 2017a). 97 Results contains IRT model fit for two different item clusters. Lastly, triangulation results are reported. Unless otherwise noted all results are for the whole student subsamples taking each item cluster (S10 and S11) analyzed rather than for the full U.S. sample. If results pertain to the half subset of the cluster S10 subsample that is also noted. Descriptive Statistics Table 7 below provides several statistics for each item and the three added variables of number of items attempted, student raw score, and average score. Curran et al. (1996) recommend for multivariate normality that skew not range outside of +/-2 and kurtosis outside of +/-7, which the data in Table 7 do not violate. The mean for the item variables remains stable, in other words the actual mean does not stray greatly from 0.5, and indicates all the items are on the same scale (a mean of say 15 might indicate an item on a different scale when all other item means hover between 0.1 and 0.8). Table 7 Descriptive Statistics for Item Cluster S10 Full Subsample Variable n Mean SD Median Min Max Range Skew Kurtosis SE DS625Q01C 1306 0.52 0.50 1.00 0 1 1 -0.09 -1.99 0.01 CS625Q02S 1306 0.64 0.48 1.00 0 1 1 -0.58 -1.67 0.01 CS625Q03S 1306 0.58 0.49 1.00 0 1 1 -0.31 -1.90 0.01 CS615Q07S 1306 0.29 0.45 0.00 0 1 1 0.94 -1.11 0.01 CS615Q01S 1306 0.82 0.38 1.00 0 1 1 -1.71 0.91 0.01 CS615Q02S 1306 0.48 0.50 0.00 0 1 1 0.09 -1.99 0.01 CS615Q05S 1306 0.18 0.39 0.00 0 1 1 1.65 0.73 0.01 CS604Q02S 1306 0.49 0.50 0.00 0 1 1 0.06 -2.00 0.01 DS604Q04C 1306 0.29 0.45 0.00 0 1 1 0.93 -1.14 0.01 CS645Q03S 1306 0.51 0.50 1.00 0 1 1 -0.02 -2.00 0.01 DS645Q04C 1306 0.57 0.50 1.00 0 1 1 -0.27 -1.93 0.01 DS645Q05C 1306 0.14 0.35 0.00 0 1 1 2.04 2.18 0.01 CS657Q01S 1306 0.71 0.46 1.00 0 1 1 -0.91 -1.18 0.01 CS657Q02S 1306 0.42 0.49 0.00 0 1 1 0.32 -1.90 0.01 CS657Q03S 1306 0.47 0.50 0.00 0 1 1 0.14 -1.98 0.01 Number Attempted 1306 15.00 0.00 15.00 15 15 0 NaN NaN 0.00 Raw Score 1306 7.09 3.37 7.00 0 15 15 0.12 -0.91 0.09 Average 1306 0.47 0.22 0.47 0 1 1 0.12 -0.91 0.01 98 Figure 13 is a histogram of average scores for the full U.S. science sample while Figure 14 provides the same information for the cluster S10 subsample. Both populations seem fairly normal in distribution, but the subsample population does show less variability in average scores. Figure 15 is a set of histograms showing frequency of 0 and 1 scores for each item. A distance heatmap using the correlations from the polychoric matrix is shown in Figure 16 with the blue color indicating high similarity (close distance together) between items while the orange color indicates low similarity (far apart from each other) between items. Table 8 provides the polychoric correlations with the majority of items being weakly and negatively correlated – a negative correlation indicates that as a student does well on one item they score less on the other item. Figure 13 Histogram of Student Average Scores for Full U.S. Science Sample 99 Figure 14 Histogram of Student Average Scores for Item Cluster S10 Full Subsample Figure 15 Histograms of Student Score Point Frequency for Item Cluster S10 Full Subsample 100 Figure 16 Distance Heatmap for Item Cluster S10 Full Subsample 101 Table 8 Means (M), Standard Deviations (SD), and Correlations with Confidence Intervals (CI) for Item Cluster S10’s Full Subsample Item M SD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1. DS625Q01C 0.35 0.21 2. CS625Q02S 0.32 0.20 -.02 [-.53, .50] 3. CS625Q03S 0.37 0.19 .22 -.00 [-.33, .66] [-.51, .51] 4. CS615Q07S 0.37 0.20 .40 -.08 .13 [-.15, .75] [-.57, .45] [-.41, .60] 5. CS615Q01S 0.33 0.23 .13 -.07 .32 .38 [-.41, .60] [-.56, .46] [-.23, .71] [-.17, .75] 6. CS615Q02S 0.37 0.21 .21 -.07 .17 .40 .65** [-.34, .65] [-.56, .46] [-.38, .63] [-.14, .76] [.21, .87] 7. CS615Q05S 0.20 0.24 -.08 -.24 -.15 .09 -.42 -.26 [-.57, .45] [-.67, .31] [-.61, .40] [-.44, .58] [-.77, .12] [-.68, .29] 8. CS604Q02S 0.32 0.20 .09 -.07 .19 -.02 .11 .12 -.20 [-.44, .58] [-.56, .46] [-.36, .64] [-.53, .50] [-.43, .59] [-.42, .59] [-.65, .35] 9. DS604Q04C 0.34 0.20 .26 .18 .09 .01 .11 .20 -.11 -.12 [-.29, .68] [-.37, .63] [-.44, .58] [-.51, .52] [-.42, .59] [-.34, .65] [-.59, .43] [-.59, .42] 10. CS645Q03S 0.34 0.21 .41 .11 .24 .15 -.01 .15 -.19 .17 .14 [-.12, .76] [-.42, .59] [-.31, .67] [-.39, .62] [-.52, .50] [-.39, .62] [-.64, .36] [-.37, .63] [-.40, .61] 11. DS645Q04C 0.41 0.19 .29 .12 .33 .24 .34 .41 -.40 .16 .22 .31 [-.26, .70] [-.42, .60] [-.22, .72] [-.31, .67] [-.21, .73] [-.13, .76] [-.76, .14] [-.39, .62] [-.33, .66] [-.25, .71] 12. DS645Q05C 0.29 0.22 -.06 .21 -.02 -.26 -.15 -.10 -.24 .01 -.08 .27 .39 [-.55, .47] [-.34, .65] [-.53, .50] [-.68, .29] [-.62, .39] [-.58, .44] [-.67, .31] [-.50, .52] [-.57, .45] [-.28, .69] [-.15, .75] 13. CS657Q01S 0.16 0.24 -.41 -.17 -.23 -.30 -.11 -.27 -.34 -.17 -.34 -.47 -.14 -.17 [-.76, .13] [-.63, .37] [-.67, .32] [-.71, .25] [-.59, .43] [-.69, .28] [-.73, .20] [-.63, .37] [-.73, .21] [-.79, .06] [-.61, .40] [-.63, .38] 14. CS657Q02S 0.32 0.21 .03 .28 .17 .08 -.10 .07 -.27 .12 .22 .04 .12 -.02 -.24 [-.49, .53] [-.27, .69] [-.38, .63] [-.45, .57] [-.58, .43] [-.46, .56] [-.69, .28] [-.42, .60] [-.33, .66] [-.48, .54] [-.42, .60] [-.53, .50] [-.67, .31] 15. CS657Q03S 0.37 0.20 .17 .11 .27 .20 .09 .26 .12 -.02 .35 .15 .25 .09 -.58* .42 [-.37, .63] [-.43, .59] [-.28, .69] [-.34, .65] [-.44, .58] [-.29, .68] [-.42, .59] [-.53, .50] [-.19, .73] [-.39, .61] [-.30, .67] [-.45, .57] [-.84, -.10] [-.11, .77] Note. Values in square brackets indicate the 95% CI for each correlation. The CI provides a range of population correlations that describe where the sample correlation may truly lay (Cumming, 2014). * Indicates p < .05 and ** indicate p < .01. 102 RQ2A: Cluster Analyses Results The following scree plot in Figure 17 does not show a distinct elbow bend at any number of clusters because the slope is constantly decreasing without leveling off. There is no indication of multidimensionality from this analysis. Figure 17 Scree Plot for Item Cluster S10 with Full Subsample A second scree plot, shown in Figure 18, was obtained, as described in Chapter 2. Methods: Conducting Quantitative Analyses – Cluster Analyses, from the randomly chosen half of the cluster S10 student subsample. No elbow bend indicating optimal number of clusters is visible so there is no clear multidimensionality based on this analysis – no place where establishing clear dimensions would seem most appropriate. 103 Figure 18 Scree Plot for Item Cluster S10 with Random Half of Subsample Figure 19 provides a scree plot that lacks a clear elbow bend, indicating no clusters for item cluster S11, thus revealing no clear multidimensionality by the above definition. Figure 19 Scree Plot for Item Cluster S11 RQ2B: PCA Results Figure 20 provides bar plots of the three PCA loadings (eigenvectors) explaining the most variation in the set of items in cluster S10. Principal components 4-15 were dropped as 104 they all explained 2.5% or less of the variation and were each far below eigenvalues of 1. The reader can note below that the only for PC 1 was the eigenvalue clearly 1 or greater, but since my research questions are about multidimensionality in three dimensions as compared to the usual unidimensionality with which PISA data are modeled, PC1-PC3 were kept. Per prcomp function documentation, “The signs of the columns of the rotation matrix (the loadings) are arbitrary, and so may differ between different programs for PCA, and even between different builds of R” (R Core Team, 2023). Only for PC 1 (2.01) was the eigenvalue greater than 1, PC 2 (0.1) and PC 3 (0.08) were much smaller. Eigenvalues greater than 1 indicate those components “account for more than the mean of the total variance in the items,” which follows the Kaiser- Guttman rule (Li, 2012). The vast majority of items load most strongly on PC 1, with only 1 item each loading most strongly for PC 2 and PC 3, as shown in the figures. 105 Figure 20 Loadings Bar Plots for Item Cluster S10 with Full Subsample Figure 21 visualizes the three principal components in a 3D space based on the PCA scores. In general, PCA scores are developed by taking the original measurement and multiplying it by its eigenvector coefficient, which indicates the contribution of each measurement, and then additively summarizing all those values to get the score on any axis (Carr, 2001). However, Carr (2001) indicates that when a coefficient matrix, like this study uses, is used in place of a covariance matrix to run the PCA the calculated results will not match the results computed in a R program. Note, the low redundancy - data are spread apart and weakly 106 correlated. The two items isolated in PC 2 and PC 3 will be dropped as outliers in the second round of IRT analyses for item cluster S10, as no single item ever serves as a full measure in IRT. Multidimensionality does not seem to be indicated in this item cluster as the majority of variance (78.8%) is explained by PC 1. An interactive graph with rotation and data point hovering is available at https://chart-studio.plotly.com/~cnmalcom/3. Figure 21 PCA Plot for Item Cluster S10 with Full Subsample Note. Blue dots are items loading mainly on PC 1, while red dots indicate items loading mainly on PC 2, and green dots indicate items mainly loading on PC 3. 107 A second analysis for the random half subset of the set of items in cluster S10 was used to help confirm PCA results. Figure 22 provides bar plots of the three PCA loadings (eigenvectors) explaining the most variation. Principal components 4-15 were dropped as explained above. Only for PC 1 (2.18) was the eigenvalue greater than 1, PC 2 (0.11) and PC 3 (0.1) were much smaller. The majority of items load most strongly on PC 1. Figure 22 Loadings Bar Plots for Item Cluster S10 with Random Half of Subsample 108 Figure 23 visualizes the three principal components in a 3D space based on the PCA scores. Again, the plot shows low redundancy - data are spread apart and weakly correlated. The same two items from the earlier PCA of the full subsample from item cluster S10 are still showing up independently on PC 2 and 3. Multidimensionality does not seem to be indicated as the majority of variance (80.1%) is explained by PC 1. This plot, as expected, is very similar in substance to Figure 21. An interactive graph with rotation and data point hovering is available at https://chart-studio.plotly.com/~cnmalcom/5. Figure 23 Confirmation PCA Plot for Item Cluster S10 with Random Half of Subsample 109 A third PCA for the set of items in cluster S11 was run. Figure 24 provides bar plots of the three PCA loadings (eigenvectors) explaining the most variation. Principal components 4-15 were dropped as they all explained 1% or less of the variation. Only for PC 1 (2.86) was the eigenvalue greater than 1, PC 2 (0.1) and PC 3 (0.05) were much smaller. The majority of items load on PC 1. Figure 24 Loadings Bar Plots for Item Cluster S11 110 Figure 25 visualizes the three principal components in a 3D space based on the PCA scores. Again, the plot shows low redundancy - data are spread apart and weakly correlated. Multidimensionality does not seem to be indicated as the majority of variance (86.7%) is explained by PC 1. While more items are loading on PC 2 and PC 3 when only partial data are used, they do not appear to be in separate dimensions and do not align with science subdomains or with types of knowledge identified by OECD in the 2015 science framework. An interactive graph with rotation and data point hovering is available at https://chart- studio.plotly.com/~cnmalcom/7. Figure 25 PCA Plot for Item Cluster S11 111 RQ2C: IRT Results Both model fit and information from IRT analyses are provided in this section. Table 9 provides the subdomain groupings of items for each item cluster that were used in the multidimensional models. Physical System items were coded as Dimension 1, Earth and Space Systems as Dimension 2, and Living Systems as Dimension 3. Table 9 Item Groupings for MIRT Models Cluster S10 Cluster S11 Item Type of Type of Knowledge Subdomain DOK Item Knowledge Subdomain DOK DS625Q01C Content Physical L CS643Q01S Procedural Physical M CS625Q02S Content Physical L CS643Q02S Procedural Physical M CS625Q03S Content Physical M DS643Q03C Content Physical L CS604Q02S Content Physical M CS643Q04S Procedural Physical M DS604Q04C Epistemic Physical M DS643Q05C Epistemic Physical M CS629Q02S Content Physical M CS615Q07S Procedural Earth and Space M CS629Q04S Epistemic Physical M CS615Q01S Procedural Earth and Space M CS615Q02S Procedural Earth and Space M DS648Q01C Procedural Earth and Space M CS615Q05S* Epistemic Earth and Space M CS648Q02S Procedural Earth and Space M CS645Q03S Content Earth and M CS648Q03S Procedural Earth and Space Space M DS645Q04C Content Earth and Space M DS648Q05C Epistemic Earth and Space M DS645Q05C Content Earth and Space M DS629Q03C Procedural Earth and Space M CS657Q01S* Content Living L CS656Q01Sê Content Living M CS657Q02S Content Living M DS656Q02Cê Procedural Living H CS657Q03S Procedural Living M CS656Q04Sê Procedural Living M Note. *Items that were dropped from Models 3B and 4b. êReleased item set. DOK = Depth of Knowledge: Low (L), Medium (M), High (H), developed by Webb in 1997 per OECD (2017a). Cluster S10. Importantly, since the two items CS657Q01S and CS615Q05S each loaded in a separate dimension by themselves, the model fit analyses were conducted both with and 112 without these items in the item cluster S10 as shown in Table 10. Based on item difficulties for Model 1, these unusual items also have unusual characteristics: item CS615Q05S (xsi = 1.79) seems to be the second hardest item, item DS645Q05C (xsi = 2.08) being the hardest, and item CS615Q01S (xsi = -2.01) ranked the easiest while item CS657Q01S (xsi = -1.34) was the second easiest. This holds true for Models 2, 3, and 4 while the difficulty levels vary slightly. Item infit (information-weighted fit) statistics for Model 1 showed that some items were slightly underfit and some slightly overfit when compared to the desired value of 1, but all were well within usual tolerances67 of 0.70 to 1.30 (blue dashed lines in Figure 26). Figure 26 shows the infit statistic for each item with the desired expected value of 1.0 (green line). The two outlier items are marked in red. Note that items below 1.0 may have too predictable of responses and those greater than 1.0 may have responses that are too noisy, i.e., the excess variation may be masking what is a good model of the response pattern (Wind & Hua, 2021). 67 These tolerances were adopted based off usage in a similar study by Pensavalle and Solinas (2013). 113 Figure 26 Infit Statistics for 1 PL UIRT Model of Item Cluster S10 with Full Subsample There were 6 items whose infit statistic was significant at p < 0.05 for Models 1 and 3. Infit statistics remained close to the desired value of 1 for Models 2 and 4, and none were statistically significantly different. Models with items dropped have 3 items (Model 1b) and 2 items (Model 3b) that had a significant infit statistic respectively. Table 10 Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S10 Subsample Full Set of 15 Items Without Items CS657Q01S and CS615Q05S Model 3 Model 4 Model Model Model 1 Model 2 (1PL (2PL Model 1b 2b Model 3b (1PL UIRT) (2PL UIRT) (1PL 4b MIRT) MIRT) (1PL UIRT) (2PL MIRT) (2PL UIRT) MIRT) Deviance (-2 log- 21,630.16 21,456.08 21,561.67 21,426.74 18,948.83 18,925.41 likelihood) Not Not Number of Run Run Estimated 16 30 21 33 14 19 Parameters 114 AIC (constraint on students) 21,662 21,516 21,604 21,493 18,977 18,963 BIC 21,745 21,671 21,712 21,664 19,049 19,062 Iterations 15 21 428 151 17 389 EAP Reliabilities 0.74 0.76 0.75 Dim 1 NA NA 0.725 0.735 NA 0.736 Dim 2 NA NA 0.729 0.743 NA 0.728 Dim 3 NA NA 0.715 0.657 NA 0.635 Infit Range 0.915 to 0.989 to 0.905 to 0.985 to 0.929 to 0.933 to 1.159 1.015 1.146 1.011 1.001 1.081 Note. The green highlighted model is the most parsimonious when considering model statistics and guidelines. Improvement of fit was determined for each model pair and is shown in Table 11, which can be interpreted for each pair of models as “from Model X to Model Y there is a Z% increase in improvement of model fit. Improvement of fit was determined with the following formula ((Model X Deviance – Model Y Deviance)/Model Y Deviance) * 100. Overall, improvement of the fit statistic for Model 1 (1PL UIRT) compared to Model 3 (1PL MIRT) for the three-dimensional model by content area is only 0.3%, which may not be meaningful enough for country level results. In other words, at least reporting at the group level for which representative samples were drawn (country in this case), the difference made by carrying forward all the extra parameters of the more complex model would not make a practical difference. Note, models for the full set of items versus the models lacking 2 items were not compared due to the difference in items, which made the models not able to be nested. Table 11 Comparison of Model Fit – Item Cluster S10 Subsample Model Model Model Model Model Model Model 1 to 2 1 to 3 1 to 4 3 to 2 2 to 4 3 to 4 1b to 3b Improvement of Fit 0.8% 0.3% 0.9% 0.5% 0.1% 0.6% 0.1% 115 Goodness of fit was also determined using chi-square test to compare sets of models with a significance level 𝛼 set to 0.05. The models are compared below. • Model 1 and Model 2: Χ2 (14, N = 1,306) = 174.09, p = 0) so null hypothesis that both models are the same can be rejected and Model 2 is a significantly better fit based on lower deviance, AIC, and BIC statistics. • Model 1 and Model 3: Χ2 (5, N = 1,306) = 68.50, p = 0) so null hypothesis that both models are the same can be rejected and Model 3 is a significantly better fit based on lower deviance, AIC, and BIC statistics. • Model 1 and Model 4: Χ2 (17, N = 1,306) = 203.43, p = 0) so null hypothesis that both models are the same can be rejected and Model 4 is a significantly better fit based on lower deviance, AIC, and BIC statistics. • Model 2 and Model 3: Χ2 (9, N = 1,306) = 1015.59, p = 0) so null hypothesis that both models are the same can be rejected and Model 2 is a significantly better fit based on lower deviance, AIC, and BIC statistics. • Model 2 and Model 4: Χ2 (3, N = 1,306) = 29.34, p < 0.0001) so null hypothesis that both models are the same can be rejected and Model 4 is a significantly better fit based on lower deviance, AIC, and BIC statistics. • Model 3 and Model 4: Χ2 (12, N = 1,306) = 134.93, p = 0) so null hypothesis that both models are the same can be rejected and Model 4 is a significantly better fit based on lower deviance, AIC, and BIC statistics. 116 • Model 1b and Model 3b: Χ2 (5, N = 1,306) = 23.42, p < 0.0001) so null hypothesis that both models are the same can be rejected and Model 3b is a significantly better fit based on lower deviance and AIC statistics (BIC was actually higher for Model 3b). Cluster S11. Importantly, since the 1PL UIRT model was found to be the better fit for item cluster S10, only it and the 1PL MIRT model were compared for this item cluster as confirmation of unidimensional model fit. Analysis on IRT model fit is shown in Table 12. While Model 6 appears to be the better fitting model based on the lower AIC and BIC statistics, it does require more iterations to converge, and substantially more parameters are employed without making much practical difference in results at the group reporting level for this assessment. Model 5 has 9 items and Model 5 has 8 items whose infit statistic was significant at p < 0.05. Table 12 Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S11 Subsample Full Set of 15 Items Model 5 Model 6 (1PL UIRT) (1PL MIRT) Deviance (-2 log-likelihood) 21,530.46 21,467.69 Number of Estimated Parameters 16 21 AIC (constraint on students) 21,562 21,510 BIC 21,645 21,618 Iterations 25 474 EAP Reliabilities 0.8 Dim 1 NA 0.727 Dim 2 NA 0.752 Dim 3 NA 0.794 Infit Range 0.888 to 0.912 to 1.152 1.132 Note. The green highlighted model is the most parsimonious when considering model statistics and guidelines. 117 Improvement of fit was also determined for this model pair. Model 6 shows 0.3% improvement over the unidimensional Model 5. For goodness of fit chi-square test, Model 5 and Model 6: Χ2 (5, N = 1,306) = 62.78, p = 0) so null hypothesis that both models are the same can be rejected and Model 6 is a significantly better fit based on lower deviance, AIC, and BIC statistics. Item Fit Analyses. Using the best fitting model (Model 1b - 1PL UIRT) for item cluster S10 (full subsample) several analyses of item fit were completed. Figure 27 shows the ICCs for 13 items of item cluster S10 – the blue line is the model’s expected scores curve and the black line is the actual scores curve. These plots showcase the relationship between the latent trait (student ability) and probability of an expected correct response (Wind & Hua, 2021). Figure 27 ICC Plots for Item Cluster S10 with Full Subsample 118 119 120 Note. Based off Model 1b and items that were dropped are not shown. In Figure 28 the left side of the map provides a histogram of student ability (latent trait). The right side of the map provides item difficulty with harder items near the top of the map and easier items near the bottom. The distribution of difficulty shows a reasonably good spread given the location of the person respondents, although more very easy items might be helpful given the large number of respondents performing below the level of the bulk of the items. 121 Figure 28 Wright Map for Item Cluster S10 with Full Subsample Note. Based off Model 1b and items that were dropped are not shown. Triangulation Extending on the earlier proposed qualitative education theory: distinct science subdomains require students to have differentiated knowledge to demonstrate mastery of each subdomain, a multidimensional model should best fit student data to accurately portray student abilities. Since this did not occur, a divergent triangulation is shown in Figure 29 as 122 qualitative analysis did reveal multiple content dimensions. Broken orange arrows indicate propositions that did not get proven by empirical evidence in the theoretical level. The curvy green arrow indicates that Proposition 1 should have directly led to Proposition 2 but was not proven by findings from the quantitative analysis. Blue arrows indicate empirical findings that did support each proposition. Note that the two broken arrow claims both rely on little separation by the more complex models, relative to the reporting claims. Figure 29 Triangulation of Results Note. Adapted from “Combining qualitative and quantitative research within mixed method research designs: A methodological review,” by U. Östlund, L. Kidd, Y. Wengström, and N. Rowa-Dewar, 2011, International Journal of Nursing Studies, 48(2011), p. 371. Copyright 2010 by Elsevier. The IRT analyses did statistically support Proposition 2 for both S10 and S11 subsamples but not with practical significance. The initial exploratory analyses offer some clues, such as many 123 weakly correlated dimensions but not rising to the level of three-dimensional modeling or aligning with the theoretical level being assessed in this dissertation, for any of the subsamples. Results Relating to RQ3 Figure 30 shows histograms for student ability level for both the 1PL UIRT and 1PL MIRT, Models 1b and 3b respectively, for item cluster S10. The number of students is greatest in the middle range of abilities for all models. For Models 1b and 3b dimension 1 the ability levels ranged from -2 to 2 while for Model 3b dimensions 2 and 3 they ranged from -3 to 3. Please note that the software does not allow dimensions to be directly compared as they are not aligned in this software, so do not make this comparison although the charts are located near each other. Figure 30 Histograms of Student Ability Levels for Item Cluster S10 124 Note. Items CS657Q01S and CS615Q05S were not included in the models shown above. 125 CHAPTER 4: DISCUSSION Analysis of educational data cannot exist in a silo apart from the education content being measured. What the construct is and how students are asked to learn and master it matter. Assessment data from students need to be modeled by researchers in a meaningful manner that incorporates these aspects. Often, assessment developers and educators start with an education standard and work towards developing an assessment/task to measure student performance rather than beginning with an understanding of the content framework and how students learn the subject (Claesgens et al., 2008; NRC, 2001). This study aimed to marry both a qualitative review of the 2015 science framework’s content and how that content was learnt by students to a quantitative analysis of which model would be most appropriate for the assessment data. Major themes of the study’s results are presented in this chapter, along with providing the study’s limitations, threats to the study’s validity and reliability, avenues for future research, and policy recommendations for stakeholders. Study Overview The literature synthesis revealed that accurately assessing science learning is still a relevant need in the U.S. In addition, the U.S. education system is still failing to improve gaps in science performance associated with a student’s economic status and their race (Corcoran et al., 2009). Science education in U.S. schools continues to be centered around separate subject subdomains: life, physical, Earth and space, etc., which also drives how science assessments are formed. These science subdomains have been successfully modeled as multidimensional for other assessments, but some assessments like PISA still model science unidimensionally while reporting on student achievement in a multidimensional space – see Appendix C. UIRT models 126 are often more familiar and interpretable to large-scale assessment developers (Lang & Tay, 2021). This study was undertaken to determine if a mixed methods approach could reveal triangulated evidence of multidimensionality in the 2015 PISA science and how inaccurate modeling might impact equity issues relevant to students. Key Takeaways 1. Qualitative analysis yielded a multidimensional view of science content in the 2015 PISA science framework by determining the science subdomains of living, physical, and Earth and space systems were developed as separate content dimensions. 2. Quantitative analysis yielded a 1PL UIRT model as the most practical fit for the 2015 PISA science assessment student data. This outcome was confirmed by PCA and cluster analysis results. Even though the 1PL MIRT model was statistically significant, it offered little improvement of fit for the inferences being made in PISA. 3. The equity investigation was limited by available data yet yielded one result of how using a unidimensional model instead of a MIRT model might negatively impact marginalized student groups by combining students into fewer ability levels, which leads to a loss of information about how students are performing in specific science subdomains. A Lack of Synergy Between Results Triangulation of results illustrated that the MIRT model advocated for by a qualitative analysis of the framework was not supported by quantitative analysis, which indicated the 1PL UIRT model was the most practical model in terms of using fewer parameters and viewed in the light of the MIRT model not improving fit substantially. Following is a discussion of what factors 127 might cause this disconnect between what is described in the framework and how the data end up being modeled. These factors are outside the scope of this study’s research questions. Reckase (1989, p. 9) stressed that “test items may require more than one cognitive skill for successful solution but still generate a statistically unidimensional data set through the interaction with a population that varies on many dimensions.” This might be seen here in the weak correlations shown in the descriptive statistics. One factor impacting multidimensionality of science could be that “most decisions about instruction and curriculum sequences in science have not been guided by a long-term understanding of learning progressions that are grounded in the findings of contemporary cognitive, developmental, education, and science studies research” (NRC, 2001; NRC, 2007, p. 214). Middle and high schools in the U.S. continue to offer science courses as distinct units and science subdomains are infrequently integrated (Enger & Yager, 2009). Up until NGSS was released in 2013 and advocated for crosscutting concepts and science practices that applied across science subdomains (NGSS Lead States, 2013), science learning focused mainly on recall of discrete facts. In 2015, when PISA last administered science as a major domain, the impact of NGSS on curriculum would not have been as far spread throughout the different states. This means students may still be responding to items on the science assessment via recall of facts rather than actually demonstrating mastery of science concepts. Use of memory to recall of facts might be a separate dimension (Leigh et al., 2006) from the science subdomains, but is outside the scope of the investigation here. Another factor is hinted at by findings from a research study set in Brazil, which noted for that country that PISA scores could be impacted by higher participation rates of students 128 who had more school years completed (Gomes et al., 2020). If students of a similar age/grade level all have a similar ability level as evidenced by their scores that might also impact a construct showing up as multidimensional. One study showed that even though IRT analysis should be sample independent for the estimations of item characteristics “the stability of these estimations is enhanced when the sample is heterogeneous with regard to the latent trait” (Osteen, 2010, p. 79), which is less likely in a student sample whose members are homogenous in coursework taken and age. A third factor that could impact or mask multidimensionality is general ability. Pokropek et al. (2022) found that only 17% of variance in science items is attributed to a specific science ability latent trait while the rest can be contributed to general ability (sometimes considered as working memory or another cognitive trait) based on a study of 33 OECD countries taking the 2018 PISA. A strong correlation of 0.88 exists between standard achievement tests that measure intelligence and PISA (Pokropek et al., 2022). This could indicate that PISA is measuring a student’s general intelligence or even their test taking skills rather than the subdomains outlined in the 2015 science framework. While these are outside of the scope of the research questions here, it is known that aspects such as test taking skills can present as an “unwanted dimension of performance” (Scalise & Gifford, 2006). Even the socioeconomic factors, such as number of books owned by the family, which are measured by OECD, may correlate more strongly with general ability (Pokropek et al., 2022). 129 Overview of Released Item Set Since qualitative analysis of the framework indicates the science subdomains should be separate dimensions, a deeper look into what a released item set (see Figures 31-33)68 reveals about the science content versus general ability requirements of the items being assessed in 2015 PISA science is warranted. If items are only asking students to perform recall of science facts, eliminate options based on test taking skills, or use their general intelligence to respond then this could eliminate the content-related multidimensionality being explored here that seems required by the framework. Note that this qualitative review of item content is based on one set of items as the other science items analyzed in this study were not released, leading to the caveat that other items may have more rigorous ties to science subdomains. The first item of the Bird Migration set shown in Figure 31 (OECD scoring information provided directly below) seems to be a simple recall of the definition of natural selection, a mechanism of evolution. The item stimulus provides the key term evolution to solicit recall. The item also requires a level of detail about bird migration that is not normally taught in high school since option D (a second correct response) expects students to know if a species of bird can have a better chance at finding nesting sites, which Robins69 do by having more experienced flock members lead less experienced to good nesting sites. OECD (n.d.-c) has tied this item to content knowledge of living systems, most likely to the standard “Populations (e.g. species, evolution, biodiversity, genetic variation)” – more detail on what aspect/s of evolution can be included in the 2015 PISA science are not provided by OECD. This item is also coded as a 68 Images from OECD (n.d.-c). 69 See Robin Migration website https://journeynorth.org/tm/robin/facts_migration.html with contributions by an ornithologist (a scientist who studies birds). 130 DOK of medium by OECD (n.d.-c), which typically requires more from a student than recall of facts. From an assessment developer perspective, I do not see knowledge of living systems being applied in this item. Figure 31 Bird Migration Item 1 from Item Cluster S11 131 The second item of the Bird Migration set shown in Figure 32 (OECD scoring information provided directly below) is of the constructed response item type. The item expects students to draw on procedural knowledge of the correct process for scientific experiments that could apply to any science subdomain. OECD (n.d.-c) has tied this item to procedural knowledge of living systems, most likely to the same standard – a specific aspect of the “Populations” standard was not able to be teased out for this item. This item is also coded as a DOK of high by OECD (n.d.-c), which could be because the student has to make an analytical connection to factors that would invalidate a scientific investigation. From an assessment developer perspective, this item seems tied to the living systems subdomain merely because it mentions living animals, i.e., birds, but does seem to be a higher DOK, although the format effect between selected and constructed response might require a higher literacy DOK, which could be another source of dimensionality although not theoretically intended based on the PISA science framework analyzed. 132 Figure 32 Bird Migration Item 2 from Item Cluster S11 The third item of the Bird Migration set shown in Figure 33 (OECD scoring information provided directly below) is another multiple-choice item like Item 1 but with a multi-select option technology enhancement. Notice that the item’s main stimulus has changed to a narrative about a specific bird species and the online version of this computer-based item set 133 does not seem to allow for students to return to the original stimulus. This could confuse some students and lead to a disconnect with the item. The item again expects students to draw on procedural knowledge, but this time tied to what seems to be actual research data from a living system. OECD (n.d.-c) has tied this item to procedural knowledge of living systems, most likely to the same standard – migrations of populations would fit into the “Populations” standard’s example list. This item is also coded as a DOK of medium by OECD (n.d.-c), which this item seems to meet as the data must be interpreted by the student. From an assessment developer perspective, this item out of the set is most closely tied to the living systems subdomain yet focuses on a skill that could be used in any science subdomain – the interpretation of graphical data. Hence once again, subdomain multidimensionality could be masked in this example. Based on just this item set, the application of general intelligence ability and perhaps a dimension of science procedural knowledge does seem to outweigh application of concepts required by each science subdomain in the 2015 PISA science framework. PISA’s science procedural knowledge might be similar to “science practices” in the U.S. NGSS and PISA science epistemic knowledge similar to cross-cutting concepts, bringing these more unifying ideas into play and masking some subdomain multidimensionality. If other items are also disconnected from the subdomains this could help explain the practical unidimensionality of item clusters S10 and S11. 134 Figure 33 Bird Migration Item 3 from Item Cluster S11 135 Alternate Sources of Multidimensionality • OECD (2017b) notes there is possible dimensionality between new and trend science items and claims that a UIRT model provided a better fit, yet the fit statistics (AIC and BIC) provided in the technical report show the MIRT model as statistically better with slightly more improvement of fit. This would be for the linked set of items across clusters, not possible to explore here. This is outside the themes being explored and was not codable but is consistent with the results found here for the clusters examined. • In the 2015 science framework OECD (2017a) provides several types of knowledge, each of which could be a separate dimension. Only content knowledge was examined in this study, but procedural knowledge and/or epistemic knowledge may each be a dimension similar to the three-dimensional aspect of NGSS, as discussed earlier. • Item format, specifically the item’s stem (introductory sentence) has been found to impact multidimensionality of a math assessment (Kan et al., 2018). This could also be a source of multidimensionality in science assessments, as discussed earlier. Also, science content often uses math expressions in items, so the math issue like literacy discussed earlier could be involved in the weak correlations seen in the data exploration. • Scientific inquiry requires a diverse skill set from students, such as creativity and ability to question. This is speculative based on data and research questions here, and is not explored in the literature review, but such broader issues could contribute if present so that many, but not separable, dimensions might be in play as discussed earlier. These do not seem aligned with the sources of dimensionality explored directly in the analysis. 136 Impact On Equity Referring back to the findings and inferences drawn from the available data RQ3 could be better articulated as: • What inferences can be made about equity in education based on model fit and range of student ability per different subdomains? This rephrasing of the research question was necessitated by the lack of publicly available demographic data. Therefore, only inferences about how model fit might impact equity could be made. The NRC (2012, p. 277) advocates that a “crucial role of a framework and its subject matter standards is to help ensure and evaluate educational equity” and “that all students should have adequate opportunities to learn”. We should extend that to modeling data accurately with an eye towards the best model fit so all students have their performance modeled equitably. While PISA aims to impact policy at a broader national scale than the student level (Froese-Germain, 2010) the outcome of mismodelling student data can still be that stakeholders redirect resources to away from student groups and subdomains of science education that really need them. As mentioned earlier, OECD did not make publicly available information on race/ethnicity of students for the 2015 PISA. This lack of data impacts stakeholder understanding of the diversity of the student population and disables researchers attempts to search for equity issues in student performance. We know that the U.S. sample was drawn to be representative, within the limitations for missing data described earlier, so subgroups must be present, and at representative rates, but they are not disaggregated in the PISA data shared. 137 Therefore, results from this investigation were limited to student ability level as evidenced by theta and how the range of ability levels changed between the UIRT and MIRT models, for the full sample. Future work should explore educational technology data sets or others internal to the U.S. where individual inferences are made and disaggregated samples might be available in science, to see for whom the impact of practical significance might be most meaningful if statistically significant multidimensionality that is separable is present but neglected in the modeling. One issue can be explored here. When looking at Figure 30 and comparing the UIRT model to the MIRT model, one can see how the students are now compacted into fewer bins for the MIRT model. This may be doing a disservice to those students who do not have access to the higher-level science classes or are challenged by economic constraints of where they live (e.g., lack of internet or science labs). These underrepresented groups tend to be minorities – see Figures 32-34 of Appendix A. This lack of access or differential access to educational content continues to be a concern regarding a lack of advancement in the U.S. on science achievement by students (Vasquez, 2006). Limitations There were numerous limitations to this study. No linking information was released to examine all the items together thus limiting the size of sample. OECD did not document why data were missing for U.S. cases in the released sample, or provide additional information about them, so there was no additional information to report. Only one set of items from a subsample was released so a thorough qualitative review of the items was not possible. A lack of publicly available data on the diversity of the sample inhibited the equity investigation 138 required to answer RQ3. This absence of this information was a policy decision by some participating countries – in particular, the NCES controls access to ethnicity data provided by OECD for the U.S. and only makes it available unlinked and through a restricted use license, which were not the focus of this dissertation. Response rates were low for trend items so only new items were analyzed. Due to the assessment cycle in 2015 being the first science fully digital assessment, this could be due to trend items being mainly paper based while new items were developed natively as computer based and more schools are expecting electronic delivery of assessments. Threats to Validity and Reliability The model’s generalizability may be limited due to sample size, which may be restricted by elimination of students with missing data. Therefore, external validity may be impacted and the results may not be generalizable to various student populations with differing demographics in the U.S., especially minority groups that tend to have smaller population sizes. This brings to question if the U.S. sample design was adequate enough. Next, as with most assessments there is always the question if what is being measured, such as the subdomain of life science, remains the same across clusters of variable content. Without any linking information available and no common form among all students, instrumentation also threatens the internal validity of this study. “A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes (Lang & Tay, 2021, p. 328).” Relatedly, students had access to prior versions of science items in the form of released items, which introduces the threat of testing, where 139 exposure to a pre-test similar to the final test might influence their outcomes as reported on in the PISA results. Finally, opportunity to learn threats to internal validity could exist since students who were assessed varied in their learning stages by having taken different science courses. An important note is that because there was not a control group and students were not divided into groups (only by country) for the analysis, threats to validity associated with potential differences in group membership /selection bias are not relevant to PISA’s sampling. There are different global windows of testing for PISA which could indicate a maturation threat, but the window of testing for the U.S. was fairly stable. Changes to the environment in which the assessment was taken and changes in student behavior were not shared by OECD. With regards to reliability, a key threat is researcher error with regards to the number and construct type of the dimensions identified in the qualitative analysis. This error was mitigated by having a reviewer familiar with the construct analyze results. There will not be any agreement indices collected for the qualitative analysis. Future Research Since the qualitative and quantitative analyses let to divergent results the question remains on how to accurately model the science content domain when its subdomains appear to be separate dimensions. Determining how dimensionality of science content affects assessments and their items is crucial to developing assessments that yield information equitably for diverse student subgroups. The following are recommendations for future research in this area: 140 • Since polytomous items were dropped from the study and only dichotomous items were analyzed, a future analysis needs to include a more complex model that covers both types of items. • A structural equation model might be implemented if there more connections are found between items based on their subdomain content. • The MIRT model in this study was developed for a between-item representation of multidimensionality – see Figure 6 (from Li et al., 2012). Baghaei (2012) provides an overview of a MIRT model for within-item dimensionality – see Figure 34 below. A within-item MIRT model should be tested to see if it provides better fit since some of the standards in the 2015 PISA science framework were found to load on each other during the qualitative analysis – see Figure 11. This relates to the expectation that students should be able to apply science content to either interdisciplinary or independent science items/tasks (Mostafa et al., 2018; OECD, 2017a). 141 Figure 34 Differences in Between-item and Within-item MIRT Models Note. From “The application of multidimensional Rasch models in large-scale assessment and validation: An empirical example,” by Purya Baghaei, 2012, Electronic Journal of Research in Educational Psychology, 10(1), p. 239. Copyright 2012 by Electronic Journal of Research in Educational Psychology. Policy Recommendations The recommendations below are targeted towards several groups of science education stakeholders: researchers, learning scientists, policy makers, assessment and curriculum developers, and educators themselves. With the goal of clearer understanding of how science assessments should be modeled to increase equity for all students here are some crucial areas still needing attention: • Prior researchers have advocated using an UIRT model simply because PISA has used this model in the past - see the Davier et al. (2019) study on model fit using PISA data. While following a similar methodology might increase reliability of results it does not 142 address equity concerns and new models should be analyzed to find the fit that provides accurate information on science performance for every student. • At what point is an increase in fit too small if it can be justified by the help it provides the group existing in that range? Quantitative education researchers should review if model fit indices can be tailored to specific subgroups of the population. • Developers of science standards for the U.S. should compare NGSS dimensionality and that found in PISA framework to identify areas where there is a content match or disconnect. These comparisons could guide how items are coded to a dimension in MIRT models • For future PISA cycles, that more information such as linking be released, or that a similar study be conducted internally and a report released. • OECD should consider making ethnicity data available publicly to increase transparency of results. This data release can be done on a nation-by-nation basis if requested. • OECD assessment designers should release item specifications if available or build them to guide interpretation of future science framework standards. • Learning scientists should identify any unique constructs for each science subdomain with the goal of describing how students develop mastery in each area. Conclusions In order to ensure we measure what is intended, both educators and researchers will benefit from digging deeper into what constitutes each science subdomain. If the science subdomains are truly individual constructs, then our scoring models should follow the framework’s scaffolding, or we should identify what dimensionality aspect is more 143 predominantly being assessed. While large-scale assessments provide countries with a wealth of data, they may be doing harm to educational equity based on use inferences that routinely affect national educational objectives and school policies. In other words, we should not make claims about what is measured when we did not accurately differentiate the constructs or chose a model based on its usability only. As the U.S. moves forward with more three- dimensional science learning via NGSS how to model those constructs in a multidimensional space will become more critical. To begin evaluating these aspects and fulfill the proposed policy recommendations above, a good start, to coin a phrase from Castillo & Gillborn (2023), is to “democratize evidence.” Without complete sets of data, researchers are limited in verifying or building upon previous research and new discoveries cannot be made. Throughout this study it was clear that more complete student demographic data was needed to validate the equity inferences being drawn from the conclusion that there was model misfit based on the qualitative and quantitative findings. 144 REFERENCES Alesina, A., Devleeschauwer, A., Easterly, W., Kurlat, S., & Wacziarg, R. (2003). Fractionalization. Journal of Economic Growth, 8(2), 155-194. https://www.nber.org/papers/w9411 American Educational Research Association. (2014). Standards for Educational and Psychological Testing. Armstrong, C. (2021). Key methods used in qualitative document analysis. SSRN eLibrary. https://ssrn.com/abstract=3996213 Atkisson, M. (2010, 2010-10-15). Social Negotiation as a Central Principle of Constructivism. Ways of Knowing. https://woknowing.wordpress.com/2010/10/14/social-negotiation- as-a-central-principle-of-constructivism/ Ayala, R. J. (2022). The theory and practice of item response theory (2nd edition). The Guilford Press. Baghaei, P. (2012). The application of multidimensional rasch models in large scale assessment and validation: An empirical example. Electronic Journal of Research in Educational Psychology, 10(1), 233-252. Baškarada, S. & Koronios, A., (2018). A philosophical discussion of qualitative, quantitative, and mixed methods research in social science. Qualitative Research Journal, https://doi.org/10.1108/QRJ-D-17-00042 Boon, M., Orozco, M., & Sivakumar, K. (2022). Epistemological and educational issues in teaching practice-oriented scientific research: Roles for philosophers of science. European Journal for Philosophy of Science, 12(16), 1-23. Bowen, G. A. (2009). Document analysis as a qualitative research method. Qualitative 145 Research Journal, 9(2), 27-40. Brandt, S. (2015). Unidimensional interpretation of multidimensional tests. [Doctoral dissertation, Christian-Albrechts-Universität zu Kiel]. Briggs, D. C. & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, 4(1), 87-100. Broesch, T., Crittenden, A. N., Beheim, B. A., Blackwell, A. D., Bunce, J. A., Colleran, H., Hagel, K., Kline, M., McElreath, R., Nelson, R. G., Pisor, A. C., Prall, S., Pretelli, I., Purzycki, B., Quinn, E. A., Ross, C., Scelza, B., Starkweather, K., & Stieglitz, J. (2020). Navigating cross- cultural research: methodological and ethical considerations. Proceedings B The Royal Society, 287(20201245), 1-7. Brooks-Bartlett, J. (2018). Probability concepts explained:Maximum likelihood estimation. Towards Data Science. Carpiano, R. M., & Daley, D. M. (2006). A guide and glossary on postpositivist theory building for population health. Journal of Epidemiology and Community Health, 60, 564-570. doi: 10.1136/jech.2004.031534 Carr, S. M. (2001). Interpreting a principal components analysis - Theory & practice. Memorial University: Biology – Faculty of Science. https://www.mun.ca/biology/scarr/2900_PCA_Analysis.htm Castillo, W. & Gillborn, D. (2023, September). How to “QuantCrit:” Practices and Questions for Education Data Researchers and Users. (EdWorkingPaper No. 22-546). https://doi.org/10.26300/v5kh-dd65 Center for Professional Education of Teachers. (n.d.). Equity and Assessment. 146 https://cpet.tc.columbia.edu/news-press/equity-and-assessment Civil Rights Data Collection. (2023, November 20). Data on Equal Access to Education. Office for Civil Rights, U.S. Department of Education. https://ocrdata.ed.gov/ Claesgens, J., Scalise, K., Wilson, M., & Stacy, A. (2008). Mapping student understanding in chemistry: The perspectives of chemists. Science Education, 93, 56-85. College Board. (2009). Science: College board standards for college success. The College Board. Connected Papers | Find and explore academic papers. (2021). https://www.connectedpapers.com/ Corcoran, T., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence- based approach to reform (Report No. RR-63). Consortium for Policy Research in Education (CPRE). www.cpre.org Csapó, B., & Funke, J. (Eds.). (2017). The Nature of problem solving: Using research to inspire 21st century learning. OECD Publishing. https://read.oecd-ilibrary.org/education/the- nature-of-problem-solving_9789264273955-en#page5 Cummings, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29. Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16- 29. Davier, M. V., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., Davis, S., Kong, N., & Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assessment in Education: Principles, Policy & Practice, 26(4), 466- 488. 147 DeMars, C. E. (2016). Partially compensatory multidimensional item response theory models: Two alternate model forms. Educational and Psychological Measurement, 76(2), 231- 257. Duran, V. (2014, March 17). Multidimensional item response theory: What have we learned thus far [PowerPoint slides]. The Psychometrics Centre, University of Cambridge. https://www.psychometrics.cam.ac.uk/system/files/documents/multidimensional-item- response-theory.pdf Enger, S. K., & Yager, R. E. (2009). Chapter 1: A framework for assessing student understanding in science: A standards-based K-12 handbook. In Assessing student understanding in science (2nd edition, pp.1-11). Sage. Erzberger, C., & Kelle, U. (2003). Making inferences in mixed methods: The rules of integration. In A. Tashakkori & C. Teddlie (Eds.), Handbook of Mixed Methods in Social & Behavioral Research 1st edition (pp. 457-488). Sage. Froese-Germain, B. (2010). The OECD, PISA and the impacts on educational policy (Report). Canadian Teachers’ Federation. Gale, N. K., Heath, G., Cameron, E., Rashid, S., & Redwood, S. (2013). Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Medical Research Methodology, 13(117), 1-8. Gao, N., Johnson, H., Lafortune, J., & Dalton, A. (2019). New eligibility rules for the University of California? The effects of new science requirements. Public Policy Institute of California. https://www.ppic.org/wp-content/uploads/new-eligibility-rules-for-university-of- california-the-effects-of-new-science-requirements.pdf 148 GEOstata. (2016). PISA 2015 Results – Performance in Science. OECD. Godwin, A. (2017). Unpacking latent diversity. In American Society for Engineering Education Annual Conference & Exposition. Gomes, M., Hirata, G., & Oliveira, J. B. A. E. (2020). Student composition in the PISA assessments: Evidence from Brazil. International Journal of EducationalDevelopment, 79, 1-7. Greene, J. C. & Caracelli, V. J. (1997). Defining and describing the paradigm issue in mixed- method evaluation. New Directions for Evaluation, 1997(74), 5-17. Greenwood, B. (2020). Understanding Pedagogy - What is Social Constructivism? Satchel. https://blog.teamsatchel.com/understanding-pedagogy-what-is-social-constructivism Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), 1-30. doi:10.18637/jss Hanushek, E.A., Jamison, D.T., Jamison, E.A., & Woessmann, L. (2008). Education and economic growth: It’s not just going to school, but learning something while there that matters. Education Next, 8(2), 62-70. Harris, D. (n.d.). Comparison of 1-, 2-, and 3-parameter IRT models. Instructional Topics in Educational Measurement, 157-163. Ho, L., & Limpaecher, A. (2021, September 17). The practical guide to grounded theory. practical guide to grounded theory research. Delve. https://delvetool.com/groundedtheory Iliescu, D. & Greiff, S. (2021). On consequential validity. European Journal of Psychological Assessment, 37(3), 163–166. 149 Immekus, J. C., Snyder, K. E., & Ralston, P. A. (2019). Multidimensional item response theory for factor structure assessment in educational psychology research. Frontiers in Education, 4(45), 2-15. Irribarra, D. T. & Arneson, A. E. (2023). The challenge of defining and interpreting dimensionality ineducational and psychological assessments. Measurement, 221, 1-8. Irribarra, D.T. & Freund, R. (2014). Wright Map: IRT item-person map with ConQuest integration. https://github.com/david-ti/wrightmap Issayeva, L. (2022, December 18). Multidimensional item response theory. Assessment Systems Corporation (ASC). https://assess.com/multidimensional-item-response-theory/ Jerrim, J. (2016, November 1). The design and use of test scores: Lecture 4 [PowerPoint slides]. Social Research Institute, University College London. Jerrim, J., Micklewright, J., Heine, J., Salzer, C., & McKeown, C. (2018). PISA 2015: how big is the ‘mode effect’ and what has been done about it? Oxford Review of Education, 44(4), 476- 493. https://doi.org/10.1080/03054985.2018.1430025 Johnson, R. B. & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has come. Educational Researcher. 33(7), 14-26. Johnson, S. (2019, October 18). How one high school’s dispute reflects the struggle to teach California’s science standards. EdSource. https://edsource.org/2019/how-one-high- schools-dispute-reflects-the-struggle-to-teach-californias-science-standards/618752 Jolliffe, I. T. & Cadima, J. (2016). Principal component analysis: A review and recent Developments. The Royal Publishing Society: Philosophical Transactions A. 374, 1-16. Kaldaras, L., Akaeze, H., & Krajcik, J. (2021). A Methodology for Determining and Validating 150 Latent Factor Dimensionality of Complex Multi-Factor Science Constructs Measuring Knowledge-In-Use. Educational Assessment, 26(4), 241-263. Kan, A., Bulut, O., & Cormier, D. C. (2018). The impact of item stem format on the dimensional structure of mathematics assessments. Educational Assessment, 24(1), 13-32. Kassambara, A. & Mundt, F. (2020) Factoextra: Extract and visualize the results of multivariate data analyses. (Version 1.0.7) [R program package]. https://CRAN.R- project.org/package=factoextra Kelley, T. R. & Knowles, J. G. (2016). A conceptual framework for integrated STEM education. International Journal of STEM Education, 3(11), 1-11. Kiefer, T., Robitzsch, A., & Wu, M. (2015, July 2). TAM: An R package for item response modelling [PowerPoint slides]. https://user2015.math.aau.dk/presentations/205.pdf Kose, I. A. & Demirtasli, N. C. (2012). Comparison of unidimensional and multidimensional models based on item response theory in terms of both variables of test length and sample size. Procedia - Social and Behavioral Sciences, 46, 135 – 140. Krutsch, E. & Roderick, V. (2022, November 4). STEM Day: Explore Growing Careers. U.S. Department of Labor Blog. https://blog.dol.gov/2022/11/04/stem-day-explore-growing- careers Lang, J. W. B., & Tay, L. (2021). The Science and Practice of Item Response Theory in Organizations. Annual Review of Organizational Psychology and Organizational Behavior, 8, 311-338. Language Resource Center (LRC). (2022). Languages by countries. https://www.languagerc.net/languages-by-countries/ 151 Learn more about PILA (n.d.). The Platform for Innovative Learning Assessments (PILA). https://pilaproject.org/. Leigh, J. H., Zinkhan, G. M., & Swaminathan, V. (2006). Dimensional relationships of recall and recognition measures with selected cognitive and affective aspects of print ads. Journal of Advertising, 35(1), 105-122. Li, Y., Jiao, H., & Lissitz, R. W. (2012). Applying multidimensional item response theory models in validating test dimensionality: An example of K–12 large-scale science assessment. Journal of Applied Testing Technology, 13(2), 1-27. Lips, D. & Moritz, M. (2023). STEM and Computer Science Education: Reforming Federal K-12 Education R&D Activities to Strengthen American Competitiveness: Prepared by Federation of American Scientists. Lincoln Network, Foundation for American Innovation. Mailman School of Public Health. (2023, November). Item Response Theory. Columbia University. https://www.publichealth.columbia.edu/research/population-health- methods/item-response-theory Market Data Retrieval (MDR). (2024, March 26). How many schools are in the U.S.? MDR Education. https://mdreducation.com/how-many-schools-are-in-the-u-s/ Maxwell, J. A. & Mittapalli, K. (2010). Realism as a stance for mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of Mixed Methods in Social & Behavioral Research 2nd edition (pp. 145-167). Sage. Mazzei, L. A., & Jackson, A. Y. (Eds.). (2024). Postfoundational approaches to qualitative inquiry. Routledge. DOI: 10.4324/9781003298519 152 McLeod, S. (2019). Constructivism as a Theory for Teaching and Learning | Simply Psychology. https://www.simplypsychology.org/constructivism.html Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. Messick, S. (1993). Foundations of validity: Meaning and consequences in psychological assessment (Report No. RR-93-51). Educational Testing Service. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/j.2333-8504.1993.tb01562.x Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. Mostafa, T., Echazarra, A., & Guillou, H. (2018). The science of teaching science: An exploration of science teaching practices in PISA 2015 (OECD Education Working Papers No. 188). www.oecd.org/edu/workingpapers National Center for Education Statistics. (n.d.-a). Frequently asked questions. PISA resources. https://nces.ed.gov/surveys/pisa/faq.asp National Center for Education Statistics. (n.d.-b). Science literacy: Average scores. Program for International Student Assessment (PISA). https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp National Center for Education Statistics. (2022). High school mathematics and science course completion: Condition of education. U.S. Department of Education, Institute of Education Sciences. https://nces.ed.gov/programs/coe/indicator/sod National Research Council. (2001). Knowing what students know: The science and 153 design of educational assessment. Washington, DC: The National Academies Press. https://doi.org/10.17226/10019 National Research Council. (2007). Taking science to school: Learning and teaching science in grades K-8. Washington, DC: The National Academies Press. https://doi.org/10.17226/11625 National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. Washington, DC: The National Academies Press. NGSS Lead States. (2013). Next generation science standards: For states, by states: Three dimensional learning. https://www.nextgenscience.org/three-dimensional-learning Organisation for Economic Co-operation and Development. (n.d.-a). About PISA. The Programme for International Student Assessment (PISA). https://www.oecd.org/pisa/aboutpisa/ Organisation for Economic Co-operation and Development. (n.d.-b). FAQ. The Programme for International Student Assessment (PISA). https://www.oecd.org/pisa/pisafaq/ Organisation for Economic Co-operation and Development. (n.d.-c). PISA 2015 released field trial cognitive items. OECD Publishing. Organisation for Economic Co-operation and Development. (2016a). Country note: Key findings from PISA 2015 for the United States. OECD Publishing. https://www.oecd.org/pisa/PISA-2015-United-States.pdf Organisation for Economic Co-operation and Development. (2016b). PISA 2015 results (volume I): Excellence and equity in education. OECD Publishing. http://dx.doi.org/10.1787/9789264266490-en 154 Organisation for Economic Co-operation and Development. (2017a). PISA 2015 Assessment and analytical framework: science, reading, mathematic, financial literacy and collaborative problem solving, revised edition. OECD Publishing. http://dx.doi.org/10.1787/9789264281820-en Organisation for Economic Co-operation and Development. (2017b). PISA 2015 technical report. OECD Publishing. https://www.oecd.org/pisa/data/2015-technical-report/ Organisation for Economic Co-operation and Development. (2018). PISA 2015: Results in Focus. OECD Publishing. Organisation for Economic Co-operation and Development. (2023). Working draft: PISA learning in the digital world assessment framework. Osteen, P. (2010). An introduction to using multidimensional item response theory to assess latent factor structures. Journal of the Society for Social Work and Research, 1(2), 66-82. Östlund, U., Kidd, L., Wengström, Y., and Rowa-Dewar, N. (2011). Combining qualitative and quantitative research within mixed method research designs: A methodological review. International Journal of Nursing Studies. 48(2011), 369-383. Otarigho, M. D. & Oruese, D. D. (2013). Problems and prospects of teaching integrated science in secondary schools in Warri, Delta State, Nigeria. Techno LEARN: An International Journal of Educational Technology, 3(1), 19-26. Park, S., Reeger, A., & Aloe, A. M. (2020). Technically speaking: Determining test effectiveness with item response theory. Iowa Reading Research Center, University of Iowa. https://irrc.education.uiowa.edu/blog/2020/09/technically-speaking-determining- test-effectiveness-item-response-theory 155 Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work developing transferable knowledge and skills in the 21st Century. National Academies Press. DOI 10.17226/13398 Pensavalle, C. A. & Solinas, G. (2013). The Rasch model analysis for understanding mathematics proficiency—A case study: Senior high school Sardinian students. Creative Education, 4(12), 767-773. Pierson, A. E., Clark, D. B., & Kelly, G. J. (2019). Learning Progressions and Science Practices Tensions in Prioritizing Content, Epistemic Practices, and Social Dimensions of Learning. Science & Education, 28, 833-841. PISA USA. (2015). Program for international student assessment: Frequently asked questions – Information for Students [Brochure]. https://www.fldoe.org/core/fileparse.php/5389/urlt/PISA2015_FAQ_Student_Informati on.pdf Pokropek, A., Marks, G. N., Borgonovi, F., Koc, P., & Greiff, S. (2022). General or specific abilities? Evidence from 33 countries participating in the PISA assessments. Intelligence, 92. Polites, G. L., Roberts, N., and Thatcher, J. (2012). Conceptualizing models using multidimensional constructs: A review and guidelines for their use. European Journal of Information Systems, 21, 22-48. Plotly Technologies Inc. (2015). Collaborative data science. Montréal, QC. https://plot.ly R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ 156 Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412. Reckase, M. D. (1989). The interpretation and application of multidimensional item response theory models; and computerized testing in the instructional environment: Final Report (Report No. AD-A214109). The American College Testing (ACT) Program. Reckase, M. D. (1990). Unidimensional data from multidimensional tests and multidimensional data from unidimensional tests [Paper presentation]. Annual Meeting of American Educational Research Association, Boston. Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25-36. Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research. R package version 2.4.1. https://CRAN.R-project.org/package=psych Robitzsch, A., Kiefer, T., & Wu, M. (2022). TAM: Test analysis modules. R package version 4.1-4, https://CRAN.R-project.org/package=TAM RStudio Team. (2021). RStudio: Integrated development environment for R. RStudio, PBC, Boston, MA. http://www.rstudio.com/ Saunders, B., Sim, J., Kingstone, T., Baker, S., Waterfield, J., Bartlam, B., Burroughs, H., & Jinks, C. (2018). Saturation in qualitative research: Exploring its conceptualization and operationalization. Quality & Quantity, 52, 1893-1907. Scalise, K. & Clarke-Midura, J. (2018). The many faces of scientific inquiry: Effectively measuring what students do and not only what they say? Journal of Research in Science Teaching, 55, 1469-1496. 157 Scalise, K. & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing “intermediate constraint” questions and tasks for technology platforms. The Journal of Technology, Learning, and Assessment, 4(6), 1-45. Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com Singer, J. D. & Braun, H. I. (2018). Testing international education assessments: Rankings get headlines, but often mislead. Science, 360(6384), 38-40. Socha, A. (n.d.) Multidimensional item response theory. [Unpublished article – James Madison University]. https://educ.jmu.edu/~sochaab/index_files/Showcase/MIRT.pdf Spencer, S. G. (2004). The strength of multidimensional item response theory in exploring construct space that is multidimensional and correlated. [Doctoral dissertation, Brigham Young University]. BYU ScholarsArchive. Stehle, S. M., & Peters-Burton, E. E. (2019). Developing student 21st century skills in selected exemplary inclusive STEM high schools. International Journal of STEM Education, 6(1), 1- 15. https://doi.org/10.1186/s40594-019-0192-1 Strauss, V. (2019, December 3). Expert: How PISA created an illusion of education quality and marketed it to the world. The Washington Post. The World Bank Group. (2023). Compulsory education, duration (years). Data. https://data.worldbank.org/indicator/SE.COM.DURS The World Bank Group. (2023). World development indicators. https://datatopics.worldbank.org/world-development-indicators/ Uesaka, Y., Suzuki, M., & Ichikawa, S. (2022). Analyzing students’ learning strategies using item 158 response theory: Toward assessment and instruction for self-regulated learning. Frontiers in Education, 7(921844), 1-16. USAGov. (2023, December 27). Official language of the United States. About the U.S. and its government. https://www.usa.gov/official-language-of-us Vasquez, J. (2006). High school biology today: What the committee of ten did not anticipate. CBE—Life Sciences Education: High School Biology Today, 5, 29-33. Venkatesh, V., Brown, S. A., & Bala, H. (2013). Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods research in information systems. MIS Quarterly, 37(1), 21-54. Venkatesh, V., Brown, S. A., & Sullivan, Y. W. (2016). Guidelines for conducting mixed methods research: An extension and illustration. Journal of AIS, 17(7), 435-495. Voogt, J. & Roblin, N. P. (2012). A comparative analysis of international frameworks for 21st century competences: Implications for national curriculum policies. Journal of Curriculum Studies, 44(3), 299-321. Wach, E., Ward, R., & Jacimovic, R. (2013). Learning about Qualitative Document Analysis. IDS Practice Papers in Brief, 1-10. Wang, C. (2021). A brief history and next stage of multidimensional item response theory. Quantitative and Qualitative Methods. Wang, C. & Nydick, S. W. (2015). Comparing two algorithms for calibrating the restricted non-compensatory multidimensional IRT model. Applied Psychological Measurement, 39(2), 119-134. Welch, W. W. (1977). Chapter 3: Evaluation and decision-making in integrated science. In D. 159 Cohen (Ed.), Volume IV: New trends in integrated science teaching: evaluation of integrated science education (pp. 26-36). United Nations Educational, Scientific, and Cultural Organization. Western Governors University. (2020). What is constructivism? https://www.wgu.edu/blog/what-constructivism2005.html#close Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (Version 3.4.4) [R program package]. Springer-Verlag, New York. https://ggplot2.tidyverse.org Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation (Version 1.1.4) [R program package]. https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org Wilson, M. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46(9), 3766-3774. Winarno, N., Rusdiana, D., Riandi, R., Susilowati, E., & Afifah, R. M. A. (2020). Implementation of Integrated Science Curriculum: A Critical Review of the Literature. Journal for the Education of Gifted Young Scientists, 8(2), 795-817. DOI: http://dx.doi.org/10.17478/jegys.675722 https://www.wgu.edu/blog/what-constructivism2005.html#close Wind, S. & Hua, C. (2021). Rasch measurement theory analysis in R: Illustrations and practical guidance for researchers and practitioners. Bookdown. https://bookdown.org/chua/new_rasch_demo2/ World Population Review. (2023). Most racially diverse countries 2023. https://datatopics.worldbank.org/world-development-indicators/ 160 Yamamoto, K. (1995). TOEFL technical report: Estimating the effects of test length and test time on parameter estimation using the hybrid model (Report No. ETS-RR-95-2; TOEFL-TR-10). Educational Testing Service. https://files.eric.ed.gov/fulltext/ED395035.pdf Yen, S. J. & Leah, W. (2007). Multidimensional IRT models for Composite Scores [Paper presentation]. 2007 Annual Meeting of the National Council of Measurement in Education, Chicago, IL, United States. 161 APPENDIX A: STUDENT ENROLLMENT IN SCIENCE COURSES BY ETHNICITY The following bubble graphs showcase student enrollment for U.S. high school science courses based on data from Civil Rights Data Collection (CRDC, 2023) collected by the Office for Civil Rights (OCR). The data are from the 2020-21 school year, which was when OCR could restart data collection after a delay due to COVID-19. Student data is from all school districts and public schools, “as well as long-term secure juvenile justice facilities, charter schools, alternative schools, and special education schools that focus primarily on serving the educational needs of students with disabilities under IDEA or section 504 of the Rehabilitation Act (CRDC, 2023).” Administration of the CRDC occurs every two years in the 50 states, Washington, D.C., and the Commonwealth of Puerto Rico (CRDC, 2023). Enrollment was less than 1% for the American Indian or Alaska Native and Native Hawaiian or Other Pacific Islander student populations, so those student populations are not depicted for all three bubble charts. The majority of students enrolled in high school physics (Figure 35), biology (Figure 36), and chemistry (Figure 37) courses are of White and Hispanic or Latino (of any race) ethnicities. Figure 35 U.S. High School Physics Enrollment 162 Figure 36 Figure 37 U.S. High School Biology Enrollment U.S. High School Chemistry Enrollment Note. For Figures 35-37, from “Data on Equal Access to Education,” 2023, Civil Rights Data Collection. Copyright 2015 by Office for Civil Rights U.S. Department of Education. https://ocrdata.ed.gov/ 163 APPENDIX B: 2015 PISA AVERAGE SCORES FOR SCIENCE Table 13 provides the 2015 PISA mean scores for the science literacy scale for all countries with the U.S. data highlighted in blue. Also included are the standard errors (SE). With regards to the U.S., it is important to consider that only two states and a territory were sampled individually to get their mean score per state/territory (OECD, 2016a). Figure 38 shows the relative stability of U.S. science mean scores over time from 2006 to 2018 (OECD, 2016a). Table 13 2015 PISA Country Rankings by Average Score in Science Education System Average SE Education System Average Score Score SE OECD average 493 0.4 I celand 473 1.7 Singapore 556 1.2 Israel 467 3.4 Japan 538 3.0 Malta 465 1.6 Estonia 534 2.1 Slovak Republic 461 2.6 Chinese Taipei 532 2.7 Greece 455 3.9 Finland 531 2.4 Chile 447 2.4 Macau (China) 529 1.1 Bulgaria 446 4.4 Canada 528 2.1 United Arab Emirates 437 2.4 Vietnam 525 3.9 Uruguay 435 2.2 Hong Kong (China) 523 2.5 Romania 435 3.2 B-S-J-G (China) 518 4.6 Cyprus 433 1.4 Korea, Republic of 516 3.1 Moldova, Republic of 428 2.0 New Zealand 513 2.4 Albania 427 3.3 Slovenia 513 1.3 Turkey 425 3.9 Australia 510 1.5 Trinidad and Tobago 425 1.4 United Kingdom 509 2.6 Thailand 421 2.8 Germany 509 2.7 Costa Rica 420 2.1 Netherlands 509 2.3 Qatar 418 1.0 Switzerland 506 2.9 Colombia 416 2.4 Ireland 503 2.4 Mexico 416 2.1 Belgium 502 2.3 Montenegro, Republic of 411 1.0 Denmark 502 2.4 Georgia 411 2.4 Poland 501 2.5 Jordan 409 2.7 Portugal 501 2.4 Indonesia 403 2.6 Norway 498 2.3 Brazil 401 2.3 United States 496 3.2 Peru 397 2.4 Austria 495 2.4 Lebanon 386 3.4 France 495 2.1 Tunisia 386 2.1 Sweden 493 3.6 Macedonia, Republic of 384 1.2 Czech Republic 493 2.3 Kosovo 378 1.7 Spain 493 2.1 Algeria 376 2.6 164 Latvia 490 1.6 Dominican Republic 332 2.6 Russian Federation 487 2.9 Luxembourg 483 1.1 Italy 481 2.5 Hungary 477 2.4 U.S. States and Territories Lithuania 475 2.7 Massachusetts 529 6.6 Croatia 475 2.5 North Carolina 502 4.9 Buenos Aires (Argentina) 475 6.3 Puerto Rico 403 6.1 Note. Adapted from “Science literacy: Average scores,” n.d., National Center for Education Statistics. Copyright 2015 by OECD. https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp “Average score is higher than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” “Average score is lower than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” “Education systems are ordered by 2015 average score. The OECD average is the average of the national averages of the OECD member countries, with each country weighted equally. Scores are reported on a scale from 0 to 1,000. All average scores reported as higher or lower than the U.S. average score are different at the .05 level of statistical significance. Italics indicate non-OECD countries and education systems. B-S-J-G (China) refers to the four PISA participating China provinces: Beijing, Shanghai, Jiangsu, and Guangdong. Results for Massachusetts and North Carolina are for public school students only (NCES, n.d.-b).” While Argentina, Malaysia, and Kazakhstan participated in PISA 2015, Argentina only provided a reliable sample from Buenos Aires, Malaysia was unable to meet response rate standards, and Kazakhstan administered only multiple-choice items, which limited comparison by rank in Argentina’s case and prevented comparison by rank in the case of Malaysia and Kazakhstan (OECD, 2018). Figure 38 U.S. Mean Scores for Science Stable Over Time 2006 2009 2012 2015 2018 489 502 497 496 502 165 APPENDIX C: 2015 PISA AVERAGE SCORES BY SCIENCE SUBDOMAIN Table 14 provides the 2015 PISA mean scores for the three science subscales: physical, living, and Earth and space systems for all countries with the U.S. data highlighted in blue. Also included are the SE. With regards to the U.S., it is important to consider that only two states and a territory were sampled individually to get their mean subscale scores per state/territory (OECD, 2016a). Note, the science competency70 subscales (i.e., explain phenomena, evaluate and design inquiry, and interpret data and evidence) are outside the scope of this study so are not provided, but are available at the below NCES website. https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3_2.asp Table 14 2015 PISA Country Rankings by Average Score in Science Subdomain Physical Systems Living Systems Earth and Space Systems Education Average SE Education Average SE Education Average System Score System Score System Score SE OECD average 493 0.5 OECD average 492 0.5 OECD average 494 0.5 Singapore 555 1.6 Singapore 558 1.4 Singapore 554 1.6 Japan 538 3.2 Japan 538 3.2 Japan 541 3.3 Estonia 535 2.3 Chinese Taipei 532 2.7 Estonia 539 2.3 Finland 534 2.6 Estonia 532 2.1 Finland 534 3.0 Macau (China) 533 1.4 Canada 528 2.4 Chinese Taipei 534 3.1 Chinese Taipei 531 3.0 Finland 527 2.5 Macau (China) 533 1.2 Canada 527 2.4 Macau (China) 524 1.4 Canada 529 2.5 Hong Kong 523 2.9 Hong Kong 523 2.7 Hong Kong (China) (China) (China) 523 2.5 B-S-J-G (China) 520 5.3 B-S-J-G (China) 517 4.5 Korea, Republic of 521 3.3 Korea, Republic of 517 3.6 New Zealand 512 2.8 B-S-J-G (China) 516 4.9 New Zealand 515 2.7 Slovenia 512 1.6 Slovenia 514 1.8 Slovenia 514 1.6 Korea, Republic of 511 3.2 New Zealand 513 2.7 Netherlands 511 2.6 Australia 510 1.8 Netherlands 513 2.8 Australia 511 1.8 Germany 509 2.9 Germany 512 2.9 70 NCES refers to OECD’s framework competency subscales as “process subscales” on their website. 166 United 509 2.9 United Kingdom Kingdom 509 2.6 United Kingdom 510 2.8 Denmark 508 2.7 Switzerland 506 3.2 Australia 509 2.1 Ireland 507 2.8 Netherlands 503 2.4 Switzerland 508 3.1 Germany 505 2.8 Belgium 503 2.4 Denmark 505 2.7 Switzerland 503 3.1 Portugal 503 2.5 Belgium 503 2.6 Poland 503 2.7 Poland 501 2.8 Ireland 502 2.6 Norway 503 2.5 Ireland 500 2.5 Poland 501 2.8 Sweden 500 3.8 United States 498 3.4 Portugal 500 2.9 Belgium 499 2.4 Denmark 496 2.6 Norway 499 2.6 Portugal 499 2.7 France 496 2.3 Austria 497 2.9 Austria 497 2.7 Norway 494 2.5 Spain 496 2.3 United States 494 3.5 Spain 493 2.3 United States 496 3.4 France 492 2.4 Czech Republic 493 2.4 France 496 2.5 Czech Republic 492 2.5 Austria 492 2.6 Sweden 495 4.1 Latvia 490 1.7 Latvia 489 1.7 Czech Republic 493 2.6 Russian Federation 488 3.4 Sweden 488 3.7 Latvia 493 1.9 Spain 487 2.3 Luxembourg 485 1.2 Russian Federation 489 3.3 Hungary 481 2.9 Russian Federation 483 2.8 Italy 485 2.7 Italy 479 2.8 Italy 479 2.7 Luxembourg 483 1.6 Luxembourg 478 1.4 Croatia 476 2.6 Croatia 477 2.7 Lithuania 478 2.8 Iceland 476 2.0 Hungary 477 2.8 Iceland 472 1.9 Lithuania 476 2.7 Lithuania 471 3.0 Croatia 472 2.6 Hungary 473 2.6 Iceland 469 1.9 Israel 469 Slovak 3.8 Israel 469 3.5 Republic 458 2.8 Slovak Republic 466 2.9 Slovak Republic 458 2.8 Israel 457 3.8 Greece 452 4.0 Greece 456 4.0 Greece 453 4.3 Bulgaria 445 4.4 Chile 452 2.7 Bulgaria 448 4.8 Chile 439 3.0 Bulgaria 443 4.5 Chile 446 2.5 United Arab Emirates 434 2.8 Uruguay 438 2.5 United Arab Emirates 435 2.8 Cyprus 433 1.6 United Arab Emirates 438 2.6 Uruguay 434 2.6 Uruguay 432 2.6 Cyprus 433 1.5 Cyprus 430 1.6 Turkey 429 4.3 Turkey 424 3.9 Turkey 421 4.3 Thailand 423 3.2 Qatar 423 1.1 Mexico 419 2.4 Costa Rica 417 2.4 Thailand 422 3.2 Costa Rica 418 2.4 Qatar 415 1.5 Costa Rica 420 2.4 Thailand 416 3.2 Colombia 414 2.7 Colombia 419 2.5 Colombia 411 2.7 Mexico 411 2.2 Mexico 415 2.4 Montenegro, Republic of 410 2.0 167 Montenegro, Republic of 407 1.6 Montenegro, Republic of 413 1.3 Qatar 409 1.2 Brazil 396 2.6 Brazil 404 2.6 Brazil 395 3.1 Peru 389 2.7 Peru 402 2.7 Peru 393 3.1 Tunisia 379 2.4 Tunisia 390 2.4 Tunisia 387 3.4 Dominican Republic 332 3.0 Dominican Republic 332 2.8 Dominican Republic 324 3.4 Buenos Aires — † Buenos Aires — † Buenos Aires (Argentina) (Argentina) (Argentina) — † Romania — † Romania — † Romania — † Jordan — † Jordan — † Jordan — † Vietnam — † Vietnam — † Vietnam — † Georgia — † Georgia — † Georgia — † Albania — † Albania — † Albania — † Trinidad and — † Trinidad and Tobago Tobago — † Trinidad and Tobago — † Macedonia, — † Macedonia, — † Macedonia, Republic of Republic of Republic of — † Algeria — † Algeria — † Algeria — † Indonesia — † Indonesia — † Indonesia — † Malta — † Malta — † Malta — † Lebanon — † Lebanon — † Lebanon — † Kosovo — † Kosovo — † Kosovo — † Moldova, Moldova, Moldova, Republic of — † Republic of — † Republic of — † U.S. States U.S. States U.S. States and and and Territories Territories Territories Massachusetts 526 6.7 Massachusetts 533 6.9 Massachusetts 528 6.6 North Carolina 501 5.2 North Carolina 503 5.4 North Carolina 502 5.0 Puerto Rico — † Puerto Rico — † Puerto Rico — † Note. Adapted from “Science literacy: Average scores,” n.d., National Center for Education Statistics. Copyright 2015 by OECD. https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp “Average score is higher than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” “Average score is lower than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” ― “Not available (NCES, n.d.-b).” † “Not applicable (NCES, n.d.-b).” “Education systems are ordered by 2015 average subscale score. The OECD average is the average of the national averages of the OECD member countries, with each country weighted equally. Scores are reported on a scale from 0 to 1,000. Albania, Algeria, Buenos Aires (Argentina), Georgia, Indonesia, Jordan, Kosovo, Lebanon, Malta, Republic of Macedonia, Republic of Moldova, Puerto Rico, Romania, Trinidad and Tobago, and Vietnam 168 administered paper-based trend items and have no scores for science subscales. Italics indicate non-OECD countries and education systems. B-S-J-G (China) refers to the four PISA participating China provinces: Beijing, Shanghai, Jiangsu, and Guangdong. Results for Massachusetts and North Carolina are for public school students only (NCES, n.d.-b).” While Argentina, Malaysia, and Kazakhstan participated in PISA 2015, Argentina only provided a reliable sample from Buenos Aires, Malaysia was unable to meet response rate standards, and Kazakhstan administered only multiple-choice items, which limited comparison by rank in Argentina’s case and prevented comparison by rank in the case of Malaysia and Kazakhstan (OECD, 2018). 169 APPENDIX D: LITERATURE CONNECTIONS Figure 39 details literature connections to the 2012 Li et al. article, which has similar methodology to my proposed study. Note, Figure 39 is hyperlinked to an interactive, larger version of the graphic, which was generated from https://www.connectedpapers.com/. Figure 39 Connections to the 2012 Li Article 170 APPENDIX E: LITERATURE REVIEW MATRIX Table 15 provides a detailed account of the literature review. The table is organized alphabetically with notes on various aspects of each piece of literature. The measurement type is color coded for ease of use – see key below. Also included are any barriers to using MIRT models to describe student performance in the science content subdomains. The big ideas driving the inclusion of the literature in this dissertation are summarized in the final column. Color Key Mixed Methods Qualitative Quantitative Included in References Table 15 Results of Literature Review Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) AERA (2014) NA NA Book Quantitative NA Education testing standards Aktürk et al. Turkey, Early (2017) Childhood STEM Article Qualitative NA Example of curriculum document analysis Armstrong (2021) NA NA Article Qualitative NA Document analysis; grounded theory; epistemology Life and Physical Little curriculum overlap Athalonz (2023) NA Sciences Blog Post NA NA between life and physical science Atkinsson (2010) NA NA Online Learning Collaboration as part of Article Theory NA constructivism 171 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Ayala (2022) NA NA Book Quantitative Sources of indeterminacy (pg. 408) Complete overview of IRT Difficult to justify to test MIRT in large-scale assessment, Baghaei (2012) Iran, HS English Article Quantitative takers why scores on compares between- and within-different dimensions item multidimensionality (see depend on each other Figure 1) – hold for discussion Baskarada & Epistemology and philosophical Koronios (2018) NA NA Article Mixed Methods NA considerations for mixed methods research Berenzer & Global, 15-year- Math, Reading, PISA, PIRLS, Use of scaling and IRT in large- Adams (2017) olds, 4th graders Science Book Quantitative, NA scale assessments; IRT model large-scale choice Beribisky & Hancock (2023) NA NA Article Quantitative NA RMSEA comparison Binkley & Ma U.S., HS Advanced (AP) Newspaper Inequity in advanced placement (2023) classes Articles NA NA courses by student ethnicity (Black and Latino especially) Boon et al. (2022) NA Science Article Epistemology NA Overview of constructivist approach to science education Document analysis as a Bowen (2009) NA NA Article Qualitative NA research method – use in methods section Brandt (2015) Global, 15-year- PISA, NAEP, KEY! Large-scale assessment’s olds, U.S. NA Dissertation Quantitative, Reliability of comprehensive scores unidimensional approach when large-scale, really multidimensional; Briggs & Wilson Intro to multidimensional Rasch (2003) NA NA Article Quantitative NA models; art and science of measurement Broesch et al. (2020) NA NA Article Mixed Methods NA Research design that takes culture into consideration Brooks-Bartlett NA NA Online (2017) Article Quantitative NA Introduction to probability 172 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Brooks-Bartlett Online (2018) NA NA Article Quantitative NA Maximum likelihood estimation Camilleri (2023) NA NA News Article Quantitative NA Historical timeline of IRT Carnoy et al. U.S. Math, Reading, PISA, NAEP, State comparisons more useful (2015) Science Briefing Quantitative NA than U.S. to international comparisons Caro & Biecek PISA, TIMSS, (2017) International NA Article PIRLS, NA An R package for analyzing Quantitative large-scale assessment data Figure 1 is a framework/theory/model view Carpiano & Daley for philosophy that will work for (2006) NA NA Article Epistemology NA qualitative review of content framework too – use in methods section; glossary for epistemology CFPB (2019) Global Financial Report PISA, Mixed NA Overview of PISA financial Literacy Methods literacy results; Claesgens et al. U.S., HS and Uses IRT to match scores to a (2008) University Chemistry Article Mixed Methods NA framework (pg. 8 for discussion chapter) College Board (2009) U.S. Science Book NA NA Science standards for college success Columbia (2023) NA NA Webpage Quantitative NA Item parameters and IRT Status of U.S. science Corcoran et al. education; role of science (2009) U.S., K-12 Science Report NA NA learning progressions (for possible use in my discussion chapter) CPET (n.d.) NA NA Webpage NA NA Equity in assessment CRCD (2023) U.S., K-12 Math, Science, AP Webpage Quantitative NA Civil rights data for education in U.S. K-12 173 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Creswell (2015) NA NA Article Mixed Methods NA Approaches to and handbook for mixed methods Crisan et al. NA NA Article Quantitative, NA Consequences of (2017) Simulation unidimensional IRT model misfit Csapó (2017) NA NA Online Book NA NA Overview of 21st Century skills Curran et al. (1996) NA NA Article Quantitative NA Skew and kurtosis acceptable ranges MIRT was not used in Using multiple-group Rasch Davier et al. Global, 15-year- Science, Math, PISA, original analysis so cannot model rather than Article be used in newer research unidimensional IRT for linking in (2019) olds and Reading Quantitative in order to preserve trend PISA to generate cross-country and prior conclusions comparisons (for discussion chapter) DeMars (2016) NA NA Article Quantitative, Item difficulties can vary Key! Non-compensatory MIRT Simulation by dimension equation (for methods and results chapters) Dorans & Kingston (1985) NA NA Article Quantitative NA Violating unidimensionality Duran (2014) NA NA PowerPoint Separate multi into KEY! IRT and MIRT advantages Presentation Quantitative unidimensional subtests overview Duschl et al. Science learning and learning (2007) U.S., K-8 Science Book NA NA progressions (for possible use in my discussion chapter) El Masri & UK, France, PISA, Model fit analysis in relation to Andrich (2020) Jordan, 15-year- Science Article Quantitative, NA invariance and validity with olds IRT regards to DIF Discussion on how concept Enger & Yager subdomains of science should (2009) U.S., K-12 Science Book NA NA be taught; change to inquiry learning; background on other learning domains EPI (2015) Global, 15-year- Science, PISA, Mixed 2012 PISA, NAEP, and TIMSS olds Reading, Math Report Methods NA results comparison between 174 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) U.S. states rather than international Assessment design uses standards to measure desired Ercikan & Oliveri 21stNA Century Article NA NA traits; ECD; cognitive evidence (2016) Skills needed for complex constructs rather than just expert reviews of items Erzberger & Kelle (2003) NA NA Book Mixed Methods NA Mixed methods handbook Fisher (2023) Unspecified Adverse Children Childhood Dissertation Mixed Methods NA Example of methods section Experiences split into two plans Fu (2016) U.S., Grade 8 Algebra Article Quantitative Practical significance vs. MIRT models with covariates statistical significance applied to longitudinal test data Framework method to Gale et al. (2013) NA NA Article Qualitative NA compare/contrast qualitative data PPIC report on HS science course requirements for CA Gao et al. (2019) U.S., HS to Science Report Mixed Methods NA university admission – save for College discussion as mentions racial disparity in meeting new requirements Garnier-Villarreal et al. (2021) NA NA Article Quantitative The number of factors to Estimation limits of between- be evaluated item MIRT models GEOstata (2016) Global Science Webpage PISA NA Map of PISA 2015 country/economy participants Latent diversity’s impact on Godwin (2017) College Engineering Conference Article Mixed Methods NA creative solutions to engineering problems 175 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Gomes et al. Brazil, Student Math Article PISA, Why scores may be impacted by (2020) age varied Quantitative NA student age due to taking more classes (hold for discussion) Greene & Pragmatic paradigm in mixed Caracelli (1997) NA NA Article Mixed Methods NA methods design using different epistemologies Greenwood NA NA Blog Post Learning (2020) Theory NA Social constructivism overview Griffin & McGaw (2012) NA NA Book Quantitative NA Assessment of 21 st century skills Multidimensionality does Multidimension model Haksing (2010) NA NA Article Quantitative not mean MIRT has to be indistinguishable from used unidimensional model Hanushek et al. Global, 15-year- How cognitive growth as (2008) olds Math Article PISA, Quantitative NA measured by PISA impacts the U.S. economy Harris (n.d.) NA NA Article Quantitative NA Compares equations for UIRT models Harrison et al. U.S., HS, MS, Nature of NGSS, Mixed Multidimensional nature of (2015) Hawaii Science Article Methods NA nature of science learning; MRCML model Hartig & Hohler (2009) NA NA Article Quantitative NA KEY! SEM models for MIRT: between and within items; Hebel et al. (2017) Global, 15-year- PISA, Mixed olds, France Science Article Methods NA Difficulty of PISA science items Hoover et al. Secondary Earth and Space Report NA NA Statistics on school (2018) Sciences (ESS) requirements for ESS education How item framing may impact Hsu et al. (2023) Undergraduate Biology Article Quasi-random, NA student performance (for Mixed Methods possible use in my discussion chapter) IES (n.d.) U.S., Grades 4 and 8 Science, Math Webpage TIMSS, Overview of TIMSS science Quantitative NA assessment 176 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Iliescu (2021) NA NA Article Validity NA Social consequence of testing; Immekus (2019) Undergraduate Engineering Article Quantitative, NA KEY! Overview of IRT (2PL) and MIRT, CFA MIRT Multidimensional MRCMLM Intasoi et al. 2020 Thailand, Grade 7 Science Article Quantitative NA model works better to measure scientific competency framework Irribarra & Social Sciences, Arneson (2023) NA Education Article Quantitative NA Defining dimensional structure Compensatory vs. non- Issayeva (2022) NA NA Webpage Quantitative NA compensatory MIRT models; MLE; guessing parameter Jerrim (2016) NA NA PowerPoint PISA, NA Types of weighting; handling Presentation Quantitative missing data PISA, TIMSS, Jerrim (2023) Global Science, Reading, Math Article PIRLS, NA Interest in large-scale Quantitative assessments by country Sweden, Jerrim et al. Germany, Science, Article PISA, NA Change of mode to computer-(2018) Ireland, 15-year- Reading, Math Quantitative based assessment olds China and Ji (2023) Canada, Self-regulation Thesis Qualitative NA Document analysis Kindergarten methodology Johnson (2019) U.S., HS Integrated Web article NA NA Parent and student concerns Science over integrated science Johnson & Onwuegbuzie NA NA Article Mixed Methods NA Pragmatic mixed methods – (2004) equal status design Jolliffe & Cadima (2016) NA NA Article Quantitative NA A review of PCA Kaldaras et al. NA Science Article NGSS, Validating latent multi-factor (2021) Quantitative, NA science constructs 177 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) EFA, CFA, Invariance Analysis Kandanaarachchi IRT used to evaluate machine & Smith-Miles NA NA Article Quantitative NA learning algorithm (future (2023) research section) Bayesian probabilistic Kaplan & Huang (2021) U.S. Math, Reading Article NAEP, NA forecasting view; combining Quantitative NSLP variable as a proxy for socio-economic status Student perceptions of different Kapucu (2021) Turkey, 9th grade Science Article Mixed Methods NA science subdomains differentiate from one another Kelley & Knowles (2016) International STEM Article NA NA Integrated STEM education Kim & Wilson Polytomous item explanatory (2020) NA NA Article Quantitative NA item response theory models via MGLMM Overview of information criteria Kim et al. (2019) NA NA Article Quantitative NA usage when comparing model fit Longer tests and larger Kose & Demirtasli sample sizes are needed Comparing sample size and test th (2012) Turkey, 8 grade Language Article Quantitative to increase model length for UIRT and MIRT sensitivity and decrease models error Krutsch & U.S. Department of Labor chart Roderick (2022) NA STEM Blog Post Quantitative NA on STEM job growth Kuo & Sheng NA NA Article Quantitative, Estimation methods for multi- (2016) Simulation NA unidimensional graded response 178 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Can be similar to other models such as CTT or Lang & Tay (2021) NA NA Article IRT, MIRT, CFA; unidimensional KEY! Overview of IRT models; Quantitative models are more familiar includes R code; history of MIRT and easily interpreted (for development discussion chapter) Lau (2009) Global Scientific Article PISA, NA Review of 2006 framework finds Literacy Qualitative construct validity issues Learn PILA (n.d.) Global Innovative Webpage Quantitative NA Platform for Innovative Learning Assessments Investigation across different Lee & Tsai (2012) College Biology, Physics Article Qualitative NA domains of science is rare for student epistemological beliefs Michigan Gr. 5, Unidimensionality and KEY! MIRT model used Li et al. (2012) U.S., K-12 Science Article Quantitative, local item dependence successfully for large-scale state EFA, CFA assumptions assessment in science Lin (1998) NA NA Article Qualitative NA Positivist vs. Interpretivist approaches Lips & Moritz (2022) U.S. STEM Report NA NA Government spending on STEM Liu et al. (2022) U.S., Grade 8, Math NA Article NAEP, MIRT, Scalability to big data 2PL IRT most commonly used; Quantitative report RMSE MacLeod & U.S., Memory recall as a Nelson (1984) Undergraduates NA Article Quantitative NA unidimensional construct Mari et al. (2017) NA NA Article Quantitative NA Nature of measurement vs. measure Marlowe (1986) NA NA Article Quantitative NA Multidimensionality in social intelligence Masur (2022) NA NA Webpage Quantitative NA IRT models in MIRT R package Maul (2019) NA NA Article NA NA Intersubjectivity of measurement Maxwell & Mittapalli (2010) NA NA Handbook Mixed Methods NA Scientific realism 179 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Mazzei & Jackson (2024) NA NA Book Qualitative NA Re-animating documents in another form McDonald (1999) NA NA Book CFA, IRT, MIRT, Quantitative NA Overview of multiple quantitative statistics Mcleod (2023) NA NA Online Learning Article Theory NA Constructivism Use vs. interpretation Messick (1989) NA NA Article NA NA inferences in assessment validity Messick (1993) NA NA Article NA NA Consequential validity as an aspect of construct validity Messick (1995) NA NA Article NA NA Construct validity Monseur et al. Global Reading, (2011) Science, Math Article PISA, NA Violation of independence Quantitative assumption by items in a set Moroi (2020) NA NA Article Mixed Methods NA Philosophies of research Student enjoyment of science Mostafa et al. Global, 15-year- Science Report PISA, Mixed NA linked to inquiry teaching; 2015 (2018) olds Methods PISA science test design/scoring/scale Infit item statistic acceptable Müller (2020) NA NA Article Quantitative NA bounds – save for results chapter A framework for science National Research education; chapter 11 discusses Council (2012) U.S., K-12 Science Book NA NA DEI in science education (for possible use in my discussion chapter) NCES (2019) U.S. Math, Science Report PISA NA K-12 course completion statistics NCES (2022) U.S. Science, Math Report NAEP, Quantitative NA Science course completion data NCES (n.d.-a) Global, 15-year-olds, U.S. Science Webpage PISA NA Number of U.S. schools participating in 2015; implies 180 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) students do not have to participate 2015 PISA scores for all participating countries, also NCES (n.d.-b) Global, 15-year- PISA, olds, U.S. Science Webpage Quantitative NA broken down by subdomain; U.S. sampling and data collection methods NGSS (2013) U.S. STEM Webpage NA NA Next Generation Science Standards Niiniluoto ed. et al. (2004) NA NA Book Qualitative NA Overview of different epistemologies and their origins NWEA (2015) NA NA Blog Post Quantitative NA KEY! List of factors impacting use of MIRT Math, Science, OCR (2023a) U.S., K-12 Computer Report Mixed Methods NA Student access to education Science OCR (2023b) U.S., K-12 NA Report Quantitative NA Student enrollment OECD (2016a) Global, 15-year- Science, PISA, Mixed U.S.-specific report on 2015 olds Reading, Math Report Methods NA data Global, 15-year- Science, PISA, Mixed Map of participating countries; OECD (2016b) olds Reading, Math Report Methods NA 2015 results focusing on equity in education OECD (2017a) Global Science Framework PISA NA 2015 combined framework including science Key! Technical report for PISA OECD (2017b) Global, 15-year- Science, Report PISA, Mixed NA 2015 [Annex F – technical olds Reading, Math Methods standards; Annex A – item codes and counts] OECD (2018) Global, 15-year- Science, PISA, Mixed olds Reading, Math Report Methods NA Results in focus, data overview OECD (2019) Global Science Framework PISA NA 2018 science framework OECD (2020) Global Science Report PISA NA Strategic vision for 2024 science assessment 181 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) OECD (2023) Global Learning in Digital World Report NA NA Digital assessment framework OECD (n.d.-a) Global, 15-year-olds NA Webpage NA NA Overview of PISA OECD (n.d.-b) Global NA Webpage NA NA Frequently asked questions about PISA OECD (n.d.-c) Global, 15-year- Science, olds Collaborative Report PISA, Mixed Methods NA 2015 released field trial items Problem Solving Osteen (2010) MSW students NA Article Quantitative NA Integrating CFA and MIRT Ostlund et al. NA NA Article Mixed Methods NA Triangulation of findings from (2011) mixed methods research Nigeria, Using integrated science Otarigho & Secondary Integrated Article Qualitative NA curriculum to provide students Oruese (2013) Schools Science with an understanding of how science affects everyday lives Higher error for ability in unidimensional than Park et al. (2019) Belgium, 6- to 8- MIRT, year-olds Number Sense Article Quantitative multidimensional if item MIRT in adaptive learning set is truly systems multidimensional Park et al. (2020) NA NA Blog Post Quantitative NA ICC overview with item parameters Developing transferable Pellegrino & ELA, Math, and knowledge and skills in 21 st Hilton (2012) U.S., K-12 Science Book NA NA century; emphasis on science inquiry (for possible use in my discussion chapter) NRC - Pellegrino Intersection of student learning et al. (2001) U.S., K-12 All Book NA NA and assessment (for possible use in my discussion chapter) Pelz (n.d.) NA NA Webpage Social Science Research NA Takes at least two dimensions to be multidimensional 182 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Pierson et al. Learning progressions; tension (2019) U.S., K-12 Science Article NGSS NA over teaching fact memorization vs. inquiry Reading, Math, Science, Collaborative PISA USA (2015) U.S., 15-year-olds Problem Brochure NA NA Students volunteer to take PISA Solving, if randomly selected by OECD. Financial Literacy KEY! Only 17% for science is independent of the general Pokropek (2022) Global, 15-year- Science, Math, Article PISA, ability factor and can be olds Reading Quantitative NA attributed to specific science ability factor (for possible use in my discussion chapter) Polites et al. Conceptualizing (2012) NA NA Article Mixed Methods dimension relationships Defining multidimensionality using theory Reckase (1985) NA NA Article Quantitative NA Multidimensional item difficulty An item requiring two Reckase (1989) NA NA Article Quantitative cognitive skills to solve Application of MIRT – hold for may still be discussion unidimensional Reckase (1990) NA NA Paper Presentation Quantitative NA Defining dimensionality Multiple items can be Reckase (1997) NA NA Article Quantitative selected based on MIRT, Future directions for MIRT; but modeled early MIRT development unidimensionally Reckase (2009) NA NA Book Quantitative Small number of items on Psychological and educational test context for MIRT 183 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Reed & Wolfson Learning progressions assume (2021) U.S., HS, College Chemistry Article Qualitative NA linear learning path; LPs not used by all Massachusetts scores at a high Reis (2016) U.S., Reading, Newspaper Massachusetts Science, Math Article PISA NA level similar to country leaders of PISA 2015 Newspaper Using test scores to limit bias Richman (2023) K-12, Texas Math Article NA NA and determine which math class a student can take Richman & Crain Newspaper Teacher shortages lead to (2022) K-12, U.S. NA Article NA NA accepting teachers with less training Ruiz-Primo & Li Global, 15-year- PISA 2006 and (2015) olds Science 2009, NA How item context affects Quantitative student performance Saunders et al. Saturation evaluation and (2018) NA NA Article Qualitative NA grounded theory Scalise (2017a) U.S., MS Science Article Quantitative NA MIRT model for tech-enhanced items Scalise (2017b) NA Neuroscience Book NA NA Describes how students learn Scalise & Clarke- U.S., MS Science Inquiry Article Quantitative, KEY! Compares the fit of Midura (2018) MIRT NA different IRT models to the data; Bayes net for process data Scalise & Gifford Overview of item types and (2006) NA NA Article NA Can item type affect multidimensionality? constraints (for possible use in my discussion chapter) Scalise & Wilson 21stNA Century Multidimensionality in (2011) Learning Article Quantitative NA constructs Scalise et al. Literature NA STEM Article Review and NA Digital accommodations for (2018) Analysis students Scalise et al. (2021) NA NA Article Quantitative NA Learning analytics; figure 1 184 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Siegel (2006) NA NA Article Qualitative NA Epistemological diversity in education research Socha (n.d.) NA NA Unpublished article Quantitative Large sample size MIRT assumptions; ICS analog to ICC Comparing MIRT to UIRT when Spencer (2004) NA NA Dissertation Quantitative NA retrieving item difficulties and differentiation Stehle & Peters- Burton (2019) U.S., HS STEM Article Quantitative NA U.S. students underperforming in science Strauss (2019) Global, 15-year-olds NA Newspaper Article PISA NA Negative aspects of PISA scores Taut & Palacios Global, 15-year- Intended and unintended (2016) olds NA Article PISA NA interpretations and uses of PISA results Thomson et al. Australia Scientific Report PISA NA Overview of scientific literacy as (2013) Literacy assessed by PISA No meaningful change in science scores; U.S. 2nd generation immigrants worst Tucker (2016) U.S. Reading, Math, Newspaper Science article PISA NA educated; math teachers lack appropriate training; poor recruitment strategy of teachers Tulodziecki (2012) NA NA Article Qualitative NA Epistemic equivalence Newspaper Detracking students to increase Turcotte (2023) HS Math article NA NA equity led to increased math performance Tykoski (2017) U.S., HS ESS Blog Post NA NA Lack of ESS in IB and AP courses Japan, higher, Uesaka et al. middle, and Self-regulated Usefulness of IRT to classroom (2022) lower-ranked learning Article Quantitative NA instruction universities 185 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Ulitzsch & Nestler (2022) NA NA Article Quantitative NA Bayesian IRT model Description of flow of course delivery in the sciences, i.e., biology to chemistry to physics, Vazquez (2006) U.S., HS Biology, Physics, Chemistry Article Qualitative NA and if it should change to a physics first approach – save for discussion on inequity in course access Purposes for and guidelines of Venkatesh et al. Information mixed methods research; (2013) NA Systems Article Mixed Methods NA developing meta-inferences; validity of mixed methods research KEY! Extension of 2013 article Venkatesh et al. NA NA Article Mixed Methods NA with variations of mixed (2016) methods research; epistemology Voogt & Roblin International 21 st Century Document selection; screening (2012) Competencies Article Qualitative NA framework for sub-themes (see Table 3) Document analysis; set Wach (2013) NA NA Article Qualitative NA inclusion criteria; coding; validity Reading, Math, Science, U.S., Global, 15- Collaborative Walker (2016) Problem News Article PISA, Mixed NA U.S. scores remain in middle of year-olds Solving, (web) Methods other country averages for 2015 Financial Literacy Wang (2021) NA NA Article Quantitative Costs of large data sets Unidimensionality assumption; history of MIRT 186 Author/s or Editor/s or Student Content Reference Measurement Barriers to Using MIRT Abbreviated Title Demographics Domain Type Type Model Big Idea/s (Date) Items on multiple dimensions leads to Wang & Nydick NA NA Article Quantitative, increased variability with regards to difficulty Algorithms for non-(2015) Simulation parameters on each compensatory MIRT models dimension, which results in decreased information Chapter 3 focuses on history of Welch (1977) U.S., K-12 Science Book NA NA integration for teaching science subdomains Wess et al. (2021) NA NA Book Quantitative NA Ch. 4: Test quality with regards to types of validity Western Definitions and key Governors NA NA Blog Post Epistemology NA characteristics/principles of University (2020) social constructivism Wilson (2013) NA NA Article Quantitative NA IRT overview; changes from CCT Winarno et al. International Science Article Literature NA Problems with teaching (2020) Review integrated science classes Yen & Leah (2007) K-12 EL Presentation Quantitative, Number of parameters to KEY! Exploratory approach to Paper EFA be estimated MIRT model emphasizes finding the best fitting model You et al. (2020) Global, 15-year- Science Article PISA, NA School characteristics impact olds Quantitative scientific literacy in students Cluster analyses should be Zakharov (2016) NA NA Article Quantitative NA evaluated for reliability and validity – save for discussion Zhao & Quantitative, Consequences of IRT model Hambleton (2017) NA NA Article Simulation NA misfit 187 APPENDIX F: PISA 2015 SCIENCE FRAMEWORK71 71 From chapter 2 in PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic, Financial Literacy and Collaborative Problem Solving, revised edition (OECD, 2017). 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 APPENDIX G: DISSERTATION TIMELINE