Investigating Content Multidimensionality in a Large-scale Science Assessment: 
A Mixed Methods Approach 
 
by 
 
Cassandra N. Malcom 
 
A Dissertation accepted and approved in partial fulfillment of the 
requirements for the degree of  
Doctor of Philosophy   
in Quantitative Research Methods in Education 
 
Dissertation Committee:  
Dr. Kathleen Scalise, Chair and Advisor 
Dr. Dianna Carrizales-Engelmann, Core Member  
Dr. George Harrison, Core Member  
Dr. Joanna Goode, Core Member 
Dr. Beth Harn, Institutional Representative  
 
University of Oregon  
Spring 2024 
 
 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2024 Cassandra N. Malcom 
This work is licensed under a Creative Commons CC BY-NC 4.0    
 3 
 
 
DISSERTATION ABSTRACT 
 
Cassandra N. Malcom 
Doctor of Philosophy in Quantitative Research Methods in Education 
Title: Investigating Content Multidimensionality in a Large-scale Science Assessment: A Mixed 
Methods Approach 
Science, Technology, Engineering, and Math (STEM) skills are increasingly required of 
students to be successful in higher education and the workforce. Therefore, modeling 
assessment outcomes accurately, often using more types of student data to get a complete 
picture of student learning, is increasingly relevant. The Program for International Student 
Assessment (PISA) is promoted as a summative assessment opportunity that includes a science 
framework. As with many science assessments, the framework includes Life, Physical, and Earth 
science, which alone seems to imply multidimensionality, and also there are other sources of 
dimensionality that seem to be described conceptually in the framework. Using data from the 
2015 PISA science assessment, a multidimensional item response theory (MIRT) model was fit 
to see how a multidimensional model operates with the data. Before developing the MIRT 
model, a qualitative review of the framework for multidimensionality took place and 
exploratory analyses were implemented for the quantitative data, including a data science 
technique to explore multidimensionality and some factor analysis techniques. After fitting the 
MIRT model, it was compared to several unidimensional IRT (UIRT) models to determine the 
model that explains the most variation. The qualitative analyses generated evidence of 
multidimensional science content domains in the 2015 PISA science framework, which should 
require a MIRT model, but quantitative analyses indicate a unidimensional model is more 
 4 
 
 
practically significant. Once quantitative results were triangulated with the qualitative review of 
the framework for multidimensionality, the implications on equity and history of harm with 
regards to science assessments were discussed. Findings from the qualitative and quantitative 
aspects of the study were used to generate recommendations for different stakeholders. 
Keywords: multidimensionality, item response theory, STEM education, summative 
assessment, large-scale assessment, qualitative framework review  
 5 
 
 
CURRICULUM VITAE 
 
NAME OF AUTHOR:  Cassandra N. Malcom 
 
 
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: 
 
 University of Oregon, Eugene 
 Southwest Texas State University, San Marcos 
 
 
DEGREES AWARDED: 
 
Doctor of Philosophy, Quantitative Research Methods in Education, (all but dissertation 
expected to be completed in 2024), University of Oregon 
 Master of Science, Biology, 2003, Southwest Texas State University 
 Bachelor of Science, Marine Biology, 2001, Southwest Texas State University 
 
 
AREAS OF SPECIAL INTEREST: 
 
Science Education 
Measurement and Assessment 
Collaboration and Inquiry in Science 
 
 
PROFESSIONAL EXPERIENCE: 
 
Graduate Instructional and Research Assistant, University of Oregon College of  
Education, 2020-Present 
 
 Science Content Writer/Reviewer Independent Contractor, Hurix Digital, 2022  
 
Science Coordinator and Assessment Specialist III, Educational Testing Service (ETS),  
2008-2021 
 
Science Department Chair and Teacher, Robert G. Cole High School, 2004-2008 
 
Science Teacher Intern, Nancy Ney Charter School, 2003-2004 
 
Science Adventure Club Teacher, Witte Museum, 2002 
 
Graduate Instructional and Research Assistant, Southwest Texas State (SWT) University  
 6 
 
 
Biology Dept., 2001-2003 
 
Undergraduate Instructional Assistant, SWT University Biology Dept., 2001 
 
 
GRANTS, AWARDS, AND HONORS: 
 
Spot Award, ETS, 2011-2013, 2017, and 2019  
President’s Award, ETS, 2012 
Academic Excellence Award, SWT University Biology Dept., 2003 
Ruth Strandman Field Biology Scholarship, 2000 
Houston Livestock and Rodeo Scholarship, 1996-1998 
Dean’s List, SWT University, 1996-1998 
Ford Scholarship, 1996 
National Dean’s List, 1996 
Girl Scout Gold Award, 1995 
 
PUBLICATIONS: 
 
Scalise, K., Malcom, C., & Kaylor, E. (2023). Chapter 8: Analysing and integrating new  
sources of data reliably in innovative assessments. In N. Foster & M. 
Piacentini (Eds.), Innovating assessments to measure and support complex skills (pp. 
138-150). OECD Publishing. https://doi.org/10.1787/e5f3e341-en 
 
Scalise, K., Malcom, C., & Kaylor, E. (2023). Chapter 13: A tale of two worlds: Machine  
learning approaches at the intersection with educational measurement. In N. Foster 
& M. Piacentini (Eds.), Innovating assessments to measure and support complex skills 
(pp. 216-224). OECD Publishing. https://doi.org/10.1787/d01eb8a4-en 
 
Malcom, C. & ETS Data, Analysis, and Reporting (DAR) Group (2020, December). National  
assessment of educational progress (NAEP) science 2019 operational assessment data 
[Conference presentation]. National Center for Education Statistics (NCES) and the 
National Assessment Governing Board (NAGB) NAEP IDQC, Princeton, NJ, United States. 
 
 7 
 
 
Lavalli, K. L., Malcom, C. N., & Goldstein, J. S. (2018). Description of pereiopod setae of  
scyllarid lobsters, Scyllarides aequinoctialis, Scyllarides latus, and Scyllarides nodifer, 
with observations on the feeding during consumption of bivalves and gastropods. 
Bulletin of Marine Science, 94(3), 571-601. https://doi.org/10.5343/bms.2017.1125 
 
California Department of Education (CDE) & Malcom, C. (2016, December). California science 
tests (CAST) and the California alternate assessment (CAA) for science [Conference 
presentation]. California Educational Research Association (CERA), Sacramento, CA, 
United States. 
 
Malcom, C. N. (2007, September). Description of the setae on the pereiopods of the 
Mediterranean slipper lobster, Scyllarides latus [Poster presentation]. 8th International 
Conference and Workshop on Lobster Biology and Management, Charlottetown, 
Canada. 
 
Malcom, C. N. (2003). Setae on slipper lobster pereiopods [Scanning electron microscope 
photographs and drawings]. In Lavalli, K.L., Spanier, E., & Grasso, F., Behavior and 
Sensory Biology of Slipper Lobsters (pp. 144, 165-167). CRC Press, 2007. 
 
Malcom, C. N. (2003). Description of the setae on the pereiopods of the Mediterranean slipper 
lobster Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper 
lobster S. aequinoctialis [master’s thesis, Southwest Texas State University]. 
https://digital.library.txstate.edu/handle/10877/11901  
 
Malcom, C. N. (2002). Description of the setae on the pereiopods of the Mediterranean slipper 
lobster, Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper 
lobster, S. aequinoctialis [Poster presentation]. 8th Colloquium Crustacea Decapoda 
Mediterranea, Ionian University, Corfu Island, Greece. 
 
Malcom, C. N. (2002). Description of the setae on the pereiopods of the Mediterranean slipper 
lobster, Scyllarides latus, the ridged slipper lobster, S. nodifer, and the Spanish slipper 
lobster, S. aequinoctialis [Poster presentation]. Marine Benthic Ecology Meeting, United 
States.  
 8 
 
 
ACKNOWLEDGMENTS 
 
I wish to express sincere gratitude to my chair and advisor, Dr. Kathleen Scalise, for her 
invitation to start this educational journey. Her mentorship has guided me throughout and 
helped me grow as a researcher. Best wishes to her as she starts her next chapter! 
To my dissertation committee members: Dianna Carrizales-Engelmann, Joanna Goode, 
Beth Harn, and George Harrison, thank you all for being willing to serve. Especially since this 
process was done in a tight timeline and for a student based in another state. Your feedback 
has been invaluable and strengthened my writing. For the opportunity to continue my learning 
remotely I thank the University of Oregon’s College of Education. To my gracious copy editors: 
Dr. Linda A. Malcom and Dr. Angelina Galvez-Kiser, also many thanks for pouring over the 
fine details. 
Last, I’d like to acknowledge my positionality in writing this dissertation. I do so in 
counterpoint to the voices that say this type of statement “does no work” in a quantitative 
research study. For me personally, a positionality statement provides a lens into how a 
researcher views their world and all its data – it may even uncover biases of which the 
researcher is unaware. The positionality statement gives voice to underrepresented 
researchers and allows them to claim how they want to be recognized when too often 
labels are forced upon them. My positionality is such: 
This researcher identifies as a liberal, white, cisgender female whose formative 
education occurred primarily in Texas. Daughter of working-class parents, a high 
value was placed on STEAM education and reading in order to better oneself and 
 9 
 
 
achieve dreams. Her love of science led her to studying and teaching science and 
biases her views on data in that she believes science can help answer any question.  
  
 10 
 
 
DEDICATION 
 
This dissertation is dedicated to my mother who walked a long, hard road to make sure I 
got here. Without her love, support, friendship, and her own dissertation journey, I might not 
have seen what all is possible. And to my cousin, who’s faith in me convinced me that there was 
never any doubt about this journey’s conclusion. 
 11 
TABLE OF CONTENTS 
Section               Page 
DISSERTATION ABSTRACT ............................................................................................................... 3 
CURRICULUM VITAE ....................................................................................................................... 5 
ACKNOWLEDGMENTS .................................................................................................................... 8 
DEDICATION ................................................................................................................................. 10 
LIST OF FIGURES ........................................................................................................................... 15 
LIST OF TABLES ............................................................................................................................. 17 
LIST OF EQUATIONS ...................................................................................................................... 18 
LIST OF ABBREVIATIONS ............................................................................................................... 19 
CHAPTER 1. INTRODUCTION AND LITERATURE SYNTHESIS .......................................................... 21 
Problem Statement ................................................................................................................... 21 
STEM Education and U.S. Economy .......................................................................................... 22 
Integrating Science ................................................................................................................... 25 
History of Harm to Learning Equity by Science Assessments ................................................... 30 
Overview of PISA ....................................................................................................................... 35 
Historical Background ........................................................................................................... 36 
Assessment Cycle .................................................................................................................. 36 
2015 Science Framework ...................................................................................................... 37 
2015 Assessment Design ....................................................................................................... 42 
 12 
2015 Science Scoring ............................................................................................................. 46 
Three MIRT Case Studies .......................................................................................................... 47 
Yen and Leah (2007) - MIRT Model for Composite Scores .................................................... 47 
Scalise and Clarke-Midura (2018) - The Many Faces of Scientific Inquiry ............................. 49 
Li et al. (2012) - Applying MIRT Models in Validating Test Dimensionality ........................... 51 
Research Questions .................................................................................................................. 53 
Research Question 1 (RQ1) ................................................................................................... 53 
Research Question 2 (RQ2) ................................................................................................... 54 
Research Question 3 (RQ3) ................................................................................................... 54 
CHAPTER 2. METHODS ................................................................................................................. 56 
Developing the Literature Synthesis ......................................................................................... 56 
Setting ....................................................................................................................................... 57 
Student Demographics ............................................................................................................. 58 
Data Collection .......................................................................................................................... 60 
Study Sample ............................................................................................................................ 61 
Data Analysis – A Mixed Methods Approach ............................................................................ 62 
Epistemology ......................................................................................................................... 63 
Purpose and Guidelines ........................................................................................................ 65 
Step 1: Qualitative Analysis ................................................................................................... 69 
Step 2: Quantitative Analysis ................................................................................................ 75 
 13 
Data Triangulation ................................................................................................................. 89 
Step 3: Equity Investigation ................................................................................................... 91 
CHAPTER 3: RESULTS .................................................................................................................... 92 
Results Relating to RQ1 ............................................................................................................. 92 
Results Relating to RQ2 ............................................................................................................. 96 
Descriptive Statistics ............................................................................................................. 97 
RQ2A: Cluster Analyses Results ........................................................................................... 102 
RQ2B: PCA Results ............................................................................................................... 103 
RQ2C: IRT Results ................................................................................................................ 111 
Triangulation ....................................................................................................................... 121 
Results Relating to RQ3 ........................................................................................................... 123 
CHAPTER 4: DISCUSSION ............................................................................................................ 125 
Study Overview ....................................................................................................................... 125 
Key Takeaways ........................................................................................................................ 126 
A Lack of Synergy Between Results ........................................................................................ 126 
Overview of Released Item Set ........................................................................................... 129 
Alternate Sources of Multidimensionality .......................................................................... 135 
Impact On Equity ................................................................................................................. 136 
Limitations .............................................................................................................................. 137 
Threats to Validity and Reliability ........................................................................................... 138 
 14 
Future Research ...................................................................................................................... 139 
Policy Recommendations ........................................................................................................ 141 
Conclusions ............................................................................................................................. 142 
REFERENCES ............................................................................................................................... 144 
APPENDIX A: STUDENT ENROLLMENT IN SCIENCE COURSES BY ETHNICITY .............................. 161 
APPENDIX B: 2015 PISA AVERAGE SCORES FOR SCIENCE ........................................................... 163 
APPENDIX C: 2015 PISA AVERAGE SCORES BY SCIENCE SUBDOMAIN ........................................ 165 
APPENDIX D: LITERATURE CONNECTIONS .................................................................................. 169 
APPENDIX E: LITERATURE REVIEW MATRIX ................................................................................ 170 
APPENDIX F: PISA 2015 SCIENCE FRAMEWORK ......................................................................... 187 
APPENDIX G: DISSERTATION TIMELINE ...................................................................................... 217 
 
 
  
 15 
LIST OF FIGURES 
Figure                 Page 
1. STEM Job Predictions for 2031 ............................................................................................... 23 
2. 2019 Science Course Enrollment ............................................................................................ 24 
3. Relationships among the Four Aspects .................................................................................. 39 
4. Released 2015 PISA Science Item ........................................................................................... 41 
5. Comparison of PBA and CBA Assessment Designs ................................................................. 45 
6. Comparison of Models ........................................................................................................... 52 
7. PISA Science Performance by Country ................................................................................... 58 
8. From Science Framework Review to MIRT Model Development ........................................... 73 
9. ICCs Based on a Three-parameter Logistic (3PL) Model ......................................................... 76 
10. Triangulation for Mixed Methods Research ........................................................................... 90 
11. Possible Connections between 2015 PISA Science Content Knowledge ................................ 95 
12. Proposed Continuum .............................................................................................................. 96 
13. Histogram of Student Average Scores for Full U.S. Science Sample ....................................... 98 
14. Histogram of Student Average Scores for Item Cluster S10 Full Subsample .......................... 99 
15. Histograms of Student Score Point Frequency for Item Cluster S10 Full Subsample ............. 99 
16. Distance Heatmap for Item Cluster S10 Full Subsample ...................................................... 100 
17. Scree Plot for Item Cluster S10 with Full Subsample ............................................................ 102 
18. Scree Plot for Item Cluster S10 with Random Half of Subsample ........................................ 103 
19. Scree Plot for Item Cluster S11 ............................................................................................. 103 
20. Loadings Bar Plots for Item Cluster S10 with Full Subsample ............................................... 105 
 16 
21. PCA Plot for Item Cluster S10 with Full Subsample .............................................................. 106 
22. Loadings Bar Plots for Item Cluster S10 with Random Half of Subsample ........................... 107 
23. Confirmation PCA Plot for Item Cluster S10 with Random Half of Subsample ..................... 108 
24. Loadings Bar Plots for Item Cluster S11 ................................................................................ 109 
25. PCA Plot for Item Cluster S11 ............................................................................................... 110 
26. Infit Statistics for 1 PL UIRT Model of Item Cluster S10 with Full Subsample ....................... 113 
27. ICC Plots for Item Cluster S10 with Full Subsample .............................................................. 117 
28. Wright Map for Item Cluster S10 with Full Subsample ........................................................ 121 
29. Triangulation of Results ........................................................................................................ 122 
30. Histograms of Student Ability Levels for Item Cluster S10 ................................................... 123 
31. Bird Migration Item 1 from Item Cluster S11 ....................................................................... 130 
32. Bird Migration Item 2 from Item Cluster S11 ....................................................................... 132 
33. Bird Migration Item 3 from Item Cluster S11 ....................................................................... 134 
34. Differences in Between-item and Within-item MIRT Models .............................................. 141 
35. U.S. High School Physics Enrollment .................................................................................... 161 
36. U.S. High School Biology Enrollment .................................................................................... 162 
37. U.S. High School Chemistry Enrollment ................................................................................ 162 
38. U.S. Mean Scores for Science Stable Over Time ................................................................... 164 
39. Connections to the 2012 Li Article ....................................................................................... 169 
  
 17 
LIST OF TABLES 
Table                 Page  
1. Three Science Subdomains in 2015 PISA ................................................................................ 40 
2. Country Demographic Comparisons ....................................................................................... 59 
3. Defining Purpose of Mixed Method Approach ....................................................................... 66 
4. Guidelines for Mixed Methods Research ............................................................................... 68 
5. Trade-offs Between Calibration Methods for a Unidimensional Score .................................. 84 
6. Evidence Supporting Dimensionality Themes ........................................................................ 93 
7. Descriptive Statistics for Item Cluster S10 Full Subsample ..................................................... 97 
8. Means (M), Standard Deviations (SD), and Correlations with Confidence Intervals (CI) for 
Item Cluster S10’s Full Subsample ........................................................................................ 101 
 
9. Item Groupings for MIRT Models ......................................................................................... 111 
10. Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S10 Subsample ..... 113 
11. Comparison of Model Fit – Item Cluster S10 Subsample ..................................................... 114 
12. Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S11 Subsample ..... 116 
13. 2015 PISA Country Rankings by Average Score in Science ................................................... 163 
14. 2015 PISA Country Rankings by Average Score in Science Subdomain ................................ 165 
15. Results of Literature Review ................................................................................................. 170 
 
  
 18 
LIST OF EQUATIONS 
Equation               Page 
1. 1PL UIRT .................................................................................................................................. 87 
2. 2PL UIRT .................................................................................................................................. 87 
3. 1PL MIRT ................................................................................................................................. 88 
4. 2PL MIRT ................................................................................................................................. 88 
   
 19 
LIST OF ABBREVIATIONS 
Term Abbreviation Definition1 
An estimate of model fit based on the number of 
model parameters and log-likelihood that favors 
Akaike Information Criterion AIC more complex models and smaller sample size; the 
smaller AIC indicates better fitting model when two 
models are compared 
An estimate of model fit that favors simpler models 
Bayesian Information Criterion BIC by adding a penalty for more parameters; the smaller BIC indicates better fitting model when two 
models are compared 
Civil Rights Data Collection CRDC NA 
Classical Test Theory CTT States an observed test score is the sum of a student’s true score and random error 
An online test designed to increase or decrease test 
Computer Adaptive Testing CAT difficulty based on a student’s ability as shown during the test by providing an easier or harder item 
dependent on the score of the last item given  
Computer-based Assessment CBA An assessment designed to be delivered and administered via a computer or tablet 
Degrees of Freedom df Number of independent variables that are free to vary in a data sample 
Diversity, Equity, and Inclusion DEI NA 
Expected A Posteriori EAP An estimate of the predicted value for the latent trait posterior probability distribution 
Exploratory Factor Analysis EFA Identifies the latent traits then builds a linear model of the variables 
Gross Domestic Product GDP NA 
Item Characteristic Curve ICC A graph of a probability of a correct response versus a student’s ability 
Item Information-weighted Fit Infit “Information” refers to the variance of the observations 
Item Response Theory  IRT Explains the relationship between a latent trait and an observable outcome 
The items in an assessment may be related in that an 
Local Item Dependence LID answer on one item indicates a higher chance of answering other items in a similar manner, even 
when conditioned on proficiency estimate 
Generates a random sample of a target distribution 
with a large number of dimensions where each 
Markov Chain Monte Carlo MCMC MCMC sample is dependent on the prior MCMC 
sample; can estimate the sum as either the mean or 
variance of drawn samples 
 
1 Statistical terms are provided with definitions that are summarized by the researcher. Terms that have no 
applicable statistical definition are marked NA. 
 20 
Maximum Likelihood Estimation MLE Finding the parameter values that give a curve best fitting the data 
Maximum marginal likelihood Estimates the parameters that are most likely of the 
estimation  MMLE expected probability distribution based on observed data 
Multidimensional IRT MIRT Can model an assessment measuring multiple traits 
National Assessment of Educational 
Progress NAEP NA 
National Center for Education 
Statistics NCES NA 
Next Generation Science Standards NGSS NA 
Office for Civil Rights OCR NA 
One-parameter Model 1PL Simplest model that describes the latent trait (i.e., ability) based only on the difficulty parameter 
Organization for Economic Co-
operation and Development OECD NA 
Paper-based Assessment CBA An assessment designed to be delivered and administered via a paper form 
Number of observed variables are reduced to a 
Principal Component Analysis PCA decreased number of principal components, which account for the most variance of the observed 
variables 
Programme for International Student 
Assessment PISA NA 
Is the root of the mean of the squared errors 
Root Mean Square Error RMSE between observed and predicted values; measures the error of a model when predicting quantitative 
data 
Is the square root of mean of squared residuals and 
Root Mean Square of the Residuals  RMSR measures badness-of-fit for a model (0 indicating 
perfect model fit) 
Refers to education containing these subjects, or a 
Science, Technology, Engineering, required set of skills needed for the workforce. 
and Math STEM Sometimes the Arts are included, in which case the acronym becomes STEAM, which is out of scope in 
this dissertation. 
Standard Deviation SD A measurement of the spread of data in relation to the mean of the population that indicates variability 
An inferential statistic that indicates the reliability of 
Standard Error SE a sample population mean compared to the actual 
population mean 
Model that describes three parameters, difficulty, 
Three-parameter Model 3PL discrimination, and guessing, in relation to the latent 
trait 
Two-parameter Model 2PL Model that describes two parameters, difficulty and discrimination, in relation to the latent trait 
Unidimensional IRT UIRT Models an assessment measuring a single latent trait 
United States U.S. NA 
 21 
CHAPTER 1. INTRODUCTION AND LITERATURE SYNTHESIS 
Problem Statement 
Three science subdomains (Life, Physical, and Earth and Space) that are commonly 
assessed in large-scale assessments are qualitatively describable as multiple dimensions based 
on teaching pedagogy and student ability. For a construct, in this instance science, to be 
considered multidimensional its dimensions2 (e.g., the science subdomains) should be different 
from each other yet connected to the theorized construct – see section Defining Dimensionality 
(Ch. 2) for further information (Polites et al., 2012). However, data from national and global 
large-scale science assessments are typically quantitatively modeled with unidimensional item 
response theory (or UIRT) models rather than multidimensional IRT (or MIRT) models. IRT is a 
method of examining the relationship between something intangible, such as science ability, 
and how that latent trait manifests in, for example, scores on a set of science items – see 
section Primer on Item Response Theory (Ch. 2) for a more detailed description.  
Some researchers advocate that we should more accurately model student data from 
large-scale science assessments by using MIRT models. The disconnect between how large-
scale assessments report data, such as on science subscales3, while the data is modeled 
unidimensionally, could impact policy decisions, instruction, and other aspects of the education 
experience. IRT models can allow teachers and students to consider and discuss how student 
ability is impacted by item parameters, such as difficulty (Uesaka et al., 2022), but only if the 
appropriate IRT models are used. This problem will be explored here using a mixed methods 
 
2 At least two are needed to be multidimensional. 
3 For 2015 PISA subscales were based on the science subdomains of life, physical, and Earth and Space systems – 
see Appendix C. 
 22 
approach, where qualitative (document exploration) techniques are used to better understand 
the conceptual framework claims in a framework for which the United States (U.S.) sample of 
the quantitative data set is next explored quantitatively for some aspects of dimensionality. 
STEM Education and U.S. Economy 
Understanding the current status of science education with regards to the U.S. economy 
will help illustrate why there is a desire for Science, Technology, Engineering, and Math (STEM)4 
assessment research, such as the more accurate modeling noted above. Citizen science and the 
need in life for understanding and interpreting personal STEM contexts is very important for 
decision making in modern society. For instance being able to have enough STEM knowledge to 
understand and make decisions regarding implications for vaccination, masking, and other 
precautions in the recent COVID-19 pandemic would have helped citizens, especially in early 
days of the pandemic given the absence of a fuller understanding. This is one example, of which 
there are many others, of how decision making by individuals in society may interact with their 
base STEM understanding.  
Students’ cognitive skills and general knowledge can impact a nation’s economy  
(Hanushek et al., 2008). A highly skilled workforce helps nations be competitive in the global 
marketplace (Hanushek et al., 2008). With regards to the economies of many nations, there is a 
long-standing link between more STEM education and greater productivity and creativity from 
the workforce in generating solutions, whether they be environmental, technological, or 
industrial (Hanushek et al., 2008). Over the years, this link has led to increased government 
interest in STEM education and in the status of the STEM workforce (Kelley & Knowles, 2016). 
 
4 Sometimes including Arts for STEAM 
 23 
For example, in 2021, the U.S. government spent approximately 3.9 billion on STEM education 
(Lips & Moritz, 2023). 
By 2031, the U.S. Department of Labor predicts the STEM workforce will grow by almost 
11%, which is more than two times faster than the predicted growth rate of other occupations, 
see Figure 1 (Krutsch & Roderick, 2022). 
Figure 1 
 
STEM Job Predictions for 2031 
 
Note. These predictions do not include careers related to just the Arts, as in STEAM where Arts are added to STEM. 
From “STEM day: Explore growing careers,” by E. Krutsch and V. Roderick, 2022, U.S. Department of Labor Blog 
(https://blog.dol.gov/2022/11/04/stem-day-explore-growing-careers). Copyright 2022, U.S. Department of Labor. 
 
Generation of these new STEM jobs will lead to increased need for workers with STEM skills. 
This is especially true since the anticipated STEM jobs are expected to be enticing for students 
as they are also predicted to pay more on average (Krutsch & Roderick, 2022). In order to have 
enough STEM skilled workers, STEM education frameworks, associated teaching pedagogy, and 
 24 
related assessments are being evaluated in the hopes that improvements in these areas will 
lead to more students taking STEM classes then entering a STEM career field.  
Instead of more students entering STEM tracks in higher education or careers, students 
in the U.S. are currently falling behind other countries in the mastery of science skills (Hanushek 
et al., 2008). Under 40% of high school students in public and private schools took a biology, 
chemistry, and physics course in 2019 (National Center for Education Statistics [NCES], 2022). 
The majority of these students took at least one biology course, with chemistry as the second 
course students took most frequently, and physics plus Earth sciences lagged far behind in 
student enrollment – see Figure 2 (NCES, 2022).  
Figure 2 
 
2019 Science Course Enrollment 
 
Note. Adapted from “High school mathematics and science course completion,” 2022, National Center for 
Education Statistics: Condition of Education (https://nces.ed.gov/programs/coe/indicator/sod/high-school-
courses). Copyright 2022 by the U.S. Department of Education, Institute of Education Sciences. 
 25 
Evidence from the National Assessment of Educational Progress (NAEP) science scores shows 
students in grades 8 and 12 well below 50% proficiency (Stehle & Peters-Burton, 2019). The 
Programme for International Student Assessment (PISA) science scores rank the U.S. 25th with a 
mean score of 496, compared to the Organization for Economic Co-operation and Development 
(OECD) overall mean of 493, out of 72 participating economies and countries (Organisation for 
Economic Co-operation and Development [OECD], 2018). Due to this lag, both STEM educators 
and educational researchers are rethinking how STEM skills are taught and evaluated. 
Integrating Science  
Science educational pedagogy used to be focused on the retention of facts (Kaldaras et 
al., 2021; Pierson et al., 2019; Enger & Yager, 2009), such as memorizing all the parts of a cell. 
Newer pedagogy focuses on integrating science content and scientific inquiry (Kaldaras et al., 
2021; Pellegrino & Hilton, 2013; Enger & Yager, 2009), along with soft skills like creative 
thinking (Csapó & Funke, 2017). This new pedagogy focus has led to the development of the 
Next Generation Science Standards (NGSS) in 20135, which advocate for crosscutting concepts, 
practices, and disciplinary core ideas to be taught, in a type of “sensemaking” effort. 
Crosscutting concepts in particular focus on applying knowledge across different science 
subdomains6 (NGSS Lead States, 2013), and on drawing on these concepts to explore new STEM 
ideas as they arise in a student’s life. This new focus could appear to call for science classes with 
integrated content, but the realization of that type of curriculum has been hard to achieve. 
 
5 The short timeframe between the NGSS being published, adopted by states, and then incorporated into 
classroom curriculum and the 2015 administration of the PISA left little room for any impacts on student learning 
to transfer to the 2015 science portion of the PISA. 
6 Scientific inquiry skills are found in the NGSS science and engineering practices and are also considered a 
separate dimension of learning that can apply across each science subdomain. 
 26 
Policy and/or teacher preparation often leaves the coursework in separate “silos”. Assessments 
also need to change from measurement of constructs only requiring recall to this integrated 
content that requires students to apply knowledge and make sense of it (Kaldaras et al., 2021). 
While there was a movement to combine the science subdomains into an integrated 
curriculum in the 1970s (Welch, 1977) and calls for this approach continue, most U.S. schools 
do not employ this method and the subdomains remain taught separately with little crossover 
in science content (Winarno et al., 2020). Winarno et al. (2020) provide several reasons why 
integration of the subdomains has remained difficult, including: 
• Educators are often trained in only one of the science subdomains, 
• Less professional development and college-level training is available for educators, 
• High schools and state policy are designed around the idea that integrated curriculum in 
STEM does not yet strongly support higher education goals, since higher education 
courses often do not draw on integrated frameworks, 
• Learning still tends to be lecture orientated when labs provide more connections for 
students, and 
• Limited availability of integrated science textbooks. 
In addition, the following factors, some from personal experience, also impact integrating 
science subdomain instruction. 
The cost of implementation of integrated science courses is too high for most school districts 
to develop educator training on how to integrate, plus allow the district to purchase new lab 
equipment, supplies, and textbooks. A small suburban high school where I taught rejected a 
new integrated science course as the funds to restructure the science lab with the needed 
 27 
equipment were quite high compared to the current cost of offering three subdomains and a 
few elective courses. 
Developing a new curriculum that functionally integrates the 3 science domains can be time 
restrictive. Integrated science curriculum should not simply be, for example, a restructuring of a 
physics course to contain life science examples, but rather a deeper dive into learning how both 
physics and biology play a role together in a shared construct, like muscle movement. This 
would require a new way of approaching curriculum that provides opportunities for students to 
dive deep into how physics can help the understanding of the biology of muscles working with 
bones to create movement in a living organism. Considerable pre-planning, teacher 
coordination, and community feedback would need to occur for such a course to be 
successfully delivered. 
Teacher resistance to the curriculum change, which can be due to the untested nature of 
how beneficial the new curriculum may be to student learning within the new learning 
environment, or a tendency to cling to what already works well. All educators have felt this way 
at some point or another – a dissatisfaction with being asked to change what works well for our 
students to the newest educational fad without enough teacher buy-in earlier in the process of 
curriculum redesign. While working on a NGSS-focused assessment redesign for a state, I was 
able to hear some of the frustrations teachers felt regarding how the three-dimensional 
learning required by NGSS was to be successfully implemented, especially for students with 
special needs. 
Parent and student concern that integrated courses do not offer enough depth of subdomain 
content to prepare students for college. For example, a student may be more interested in 
 28 
becoming a cell biologist and feel that less time will be devoted to cells in a course that has to 
cover major concepts from life, physics, chemistry, and Earth sciences. The more intricate 
details of cell biology may be missed unless the student takes additional courses, such as an 
advanced biology course. See Johnson’s (2019) EdSource article7 for insight into the parent and 
student concerns over integrated science courses at a California high school. In a similar vein, 
policymaker and public resistance due to the belief that some science subdomains, when 
integrated, will lose teaching and learning minutes (opportunity to learn issues) since available 
time will likely be divided up between several science subdomains. Instead of a student taking 
three different science courses spread out over three years, they may end up taking two 
integrated science courses spread out over two years. Students can often choose to take an 
elective science course, but counseling, the degree of informed choice, and access to such 
courses often varies by socioeconomics and can include substantial racial, ethnic, and gender 
disparities (Gao et al., 2019). 
Assessments may not accurately show trends in data from the non-integrated year to the 
integrated year, so student growth can be harder to map. Many teachers develop their own 
assessments and unless a school requires a standardized formative assessment at the end of 
each science course the reasons for changes in student performance may be unknowable for 
many individuals. Finally, there is often misuse of integrated science courses as low-level science 
classes. In the past, I have heard school counselors refer to integrated science courses as a 
“dumping ground” for students who are not doing well in math and so cannot manage physics 
 
7 Located here: https://edsource.org/2019/how-one-high-schools-dispute-reflects-the-struggle-to-teach-
californias-science-standards/618752  
 29 
or chemistry, have less potential/desire for pursuing a science career, or who have special 
education needs. This filtering of students is obviously not the intended use of integrated 
science education and advocates for integrated science instruction will tell you that a better 
vision of an integrated science course is to prepare all students for dealing with science in their 
daily lives (Otarigho & Oruese, 2013). 
All of the above restraints have led to integrated science curriculum not being fully applied 
in high schools throughout the U.S. Since high schools still offer distinct classes for each science 
subdomain, one might then expect large-scale science assessment data that is generated from 
an assessment of these subdomains to be modeled using MIRT, whereas UIRT is vastly 
employed as the standard.8 By using an UIRT model the assessment designers may be impacting 
the assessment’s consequential validity by assuming all students have the similar access to 
similar course content taught in a similar way. 
As educators struggle to mirror the three-dimensional aspects of NGSS, so too do large-
scale assessments like PISA wrestle with the challenges of modelling quantitative data. MIRT 
models will only be appropriate if what is anticipated by frameworks to be extensively 
multidimensional data actually shows such patterns. Alternatively, explaining why such data 
sets do not show such patterns if frameworks anticipate them is theoretically also a challenge. 
Educators and governments are often calling for the need to be able to assess these skills with 
the need to use more advanced modeling if necessary to evaluate the student data from those 
assessments, such as MIRT or hybrid IRT models. This study aims to first explore a framework 
 
8 While the goal of this study is not to determine which method of course delivery, e.g., integrated or 
nonintegrated, is most appropriate, the main method of course delivery in U.S high schools as nonintegrated 
subdomains of science may indicate multidimensionality. 
 30 
for multidimensional claims, then use some data analytic and MIRT models to explore some 
2015 PISA science scores from U.S. students to determine whether models can help showcase 
some of the implied multidimensionality. This might include how the three subdomains of the 
2015 PISA science framework, i.e., physical, life, and Earth and Space systems9, are distinct 
dimensions of student learning with supporting evidence from a qualitative analysis of the 
science subdomains in the 2015 PISA framework, or other aspects of multidimensionality. If a 
MIRT model substantially improves fit for some portions of the 2015 PISA science data better 
than unidimensional IRT models this could indicate that some of the latent trait skills are quite 
different from each other with regards to student ability. Not using a MIRT model when 
multidimensionality exists could potentially cause harm to students individually or in 
aggregated groups through invalid inferences being made about their ability in each subdomain 
(Spencer, 2004). In addition, harm could be caused to educational institutions trying to make 
policy decisions based on an incorrect understanding of student outcomes. These policies can 
be especially impactful if such decisions marginalize students of color. 
History of Harm to Learning Equity by Science Assessments 
Potential social impacts from how assessments are used is at the core of consequential 
validity (Iliescu & Greiff, 2021). Messick (1993, p. 5) defines consequential validity as an aspect 
of construct validity, which “appraises the value implications of score interpretation as a basis 
for action as well as the actual and potential consequences of test use, especially in regard to 
 
9 For those familiar with NGSS, these three subdomains fall into the NGSS dimension of core ideas. The NGSS 
dimensions of crosscutting concepts and science and engineering practices, such as inquiry, are outside the scope 
of this study. My use of the term “dimension” throughout this dissertation is meant to refer to aspects of 
multidimensionality in the PISA international framework such as science subdomains or other components in the 
framework, and not specifically to NGSS concepts, which tend to be U.S. embedded. 
 31 
sources of test invalidity related to issues of bias, fairness, and distributive justice.” When 
analyzing assessments for validity in general, Iliescu and Greiff (2021) advocate for validity to be 
applied to inferences, or claims made, rather than the instrument itself in order to focus on 
consequential validity. They further argue that more research is needed into the “social 
consequences of testing” as the effects of assessments on specific subpopulations and society 
in general directly draw a line to educational diversity, equity, and inclusion (DEI).  
If the opportunity to learn hinges on the learning tools10 available to a student, then any 
learning tool inequity can impact the validity of inferences made from an assessment (American 
Educational Research Association [AERA], 2014). Furthermore, the scores from that assessment 
may create additional ripples of inequity if student placement and future opportunities to learn 
hinge on those scores. We need to be able to redesign assessments to act as tools of equity 
rather than furthering existing inequities in education by assessing students on material they 
have not had the opportunity to learn. The Center for Professional Education of Teachers 
(CPET) (n.d.) suggests assessment practices can become more equitable if we: 
• “Ensure our assessments align with what we actually teach 
• Formatively assess students on a regular basis 
• Differentiate assessment products whenever possible  
• Offer a variety of ways to demonstrate mastery 
• Be flexible (but not too flexible), and offer time to make up assessments 
• Create relevant, engaging assessment methods 
 
10 Learning tools encompasses (but is not limited to) curriculum, technology, books, instructional aids, and lab 
equipment. 
 32 
• Make assessments rigorous, not rote 
• Develop and maintain a growth mindset 
• Emphasize effort and progress, not grades 
• Acknowledge and cultivate students' strengths and talents” 
Not all of these suggestions apply to PISA and the intended use of its scores – “to evaluate 
education systems worldwide” in order to determine “how well students, at the end of 
compulsory education, can apply their knowledge to real-life situations and can therefore fully 
participate in society” (OECD, n.d.-a). For example, formative assessments align more with 
school responsibilities. However, the outcomes from some suggestions could be impacted by 
PISA. Due to the global nature of PISA, OECD cannot guarantee that it is aligned with what is 
taught in the schools of every country (OECD, n.d.-a), yet could OECD make its assessments 
more culturally responsive? With regards to differentiating assessments, PISA did not provide 
an adaptive11 test that differentiates based on student ability for science in 2015 (OECD, n.d.-b), 
but is making progress on including more adaptivity (multi-stage) in later cycles in various 
content areas (OECD, n.d.-a). Taken together, substantial deficits in assessment equity could 
impact interpretation of assessments results even if the assessment’s intended use is at a more 
aggregated level. 
OECD (2016a) defines equity in education as calling for “opportunities to acquire these 
[science] skills12 should be independent of students’ backgrounds.” PISA is used to analyze 
education equity through several lenses by: 1) examining “variation in the distribution of 
 
11 OECD does currently have an adaptive test for reading and math (OECD, n.d.-b). 
12 Students “have a basic understanding of science that will help them become informed citizens in a 
world shaped by science and technological progress. (OECD, 2016a).” 
 33 
student outcomes, especially whether students acquire a baseline level of skills, as a way to 
assess the inclusiveness of school systems”, 2) determining “impact of students’ backgrounds 
on their outcomes at school”, and 3) exploring if “access to educational resources and the 
incidence of sorting practices varies between students of different backgrounds as a way to 
identify some of the factors that mediate their association with performance. (OECD, 2016a).” 
For the U.S. 11.4% of the variation in science performance by students can be linked to their 
socio-economic status (OECD, 2016a). Language considerations mostly are outside the scope of 
this investigation, but there is substantial infrastructure in PISA within and across countries for 
multiple language support and translation, mostly addressed by national considerations 
regarding their country’s students (OECD, 2017b). 
In OECD’s (2016a) report on Country Note: Key Findings from PISA 2015 for the United 
States, the performance and equity of the U.S. educational system is compared with those 
same aspects of other countries13 that OECD has identified as showing high or improving levels. 
The following educational and equity conclusions (OECD, 2016a) were drawn: 
• “The United States is a wealthy country.” 
• “The United States spends a large amount on education.” 
• “There is more variation in socio-economic status14 in the United States than in the 
other four [comparison] countries.” 
 
13 The four comparison countries identified by OECD are Canada, Estonia, Germany, and Hong Kong (China), which 
are shown in Table 2. 
14 OECD defined socio-economic status (SES) for the 2015 PISA by an index they developed from economic, social, 
and cultural status variables related to the family background of students (OECD, 2016a, p. 27). The variables 
included: parental education level, parental occupation, amount of wealthy possessions, and amount of books 
owned (OECD, 2016a, p. 27). The score is a composite determined from principal component analysis (PCA) and 
can be compared to other nations (each is weighted equally) since it is standardized to a mean of zero with a 
standard deviation of 1 (OECD, 2016a, p. 27).  
 34 
• “The United States occupies an intermediate position in terms of the percentage of 
socio-economically disadvantaged students.” 
• “A large but not extreme15 percentage of students in the United States have an 
immigrant background.” 
• “The United States is a large and complex country.” 
Another area where the U.S. experiences educational complexity is in the type of and access 
to science course offerings. Student enrollment in courses covering different science 
subdomains can vary by student ethnicity. Appendix A provides an overview of student 
enrollment percentages in U.S. high school science courses (biology, chemistry, and physics) by 
student ethnicity. Regardless of the science subdomain being taught the majority of students 
enrolled are White – see Figures 35-37. Therefore, it is important to consider if the inferences 
made based on PISA science scores can be accurately applied to other subpopulations of 
students. This effect could be further enlarged if the model used to determine how student 
ability relates to item parameters is inaccurately modeling the subdomains of science. If the 
U.S. bases educational policy changes or educational system reform on aspects of PISA ranking 
(rather than on needs of subpopulations) resulting from a model mismatch there could be 
unintended consequences for how PISA science data is used to inform economy, policy, and 
science education. An assessment can be evaluated both in terms of its use and its interpretive 
inferences, which can occur with probing the model for evidence of its accuracy (Messick, 
1989). 
 
15 While OECD does not define what an extreme percentage would be, OECD does clarify that 23% of U.S. students 
are immigrants and that there are only 5 other OECD-member countries with a greater percent of student 
immigrants (OECD, 2016a). 
 35 
For example, if policy makers determine that U.S. students are lagging behind other 
countries that are global competitors with regards to physical systems and Earth and space 
systems scores (see Appendix C) and earmark more funding for science education in these 
areas only, then the inference they are making on how to use PISA scores could be inherently 
flawed. Singer and Braun (2018, p. 39) describe this “mix of nationalism, fears about global 
competitiveness, and human nature” as leading to “unitary ‘silver bullet’ solutions based on 
highly aggregated data.” This leads to an ecological fallacy, which is making inferences at an 
individual level, whether it be on a student’s learning, a school/district’s performance, or a 
state’s educational requirements, based on population level data and can generate inaccurate 
conclusions. While OECD (2018) does clarify that PISA results should not be used to make 
inferences about individual students, it often makes inferences about school policy. As a 
country, the U.S. needs to clarify use of these inferences to educators, researchers, and most 
importantly to the media, which often discusses PISA scores incorrectly. OECD could also 
further clarify how survey data on school policy should be used per the AERA (2014, p. 23) 
standard for validity 1.3: “If validity for some common or likely interpretation for a given use 
has not been evaluated…that fact should be made clear and potential users should be strongly 
cautioned about making unsupported interpretations.” 
Overview of PISA 
 The OECD develops, administers, scores, and provides data for the PISA (OECD, n.d.-a). 
Results from PISA are used to rank countries by their students’ mean domain score (OECD, n.d.-
a). As mentioned earlier, these rankings can provide a measure for a nation’s economy and are 
important to educational policy (Pokropek et al., 2022). 
 36 
Historical Background 
PISA was first administered by the OECD to students in 2000 (OECD, n.d.-a). The 
international program continues today, and countries can elect16 to participate and receive 
information on 15-year-old students about their learning in several content areas (OECD, n.d.-
a). The science assessment has been computer-based since 2015 (OECD, n.d.-b) for most 
countries (Jerrim et al., 2018). Students do not all receive the same items due to the 
assessment design (OECD, n.d.-b). 
In the first cycle of PISA in 2000, 43 countries participated, including the U.S. (OECD, 
n.d.-a). Original participants included 29 countries that are members of OECD and 14 non-
member countries (OECD, n.d.-a). PISA has since grown to include more than 90 countries and 
economies worldwide while serving around 3 million students (OECD, n.d.-b). The assessment is 
influential for policy development in some countries due to the country education ranking that 
OECD provides (OECD, n.d.-a).  and has been linked to other national assessments in several 
countries, such as in the U.S. (OECD, n.d.-a). 
Assessment Cycle 
The “major” assessment domains found in PISA are traditionally math, reading, and 
science applied to “everyday activities,” which usually rotate as the primary content in three-
year cycles17 (OECD, n.d.-a). For instance, science, as currently defined by OECD (see section 
2015 Science Framework below), last served as the major domain in 2015, followed by reading 
 
16 Student participation is also voluntary in the U.S. with an offer of a certificate of 4 hours service from the U.S. 
Department of Education (PISA USA, 2015). Students are chosen randomly by OECD from a list of all eligible 
students that is provided by each U.S. school that elects to participate with a goal of 42 students per participating 
school (PISA USA, 2015). 
17 Science therefore rotates as a major domain to be the focus of the assessment every nine years (OECD, 2017b). 
 37 
in 2018 and math in 2021 (OECD, n.d.-a). COVID-19 has thrown off some of the administration 
dates such that science will next be administered as a major domain in PISA in 2025. Some 
content from the major domains usually reappears as a “minor” domain in PISA in each cycle to 
help extend trends in non-major years, along with a new innovative domain like collaborative 
problem solving chosen each cycle since 2012, and sometimes optional domains are delivered 
in some cycles, such as financial or digital literacy (OECD, n.d.-a). The degree of information 
available in non-major years, and the rotation of the innovative domains, have become 
somewhat problematic for some countries since they need information more often and for 
other reasons (OECD, n.d.-a). 
2015 Science Framework18 
A content framework like the 2015 PISA science framework drives the discussion around 
what educators worldwide might think should be taught for students to become proficient in 
science. There is a committee with science education experts from around the world that 
reviews the framework, as well as feedback from each participating country for each version of 
the framework. For the 2015 framework, the focus was on scientific literacy. Scientific literacy is 
defined as “the ability to engage with science-related issues, and with the ideas of science, as a 
reflective citizen” (OECD, 2017a).  OECD (2017a) claims that a comprehensive list of “all the 
ideas and theories that might be considered fundamental for a scientifically literate individual” 
has not yet been made.  However, three competencies identified by OECD include: “explain 
phenomena scientifically”, “evaluate and design scientific inquiry”, and “interpret data and 
evidence scientifically” to provide evidence for scientific literacy (OECD, 2017a). 
 
18 See Appendix F for the full 2015 PISA science framework. 
 38 
The framework also tries to describe what knowledge is assessable. The 2015 PISA 
science framework states that knowledge will be assessed if it meets the following criteria: 
• “has relevance to real-life situations, 
• represents an important scientific concept or major explanatory theory that has 
enduring utility, and 
• is appropriate to the developmental level of 15-year-olds (OECD, 2017a).” 
Content knowledge will comprise more than half of the assessment according to the developers 
(OECD, 2017a). The subdomains of science fall squarely into the framework’s knowledge aspect 
that includes the following elements: “content”, “procedural”, and “epistemic” (OECD, 2017a). 
Content knowledge includes “facts, concepts, ideas, and theories about the natural world that 
science has established” (OECD, 2017a) that are historically learned in the classroom by teacher 
explanation and student exploration of a construct. Putting this altogether is sometimes called 
“sense making” in STEM. Procedural knowledge is what scientists follow to develop evidence 
supporting scientific knowledge via “practices and concepts on which empirical enquiry is based 
such as repeating measurements to minimize error and reduce uncertainty, the control of 
variables, and standard procedures for representing and communicating data” (OECD, 2017a). 
Sometimes this is considered to be “scientific practice” although defining specific practices of 
interest can widen and narrow this framing. Epistemic knowledge derives from “understanding 
science as a practice, which refers to an understanding of the role of specific constructs and 
defining features essential to the process of knowledge building in science” and “includes an 
understanding of the function that questions, observations, theories, hypotheses, models, and 
arguments play in science; a recognition of the variety of forms of scientific inquiry; and the 
 39 
role peer review plays in establishing knowledge that can be trusted” (OECD, 2017a). 
Sometimes this is considered to include cross-cutting concepts, although defining specific 
concepts of interest can widen and narrow this framing. 
In addition to knowledge, the PISA framework identifies three other aspects that it bases its 
view of scientific literacy on, which are: contexts, competencies, and attitudes. OECD (2017a) 
clarifies that the PISA science assessment “is not an assessment of contexts.” Instead, the 
knowledge and competencies are assessed “in specific contexts” (OECD, 2017a) so that 
students must make sense of their STEM thinking in the applied context. Aspects and elements 
described above are provided in Figure 3.  
Attitudes are thought of in PISA as possibly interacting with dispositions to learn or to use 
competencies and knowledge and may be addressed in the extensive PISA questionnaires 
depending on cycle and selection of questionnaire material, which are mentioned for clarity but 
are outside the scope of this research study. 
Figure 3 
 
Relationships among the Four Aspects 
Note. Adapted from “PISA 2015 Assessment and analytical framework: science, reading, mathematic, financial 
 40 
literacy and collaborative problem solving, revised edition,” 2017, OECD Publishing 
(http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. 
 
As shown in Table 1Error! Reference source not found., the science content knowledge is 
classified into three subdomains for this framework: life, physical19, and Earth and space 
systems (OECD, 2017a). 
Table 1 
 
Three Science Subdomains in 2015 PISA 
Subdomain Code Knowledge of the Content of Science 
PS1 Structure of matter (e.g. particle model, bonds) 
PS2 Properties of matter (e.g. changes of state, thermal and electrical conductivity) 
Physical PS3 Chemical changes of matter (e.g. chemical reactions, energy transfer, acids/bases) 
Systems PS4 Motion and forces (e.g. velocity, friction) and action at a distance (e.g. magnetic, 
(PS) gravitational and electrostatic forces) PS5 Energy and its transformation (e.g. conservation, dissipation, chemical reactions) 
PS6 Interactions between energy and matter (e.g. light and radio waves, sound and seismic waves) 
LS1 Cells (e.g. structures and function, DNA, plant and animal) 
LS2 The concept of an organism (e.g. unicellular and multicellular) 
Living 
Systems LS3 
Humans (e.g. health, nutrition, subsystems such as digestion, respiration, circulation, 
excretion, reproduction and their relationship) 
(LS) LS4 Populations (e.g. species, evolution, biodiversity, genetic variation) 
LS5 Ecosystems (e.g. food chains, matter and energy flow) 
LS6 Biosphere (e.g. ecosystem services, sustainability) 
ESS1 Structures of the Earth systems (e.g. lithosphere, atmosphere, hydrosphere) 
Earth ESS2 Energy in the Earth systems (e.g. sources, global climate) 
and ESS3 Change in Earth systems (e.g. plate tectonics, geochemical cycles, constructive and Space destructive forces) 
Systems ESS4 Earth’s history (e.g. fossils, origin and evolution) 
(ESS) ESS5 Earth in space (e.g. gravity, solar systems, galaxies) 
ESS6 The history and scale of the universe and its history (e.g. light year, Big Bang theory) 
Note. Adapted from “PISA 2015 assessment and analytical framework: science, reading, mathematic, financial 
literacy and collaborative problem solving, revised edition,” 2017, OECD Publishing 
(http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. 
 
19 The physical systems subdomain seems to encompass both physics and chemistry constructs. 
 41 
 
The 2015 PISA science framework further clarifies that in the assessment approximately 
36% of items will be physical, 36% of living, and 28% of Earth and space systems (OECD, 2017a). 
An example of a released science item with a focus on life science is provided in Figure 4. 
Figure 4 
 
Released 2015 PISA Science Item20 
 
 
 
20 The correct response needed to reference “a flower cannot produce seed without pollination” (OECD, n.d.-c). 
 42 
Note. From “PISA 2015 Assessment and analytical framework: science, reading, mathematic, financial literacy and 
collaborative problem solving, revised edition,” 2017, OECD Publishing 
(http://dx.doi.org/10.1787/9789264281820-en). Copyright 2022 by OECD. 
Items were delivered in the following contexts: “health, natural resources, the environment, 
hazards, and the frontiers of science and technology (OECD, n.d.-c).” The contexts were set in 
“personal, local/national, and global settings (OECD, n.d.-c).” Items were developed to meet 
one of the following three cognitive demands21: 
1. “Low - Carry out a one-step procedure, for example recall of a fact, term, principle or 
concept, or locate a single point of information from a graph or table. 
2. Medium – Use and apply conceptual knowledge to describe or explain phenomena, 
select appropriate procedures involving two or more steps, organize/display data, 
interpret or use simple data sets or graphs. 
3. High - Analyze complex information or data, synthesize or evaluate evidence, justify, 
reason given various sources, develop a plan or sequence of steps to approach a 
problem (OECD, n.d.-c).” 
OECD (n.d.-c) defined theoretical difficulty for an item as “a combination both of the degree of 
complexity and range of knowledge it requires and the cognitive operations that are required to 
process the item.” 
2015 Assessment Design22 
 
21 These definitions were newly added to the 2015 PISA framework for the scientific literacy domain (OECD, n.d.-c) 
22 The complex design of test forms, item types, and sampling techniques used to calculate a country’s rank are 
outside the scope of this study. Information provided in this section is to help orientate the reader to the higher-
level details of the design of the science assessment. Complete details regarding assessment design are available in 
the PISA 2015 technical report (OECD, 2017b). 
 43 
In terms of educational theory behind the PISA there is one unifying goal. PISA is 
intended to be:  
A collaborative effort among OECD member countries to measure how well 15-year-old 
students approaching the end of compulsory schooling are prepared to meet the 
challenges of today’s knowledge societies. The assessment is forward-looking: rather 
than focusing on the extent to which these students have mastered a specific school 
curriculum, it looks at their ability to use their knowledge and skills to meet real-life 
challenges. This orientation reflects a change in curricular goals and objectives, focusing 
more on what students can do with what they learn at school. (OECD, 2017b, p. 22) 
A “real-life challenge” faced by an individual in life might be the COVID-19 examples given 
previously. Such challenges would most likely include integrating content knowledge since 
rarely do we find successful scientific solutions in a vacuum. For example, for those in STEM 
careers, biologists rarely describe cellular processes without integrating organic chemistry 
knowledge. Physics is often used to describe the motion of animals and how planets align. 
However, as noted earlier, the integrated course model is not how science education is 
designed – rare is the course that incorporates both life and physical science in a K-12 setting. 
Neither are PISA or many other large-scale science assessments designed to be integrated with 
items assessing multiple science subdomains at once. The science assessment design includes 
items that are targeted to specific subdomains, Figure 4 (Bee Colony Collapse Disorder 
Question23 1), which is coded to knowledge – system: content – living science (OECD, 2017a). A 
 
23 Question # refers to a specific item in the order it appeared in a cluster. 
 44 
lack of integrated content in items may negate some of the “forward-looking” aspects of PISA, 
although a fuller theoretical investigation of this topic is outside the scope of this dissertation. 
 The cognitive assessment (see Figure 5) included an additional domain of collaborative 
problem solving in 2015 (OECD, 2017b). Some of the countries took a paper-based assessment 
(PBA) version while others took a computer-based assessment (CBA) (OECD, 2017b). The CBA 
version of PISA was considered a field trial and included both trend and new items while the 
PBA version only had trend items (OECD, 2017b). 
Items had several response formats: click on a choice, numeric entry, text entry, select 
from drop-down menu, and drag and drop and were dispersed among multiple test forms 
(OECD, 2017b). There was no common form with a common set of items that all students 
received (OECD, 2017b). Forms were randomly assigned (OECD, 2017b), with no linking 
information released. Within forms, the science items were organized into clusters (OECD, 
2017a). There were 36 random science clusters combinations possible across the 66 CBA forms 
(OECD, 2017b). There was also a unique form referred to as the Une Heure (UH) form and is for 
students with special needs. This form included “easier items in each domain” with “a more 
limited reading load” and 50% of the items assessing science (OECD, 2017b, p. 42). 
Approximately 61 physical items, 74 life items, and 49 Earth and space items for a total 
of 184 items were developed and chosen for the assessment, which is equal to 6 hours of test 
questions – only 85 were trend while 99 were new (Mostafa et al., 2018). However, the actual 
test was about 2 hours per student. As shown in Figure 5, students have 2 hours to take the 
cognitive portion of the assessment, which includes the major domain of science and the minor 
domains of reading, mathematics, and collaborative problem solving, and then they are offered 
 45 
a questionnaire24 with a shorter innovative assessment on financial literacy at the end (OECD, 
2017b). The major domain, science for 2015, takes an hour of the time allotted to both the PBA 
and CBA cognitive assessment to complete (OECD, 2016b). Teachers had an optional short 
questionnaire that could be taken after the student questionnaire (OECD, 2017b). All elements 
of the assessment were offered in different languages according to PISA specifications to 
accommodate some language aspects of the participant’s setting. 
Figure 5 
Comparison of PBA25 and CBA26 Assessment Designs 
 
Note. From “PISA 2015 technical report,” 2017, OECD Publishing (http://dx.doi.org/10.1787/9789264281820-en), 
p. 36. Copyright 2017 by OECD. 
 
 
24 OECD often refers to the cognitive assessment as a survey while referring to the student and teacher qualitative 
surveys as questionnaires in both the framework and technical report. 
25 While PBA countries were offered the optional financial literacy assessment none took it (OECD, 2017b). 
26 ICT refers to Information and Computer Technology Literacy Familiarity (ICT) questionnaires. 
 46 
2015 Science Scoring 
While items are described in the framework, these do not generate individual scores per 
student that are visible to the public (OECD, n.d.-b). Per OECD (2018), nearly 540,000 students 
globally finished the science assessment. There is not a theoretical minimum or maximum score 
for each of these students (OECD, n.d.-b). In this study’s data, however, item level responses for 
students within the U.S. sample were used. 
In the reported data27 at the country level, scaling occurs with IRT and then is 
transformed for scores around normal distributions (OECD, n.d.-b). Means for OECD country 
participants are approximately 500 score points with a standard deviation (SD) of 100 score 
points (OECD, n.d.-b). Countries are then ranked according to the mean score (OECD, n.d.-b). 
Most students score between 400 and 600 points (OECD, n.d.-b).  
Having mentioned earlier that PISA scores can be indicators of a country’s education 
status and economy, it’s important to consider that not all educators and researchers believe 
that PISA scores are good indicators of these factors (Strauss, 2019). Some educators feel that 
there is not a one-size fits all assessment and that assessment should be more inquiry-based 
where students actually show what they can do, which is only partially assessed on the PISA 
assessment from 2015. In addition, the scores reflect students’ performance in different 
countries. OECD attempts to take into account cultures and policies in those countries but has 
not been able to fully resolve that these differences promote different knowledge, skills, and 
abilities (Strauss, 2019). This is partly what the OECD assessments are intended to consider, but 
in a large-scale assessment it is difficult to separate issues like cultural relevance from policy 
 
27 See Appendix B for science mean scores by country and Appendix C for science mean subscale scores by country. 
 47 
variations that the countries feel might be important to detect. To better understand scores 
within the context of the science subdomains, a qualitative analysis of student level data, which 
is not possible given the nature of the PISA data and lack of access to all science items, might 
complement a quantitative analysis of 2015 PISA science scores. Instead, I am trying to identify 
conceptual multidimensionality via conceptual analysis of the framework with document 
analysis, followed by quantitative analysis of the student-level data set from the instrument 
representing that framework, to see if empirical evidence of patterns of quantitative 
multidimensionality exists. 
Three MIRT Case Studies 
The following studies exemplify the current state of MIRT model usage in assessment 
research and are similar to what was implemented in the methodology chapter of this study. 
Two of the studies provide original research addressing the use of MIRT in science assessments 
with only one being a large-scale case, while the third case study focuses on a large-scale 
English language proficiency assessment. All three are united in their use of a methodology to 
validate the use of MIRT models by comparing fit to other IRT models and by a theory that their 
educational constructs are expected to be multidimensional. 
Yen and Leah (2007) - MIRT Model for Composite Scores  
Similar to the PISA 2015 reading framework, the knowledge content of English language 
proficiency assessments is often multidimensional. The subdomains for this English language 
proficiency assessment were identified as speaking, listening, reading, and writing, but the 
researchers were interested in mainly the speaking and listening constructs. The speaking 
construct was further separated into four subsequent proficiencies while the listening construct 
 48 
had three subsequent proficiencies. These proficiencies were considered as sub-subdomains in 
the analysis. Twenty items each were administered for the speaking and listening subdomains. 
Half of the speaking items were scored polytomously, but the rest were dichotomous, and all of 
the listening items were dichotomous. Students took the speaking portion of the test in about 
10 minutes and the listening portion in about 15 minutes. 
The assessment itself was considered large-scale because it was given to all eligible K-12 
students in an unidentified location that was state, or country sized. However, the researchers 
pulled a sample of 12,008 student responses from only elementary school students. Only full 
sets of responses were analyzed as some students did not complete all of the items. The sample 
included slightly less female students than male students while one unidentified ethnic group 
dominated the sample. 
The unidimensional IRT models that were analyzed included the 3PL and the two-
parameter partial credit model. Calibrated together, both multiple choice and constructed 
response items were placed on the related subdomain scale. Marginal maximum likelihood was 
used to estimate the parameters at the same time for both item types. Importantly, the 
speaking and listening items underwent distinct calibrations, but an additional calibration was 
performed on these subtests together to build an “oral” composite scale. 
Next an exploratory factor analysis was conducted to determine if the sub-subdomain 
proficiencies were actually multidimensional and to identify the number of dimensions that 
were present. Using the BMIRT computer program, four different MIRT models were then 
applied. One of the models [Model Type 1] assumed that speaking and listening subdomains 
are combined to measure one latent trait with several sub-subdomains while the other three 
 49 
models [Model Type 2] assume that the speaking and listening subdomains measure distinct 
latent traits, each with their own set of sub-subdomains. The researchers pointed out that MIRT 
models can have parameter estimation issues due to the number of parameters, but the 
Markov Chain Monte Carlo (MCMC) method used by BMIRT helped alleviate this. Population 
parameters were fixed as a normal distribution to the mean of 0 and standard deviation (SD) of 
1 for Model Type 1 and fixed as multinormal distribution with mean (0,0) for Model Type 2. 
Each model went through 10,000 iterations. Each model’s fit was compared using the Akaike 
Information Criterion (AIC) and chi-square difference test. The Type 2 MIRT models had the 
better fit and the researchers concluded that multidimensionality existed in the assessment. 
The MIRT models proved successful even though assessments less than 100 items in length can 
have trouble with the discrimination parameter. Scalise and Clarke-Midura (2018) also had 
success in applying a MIRT model to assessment scores based on a multidimensional 
framework. 
Scalise and Clarke-Midura (2018) - The Many Faces of Scientific Inquiry 
The NGSS is a multidimensional framework used, with some adaptions, by over 30 
states in the U.S. The researchers looked at whether an online/virtual performance task aligned 
with the Framework for K-12 Science Education (National Research Council [NRC], 2012), 
College Board Standards for College Success (College Board, 2009), and the NGSS inquiry 
practices and science content knowledge delivered evidence on more than one latent trait 
using a MIRT-Bayes model. The nature of scientific inquiry done by students is described as 
complex and often has to be less regulated to measure the full reasoning of students, which 
indicates that when performing inquiry students may use multiple abilities. 
 50 
Items, both polytomous and dichotomous (multiple-choice), were developed by Harvard 
University learning scientists working in STEM innovations for two dimensions, inquiry and 
explaining. A sample of 1,986 student response were collected for 23 items. Less than 1% of 
students had missing data. Information on gender and ethnicity of the student sample was not 
provided. The entire assessment took students nearly 40 minutes to complete and process data 
on student actions was also collected during this time. 
To model the data, researchers used an exploratory study design that included both 
unidimensional and multidimensional IRT models. MIRT models include a 2-dimensional partial 
credit and a hybrid-MIRT-Bayes model. For the 2-dimensional MIRT model the difficulty 
parameter was estimated freely once the means of the latent variables were set to zero. In the 
case of the hybrid MIRT model, a Bayes net was constructed first then the MIRT model was 
applied. The Bayes nets can help structure semi-amorphous data onto the constructs being 
investigated. Upon analysis, the unidimensional model was not as well fit to the data based on 
the significant deviance difference while item fit was acceptable for the 2-dimensional MIRT 
model. In addition, the two latent traits of inquiry and explaining were just moderately 
correlated, indicating student performance might be varied enough to justify the use of MIRT 
based on the concept framework, although an even larger scale sample could help show this 
more definitively. Both dimensions had expected a posteriori (EAP) reliability coefficients 
greater than 0.85. Researchers elaborated that theoretical support from this framework was 
validated by content experts, who designed the task to have items that would assess each 
ability/skill independently. In counterpoint to the Yen and Leah (2007) study, no items were 
found to load negatively, which would indicate negative discrimination. The hybrid MIRT model 
 51 
appeared to fit even better with higher reliability estimates. Wright maps and standard error of 
measurement plots helping to visualize the data. Similarly, Li et al. (2012) explored how MIRT 
models can accurately depict data from items that cover several science domains. 
Li et al. (2012) - Applying MIRT Models in Validating Test Dimensionality 
Assessment dimensionality should be explored at the beginning of development to 
provide validity for the inferences made with regards to student ability/latent traits. 
Development of items is done with alignment to different anticipated dimensions so that each 
item targets one or more constructs. The researchers define dimensionality of the assessment 
“as the number of traits that must be considered to achieve weak local independence between 
the items” (Li et al., 2012, p. 3). The covariance between items should “approach zero as test 
length increases” to indicate weak local independence (Li et al., 2012, p. 3). The validation of 
dimensionality with regards to the learning domain or subdomain often occurs during the field 
test. 
A random sample of 5,677 grade 5 students who had taken the 2008 Michigan state 
science assessment was selected in order to validate the dimensionality of four science 
subdomains: science processes, life, Earth, and physical sciences. A total of 45 science items 
were taken by students with no timeline given. Item type was not provided by the researchers, 
and it is also unclear if the items were field tested or operational. Gender and ethnicity of the 
students also was not declared. 
To evaluate the dimensionality of the student data, the researchers selected three 
different IRT models: a unidimensional model, a “simple-structure” MIRT, and a testlet model. 
The testlet model treats different subdomains as different testlet residual dimensions to the 
 52 
dominant dimension of general science ability. Figure 6 provides a general structure for each 
model that was evaluated. 
Figure 6 
 
Comparison of Models 
 
Note. From “Applying multidimensional item response theory models in validating test dimensionality: An example 
of K–12 large-scale science assessment,” by Y. Li, H. Jiao, and R. W. Lissitz, 2012, Journal of Applied Testing 
Technology, 13(2), p. 9. Copyright 2012 by Journal of Applied Testing Technology. 
 
 Before applying the three models, a principal component analysis (PCA) and exploratory 
linear factor analysis (EFA) were conducted. Researchers selected eigenvalues greater than 1, 
which indicate those components account for more than mean total variance of items, in the 
PCA. A scree plot helped the researchers determine that at least two components did exist. 
Next the EFA confirmed that two factors existed and based on the reduction of root mean 
square error (RMSE) values either a four- or five-factor solution might be possible to adopt. A 
four-factor solution was used as the researchers felt that the statistical analysis should be 
compared against the theoretical dimensionality used in the assessment design. 
 53 
 The MIRT model was found to significantly fit the data better than the unidimensional 
model. AIC and Bayesian information criterion (BIC) were then used to compare fit between the 
MIRT model and the testlet model. The MIRT model was found to be the best fitting since it 
produced smaller information criteria indicating better model fit to the data. The conclusion 
reached was that if several abilities are measured then MIRT models should be used. Li et al. 
(2012) pointed out that at the time of their study no prior study on validating 
multidimensionality was located comparing the fit of the three models for a large-scale K-12 
science assessment. 
 The study outlined in Chapter 2 is similar in methodology to the above 3 studies 
with a focus on multidimensionality of science assessments. There continues to be a lack of 
research on large-scale science assessments and their use of MIRT models, and few that also 
include other data analytic techniques from emerging data sciences such as shown here. This 
study will include analysis of data from the 2015 PISA large-scale science assessment.  
By modeling data from PISA’s 2015 science framework28 and assessment, the following 
research questions are expected to be answered. 
Research Questions 
Research Question 1 (RQ1) 
What evidence can be drawn from the 2015 PISA science framework to qualitatively 
support whether multiple dimensions are being described regarding student knowledge in 
science in the framework? 
 
 
28 See Appendix F. 
 54 
Research Question 2 (RQ2) 
Do quantitative indications of multidimensionality exist in the 2015 PISA science data? 
• R2A: Does a data science cluster analysis29 applied to the PISA 2015 data (U.S. 
sample at the item level) suggest multidimensionality in the student response data 
set? Regardless of dimensionality, how do the items cluster in the analysis? 
• R2B: Does principal components analysis30 (PCA) applied to the same data indicate 
multidimensionality? Does residual analysis employing standard tolerances used in 
the industry indicate multidimensionality that needs to be modeled in the U.S. 
sample? How do items cluster in the analysis, and how does this compare to R2A?   
• R2C: By how much (practical significance) does a multidimensional IRT model 
applied to the same data indicate improved fit over less dimensionally complex 
models? Is fit statistically improved based on chi-square comparisons of nested 
models fitted for dimensionality? How does item clustering compare in R2A/R2B?  
Research Question 3 (RQ3) 
Depending on covariates available in the U. S. sample data set or linked data sets or 
overall population reports, what can be said about subgroup analysis in this data set and about 
aspects of classroom instruction relative to clustering patterns found in R2A-C? For instance, do 
multidimensional models yield results that showcase students have different levels of science 
 
29 A cluster analysis is a method for sorting items into groups based on their statistical relationship to one another. 
For example, if items are closely related because a student must have physic knowledge to answer them then all 
physics items may cluster together. 
30 PCA was chosen since it is a method of reducing the dimensionality of a large dataset into principal components 
(or dimensions) that retain as much variation as possible (Mailman School of Public Health, 2023). While EFA was 
considered, its goal is to show the correlation between variables is partially due to common latent variables 
(Mailman School of Public Health, 2023), which is not the dimensionality predicted for the science subdomains. 
 55 
proficiency on different dimensions identified? Relative to demographic data, if available, what 
can be said about history of harm and employing or ignoring dimensionality in science data 
such as these? Whether contrasting or similar, what can be said about the findings viewed from 
both lenses? 
 56 
CHAPTER 2. METHODS 
Declaration of Interest: The author worked at one time but not now for Educational Testing 
Service (ETS), one of the vendors supporting some PISA efforts, and has also been the lead 
vendor of 5 companies developing, delivering, and analyzing NAEP in the U.S., for all domains 
including for NAEP Science. 
Developing the Literature Synthesis 
A literature review helped determine the state of research on multidimensionality in 
large-scale science assessments. The majority of searches were in Google Scholar and the 
University of Oregon library databases but were not limited to only peer-reviewed journals. The 
main search used the following key phrases and words: “multidimensional” + “IRT” + large-scale 
+ “science” + “assessment”31,32. For the purposes of this literature review, the following 
definitions were used: 
• Multidimensional IRT – a model estimating student ability containing more than 
one latent trait 
• Large-scale Science Assessment – assessments measuring the science ability of a 
large proportion of the student population occurring at either the U.S. state or 
national level, at the country level for areas outside of the U.S., or at a global 
level 
Nearly 7,560 results were generated in Google Scholar using the combined search 
phrase. In each iteration of searching, the Li et al. (2012) article was the top hit with very few 
 
31 Quotation marks were included to indicate for which terms the Google Scholar search platform should return 
articles with exact matches. 
32 Grade level and science subdomain were not used as eliminators. 
 57 
other results matching the specific target. See Appendix D for Figure 39, which provides an 
overview of literature connected to the Li et al. (2012) study on multidimensional IRT model fit 
for student data from a large-scale state science assessment. Table 15 in Appendix E provides 
annotations for resources found during the literature review and used in the literature 
synthesis. 
Additionally, the review focused not just on science education, but also on reading 
assessments that used MIRT to model dimensionality. This addition was made in part due to 
the lack of studies dealing specifically with multidimensionality of large-scale science 
assessments, but also because PISA declares reading to be multidimensional in the 2015 
framework (OECD, 2017a). Therefore, reading offers somewhat of a mirror into 
multidimensionality that may be useful through which to view the science framework. 
Setting 
 PISA is an international CBA and PBA that is available to all OECD countries or affiliates 
called partners (OECD, 2017b), and countries around the world are invited to partner if they are 
not already in the OECD. For 2015, the non-gray countries in Figure 7 chose to participate 
(GEOstata, 2016). The color scale illustrates how these countries ranked by their students’ 
performance on the science assessment (GEOstata, 2016). The setting focus for this study will 
be the U.S. Note that in Figure 7 the U.S. is shown ranking in the 470 to 500 mean score range 
for science (GEOstata, 2016), which is near to but slightly above center. 
 58 
Figure 7 
PISA Science Performance by Country 
U.S. 
 
Note. From “PISA 2015 Results – Performance in Science,” 2016, GEOstata. Copyright 2015 by OECD. 
 
Countries began testing in April 2015 in the following types of educational settings: educational 
institutions, vocational training or related educational programs, and foreign schools within a 
country (OECD, 2017b). Due to the range of countries participating in PISA the student 
population is very diverse. 
Student Demographics 
Around 540,000 global students completed the PISA in 2015 (OECD, 2018). Both full-
time and part-time students were eligible to participate (OECD, 2017b). Students that were 
 59 
home-schooled or taught in the workplace were not eligible to take the PISA (OECD, 2017b). 
Table 2 provides a breakdown of students in the U.S. compared to the highest performing and 
lowest performing countries. This table also includes demographics for the four countries to 
which the U.S. is compared by OECD (2016a). OECD member countries who are highlighted 
blue, and the remaining countries are OECD partners (OECD, 2017b). 
Table 2 
Country Demographic Comparisons 
2015 
Student Sample 2015 Duration of 2013 
PISA 
2015 
Country33 Size (OECD, Population Size34 Compulsory 
Ethnic Dominant 
Fraction- Language/s37,38 Mean 2017b) (in millions) Education
35  alization36 Science (in years) Score39 
*Mandarin (35%), 
*English (23%), 
Singapore 6,115 5.45 6 38.57% *Malay (14.1%), 556 
Hokkien (11.4%) 
(2000 census) 
*Estonian 67.3%, 
Estonia 5,587 1.3 9 50.62% Russian 29.7% 534 
(2000 census) 
*English 58.8%, 
Canada 20,058 35.7 10 71.24% *French 21.6%, Other 19.6%  528 
(2006 Census) 
Hong *Cantonese 90.8%, 
Kong 5,359 7.3 9 6.2% *English 2.8%, 523 
(China) (2006 census) 
Germany 6,522 81.7 13 16.82% German40 509 
 
33 Countries, and in the U.S. the schools, elect to participate in PISA (NCES, n.d.-a). 
34 From https://datatopics.worldbank.org/world-development-indicators/ 
35 From https://data.worldbank.org/indicator/SE.COM.DURS 
36 From https://worldpopulationreview.com/country-rankings/most-racially-diverse-countries 
37 From https://www.languagerc.net/languages-by-countries/ 
38 Showing only those unofficial languages that are found at 10% or greater within the country’s population. 
39 From https://www.oecd.org/pisa/pisa-2015-results-in-focus.pdf 
40 No census data provided. 
 60 
United English (82.1%)
43, 
5,71241,42 States 320.7 12 49.01% Spanish (10.7%) 496
44 
(2000 census) 
Dominican 
Republic 4,740 10.4 15
† 42.94% *Spanish45 332 
Note. A Harvard study defined fractionalization as the probability that two people randomly selected from a 
country would be from different ethnic groups (Alesina et al., 2003). † This country’s education requirement 
changed from 9 to 15 in 2010 – the other six countries have remained steady in their education requirements since 
the late 1990s through 2022. * Indicates official language/s of that country. 
 
Even though slightly over half a million global students participated in the 2015 PISA, OECD did 
not select all of those students in their sample (OECD, 2017b). The U.S. alone had 4,220,325 15-
year-olds in 2015 with 3,992,053 actually enrolled in school (OECD, 2017b). A few U.S. schools46 
(12,001) chose not to participate in 2015 at an exclusion rate of 0.30% (OECD, 2017b). 
Data Collection 
 Data were collected during the 2015 PISA administration47 then organized, analyzed, 
and reported by OECD in 2016-18. Student responses to each PISA form were collected online 
when students accessed the task through their computer or tablet and in a physical format 
when students took the assessment via a paper form. In addition to student responses on 
science content, formal and optional questionnaires collected information on student attitudes 
 
41 NCES (n.d.-a) reported 177 schools participated nationally in the U.S. with a student response rate of 90%. 
42 This is the national sample size and does not include the two states (Massachusetts/North Carolina) and one 
territory (Puerto Rico) sampled. 
43 While English is broadly used in the U.S., the U.S. has not designated an official language (USAGov, 2023). 
44 The U.S. mean score is not significantly different from the overall OECD mean score of 493 (OECD, 2018). 
45 No census data provided. 
46 As of February 2024, there were 20,318 high schools in the U.S. per https://mdreducation.com/how-many-
schools-are-in-the-u-s/. Per NCES (n.d.-a), there were 27,144 public and private secondary and high schools in the 
2015-2016 school year, but the national U.S. sample consisted of only 240 schools and not all of those participated. 
47 For the U.S., this occurred during October to November (NCES, n.d.-a) 
 61 
toward science, along with student and teacher educational backgrounds. Student in-person 
interviews, cognitive lab data, focus group data, pilot and field trials were in many cases 
collected by OECD (2017b) but not released due to policy requirements with the countries. 
Study Sample 
For this study, the full PISA 2015 extant quantitative dataset was narrowed by selecting 
only the U.S. student population at the national, not state, level that responded to a CBA 
form48. This study used a science subset from that sample. The science subsample consisted of 
5,712 students and 166 dichotomous items. Removal of students that lacked response to any 
item (so they had all NAs49 for all 166 items) dropped the sample to 5,699 students. These 13 
dropped students were considered as missing data at 0.2% for the U.S. CBA science subsample. 
OECD also performed casewise deletion for missing data (Mostafa et al., 2018). 
There was no common form between all the students in 2015 PISA science. Items were 
spread across various forms as clusters of around 15 items each, with no equating sample or 
linking information available. Hence the item cluster S1050 with the highest response rate was 
chosen for the first analysis, reducing the sample to 1,306 students and 15 items, all of which 
were new in 2015. As a confirmation of the model fit findings from the first analysis, a second 
analysis was conducted as a validity study, examining the second largest cluster, S1151. This was 
 
48 CBA was chosen as this is the mode of delivery most assessments are moving towards, including PISA. Also, 
eliminating the PBA forms allowed the subsample to be viewed through one mode lens rather than having possible 
effects on the data from different delivery modes. 
49 NA is the abbreviation for “not applicable”; however, when used in a data file it can mean many things. Note 
that the R program automatically codes blank/missing cells as NA. Per OECD (2017b), a cell coded NA by R in the 
PISA data file can mean missing data, an item not reached by a student, a data error, or skipped item. 
50 Cluster S10 as noted in the OECD (2017b) technical report’s annex A had 17 items, but the 2 polytomous items 
were dropped from this study. 
51 Cluster S11 as noted in the OECD (2017b) technical report’s annex A had 16 items, but the 1 polytomous item 
was dropped from this study. 
 62 
identified as consisting of a subsample of 1,274 students and 15 items, all of which were new in 
2015. Criteria used here for examining clusters was to select two of the 15 item clusters for an 
initial and then a validity study. Given the scope of this work as an unfunded dissertation, the 
study started with the largest sample size to help facilitate separately fitting the more complex 
models, and then selected the next largest.  
Students in these cluster subsamples with some NAs indicating no response to an item 
during their attempt at answering the cluster had those NAs converted to 0 since they had the 
opportunity to attempt the item. There were no missing data in either of the science cluster 
subsamples using these criteria after the elimination of the 0.2% of U.S. responses indicated 
earlier. Hence, missing data rates were quite low, but they were not entirely eliminated. No 
imputation technique was undertaken for the 0.2% missing data due to insufficient additional 
information released. Also, numerous problems of imputation in assessment calibration made 
this a questionable statistical adjustment even if information had been available.  
Data Analysis – A Mixed Methods Approach 
Data analysis was divided into three steps. The first step directly supported RQ1 and 
involved qualitative data being analyzed, including the main document analysis of the 2015 
PISA science framework. Step 2 supported RQ2 and consisted of the quantitative analysis52 of 
2015 PISA science student scores along with data triangulation to the Step 1 analysis. The last 
step supported RQ3 and involved analyzing with an equity lens how the best fitting IRT model 
might impact equity in regard to student outcomes for subgroups. All three steps are outlined 
in detail below within the context of a mixed methods approach to data analysis. 
 
52 Focus of this study is on the item level rather than the student level for each quantitative analysis. 
 63 
When qualitative data exists to support, or contradict, quantitative data, a mixed 
methods design for research is essential. This type of design can meld findings from both 
methodologies into an effective solution (Johnson & Onwuegbuzie, 2004).  For example, large-
scale assessments like PISA often make available content frameworks, student and teacher 
survey responses, and other documentation like item specifications that can be analyzed 
qualitatively to determine narrative themes such as multidimensionality of science content. 
This is in addition to the quantitative scoring data, process data, and student demographic data 
from the assessment itself. Such narrative themes might add to a quantitative model of student 
science ability. For example, Claesgens et al. (2008, p. 66) had a mixed methods approach of 
“using IRT, the scores for a set of student responses and the questions are calibrated relative to 
one another on the same scale and their fit, validity, and reliability are estimated, and matched 
against the framework”. In order to support such a mixture of data types and analyses, a 
researcher often needs a methodological design that incorporates different philosophical 
underpinnings. 
Epistemology 
An equivalent status design refers to a theory of research where both qualitative and 
quantitative epistemologies are valued for understanding constructs (Venkatesh et al., 2016; 
Johnson & Onwuegbuzie, 2004). This study used two epistemological foundations to 
pragmatically agree with Baskarada and Koronios (2018) that researchers should select 
philosophies that work best for the research questions to be addressed and the data to be 
analyzed. This choice was pragmatic in nature based on the requirements of this study. The 
chosen epistemological foundations were also aligned with Greene and Caracelli (1997), who 
 64 
clarify that if methodological pragmatism requires different philosophies then each can be 
viewed as “logically independent and therefore can be mixed and matched” in order to achieve 
methodology that will work well in each stage of analysis. In the end, though, RQ1 and RQ2 
were designed to consider triangulation, or in other words whether results tend to support 
each other or not across the techniques. 
My epistemological approach to science education is mainly social constructivism. 
Constructivism is a viewpoint that students build their own learning with their reality built on 
experiences as a learner and the social aspect implies that learning is collaborative. Social 
constructivism describes students as dynamically constructing and reconfiguring knowledge 
while interacting socially (OECD, 2023). This educational theory is grounded in constructivism 
psychological learning theory, which views knowledge as something a learner must “actively 
construct” for themselves (McLeod, 2019). During social interactions, such as learning within a 
classroom, students often construct their knowledge in collaboration with others (Atkisson, 
2010). The concept that students learn how to learn by interacting with others is a key 
foundation of social constructivism (Greenwood, 2020). Western Governors University (2020) 
describes several principles of constructivism: 
• “Knowledge is constructed, 
• people learn to learn as they learn, 
• learning is an active process, 
• learning is a social activity, 
• learning is contextual, 
• knowledge is personal, and 
 65 
• motivation is key to learning.” 
Boon et al. (2022) also point out that a constructivist epistemology can suit “practice-
orientated” research. 
 I have also chosen an approach that acknowledges a philosophy of scientific realism, or 
an organized reality (Moroi, 2020), which means that at some level I am acknowledging that 
there is a reality in STEM and it can be known (although in STEM, most if not all models can be 
disconfirmed to some degree at a different grain size, so the idea of scientific realism is that the 
philosophy acknowledges utility of the model at the grain size applied, such as in Atomic Theory 
and Quantum Physics). Similarly, I believe that we can try to measure concepts both 
independently of ourselves and in a rational manner, at a level that may provide some utility for 
decision making. This viewpoint pairs well with my social constructivist stance and as Maxwell 
and Mittapalli (2010) note these two philosophical paradigms can work well in a mixed 
methods approach. If multidimensionality is present in the 2015 PISA science framework and 
student data then it should be discoverable by a mixed methods analysis and can then be 
modeled based on the science content and student data. Of course, in a broader view, we know 
that all models fail at some level and do not incorporate all aspects of reality but such models 
can be useful if they provide gains in our understanding or utility in our context.  
Purpose and Guidelines 
Even though most researchers have an idea of the philosophy driving their research they 
still need to define its purpose. Venkatesh et al. (2013) advise identifying the purpose of mixed 
methods research early. Identifying the purpose/s will help establish the research goals and 
 66 
later serve to inform reviewers of how to center any findings. They identify the seven purposes 
shown in Table 3.  
Table 3 
 
Defining Purpose of Mixed Method Approach 
Purposes Description Illustration Current Study 
Meets this purpose by 
triangulating the evidence from 
Mixed methods are used in 
order to gain A qualitative study was 
both methods and statistically 
complementary views used to gain additional 
the quantitative model 
Complementary insights on the findings complements the qualitative about the same from a quantitative review of the science phenomena or study. framework’s multidimensional relationships. design. However, UIRT model 
was more practical with regards 
to improvement of model fit. 
The qualitative data and 
Mixed methods designs are results provided rich 
Completeness used to make sure a explanations of the complete picture of a findings from the NA 
phenomenon is obtained. quantitative data and 
analysis. 
Questions for one strand A qualitative study was 
emerge from the used to develop 
inferences of a previous constructs and 
Developmental one (sequential mixed hypotheses and a NA 
methods), or one strand quantitative study was 
provides hypotheses to be conducted to test the 
tested in the next one. hypotheses. 
The findings from one Meets this purpose by expanding 
Mixed methods are used in study (e.g., quantitative) prior studies at the state and 
order to explain or expand were expanded or international levels that focused 
Expansion upon the understanding elaborated by examining on quantitative model fit; 
obtained in a previous the findings from a typically, these studies did not 
strand of a study. different study (e.g., have a qualitative review of the 
qualitative). dimensionality of the science content within the framework. 
Meets this purpose by 
Mixed methods are used in A qualitative study was corroborating other studies’ 
Corroboration/ order to assess the credibility of inferences conducted to confirm 
findings of quantitative 
Confirmation obtained from one the findings from a 
multidimensionality in science 
quantitative study. through qualitatively defining approach (strand). that science subdomains are 
multidimensional. 
Mixed methods enable The qualitative analysis 
Compensation compensating for the compensated for the 
weaknesses of one small sample size in the 
NA 
quantitative study. 
 67 
approach by using the 
other. 
Qualitative and Did not meet this purpose 
Mixed methods are used quantitative studies 
with the hope of obtaining were conducted to 
because sufficient student 
Diversity compare perceptions of demographic data was divergent views of the a phenomenon of unavailable for this data set. See same phenomenon. interest by two different research question reclarification 
types of participants. in last chapter. 
Note. Adapted from “Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods 
research in information systems,” by V. Venkatesh, S. Brown, and H. Bala, 2013, MIS Quarterly, 37(1), p. 26. 
Copyright 2013 by JSTOR. 
 
For this research study 4 purposes were highlighted in green in Table 3 as scaffolding 
the work being done here: complementary since the PISA science assessment is built off the 
PISA science framework; expansion since the hope is that the qualitative data from the 
framework will illuminate the finding (or lack thereof) of multidimensionality in the student 
scores; corroboration since the qualitative data may or may not concur with any quantitative 
findings; and diversity because a lack of multidimensionality in the quantitative scores may be 
influenced by diversity issues. In Table 3’s final column are the descriptions of how this study 
met, or did not meet, each purpose. 
For example, while outside the scope of this study, a lack of latent diversity, or “diverse 
student mindsets” that can result from student diversity among other traits (Godwin, 2017, p. 
13) could be masking the dimensionality of science subdomains if the student sample is not 
diverse and the students approach science problems in a similar manner. It is on diversity issues 
that I differ in opinion from Venkatesh et al. (2013, p. 22) as they state a mixed methods 
approach should be taken without regard to “cultural incommensurability” as long as it helps 
 68 
the researcher answer their question. If the purpose of mixed methods research hinges in part 
on diversity, then cultural considerations should be taken into account when developing 
methodology for the research (Broesch et al., 2020). 
 Once purpose/s are identified the design of and data analysis for mixed methods 
research needs to be grounded in accepted qualitative and quantitative procedures (Venkatesh 
et al., 2013). Whether each type of research occurs concurrently or as in this case sequentially, 
Table 4 provides both general and validation guidelines that can be applied to any mixed 
methods research (Venkatesh et al., 2013). 
Table 4 
 
Guidelines for Mixed Methods Research  
Guideline Researcher Considerations 
Carefully think about the research questions, objectives, and contexts to decide on the 
appropriateness of a mixed methods approach for the research. Explication of the broad and 
specific research objective/s is important to establish the utility of mixed methods research. 
Carefully select a mixed methods design strategy that is appropriate for the research 
questions, objectives, and contexts. 
Develop a strategy for rigorously analyzing mixed methods data. A cursory analysis of 
qualitative data followed by a rigorous analysis of quantitative data or vice versa is not 
desirable. Apply the same standard of rigor as typically used in analyzing quantitative and 
qualitative studies. 
Integrate inferences from the qualitative and quantitative studies in order to draw meta-
inferences from mixed method results. 
Discuss validation for both quantitative and qualitative studies. 
When discussing mixed methods validation, use mixed methods research nomenclature 
consistently. 
Mixed methods research validation should be assessed on the overall findings and/or meta-
inferences from mixed methods research, not from the individual studies. 
Discuss validation from the point of view of the overall mixed methods design chosen for a 
study or research inquiry. 
Discuss potential threats to validity that may arise during data collection and analysis, along 
with any remedies. 
Note. Adapted from “Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods 
research in information systems,” by V. Venkatesh, S. Brown, and H. Bala, 2013, MIS Quarterly, 37(1), p. 41. 
Copyright 2013 by JSTOR. 
Validation General 
 69 
Following these guidelines helped support validity in this mixed methods study and might help 
transferability to other contexts for inferences made based on the data (Venkatesh et al., 
2013), although the limitations described earlier with regard to both linking and the 
documentation of the data set indicate additional work is needed for greater generalization. 
Step 1: Qualitative Analysis 
“Comparing and contrasting data is vital to qualitative analysis (Gale et al., 2013).” This 
is especially true when analyzing a framework, which could be considered a policy53 at the 
state, national, or global level, and may incorporate multiple sources of information with 
various degrees of clarification. Since frameworks are documents, this can be done via a 
qualitative document analysis (Wach et al., 2013; Bowen, 2009).  
Overview of Document Analysis. For digital and hard copies of documents, an 
organized procedure is needed for the analysis (Armstrong, 2021). Armstrong (2021) 
recommends beginning by identifying the objective of the document analysis and describes six 
common objectives: “defining concepts, mapping range and nature of phenomena, creating 
typologies54, finding associations, providing explanations, and developing strategies.” Once 
documents have been selected, it is important to not just “lift” text to be used in the report 
(Armstrong, 2021) or the analysis may be considered superficial. Rather, analysis should strive 
for deep understanding to develop meaning with regard to the construct. 
Wach et al. (2013) outline the process of document analysis in several steps and notes: 
1. Defining document inclusion criteria, which may be practical or strategic in nature, 
 
53 Wach, Ward, and Jacimovic (2013) defined policy as documents “that express official organizational aims and 
strategies.” 
54 An analysis based on categories. 
 70 
2. Gathering the document/s, 
3. Outline analysis area/s, 
4. Analyze the document/s using coding if applicable, and 
5. Verify the analysis through an independent source (2nd reviewer) to increase reliability, 
impartiality, and dependability of the findings. 
An important note about step 5 – an analysis is considered dependable if the second reviewer 
would have made the same conclusions while analyzing the document/s in the same manner. 
Finally, Wach et al. (2013) recommend that thought needs to be given if the organization 
owning the document actually delivered the proposed policy. In this case the policy would be 
the science framework claim of three subdomains in science. 
 There are several ways to present a document analysis. Mazzei and Jackson (2024) 
discuss “re-animating” documents in a visual format to uncover new “intensities”, which can be 
interpreted as new ways of seeing content contained within the document. This could take the 
form of a visual aid, such as a logic model or flow chart, that details the structure of the data 
and points out claims.  
Advantages and Disadvantages of Document Analysis. Many states and nations have 
conducted document analysis on content frameworks as an efficient means of uncovering 
content relatedness and connections to educational theory. Bowen (2009) describes 
advantages and disadvantages to this type of analysis in the list below. My counterarguments 
or agreements are in italics. 
Document analysis provides: 
• efficiency in data selection as collection from participants is not always required, 
 71 
o Agree, with caveat that sometimes documents can be difficult to get in 
entirety from institutions. 
• most documents as they are available in the public domain, 
o Disagree, some documents may be public, but institutions can also keep 
many documents internal.  
• a decreased cost compared to other analyses,  
o Agree, document analysis is not as high in cost as a recruitment of 
participants for a quantitative study. 
• less to no reactivity or obtrusiveness from documents since participants are not being 
observed, 
o Agree, but Bowen (2009) also mentions that a researcher is less likely to 
influence the research due to lack of social interaction, which I disagree with 
as document analyzers can bring their own viewpoints into describing the 
meaning of the document.  
• documents are stable over time along with being exact in nature and researchers do not 
alter what is researched, and 
o Disagree, documents often go through several versions that are not always 
reported to researchers; sometimes changes are left out of a document too. 
• broad coverage of material and historical events, 
o Agree, documents allow researchers to peer into history. 
Document analysis is disadvantageous in that documents can: 
• sometimes lack detail, 
 72 
• be hard to retrieve, and 
• become biased through the selection process. 
Despite the disadvantages, document analysis can provide needed illustration. A document 
analysis of the framework’s theoretical claims of multidimensionality will help to elucidate 
whether multidimensionality should be expected in the empirical results. 
Conducting Document Analysis. The number of documents tied to the 2015 PISA is 
quite extensive and includes frameworks, reports on results, released items, technical reports, 
country level reports, webpage FAQs, brochures, and videos. To narrow this field, documents 
were selected following the process outlined by Voogt and Roblin (2012) who recommend 
screening for the goal of identifying the main theme, which in this study is science 
dimensionality. This is done by determining inclusion criteria a priori as per Wach et al. (2013). 
The inclusion criteria required the documents to mention: PISA 2015 science framework and 
any form of these words: domain, subdomain, or dimension, along with the document having 
been developed by OECD. This second screening requirement excluded any secondary sources 
that were not directly involved in science content development since content developers, 
including teachers and assessment designers, are closest to the intent of the framework. This 
led to two documents being identified for possible analysis – OECD’s 2015 PISA Science 
Framework and PISA 2015 Technical Report. Next, a saturation evaluation on the selected 
documents was performed. Saturation evaluation, based in grounded theory55, can be used in 
qualitative analysis to stop the data collection from a document if no additional data are found 
 
55 Grounded theory is an inductive qualitative methodology that allows new theory to be formed from the 
observed data (Ho & Limpaecher, 2021). That said, I acknowledge my prior experience teaching science may lead 
to deductive reasoning with regards to PISA science content standards and their fit into a dimensionality theory. 
 73 
to code to the theme being analyzed that go beyond what has already been found, or in other 
words some degree of key saturation has been reached in the evaluation. (Saunders et al., 
2018). The saturation evaluation eliminated the PISA 2015 Technical Report as it offered no 
substantial theoretical claims of multidimensionality in science. 
Hence, the 2015 PISA Science Framework document found during document collection 
was color-coded56 by the main theme [dimensionality] and given relevant coder annotations in 
order to draw out any supporting evidence or subthemes [multi vs. unidimensionality] (Voogt & 
Roblin 2012). If no evidence was found supporting science multidimensionality that was also 
documented, along with any barriers to or reasons why multidimensionality is not present 
among the science subdomains. The subdomains were graphically connected to show any 
crossover between science content knowledge in the content standards (see Chapter 3, Figure 
11, which was verified via a committee review as recommended by Wach et al. [2013]).  
After qualitative analysis, the researcher may find among the coded themes evidence of 
a theory that supports a model.  
Figure 8 illustrates how this theory building can come about (Carpiano & Daley, 2006).  
Figure 8 
From Science Framework Review to MIRT Model Development 
 
 
56 This was done by hand rather than a computer program. 
 74 
Note. From “A guide and glossary on postpositivist theory building for population health,” 2006, by R. M. Carpiano, 
and D. M. Daley, 2006, Journal of Epidemiology and Community Health, 60, p. 566. 
 
Carpiano and Daley’s (2006) definitions of a framework, theory, and model associated with  
Figure 8 are adapted below to describe aspects of this study. 
1. Conceptual Framework: A set of standards about content knowledge in science that 
students should be able to show mastery towards. 
2. Theory: Grounded in educational pedagogy and learning philosophy, it indicates a 
relationship between the variables (science content knowledge) while diagnosing the 
science learning phenomena to predict an outcome (e.g., the science subdomains will 
present as multidimensional when scored since learning may be unique to each 
subdomain). This theory will hopefully be fully developed after qualitative framework 
analysis. 
3. Model: Makes a specific assumption about the learning theory for science content 
knowledge that allows parameters (science content knowledge variables) to be tested 
quantitatively. 
Carpiano and Daley (2006) further clarify that after articulating a theory the researcher could 
draw the model by detailing the constructs taken from the theory (such as PISA science scores), 
diagramming their flow from left to right with relationships shown using arrows57, and 
indicating positive or negative relationships with a +/-. 
 
57 Double headed arrows indicate correlations, while single-headed arrows indicate causal relationships. 
 75 
In order to predict the IRT model needed, evidence from Figure 11 comparing science 
content standards and from the framework document analysis that supported an educational 
theory on science dimensionality was then used to develop Figure 29 (similar to  
Figure 8). After the IRT model with the best fit was identified during quantitative data 
analysis it was compared to Figure 29’s predicted model. 
Step 2: Quantitative Analysis 
 “Models make assumptions around measurement explicit and testable (Lang & Tay, 
2021).” IRT provides a group of statistical models that determine the probability of a student 
selecting a specific response (Immekus, 2019). Measurement using an IRT model is an attempt 
to explain this specific response as a continuous variable (Ayala, 2022) or set of variables 
(MIRT), such as student ability in the subdomain life science. 
 Primer on Item Response Theory. IRT is best described as a “formalized” statistical set 
of models for measuring skill/ability in an assessment (Lang & Tay, 2021; Wilson, 2013). This 
skill/ability is referred to as a latent variable since it is not directly observable and the scores 
derived from the assessment are manifest variables of this ability (Ayala, 2022; Lang & Tay, 
2021). These manifest variables are empirically calibrated to be on an interval scale using the 
data since the difference between the values is meaningful (Ayala, 2022). Wilson (2013) further 
clarifies that IRT separates the scale from depending on the random number of items selected 
to be in the assessment. An item that has utility can differentiate well between student 
performance on different points of a trait continuum [the interval scale] (Ayala, 2022; Immekus, 
2019; Mailman School of Public Health, 2023).  
 76 
Item Parameters. Student performance hinges on several item parameters, which 
“define a blueprint for the model (Brooks-Bartlett, 2018).” These parameters, and their 
relationships to student ability, can be visualized with an item characteristic curve (ICC) graph.  
 
Figure 9 shows ICCs for three simulated items based on an IRT model estimating three 
parameters (Park et al., 2020). Each item in  
Figure 9 has a dichotomous score (0 for incorrect, 1 for correct).  
Figure 9 
 
ICCs Based on a Three-parameter Logistic (3PL) Model 
 
Note. From “Technically speaking: Determining test effectiveness with item response theory,” by S. Park, A. 
Reeger, and A. M. Aloe, 2020, Iowa Reading Research Center. Copyright 2023 by The University of Iowa. 
 
In  
Figure 9, P(u) represents the probability (P) of a student responding correctly to the 
item (u). Theta is a representation of student ability (this is the student’s location and is also 
called 𝜃). The item discrimination parameter or 𝛼 is labeled (a) in the right bottom corner of  
 77 
Figure 9 and provides the maximum slope steepness of each ICC. Steeper slopes 
indicated by higher 𝛼 values point towards items with greater discrimination between student 
abilities (Harris, n.d.). In addition to item discrimination, item parameters can include item 
difficulty and guessing (Ayala, 2022; Mailman School of Public Health, 2023); the 4PL, not 
shown here, also adds an additional parameter to describe what tends to happen at the top 
end of the curve, and there are numerous other innovative IRT models; operationally however, 
usually no more than the three parameters per item shown here are used. In the same corner 
of the figure, (b) represents the item difficulty parameter as it relates to student ability and (c) 
represents the “lowest possible probability” of a student responding correctly to the item as an 
indicator of the student guessing parameter (Park et al., 2020). 
Therefore, Item 1 discriminates well between students of different abilities, while Item 3 
was the least discriminating. Item 2 was moderate in both discrimination, difficulty, and 
guessing. Finally, Item 3 was the most difficult and had the least probability of students 
guessing, while Item 1 was the least difficult and had the most probability of students guessing. 
Higher performing students tend to do better on more difficult items while lower performing 
students often answer easier items correctly, with discrimination and guessing empirically 
adding to what is sometimes known about how the item tends to perform. 
While a difficulty parameter estimate is assigned to each item, an IRT model does not 
explain why an item is difficult for each student (Lang & Tay, 2021); only that empirically it was 
calibrated as difficult. An individual’s cognitive process to reach an item response is also not 
described by an IRT model (Ayala, 2022), so this is done theoretically in large-scale assessment 
with expert panels in frameworks, and sometimes with practitioners in standard settings, or can 
 78 
be done more empirically with qualitative analysis in cognitive laboratories or other small-scale 
settings using verbal protocol analysis (“think alouds”) or other techniques. For PISA, only the 
frameworks are released, but they are extensive documents.  
 Assumptions. For an IRT model to estimate all three parameters successfully, the model 
needs to conform to several assumptions. Assumption 1 refers to monotonicity, which assumes 
that as the probability for a correct response increases a student’s ability also increases 
(Mailman School of Public Health, 2023). Assumption 2 is guided by conditional/local 
independence, which states item responses on an assessment are independent of each other 
based on a student’s location/ability (Ayala, 2022). Assumption 3 is based on unidimensionality, 
which assumes that only one continuous latent trait is measured (Ayala, 2022). Assumption 4 
describes the functional form, which maintains that the data match the mathematical function 
described by a model (Ayala, 2022). Violating these assumptions can affect which model should 
be chosen as a best fit to the data. For example, if an assessment is measuring several 
proficiencies/latent traits, then a unidimensional latent variable may not be appropriate (Socha, 
n.d.). 
Defining Dimensionality. Sometimes education attributes being assessed lack a pre-
defined dimensional structure (Irribarra & Arneson, 2023; Reckase, 1990). The definitions for 
multidimensionality and unidimensionality need to be clearly defined, which Irribarra and 
Arneson (2023) argue has still not occurred in many domains even though dimensionality is 
often discussed in educational research. One way to begin a definition is to compare the 
 79 
statistical versus psychological aspects of dimensionality. Psychological/theoretical58 
dimensionality is the hypothetical latent constructs, such as science ability, that in theory as 
identified by experts are required for performing well on an assessment (Irribarra & Arneson, 
2023; Reckase, 1990). Statistical dimensionality is “the minimum59 number of mathematical 
variables needed to summarize a matrix of item response data” (Reckase, 1990). Reckase 
(1990) further points out that statistical dimensionality is based on observable data, such as 
scores, and rests on the data matrix so is not a function of the assessment or the student 
population being assessed. Actionable dimensionality is described as a third aspect by Irribarra 
and Arneson (2023), which refers to the “number of values considered when making a decision 
based on an assessment.” If the “art of assessing dimensionality” is finding the least number of 
latent abilities to preserve statistical prowess and construct meaning (Briggs & Wilson, 2003) 
then dimensionality depends on both the construct, the statistical model, and its intended 
usage. Multidimensionality over the content areas can be defined as three distinct science 
subdomains related to one theoretical construct – science ability; however other types of 
multidimensionality may also exist, such as in cognitive processing or item format as discussed 
earlier. IRT models have evolved as more research has shed light on dimensionality. 
Development of IRT Models. Preceding multidimensional IRT models was classical test 
theory (CTT) and unidimensional IRT. MIRT models are often compared to unidimensional IRT 
 
58 Irribarra and Arneson (2023) prefer “theoretical” to “psychological”. They redefine psychological dimensionality 
to theoretical dimensionality as “the number of relevant psychological attributes that can be reasonably conceived 
as quantities and are believed to be involved in generating responses to items to some extent (Irribarra & Arneson, 
2023).” 
59 Irribarra and Arneson (2023) recommend removing the term “minimum” from the definition. 
 80 
models in order to assess model fit. Therefore, an overview of the types of IRT models and their 
development is needed. 
Classical Test Theory (CTT). CTT was developed first and used historically with 
achievement tests (Brandt, 2015). Similar to IRT, the latent variable is assumed to be 
continuous (Ayala, 2022). Opposite of IRT, CTT focuses on a student’s whole score for an entire 
assessment (Ayala, 2022). In CTT, the item parameters depend on the sample from which they 
are taken (Immekus et al., 2019). In addition, CTT does not provide a reason for item difficulties, 
making linking between assessments with different item sets more challenging (Brandt, 2015). 
The CTT approach was updated to IRT in order to focus on the item and student relationship 
with a statistical model rather than the assessment in its entirety (Wilson, 2013). Now large-
scale assessments, such as NAEP and PISA, use IRT models to avoid these disadvantages 
(Brandt, 2015). 
Unidimensional IRT Models. Each UIRT model described in this section uses a logistic 
function to describe student ability with regards to item parameters to get the probability of 
answering an item correctly (Harris, n.d.). The simplest of the UIRT models is the Rasch model 
(Wilson, 2013; Ayala, 2022; Lang & Tay, 2021). The Rasch model assumes items are on a single 
continuum showing student ability via standardized z-scores.  This allows student responses to 
an item to be compared based on proficiency level. The item discrimination parameter (𝛼) in 
the Rasch model is set to a constant value of 1.0. 
In contrast, the parameter 𝛼 in the one-parameter logistic (1PL) UIRT model is allowed 
to vary from 1.0 and can be some other constant value across the items. Ayala (2022) describes 
this as a “philosophical perspective” where the Rasch model focuses on constructing the 
 81 
variable and the 1PL UIRT model focuses on fitting the data. Both the Rasch model and 1PL 
UIRT model estimate the item difficulty parameter (item location or 𝛿), which can vary. As the 𝛿 
parameter increases the probability of a correct response decreases (Reckase, 2009; Lang & 
Tay, 2021; Ayala, 2022). The 1PL UIRT model60 also has an advantage over CTT in that by fitting 
a logistic function the model no longer assumes measurement error is the same for each 
student as CTT does (Lang & Tay, 2021).  
The two-parameter logistic (2PL) UIRT model allows the discrimination parameter to 
freely vary across items. Adding to the 2PL UIRT model, the three-parameter logistic (3PL) UIRT 
model estimates the guessing parameter (𝜒). The proportion of students in the lowest 
proficiency level choosing the right answer is the estimate used for the 𝜒 parameter (Reckase, 
2009). After the 3PL UIRT model the next development in IRT was the 1PL MIRT model. 
Functionality of MIRT. The complexity of educational constructs directly led to the 
development of MIRT models (Reckase, 2009). A MIRT model is able to relax the assumption of 
unidimensionality so that multiple correlated latent traits can be measured (Wang, 2021). An 
assessment, set of items, or even a single item may require students to use multiple 
abilities/latent traits, “especially in the compound areas such as the natural sciences” (Issayeva, 
2022). A limitation of unidimensional models is they are not well fit for an instrument 
developed to be multidimensional (Immekus, 2019). Parameters are interpreted similarly to 
unidimensional IRT models, but they take the form of vectors and their direction in theta (𝜃) 
space will influence the interpretation (Socha, n.d.). This theta space is multidimensional and 
can be summarized as 𝜃! =	 [𝜃!"…𝜃!#]′	with M being the number of unobserved latent 
 
60 Other IRT models also have the same advantage over CTT. 
 82 
dimensions needed to model a student’s predicted response to an item (Immekus, 2019). The 𝜒 
parameter is the exception because it is not a vector and retains its 3PL definition. As with 
unidimensional models, the maximum marginal likelihood estimation (MMLE) can calibrate 
item parameters (Ayala, 2022; Socha, n.d.). 
MIRT models can be either compensatory or non-compensatory (Spencer, 2004; 
Issayeva, 2022; Socha, n.d.). Compensatory models are additive in nature and allows a student’s 
high score in one dimension to make up for a low score in another (Socha, n.d.; Reckase, 1997). 
However, assessment designers may find it difficult to explain to test takers why scores on 
different dimensions are dependent on each other (Baghaei, 2012). A student’s ability (𝜃) in 
one dimension does not counterbalance for their ability (𝜃) in a different dimension in a non-
compensatory (or partially compensatory) model (Socha, n.d.; Reckase, 1997). Results from a 
non-compensatory model are the nonlinear sum of thetas (Duran, 2014). Non-compensatory 
models also tend to be simpler (Socha, n.d.) and may capture cognitive ability more accurately 
(DeMars, 2016). A drawback is that non-compensatory models have not been used as 
frequently due to issues with parameter estimation, especially the lack of efficient algorithms. 
although this is somewhat changing with more computing power becoming available (Ayala, 
2022; DeMars, 2016; Spencer, 2004; Wang & Nydick, 2015). 
Limitations and Benefits of MIRT Models. While an assessment’s items may be 
multidimensional in the lens of a content framework, each dimension’s strength may not be 
enough to change from a unidimensional model (Reckase, 1985; Socha, n.d.). Sometimes we 
say the dimensions may exist, but they are not sufficiently “separable” to matter. Another 
limitation to using MIRT models is that even though collecting data via online assessments is 
 83 
less expensive than in-person, the increasing demand for data and its analysis are testing the 
limits of models like MIRT and other algorithms (Wang, 2021). There are also several sources of 
indeterminacy with a MIRT model. Metric indeterminacy results from the metric being relative 
(Ayala, 2022), that being that the item location and respondent location are relative to each 
other and not fixed until either the item mean, or respondent mean, for the calibration is set to 
zero, impacting getting calibration results on the fixed item or person. Rotational indeterminacy 
indicates the direction of each axis is not unique with regard to the item vectors; however, 
fixing the axes can help alleviate this problem (Ayala, 2022). 
A benefit of a MIRT model is the ability to link calibrations in order to create a large pool 
of calibrated items, which then may be used to develop interesting adaptations, such as 
computer adaptive testing (CAT) with parallel multidimensional test forms61 (Issayeva, 2022). 
Within-item multidimensionality can be used with a MIRT model to reduce test length because 
one item provides data on several dimensions, which is useful in CAT (Duran, 2014). Kose and 
Demirtasli (2012) found however, that longer tests (i.e., more items) and greater sample size 
are needed to reduce error and increase MIRT model sensitivity. Perhaps a greater benefit is 
more data about student ability on each dimension, which in turn can confirm or add to 
theories about educational constructs.  
PISA IRT Models. The PISA Results in Focus 2015 report states that PISA’s goal is not to 
determine the cause and effect of educational “practices and student outcomes” (OECD, 2018). 
However, the assessment does arguably make decisions about how student latent traits relate 
 
61 Test forms often refer to versions of the same assessment. For example, test form A may have items arranged 
differently from or contain only some of the same items in test form B. 
 84 
to educational constructs, which is often used by policymakers to lay some of the evidentiary 
groundwork for educational practice and theory. PISA divided each domain, reading, math, and 
science, into several subdomains for the 2015 assessment indicating the constructs of each 
domain provided evidence of multidimensionality (Brandt, 2015).  
Currently, and in 2015, PISA uses a unidimensional composite score for science 
developed via a 2PL Rasch model (Jerrim, 2016) while reporting scores in subdomains too. 
Student scores are developed by first choosing the IRT model to estimate item parameters and 
then using maximum likelihood estimate (MLE) to determine a latent trait ability level for each 
student (Jerrim, 2016). This is done in the main study after an extensive field trial to adapt 
instruments, usually by a keep and drop method, for which data does not get released. The 
difference in modeling for a multidimensional construct versus unidimensional scale scoring 
indicates that the assessment is being interpreted in both directions. There are both 
advantages and disadvantages to this approach per Brandt (2015) – see Table 5.  
Table 5 
 
Trade-offs Between Calibration Methods for a Unidimensional Score 
Calibration 
Method for 
Multidimensional Advantages Disadvantages 
Data 
• Overestimates reliability, plus biases 
difficulty and variance estimates, by 
neglecting local item dependence (LID) 
Scale Score on • Validity of the multidimensional constructs 
Unidimensional • Reliably allows calculation of individual is reduced since assessment designed to be 
Calibration scores using MLEs unidimensional  • Framework may specify one set of 
weightings for each dimension, but the 
actual weight may change due to a need to 
drop items 
 85 
• Assuming items within a subdomain are • Calculation of reliable individual scores via 
separate dimensions the items within a MLEs is not possible 
Composite Score dimension are more closely related than • Reliability of composite score is reduced 
on items between dimensions then this since items not on a common scale 
Multidimensional approach allows LID to be considered, which • Not an appropriate calibration method if 
Calibration more accurately estimates reliability the Rasch model is used since dimensions • Explicit and clear weighting of subdomains cannot be set to equal variances (this leads 
allows the unidimensional composite score to the need for standardization after model 
to be developed calibration increasing measurement error) 
 
Note. Adapted from “Unidimensional interpretation of multidimensional tests,” by S. Brandt, 2015, Dissertation, p. 
28. 
 
 Construct validity is also a crucial issue here. Per Messick (1995) and Spencer (2004), 
ignoring any evidence, such as multidimensionality found in a framework, may negatively 
impact construct validity of the evaluation of meaning behind an assessment’s results. Reading 
is the only domain that OECD (2017a) describes as multidimensional in the 2015 PISA 
Framework. Using a MIRT model could help validate the 2015 PISA science framework design if 
the subdomains are found to be indicators of multidimensionality. This finding would mean the 
current unidimensional scoring model could contain a construct misspecification, which leads 
to incorrect interpretation of PISA scores that could in turn impact education policy decisions 
for diverse student groups.  
Conducting Quantitative Analyses. The analyses in this section were done for both the 
S10 and the S11 clusters of items and were all conducted using the R program in R Studio 
(RStudio Team, 2021). The PISA 2015 science data was first cleaned by removing student cases 
that contained only NAs. Then data was then run through descriptive analyses using the psych 
package (Revelle, 2024). Next histograms of average scores for the U.S. science sample and 
cluster S10 subsample were developed for the student population using the ggplot2 package 
 86 
(Wickham, 2016). Item histograms showing number of each type of response (0 or 1) were 
plotted for the cluster S10 subsample. Due to the ordinal nature of the dichotomous data a 
polychoric matrix was developed using the psych package (Revelle, 2024) and saved for use in 
later analyses. The correlation coefficient rho was examined and developed into a table. Using 
the factoextra package (Kassambara & Mundt, 2020) a distance matrix heatmap was also 
developed from the polychoric matrix for the cluster S10 subsample. In order to evaluate the 
sensitivity of the cluster analysis the cluster S10 subsample was randomly split in half using the 
dyplr package (Wickham et al., 2023) and the random half subset was run through the same 
processes outlined in the Cluster Analyses and PCA subsections below, but not through IRT 
analysis as the subset’s size was very small at 653 students. 
Cluster Analyses. Each subsample was then run through a cluster analysis using the 
kmeans function of the stats package (R Core Team, 2023) to estimate 3 clusters. Since each 
subdomain should represent a unique dimension the logits for each dimension can be obtained 
via cluster analysis and then weighted to show each item’s relationship to a dimension (Ayala, 
2022). The results of the cluster analysis were graphed in a scree plot using the factoextra 
package (Kassambara & Mundt, 2020).  
PCA. The two subsamples and random half subset were then also run through a PCA 
using the prcomp function in the stats package of R (R Core Team, 2023). The loadings, scores, 
and variances were analyzed to help develop several plots. Three-dimensional (3D) principal 
component plots were developed based on the PCA scores with the plot_ly function of the 
plotly package (Sievert, 2020) in R, while loadings were visualized with the barplot function of 
the graphics package (R Core Team, 2023) in R. Finally, principal component plots were loaded 
 87 
publicly into Plotly Chart Studio (Plotly Technologies Inc., 2015) for later presentation due to 
their interactive nature.  
IRT Analyses. A critical step in determining which model provides the most information 
about student learning is examining the fit of various models (Yamamoto, 1995). Therefore, 
student data was analyzed in an exploratory quantitative research design. The explorations 
consisted of a 1PL UIRT, 2PL UIRT, 1PL MIRT, and 2PL MIRT models – each model is described 
below. 
 Model 1 was a 1PL UIRT model and is described by Equation 162. 
Equation 1 1PL UIRT 
𝟏
𝒑,𝒙𝒋 = 𝟏/𝜽𝒔, 𝜶, 𝜹𝒋4 = 	  𝟏 +	𝒆&𝜶(𝜽𝒔&𝜹𝒋)
Where p is the probability of a value of 1 (a correct response) when the predictor is x, e is a 
constant of 2.7183 (i.e., the base of the natural logarithm), 𝜃 is a person’s location (i.e., ability), 
𝛿 is the item’s location (i.e., estimated difficulty) of item j for student s, and alpha (𝛼) is allowed 
to vary from 1 while kept constant across items (Ayala, 2022).  
Model 2 was a 2PL UIRT model and is described by Equation 2. 
Equation 2 2PL UIRT 
𝒆𝜶𝒋𝜽𝒔,𝜸𝒋)
𝒑,𝒙𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜹𝒋4 = 	  𝟏 +	𝒆𝜶𝒋𝜽𝒔,𝜸𝒋)
The discrimination parameter, 𝛼, is now allowed to vary across items (Ayala, 2022). Note that 
the intercept is represented by gamma (γ), which is constant for this model, and is a 
representation of the item’s location and discrimination parameters’ interaction (Ayala, 2022). 
 
62 Equations 1, 2, and 4 are from Ayala (2022). Equation 3 is derived using DeMars (2016). 
 88 
“In a proficiency assessment situation γj would be interpreted as related to an item’s 
difficulty/easiness (Ayala, 2022, p. 393).” 
Model 3 was a 1PL MIRT model and is described by Equation 3. 
Equation 3 1PL MIRT 
𝒆𝜶
#
𝒋𝜽𝒔,𝜹𝒋)
𝒑,𝒙𝒊𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜸𝒋4 = 	 #  
𝟏 +	𝒆𝜶𝒋𝜽𝒔,𝜹𝒋)
In order to determine an item to dimension relationship that can change along 2 or more 
dimensions the logits are weighted (Ayala, 2022). The item slopes (𝛼) are fixed to a constant 
number across dimensions (DeMars, 2016). Note that an underscore indicates a vector and a 
prime symbol (‘) indicates a row vector. 
Model 4 was a 2PL compensatory MIRT model and is described by Equation 4. 
Equation 4 2PL MIRT 
𝜶#𝒆 𝒋𝜽𝒔,𝜹𝒋)
𝒑,𝒙𝒊𝒋 = 𝟏/𝜽𝒔, 𝜶𝒋, 𝜸𝒋4 = 	 #  
𝟏 +	𝒆𝜶𝒋𝜽𝒔,𝜹𝒋)
Model 4 differs in that 𝛼 can now vary. 
After model selection the dichotomous data (coded as 0 or 1) was analyzed using the 
TAM package (Robitzsch et al., 2022). This package was chosen primarily because of its use by 
OECD (Kiefer et al., 2015) and other researchers comparing IRT models, along with its ease of 
use. The TAM package allows for UIRT and MIRT models, but only compensatory MIRT. A table 
was developed that shows the statistics for each model and a second table comparing the 
models’ fit was also generated. Based on the best fitting model, a wright map and ICCs were 
also developed using the wrightMap function of the WrightMap package (Irribarra & Freund, 
2014) and the plot function of the graphics package (R Core Team, 2023) in R respectively. 
 89 
 
 
Data Triangulation 
After data analysis, the researcher should begin to build inferences from both types of 
data. Meta-inferences are defined by Venkatesh et al. (2013) as “theoretical statements…from 
an integration of quantitative and qualitative strands of mixed methods research.” The pathway 
for this study flowed as follows: comparing (merging) qualitative Ç quantitative findings à 
meta-inference/s. The other pathways consist of either qualitative or quantitative findings 
leading individually to the next set of findings, which would be the opposite of the first step, 
then to a meta-inference. After an analysis path is chosen Venkatesh et al. (2013) state that 
researchers should take either a bridging or bracketing research path to develop the meta-
inference/s. Bridging is described as a consensus between the two types of findings (qualitative 
+ quantitative) while bracketing uses alternate views of the phenomenon to report differences 
between the two types of findings (qualitative vs. quantitative). Bridging was used for this 
study. An analysis path can be taken further to develop these qualitative and quantitative data 
into a triangulation, or mapped, to each other so that the data from one method supports or 
contrasts conclusions drawn in the other. Östlund et al. (2011) adapted the diagram shown in 
Figure 10 from Erzberger and Kelle (2003) to illustrate triangulation. 
 90 
Figure 10 
 
Triangulation for Mixed Methods Research 
 
Note. From “Combining qualitative and quantitative research within mixed method research designs: A 
methodological review,” by U. Östlund, L. Kidd, Y. Wengström, and N. Rowa-Dewar, 2011, International Journal of 
Nursing Studies, 48(2011), p. 371. Copyright 2010 by Elsevier. 
  
Figure 10 showcases that the empirical findings for both quantitative (QUAN) and 
qualitative (QUAL) can be used to support theory, which may be newly developed. The sides of 
the triangles represent connections between the theory and each set of findings, along with the 
finding to one another. Depending on the outcome of the research, the triangle sides can differ 
in appearance. First, the triangle sides may remain convergent (as shown in Figure 10), which is 
when findings from both methods support theory and lead to the same conclusion. Second, the 
triangle sides may become parallel to each other when the findings from both methods 
compliment or support one another. Third, the triangle sides may be divergent indicating the 
findings are different for each method or may even contradict one another, in which case the 
contrasts should be explored to understand why different lenses indicate different directions. 
 91 
 As noted in their research, Östlund et al. (2011) stated, that while a mixed method 
approach is gaining ground with other researchers, the triangulation process described above 
was only beginning to be used at that time. It has since expanded. Triangulation can help clarify 
results in mixed methods research by clearly identifying the interactions between different 
types of data. Bowen (2009) and Armstrong (2021) agree that document analysis is a way to 
triangulate with other methods of research. They also describe how triangulation can provide a 
solid foundation for the design of and theory behind the mixed methods approach (Östlund et 
al., 2011). A data triangulation figure based on Östlund et al.’s (2011) procedure shown in 
Figure 10 was built for this study. 
 Step 3: Equity Investigation 
 OECD did not make publicly available information on ethnicity/race in 2015 and only 
collected minimal information on school location and student economic status in the survey 
questions. Therefore, survey questions were not used to investigate equity issues. Instead, 
student ability levels between models were compared to determine which model type had a 
greater range of ability levels. Using item cluster S10 with Models 1b and 3b thetas were 
mapped in bar plots. Ability level (theta) range was then analyzed to determine where 
historically marginalized student groups might be located to determine if a less complex model 
sacrifices information about these groups for a more pragmatic design. 
 
  
 92 
CHAPTER 3: RESULTS 
The following results are reported in three sections by research question (RQ). 
Triangulation results from the qualitative and quantitative analyses is then reported after the 
RQ2 results. See section Research Questions for the details of each RQ. 
Results Relating to RQ1 
The saturation evaluation of the two documents identified for analysis, OECD’s 2015 
PISA Science Framework and PISA 2015 Technical Report, yielded mixed results. Neither a 
broader view of science educational theory nor multidimensionality for science knowledge is 
described in the 2015 PISA science framework. This theory was teased out during the 
qualitative analysis and is proposed at the end of this section.  
Coding on dimensionality was accomplished for the first document (i.e., the science 
framework) and theoretical saturation was achieved at the end of that document analysis. 
Saturation occurred because the second document, while mentioning science and 
multidimensionality upon initial analysis, did not actually refer to science content 
dimensionality, but rather the possible dimensionality between new and trend science items. 
Any evidence resulting from color codes highlighting the two subthemes are reported in Table 6 
below for the only document analyzed. Note that the evidence column for subtheme 
unidimensionality is deliberately left blank as no portions of the science framework were found 
to code to unidimensionality. Missing evidence occurs in the subtheme multidimensionality 
(see yellow-filled cells). 
 93 
Table 6 
Evidence Supporting Dimensionality Themes 
Document 
Section/Feature Subtheme Subtheme Annotation 
(Pg. #) Unidimensionality Multidimensionality 
While OECD does appear to 
Box 2.1 be referring to 
Scientific multidimensionality with this 
knowledge: PISA  “…three distinguishable but phrase it is not about science 2015 related elements.” content knowledge, but 
terminology  rather the types of 
(Pg. 21) knowledge: content63, 
procedural, and epistemic. 
Section: 
Organizing the Science literacy is referred to 
domain of as a domain of interrelated 
science and   aspects, but its science 
Figures 2.1-2.2 content dimensionality is not 
(Pg. 25) clarified.  
“Given that only a sample of 
the content domain of science 
can be assessed in the PISA OECD refers to “fields” of 
2015 scientific literacy science indicating that these 
Section: assessment, content areas are not one 
Scientific clear criteria are used to guide single subdomain. 
knowledge and  the selection of the  64
Figure 2.5  knowledge that is assessed. Figure 2.5  clearly separates 
(Pg. 28) The criteria are applied to the content knowledge 
knowledge from required for each “field” of 
the major fields of physics, science visually by them out 
chemistry, biology, earth and with a blue banner. 
space science…” 
OECD defined the required 
percentage of items by 
science content (physical, 
living, Earth and space). This 
Table 2.2   “Desired distribution of items, is similar to other state and (Pg. 29) by content” national assessments that 
intend to report on 
subdomains of science to 
showcase student learning in 
each content area. 
 
63 This is the only type of knowledge relating to the science subdomains of life, physical, and Earth and space 
systems. 
64 Knowledge of physical systems seems to combine both chemistry and physics science content. 
 94 
Document 
Section/Feature Subtheme Subtheme 
(Pg. #) Unidimensionality Multidimensionality 
Annotation 
OECD lists content 
knowledge as a dimension 
separate from procedural 
Table 2.3  “Desired distribution of items, and epistemic knowledge. (Pg. 30) by type of knowledge” This does not clarify if 
content knowledge by itself 
is built of separate 
subdomains. 
OECD provides required item 
counts needed for the three 
Table 2.4  “Desired distribution of items types of knowledge by (Pg. 31) for knowledge” science content subdomain 
indicating each subdomain is 
its own dimension. 
Knowledge type is referred 
to as content indicating that 
science content is its own 
domain, but subdomains 
indicating multiple 
Figure 2.17  “Framework categories” dimensions is not reported. (Pg. 36) However, as shown in the 
released item in this study’s 
Figure 4, the item developers 
do document to which 
subdomain an item should 
be coded. 
 
Note. Page number refers to the page numbers shown at the bottom of each page of the PISA 2015 Science 
Framework provided in Appendix F. 
 
The 2015 PISA Science Framework document’s Figure 2.5 seems to provide a strong 
piece of evidence for multidimensionality of the science subdomains in the form of the 
differentiated and unconnected pieces of content knowledge that PISA assesses for science. 
Based on this knowledge, Figure 11 was developed to visualize any crossover of knowledge that 
was shown in Error! Reference source not found., which replicates and codes OECD’s Figure 
2.5. Any crossover noted below is not a concrete occurrence on the PISA science items. Item 
developers often carefully consider aspects like content crossover and develop the required 
 95 
items in a manner that prevents cluing of other items. My subject matter expertise was used to 
provide examples in Figure 11 where content knowledge in one subdomain may benefit from 
content knowledge in another subdomain, thus potentially affecting multidimensionality by 
making the subdomains less distinct. The orange curved arrows indicate this possible content 
knowledge crossover in the content standards. For example, PS1 is the knowledge of the 
structure of matter and having this knowledge might increase a student’s understanding of LS1, 
which is the knowledge of cells. 
Figure 11 
 
Possible Connections between 2015 PISA Science Content Knowledge 
 
 Since the qualitative results above seem to indicate that science is multidimensional 
with the three subdomains being assessed in 2015 PISA having little crossover, it seems a 
multidimensional IRT model might be warranted. However, there is insufficient information in 
the framework on the expected separability (or “difference”) between the theoretical elements 
seen – is there expected to be enough difference to perceive in a large-scale assessment? Also, 
there is little or no discussion of confounds. For instance, are high performing respondents in 
one area expected to be high performing in the others? While this is not necessarily considered 
theoretically true for U.S. NGSS disciplinary core ideas specifically in the three subject matter 
 96 
areas, it is expected theoretically to be more applicable by their nature for scientific and 
engineering practices (SEPs)65 and cross-cutting concepts (CCCs)66.  
The continuum between the framework, the proposed educational theory, and its indicated 
model is provided in Figure 12. The proposed educational theory is that distinct science 
subdomains require students to have differentiated knowledge to demonstrate mastery of each 
subdomain, which may be more accurate for disciplinary core ideas than for practices and 
cross-cutting concepts.  
Figure 12 
 
Proposed Continuum 
 
Note. Adapted from “A guide and glossary on postpositivist theory building for population health,” 2006, by R. M. 
Carpiano, and D. M. Daley, 2006, Journal of Epidemiology and Community Health, 60, p. 566. 
 
Results Relating to RQ2 
 The descriptive statistics, histograms, and correlations are provided first then in 
subsection RQ2A: Cluster Analyses Results the cluster analysis and its sensitivity test are 
reported. In subsection RQ2B: PCA Results PCA results are provided. Subsection RQ2C: IRT 
 
65 OECD refers to content similar to this as procedural knowledge in the 2015 framework (OECD, 2017a). 
66 OECD refers to content similar to this as epistemic knowledge in the 2015 framework (OECD, 2017a). 
 97 
Results contains IRT model fit for two different item clusters. Lastly, triangulation results are 
reported. Unless otherwise noted all results are for the whole student subsamples taking each 
item cluster (S10 and S11) analyzed rather than for the full U.S. sample. If results pertain to the 
half subset of the cluster S10 subsample that is also noted. 
Descriptive Statistics 
 Table 7 below provides several statistics for each item and the three added variables of 
number of items attempted, student raw score, and average score. Curran et al. (1996) 
recommend for multivariate normality that skew not range outside of +/-2 and kurtosis outside 
of +/-7, which the data in Table 7 do not violate. The mean for the item variables remains 
stable, in other words the actual mean does not stray greatly from 0.5, and indicates all the 
items are on the same scale (a mean of say 15 might indicate an item on a different scale when 
all other item means hover between 0.1 and 0.8). 
Table 7 
 
Descriptive Statistics for Item Cluster S10 Full Subsample 
Variable n Mean SD Median Min Max Range Skew Kurtosis SE 
DS625Q01C 1306 0.52 0.50 1.00 0 1 1 -0.09 -1.99 0.01 
CS625Q02S 1306 0.64 0.48 1.00 0 1 1 -0.58 -1.67 0.01 
CS625Q03S 1306 0.58 0.49 1.00 0 1 1 -0.31 -1.90 0.01 
CS615Q07S 1306 0.29 0.45 0.00 0 1 1 0.94 -1.11 0.01 
CS615Q01S 1306 0.82 0.38 1.00 0 1 1 -1.71 0.91 0.01 
CS615Q02S 1306 0.48 0.50 0.00 0 1 1 0.09 -1.99 0.01 
CS615Q05S 1306 0.18 0.39 0.00 0 1 1 1.65 0.73 0.01 
CS604Q02S 1306 0.49 0.50 0.00 0 1 1 0.06 -2.00 0.01 
DS604Q04C 1306 0.29 0.45 0.00 0 1 1 0.93 -1.14 0.01 
CS645Q03S 1306 0.51 0.50 1.00 0 1 1 -0.02 -2.00 0.01 
DS645Q04C 1306 0.57 0.50 1.00 0 1 1 -0.27 -1.93 0.01 
DS645Q05C 1306 0.14 0.35 0.00 0 1 1 2.04 2.18 0.01 
CS657Q01S 1306 0.71 0.46 1.00 0 1 1 -0.91 -1.18 0.01 
CS657Q02S 1306 0.42 0.49 0.00 0 1 1 0.32 -1.90 0.01 
CS657Q03S 1306 0.47 0.50 0.00 0 1 1 0.14 -1.98 0.01 
Number Attempted 1306 15.00 0.00 15.00 15 15 0 NaN NaN 0.00 
Raw Score 1306 7.09 3.37 7.00 0 15 15 0.12 -0.91 0.09 
Average 1306 0.47 0.22 0.47 0 1 1 0.12 -0.91 0.01 
 98 
 
Figure 13 is a histogram of average scores for the full U.S. science sample while Figure 
14 provides the same information for the cluster S10 subsample. Both populations seem fairly 
normal in distribution, but the subsample population does show less variability in average 
scores. Figure 15 is a set of histograms showing frequency of 0 and 1 scores for each item. A 
distance heatmap using the correlations from the polychoric matrix is shown in Figure 16 with 
the blue color indicating high similarity (close distance together) between items while the 
orange color indicates low similarity (far apart from each other) between items. Table 8 
provides the polychoric correlations with the majority of items being weakly and negatively 
correlated – a negative correlation indicates that as a student does well on one item they score 
less on the other item. 
Figure 13 
 
Histogram of Student Average Scores for Full U.S. Science Sample  
 
 99 
Figure 14 
 
Histogram of Student Average Scores for Item Cluster S10 Full Subsample 
 
Figure 15 
 
Histograms of Student Score Point Frequency for Item Cluster S10 Full Subsample 
 
 100 
Figure 16 
 
Distance Heatmap for Item Cluster S10 Full Subsample 
 
 101 
Table 8 
 
Means (M), Standard Deviations (SD), and Correlations with Confidence Intervals (CI) for Item Cluster S10’s Full Subsample 
Item M SD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
1. DS625Q01C 0.35 0.21               
2. CS625Q02S 0.32 0.20 -.02              
   [-.53, .50]              
3. CS625Q03S 0.37 0.19 .22 -.00             
   [-.33, .66] [-.51, .51]             
4. CS615Q07S 0.37 0.20 .40 -.08 .13            
   [-.15, .75] [-.57, .45] [-.41, .60]            
5. CS615Q01S 0.33 0.23 .13 -.07 .32 .38           
   [-.41, .60] [-.56, .46] [-.23, .71] [-.17, .75]           
6. CS615Q02S 0.37 0.21 .21 -.07 .17 .40 .65**          
   [-.34, .65] [-.56, .46] [-.38, .63] [-.14, .76] [.21, .87]          
7. CS615Q05S 0.20 0.24 -.08 -.24 -.15 .09 -.42 -.26         
   [-.57, .45] [-.67, .31] [-.61, .40] [-.44, .58] [-.77, .12] [-.68, .29]         
8. CS604Q02S 0.32 0.20 .09 -.07 .19 -.02 .11 .12 -.20        
   [-.44, .58] [-.56, .46] [-.36, .64] [-.53, .50] [-.43, .59] [-.42, .59] [-.65, .35]        
9. DS604Q04C 0.34 0.20 .26 .18 .09 .01 .11 .20 -.11 -.12       
   [-.29, .68] [-.37, .63] [-.44, .58] [-.51, .52] [-.42, .59] [-.34, .65] [-.59, .43] [-.59, .42]       
10. CS645Q03S 0.34 0.21 .41 .11 .24 .15 -.01 .15 -.19 .17 .14      
   [-.12, .76] [-.42, .59] [-.31, .67] [-.39, .62] [-.52, .50] [-.39, .62] [-.64, .36] [-.37, .63] [-.40, .61]      
11. DS645Q04C 0.41 0.19 .29 .12 .33 .24 .34 .41 -.40 .16 .22 .31     
   [-.26, .70] [-.42, .60] [-.22, .72] [-.31, .67] [-.21, .73] [-.13, .76] [-.76, .14] [-.39, .62] [-.33, .66] [-.25, .71]     
12. DS645Q05C 0.29 0.22 -.06 .21 -.02 -.26 -.15 -.10 -.24 .01 -.08 .27 .39    
   [-.55, .47] [-.34, .65] [-.53, .50] [-.68, .29] [-.62, .39] [-.58, .44] [-.67, .31] [-.50, .52] [-.57, .45] [-.28, .69] [-.15, .75]    
13. CS657Q01S 0.16 0.24 -.41 -.17 -.23 -.30 -.11 -.27 -.34 -.17 -.34 -.47 -.14 -.17   
   [-.76, .13] [-.63, .37] [-.67, .32] [-.71, .25] [-.59, .43] [-.69, .28] [-.73, .20] [-.63, .37] [-.73, .21] [-.79, .06] [-.61, .40] [-.63, .38]   
14. CS657Q02S 0.32 0.21 .03 .28 .17 .08 -.10 .07 -.27 .12 .22 .04 .12 -.02 -.24  
   [-.49, .53] [-.27, .69] [-.38, .63] [-.45, .57] [-.58, .43] [-.46, .56] [-.69, .28] [-.42, .60] [-.33, .66] [-.48, .54] [-.42, .60] [-.53, .50] [-.67, .31]  
15. CS657Q03S 0.37 0.20 .17 .11 .27 .20 .09 .26 .12 -.02 .35 .15 .25 .09 -.58* .42 
   [-.37, .63] [-.43, .59] [-.28, .69] [-.34, .65] [-.44, .58] [-.29, .68] [-.42, .59] [-.53, .50] [-.19, .73] [-.39, .61] [-.30, .67] [-.45, .57] [-.84, -.10] [-.11, .77] 
 
Note. Values in square brackets indicate the 95% CI for each correlation. The CI provides a range of population correlations that describe where the sample 
correlation may truly lay (Cumming, 2014). * Indicates p < .05 and ** indicate p < .01. 
 102 
RQ2A: Cluster Analyses Results 
 The following scree plot in Figure 17 does not show a distinct elbow bend at any number 
of clusters because the slope is constantly decreasing without leveling off. There is no 
indication of multidimensionality from this analysis. 
Figure 17 
 
Scree Plot for Item Cluster S10 with Full Subsample 
 
A second scree plot, shown in Figure 18, was obtained, as described in Chapter 2. Methods: 
Conducting Quantitative Analyses – Cluster Analyses, from the randomly chosen half of the 
cluster S10 student subsample. No elbow bend indicating optimal number of clusters is visible 
so there is no clear multidimensionality based on this analysis – no place where establishing 
clear dimensions would seem most appropriate. 
 103 
Figure 18 
 
Scree Plot for Item Cluster S10 with Random Half of Subsample 
 
Figure 19 provides a scree plot that lacks a clear elbow bend, indicating no clusters for item 
cluster S11, thus revealing no clear multidimensionality by the above definition. 
Figure 19 
 
Scree Plot for Item Cluster S11 
 
RQ2B: PCA Results 
 Figure 20 provides bar plots of the three PCA loadings (eigenvectors) explaining the 
most variation in the set of items in cluster S10. Principal components 4-15 were dropped as 
 104 
they all explained 2.5% or less of the variation and were each far below eigenvalues of 1. The 
reader can note below that the only for PC 1 was the eigenvalue clearly 1 or greater, but since 
my research questions are about multidimensionality in three dimensions as compared to the 
usual unidimensionality with which PISA data are modeled, PC1-PC3 were kept. Per prcomp 
function documentation, “The signs of the columns of the rotation matrix (the loadings) are 
arbitrary, and so may differ between different programs for PCA, and even between different 
builds of R” (R Core Team, 2023). Only for PC 1 (2.01) was the eigenvalue greater than 1, PC 2 
(0.1) and PC 3 (0.08) were much smaller. Eigenvalues greater than 1 indicate those components 
“account for more than the mean of the total variance in the items,” which follows the Kaiser-
Guttman rule (Li, 2012). The vast majority of items load most strongly on PC 1, with only 1 item 
each loading most strongly for PC 2 and PC 3, as shown in the figures. 
 105 
Figure 20 
 
Loadings Bar Plots for Item Cluster S10 with Full Subsample 
 
Figure 21 visualizes the three principal components in a 3D space based on the PCA 
scores. In general, PCA scores are developed by taking the original measurement and 
multiplying it by its eigenvector coefficient, which indicates the contribution of each 
measurement, and then additively summarizing all those values to get the score on any axis 
(Carr, 2001). However, Carr (2001) indicates that when a coefficient matrix, like this study uses, 
is used in place of a covariance matrix to run the PCA the calculated results will not match the 
results computed in a R program.  Note, the low redundancy - data are spread apart and weakly 
 106 
correlated. The two items isolated in PC 2 and PC 3 will be dropped as outliers in the second 
round of IRT analyses for item cluster S10, as no single item ever serves as a full measure in IRT. 
Multidimensionality does not seem to be indicated in this item cluster as the majority of 
variance (78.8%) is explained by PC 1. An interactive graph with rotation and data point 
hovering is available at https://chart-studio.plotly.com/~cnmalcom/3. 
Figure 21 
PCA Plot for Item Cluster S10 with Full Subsample 
 
Note. Blue dots are items loading mainly on PC 1, while red dots indicate items loading mainly on PC 2, and green 
dots indicate items mainly loading on PC 3. 
 
 107 
A second analysis for the random half subset of the set of items in cluster S10 was used 
to help confirm PCA results. Figure 22 provides bar plots of the three PCA loadings 
(eigenvectors) explaining the most variation. Principal components 4-15 were dropped as 
explained above. Only for PC 1 (2.18) was the eigenvalue greater than 1, PC 2 (0.11) and PC 3 
(0.1) were much smaller. The majority of items load most strongly on PC 1. 
Figure 22 
 
Loadings Bar Plots for Item Cluster S10 with Random Half of Subsample 
 
 
 
 108 
Figure 23 visualizes the three principal components in a 3D space based on the PCA 
scores. Again, the plot shows low redundancy - data are spread apart and weakly correlated. 
The same two items from the earlier PCA of the full subsample from item cluster S10 are still 
showing up independently on PC 2 and 3. Multidimensionality does not seem to be indicated as 
the majority of variance (80.1%) is explained by PC 1. This plot, as expected, is very similar in 
substance to Figure 21. An interactive graph with rotation and data point hovering is available 
at https://chart-studio.plotly.com/~cnmalcom/5. 
Figure 23 
Confirmation PCA Plot for Item Cluster S10 with Random Half of Subsample 
 
 109 
A third PCA for the set of items in cluster S11 was run. Figure 24 provides bar plots of 
the three PCA loadings (eigenvectors) explaining the most variation. Principal components 4-15 
were dropped as they all explained 1% or less of the variation. Only for PC 1 (2.86) was the 
eigenvalue greater than 1, PC 2 (0.1) and PC 3 (0.05) were much smaller. The majority of items 
load on PC 1. 
Figure 24 
 
Loadings Bar Plots for Item Cluster S11 
 
 
 110 
Figure 25 visualizes the three principal components in a 3D space based on the PCA 
scores. Again, the plot shows low redundancy - data are spread apart and weakly correlated. 
Multidimensionality does not seem to be indicated as the majority of variance (86.7%) is 
explained by PC 1. While more items are loading on PC 2 and PC 3 when only partial data are 
used, they do not appear to be in separate dimensions and do not align with science 
subdomains or with types of knowledge identified by OECD in the 2015 science framework. An 
interactive graph with rotation and data point hovering is available at https://chart-
studio.plotly.com/~cnmalcom/7. 
Figure 25 
 
PCA Plot for Item Cluster S11 
 
 111 
RQ2C: IRT Results 
Both model fit and information from IRT analyses are provided in this section. Table 9 
provides the subdomain groupings of items for each item cluster that were used in the 
multidimensional models. Physical System items were coded as Dimension 1, Earth and Space 
Systems as Dimension 2, and Living Systems as Dimension 3. 
Table 9 
 
Item Groupings for MIRT Models 
Cluster S10 Cluster S11 
Item Type of Type of Knowledge Subdomain DOK Item Knowledge Subdomain DOK 
DS625Q01C Content Physical L CS643Q01S Procedural Physical M 
CS625Q02S Content Physical L CS643Q02S Procedural Physical M 
CS625Q03S Content Physical M DS643Q03C Content Physical L 
CS604Q02S Content Physical M CS643Q04S Procedural Physical M 
DS604Q04C Epistemic Physical M DS643Q05C Epistemic Physical M 
   CS629Q02S Content Physical M 
CS615Q07S Procedural Earth and Space M CS629Q04S Epistemic Physical M 
CS615Q01S Procedural Earth and Space M    
CS615Q02S Procedural Earth and Space M DS648Q01C Procedural 
Earth and 
Space M 
CS615Q05S* Epistemic Earth and Space M CS648Q02S Procedural 
Earth and 
Space M 
CS645Q03S Content Earth and M CS648Q03S Procedural Earth and Space Space M 
DS645Q04C Content Earth and Space M DS648Q05C Epistemic 
Earth and 
Space M 
DS645Q05C Content Earth and Space M DS629Q03C Procedural 
Earth and 
Space M 
      
CS657Q01S* Content Living L CS656Q01Sê Content Living M 
CS657Q02S Content Living M DS656Q02Cê Procedural Living H 
CS657Q03S Procedural Living M CS656Q04Sê Procedural Living M 
Note. *Items that were dropped from Models 3B and 4b. êReleased item set. DOK = Depth of Knowledge: Low (L), 
Medium (M), High (H), developed by Webb in 1997 per OECD (2017a). 
Cluster S10. Importantly, since the two items CS657Q01S and CS615Q05S each loaded 
in a separate dimension by themselves, the model fit analyses were conducted both with and 
 112 
without these items in the item cluster S10 as shown in Table 10. Based on item difficulties for 
Model 1, these unusual items also have unusual characteristics: item CS615Q05S (xsi = 1.79) 
seems to be the second hardest item, item DS645Q05C (xsi = 2.08) being the hardest, and item 
CS615Q01S (xsi = -2.01) ranked the easiest while item CS657Q01S (xsi = -1.34) was the second 
easiest. This holds true for Models 2, 3, and 4 while the difficulty levels vary slightly.  
Item infit (information-weighted fit) statistics for Model 1 showed that some items were 
slightly underfit and some slightly overfit when compared to the desired value of 1, but all were 
well within usual tolerances67 of 0.70 to 1.30 (blue dashed lines in Figure 26). Figure 26 shows 
the infit statistic for each item with the desired expected value of 1.0 (green line). The two 
outlier items are marked in red. Note that items below 1.0 may have too predictable of 
responses and those greater than 1.0 may have responses that are too noisy, i.e., the excess 
variation may be masking what is a good model of the response pattern (Wind & Hua, 2021). 
 
67 These tolerances were adopted based off usage in a similar study by Pensavalle and Solinas (2013). 
 113 
Figure 26 
Infit Statistics for 1 PL UIRT Model of Item Cluster S10 with Full Subsample 
 
There were 6 items whose infit statistic was significant at p < 0.05 for Models 1 and 3. Infit 
statistics remained close to the desired value of 1 for Models 2 and 4, and none were 
statistically significantly different. Models with items dropped have 3 items (Model 1b) and 2 
items (Model 3b) that had a significant infit statistic respectively. 
Table 10 
 
Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S10 Subsample 
 Full Set of 15 Items Without Items CS657Q01S and CS615Q05S 
 
Model 3 Model 4 Model Model Model 1 Model 2 (1PL (2PL Model 1b 2b 
Model 3b 
(1PL UIRT) (2PL UIRT) (1PL 
4b 
MIRT) MIRT) (1PL UIRT) (2PL MIRT) (2PL UIRT) MIRT) 
Deviance 
(-2 log- 21,630.16 21,456.08 21,561.67 21,426.74 18,948.83 18,925.41 
likelihood) Not Not 
Number of Run Run 
Estimated 16 30 21 33 14 19 
Parameters 
 114 
AIC (constraint 
on students) 21,662 21,516 21,604 21,493 18,977 18,963 
BIC 21,745 21,671 21,712 21,664 19,049 19,062 
Iterations 15 21 428 151 17 389 
EAP 
Reliabilities 0.74 0.76   0.75  
Dim 1 NA NA 0.725 0.735 NA 0.736 
Dim 2 NA NA 0.729 0.743 NA 0.728 
Dim 3 NA NA 0.715 0.657 NA 0.635 
Infit Range 0.915 to 0.989 to 0.905 to 0.985 to 0.929 to 0.933 to 
1.159 1.015 1.146 1.011 1.001 1.081 
Note. The green highlighted model is the most parsimonious when considering model statistics and guidelines. 
 
Improvement of fit was determined for each model pair and is shown in Table 11, which 
can be interpreted for each pair of models as “from Model X to Model Y there is a Z% increase 
in improvement of model fit. Improvement of fit was determined with the following formula 
((Model X Deviance – Model Y Deviance)/Model Y Deviance) * 100. Overall, improvement of the 
fit statistic for Model 1 (1PL UIRT) compared to Model 3 (1PL MIRT) for the three-dimensional 
model by content area is only 0.3%, which may not be meaningful enough for country level 
results. In other words, at least reporting at the group level for which representative samples 
were drawn (country in this case), the difference made by carrying forward all the extra 
parameters of the more complex model would not make a practical difference. Note, models 
for the full set of items versus the models lacking 2 items were not compared due to the 
difference in items, which made the models not able to be nested. 
Table 11 
 
Comparison of Model Fit – Item Cluster S10 Subsample  
Model  Model  Model  Model  Model  Model  Model  
 1 to 2 1 to 3 1 to 4 3 to 2 2 to 4 3 to 4 1b to 3b 
Improvement of Fit 0.8% 0.3% 0.9% 0.5% 0.1% 0.6% 0.1% 
 115 
Goodness of fit was also determined using chi-square test to compare sets of models with a 
significance level 𝛼 set to 0.05. The models are compared below. 
• Model 1 and Model 2: Χ2 (14, N = 1,306) = 174.09, p = 0) so null hypothesis that both 
models are the same can be rejected and Model 2 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
• Model 1 and Model 3: Χ2 (5, N = 1,306) = 68.50, p = 0) so null hypothesis that both 
models are the same can be rejected and Model 3 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
• Model 1 and Model 4: Χ2 (17, N = 1,306) = 203.43, p = 0) so null hypothesis that both 
models are the same can be rejected and Model 4 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
• Model 2 and Model 3: Χ2 (9, N = 1,306) = 1015.59, p = 0) so null hypothesis that both 
models are the same can be rejected and Model 2 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
• Model 2 and Model 4: Χ2 (3, N = 1,306) = 29.34, p < 0.0001) so null hypothesis that both 
models are the same can be rejected and Model 4 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
• Model 3 and Model 4: Χ2 (12, N = 1,306) = 134.93, p = 0) so null hypothesis that both 
models are the same can be rejected and Model 4 is a significantly better fit based on 
lower deviance, AIC, and BIC statistics. 
 116 
• Model 1b and Model 3b: Χ2 (5, N = 1,306) = 23.42, p < 0.0001) so null hypothesis that 
both models are the same can be rejected and Model 3b is a significantly better fit 
based on lower deviance and AIC statistics (BIC was actually higher for Model 3b). 
Cluster S11. Importantly, since the 1PL UIRT model was found to be the better fit for 
item cluster S10, only it and the 1PL MIRT model were compared for this item cluster as 
confirmation of unidimensional model fit. Analysis on IRT model fit is shown in Table 12. While 
Model 6 appears to be the better fitting model based on the lower AIC and BIC statistics, it does 
require more iterations to converge, and substantially more parameters are employed without 
making much practical difference in results at the group reporting level for this assessment. 
Model 5 has 9 items and Model 5 has 8 items whose infit statistic was significant at p < 0.05. 
Table 12 
 
Model Fit Indices for Comparison of Relative Model Fit – Item Cluster S11 Subsample 
 Full Set of 15 Items 
 Model 5 Model 6 
(1PL UIRT)  (1PL MIRT) 
Deviance 
(-2 log-likelihood) 21,530.46 21,467.69 
Number of Estimated 
Parameters 16 21 
AIC (constraint on 
students) 21,562 21,510 
BIC 21,645 21,618 
Iterations 25 474 
EAP Reliabilities 0.8  
Dim 1 NA 0.727 
Dim 2 NA 0.752 
Dim 3 NA 0.794 
Infit Range 0.888 to 0.912 to 
1.152 1.132 
Note. The green highlighted model is the most parsimonious when considering model statistics and guidelines. 
 
 117 
Improvement of fit was also determined for this model pair. Model 6 shows 0.3% 
improvement over the unidimensional Model 5. For goodness of fit chi-square test, Model 5 
and Model 6: Χ2 (5, N = 1,306) = 62.78, p = 0) so null hypothesis that both models are the same 
can be rejected and Model 6 is a significantly better fit based on lower deviance, AIC, and BIC 
statistics. 
Item Fit Analyses. Using the best fitting model (Model 1b - 1PL UIRT) for item cluster 
S10 (full subsample) several analyses of item fit were completed. Figure 27 shows the ICCs for 
13 items of item cluster S10 – the blue line is the model’s expected scores curve and the black 
line is the actual scores curve. These plots showcase the relationship between the latent trait 
(student ability) and probability of an expected correct response (Wind & Hua, 2021). 
Figure 27 
 
ICC Plots for Item Cluster S10 with Full Subsample 
 
 118 
 
 
 119 
 
 
 120 
 
 
Note. Based off Model 1b and items that were dropped are not shown. 
 
In Figure 28 the left side of the map provides a histogram of student ability (latent trait). The 
right side of the map provides item difficulty with harder items near the top of the map and 
easier items near the bottom. The distribution of difficulty shows a reasonably good spread 
given the location of the person respondents, although more very easy items might be helpful 
given the large number of respondents performing below the level of the bulk of the items. 
 121 
Figure 28 
 
Wright Map for Item Cluster S10 with Full Subsample 
 
Note. Based off Model 1b and items that were dropped are not shown. 
 
Triangulation 
 Extending on the earlier proposed qualitative education theory: distinct science 
subdomains require students to have differentiated knowledge to demonstrate mastery of each 
subdomain, a multidimensional model should best fit student data to accurately portray 
student abilities. Since this did not occur, a divergent triangulation is shown in Figure 29 as 
 122 
qualitative analysis did reveal multiple content dimensions. Broken orange arrows indicate 
propositions that did not get proven by empirical evidence in the theoretical level. The curvy 
green arrow indicates that Proposition 1 should have directly led to Proposition 2 but was not 
proven by findings from the quantitative analysis. Blue arrows indicate empirical findings that 
did support each proposition. Note that the two broken arrow claims both rely on little 
separation by the more complex models, relative to the reporting claims. 
Figure 29 
 
Triangulation of Results 
 
Note. Adapted from “Combining qualitative and quantitative research within mixed method research designs: A 
methodological review,” by U. Östlund, L. Kidd, Y. Wengström, and N. Rowa-Dewar, 2011, International Journal of 
Nursing Studies, 48(2011), p. 371. Copyright 2010 by Elsevier. 
 
The IRT analyses did statistically support Proposition 2 for both S10 and S11 subsamples but not 
with practical significance. The initial exploratory analyses offer some clues, such as many 
 123 
weakly correlated dimensions but not rising to the level of three-dimensional modeling or 
aligning with the theoretical level being assessed in this dissertation, for any of the subsamples.  
Results Relating to RQ3 
Figure 30 shows histograms for student ability level for both the 1PL UIRT and 1PL MIRT, 
Models 1b and 3b respectively, for item cluster S10. The number of students is greatest in the 
middle range of abilities for all models. For Models 1b and 3b dimension 1 the ability levels 
ranged from -2 to 2 while for Model 3b dimensions 2 and 3 they ranged from -3 to 3. Please 
note that the software does not allow dimensions to be directly compared as they are not 
aligned in this software, so do not make this comparison although the charts are located near 
each other. 
Figure 30 
 
Histograms of Student Ability Levels for Item Cluster S10 
 
 124 
 
Note. Items CS657Q01S and CS615Q05S were not included in the models shown above. 
 125 
CHAPTER 4: DISCUSSION 
 Analysis of educational data cannot exist in a silo apart from the education content 
being measured. What the construct is and how students are asked to learn and master it 
matter. Assessment data from students need to be modeled by researchers in a meaningful 
manner that incorporates these aspects. Often, assessment developers and educators start 
with an education standard and work towards developing an assessment/task to measure 
student performance rather than beginning with an understanding of the content framework 
and how students learn the subject (Claesgens et al., 2008; NRC, 2001). This study aimed to 
marry both a qualitative review of the 2015 science framework’s content and how that content 
was learnt by students to a quantitative analysis of which model would be most appropriate for 
the assessment data. Major themes of the study’s results are presented in this chapter, along 
with providing the study’s limitations, threats to the study’s validity and reliability, avenues for 
future research, and policy recommendations for stakeholders. 
Study Overview 
 The literature synthesis revealed that accurately assessing science learning is still a 
relevant need in the U.S. In addition, the U.S. education system is still failing to improve gaps in 
science performance associated with a student’s economic status and their race (Corcoran et 
al., 2009). Science education in U.S. schools continues to be centered around separate subject 
subdomains: life, physical, Earth and space, etc., which also drives how science assessments are 
formed. These science subdomains have been successfully modeled as multidimensional for 
other assessments, but some assessments like PISA still model science unidimensionally while 
reporting on student achievement in a multidimensional space – see Appendix C. UIRT models 
 126 
are often more familiar and interpretable to large-scale assessment developers (Lang & Tay, 
2021). This study was undertaken to determine if a mixed methods approach could reveal 
triangulated evidence of multidimensionality in the 2015 PISA science and how inaccurate 
modeling might impact equity issues relevant to students. 
Key Takeaways 
1. Qualitative analysis yielded a multidimensional view of science content in the 2015 PISA 
science framework by determining the science subdomains of living, physical, and Earth 
and space systems were developed as separate content dimensions. 
2. Quantitative analysis yielded a 1PL UIRT model as the most practical fit for the 2015 
PISA science assessment student data. This outcome was confirmed by PCA and cluster 
analysis results. Even though the 1PL MIRT model was statistically significant, it offered 
little improvement of fit for the inferences being made in PISA.  
3. The equity investigation was limited by available data yet yielded one result of how 
using a unidimensional model instead of a MIRT model might negatively impact 
marginalized student groups by combining students into fewer ability levels, which leads 
to a loss of information about how students are performing in specific science 
subdomains. 
A Lack of Synergy Between Results 
 Triangulation of results illustrated that the MIRT model advocated for by a qualitative 
analysis of the framework was not supported by quantitative analysis, which indicated the 1PL 
UIRT model was the most practical model in terms of using fewer parameters and viewed in the 
light of the MIRT model not improving fit substantially. Following is a discussion of what factors 
 127 
might cause this disconnect between what is described in the framework and how the data end 
up being modeled. These factors are outside the scope of this study’s research questions.  
Reckase (1989, p. 9) stressed that “test items may require more than one cognitive skill 
for successful solution but still generate a statistically unidimensional data set through the 
interaction with a population that varies on many dimensions.” This might be seen here in the 
weak correlations shown in the descriptive statistics. 
One factor impacting multidimensionality of science could be that “most decisions 
about instruction and curriculum sequences in science have not been guided by a long-term 
understanding of learning progressions that are grounded in the findings of contemporary 
cognitive, developmental, education, and science studies research” (NRC, 2001; NRC, 2007, p. 
214). Middle and high schools in the U.S. continue to offer science courses as distinct units and 
science subdomains are infrequently integrated (Enger & Yager, 2009). Up until NGSS was 
released in 2013 and advocated for crosscutting concepts and science practices that applied 
across science subdomains (NGSS Lead States, 2013), science learning focused mainly on recall 
of discrete facts. In 2015, when PISA last administered science as a major domain, the impact of 
NGSS on curriculum would not have been as far spread throughout the different states. This 
means students may still be responding to items on the science assessment via recall of facts 
rather than actually demonstrating mastery of science concepts. Use of memory to recall of 
facts might be a separate dimension (Leigh et al., 2006) from the science subdomains, but is 
outside the scope of the investigation here.  
Another factor is hinted at by findings from a research study set in Brazil, which noted 
for that country that PISA scores could be impacted by higher participation rates of students 
 128 
who had more school years completed (Gomes et al., 2020). If students of a similar age/grade 
level all have a similar ability level as evidenced by their scores that might also impact a 
construct showing up as multidimensional. One study showed that even though IRT analysis 
should be sample independent for the estimations of item characteristics “the stability of these 
estimations is enhanced when the sample is heterogeneous with regard to the latent trait” 
(Osteen, 2010, p. 79), which is less likely in a student sample whose members are homogenous 
in coursework taken and age. 
A third factor that could impact or mask multidimensionality is general ability. Pokropek 
et al. (2022) found that only 17% of variance in science items is attributed to a specific science 
ability latent trait while the rest can be contributed to general ability (sometimes considered as 
working memory or another cognitive trait) based on a study of 33 OECD countries taking the 
2018 PISA. A strong correlation of 0.88 exists between standard achievement tests that 
measure intelligence and PISA (Pokropek et al., 2022). This could indicate that PISA is measuring 
a student’s general intelligence or even their test taking skills rather than the subdomains 
outlined in the 2015 science framework. While these are outside of the scope of the research 
questions here, it is known that aspects such as test taking skills can present as an “unwanted 
dimension of performance” (Scalise & Gifford, 2006). Even the socioeconomic factors, such as 
number of books owned by the family, which are measured by OECD, may correlate more 
strongly with general ability (Pokropek et al., 2022). 
 
 
 
 129 
Overview of Released Item Set 
 Since qualitative analysis of the framework indicates the science subdomains should be 
separate dimensions, a deeper look into what a released item set (see Figures 31-33)68 reveals 
about the science content versus general ability requirements of the items being assessed in 
2015 PISA science is warranted. If items are only asking students to perform recall of science 
facts, eliminate options based on test taking skills, or use their general intelligence to respond 
then this could eliminate the content-related multidimensionality being explored here that 
seems required by the framework. Note that this qualitative review of item content is based on 
one set of items as the other science items analyzed in this study were not released, leading to 
the caveat that other items may have more rigorous ties to science subdomains. 
 The first item of the Bird Migration set shown in Figure 31 (OECD scoring information 
provided directly below) seems to be a simple recall of the definition of natural selection, a 
mechanism of evolution. The item stimulus provides the key term evolution to solicit recall. The 
item also requires a level of detail about bird migration that is not normally taught in high 
school since option D (a second correct response) expects students to know if a species of bird 
can have a better chance at finding nesting sites, which Robins69 do by having more 
experienced flock members lead less experienced to good nesting sites. OECD (n.d.-c) has tied 
this item to content knowledge of living systems, most likely to the standard “Populations (e.g. 
species, evolution, biodiversity, genetic variation)” – more detail on what aspect/s of evolution 
can be included in the 2015 PISA science are not provided by OECD. This item is also coded as a 
 
68 Images from OECD (n.d.-c). 
69 See Robin Migration website https://journeynorth.org/tm/robin/facts_migration.html with contributions by an 
ornithologist (a scientist who studies birds). 
 130 
DOK of medium by OECD (n.d.-c), which typically requires more from a student than recall of 
facts. From an assessment developer perspective, I do not see knowledge of living systems 
being applied in this item. 
Figure 31 
Bird Migration Item 1 from Item Cluster S11 
 
 
 131 
 The second item of the Bird Migration set shown in Figure 32 (OECD scoring information 
provided directly below) is of the constructed response item type. The item expects students to 
draw on procedural knowledge of the correct process for scientific experiments that could 
apply to any science subdomain. OECD (n.d.-c) has tied this item to procedural knowledge of 
living systems, most likely to the same standard – a specific aspect of the “Populations” 
standard was not able to be teased out for this item. This item is also coded as a DOK of high by 
OECD (n.d.-c), which could be because the student has to make an analytical connection to 
factors that would invalidate a scientific investigation. From an assessment developer 
perspective, this item seems tied to the living systems subdomain merely because it mentions 
living animals, i.e., birds, but does seem to be a higher DOK, although the format effect 
between selected and constructed response might require a higher literacy DOK, which could 
be another source of dimensionality although not theoretically intended based on the PISA 
science framework analyzed.  
 132 
Figure 32 
Bird Migration Item 2 from Item Cluster S11 
 
 
The third item of the Bird Migration set shown in Figure 33 (OECD scoring information 
provided directly below) is another multiple-choice item like Item 1 but with a multi-select 
option technology enhancement. Notice that the item’s main stimulus has changed to a 
narrative about a specific bird species and the online version of this computer-based item set 
 133 
does not seem to allow for students to return to the original stimulus. This could confuse some 
students and lead to a disconnect with the item. The item again expects students to draw on 
procedural knowledge, but this time tied to what seems to be actual research data from a living 
system. OECD (n.d.-c) has tied this item to procedural knowledge of living systems, most likely 
to the same standard – migrations of populations would fit into the “Populations” standard’s 
example list. This item is also coded as a DOK of medium by OECD (n.d.-c), which this item 
seems to meet as the data must be interpreted by the student. From an assessment developer 
perspective, this item out of the set is most closely tied to the living systems subdomain yet 
focuses on a skill that could be used in any science subdomain – the interpretation of graphical 
data. Hence once again, subdomain multidimensionality could be masked in this example.  
Based on just this item set, the application of general intelligence ability and perhaps a 
dimension of science procedural knowledge does seem to outweigh application of concepts 
required by each science subdomain in the 2015 PISA science framework. PISA’s science 
procedural knowledge might be similar to “science practices” in the U.S. NGSS and PISA science 
epistemic knowledge similar to cross-cutting concepts, bringing these more unifying ideas into 
play and masking some subdomain multidimensionality. If other items are also disconnected 
from the subdomains this could help explain the practical unidimensionality of item clusters S10 
and S11.  
 134 
Figure 33 
Bird Migration Item 3 from Item Cluster S11 
 
 
 135 
Alternate Sources of Multidimensionality 
• OECD (2017b) notes there is possible dimensionality between new and trend science 
items and claims that a UIRT model provided a better fit, yet the fit statistics (AIC and 
BIC) provided in the technical report show the MIRT model as statistically better with 
slightly more improvement of fit. This would be for the linked set of items across 
clusters, not possible to explore here. This is outside the themes being explored and was 
not codable but is consistent with the results found here for the clusters examined. 
• In the 2015 science framework OECD (2017a) provides several types of knowledge, each 
of which could be a separate dimension. Only content knowledge was examined in this 
study, but procedural knowledge and/or epistemic knowledge may each be a dimension 
similar to the three-dimensional aspect of NGSS, as discussed earlier. 
• Item format, specifically the item’s stem (introductory sentence) has been found to 
impact multidimensionality of a math assessment (Kan et al., 2018). This could also be a 
source of multidimensionality in science assessments, as discussed earlier. Also, science 
content often uses math expressions in items, so the math issue like literacy discussed 
earlier could be involved in the weak correlations seen in the data exploration. 
• Scientific inquiry requires a diverse skill set from students, such as creativity and ability 
to question. This is speculative based on data and research questions here, and is not 
explored in the literature review, but such broader issues could contribute if present so 
that many, but not separable, dimensions might be in play as discussed earlier. These do 
not seem aligned with the sources of dimensionality explored directly in the analysis.  
 
 136 
Impact On Equity 
Referring back to the findings and inferences drawn from the available data RQ3 could 
be better articulated as: 
• What inferences can be made about equity in education based on model fit and range of 
student ability per different subdomains? 
This rephrasing of the research question was necessitated by the lack of publicly available 
demographic data. Therefore, only inferences about how model fit might impact equity could 
be made. 
The NRC (2012, p. 277) advocates that a “crucial role of a framework and its subject 
matter standards is to help ensure and evaluate educational equity” and “that all students 
should have adequate opportunities to learn”. We should extend that to modeling data 
accurately with an eye towards the best model fit so all students have their performance 
modeled equitably. While PISA aims to impact policy at a broader national scale than the 
student level (Froese-Germain, 2010) the outcome of mismodelling student data can still be 
that stakeholders redirect resources to away from student groups and subdomains of science 
education that really need them. 
As mentioned earlier, OECD did not make publicly available information on 
race/ethnicity of students for the 2015 PISA. This lack of data impacts stakeholder 
understanding of the diversity of the student population and disables researchers attempts to 
search for equity issues in student performance. We know that the U.S. sample was drawn to 
be representative, within the limitations for missing data described earlier, so subgroups must 
be present, and at representative rates, but they are not disaggregated in the PISA data shared. 
 137 
Therefore, results from this investigation were limited to student ability level as evidenced by 
theta and how the range of ability levels changed between the UIRT and MIRT models, for the 
full sample. Future work should explore educational technology data sets or others internal to 
the U.S. where individual inferences are made and disaggregated samples might be available in 
science, to see for whom the impact of practical significance might be most meaningful if 
statistically significant multidimensionality that is separable is present but neglected in the 
modeling. 
One issue can be explored here. When looking at Figure 30 and comparing the UIRT 
model to the MIRT model, one can see how the students are now compacted into fewer bins 
for the MIRT model. This may be doing a disservice to those students who do not have access to 
the higher-level science classes or are challenged by economic constraints of where they live 
(e.g., lack of internet or science labs). These underrepresented groups tend to be minorities – 
see Figures 32-34 of Appendix A. This lack of access or differential access to educational content 
continues to be a concern regarding a lack of advancement in the U.S. on science achievement 
by students (Vasquez, 2006). 
Limitations 
There were numerous limitations to this study. No linking information was released to 
examine all the items together thus limiting the size of sample. OECD did not document why 
data were missing for U.S. cases in the released sample, or provide additional information 
about them, so there was no additional information to report. Only one set of items from a 
subsample was released so a thorough qualitative review of the items was not possible. A lack 
of publicly available data on the diversity of the sample inhibited the equity investigation 
 138 
required to answer RQ3. This absence of this information was a policy decision by some 
participating countries – in particular, the NCES controls access to ethnicity data provided by 
OECD for the U.S. and only makes it available unlinked and through a restricted use license, 
which were not the focus of this dissertation. Response rates were low for trend items so only 
new items were analyzed. Due to the assessment cycle in 2015 being the first science fully 
digital assessment, this could be due to trend items being mainly paper based while new items 
were developed natively as computer based and more schools are expecting electronic delivery 
of assessments. 
Threats to Validity and Reliability 
 The model’s generalizability may be limited due to sample size, which may be restricted 
by elimination of students with missing data. Therefore, external validity may be impacted and 
the results may not be generalizable to various student populations with differing 
demographics in the U.S., especially minority groups that tend to have smaller population sizes. 
This brings to question if the U.S. sample design was adequate enough. Next, as with most 
assessments there is always the question if what is being measured, such as the subdomain of 
life science, remains the same across clusters of variable content. Without any linking 
information available and no common form among all students, instrumentation also threatens 
the internal validity of this study. “A test is valid for measuring an attribute if (a) the attribute 
exists and (b) variations in the attribute causally produce variation in the measurement 
outcomes (Lang & Tay, 2021, p. 328).” Relatedly, students had access to prior versions of 
science items in the form of released items, which introduces the threat of testing, where 
 139 
exposure to a pre-test similar to the final test might influence their outcomes as reported on in 
the PISA results. 
 Finally, opportunity to learn threats to internal validity could exist since students who 
were assessed varied in their learning stages by having taken different science courses. An 
important note is that because there was not a control group and students were not divided 
into groups (only by country) for the analysis, threats to validity associated with potential 
differences in group membership /selection bias are not relevant to PISA’s sampling. There are 
different global windows of testing for PISA which could indicate a maturation threat, but the 
window of testing for the U.S. was fairly stable. Changes to the environment in which the 
assessment was taken and changes in student behavior were not shared by OECD. 
 With regards to reliability, a key threat is researcher error with regards to the number 
and construct type of the dimensions identified in the qualitative analysis. This error was 
mitigated by having a reviewer familiar with the construct analyze results. There will not be any 
agreement indices collected for the qualitative analysis.  
Future Research 
 Since the qualitative and quantitative analyses let to divergent results the question 
remains on how to accurately model the science content domain when its subdomains appear 
to be separate dimensions. Determining how dimensionality of science content affects 
assessments and their items is crucial to developing assessments that yield information 
equitably for diverse student subgroups. The following are recommendations for future 
research in this area: 
 140 
• Since polytomous items were dropped from the study and only dichotomous items were 
analyzed, a future analysis needs to include a more complex model that covers both 
types of items. 
• A structural equation model might be implemented if there more connections are found 
between items based on their subdomain content. 
• The MIRT model in this study was developed for a between-item representation of 
multidimensionality – see Figure 6 (from Li et al., 2012). Baghaei (2012) provides an 
overview of a MIRT model for within-item dimensionality – see Figure 34 below. A 
within-item MIRT model should be tested to see if it provides better fit since some of 
the standards in the 2015 PISA science framework were found to load on each other 
during the qualitative analysis – see Figure 11. This relates to the expectation that 
students should be able to apply science content to either interdisciplinary or 
independent science items/tasks (Mostafa et al., 2018; OECD, 2017a).  
 141 
Figure 34 
Differences in Between-item and Within-item MIRT Models 
 
Note. From “The application of multidimensional Rasch models in large-scale assessment and validation: An 
empirical example,” by Purya Baghaei, 2012, Electronic Journal of Research in Educational Psychology, 10(1), p. 
239. Copyright 2012 by Electronic Journal of Research in Educational Psychology. 
 
Policy Recommendations 
 The recommendations below are targeted towards several groups of science education 
stakeholders: researchers, learning scientists, policy makers, assessment and curriculum 
developers, and educators themselves. With the goal of clearer understanding of how science 
assessments should be modeled to increase equity for all students here are some crucial areas 
still needing attention: 
• Prior researchers have advocated using an UIRT model simply because PISA has used 
this model in the past - see the Davier et al. (2019) study on model fit using PISA data. 
While following a similar methodology might increase reliability of results it does not 
 142 
address equity concerns and new models should be analyzed to find the fit that provides 
accurate information on science performance for every student. 
• At what point is an increase in fit too small if it can be justified by the help it provides 
the group existing in that range? Quantitative education researchers should review if 
model fit indices can be tailored to specific subgroups of the population.  
• Developers of science standards for the U.S. should compare NGSS dimensionality and 
that found in PISA framework to identify areas where there is a content match or 
disconnect. These comparisons could guide how items are coded to a dimension in MIRT 
models   
• For future PISA cycles, that more information such as linking be released, or that a 
similar study be conducted internally and a report released. 
• OECD should consider making ethnicity data available publicly to increase transparency 
of results. This data release can be done on a nation-by-nation basis if requested. 
• OECD assessment designers should release item specifications if available or build them 
to guide interpretation of future science framework standards. 
• Learning scientists should identify any unique constructs for each science subdomain 
with the goal of describing how students develop mastery in each area. 
Conclusions 
 In order to ensure we measure what is intended, both educators and researchers will 
benefit from digging deeper into what constitutes each science subdomain. If the science 
subdomains are truly individual constructs, then our scoring models should follow the 
framework’s scaffolding, or we should identify what dimensionality aspect is more 
 143 
predominantly being assessed. While large-scale assessments provide countries with a wealth 
of data, they may be doing harm to educational equity based on use inferences that routinely 
affect national educational objectives and school policies. In other words, we should not make 
claims about what is measured when we did not accurately differentiate the constructs or 
chose a model based on its usability only. As the U.S. moves forward with more three-
dimensional science learning via NGSS how to model those constructs in a multidimensional 
space will become more critical. 
 To begin evaluating these aspects and fulfill the proposed policy recommendations 
above, a good start, to coin a phrase from Castillo & Gillborn (2023), is to “democratize 
evidence.” Without complete sets of data, researchers are limited in verifying or building upon 
previous research and new discoveries cannot be made. Throughout this study it was clear that 
more complete student demographic data was needed to validate the equity inferences being 
drawn from the conclusion that there was model misfit based on the qualitative and 
quantitative findings.  
 144 
REFERENCES  
Alesina, A., Devleeschauwer, A., Easterly, W., Kurlat, S., & Wacziarg, R. (2003). Fractionalization.  
Journal of Economic Growth, 8(2), 155-194. https://www.nber.org/papers/w9411 
American Educational Research Association. (2014). Standards for Educational and  
Psychological Testing. 
Armstrong, C. (2021). Key methods used in qualitative document analysis. SSRN eLibrary.  
https://ssrn.com/abstract=3996213  
Atkisson, M. (2010, 2010-10-15). Social Negotiation as a Central Principle of Constructivism. 
Ways of Knowing. https://woknowing.wordpress.com/2010/10/14/social-negotiation-
as-a-central-principle-of-constructivism/  
Ayala, R. J. (2022). The theory and practice of item response theory (2nd edition). The Guilford 
Press. 
Baghaei, P. (2012). The application of multidimensional rasch models in large scale assessment 
and validation: An empirical example. Electronic Journal of Research in Educational 
Psychology, 10(1), 233-252. 
Baškarada, S. & Koronios, A., (2018). A philosophical discussion of qualitative, quantitative, and 
mixed methods research in social science. Qualitative Research Journal, 
https://doi.org/10.1108/QRJ-D-17-00042 
Boon, M., Orozco, M., & Sivakumar, K. (2022). Epistemological and educational issues in 
teaching practice-oriented scientific research: Roles for philosophers of science. 
European Journal for Philosophy of Science, 12(16), 1-23. 
Bowen, G. A. (2009). Document analysis as a qualitative research method. Qualitative 
 145 
Research Journal, 9(2), 27-40. 
Brandt, S. (2015). Unidimensional interpretation of multidimensional tests. [Doctoral  
dissertation, Christian-Albrechts-Universität zu Kiel].  
Briggs, D. C. & Wilson, M. (2003). An introduction to multidimensional measurement 
using Rasch models. Journal of Applied Measurement, 4(1), 87-100. 
Broesch, T., Crittenden, A. N., Beheim, B. A., Blackwell, A. D., Bunce, J. A., Colleran, H., Hagel, K., 
Kline, M., McElreath, R., Nelson, R. G., Pisor, A. C., Prall, S., Pretelli, I., Purzycki, B., 
Quinn, E. A., Ross, C., Scelza, B., Starkweather, K., & Stieglitz, J. (2020). Navigating cross-
cultural research: methodological and ethical considerations. Proceedings B The Royal 
Society, 287(20201245), 1-7. 
Brooks-Bartlett, J. (2018). Probability concepts explained:Maximum likelihood estimation. 
Towards Data Science.  
Carpiano, R. M., & Daley, D. M. (2006). A guide and glossary on postpositivist theory building for 
population health. Journal of Epidemiology and Community Health, 60, 564-570. doi: 
10.1136/jech.2004.031534 
Carr, S. M. (2001). Interpreting a principal components analysis - Theory & practice. Memorial  
University: Biology – Faculty of Science. 
https://www.mun.ca/biology/scarr/2900_PCA_Analysis.htm  
Castillo, W. & Gillborn, D. (2023, September). How to “QuantCrit:” Practices and Questions for  
Education Data Researchers and Users. (EdWorkingPaper No. 22-546). 
https://doi.org/10.26300/v5kh-dd65  
Center for Professional Education of Teachers. (n.d.). Equity and Assessment.  
 146 
https://cpet.tc.columbia.edu/news-press/equity-and-assessment  
Civil Rights Data Collection. (2023, November 20). Data on Equal Access to  
Education. Office for Civil Rights, U.S. Department of Education. https://ocrdata.ed.gov/  
Claesgens, J., Scalise, K., Wilson, M., & Stacy, A. (2008). Mapping student understanding 
in chemistry: The perspectives of chemists. Science Education, 93, 56-85. 
College Board. (2009). Science: College board standards for college success. The College Board. 
Connected Papers | Find and explore academic papers. (2021).  
https://www.connectedpapers.com/  
Corcoran, T., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence- 
based approach to reform (Report No. RR-63). Consortium for Policy Research in 
Education (CPRE). www.cpre.org  
Csapó, B., & Funke, J. (Eds.). (2017). The Nature of problem solving: Using research to inspire  
21st century learning. OECD Publishing. https://read.oecd-ilibrary.org/education/the-
nature-of-problem-solving_9789264273955-en#page5  
Cummings, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29. 
Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality  
and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16-
29. 
Davier, M. V., Yamamoto, K., Shin, H. J., Chen, H., Khorramdel, L., Weeks, J., Davis, S., Kong, N.,  
& Kandathil, M. (2019). Evaluating item response theory linking and model fit for data 
from PISA 2000–2012. Assessment in Education: Principles, Policy & Practice, 26(4), 466-
488. 
 147 
DeMars, C. E. (2016). Partially compensatory multidimensional item response theory models:  
Two alternate model forms. Educational and Psychological Measurement, 76(2), 231-
257. 
Duran, V. (2014, March 17). Multidimensional item response theory: What have we learned thus  
far [PowerPoint slides]. The Psychometrics Centre, University of Cambridge. 
https://www.psychometrics.cam.ac.uk/system/files/documents/multidimensional-item-
response-theory.pdf  
Enger, S. K., & Yager, R. E. (2009). Chapter 1: A framework for assessing student understanding  
in science: A standards-based K-12 handbook. In Assessing student understanding in 
science (2nd edition, pp.1-11). Sage. 
Erzberger, C., & Kelle, U. (2003). Making inferences in mixed methods: The rules of integration.  
In A. Tashakkori & C. Teddlie (Eds.), Handbook of Mixed Methods in Social & Behavioral 
Research 1st edition (pp. 457-488). Sage.  
Froese-Germain, B. (2010). The OECD, PISA and the impacts on educational policy (Report). 
Canadian Teachers’ Federation.  
Gale, N. K., Heath, G., Cameron, E., Rashid, S., & Redwood, S. (2013). Using the framework 
method for the analysis of qualitative data in multi-disciplinary health research. BMC 
Medical Research Methodology, 13(117), 1-8. 
Gao, N., Johnson, H., Lafortune, J., & Dalton, A. (2019). New eligibility rules for the University of 
California? The effects of new science requirements. Public Policy Institute of California. 
https://www.ppic.org/wp-content/uploads/new-eligibility-rules-for-university-of-
california-the-effects-of-new-science-requirements.pdf  
 148 
GEOstata. (2016). PISA 2015 Results – Performance in Science. OECD.  
Godwin, A. (2017). Unpacking latent diversity. In American Society for Engineering Education 
Annual Conference & Exposition. 
Gomes, M., Hirata, G., & Oliveira, J. B. A. E. (2020). Student composition in the PISA 
assessments: Evidence from Brazil. International Journal of EducationalDevelopment, 79, 
1-7. 
Greene, J. C. & Caracelli, V. J. (1997). Defining and describing the paradigm issue in mixed-
method evaluation. New Directions for Evaluation, 1997(74), 5-17. 
Greenwood, B. (2020). Understanding Pedagogy - What is Social Constructivism? Satchel. 
https://blog.teamsatchel.com/understanding-pedagogy-what-is-social-constructivism 
Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast Density-Based Clustering with R. 
Journal of Statistical Software, 91(1), 1-30. doi:10.18637/jss 
Hanushek, E.A., Jamison, D.T., Jamison, E.A., & Woessmann, L. (2008). Education and economic 
growth: It’s not just going to school, but learning something while there that matters. 
Education Next, 8(2), 62-70. 
Harris, D. (n.d.). Comparison of 1-, 2-, and 3-parameter IRT models. Instructional Topics in 
Educational Measurement, 157-163. 
Ho, L., & Limpaecher, A. (2021, September 17). The practical guide to grounded theory.  
practical guide to grounded theory research. Delve. 
https://delvetool.com/groundedtheory 
Iliescu, D. & Greiff, S. (2021). On consequential validity. European Journal of Psychological 
Assessment, 37(3), 163–166. 
 149 
Immekus, J. C., Snyder, K. E., & Ralston, P. A. (2019). Multidimensional item response 
theory for factor structure assessment in educational psychology research. Frontiers in 
Education, 4(45), 2-15. 
Irribarra, D. T. & Arneson, A. E. (2023). The challenge of defining and interpreting  
dimensionality ineducational and psychological assessments. Measurement, 221, 1-8. 
Irribarra, D.T. & Freund, R. (2014). Wright Map: IRT item-person map with ConQuest  
integration. https://github.com/david-ti/wrightmap  
Issayeva, L. (2022, December 18). Multidimensional item response theory. Assessment Systems  
Corporation (ASC). https://assess.com/multidimensional-item-response-theory/  
Jerrim, J. (2016, November 1). The design and use of test scores: Lecture 4 [PowerPoint slides].  
Social Research Institute, University College London. 
Jerrim, J., Micklewright, J., Heine, J., Salzer, C., & McKeown, C. (2018). PISA 2015: how big is the  
‘mode effect’ and what has been done about it? Oxford Review of Education, 44(4), 476-
493. https://doi.org/10.1080/03054985.2018.1430025    
Johnson, R. B. & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm  
whose time has come. Educational Researcher. 33(7), 14-26. 
Johnson, S. (2019, October 18). How one high school’s dispute reflects the struggle to teach  
California’s science standards. EdSource. https://edsource.org/2019/how-one-high-
schools-dispute-reflects-the-struggle-to-teach-californias-science-standards/618752  
Jolliffe, I. T. & Cadima, J. (2016). Principal component analysis: A review and recent 
Developments. The Royal Publishing Society: Philosophical Transactions A. 374, 1-16.  
Kaldaras, L., Akaeze, H., & Krajcik, J. (2021). A Methodology for Determining and Validating  
 150 
Latent Factor Dimensionality of Complex Multi-Factor Science Constructs Measuring 
Knowledge-In-Use. Educational Assessment, 26(4), 241-263. 
Kan, A., Bulut, O., & Cormier, D. C. (2018). The impact of item stem format on the dimensional  
structure of mathematics assessments. Educational Assessment, 24(1), 13-32. 
Kassambara, A. & Mundt, F. (2020) Factoextra: Extract and visualize the results of  
multivariate data analyses. (Version 1.0.7) [R program package]. https://CRAN.R-
project.org/package=factoextra 
Kelley, T. R. & Knowles, J. G. (2016). A conceptual framework for integrated STEM education.  
International Journal of STEM Education, 3(11), 1-11. 
Kiefer, T., Robitzsch, A., & Wu, M. (2015, July 2). TAM: An R package for item response  
modelling [PowerPoint slides]. https://user2015.math.aau.dk/presentations/205.pdf  
Kose, I. A. & Demirtasli, N. C. (2012). Comparison of unidimensional and multidimensional  
models based on item response theory in terms of both variables of test length and 
sample size. Procedia - Social and Behavioral Sciences, 46, 135 – 140. 
Krutsch, E. & Roderick, V. (2022, November 4). STEM Day: Explore Growing Careers. U.S.  
Department of Labor Blog. https://blog.dol.gov/2022/11/04/stem-day-explore-growing-
careers 
Lang, J. W. B., & Tay, L. (2021). The Science and Practice of Item Response Theory in  
Organizations. Annual Review of Organizational Psychology and Organizational 
Behavior, 8, 311-338. 
Language Resource Center (LRC). (2022). Languages by countries.  
https://www.languagerc.net/languages-by-countries/ 
 151 
Learn more about PILA (n.d.). The Platform for Innovative Learning Assessments (PILA).  
https://pilaproject.org/. 
Leigh, J. H., Zinkhan, G. M., & Swaminathan, V. (2006). Dimensional relationships of recall and 
recognition measures with selected cognitive and affective aspects of print ads. Journal 
of Advertising, 35(1), 105-122. 
Li, Y., Jiao, H., & Lissitz, R. W. (2012). Applying multidimensional item response theory models 
in validating test dimensionality: An example of K–12 large-scale science assessment. 
Journal of Applied Testing Technology, 13(2), 1-27. 
Lips, D. & Moritz, M. (2023). STEM and Computer Science Education: Reforming Federal K-12 
Education R&D Activities to Strengthen American Competitiveness: Prepared by 
Federation of American Scientists. Lincoln Network, Foundation for American 
Innovation. 
Mailman School of Public Health. (2023, November). Item Response Theory. Columbia  
University. https://www.publichealth.columbia.edu/research/population-health-
methods/item-response-theory  
Market Data Retrieval (MDR). (2024, March 26). How many schools are in the U.S.? MDR  
Education. https://mdreducation.com/how-many-schools-are-in-the-u-s/ 
Maxwell, J. A. & Mittapalli, K. (2010). Realism as a stance for mixed methods research. In A.  
Tashakkori & C. Teddlie (Eds.), Handbook of Mixed Methods in Social & Behavioral 
Research 2nd edition (pp. 145-167). Sage. 
Mazzei, L. A., & Jackson, A. Y. (Eds.). (2024). Postfoundational approaches to qualitative inquiry. 
Routledge. DOI: 10.4324/9781003298519 
 152 
McLeod, S. (2019). Constructivism as a Theory for Teaching and Learning | Simply Psychology. 
https://www.simplypsychology.org/constructivism.html  
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. 
Educational Researcher, 18(2), 5–11. 
Messick, S. (1993). Foundations of validity: Meaning and consequences in psychological 
assessment (Report No. RR-93-51). Educational Testing Service. 
https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/j.2333-8504.1993.tb01562.x  
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' 
responses and performances as scientific inquiry into score meaning. American 
Psychologist, 50(9), 741–749. 
Mostafa, T., Echazarra, A., & Guillou, H. (2018). The science of teaching science: An  
exploration of science teaching practices in PISA 2015 (OECD Education Working Papers 
No. 188). www.oecd.org/edu/workingpapers  
National Center for Education Statistics. (n.d.-a). Frequently asked questions. PISA resources.  
https://nces.ed.gov/surveys/pisa/faq.asp  
National Center for Education Statistics. (n.d.-b). Science literacy: Average scores. Program for  
International Student Assessment (PISA). 
https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp  
National Center for Education Statistics. (2022). High school mathematics and science course  
completion: Condition of education. U.S. Department of Education, Institute of 
Education Sciences. https://nces.ed.gov/programs/coe/indicator/sod  
National Research Council. (2001). Knowing what students know: The science and 
 153 
design of educational assessment. Washington, DC: The National Academies Press. 
https://doi.org/10.17226/10019  
National Research Council. (2007). Taking science to school: Learning and teaching science in  
grades K-8. Washington, DC: The National Academies Press. 
https://doi.org/10.17226/11625 
National Research Council. (2012). A framework for K-12 science education: Practices,  
crosscutting concepts, and core ideas. Washington, DC: The National Academies Press. 
NGSS Lead States. (2013). Next generation science standards: For states, by states: Three  
dimensional learning. https://www.nextgenscience.org/three-dimensional-learning   
Organisation for Economic Co-operation and Development. (n.d.-a). About PISA. The  
Programme for International Student Assessment (PISA). 
https://www.oecd.org/pisa/aboutpisa/ 
Organisation for Economic Co-operation and Development. (n.d.-b). FAQ. The Programme for  
International Student Assessment (PISA). https://www.oecd.org/pisa/pisafaq/ 
Organisation for Economic Co-operation and Development. (n.d.-c). PISA 2015 released field  
trial cognitive items. OECD Publishing. 
Organisation for Economic Co-operation and Development. (2016a). Country note: Key findings  
from PISA 2015 for the United States. OECD Publishing. 
https://www.oecd.org/pisa/PISA-2015-United-States.pdf  
Organisation for Economic Co-operation and Development. (2016b). PISA 2015 results (volume  
I): Excellence and equity in education. OECD Publishing. 
http://dx.doi.org/10.1787/9789264266490-en  
 154 
Organisation for Economic Co-operation and Development. (2017a). PISA 2015 Assessment and  
analytical framework: science, reading, mathematic, financial literacy and collaborative  
problem solving, revised edition. OECD Publishing. 
http://dx.doi.org/10.1787/9789264281820-en  
Organisation for Economic Co-operation and Development. (2017b). PISA 2015 technical report.  
OECD Publishing. https://www.oecd.org/pisa/data/2015-technical-report/   
Organisation for Economic Co-operation and Development. (2018). PISA 2015: Results in Focus.  
OECD Publishing. 
Organisation for Economic Co-operation and Development. (2023).  Working draft: PISA  
learning in the digital world assessment framework. 
Osteen, P. (2010). An introduction to using multidimensional item response theory to assess  
latent factor structures. Journal of the Society for Social Work and Research, 1(2), 66-82. 
Östlund, U., Kidd, L., Wengström, Y., and Rowa-Dewar, N. (2011). Combining qualitative and  
quantitative research within mixed method research designs: A methodological review. 
International Journal of Nursing Studies. 48(2011), 369-383. 
Otarigho, M. D. & Oruese, D. D. (2013). Problems and prospects of teaching integrated science 
in secondary schools in Warri, Delta State, Nigeria. Techno LEARN: An International 
Journal of Educational Technology, 3(1), 19-26. 
Park, S., Reeger, A., & Aloe, A. M. (2020). Technically speaking: Determining test 
effectiveness with item response theory. Iowa Reading Research Center, University of 
Iowa. https://irrc.education.uiowa.edu/blog/2020/09/technically-speaking-determining-
test-effectiveness-item-response-theory  
 155 
Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work developing  
transferable knowledge and skills in the 21st Century. National Academies Press. DOI 
10.17226/13398 
Pensavalle, C. A. & Solinas, G. (2013). The Rasch model analysis for understanding mathematics  
proficiency—A case study: Senior high school Sardinian students. Creative Education, 
4(12), 767-773. 
Pierson, A. E., Clark, D. B., & Kelly, G. J. (2019). Learning Progressions and Science Practices 
Tensions in Prioritizing Content, Epistemic Practices, and Social Dimensions of Learning. 
Science & Education, 28, 833-841. 
PISA USA. (2015). Program for international student assessment: Frequently asked questions –  
Information for Students [Brochure]. 
https://www.fldoe.org/core/fileparse.php/5389/urlt/PISA2015_FAQ_Student_Informati
on.pdf  
Pokropek, A., Marks, G. N., Borgonovi, F., Koc, P., & Greiff, S. (2022). General or specific  
abilities? Evidence from 33 countries participating in the PISA assessments. Intelligence, 
92. 
Polites, G. L., Roberts, N., and Thatcher, J. (2012). Conceptualizing models using  
multidimensional constructs: A review and guidelines for their use. European Journal of 
Information Systems, 21, 22-48. 
Plotly Technologies Inc. (2015). Collaborative data science. Montréal, QC. https://plot.ly  
R Core Team. (2023). R: A language and environment for statistical computing. R 
Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/  
 156 
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied  
Psychological Measurement, 9(4), 401-412. 
Reckase, M. D. (1989). The interpretation and application of multidimensional item 
response theory models; and computerized testing in the instructional environment: 
Final Report (Report No. AD-A214109). The American College Testing (ACT) Program. 
Reckase, M. D. (1990). Unidimensional data from multidimensional tests and multidimensional  
data from unidimensional tests [Paper presentation]. Annual Meeting of American 
Educational Research Association, Boston. 
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied  
Psychological Measurement, 21(1), 25-36. 
Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research.  
R package version 2.4.1. https://CRAN.R-project.org/package=psych  
Robitzsch, A., Kiefer, T., & Wu, M. (2022). TAM: Test analysis modules. R package version 4.1-4,  
https://CRAN.R-project.org/package=TAM 
RStudio Team. (2021). RStudio: Integrated development environment for R. RStudio, 
PBC, Boston, MA. http://www.rstudio.com/  
Saunders, B., Sim, J., Kingstone, T., Baker, S., Waterfield, J., Bartlam, B., Burroughs, H., & Jinks,  
C. (2018). Saturation in qualitative research: Exploring its conceptualization and 
operationalization. Quality & Quantity, 52, 1893-1907. 
Scalise, K. & Clarke-Midura, J. (2018). The many faces of scientific inquiry: Effectively 
measuring what students do and not only what they say? Journal of Research in Science 
Teaching, 55, 1469-1496. 
 157 
Scalise, K. & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for  
constructing “intermediate constraint” questions and tasks for technology platforms. 
The Journal of Technology, Learning, and Assessment, 4(6), 1-45. 
Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman  
and Hall/CRC. https://plotly-r.com  
Singer, J. D. & Braun, H. I. (2018). Testing international education assessments: Rankings get  
headlines, but often mislead. Science, 360(6384), 38-40. 
Socha, A. (n.d.) Multidimensional item response theory. [Unpublished article – James Madison  
University]. https://educ.jmu.edu/~sochaab/index_files/Showcase/MIRT.pdf    
Spencer, S. G. (2004). The strength of multidimensional item response theory in exploring  
construct space that is multidimensional and correlated. [Doctoral dissertation, Brigham 
Young University]. BYU ScholarsArchive. 
Stehle, S. M., & Peters-Burton, E. E. (2019). Developing student 21st century skills in selected  
exemplary inclusive STEM high schools. International Journal of STEM Education, 6(1), 1-
15. https://doi.org/10.1186/s40594-019-0192-1 
Strauss, V. (2019, December 3). Expert: How PISA created an illusion of education quality and  
marketed it to the world. The Washington Post. 
The World Bank Group. (2023). Compulsory education, duration (years). Data.  
https://data.worldbank.org/indicator/SE.COM.DURS 
The World Bank Group. (2023). World development indicators.  
https://datatopics.worldbank.org/world-development-indicators/ 
Uesaka, Y., Suzuki, M., & Ichikawa, S. (2022). Analyzing students’ learning strategies using item  
 158 
response theory: Toward assessment and instruction for self-regulated learning. 
Frontiers in Education, 7(921844), 1-16. 
USAGov. (2023, December 27). Official language of the United States. About the U.S. and its  
government. https://www.usa.gov/official-language-of-us  
Vasquez, J. (2006). High school biology today: What the committee of ten did not anticipate.  
CBE—Life Sciences Education: High School Biology Today, 5, 29-33. 
Venkatesh, V., Brown, S. A., & Bala, H. (2013). Bridging the qualitative-quantitative divide:  
Guidelines for conducting mixed methods research in information systems. MIS 
Quarterly, 37(1), 21-54. 
Venkatesh, V., Brown, S. A., & Sullivan, Y. W. (2016). Guidelines for conducting mixed methods  
research: An extension and illustration. Journal of AIS, 17(7), 435-495. 
Voogt, J. & Roblin, N. P. (2012). A comparative analysis of international frameworks for 21st  
century competences: Implications for national curriculum policies. Journal of 
Curriculum Studies, 44(3), 299-321. 
Wach, E., Ward, R., & Jacimovic, R. (2013). Learning about Qualitative Document Analysis. IDS  
Practice Papers in Brief, 1-10. 
Wang, C. (2021). A brief history and next stage of multidimensional item response theory.  
Quantitative and Qualitative Methods. 
Wang, C. & Nydick, S. W. (2015). Comparing two algorithms for calibrating the restricted 
non-compensatory multidimensional IRT model. Applied Psychological Measurement, 
39(2), 119-134. 
Welch, W. W. (1977). Chapter 3: Evaluation and decision-making in integrated science. In D.  
 159 
Cohen (Ed.), Volume IV: New trends in integrated science teaching: evaluation of 
integrated science education (pp. 26-36). United Nations Educational, Scientific, and 
Cultural Organization. 
Western Governors University. (2020). What is constructivism?  
https://www.wgu.edu/blog/what-constructivism2005.html#close  
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (Version 3.4.4) [R program  
package]. Springer-Verlag, New York. https://ggplot2.tidyverse.org 
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data  
Manipulation (Version 1.1.4) [R program package]. https://github.com/tidyverse/dplyr, 
https://dplyr.tidyverse.org  
Wilson, M. (2013). Using the concept of a measurement system to characterize measurement  
models used in psychometrics. Measurement, 46(9), 3766-3774. 
Winarno, N., Rusdiana, D., Riandi, R., Susilowati, E., & Afifah, R. M. A. (2020). Implementation of  
Integrated Science Curriculum: A Critical Review of the Literature. Journal for the 
Education of Gifted Young Scientists, 8(2), 795-817. DOI: 
http://dx.doi.org/10.17478/jegys.675722  
https://www.wgu.edu/blog/what-constructivism2005.html#close  
Wind, S. & Hua, C. (2021). Rasch measurement theory analysis in R: Illustrations and practical  
guidance for researchers and practitioners. Bookdown. 
https://bookdown.org/chua/new_rasch_demo2/  
World Population Review. (2023). Most racially diverse countries 2023.  
https://datatopics.worldbank.org/world-development-indicators/ 
 160 
Yamamoto, K. (1995). TOEFL technical report: Estimating the effects of test length and test time  
on parameter estimation using the hybrid model (Report No. ETS-RR-95-2; TOEFL-TR-10). 
Educational Testing Service. https://files.eric.ed.gov/fulltext/ED395035.pdf  
Yen, S. J. & Leah, W. (2007). Multidimensional IRT models for Composite Scores [Paper  
presentation]. 2007 Annual Meeting of the National Council of Measurement in 
Education, Chicago, IL, United States.
 161 
APPENDIX A: STUDENT ENROLLMENT IN SCIENCE COURSES BY ETHNICITY 
The following bubble graphs showcase student enrollment for U.S. high school science 
courses based on data from Civil Rights Data Collection (CRDC, 2023) collected by the Office for 
Civil Rights (OCR). The data are from the 2020-21 school year, which was when OCR could 
restart data collection after a delay due to COVID-19. Student data is from all school districts 
and public schools, “as well as long-term secure juvenile justice facilities, charter schools, 
alternative schools, and special education schools that focus primarily on serving the 
educational needs of students with disabilities under IDEA or section 504 of the Rehabilitation 
Act (CRDC, 2023).” Administration of the CRDC occurs every two years in the 50 states, 
Washington, D.C., and the Commonwealth of Puerto Rico (CRDC, 2023). Enrollment was less 
than 1% for the American Indian or Alaska Native and Native Hawaiian or Other Pacific Islander 
student populations, so those student populations are not depicted for all three bubble charts. 
The majority of students enrolled in high school physics (Figure 35), biology (Figure 36), and 
chemistry (Figure 37) courses are of White and Hispanic or Latino (of any race) ethnicities. 
Figure 35 
 
U.S. High School Physics Enrollment 
 
 162 
Figure 36 Figure 37 
U.S. High School Biology Enrollment U.S. High School Chemistry Enrollment 
  
  
Note. For Figures 35-37, from “Data on Equal Access to Education,” 2023, Civil Rights Data Collection. Copyright 
2015 by Office for Civil Rights U.S. Department of Education. https://ocrdata.ed.gov/ 
 163 
APPENDIX B: 2015 PISA AVERAGE SCORES FOR SCIENCE 
Table 13 provides the 2015 PISA mean scores for the science literacy scale for all 
countries with the U.S. data highlighted in blue. Also included are the standard errors (SE). With 
regards to the U.S., it is important to consider that only two states and a territory were sampled 
individually to get their mean score per state/territory (OECD, 2016a). Figure 38 shows the 
relative stability of U.S. science mean scores over time from 2006 to 2018 (OECD, 2016a). 
Table 13 
2015 PISA Country Rankings by Average Score in Science 
Education System Average SE Education System Average Score Score SE 
 
OECD average 493  0.4 I celand 473 1.7 
 
 
Singapore 556 1.2 Israel 467 3.4 
 
 
Japan 538 3.0 Malta 465 1.6 
 
 
Estonia 534 2.1 Slovak Republic 461 2.6 
Chinese Taipei 532  2.7 Greece 455  3.9 
 
 
Finland 531 2.4 Chile 447 2.4 
 
 
Macau (China) 529 1.1 Bulgaria 446 4.4 
 
 
Canada 528 2.1 United Arab Emirates 437 2.4 
 
 
Vietnam 525 3.9 Uruguay 435 2.2 
 
 
Hong Kong (China) 523 2.5 Romania 435 3.2 
 
 
B-S-J-G (China) 518 4.6 Cyprus 433 1.4 
 
 
Korea, Republic of 516 3.1 Moldova, Republic of 428 2.0 
 
 
New Zealand 513 2.4 Albania 427 3.3 
 
 
Slovenia 513 1.3 Turkey 425 3.9 
 
 
Australia 510 1.5   Trinidad and Tobago 425 1.4 
 
 
United Kingdom 509 2.6 Thailand 421 2.8 
 
 
Germany 509 2.7 Costa Rica 420 2.1 
 
 
Netherlands 509 2.3 Qatar 418 1.0 
 
 
Switzerland 506 2.9 Colombia 416 2.4 
 
Ireland 503  2.4 Mexico 416 2.1 
 
Belgium 502  2.3 Montenegro, Republic of 411 1.0 
 
Denmark 502  2.4 Georgia 411 2.4 
 
Poland 501  2.5 Jordan 409 2.7 
 
Portugal 501  2.4 Indonesia 403 2.6 
 
Norway 498  2.3 Brazil 401 2.3 
 
United States 496  3.2 Peru 397 2.4 
 
Austria 495  2.4 Lebanon 386 3.4 
France 495  
 
2.1 Tunisia 386 2.1 
 
Sweden 493  3.6 Macedonia, Republic of 384 1.2 
 
Czech Republic 493  2.3 Kosovo 378 1.7 
 
Spain 493  2.1 Algeria 376 2.6 
 164 
 
Latvia 490  1.6 Dominican Republic 332 2.6 
 
Russian Federation 487 2.9     
 
Luxembourg 483 1.1     
 
Italy 481 2.5     
Hungary  477 2.4 U.S. States and    Territories 
 
 
Lithuania 475 2.7 Massachusetts             529 6.6 
 
Croatia 475 2.5 North Carolina            502  4.9 
Buenos Aires   
(Argentina) 475 6.3 Puerto Rico 403 6.1 
 
Note. Adapted from “Science literacy: Average scores,” n.d., National Center for Education Statistics. Copyright 
2015 by OECD. https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp  
     “Average score is higher than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” 
     “Average score is lower than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” 
“Education systems are ordered by 2015 average score. The OECD average is the average of the national averages 
of the OECD member countries, with each country weighted equally. Scores are reported on a scale from 0 to 
1,000. All average scores reported as higher or lower than the U.S. average score are different at the .05 level of 
statistical significance. Italics indicate non-OECD countries and education systems. B-S-J-G (China) refers to the four 
PISA participating China provinces: Beijing, Shanghai, Jiangsu, and Guangdong. Results for Massachusetts and 
North Carolina are for public school students only (NCES, n.d.-b).” While Argentina, Malaysia, and Kazakhstan 
participated in PISA 2015, Argentina only provided a reliable sample from Buenos Aires, Malaysia was unable to 
meet response rate standards, and Kazakhstan administered only multiple-choice items, which limited comparison 
by rank in Argentina’s case and prevented comparison by rank in the case of Malaysia and Kazakhstan (OECD, 
2018). 
 
Figure 38 
U.S. Mean Scores for Science Stable Over Time 
2006 2009 2012 2015 2018
489 502 497 496 502
 165 
APPENDIX C: 2015 PISA AVERAGE SCORES BY SCIENCE SUBDOMAIN 
Table 14 provides the 2015 PISA mean scores for the three science subscales: physical, 
living, and Earth and space systems for all countries with the U.S. data highlighted in blue. Also 
included are the SE. With regards to the U.S., it is important to consider that only two states 
and a territory were sampled individually to get their mean subscale scores per state/territory 
(OECD, 2016a). Note, the science competency70 subscales (i.e., explain phenomena, evaluate 
and design inquiry, and interpret data and evidence) are outside the scope of this study so are 
not provided, but are available at the below NCES website. 
https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3_2.asp  
Table 14 
2015 PISA Country Rankings by Average Score in Science Subdomain 
Physical Systems Living Systems Earth and Space Systems 
Education Average SE Education Average SE Education Average System Score System Score System Score SE 
OECD average 493  0.5 OECD average 492  0.5 OECD average 494  0.5 
Singapore 555  1.6 Singapore 558  1.4 Singapore 554  1.6 
Japan 538  3.2 Japan 538  3.2 Japan 541  3.3 
Estonia 535  2.3 Chinese Taipei 532  2.7 Estonia 539  2.3 
Finland 534  2.6 Estonia 532  2.1 Finland 534  3.0 
Macau (China) 533  1.4 Canada 528  2.4 Chinese Taipei 534  3.1 
Chinese Taipei 531  3.0 Finland 527  2.5 Macau (China) 533  1.2 
Canada 527  2.4  Macau (China) 524  1.4  Canada 529  2.5 
Hong Kong 523  2.9 Hong Kong 523  2.7 Hong Kong  (China) (China) (China) 523 2.5 
B-S-J-G (China) 520  5.3 B-S-J-G (China) 517  4.5 Korea,  Republic of 521 3.3 
Korea, 
Republic of 517 
 3.6 New Zealand 512  2.8 B-S-J-G (China) 516  4.9 
New Zealand 515  2.7 Slovenia 512  1.6 Slovenia 514  1.8 
Slovenia 514  1.6 Korea,  Republic of 511 3.2 New Zealand 513 
 2.7 
Netherlands 511  2.6 Australia 510  1.8 Netherlands 513  2.8 
Australia 511  1.8 Germany 509  2.9 Germany 512  2.9 
 
70 NCES refers to OECD’s framework competency subscales as “process subscales” on their website. 
 166 
United 509  2.9 United Kingdom Kingdom 509 
 2.6 United  Kingdom 510 2.8 
Denmark 508  2.7 Switzerland 506  3.2 Australia 509  2.1 
Ireland 507  2.8 Netherlands 503  2.4 Switzerland 508  3.1 
Germany 505  2.8 Belgium 503  2.4 Denmark 505  2.7 
Switzerland 
 503 
 3.1 Portugal 503  2.5 Belgium 503  2.6 
Poland 
 503 
 2.7 Poland 501  2.8 Ireland 502  2.6 
Norway 503  2.5 Ireland 500  2.5 Poland 501  2.8 
Sweden 500  3.8 United States 498  3.4 Portugal 500  2.9 
Belgium 499  2.4 Denmark 496  2.6 Norway 499  2.6 
Portugal 499  2.7 France 496  2.3 Austria 497  2.9 
Austria 497  2.7 Norway 494  2.5 Spain 496  2.3 
United States 494  3.5 Spain 493  2.3 United States 496  3.4 
France 492  2.4 Czech Republic 493  2.4 France 496  2.5 
Czech 
Republic 492  2.5 Austria 492  2.6 Sweden 495  4.1 
Latvia 490  1.7 Latvia 489  1.7 
Czech 
Republic 493  2.6 
Russian 
Federation 488  3.4 Sweden 488  3.7 Latvia 493  1.9 
Spain 487  2.3 Luxembourg 485 1.2 Russian  Federation 489  3.3 
Hungary 481 2.9 Russian  Federation 483  2.8 Italy 485  2.7 
Italy 479  2.8 Italy 479  2.7 Luxembourg 483  1.6 
Luxembourg 478  1.4 Croatia 476  2.6 Croatia 477  2.7 
Lithuania 478  2.8 Iceland 476  2.0 Hungary 477  2.8 
Iceland 472  1.9 Lithuania 476  2.7 Lithuania 471  3.0 
Croatia 472  2.6 Hungary 473  2.6 Iceland 469  1.9 
Israel 469 Slovak  3.8 Israel 469  3.5 Republic 458  2.8 
Slovak 
Republic 466  2.9 
Slovak 
Republic 458  2.8 Israel 457  3.8 
Greece 452  4.0 Greece 456  4.0 Greece 453  4.3 
Bulgaria 445  4.4 Chile 452  2.7 Bulgaria 448  4.8 
Chile 439  3.0 Bulgaria 443  4.5 Chile 446  2.5 
United Arab 
Emirates 434  2.8 Uruguay 438 2.5 
United Arab 
 Emirates 435  2.8 
Cyprus 433 1.6 United Arab  Emirates 438  2.6 Uruguay 434  2.6 
Uruguay 432  2.6 Cyprus 433  1.5 Cyprus 430  1.6 
Turkey 429  4.3 Turkey 424  3.9 Turkey 421  4.3 
Thailand 423  3.2 Qatar 423  1.1 Mexico 419  2.4 
Costa Rica 417  2.4 Thailand 422  3.2 Costa Rica 418  2.4 
Qatar 415  1.5 Costa Rica 420  2.4 Thailand 416  3.2 
Colombia 414  2.7 Colombia 419  2.5 Colombia 411  2.7 
Mexico 411 2.2 Mexico 415 2.4 Montenegro,   Republic of 410  2.0 
 167 
Montenegro, 
Republic of 407  1.6 
Montenegro, 
Republic of 413  1.3 Qatar 409  1.2 
Brazil 396  2.6 Brazil 404  2.6 Brazil 395  3.1 
Peru 389  2.7 Peru 402  2.7 Peru 393  3.1 
Tunisia 379  2.4 Tunisia 390  2.4 Tunisia 387  3.4 
Dominican 
Republic 332 3.0 
Dominican 
 Republic 332  2.8 
Dominican 
Republic 324  3.4 
Buenos Aires —  † Buenos Aires —  † Buenos Aires (Argentina) (Argentina) (Argentina) —  † 
Romania —  † Romania —  † Romania —  † 
Jordan —  † Jordan —  † Jordan —  † 
Vietnam —  † Vietnam —  † Vietnam —  † 
Georgia —  † Georgia —  † Georgia —  † 
Albania —  † Albania —  † Albania —  † 
Trinidad and —  † Trinidad and Tobago Tobago —  † 
Trinidad and 
Tobago —  † 
Macedonia, —  † Macedonia, —  † Macedonia, Republic of Republic of Republic of —  † 
Algeria —  † Algeria —  † Algeria —  † 
Indonesia —  † Indonesia —  † Indonesia —  † 
Malta —  † Malta —  † Malta —  † 
Lebanon —  † Lebanon —  † Lebanon —  † 
Kosovo —  † Kosovo —  † Kosovo —  † 
Moldova, Moldova, Moldova, 
Republic of —  † Republic of —  † Republic of —  † 
            
U.S. States U.S. States U.S. States 
and    and    and    
Territories Territories Territories 
Massachusetts 526  6.7 Massachusetts 533  6.9 Massachusetts 528  6.6 
North Carolina 501  5.2 North Carolina 503  5.4 North Carolina 502  5.0 
Puerto Rico —  † Puerto Rico —  † Puerto Rico —  † 
Note. Adapted from “Science literacy: Average scores,” n.d., National Center for Education Statistics. Copyright 
2015 by OECD. https://nces.ed.gov/surveys/pisa/pisa2015/pisa2015highlights_3.asp  
     “Average score is higher than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” 
     “Average score is lower than U.S. average score at the .05 level of statistical significance (NCES, n.d.-b).” 
― “Not available (NCES, n.d.-b).” 
† “Not applicable (NCES, n.d.-b).” 
“Education systems are ordered by 2015 average subscale score. The OECD average is the average of the national 
averages of the OECD member countries, with each country weighted equally. Scores are reported on a scale from 
0 to 1,000. Albania, Algeria, Buenos Aires (Argentina), Georgia, Indonesia, Jordan, Kosovo, Lebanon, Malta, 
Republic of Macedonia, Republic of Moldova, Puerto Rico, Romania, Trinidad and Tobago, and Vietnam 
 168 
administered paper-based trend items and have no scores for science subscales. Italics indicate non-OECD 
countries and education systems. B-S-J-G (China) refers to the four PISA participating China provinces: Beijing, 
Shanghai, Jiangsu, and Guangdong. Results for Massachusetts and North Carolina are for public school students 
only (NCES, n.d.-b).” While Argentina, Malaysia, and Kazakhstan participated in PISA 2015, Argentina only provided 
a reliable sample from Buenos Aires, Malaysia was unable to meet response rate standards, and Kazakhstan 
administered only multiple-choice items, which limited comparison by rank in Argentina’s case and prevented 
comparison by rank in the case of Malaysia and Kazakhstan (OECD, 2018). 
 
 
 169 
APPENDIX D: LITERATURE CONNECTIONS 
Figure 39 details literature connections to the 2012 Li et al. article, which has similar 
methodology to my proposed study. Note, Figure 39 is hyperlinked to an interactive, larger 
version of the graphic, which was generated from https://www.connectedpapers.com/. 
Figure 39 
 
Connections to the 2012 Li Article 
 170 
APPENDIX E: LITERATURE REVIEW MATRIX 
Table 15 provides a detailed account of the literature review. The table is organized alphabetically with notes on various 
aspects of each piece of literature. The measurement type is color coded for ease of use – see key below. Also included are any 
barriers to using MIRT models to describe student performance in the science content subdomains. The big ideas driving the 
inclusion of the literature in this dissertation are summarized in the final column. 
Color Key 
Mixed Methods 
Qualitative 
Quantitative 
Included in References 
Table 15 
Results of Literature Review 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
AERA (2014) NA NA Book Quantitative NA Education testing standards 
Aktürk et al. Turkey, Early 
(2017) Childhood STEM Article Qualitative NA 
Example of curriculum 
document analysis 
Armstrong (2021) NA NA Article Qualitative NA Document analysis; grounded theory; epistemology 
Life and Physical Little curriculum overlap Athalonz (2023) NA Sciences Blog Post NA NA between life and physical science 
Atkinsson (2010) NA NA Online Learning Collaboration as part of Article Theory NA constructivism 
 171 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Ayala (2022) NA NA Book Quantitative Sources of indeterminacy (pg. 408) Complete overview of IRT 
Difficult to justify to test MIRT in large-scale assessment, 
Baghaei (2012) Iran, HS English Article Quantitative takers why scores on compares between- and within-different dimensions item multidimensionality (see 
depend on each other Figure 1) – hold for discussion 
Baskarada & Epistemology and philosophical 
Koronios (2018) NA NA Article Mixed Methods NA considerations for mixed methods research 
Berenzer & Global, 15-year- Math, Reading, PISA, PIRLS, Use of scaling and IRT in large-
Adams (2017) olds, 4th graders Science Book Quantitative, NA scale assessments; IRT model large-scale choice 
Beribisky & 
Hancock (2023) NA NA Article Quantitative NA RMSEA comparison 
Binkley & Ma U.S., HS Advanced (AP) Newspaper 
Inequity in advanced placement 
(2023) classes Articles NA NA courses by student ethnicity (Black and Latino especially) 
Boon et al. (2022) NA Science Article Epistemology NA Overview of constructivist approach to science education 
Document analysis as a 
Bowen (2009) NA NA Article Qualitative NA research method – use in 
methods section 
Brandt (2015) Global, 15-year-
PISA, NAEP, KEY! Large-scale assessment’s 
olds, U.S. NA Dissertation Quantitative, 
Reliability of 
comprehensive scores unidimensional approach when large-scale,  really multidimensional;  
Briggs & Wilson Intro to multidimensional Rasch 
(2003) NA NA Article Quantitative NA models; art and science of measurement 
Broesch et al. 
(2020) NA NA Article Mixed Methods NA 
Research design that takes 
culture into consideration 
Brooks-Bartlett NA NA Online (2017) Article Quantitative NA Introduction to probability 
 172 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Brooks-Bartlett Online 
(2018) NA NA Article Quantitative NA Maximum likelihood estimation 
Camilleri (2023) NA NA News Article Quantitative NA Historical timeline of IRT 
Carnoy et al. U.S. Math, Reading, PISA, NAEP, 
State comparisons more useful 
(2015) Science Briefing Quantitative NA than U.S. to international comparisons 
Caro & Biecek PISA, TIMSS, 
(2017) International NA Article PIRLS, NA 
An R package for analyzing 
Quantitative large-scale assessment data 
Figure 1 is a 
framework/theory/model view 
Carpiano & Daley for philosophy that will work for 
(2006) NA NA Article Epistemology NA qualitative review of content framework too – use in 
methods section; glossary for 
epistemology 
CFPB (2019) Global Financial Report PISA, Mixed NA Overview of PISA financial Literacy Methods literacy results;  
Claesgens et al. U.S., HS and Uses IRT to match scores to a 
(2008) University Chemistry Article Mixed Methods NA framework (pg. 8 for discussion chapter) 
College Board 
(2009) U.S. Science Book NA NA 
Science standards for college 
success 
Columbia (2023) NA NA Webpage Quantitative NA Item parameters and IRT 
Status of U.S. science 
Corcoran et al. education; role of science 
(2009) U.S., K-12 Science Report NA NA learning progressions (for possible use in my discussion 
chapter) 
CPET (n.d.) NA NA Webpage NA NA Equity in assessment 
CRCD (2023) U.S., K-12 Math, Science, AP Webpage Quantitative NA 
Civil rights data for education in 
U.S. K-12 
 173 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Creswell (2015) NA NA Article Mixed Methods NA Approaches to and handbook for mixed methods  
Crisan et al. NA NA Article Quantitative, NA Consequences of (2017) Simulation unidimensional IRT model misfit 
Csapó (2017) NA NA Online Book NA NA Overview of 21st Century skills 
Curran et al. 
(1996) NA NA Article Quantitative NA 
Skew and kurtosis acceptable 
ranges 
MIRT was not used in Using multiple-group Rasch 
Davier et al. Global, 15-year- Science, Math, PISA, original analysis so cannot 
model rather than 
Article be used in newer research unidimensional IRT for linking in (2019) olds and Reading Quantitative in order to preserve trend PISA to generate cross-country 
and prior conclusions comparisons (for discussion chapter) 
DeMars (2016) NA NA Article Quantitative, Item difficulties can vary 
Key! Non-compensatory MIRT 
Simulation by dimension equation (for methods and results chapters) 
Dorans & 
Kingston (1985) NA NA Article Quantitative NA Violating unidimensionality 
Duran (2014) NA NA PowerPoint Separate multi into KEY! IRT and MIRT advantages Presentation Quantitative unidimensional subtests overview 
Duschl et al. Science learning and learning 
(2007) U.S., K-8 Science Book NA NA progressions (for possible use in my discussion chapter) 
El Masri & UK, France, PISA, Model fit analysis in relation to 
Andrich (2020) Jordan, 15-year- Science Article Quantitative, NA invariance and validity with olds IRT regards to DIF 
Discussion on how concept 
Enger & Yager subdomains of science should 
(2009) U.S., K-12 Science Book NA NA be taught; change to inquiry learning; background on other 
learning domains  
EPI (2015) Global, 15-year- Science, PISA, Mixed 2012 PISA, NAEP, and TIMSS olds Reading, Math Report Methods NA results comparison between 
 174 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
U.S. states rather than 
international 
Assessment design uses 
standards to measure desired 
Ercikan & Oliveri 21stNA  Century Article NA NA traits; ECD; cognitive evidence (2016) Skills needed for complex constructs 
rather than just expert reviews 
of items 
Erzberger & Kelle 
(2003) NA NA Book Mixed Methods NA Mixed methods handbook 
Fisher (2023) Unspecified 
Adverse 
Children Childhood Dissertation Mixed Methods NA 
Example of methods section 
Experiences split into two plans 
Fu (2016) U.S., Grade 8 Algebra Article Quantitative Practical significance vs. MIRT models with covariates statistical significance applied to longitudinal test data 
Framework method to 
Gale et al. (2013) NA NA Article Qualitative NA compare/contrast qualitative 
data 
PPIC report on HS science 
course requirements for CA 
Gao et al. (2019) U.S., HS to Science Report Mixed Methods NA university admission – save for College discussion as mentions racial 
disparity in meeting new 
requirements 
Garnier-Villarreal 
et al. (2021) NA NA Article Quantitative 
The number of factors to Estimation limits of between-
be evaluated item MIRT models 
GEOstata (2016) Global Science Webpage PISA NA Map of PISA 2015 country/economy participants 
Latent diversity’s impact on 
Godwin (2017) College Engineering Conference Article Mixed Methods NA creative solutions to engineering problems 
 175 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Gomes et al. Brazil, Student Math Article PISA, 
Why scores may be impacted by 
(2020) age varied Quantitative NA student age due to taking more classes (hold for discussion) 
Greene & Pragmatic paradigm in mixed 
Caracelli (1997) NA NA Article Mixed Methods NA methods design using different epistemologies 
Greenwood NA NA Blog Post Learning (2020) Theory NA Social constructivism overview 
Griffin & McGaw 
(2012) NA NA Book Quantitative NA Assessment of 21
st century skills 
Multidimensionality does Multidimension model 
Haksing (2010) NA NA Article Quantitative not mean MIRT has to be indistinguishable from 
used unidimensional model 
Hanushek et al. Global, 15-year- How cognitive growth as 
(2008) olds Math Article 
PISA, 
Quantitative NA measured by PISA impacts the U.S. economy 
Harris (n.d.) NA NA Article Quantitative NA Compares equations for UIRT models 
Harrison et al. U.S., HS, MS, Nature of NGSS, Mixed Multidimensional nature of 
(2015) Hawaii Science Article Methods NA nature of science learning; MRCML model 
Hartig & Hohler 
(2009) NA NA Article Quantitative NA 
KEY! SEM models for MIRT: 
between and within items;  
Hebel et al. (2017) Global, 15-year- PISA, Mixed olds, France Science Article Methods NA Difficulty of PISA science items 
Hoover et al. Secondary Earth and Space Report NA NA Statistics on school (2018) Sciences (ESS) requirements for ESS education 
How item framing may impact 
Hsu et al. (2023) Undergraduate Biology Article Quasi-random, NA student performance (for Mixed Methods possible use in my discussion 
chapter) 
IES (n.d.) U.S., Grades 4 and 8 Science, Math Webpage 
TIMSS, Overview of TIMSS science 
Quantitative NA assessment 
 176 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Iliescu (2021) NA NA Article Validity NA Social consequence of testing;  
Immekus (2019) Undergraduate Engineering Article Quantitative, NA KEY! Overview of IRT (2PL) and MIRT, CFA MIRT 
Multidimensional MRCMLM 
Intasoi et al. 2020 Thailand, Grade 7 Science Article Quantitative NA model works better to measure scientific competency 
framework 
Irribarra & Social Sciences, 
Arneson (2023) NA Education Article Quantitative NA Defining dimensional structure 
Compensatory vs. non-
Issayeva (2022) NA NA Webpage Quantitative NA compensatory MIRT models; 
MLE; guessing parameter 
Jerrim (2016) NA NA PowerPoint PISA, NA Types of weighting; handling Presentation Quantitative missing data 
PISA, TIMSS, 
Jerrim (2023) Global Science, Reading, Math Article PIRLS, NA 
Interest in large-scale 
Quantitative assessments by country 
Sweden, 
Jerrim et al. Germany, Science, Article PISA, NA Change of mode to computer-(2018) Ireland, 15-year- Reading, Math Quantitative based assessment 
olds 
China and 
Ji (2023) Canada, Self-regulation Thesis Qualitative NA Document analysis 
Kindergarten methodology 
Johnson (2019) U.S., HS Integrated Web article NA NA Parent and student concerns Science over integrated science 
Johnson & 
Onwuegbuzie NA NA Article Mixed Methods NA Pragmatic mixed methods – 
(2004) equal status design 
Jolliffe & Cadima 
(2016) NA NA Article Quantitative NA A review of PCA 
Kaldaras et al. NA Science Article NGSS, Validating latent multi-factor (2021) Quantitative, NA science constructs 
 177 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
EFA, CFA, 
Invariance 
Analysis 
Kandanaarachchi IRT used to evaluate machine 
& Smith-Miles NA NA Article Quantitative NA learning algorithm (future 
(2023) research section) 
Bayesian probabilistic 
Kaplan & Huang 
(2021) U.S. Math, Reading Article 
NAEP, NA forecasting view; combining Quantitative NSLP variable as a proxy for 
socio-economic status 
Student perceptions of different 
Kapucu (2021) Turkey, 9th grade Science Article Mixed Methods NA science subdomains 
differentiate from one another 
Kelley & Knowles 
(2016) International STEM Article NA NA Integrated STEM education 
Kim & Wilson Polytomous item explanatory 
(2020) NA NA Article Quantitative NA item response theory models via MGLMM 
Overview of information criteria 
Kim et al. (2019) NA NA Article Quantitative NA usage when comparing model 
fit 
Longer tests and larger 
Kose & Demirtasli sample sizes are needed Comparing sample size and test th
(2012) Turkey, 8  grade Language Article Quantitative to increase model length for UIRT and MIRT sensitivity and decrease models 
error 
Krutsch & U.S. Department of Labor chart 
Roderick (2022) NA STEM Blog Post Quantitative NA on STEM job growth 
Kuo & Sheng NA NA Article Quantitative, 
Estimation methods for multi-
(2016) Simulation NA unidimensional graded response 
 178 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Can be similar to other 
models such as CTT or 
Lang & Tay (2021) NA NA Article IRT, MIRT, CFA; unidimensional 
KEY! Overview of IRT models; 
Quantitative models are more familiar includes R code; history of MIRT 
and easily interpreted (for development 
discussion chapter) 
Lau (2009) Global Scientific Article PISA, NA Review of 2006 framework finds Literacy Qualitative construct validity issues 
Learn PILA (n.d.) Global Innovative Webpage Quantitative NA Platform for Innovative Learning Assessments 
Investigation across different 
Lee & Tsai (2012) College Biology, Physics Article Qualitative NA domains of science is rare for 
student epistemological beliefs 
Michigan Gr. 5, Unidimensionality and KEY! MIRT model used 
Li et al. (2012) U.S., K-12 Science Article Quantitative, local item dependence successfully for large-scale state 
EFA, CFA assumptions assessment in science 
Lin (1998) NA NA Article Qualitative NA Positivist vs. Interpretivist approaches 
Lips & Moritz 
(2022) U.S. STEM Report NA NA Government spending on STEM 
Liu et al. (2022) U.S., Grade 8, Math NA Article 
NAEP, MIRT, Scalability to big data 2PL IRT most commonly used; Quantitative report RMSE 
MacLeod & U.S., Memory recall as a 
Nelson (1984) Undergraduates NA Article Quantitative NA unidimensional construct 
Mari et al. (2017) NA NA Article Quantitative NA Nature of measurement vs. measure 
Marlowe (1986) NA NA Article Quantitative NA Multidimensionality in social intelligence 
Masur (2022) NA NA Webpage Quantitative NA IRT models in MIRT R package 
Maul (2019) NA NA Article NA NA Intersubjectivity of measurement 
Maxwell & 
Mittapalli (2010) NA NA Handbook Mixed Methods NA Scientific realism 
 179 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Mazzei & Jackson 
(2024) NA NA Book Qualitative NA 
Re-animating documents in 
another form 
McDonald (1999) NA NA Book CFA, IRT, MIRT, Quantitative NA 
Overview of multiple 
quantitative statistics 
Mcleod (2023) NA NA Online Learning Article Theory NA Constructivism 
Use vs. interpretation 
Messick (1989) NA NA Article NA NA inferences in assessment 
validity 
Messick (1993) NA NA Article NA NA Consequential validity as an aspect of construct validity 
Messick (1995) NA NA Article NA NA Construct validity 
Monseur et al. Global Reading, (2011) Science, Math Article 
PISA, NA Violation of independence Quantitative assumption by items in a set 
Moroi (2020) NA NA Article Mixed Methods NA Philosophies of research 
Student enjoyment of science 
Mostafa et al. Global, 15-year- Science Report PISA, Mixed NA linked to inquiry teaching; 2015 (2018) olds Methods PISA science test 
design/scoring/scale 
Infit item statistic acceptable 
Müller (2020) NA NA Article Quantitative NA bounds – save for results 
chapter 
A framework for science 
National Research education; chapter 11 discusses 
Council (2012) U.S., K-12 Science Book NA NA DEI in science education (for possible use in my discussion 
chapter) 
NCES (2019) U.S. Math, Science Report PISA NA K-12 course completion statistics 
NCES (2022) U.S. Science, Math Report NAEP, Quantitative NA Science course completion data 
NCES (n.d.-a) Global, 15-year-olds, U.S. Science Webpage PISA NA 
Number of U.S. schools 
participating in 2015; implies 
 180 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
students do not have to 
participate 
2015 PISA scores for all 
participating countries, also 
NCES (n.d.-b) Global, 15-year- PISA, olds, U.S. Science Webpage Quantitative NA broken down by subdomain; U.S. sampling and data 
collection methods 
NGSS (2013) U.S. STEM Webpage NA NA Next Generation Science Standards 
Niiniluoto ed. et 
al. (2004) NA NA Book Qualitative NA 
Overview of different 
epistemologies and their origins 
NWEA (2015) NA NA Blog Post Quantitative NA KEY! List of factors impacting use of MIRT 
Math, Science, 
OCR (2023a) U.S., K-12 Computer Report Mixed Methods NA Student access to education 
Science 
OCR (2023b) U.S., K-12 NA Report Quantitative NA Student enrollment 
OECD (2016a) Global, 15-year- Science, PISA, Mixed U.S.-specific report on 2015 olds Reading, Math Report Methods NA data 
Global, 15-year- Science, PISA, Mixed Map of participating countries; OECD (2016b) olds Reading, Math Report Methods NA 2015 results focusing on equity in education 
OECD (2017a) Global Science Framework PISA NA 2015 combined framework including science 
Key! Technical report for PISA 
OECD (2017b) Global, 15-year- Science, Report PISA, Mixed NA 2015 [Annex F – technical olds Reading, Math Methods standards; Annex A – item 
codes and counts] 
OECD (2018) Global, 15-year- Science, PISA, Mixed olds Reading, Math Report Methods NA Results in focus, data overview 
OECD (2019) Global Science Framework PISA NA 2018 science framework 
OECD (2020) Global Science Report PISA NA Strategic vision for 2024 science assessment 
 181 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
OECD (2023) Global Learning in Digital World Report NA NA Digital assessment framework 
OECD (n.d.-a) Global, 15-year-olds NA Webpage NA NA Overview of PISA 
OECD (n.d.-b) Global NA Webpage NA NA Frequently asked questions about PISA 
OECD (n.d.-c) Global, 15-year-
Science, 
olds Collaborative Report 
PISA, Mixed 
Methods NA 2015 released field trial items Problem Solving 
Osteen (2010) MSW students NA Article Quantitative NA Integrating CFA and MIRT 
Ostlund et al. NA NA Article Mixed Methods NA Triangulation of findings from (2011) mixed methods research 
Nigeria, Using integrated science Otarigho & Secondary Integrated Article Qualitative NA curriculum to provide students Oruese (2013) Schools Science with an understanding of how science affects everyday lives 
Higher error for ability in 
unidimensional than 
Park et al. (2019) Belgium, 6- to 8- MIRT, year-olds Number Sense Article Quantitative multidimensional if item 
MIRT in adaptive learning 
set is truly systems 
multidimensional 
Park et al. (2020) NA NA Blog Post Quantitative NA ICC overview with item parameters 
Developing transferable 
Pellegrino & ELA, Math, and knowledge and skills in 21
st 
Hilton (2012) U.S., K-12 Science Book NA NA century; emphasis on science inquiry (for possible use in my 
discussion chapter) 
NRC - Pellegrino Intersection of student learning 
et al. (2001) U.S., K-12 All Book NA NA and assessment (for possible use in my discussion chapter) 
Pelz (n.d.) NA NA Webpage Social Science Research NA 
Takes at least two dimensions 
to be multidimensional 
 182 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Pierson et al. Learning progressions; tension 
(2019) U.S., K-12 Science Article NGSS NA over teaching fact memorization vs. inquiry 
Reading, Math, 
Science, 
Collaborative 
PISA USA (2015) U.S., 15-year-olds Problem Brochure NA NA Students volunteer to take PISA 
Solving, if randomly selected by OECD. 
Financial 
Literacy 
KEY! Only 17% for science is 
independent of the general 
Pokropek (2022) Global, 15-year- Science, Math, Article PISA, ability factor and can be olds Reading Quantitative NA attributed to specific science 
ability factor (for possible use in 
my discussion chapter) 
Polites et al. Conceptualizing 
(2012) NA NA Article Mixed Methods dimension relationships Defining multidimensionality using theory 
Reckase (1985) NA NA Article Quantitative NA Multidimensional item difficulty 
An item requiring two 
Reckase (1989) NA NA Article Quantitative cognitive skills to solve Application of MIRT – hold for may still be discussion 
unidimensional 
Reckase (1990) NA NA Paper Presentation Quantitative NA Defining dimensionality 
Multiple items can be 
Reckase (1997) NA NA Article Quantitative selected based on MIRT, Future directions for MIRT; but modeled early MIRT development 
unidimensionally 
Reckase (2009) NA NA Book Quantitative Small number of items on Psychological and educational test context for MIRT 
 183 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Reed & Wolfson Learning progressions assume 
(2021) U.S., HS, College Chemistry Article Qualitative NA linear learning path; LPs not used by all 
Massachusetts scores at a high 
Reis (2016) U.S., Reading, Newspaper Massachusetts Science, Math Article PISA NA level similar to country leaders of PISA 2015 
Newspaper Using test scores to limit bias Richman (2023) K-12, Texas Math Article NA NA and determine which math class a student can take 
Richman & Crain Newspaper Teacher shortages lead to 
(2022) K-12, U.S. NA Article NA NA accepting teachers with less training 
Ruiz-Primo & Li Global, 15-year- PISA 2006 and 
(2015) olds Science  2009, NA 
How item context affects 
Quantitative student performance 
Saunders et al. Saturation evaluation and 
(2018) NA NA Article Qualitative NA grounded theory 
Scalise (2017a) U.S., MS Science Article Quantitative NA MIRT model for tech-enhanced items 
Scalise (2017b) NA Neuroscience Book NA NA Describes how students learn 
Scalise & Clarke- U.S., MS Science Inquiry Article Quantitative, 
KEY! Compares the fit of 
Midura (2018) MIRT NA different IRT models to the data; Bayes net for process data 
Scalise & Gifford Overview of item types and 
(2006) NA NA Article NA 
Can item type affect 
multidimensionality? constraints (for possible use in my discussion chapter) 
Scalise & Wilson 21stNA  Century Multidimensionality in (2011) Learning Article Quantitative NA constructs 
Scalise et al. Literature NA STEM Article Review and NA Digital accommodations for (2018) Analysis students 
Scalise et al. 
(2021) NA NA Article Quantitative NA Learning analytics; figure 1 
 184 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Siegel (2006) NA NA Article Qualitative NA Epistemological diversity in education research 
Socha (n.d.) NA NA Unpublished article Quantitative Large sample size 
MIRT assumptions; ICS analog 
to ICC 
Comparing MIRT to UIRT when 
Spencer (2004) NA NA Dissertation Quantitative NA retrieving item difficulties and 
differentiation 
Stehle & Peters-
Burton (2019) U.S., HS STEM Article Quantitative NA 
U.S. students underperforming 
in science 
Strauss (2019) Global, 15-year-olds NA 
Newspaper 
Article PISA NA Negative aspects of PISA scores 
Taut & Palacios Global, 15-year- Intended and unintended 
(2016) olds NA Article PISA NA interpretations and uses of PISA results 
Thomson et al. Australia Scientific Report PISA NA Overview of scientific literacy as (2013) Literacy assessed by PISA 
No meaningful change in 
science scores; U.S. 2nd 
generation immigrants worst 
Tucker (2016) U.S. Reading, Math, Newspaper Science article PISA NA educated; math teachers lack appropriate training; poor 
recruitment strategy of 
teachers 
Tulodziecki (2012) NA NA Article Qualitative NA Epistemic equivalence 
Newspaper Detracking students to increase Turcotte (2023) HS Math article NA NA equity led to increased math performance 
Tykoski (2017) U.S., HS  ESS Blog Post NA NA Lack of ESS in IB and AP courses 
Japan, higher, 
Uesaka et al. middle, and Self-regulated Usefulness of IRT to classroom 
(2022) lower-ranked learning Article Quantitative NA instruction 
universities 
 185 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Ulitzsch & Nestler 
(2022) NA NA Article Quantitative NA Bayesian IRT model 
Description of flow of course 
delivery in the sciences, i.e., 
biology to chemistry to physics, 
Vazquez (2006) U.S., HS Biology, Physics, Chemistry Article Qualitative NA and if it should change to a physics first approach – save for 
discussion on inequity in course 
access 
Purposes for and guidelines of 
Venkatesh et al. Information mixed methods research; 
(2013) NA Systems Article Mixed Methods NA developing meta-inferences; validity of mixed methods 
research 
KEY! Extension of 2013 article 
Venkatesh et al. NA NA Article Mixed Methods NA with variations of mixed (2016) methods research; 
epistemology 
Voogt & Roblin International 21
st Century Document selection; screening 
(2012) Competencies Article Qualitative NA framework for sub-themes (see Table 3) 
Document analysis; set 
Wach (2013) NA NA Article Qualitative NA inclusion criteria; coding; 
validity 
Reading, Math, 
Science, 
U.S., Global, 15- Collaborative Walker (2016) Problem News Article PISA, Mixed NA U.S. scores remain in middle of year-olds Solving, (web) Methods other country averages for 2015 
Financial 
Literacy 
Wang (2021) NA NA Article Quantitative Costs of large data sets Unidimensionality assumption; history of MIRT 
 186 
Author/s or 
Editor/s or Student Content Reference Measurement Barriers to Using MIRT 
Abbreviated Title Demographics Domain Type Type Model Big Idea/s 
(Date) 
Items on multiple 
dimensions leads to 
Wang & Nydick NA NA Article Quantitative, 
increased variability with 
regards to difficulty Algorithms for non-(2015) Simulation parameters on each compensatory MIRT models 
dimension, which results 
in decreased information 
Chapter 3 focuses on history of 
Welch (1977) U.S., K-12 Science Book NA NA integration for teaching science 
subdomains 
Wess et al. (2021) NA NA Book Quantitative NA Ch. 4: Test quality with regards to types of validity 
Western Definitions and key 
Governors NA NA Blog Post Epistemology NA characteristics/principles of 
University (2020) social constructivism 
Wilson (2013) NA NA Article Quantitative NA IRT overview; changes from CCT 
Winarno et al. International Science Article Literature NA Problems with teaching (2020) Review integrated science classes 
Yen & Leah (2007) K-12 EL Presentation Quantitative, Number of parameters to 
KEY! Exploratory approach to 
Paper EFA be estimated MIRT model emphasizes finding the best fitting model 
You et al. (2020) Global, 15-year- Science Article PISA, NA School characteristics impact olds Quantitative scientific literacy in students 
Cluster analyses should be 
Zakharov (2016) NA NA Article Quantitative NA evaluated for reliability and 
validity – save for discussion 
Zhao & Quantitative, Consequences of IRT model 
Hambleton (2017) NA NA Article Simulation NA misfit 
 
  
 187 
APPENDIX F: PISA 2015 SCIENCE FRAMEWORK71 
 
 
71 From chapter 2 in PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic, Financial 
Literacy and Collaborative Problem Solving, revised edition (OECD, 2017). 
 
 188 
 
 189 
 
 190 
 
 191 
 
 192 
 
 193 
 
 194 
 
 195 
 
 196 
 
 197 
 
 198 
 
 199 
 
 200 
 
 201 
 
 202 
 
 203 
 
 204 
 
 205 
 
 206 
 
 207 
 
 208 
 
 209 
 
 210 
 
 211 
 
 212 
 
 213 
 
 214 
 
 215 
 
 216 
  
 217 
APPENDIX G: DISSERTATION TIMELINE