A CASE STUDY EVALUATING THE FIDELITY OF IMPLEMENTATION OF CONSTRUCTING MEANING TRAINING AT A LOCAL MIDDLE SCHOOL by BRIAN F. SICA A DISSERTATION Presented to the Department of Educational Methodology, Policy, and Leadership and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Education March 2016 ii DISSERTATION APPROVAL PAGE Student: Brian F. Sica Title: A Case Study Evaluating the Fidelity of Implementation of Constructing Meaning Training at a Local Middle School This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Education degree in the Department of Educational Methodology, Policy, and Leadership by: Keith Zvoch Chairperson Joanna Smith Core Member Yvonne Curtis Core Member Audrey Lucero Institutional Representative and Scott L. Pratt Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded March 2016 iii © 2016 Brian F. Sica iv DISSERTATION ABSTRACT Brian F. Sica Doctor of Education Department of Educational Methodology, Policy, and Leadership March 2016 Title: A Case Study Evaluating the Fidelity of Implementation of Constructing Meaning Training at a Local Middle School The purpose of this study was to understand the implementation of practices derived from Constructing Meaning (CM) training by teachers (n = 30) at a local middle school. The study took place in two phases. Phase one was primarily quantitative. Implementation fidelity was measured for each critical component of CM training, and component and aggregate indices were constructed and analyzed. The second phase, primarily qualitative, investigated teachers’ perceptions of the conditions that favored or hindered implementation. Results indicated that certain components were implemented to a greater degree than others and that the overall implementation fidelity was approximately 50%. Key conditions for implementation were identified as collaboration (both with peers and CM trainers), sufficient time, and clear connections to other programs. v CURRICULUM VITAE NAME OF AUTHOR: Brian F. Sica GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED University of Oregon, Eugene Montana State University, Bozeman University of Idaho, Moscow DEGREES AWARDED Doctor of Education, Educational Methodology, Policy, and Leadership, 2016, University of Oregon Master of Science, Science Education, 2006, Montana State University Bachelor of Science, Chemistry Teaching, 2001, University of Idaho AREAS OF SPECIAL INTEREST Minority and underrepresented student education Implementation of educational initiatives, particularly at the classroom level Adult leadership PROFESSIONAL EXPERIENCE Principal, Beaverton School District’s Health and Science School and School of Science and Technology, 2014–present Assistant Principal, Hillsboro School District’s Century and Hillsboro High Schools, 2010–2014 Chemistry and Physics Instructor, Hillsboro School District’s Hillsboro High School, 2006–2010 Chemistry and Physics Instructor, Shelley School District’s Shelley High School, 2001, 2006 vi GRANTS, AWARDS, AND HONORS CTE Revitalization Grant ($362,000), Project Director, Health and Science School, 2014–2015 United States Department of Education Smaller Learning Communities Grant (1.25M), Project Manager, Hillsboro High School, 2008–2011 PUBLICATIONS Sica, B. (2009, October 31). Modifying discussion and assessment techniques to increase student understanding and teacher reflective practices. Retrieved from Action Research Expeditions website: www.arexpeditions.montana.edu. Available March 2010. vii ACKNOWLEDGMENTS I would like to offer my sincere appreciation to Drs. Zvoch, Smith, Lucero, and Curtis for their guidance throughout this project. In addition, a special thanks to the administration, teachers, and students at the middle school site of this project. Also, I would like to thank my cohort members at the University of Oregon; I learned more from you than I ever thought possible. Finally, I would like to thank my wife and son for their unwavering support and sacrifice. viii This project is dedicated to my students, past and present. I think of you often. ix TABLE OF CONTENTS Chapter Page I. II. III. INTRODUCTION ...................................................................................... 1 LITERATURE REVIEW ........................................................................... 6 Defining Fidelity of Implementation ...................................................... 6 Why Study Fidelity of Implementation? ................................................ 8 Fidelity as a Summative Evaluation ................................................... 9 Fidelity as a Formative Evaluation ................................................... 10 How Is Fidelity of Implementation Measured by Evaluators?............. 12 Identifying the Critical Components of the Intervention ................. 14 Organizing the Components ............................................................. 14 Measuring the Critical Components of the Intervention .................. 15 Determining the Reliability and Validity of the Measures ............... 18 Determining Validity of Measures ................................................... 20 The Investigation of a Specific Intervention ........................................ 22 Critical Components of Constructing Meaning ................................ 24 Opportunity to Increase Student Achievement ................................ 30 Using a Qualitative Approach to Fidelity Studies ................................ 31 A Mixed-Methods Approach to the Fidelity of Implementation of Constructing Meaning Training ........................................................... 33 METHODS ............................................................................................... 35 Setting and Participants ........................................................................ 35 Research Design ................................................................................... 38 Phases of Research ........................................................................... 38 Success of Implementation ............................................................... 39 Critical Components of CM Training .............................................. 40 Variation in Implementation by Predictor Variable ......................... 41 Level of Implementation Compared to the Literature ...................... 41 Phase Two, the Investigation of RQ2 ............................................... 42 x Chapter Page IV. Time Element ................................................................................... 44 Data Collection Instruments ................................................................. 45 Observations and Reflections ........................................................... 45 Teacher Surveys ............................................................................... 46 Interviews ......................................................................................... 47 Focus Groups .................................................................................... 48 Procedures ............................................................................................ 49 Observations ..................................................................................... 49 Teacher Reflections (Self-Evaluations) ............................................ 50 Surveys ............................................................................................. 50 Interviews ......................................................................................... 51 Student Focus Groups ...................................................................... 52 Methods of Analysis ............................................................................. 52 Research Question 1 ......................................................................... 52 Research Question 2 ......................................................................... 53 Threats to Reliability and Validity ....................................................... 54 Reliability ......................................................................................... 54 Construct Validity ............................................................................ 55 Qualitative Validity .......................................................................... 56 Triangulation .................................................................................... 56 Discrepant Information and Negative Cases .................................... 56 Member Checking ............................................................................ 57 RESULTS ................................................................................................. 58 RQ1: Success of Implementation of CM Practices .............................. 58 Results of the Calibration Observations ........................................... 58 Quantitative Analysis of the Calibration Data ................................. 59 Internal Consistency ............................................................................. 59 Cronbach’s Alpha for Each Component .......................................... 59 xi Chapter Page V. Inter-Item Correlation Analysis ........................................................ 60 Cronbach’s Alpha With Items Deleted ............................................ 61 Revisions Due to Reliability Analysis .............................................. 61 Fidelity of Implementation Index by Component ................................ 63 Overall Index of Fidelity .................................................................. 64 Index of Fidelity by Predictor Variable ................................................ 64 Years of Teaching Experience ......................................................... 64 Subject Area Taught ......................................................................... 65 Analysis of Variance by Subject Area Taught ................................. 66 Latency Since Training .................................................................... 68 Teacher Reflection and Survey ............................................................ 69 Teachers’ Self-Scoring on the Refining Our Practices Rubric, by Component ....................................................................................... 69 RQ2: Conditions That Favor or Hinder Successful Implementation ... 70 Teachers’ Perceptions of CM Training ................................................ 71 Teachers’ Perceptions of Each Critical Component of CM Practices .................................................................................... 73 Open-Ended Survey Questions ........................................................ 73 Teacher Interviews ............................................................................... 75 Student Focus Groups .......................................................................... 75 DISCUSSION .......................................................................................... 79 Discussion of RQ1: Success of Implementation .................................. 79 Success of the Implementation of the Critical Components ............ 79 Contextual Relevance of the Components ....................................... 86 Success of the Overall Implementation ............................................ 87 Success of Implementation by Predictor Variable ............................... 89 Relationship Between Fidelity and Years of Teaching Experience ........................................................................................ 89 Relationship Between Fidelity and Latency Since Training ............ 92 xii Chapter Page Relationship Between Fidelity and Teachers’ Primary Subject Area ..................................................................................... 92 RQ2: Conditions That Influence Implementation ................................ 93 Teachers’ Perceptions of CM Training ............................................ 93 Collaboration .................................................................................... 94 Time as a Resource ........................................................................... 97 Alignment With Other Priorities ........................................................ 100 Alignment With Other Priorities as a Hindrance to Implementation ............................................................................... 100 Limitations of the Study ..................................................................... 101 Psychometric Properties of the Rubric ........................................... 101 Sample Size of the Study ................................................................ 103 Participant Bias ............................................................................... 103 Over-Reporting by Teachers .......................................................... 104 Target of Implementation ............................................................... 106 Implications for Practice .................................................................... 106 Target for Successful Implementation ............................................... 107 Developing Reliable Systems of Observation .................................... 108 Increased Reliability Analysis of the Refining Our Practices Rubric ............................................................................................. 108 Considerations of Time as a Resource ............................................... 109 Connections to Other Priorities .......................................................... 109 Distinguishing Between Training and Intervention ........................... 110 Implications for Future Research ....................................................... 112 APPENDIX: THE REFINING OUR PRACTICES RUBRIC ..................................113 REFERENCES CITED ..............................................................................................118 xiii LIST OF FIGURES Figure Page 1. A framework for studying program effects that includes measurement of implementation fidelity ........................................................................................11 2. Explanatory sequential design .............................................................................39 3. Implementation of each critical component by subject area ................................67 4. Comparison of scores from observations and reflections ....................................72 xiv LIST OF TABLES Table Page 1. Summary of inter-item correlations within each critical component ...................62 2. Revised Cronbach’s alpha values with specific items deleted .............................63 3. Index of fidelity by critical component ................................................................65 4. Pearson’s correlation between year of teaching experience and fidelity of implementation ....................................................................................................66 5. Index of fidelity of the overall intervention by teachers’ primary subject area ..........................................................................................................67 6. Tukey BSD post hoc test ......................................................................................68 7. Index of fidelity organized by training cohort .....................................................69 8. Index of fidelity based on teacher reflections ......................................................71 9. Paired sample t-test of the observation and reflection indices .............................72 10. Summary of closed-ended survey questions ......................................................74 11. Example responses to open-ended survey questions .........................................76 12. Frequency of reference to emerging themes in teacher interviews ....................77 1 CHAPTER I INTRODUCTION Federal and local pressure to produce measurable increases in student achievement remains a constant focus for schools across the country (Polikoff, McEachin, Wrabel, & Duque, 2013). In the Race to the Top competitive grant program, states and local educational agencies (LEAs) competed for $4.35 billion in federal grants to be used to improve their schools (Race to the Top Act of 2011, 2014). Virtually every aspect of the detailed application criteria was in some way tied to measureable student achievement. Similarly, beginning in 2011, states were able to apply for flexibility waivers to the Elementary and Secondary Education Act, specifically in regard to the student achievement requirements of the 2001 reauthorization known as No Child Left Behind (NCLB; No Child Left Behind, 2001). As of January of 2016, 43 states have approved requests for waivers. Each of these requests were required to include a detailed plan for improving instruction and closing the achievement gap, as measured by standardized test scores (U.S. Department of Education, 2012). In efforts designed to meet accountability requirements, many districts have focused on improving curriculum, instruction, and assessment through high quality professional development (Blank & de las Alas, 2009; Darling-Hammond & Wei, 2009). Professional development is usually targeted at an initiative or intervention aimed at a particular curricular area such as literacy or math, or a specific group of students, such as English Language Learners (ELLs). Common professional development initiatives include Professional Learning Communities (PLCs), Positive Behavior Interventions and Supports (PBIS), and programs aimed at increasing the academic English language 2 development of students (Echevarria, Richards-Tutor, Chinn, & Ratleff, 2011). Professional development (PD) can be delivered in various formats. Generally, the formats can be classified as workshop-style, visits to other sites, coaching, research, and peer-to-peer observations (Darling-Hammond & Wei, 2009). Typical professional development includes some combination of the formats, such as initial training, release time for teachers to create and modify curricula, and instruction on the use of program materials (Odden, Archibald, Fermanich, & Gallagher, 2012). It has been estimated that approximately 90% of teachers experience some sort of PD in a given school year, and that 90% of the PD teachers participate in is primarily organized on a workshop model where they attend a one- to three-day conference with little to no systematic follow-up (Darling-Hammond & Wei, 2009). However, the workshop model has not been shown to be the most effective form of PD (Gulumhussein, 2013). In a comprehensive meta- analysis of more than 1,300 studies, Garet et al. (2009) determined the highest effect sizes were from PD formats that were “sustained and intensive” (p. 938). The researchers went on to suggest that models with less than 14 hours of direct instruction had no effect on student achievement (Garet, Porter, Desimone, Birman, & Yoon, 2009). The difficulty in designing PD to effect change may come less from the specific method used to teach the teachers and more from the level of implementation planning provided (Gulumhussein, 2013). It has been suggested that the challenges of changing practice do not come with practitioners learning the new practice, but rather in their attempts to integrate it into their regular routines (Guskey, 2002). It may take a teacher more than 20 attempts at implementing a practice to master it (Joyce & Showers, 2002). The challenge of implementation is compounded by the desire of school leaders, who 3 likely feel pressure to maximize their resource allocations and see immediate results (Gulumhussein, 2013). Calculating the exact cost of PD is difficult due to the variety of resources used for implementation. For example, in addition to the cost of training, many initiatives require the development of materials and additional planning time for the teaching staff. Some researchers have estimated that a school district spends between 2% and 5% of its operating budget on PD (Miller, Lord, & Dorney, 1994; Odden et al., 2012). This estimate of the financial impact may be low, however, because it is difficult to assess accurately the amount of time—both compensated and uncompensated—that teachers allocate to implementation of PD skills (Odden et al., 2012). These costs are coming at a time when school and district leaders are forced to balance developmental costs with shrinking budgets. In the 2013–2014 school year, approximately 35 states had lower per- pupil spending than pre-recession levels (Leachman & Mai, 2014). The combination of budgetary constraints and political pressures to increase achievement outcomes means that school leaders are required to constantly evaluate their programs in order to show a rapid return on investment (ROI). The program evaluations can serve as evidence of ROI if they reveal an improvement in instruction, an increase in student achievement, or both. In order to make inferential claims of improvement, school leaders must design program evaluations using experimental or tightly controlled quasi- experimental designs that include both control and program-receiving (experimental) groups (Weiss, Bloom, & Brock, 2013). However, evaluations are often completed by measuring change in school-wide or district levels of student achievement, typically from standardized tests without the benefit of a solid research design (Shymansky, Wang, 4 Annetta, Yore, & Everett, 2010). Even if an intervention has been previously shown to be effective, evaluating a program by only looking at student achievement data is flawed because it assumes the program has been implemented in a way that would lead to certain expected changes. As a result, in addition to a strong inferential research design, evaluations should also include a measurement of implementation fidelity (Century, Cassata, Rudnick, & Freeman, 2012; Weiss et al., 2013; Zvoch, 2012). The concept of implementation (or treatment) fidelity, which considers the degree to which a program is delivered as intended (Yeaton & Sechrest, 1981), served as the basis for this case study. The inclusion of treatment fidelity strengthens a program evaluation by giving providers formative data as well as more accurate summative claims (Gulumhussein, 2013; Weiss et al., 2013). In formative evaluations, providers can allocate additional resources or make adjustments to their implementation plans. In summative evaluations, evidence of high implementation fidelity can strengthen inferential claims by demonstrating that the treatment group received the intervention as intended, and was thus distinct from the control group—in other words, the change measured was the result of the intervention (Weiss et al., 2013). Evaluators making inferential claims without a measurement of fidelity risk attributing a change in outcomes to a change in practice that was not verified to have actually occurred (Dusenbury, Brannigan, Falco, & Hansen, 2003). The study presented here investigated the manner in which school leaders and teachers evaluated and understood the degree to which a program had been implemented as intended. The following chapters (1) review and synthesize the relevant literature regarding the concept of fidelity of implementation, (2) describe methods for evaluating 5 implementation using both quantitative and qualitative methods, (3) present findings from an evaluation of Constructing Meaning practices at a local middle school, (4) offer conclusions drawn from the data, and (5) discuss recommendations for application and future research. 6 CHAPTER II LITERATURE REVIEW The literature relevant to this study is reviewed and synthesized in this chapter. The primary themes of the literature review are as follows: a conceptual framework of the construct of fidelity of implementation, a synthesis of the approaches in measuring implementation fidelity in prior research, and the review of the specific PD model being used as an intervention. Defining Fidelity of Implementation The concept of fidelity of implementation can be defined as the degree to which a treatment is delivered as intended by its developers (Moncher & Prinz, 1991; Orwin, 2000; Yeaton & Sechrest, 1981). In a research context, evaluating fidelity can provide confirmation that the manipulation of the independent variable occurred as planned (Moncher & Prinz, 1991). In a program monitoring context, fidelity evaluation can provide information to policymakers that services are being implemented as prescribed to reach the intended targets (Orwin, 2000). Although these descriptions seem simple, in practice fidelity of implementation is challenging to define and measure (Zvoch, 2009, 2012). In their frequently cited study, Dane and Schneider (1998) suggested that fidelity investigations should address five aspects of implementation. Adherence is the extent to which the intervention is delivered by a provider as designed by its developer, possibly measured by observations and/or checklists (Drake et al., 2001). In educational settings, the provider is likely a teacher, counselor, or other specialist. Exposure (also referred to as dose) is a multifaceted construct. Dose is, generally, the completeness of the delivery 7 of the program (Dusenbury et al., 2003). The completeness of delivery can mean the amount of intervention actually received by intended recipients, and is influenced by the methods of delivery and the engagement of recipients. Quality of delivery looks at aspects of the intervention beyond basic implementation. Quality of delivery evaluation points can include provider (teacher) enthusiasm, depth of providers’ understanding of the program model, and appropriateness of specific applications. Participant responsiveness is the level to which the participants (in the case of educational interventions, students) respond to or interact with the intervention. For example, investigators can observe whether a student actually uses the vocabulary list a teacher has posted on the wall. Program differentiation documents the degree to which the treatment intervention differs from current practice or a control condition. Dane and Schneider (1998) suggest that all fidelity studies should measure each of these aspects, though few studies have been able to thoroughly address all five in their evaluations (Dusenbury et al., 2003). Challenges are present in obtaining and utilizing reliable and valid measures of adherence and quality (Dusenbury et al., 2003). For example, dose requires recording every instance of program use, which is only practical through the self-reporting of providers and recipients and introduces the potential for bias and over-reporting (Kruger & Dunning, 1999). Moreover, participant responsiveness can measure a range of recipient actions. Broadly, it may also be measured as simply the number of recipients being presented with the intervention. In a school setting, the number of students in a class that is observed following protocol may all count toward participant responsiveness. A more complete measurement, however, would be a calculation of the number of students actually engaging with the tools of the intervention. 8 In addition, measuring engagement on a continuum can be very challenging, as it requires observers to interpret varying levels of engagement in different students who are displaying similar actions (Tan, Sun, & Khoo, 2014). For example, a student who appears to be writing may be authentically engaged in a prescribed exercise (high participant responsiveness), while another student who is also writing may be simply writing a message to a friend (low participant responsiveness). Dose can be estimated through the self-reporting of providers and recipients, although the level of bias and over-reporting may be difficult to assess. Measuring program differentiation can be challenging, in that it is common to find similar elements in varying interventions (Hansen, Graham, Wolkenstein, & Rohrbach, 1991). Although the aspects described above can be challenging to accurately measure, they cannot be ignored. Each one represents an important component of the analysis of fidelity of implementation of an intervention. The terms described by Dane and Schneider are found throughout the literature to introduce and describe fidelity measurement (Carroll et al., 2007; Century, Rudnick, & Freeman, 2010; Dusenbury et al., 2003; Zvoch, 2012). However, they have not been accepted as the standard by all (Weiss et al., 2013). Although frequently referenced in the literature, Dane and Schneider’s terminology has been shown to be too broad to use as a framework of study for implementation fidelity. As described above, the difficulties in measurement have prevented the terminology from being used as a universal framework for fidelity studies. Why Study Fidelity of Implementation? The use of interventions to improve outcomes is not unique to the field of education; virtually all service providers implement interventions to change outcomes 9 (Durlak & DuPre, 2008; Dusenbury et al., 2003). However, early research of implementation fidelity suggested that without studying fidelity of implementation, intervention research does not yield meaningful claims (Yeaton & Sechrest, 1981). In other words, if implementation fidelity is not clearly measured, it is impossible to distinguish between a flawed program and poor implementation. Evaluators must identify whether the intended aspects of the intervention are being fully implemented and delivered to their recipients. Too often, interventions are evaluated based only on the intended outcomes, with little to no measurement of the actual implementation (Dobson & Cook, 1980; Durlak & DuPre, 2008; Harn, Parisi, & Stoolmiller, 2013). Without proper attention to fidelity of implementation, claims made from such evaluations may not accurately reflect the intervention’s actual efficacy. Fidelity as a summative evaluation. Generally, most practitioners assume that demonstrating high fidelity to evidence-based best practices will result in higher gains in student achievement than those with low fidelity (Harn et al., 2013). However, causal claims regarding effects of an intervention should not be made without including a confirmation of the level of implementation fidelity to complement a well-designed experimental study (Dusenbury et al., 2003; Weiss et al., 2013; Yeaton & Sechrest, 1981). Weiss et al. (2013) proposed a framework for program evaluations that includes the investigation of implementation fidelity with a strong research design. As illustrated in figure 1, Weiss and colleagues describe phase one of their framework as an investigation of fidelity within an experimental design, in order to limit possible errors in interpreting their final outcomes. For example, if fidelity is not measured and student achievement goals are not observed, evaluators may conclude prematurely that the 10 intervention itself was not effective in producing the desired outcomes. Alternatively, when fidelity is measured, researchers can strengthen arguments that the treatment had a causal relationship with reaching desired outcomes by ensuring that the treatment group received the intervention as intended (Echevarria et al., 2011; Wolery, 2011). Weiss and colleagues describe a comprehensive approach to program evaluation that goes beyond the implementation phase. Their framework includes investigations of the characteristics of the providing organization, characteristics of the recipients, and description of an appropriate experimental or quasi-experimental design. The experimental design phase includes a measurement of treatment contrasts that define and describe the differences between treatments received with and without access to the intervention. The model also includes “mediators” as an intermediary between the treatment being received and the outcomes being measured. Mediators are part of the complex process that ultimately produces the program effects. For example, in teacher PD intended to ultimately raise student achievement, a mediator may be the changes to classroom instruction. An inclusive study of program effects would include all of the elements of Weiss’s framework. However, the study described by this manuscript focuses on the initial phase of the model, treatment fidelity. Fidelity as a formative evaluation. Investigating fidelity can also provide insight into the characteristics of implementation of an intervention in organizational settings (Weiss et al., 2013). When implementation is closely monitored, evaluators can gain insight into why a particular intervention succeeds or fails to become fully implemented (Harachi, Abbott, Catalano, Haggerty, & Fleming, 1999). For example, school leaders 11 Figure 1. A framework for studying program effects that includes measurement of implementation fidelity. Taken from Weiss et al. (2013). A conceptual framework for studying the sources of variation in program effects. MDRC Working Papers on Research Methodology. may find that the time required for daily teacher collaboration within the school day is impossible to provide. However, the evaluation may suggest that teachers provided with extended paid time are more likely to implement a program with fidelity than teachers who are not compensated for additional time commitments. Leaders can use this information to make decisions regarding resource allocation. Similarly, through early and regular measurements of implementation fidelity, leaders can provide rapid feedback to practitioners who are learning new techniques (Harn et al., 2013; Webster-Stratton, Reinke, Herman, & Newcomer, 2011). Formative feedback developed from investigations of fidelity may increase the likelihood that the intervention will be 12 delivered as intended (Codding, Feinberg, Dunn, & Pace, 2005; Mortneson & Witt, 1998). In addition to formative information, studying implementation also reveals how likely a program is to be implemented with high fidelity beyond initial or pilot trials. If a program is extremely difficult to implement as intended, it may not be practical or sustainable, regardless of whether the desired outcomes have been achieved (Dusenbury et al., 2003). There are often subtle components of the implementation that were influential to the success of the program that may or may not be possible to replicate (Wolery, 2011). For example, an evaluation may reveal that teachers’ enthusiasm for the intervention predicted higher fidelity. However, increasing the enthusiasm of teachers with lower fidelity may prove to be a challenge. Practitioners can also see how the implementation changes a wide range of organizational systems and behaviors, perhaps some of which were not originally targeted (Dusenbury et al., 2003). Information regarding unanticipated system changes is not only valuable to the actual implementers, but to those charged with allocating resources (Century et al., 2012). For example, school leaders looking to increase collaboration regarding student behavior may implement cross-curricular teaming structures among staff. In analyzing the intended practice, evaluators may find that curriculum-based collaboration has also increased. Accordingly, school leaders may look to support such unpredicted changes in practice through increased resource allocation. How Is Fidelity of Implementation Measured by Evaluators? Historically, schools have not been given consistent direction on measuring program implementation (Dusenbury et al., 2003; Harn et al., 2013). Recently, however, 13 an increased focus on including implementation measurement in evaluation studies has forced researchers to abandon the concept of black-box approaches to program evaluation (Harachi et al., 1999; Mowbray, Holter, Teague, & Bybee, 2003; Zvoch, 2012). Numerous guidelines for approaching fidelity investigations through measurement of critical components have been developed (Bond et al., 2000; Mowbray, Bybee, Holter, & Lewandowski, 2006; Mowbray et al., 2003). Hall and Hord (1987) describe critical components as the “building blocks” (p. 117) of the intervention. The building blocks are the components of the intervention that are deemed most crucial to program success. The identification of critical components underlies the process of measuring fidelity of implementation in that they allow evaluators to specify active program ingredients and uncover deviations from the intended model (Mowbray et al., 2003). Additionally, by defining and basing evaluations on critical components, evaluators can investigate whether the treatment group is actually receiving a different experience than control group participants, or if a program differs significantly across multiple sites—such as different high schools in a given district (Mowbray et al., 2003) Although other researchers use slightly different nomenclature, there is consistency in the notion of programs having specific features that must be considered when studying fidelity of implementation (Century et al., 2012). The steps to using critical components to frame a fidelity study were summarized by Teague, Bond, and Drake (1998): (1) identify the indicators or critical components of the intervention, describing both the operational definition of the components and the methods used for measurement; (2) collect the data to measure each indicator or component; and (3) examine the data in terms of reliability and validity. 14 Identifying the critical components of the intervention. Mowbray et al. (2003) describe three approaches to developing fidelity criteria: (1) consult the program model of the intervention, (2) obtain expert opinion, and (3) consult the participants involved. Using the program model is the most straightforward approach, especially if the program includes key components in its manuals or other training devices (Bond et al., 2000; Christie & Alkin, 2003; Mowbray et al., 2006). Determining the critical components from the program model, however, may limit the ability to assess the intervention accurately if it has been adapted from its original design (Harn et al., 2013). For example, if a component is modified to meet the needs of a particular school culture or program, a fidelity evaluation based solely on the program model would likely indicate a lower fidelity score (Webster-Stratton et al., 2011). Flexibility within implementation, as described by Cohen (2008), suggests that adapting the original design—as in approaches (2) or (3) noted above—can have positive impacts on the intervention, and that evaluators finding lower fidelity results due to adaptations should further investigate the changes before allocating resources to increase fidelity (Harn et al., 2013; Webster-Stratton et al., 2011). Organizing the components. The critical components of the intervention can be further described as either structural or procedural (Knoche, Sheridan, Edwards, & Osborn, 2010; Mowbray et al., 2003; O’Donnell, 2008). The structural components are those that provide the framework of the intervention, and the processes that define the way the framework is delivered (Mowbray et al., 2003). For example, structural components may include the use of required materials or the amount of time spent on a particular topic, or the contextual conditions such as student-to-teacher ratios or length of 15 class periods (Durlak & DuPre, 2008; Harn et al., 2013). Process components tend to focus more on behaviors and interactions of teachers and students (in educational settings), or possibly doctors and nurses (in health care settings) (Century et al., 2012; O’Donnell, 2008). The organization of components by structure or process requires the researcher to document the interactions with the intervention (process) as well as the core activities themselves (structure). The distinction of components into structure and process also aids in the application of evaluations by allowing leaders to apply resources (increased training, guidance and feedback, or modifications to contextual conditions) to the components (structure or process) that are in greatest need (Durlak & DuPre, 2008; Dusenbury et al., 2003; Kaderavek & Justice, 2010). Measuring the critical components of the intervention. Tools to assess fidelity to the critical components typically come in the form of checklists or measures that have been scaled, along with associated rubrics (Bond et al., 2000; Century et al., 2010; Mowbray et al., 2003). Ideally, these checklists or rubrics have been developed as a part of the program design, field-tested, and improved by previous users. Monitoring the application of the components can be achieved through direct observation, self- assessments by the practitioners, or a combination of both (McKenna, Flower, & Ciullo, 2014). For example, Positive Behavior Intervention and Supports (PBIS; Sugai & Horner, 2002) is a common school-wide intervention program used to improve the overall climate and culture of schools (Bradshaw, Koth, Thornton, & Leaf, 2009). Researchers at PBIS Maryland have designed a tool called the Implementation Phases Inventory (IPI; Bradshaw, Barrett, & Bloom, 2004). PBIS coaches use this tool to observe school practices to characterize the school as being at a particular level of implementation 16 (Bradshaw, Debnam, Koth, & Leaf, 2008). Coaches using the IPI assign a PBIS Level Rating for the school that can be used to track progress and plan further professional development. The IPI measures the critical components of each PBIS level with respect to adherence, quality, and dosage. PBIS coaches use a checklist aligned to the design as they observe teacher practices (adherence). The coaches have been trained, through PBIS, to make judgments on the quality delivery, and indicate their findings on the checklist as well. Finally, school records are used to measure how many students receive the particular components of the intervention (dose). Challenges in measuring implementation fidelity were described by researchers at the Oregon Social Learning Center, who measured the implementation fidelity of the Oregon Model of Parent Management Training using the critical components described in the program manual (Forgatch, Patterson, & DeGarmo, 2005). Researchers found that the components could be organized into adherence and quality exclusively. The Fidelity of Implementation Checklist (FIMP; Knutson, Forgatch, & Rains, 2003) was used to measure the components. The need for flexibility by practitioners and the varying degree of client engagement became a challenge when applying the binary checklist. The FIMP consisted of direct observations and video recordings of sessions (Forgatch et al., 2005). The primary goals of the evaluation were to identify the psychometric properties of the FIMP and to measure the efficacy of the training. A Cronbach’s alpha reliability analysis of the raters revealed a range of 0.87–0.95, depending on the component. The correlation between the items ranged from 0.71 to 0.90. The evaluation revealed that fidelity of implementation could be shown to account for 30% of the change in the parental behavior. In addition, the researchers found that practitioners used their professional 17 experience to adapt the components to meet the needs of the individual recipients. In doing so, the level of fidelity was lowered, although the change may have been warranted. These researchers recommended that observers record and review videotapes of sessions in order to code all activities (Forgatch et al., 2005). The studies presented above give insight to the opportunities and challenges of measuring the fidelity of implementation within a program evaluation. The studies also present methods to limit the impact of challenges when designing a program evaluation. For example, the use of simple checklists causes judgments to be made too narrowly. Preferably, comprehensive descriptions of components with progressive rubrics should be used when available. The specifics of conducting observations present additional challenges. For instance, the timing of the observations may affect the results (Bond et al., 2000; Yeaton & Sechrest, 1981). Studies have shown that fidelity to program adherence can vary over time (Dusenbury et al., 2003; Zvoch, 2009). Therefore, repeated measures of fidelity over time are preferable to a single-point data collection (Zvoch, 2009). Multiple measures yield a better understanding of the average adherence when implementation is likely to vary over time. Additionally, the rate of change of implementation may be determined. With multiple measurements over time, it is possible to examine whether implementation increases, decreases, or remains unchanged over the course of a school year. The general feasibility of fidelity measurement may also impact the evaluation of program implementation (Mowbray et al., 2003). At times, fidelity measures are aligned to components that can be practically measured; however, they do not accurately reflect the scope of the intervention (McGrew, Bond, Dietzen, & Salyers, 1994). Thus, results of these studies that do not address every component in the intervention are limited to the 18 components that are measured. For example, a component of an intervention may be focused on student academic talk. A practical measure of student talk is to record the ratio of teacher talk to student talk, or even more simply, the number of minutes per class that a student is talking. Although measuring the quantity of student talk is straightforward, the measurement would not describe whether the talk was academic or not. By not measuring the academic nature of the talk, the scope of the component would not be fully assessed. In order to measure student talk more completely, observers would need to measure the quantity of talk as well as the substance of the talk. Observers need to be in much closer proximity to students to do this, which may cause students to change their behaviors, or, at least, increase the difficulty of observing a wide range of students. Alternatively, audio recordings could be obtained, transcribed, and coded into varying degrees of academic talk. In classrooms where the student talk is directed to the instructor and from single students at a time, the use of recordings may be practical. However, in classrooms where student talk is directed to each other in dyads or small groups, a practice that is considered beneficial (Bickmore & Parker, 2014), numerous recording stations would need to be set up at multiple points in the classroom. The equipment demands of setup and the personnel demands of transcribing and coding over multiple classrooms may render the approach impractical. Determining the reliability and validity of the measures. Data collected must first be analyzed for reliability and validity prior to making meaningful conclusions (Mowbray et al., 2003). Reliability generally refers to the ability of a test or other technique to yield consistent results (Babbie, 2007). There are two forms of reliability 19 particularly relevant to this case study. Reliability between observers, or inter-rater reliability/consistency, is important if more than one person will be making observations. Secondly, the reliability between the scores obtained from the different items in the instrument should agree with one another. For example, it is relevant to verify that scores on items that represent a particular construct positively correlate with one another. Reliability indices should first account for the level of agreement on the judgments of the same event. The simplest measurement is in the form of a percent agreement. Percent agreement, however, is not considered to be adequate, as it does not take into account the agreement that would be expected due to chance (Hoehler, 2000). Cohen’s kappa is a simple extension of the rate of agreement that corrects for the agreement expected by chance. The kappa statistic is designed for use with nominal or ordinal data, preferably when only binary judgments are made (Morgan, Leech, Gloeckner, & Barrett, 2013). Although the kappa statistic was designed for binary scales, it is often applied to graded measurements due to its relative ease in calculation and interpretation (Morgan et al., 2013). A second approach is to account for the internal consistency of the item responses by using Cronbach’s alpha (Bond et al., 2000). Internal consistency refers to the agreements among items on a particular measure that evaluate a specific construct. In an evaluation of teacher practices, observations may be made using a particular rubric that evaluates a standard or domain. The rubric for a particular standard may include multiple indicators. The groups of indicators for a particular standard should yield a similar result, regardless of the observer. The measurement of internal consistency using Cronbach’s alpha usually utilizes three steps. The first is the determination of the alpha itself, 20 providing evaluators an indication of the agreement of scores on the items. Next, an analysis of the inter-item correlations is made, allowing evaluators to determine which scores agree with or contradict each other. Finally, the alpha is repeatedly measured by removing single items one at a time. The alpha with an item removed can be compared to the alpha with all items included. Alphas that are increased when particular items are deleted suggest a particular item is lowering the internal consistency and should be considered for removal from analysis (Morgan et al., 2013). Determining validity of measures. Validity refers to the degree to which the data support the adequacy and appropriateness of the interpretations and actions that they derive (Messick, 1994). In quantitative studies, three forms of validity should be considered: content validity, predictive (concurrent) validity, and construct validity (Creswell, 2014). Content validity is the ability of a test to measure the content it was intended to measure. Content validity ensures that the measure adequately captures the breadth of the target. Content validity can be measured using field experts to review items, review descriptions of the content, and make judgments as to the completeness of the measure (Polit & Beck, 2006). Predictive, or concurrent, validity is the degree to which scores predict or correlate with other measures of the same content or construct (Creswell, 2014). For example, both the College Board’s ACT and the National Assessment of Educational Progress (NAEP) include sections designed to measure students’ “reading ability.” High concurrent validity between the ACT and NAEP would indicate that students scoring high in the reading section of the ACT would also score high in the reading section of the 21 NAEP. Predictive validity indicates whether or not a measure adequately predicts a criterion. An example would be if the College Board’s ACT exam accurately predicts future college success (Babbie, 2007). Predictive validity can be measured using regression analysis or similar inferential statistics (Morisky, Green, & Levine, 1986). Related to content validity is the concept of construct validity. The term construct refers to abstract or difficult-to-observe properties, such as motivation or personality, as opposed to easy-to-define observables like pH and age (Thorndike & Throndike-Christ, 2011). Construct validity refers to the degree to which a study accurately measures the intended construct (Tindal & Marsden 1996). Messick (1994) describes two general threats to construct validity: “construct underrepresentation” occurs when a measure is too narrow to fully describe the construct, whereas “construct-irrelevant variance” arises when the measure is too broad and includes indicators aligned to other constructs. The threats to validity should be addressed when designing a program evaluation. In order to limit the threats to measurement validity, evaluators first need to thoroughly understand the components of the intervention. Understanding can be derived from qualitative data on the specific pieces of the implementation process through the involvement of the people closely involved with the intervention (Brunette et al., 2008; Singh & Fletcher, 2014). A complete understanding of the components should be developed through a review of the program model, but also through the involvement of key stakeholders and experts (Brandon, 1998; Mowbray et al., 2006). Brandon (1994) synthesized the findings of four studies to develop guidelines for including stakeholder and expert input, in addition to a review of the program model for the purposes of limiting threats to measurement validity. He concluded that researchers should ensure 22 that the groups included have the appropriate experience and able to participate. They should also take care in developing thorough methods for gathering stakeholder feedback. Finally, stakeholder groups should have equitable participation in the feedback processes, meaning simply that “no stakeholder group’s expertise is ignored in the evaluation and decisions making process” (Brandon, 1998 p.8). The use of an instrument, ideally a graded rubric, is the central element of fidelity evaluations. Therefore, the ability for the instrument to generate reliable and valid data is paramount to the confidence that underlies analysis. The difficulties in obtaining reliable and valid observational data are highlighted throughout the current study. The Investigation of a Specific Intervention Constructing Meaning (CM) training is a product by E. L. Achieve, an educational consulting company. The basic premise of the program is that English Language Development needs to be integrated throughout all curricula, not just in an English class (Dutro, 2009). Requiring all teachers to use strategies in language and literacy development is a shift in pedagogy, especially at the secondary level. Secondary schools are typically segmented into distinct subjects, where the science teacher is responsible for the science content and the language arts teacher is considered solely responsible for literacy development (O’Brien, Stewart, & Moje, 1995). The students, as well as the teachers, may realize this segmentation. Measor (1984) found that students’ actions and behaviors varied significantly throughout the day, depending on their perceptions of the current course. For example, students were more likely to make language convention errors in a science class than a language arts class, where they perceived the practices to be more relevant. In order to shift the perception that language 23 convention is irrelevant in non–language arts classes, CM training provides strategies for teachers to utilize within their content areas to improve the overall academic language proficiency of their students. CM training is designed to enable teachers to lead students to develop their English language proficiency while still meeting the rigorous demands of content area courses. The foundational basis of CM includes procedures for the following:  Ensuring both a content and language objective for every lesson.  Using a functional language approach to instruction. Typical language functions include comparing two ideas, persuading an audience, or defending a claim.  Dividing introductory lessons into discrete chunks to scaffold students toward longer, more complex activities.  Explicitly teaching language with opportunities for written and oral practice in every course of study. CM teacher training is provided in a three-day seminar where teachers learn background research, teaching strategies, and methods for adapting existing lessons. Following the training, teachers are provided with institutional handbooks as well as access to instructional coaches for support. The training begins with a background of relevant concepts in language development. Teachers then transition to learning specific strategies to be implemented in their classrooms. Strategies include the use of language targets, the use of sentence frames, and tools to scaffold the “bricks and mortar” of their lessons (Dutro & Moran, 2003). For the purposes of CM, bricks refer to the vocabulary that is specific to the course of study. As an example, the terms stoichiometry, amphoteric, and monoprotic would all be considered “bricks” of a high-school chemistry 24 course. The “mortar” are academic words that are consistently used regardless of content, such as therefore, analyze, or however. The strategies taught to the teachers extend through each of the foundational principles described above. Following the strategies portion of the training, teachers are given time to adapt their curricula (casually referred to as “CMing”). Teachers are taught to adapt their curricula by applying the strategies they learned to meet the overall goals described by the critical components. Finally, teachers present mock lessons and self- evaluate their work based on a rubric developed in alignment with the critical components. Critical components of constructing meaning. E. L. Achieve, the developer of the CM training, has designated five areas as critical components: (1) Understanding Backward Design. This includes designing instruction that addresses the cognitive and linguistic demands required to meet stated student learning goals. (2) Language as a Part of Content Teaching. This component requires creating opportunities to learn both content “bricks” and functional “mortar” throughout instruction. (3) Oral Language Practice. This refers to instructional designs that allow for structured peer interaction for students to use the target language (English) of the learning goal, including students who may have limited English language proficiency. 25 (4) Interactive Reading and Note-Taking. This describes the use of comprehensive strategies and note-taking tools to facilitate the navigation of complex text and increase student independence. (5) Academic Writing Support. This final component prompts teachers to provide tools and facilitate processes that support students in producing complex academic writing. Each of the critical components align to one or more of the fundamental concepts and are operationalized by teachers using specific strategies presented in the training in their classroom practices. As described above, one fundamental concept within CM practices is the explicit teaching of academic language through the strategies delivered in the Language as a Part of Content Teaching critical component. Explicitly teaching language involves direct and unambiguous strategies to teach academic language acquisition (Rosenshine, 1987). Criteria for qualifying a specific strategy as explicit were summarized by Archer and Hughes (2011) and are in line with the strategies presented in CM training. In a meta- analysis of 49 experimental and quasi-experimental studies, researchers investigated the effect size of various approaches to second-language instruction on student achievement (Norris & Ortega, 2000). The approaches were classified into four categories: implicit and explicit instruction using the Focus on Form (ForM) approach, and implicit and explicit using the Focus on Forms (ForMS) approach. The ForM approach teaches the forms of language as they come up incidentally in a student’s academic conversation. In contrast, the ForMS approach teaches linguistic elements in discrete lessons (Sheen, 2002). The assessments used varied through the experiments in the meta-analysis, but 26 were grouped into four categories: metalinguistic judgments, selected response, constrained constructed response, and free constructed response. Overall, the researchers found that on various student performance outcomes, explicit language instruction had a mean effect size over one half of a standard deviation greater than that associated with implicit instruction. These results suggest that utilizing explicit strategies, such as those presented in CM training, may lead to relatively stronger English Language Arts (ELA) achievement. As described above, an additional premise for consideration is the explicit teaching of academic English language throughout the curriculum, not just in literacy courses (E. L. Achieve, 2014). Teaching language within other content courses has been shown to increase the contextualization experienced by students and, in turn, increase levels of achievement (Tompkins, Campbell, Green, & Smith, 2014). Additionally, providing professional development in academic language development to all teachers serves as a more pragmatic approach in light of the current standards. In both the Common Core State Standards for Mathematics (CCSS; National Governors Association, 2010) and the Next Generation of Science Standards (NGSS; Lead States, 2013), standards include requirements for communication, collaboration, and text complexity. The added requirements strengthen the case for language acquisition to be taught in all classes, resulting in a need for professional development opportunities involving all teachers (Archer & Hughes, 2011) such as CM. The design of lessons through the use of strategies delivered in component (1), Understanding Backwards Design, builds on the concept that language instruction should occur in all content classes. Teachers are instructed to begin the design of the lesson with 27 both content and language objectives, giving students language goals in addition to content objectives (Ferretti, MacArthur, & Dowdy, 2000). The use of clear objectives allows teachers and students to navigate the different standards that are directing the class. For example, a typical high school biology course can be aligned to Common Core Literacy Standards, NGSS, and locally adopted standards for English Language Learners (ELL) curriculum (Valdés, Kibler, & Walqui, 2014). Clear short-term objectives allow students to understand the outcomes they are expected to achieve by completing their daily assignments. For example, in the current study, the district has aligned every course to “learning targets.” Learning targets serve as the classroom-level guide for the implementation of broad standards, and allow teachers to appropriately design their instruction to ensure alignment. Specifically, teachers are able to explicitly express their high expectations for students, ELLs in particular, who may have experienced lower expectations in other school settings (Echevarria, Frey, & Fisher, 2015). Standards, and associated learning targets, set the benchmarks for students as they progress through the school system. Additionally, including language objectives supports the concept of language instruction across the curriculum, as described above (Vacca & Vacca, 1989). Including a language objective in a content class is a method to teach language explicitly. Norris & Ortega (2000) completed a comprehensive meta-analysis comparing instruction to student outcomes in writing. The dependent variable was described as students’ demonstration of language. The nature of the meta-analysis did not allow for a single common measure to be used; however, the measures across the study were coded into four groups: metalinguistic judgments, selected response, constrained constructed 28 response, and free constructed response. The parameters of the meta-analysis defined the independent variable as literacy instruction being implicit in nature. The experimental group contained only classes where explicit language instruction was used. Explicit instruction was defined by DeKeyser (1995) as instruction that requires students to attend to specific linguistic rules or forms. For example, two science teachers may be using the same article related to the mechanisms of photosynthesis and respiration. One teacher may ask students to identify the literary moves that the author makes in comparing photosynthesis to cellular respiration. The other teacher may restrict the students’ tasks to simply content-specific comprehension, such as understanding the different roles of energy in the two processes. Researchers found an average effect size of 0.75 throughout the studies, with a pre-test and post-test measuring the impact of direct language instruction as described above. These outcomes suggest that explicit language instruction may lead to higher outcomes across subject areas. It should be noted that there are multiple strategies to instruct language directly, and that developing and displaying a language target cannot be considered complete language instruction. However, the backward design of lessons and units from a defined language objective is critical to the CM approach (Dutro, 2009). The actual design of the lesson will ideally include strategies from all of the remaining components of Interactive Reading and Note-Taking, Academic Writing Support, and Oral Language Practice. Interactive Reading and Note-Taking summarizes strategies intended for the production of work using academic language derived from content area texts, lectures, and other learning opportunities. Teachers can provide discrete scaffolding to more 29 complex objectives through interactive reading and note-taking. Providing such scaffolding allows students to participate in more conceptually abstract activities than they would otherwise be able (Lucero, 2013). Specific strategies for the interaction with notes, as opposed to allowing students to take notes passively, was shown to have a modest effect size of 0.22 on students’ post-test performance from a meta-analysis of 57 studies comparing note-taking to non-note-taking strategies (Kobayashi, 2005). Additionally, a key piece of Interactive Reading and Note-Taking is the summarization of key learning. According to a meta-analysis presented to the Carnegie Foundation, there is an effect size of 0.82 on assessment of “quality writing” when students are explicitly taught to summarize texts (Graham & Perin, 2007). The CM participant manual offers more than 15 distinct strategies for teachers to use in order to increase the interaction of students and their reading or note-taking assignments. Academic Writing Support provides strategies to shelter the challenges of language acquisition away from content knowledge. The approach stems from the theory that knowledge is transferable between languages (that is, if you understand something in one language, you understand it in the other) (Bangert-Drowns, Hurley, & Wilkinson, 2004). Often, students struggle with representing their knowledge in a second language (a language challenge), and this is misrepresented as a content challenge. By using strategies such as sentence frames and instruction specifically targeted to vocabulary instruction, teachers can help students communicate their learning clearly even as language development is still occurring (Bangert-Drowns et al., 2004; Graham & Perin, 2007). In a meta-analysis of 123 studies, researchers were able to identify specific areas of writing instruction and summarize their effect sizes on post-test performance. Of the 30 areas identified, utilizing explicit instruction to teach students the components of writing, such as pre-writing, drafting, and revising, yielded an average effect size of 0.82 relative to “writing quality” across the studies that were analyzed (Graham & Perin, 2007). Opportunity to increase student achievement. The middle school that is the site of this study has consistently underperformed on standardized tests, particularly in ELA. In the most recent state report card, the school earned a Level 3, placing it in between the 15th and 44th percentile of all middle schools in the state. Student achievement data that is disaggregated by subgroup indicates a predictable achievement gap. Approximately 20% fewer Hispanic and ELL students in the school meet state benchmarks in ELA. Recently, school and district leaders have committed to supporting teachers in improving students’ outcomes through the use of high-quality PD. The focus of the professional development has been primarily around academic language instruction. Recently, Sheltered Instructional Observation Protocol (SIOP)—which is used for observing teachers who are using specific instructional strategies that target the development of academic English language—was used to increase teachers’ understanding of best practices in language development. The protocol is arranged around eight areas, each of which can be observed during classroom instruction. Teachers observed using a high degree of fidelity to these eight areas receive a high SIOP score. Use of the SIOP as a tool to measure implementation fidelity was studied in a large urban school district (Echevarria et al., 2011). Overall, researchers found that the greater the SIOP score, the greater the student achievement, with the SIOP score explaining approximately 21% of the variance in student achievement. Despite the claims that SIOP practices can raise student achievement, the middle school of this case study has shifted 31 its focus. According to the school principal, the teachers were supportive of SIOP, but they were looking for more specific strategies than those provided. As a replacement, Constructing Meaning training was chosen because district leaders felt the approach of CM, including the fundamental basis and critical components described above, would serve as a follow-up to SIOP and provide continued support to teachers as they explicitly teach language acquisition in their classrooms. Using a Qualitative Approach to Fidelity Studies Although much of the fidelity research has been quantitative, studies that are designed to understand processes and events, such as program implementation, may benefit from including a qualitative approach (Maxwell, 2013). For example, researchers at Dartmouth Medical School conducted a follow-up qualitative study to further examine quantitative implementation data (based on observational checklists) derived from the provider’s use of a mental health intervention protocol (Brunette et al., 2008). The researchers used field observations and semi-structured interviews to understand the “facilitators and hindrances” of the specific implementation. The results yielded meaningful claims around both a priori and unpredicted characteristics of implementation. The evaluators were able to organize the hindrances that they uncovered into specific themes of leadership, supervision, staff turnover, consulting with experts, and finances. The evaluators were also able to provide recommendations to hospital management based on each of the themes. For example, one recommendation regarding the theme of leadership was to ensure that the staff understood the level of prioritization the intervention had compared to other hospital objectives. Sites that demonstrated high levels of fidelity were able to clearly prioritize the intervention through policy, financial, 32 and human resource decisions. Participants in sites with low fidelity felt that their leader or leaders had failed to clearly establish the interventions as a priority. Evaluators presented leadership with a recommendation to take steps to clearly show the intervention as a priority. The leaders’ actions on the recommendations included changes in personnel, the development of policy, and increases in communication to the staff from the management. Similarly, researchers at the United Kingdom’s National Institute for Health Research investigated changes to the behaviors of both practitioners and recipients using an implementation fidelity framework that included qualitative methods (Dyas, Togher, & Siriwardena, 2014). The researchers designed interview questions to investigate both adherence to the model and participant responsiveness. The interview responses allowed researchers to gain a better understanding of the pilot data and to better explain the quantitative data. Specifically, researchers were able to ask participants questions directly related to the quantitative data and report on their responses. The combination of quantitative and qualitative data allowed the evaluators to make more specific recommendations to leadership—in particular, in areas in need of improvement. The combination of quantitative and qualitative data in an evaluation of fidelity may be particularly useful when the purpose for the study is of a formative nature. Quantitative studies, with strong experimental designs, are well suited to describe cause and effect relationships; they are not as well suited to questions of a “how” or “why” nature (Collins, Onwuegbuzie, & Sutton, 2006). Qualitative methods, such as interviews, surveys, and focus groups gather information about the human experiences of the program being evaluated. The descriptions of the experiences can yield information on 33 the context-specific beliefs and biases that contribute to the level of implementation of the program being evaluated (Sankar, Golin, Simoni, Luborsky, & Pearson, 2006). In addition, the benefits include the opportunities to hear the perspectives of the providers as to what components of the intervention are presenting challenges for implementation. Understanding the perspectives of the providers would not be apparent in the quantitative data alone. Mixed-methods research attempts to combine the strengths of quantitative and qualitative methods into a single design (Babbie, 2007; Johnson & Onwuegbuzie, 2004). Mixed-methods research allows researchers to obtain a more complete understanding of the phenomena they are studying rather than using a single method (Hesse-Biber & Johnson, 2013; Johnson & Onwuegbuzie, 2004). Quantitative analysis tends to be very objective and maintain a value-neutral stance in the discussion. Conversely, studies that are solely qualitative utilize subjective analysis and can include a value-specific approach in the discussion (Tashakkori & Teddlie, 1998). Including both quantitative and qualitative analysis can be used to explain or interpret initial findings, explore an observed phenomenon, or address a question from multiple levels. A Mixed-Methods Approach to the Fidelity of Implementation of Constructing Meaning Training In line with other program evaluation studies of implementation fidelity, a formative evaluation of CM implementation was conducted at a middle school in a Northwest Oregon School District1. The concept of treatment fidelity was used to design a study that 1 “Northwest Oregon School District” is being used as a pseudonym to ensure confidentiality. 34 measured the implementation of the critical components of Constructing Meaning training. The measurement of the implementation of the components resulted primarily from classroom observations that utilized the Refining Our Practices Rubric. The observational data was analyzed using primarily descriptive statistics. The results of the quantitative data analysis were presented to teachers during the qualitative phase of the study, along with survey and interview questions, in order to understand their perspectives on the quantitative findings. In line with applied research, this study addressed a specific school and district need by conducting a comprehensive evaluation of the implementation of CM practices. Neither the middle school, nor the larger district, had a systematic evaluation plan in place. The case study described here was used to determine the level of fidelity of CM training, understand the perceptions of the providers (teachers), and make recommendations regarding implementation of CM practices. My conclusions and recommendations result from investigation of the following research questions:  RQ1. How successfully has the faculty of a local middle school implemented the critical components of Constructing Meaning training? o To what degree have the critical components been implemented? o Is the variation in implementation predictable? o How does the degree of implementation compare to a determined threshold?  RQ2. What are the conditions that favor or hinder a high degree of implementation fidelity in Constructing Meaning practices? 35 CHAPTER III METHODS A mixed-methods design was used to investigate the research questions presented in Chapter II. The following chapter describes the research setting, participants, measures, and analysis procedures. Setting and Participants The Northwest Oregon School District, where the middle school of this case study is located, has adopted Constructing Meaning (CM) as a major source of professional development, specifically at the secondary level. The district has communicated a commitment of having every middle school teacher trained in CM within the next three years. The majority of the middle school teachers have yet to be trained, and the district must make a significant resource allocation in order to meet the goal. With school budgets still below pre-recession levels, allocation of resources is closely scrutinized and school leaders must continually monitor the return on investment (ROI) in programs and practices. As a result, the district agreed to participate in this study as a formative implementation evaluation in a pilot school that has been involved with CM training for the past three years. This study took place exclusively within one middle school in the Northwest Oregon School District. The district is one of the largest in the state, serving approximately 40,000 students. The school’s demographic composition is approximately 42% white, 36% Hispanic, 9% Asian, 5% black, 1% Pacific Islander, 1% Native American, and 6% multiracial students. Approximately 37% of the students are English Language Learners (ELLs), 16% receive special education services, and 64% participate 36 in the Federal Free and Reduced Meals program. The middle school includes grades six through eight and is considered a comprehensive middle school without a specialty program (such as the International Baccalaureate’s Middle Years Program). The study participants included middle school teachers and students. The school employs 52 certified teachers, 34 of whom have been trained in CM practices. Four of the trained teachers opted out of the study. The participating teachers (n = 30) include 10 Math, 8 Science, 4 Humanities (combined Language Arts and Social Studies), 3 Special Education, 2 Art, 2 ESL, and 1 Physical Education teacher. The teachers varied in teaching experience from 1 to 25 years, with a mean experience of 9.85 years (SD = 5.02). Forty-five students, 15 at each grade level, had a direct role in the study by participating in focus groups. The students were selected at random from grade level lists and were given the option to participate. All of the students agreed to participate and their guardians granted permission. However, for each group, some students were absent on the day of their assigned focus group, presumably due to illness. The resulting groups consisted of 13 sixth graders, 14 seventh graders, and 11 eighth graders. Twenty were male and eighteen were female. Forty-one percent were designated as English Language Learners (either active or monitored). The students were organized into eight focus groups of four to five students each. The groups were set in order to minimize the disruption to the students’ school day. Students were pulled from elective or teaching assistant periods, when possible. The grade level remained constant with each group and the male/female ratio was as even as possible. The district does not have a uniform model for ELL inclusion throughout its schools. Some schools opt for a “pull out” model, where specific language instruction is 37 delivered in a class that is distinct and not connected to the grade level language arts class. In contrast, other schools, including the study site, opt for a more inclusive model where all ELLs continue with grade-level Language Arts and Social Studies classes. As a result, individual classes are a heterogeneous mix of students, closely reflecting the overall demographic of the school. All of the teachers in the study had classroom ELL populations between 29% and 45% of the total student makeup. Constructing Meaning training has been a significant source of professional development at this middle school during the past three years (Brock, personal communication, September 15, 2014). As described in Chapter II, the training was selected as a follow-up or continuation of previous work to implement Sheltered Instruction Observation Protocol (SIOP) techniques from trainings that occurred from approximately 2006–2009. The school employs two “instructional guides” that have been certified by E. L. Achieve as CM trainers available for additional training and support. The instructional guides earned this certification through a “train the trainers” process facilitated by E. L. Achieve. The process to become a certified E. L. Achieve trainer requires participation in two additional workshops beyond the initial training. The first additional workshop is called a +2, referring to the two days spent reviewing videotaped examples of implemented CM practices, training on observations, and discussions of quality feedback. The second additional workshop is called “District Leadership” and includes shadowing other trainers, review of local achievement data, training in E. L. Achieve’s approach and practices, and implementation planning. The guides were essentially “on-call” to observe teachers and provide feedback, help develop curricula, or assist in the delivery of lessons. The use of instructional guides 38 by the teachers was voluntary, and the frequency of use was not recorded. However, casual conversation with the guides revealed that they felt they were frequently utilized by some teachers and rarely accessed by others. Research Design The purpose of this mixed-methods study was to monitor and describe the implementation of CM practices at the study site. Therefore, this study investigated the experiences of teachers and students within the school that has piloted the training in order to gain insight into the nuances of implementation that the district could use for future planning. The insight provided would be framed around the success of implementation and the conditions that favored or hindered implementation. Phases of research. There are a variety of design approaches within the field of mixed-methods research. Ivankara (2006) identified more than 40 different mixed- methods research designs referenced in the literature. However, Cresswell et al. (2003) describe the six most commonly used designs. Within the six designs, three are concurrent—where the quantitative data collections and analysis occurs simultaneous to the qualitative—and three take place sequentially in two distinct phases. Researchers use concurrent designs when the goals of their studies include comparing or consolidating the quantitative and qualitative findings. Alternatively, researchers use sequential designs when the goal of the second phase is to explain or elaborate on the findings of the first phase. In the explanatory sequential design illustrated in figure 2, researchers apply quantitative methods first, followed by qualitative methods in order to understand more fully the initial quantitative findings Creswell, 2014). The current study utilized this design to investigate the two research questions (RQ1 and RQ2) presented in Chapter II. 39 The use of explanatory sequential design was appropriate in the current study because the components of the quantitative data had been pre-established, eliminating the need to explore the components through qualitative measures first, as would be the case in other mixed-methods designs. RQ1 was written to be investigated using primarily quantitative techniques while RQ2 was written to be investigated using primarily qualitative techniques. Beginning with quantitative data was advantageous as it provided a foundation for the qualitative measures, particularly the semi-structured interviews. Quantitative Data Collection Initial Analysis Qualitative Data Collection and Initial Analysis Combined Analysis and RQ Implications Figure 2. Explanatory sequential design. Success of implementation. Following the explanatory sequential model, phase one of this study included quantitative methods focused on addressing RQ1. RQ1 investigated the success of the faculty at the local middle school in implementing the critical components of CM training. Prior to this study, neither the school, nor its parent district, had set clear expectations for the level of implementation expected. Due to the lack of a predetermined standard, this study utilized personal communication with district leaders and CM trainers, a review of the CM program manual, and a review of relevant literature to develop a standard of success. The resulting standard included determining the degree of implementation of the components of CM training, the level of implementation variability between teachers, and the comparison of actual implementation to a predetermined standard. The standard is described later in this chapter. 40 Critical components of CM training. As discussed earlier, researchers have suggested framing implementation studies around the components of the interventions that are most critical for an acceptable implementation. The investigation in this study was accomplished through the identification, measurement, and interpretation of the implementation of the critical components of CM training. By utilizing a critical components approach, the operational definition of implementation fidelity for this study was the degree to which the critical components of CM practices were implemented by teachers at the middle school. The critical components of CM training, as defined by E .L. Achieve, are: (1) Understanding Backward Design, (2) Language as a Part of Content Teaching, (3) Oral Language Practice, (4) Interactive Reading and Note-Taking, and (5) Academic Writing Support (see the discussion of each of these components in Chapter II). These components were identified by the developers of CM through the review of relevant literature, the opinions of experts in the field, and follow-up dialog with participating schools across the country (E. L. Achieve, 2014). As described in Chapter II, the critical components are simply an organization of the instructional practices that most closely align with the key research-based principles of CM. According to personal communication with representatives from E. L. Achieve, past and future revisions to the critical components focus on the descriptive terminology and the specific groupings of strategies. For example, in an upcoming version of the rubric (as of October 2015), components (4) and (5) have been reworded to Language for Reading Comprehension and Language for Writing Comprehension. The goal of rewriting the rubrics is to further define the specific parameters of the critical components. 41 E. L. Achieve also publishes the Refining Our Practices Rubric, a tool that describes the adherence, quality, and to some extent the dose of each component outlined above. The application of the rubric, by trained observers, to this case study provided the basis for collection of quantitative data on implementation fidelity. The rubric is further discussed in the Instruments section of this chapter. Variation in implementation by predictor variable. Years of teaching experience, teachers’ primary subject area, and time since receiving CM training were used as predictor variables in data analysis. Although other factors are likely related to CM implementation, such as teachers’ initial buy-in and previous quality of teaching, they are difficult to measure and extend beyond the scope of the current study. Level of implementation compared to the literature. The school, the district, and the E. L. Achieve organization expect that the use of CM practices will have a positive impact on student achievement in both classroom-based and state and national standardized exams. Generally, high fidelity of implementation has been shown to increase intended outcomes (Benner, Nelson, Stage, & Ralston, 2011). However, the current study focused exclusively on the level of implementation by teachers and did not analyze student achievement data. Because of the omission of student achievement data, success was not measured by an exam achievement standard, but rather by an implementation standard. As previously described, neither the school nor district had established expectations for the level of implementation. A review of the CM program manual provided little insight as to the level of implementation that could be expected (E. L. Achieve, 2014). The references to expected timeline to achieve full implementation 42 are vague, indicating that teachers need to “practice in the classroom to improve” and “use the Refining Our Practices” rubric as a formative tool to progress (p. 63). The literature also does not provide a universal standard level of implementation needed to achieve anticipated results. The tolerance of limited implementation varies by intervention (Kaderavek & Justice, 2010). The use of typical standard setting models, such as the Angoff or Ebel, require substantial training and time resources not available in this study (Cizek & Bunch, 2007). Therefore this study relied on face validity to develop a success threshold. School leaders, CM instructional coaches, and the primary investigator met to discuss the intended levels of implementation as part of this study. In addition, a review of the terminology of the rubric was used to formulate a success standard. Level 2 scores for rubric items included terms that had a negative connotation, such as “rarely,” “occasionally,” “to individual students,” and “not addressed.” In contrast, the level 3 descriptions included more positive terminology, such as “frequently,” “used by most students,” “including both bricks and mortar vocabulary words.” The determination was made that all rubric items evaluating critical components should be scored at level 3 or higher when observing a “successful CM Teacher.” The expectation of “all [level] 4s, all the time” was not practical, in the opinion of the group. Accepting the “all [level] 3s” consensus meant that an implementation rate of 75% would equal “successful implementation.” Additionally, as this evaluation was intended to be formative in nature, school leaders cautioned the principal investigator to frame the 75% threshold solely as a marker for this study and not an administrative directive. Phase two, the investigation of RQ2. Phase two utilized qualitative methods to investigate the conditions that favored or hindered the implementation of CM practices. 43 Qualitative research methods provide tools for achieving goals related to interpretation and understanding of social phenomena (Merriam, 2008; Maxwell, 2013; Creswell, 2014). Qualitative research has certain characteristics (Creswell, 2014). The characteristic of natural setting occurs when the research is conducted where participants experience the topic being investigated. In the current study, all data was conducted exclusively within the middle school where CM practices were being implemented. The characteristic of inductive data analysis is found within the concept of explanatory sequential design, where emerging data patterns and themes are directly investigated by specific questions or other qualitative methods. In the current study, both the initial analysis of quantitative data and the emerging themes from initial qualitative analysis were explicitly discussed in open-ended interviews. Another characteristic present in the current study is referred to as participants’ meanings. Participants’ meanings direct researchers to keep the focus of analysis on perceptions that the participants hold in regard to the issue, not what the researcher expects or desires. In the current study, techniques such as the display of negative information and member checking were used to ensure that participant meaning was included. The qualitative methods included surveys (both closed and open-ended questions), semi-structured interviews with teachers, and student focus groups. Surveys are used in qualitative studies to describe, compare, and explain individual and organizational knowledge (Fink, 2013). The surveys in this study asked questions to solicit the teachers’ perceptions of the overall training as well as the individual components. Interviews, particularly less structured interviews, allow the researcher to do more active inquiry by asking questions specific to the emerging themes of study 44 (Babbie, 2007; Warren, 2002). The interviews in the current study included the presentation of initial data from classroom observations as well as the themes that emerged from the structured surveys. Similarly, focus groups allow participants to provide additional details relevant to each other’s comments (Sankar et al., 2006). Focus group participants, particularly minors, may also be more comfortable in a group setting opposed to individual interviews (Ouimet, Bunnage, Carini, Kuh, & Kennedy, 2004). Time element. This case study was completed during the 2014–2015 school year by conducting teacher observations, teacher reflections, teacher surveys, teacher interviews, and student focus groups. Teacher observations, reflections, and surveys occurred during a six-week period (February to mid-March 2015) at the beginning of the second semester of the school year. Conducting the research at the beginning of the second semester was advantageous for two reasons. Teachers who had been trained in the beginning of the school year needed sufficient time to apply what they had learned in the training. Also, by completing the observations in the beginning of the semester, teachers would have yet to “gradually release” the students from the supports of CM, meaning that the use of CM strategies would be more apparent to the observers than they may be later in the year. The results of the teacher observations were summarized and descriptive statistics generated prior to the teacher interviews. Following the model of explanatory sequential design, the teacher interviews included peer examination of the quantitative data. Teachers were provided with summary data describing the implementation by component and in aggregate, the implementation organized by predictor variable, and a comparison of the observed scores with the self-reported scores from the teacher reflections. 45 Data Collection Instruments Observations and reflections. The Refining Our Practices Rubric (reproduced in the appendix) developed by E. L. Achieve was used to facilitate the collection of observational data on the use of CM critical components. The rubric has four indicators for each of the five critical components. Each indicator is evaluated on a four-point scale, with point descriptors for each indicator. According to the CM program manual (2008), the rubric has been designed and modified by E. L. Achieve and used in multiple sites across the country. The feedback from users, including teachers and coaches, has been collected to make modifications to the rubric over time. For example, a past version of the rubric included descriptors for individual items that did not explicitly build on each other: it was possible to receive a rating of 4 without first meeting the requirements of level 3. As a result, the current version of the rubric includes descriptors for level 4 items that include the phrase “in addition to level three criteria” (plus added level four criteria). The rubric items have also been reorganized to complement the rewording of the components. For example, in an upcoming version of the rubric, the items that reference writing in component (4), Interactive Reading and Note-Taking, are moved to the Language for Writing Comprehension section. According to representatives from E. L. Achieve, although field-testing has occurred, the results have not been presented for publication in peer-reviewed journals, outlined in the program manual, or published in any sort of technical manual. As described above, the publishers opted to refine the rubric over time based on feedback from users rather than reporting on the reliability and validity of the tool. The lack of reliability and validity data for the rubric is a concern, as virtually all of the quantitative 46 data for the evaluation of CM was derived from the rubric. Without psychometric data available, the data collected is simply assumed to be reliable and valid, which can lead to misinterpretations. Therefore, a reliability analysis was conducted using the observational data collected in this study. Limitations associated with the use of an untested instrument will be thoroughly considered in the discussion. Teacher surveys. A survey was developed to be completed by all teachers in this study. The intent of the survey was to gain insight into the use of CM instructional practices and the reasons behind varying levels of implementation fidelity. The survey items were developed based on the work of the Learning Forward organization (formerly the National Council of Staff Development). Learning Forward developed their current standards for professional development to outline the characteristics of professional learning that lead to effective teaching practices, supportive leadership, and improved student results (Learning Forward, 2014). The Standards Assessment Inventory (SAI) was developed to assess the quality of professional development in schools, based on standards defined by Learning Forward (Vaden-Kiernan, Jones, & McCann, 2009). The SAI has been used in case studies that documented the use of Learning Forward standards in professional development planning and evaluation (Slabine, 2011). For example, between 2008 and 2010, 285 schools in Arkansas used the SAI to evaluate their implementation of the Arkansas Comprehensive School Improvement Plans. Their results indicated that by aligning to Learning Forward’s standards, school leaders were able to understand “what areas were having an impact and what areas needed improvement.” From the case study, the Arkansas Department of Education identified the evaluation of 47 professional development as an official point of emphasis for local school leaders (Slabine, 2011). The SAI itself is too broad and cost-prohibitive to utilize directly in this study. Instead, individual survey questions for this study were aligned with the Learning Forward standards of leadership, resources, and implementation (Learning Forward, 2014). The survey contained two sets of questions. The first set was targeted at the overall implementation process, with four questions primarily addressing leadership and five questions primarily addressing resources. The second set was four basic questions, all primarily targeted at implementation, repeated for each of the five critical components (a total of 20 implementation questions in the second part of the survey). Additionally, the survey included open-ended questions that allowed teachers to provide as much detail as desired in their responses. The questions asked teachers to explain how well aligned the trainings were to their past instructional practices, the extent to which the training required teachers to modify their curricular materials, which elements of the training and follow-up made implementation easy, and which elements of the training and follow-up were difficult. Interviews. Semi-structured interviews were conducted with teachers (n = 9) in order to gain a more in-depth understanding of their perspectives regarding CM implementation (Merriam, 2014). Interviews provided rich and meaningful data used to understand the different levels of fidelity. Maxwell (2013) suggests that interviews should be used to gain a description of the contextual details that are difficult to uncover by observation alone. Weiss (1994) and Maxwell agree in directing researchers to ask questions specific to the observations. The guidance of Weiss and Maxwell was followed 48 in this study by providing teachers preliminary results based on observations in order to hear their insights as to why certain trends emerged. The quantitative data was used to drive certain aspects of the interviews, such as asking teachers to explain trends. Specific pieces of data were selected that aligned to sub-questions (a) through (c) of RQ1 regarding the degree of implementation and which variables seemed to predict implementation success. Interviewees were presented with three relevant outcomes of the quantitative measures. First, the interviewees were asked to comment on the distribution of the fidelity index by component and overall implementation. Next, the fidelity index, disaggregated by primary subject area, was shown for comment. Finally, interviewees discussed a side-by-side comparison of the indices showing the observations with the scores that were self-reported during the reflections. The indices, as described in the methods of analysis sections below, were displayed as a percentage of points earned for each of the components on the rubric (e.g., all level 3 scores would be displayed as 75%). Focus groups. Eight focus group sessions were conducted, with approximately five students each and lasting between 45 and 60 minutes. The questions were structured around the students’ opportunities and use of the classroom techniques of CM practices. The students were presented with an age-appropriate definition of the goal, along with a few sample tasks for each of the five critical components. For example, the goal statement for component (5), Academic Writing Support, was presented as “Teachers are trying to help you write like professionals in each of your subjects.” The sample task was the use of sentence frames to provide evidence for an argument. Students were asked questions like “How have you used sentence frames in your different classes?” Students 49 were also given the opportunity to follow up on these answers with more open-ended questions, such as “Did you find sentence frames helpful in completing your assignments?” Similarly, when investigating component (3), Oral Language Practice, students were asked about their opportunities to talk to each other during class. For example, they were asked if they were able to choose their own groups, if they had used the “appointment clock,” and whether they were taught different techniques in (active) listening. Similar prompts and examples were provided for each of the five critical components. Procedures The procedures described below took place during a six-week interval beginning in early February 2015. The observations, reflections, and surveys were completed in the initial weeks, followed by the teacher interviews and student focus groups. Observations. Each teacher was observed by one of the two district instructional coaches for one 20-minute interval of a lesson. As described above, E. L. Achieve certified the observers through additional training to support implementation. The observers were also classified as certified teachers and not as administrators, ensuring that their presence would not be used for job performance evaluations. The observers maintained confidentiality by using codes instead of teachers’ names in all documentation. The observations were preannounced but not necessarily scheduled. Teachers were able to choose from certain blocks of dates and times, but did not know the exact time of the observation. The goal of this scheduling system was to eliminate activities that would prevent observation of instructional practices, such as tests or guest speakers, while also attempting to see “regular” practice. The observers also attempted to 50 ensure that an approximately equal number of observations occurred during the beginning, middle, and end of the class period. However, due to logistical limitations, approximately 20% of all observations occurred during the beginning third of the class, 40% in the middle, and 20% during the closing third. Teacher reflections (self-evaluations). The teachers were asked to self-assess their typical practices by using the Refining Our Practices Rubric, which was distributed to every teacher via Google Forms. Each teacher was assigned a code number, allowing their evaluation data to be linked with their classroom observation without requiring individual names to be used. All information was kept confidential in order to increase confidence in the formative rather than potentially evaluative nature of this study. Teachers are more likely to participate authentically if they have confidence that their results will not be connected to their names without prior permission (Fink, 2013). The instructional coaches were the only people with a master list of names and numbers, and did not disseminate any identifiable information, as required by the University’s Internal Review Board. Surveys. The teacher surveys were distributed with the self-evaluation form discussed above. The principal of the school allocated an hour of staff development time to complete the teacher reflection and survey; however, participation was kept optional. By allotting time for survey completion, participants may have been more likely to complete the survey than if they were asked to complete it on their own time. The intent was that, through careful communication, teachers would recognize this survey as an opportunity to have their voice drive PD planning and they would take the time to respond thoughtfully. Initial participation was 30%, so a reminder email was sent asking 51 teachers to complete the survey within a week of the allotted completion time. Following the month that the survey was open, a final open-ended survey question was sent to all teachers asking them if they would explain why they did not complete the survey, if applicable. The purpose of the open-ended question was to gain an understanding of why individuals chose to not participate in order to better understand a possible source of sampling bias (Fink, 2013). Only two teachers replied, and they simply stated that they lacked time to complete the survey. The process described above resulted in a total 43% completion rate for the survey. Interviews. The instructional coaches provided a list of nine teachers, three from each third of the implementation distribution that was derived from the observations. The self-evaluations and surveys were not used to generate the interview list because of the low response rate. The distribution was constructed by ranking individual teachers in order by their overall implementation index, and then dividing the list into three groups. The resulting list included three groups of 10 teachers each. A random number generator was then used to select three teachers from each group for interviews. The intent was to obtain a range of responses; however, the identities of the teachers remained confidential. One hour was allocated for each interview. Interviews were conducted in person at the school. The interviews began with a simple introduction of the purpose and overall design of the study. Teachers were reminded that the survey information would remain confidential and not be used for any job performance evaluations. The interviews followed a semi-structured script, including open-ended questions and a discussion of the quantitative data results as described above. The audio of the interviews was recorded 52 using Audacity software and transcribed by the Casting Words Transcription Service. The transcripts were then loaded into ATLAS.ti software for analysis. Student focus groups. Fifteen students from each grade level (6–8) were selected at random to be invited to participate in focus groups. The grade-level groups were then organized into three groups each, consisting of approximately 5 students. Students were first notified in their homeroom classes, given a written description of the study, as well as a consent form for their parents to sign. Phone and email communication was encouraged between the students’ families and the principal investigator. All parents of the 45 selected students agreed to the participation, and focus groups were scheduled during homeroom to limit the disruption to instruction. The students were called to the office by the school secretary just before their scheduled focus groups. Each group met in a central conference room and sat around a circular table. The researcher facilitated the focus groups in a casual and welcoming tone, first asking each question to the group and then ensuring that each student had the explicit opportunity to respond to each question. The audio of the focus groups was digitally recorded, transcribed, and coded in the same manner as the teacher interviews. Methods of Analysis Research Question 1. How successfully has the faculty of a local middle school implemented the critical components of Constructing Meaning training? The analysis of RQ1 was based on data gathered from the Refining Our Practices Rubric. The rubric provided data from two data collection techniques, the observations by coaches and the reflections by teachers. These tools provided two sets of scores for each of the critical components of CM training. Each of the five critical components was evaluated using 53 four indicators, and each indicator was measured using a four-point Likert scale (1–4). For each critical component, then, there were 16 possible points (four for each of the four indicators). The sum of the component scores yielded the overall index of fidelity. The scores from the observations and the scores for the reflections were calculated using identical techniques, and kept separate for comparison. In Chapter IV, the results are reported for both the component scores and the overall indices using descriptive and inferential statistics. The purpose of the quantitative analysis was to identify the degree to which the components had been implemented, the variation in implementation across the site, and to compare the implementation to the predetermined standard of 75%. In addition, the data gathered were used during interviews to prompt teachers to describe the nuances of the implementation, judge the overall implementation (RQ1), and reflect on the conditions that support or hinder implementation (RQ2). Descriptive statistics were used to compare the scores assigned by the instructional coaches with the scores given by the teachers. Comparing the scores on identical sections of the rubric made it possible to determine whether the perceptions of teachers differed from the practices observed by coaches. Research Question 2. What are the conditions that favor or hinder a high degree of implementation fidelity of Constructing Meaning practices? Surveys and teacher interview data were collected to answer this question. As suggested by Creswell (2014), qualitative data were analyzed as follows. Open-ended survey question responses, interview transcripts, and student focus group transcripts were organized into ATLAS.ti, a qualitative coding program. The survey and interview data were organized by teacher characteristics of subject area taught, years of experience, and overall level of fidelity as 54 indicated by the observations. The data were coded, using both preset and emerging codes. The preset codes were based on the professional development standards identified by Learning Forward and included leadership, resources, content knowledge, student ability, and quality of training. In order to account for bias in selecting these codes, themes were added that emerged from the data analysis. The codes were further organized into a small number of themes (3–4). Themes were then interpreted and verified using the member checking technique described below. Threats to Reliability and Validity The methods described above introduce several threats to the reliability and validity of the data obtained. The lack of psychometric data available for the Refining Our Practices Rubric and the absence of inter-rater agreement and training prior to this study presents a significant threat to the quantitative data. The qualitative instruments were developed solely for the case study presented here and introduce threats to validity and reliability. In the following sections, the threats are further described and attempts to limit their impacts are identified. Reliability. There are two significant areas of concern regarding reliability in this study. The first reliability concern stems from two different observers completing the observations. The concern is that the raters may not have consistently applied the rubric to their observations. In an attempt to increase inter-rater reliability, the observers practiced using the rubric with prerecorded video examples of lessons employing CM strategies obtained from E. L. Achieve. The observers then performed five live pilot observations together, with each classroom visit lasting approximately 20 minutes. The teachers being observed knew that the coaches would be visiting their classes during a 55 specific week, but did not know the exact period of the day. The coaches met prior to the observations to review the rubric, watch example video clips provided by E. L. Achieve, and discuss possible “look-fors” for each indicator. The observers followed a routine for completing and rating the teachers during the pilot observations. They would both observe the teacher, take any necessary notes, and determine ratings independently. Following the observation and independent analysis, the coaches would discuss what they observed and compare their scores. The coaches would then agree on a final rating for the teacher to be used in the reliability analysis. Results were analyzed for inter-rater reliability using percent agreement and Cohen’s kappa (Morgan et al., 2013). The second significant reliability threat follows from the instrument. The rubric is designed so that each of the four items per component are weighted equally to provide an overall component score. The reliability threat stems from the possibility that the items do not measure the same construct. As previously stated, the developers of the rubric do not publish psychometric data for the tool. Therefore the reliability of the instrument was analyzed using Cronbach’s alpha (Cronbach, 1951; Tavakol & Dennick, 2011). The results of the reliability analysis are presented in Chapter IV. Construct validity. Mowbray (2003), Bond (2000), and Century (2010) each present methodologies for designing and validating fidelity studies. Both Mowbray and Bond suggest using experts to determine the components to be measured. E. L. Achieve developed the components and evaluation rubric used in this study. The question then becomes, does the rubric provide a valid assessment of each implementation component? As stated earlier, the rubric used in this study has not been formally evaluated. The only 56 indication of validity is the face validity claimed by the developers and supported by the instructional coaches involved in this study. Face validity refers to whether or not a measure seems or appears to be valid as determined by the individuals using it (Babbie, 2007). However, as the name implies, face validity is only a superficial indicator of validity and is not robust enough to provide significant confidence in the measures (Thorndike & Throndike-Christ, 2011). The lack of confidence in the validity of the rubric is a study limitation that will be addressed in the discussion. Qualitative validity. Qualitative validity is related to the accuracy, integrity, and credibility of the study (Cresswell, 2013; Maxwell, 2005). The accuracy of the current study has been strengthened through triangulation of data, the integrity through negative information, and the credibility through member checking. Triangulation. Data for the qualitative portion of the study was obtained from the following sources: open-ended survey questions, interviews with teachers, and student focus groups. Each of the data sources was targeted to explore the results of the quantitative portion of this study. The variety of data sources strengthens the validity of the study by reducing chance associations and biases of a single measure (Maxwell, 2013). For example, open-ended survey responses can indicate emerging themes that are relevant to the study. The themes can be confirmed (or refuted) through interviews and focus groups. Discrepant information and negative cases. Qualitative studies often uncover both positive and negative information. Investigating and specifically reporting findings that may not align with researchers’ desired outcomes (discrepant information and negative cases) is a key strategy in strengthening qualitative validity (Maxwell, 2013). 57 The inclusion of such data indicates that the researcher has presented a comprehensive analysis of the data. Generally speaking, based on personal communication with teachers, coaches, and school and district administrators, there was a desire to see a high level of implementation across all components of CM training. However, as presented and discussed in Chapters IV and V, data that suggested areas for improvements were also included in this study. Member checking. A substantial amount of evidence derived from the qualitative surveys and interviews with teachers needed to be interpreted for analysis. The strategy of member checking was used in an attempt to prevent misinterpretation of teachers’ statements. Using member checking allows participants to comment on the findings and report whether they agree with the theories that have developed (Creswell & Plano Clark, 2011). Preliminary summaries of each interview, along with developing theories, were shared with teachers. Teachers were asked to clarify and comment on the findings. Maxwell (2013) considers member checking to be the most important method for eliminating misconceptions and uncovering bias in qualitative analysis. The research design, data collection methods, and analytic procedures described above were designed to investigate the research questions presented in Chapter II. The quantitative measures utilized in phase one provided the data to determine the degree of implementation and the variation in implementation across the site. In addition, the initial analysis of the quantitative data was used during the qualitative interviews. Finally, qualitative data were used to describe the conditions that favored or hindered implementation. The results of the study are presented in chapter IV. 58 CHAPTER IV RESULTS Chapter IV is organized by the two research questions addressed in this study. RQ1 investigated the success of implementation, through the evaluation of the implementation of critical components, the variability of implementation, and in comparison with a developed standard. RQ1 was addressed using primarily quantitative data. RQ2 investigated the conditions that favored or hindered successful implementation and was addressed using primarily qualitative data. RQ1: Success of Implementation of CM Practices RQ1 investigated the success of implementation by constructing fidelity indices from the observation and reflection data conducted with the Refining Our Practices Rubric. The investigation included a reliability analysis of the rubric and the computation of fidelity indices for the components and the aggregate. Results of the calibration observations. Informal phone interviews with the observers revealed that they felt they were “generally on the same page” in regard to the ratings. They stated that they felt discrepancies likely occurred due to observing different parts of the class, such as one rater watching the teacher and the other focusing on student actions, rather than a different interpretation of the same observation. For example, one rater mentioned that in the first observation, she found herself focusing on a particular group of students and missed the teacher providing direct instruction related to interactive note-taking. During the debriefing, the second rater brought the missing evidence to her attention, and she agreed a higher rating would have been more appropriate. Another example was described by rater two, who began each observation by scanning the room 59 for evidence of anchor charts displaying sentence frames, word walls, or other student aides. As a result, she did not record any of the verbal instructions or student responses that occurred during those first few minutes. The routine was therefore modified to have observers scan the room at a time when there was a lower chance of missing verbal evidence, such as during silent reading. Quantitative analysis of the calibration data. Inter-rater reliability was estimated using Cohen’s kappa. Each rater had made 100 judgments during the calibration process that resulted in a kappa of .607, indicating a moderate level of agreement (Cohen, 1960; Mchugh, 2012). Internal Consistency Internal consistency was estimated using Cronbach’s alpha (α), inter-item correlations, and recalculation of Cronbach’s alpha with each item removed. Following the complete analysis of the reliability data, one item was excluded from all analyses. Cronbach’s alpha for each component. The Refining Our Practices Rubric includes sets of items for each component separately. In reality, the rubric is actually a collection of five rubrics, one for each critical component. Therefore, the rubric was treated as five unique tests, one for each component, with four items each (n = 4) when calculating Cronbach’s alpha. Alpha values ranged from .290 to .837. A general rule of thumb (Gliem & Gliem, 2003) suggests that an alpha value of > 0.9 is excellent, > 0.8 is good, > 0.7 is acceptable, > 0.6 is questionable, > 0.5 is poor, and < 0.5 is unacceptable. As can be seen in table 1, scores on component (1), Understanding Backward Design (UBD), and component (2), Language as a Part of Content Teaching (LPCT), had alpha values corresponding to 60 “good” reliability (α = .803 and α = .837, respectively), while component (3), Oral Language Practice (OLP), and component (4), Interactive Reading and Note-Taking (IRNT), were on the border of “questionable/poor,” with α = .606 and α = .597, respectively. Component (5), Academic Writing Support (AWS), demonstrated the lowest reliability at α = .209. Inter-item correlation analysis. An inter-item correlation analysis for each item within each component was also conducted. It was expected that all items for a particular component would show an acceptable agreement with all other items. The highest possible inter-item correlation may not be the most desirable situation. Correlations that are too high may indicate repetition between items and a narrow illustration of the desired construct (Tavakol & Dennick, 2011). Various but similar “rules of thumb” appear in the literature. Generally, inter-item correlations are acceptable above 0.25 and below 0.70 (Briggs & Cheek, 1986; Clark & Watson, 1995). As can be seen in table 1, all items for component (1), Understanding Backward Design, and component (2), Language as a Part of Content Teaching, showed positive and acceptable correlations between items. The items measuring component (3), Oral Language Practice, all displayed positive correlations, although items three and four correlated with each other at a low level (.123). Likewise, Oral Language Practice items three and four correlated below the .250 threshold, at .215 and .222, respectively. Component (4), Interactive Reading and Note- Taking, items one and four showed a negative correlation. Interactive Reading and Note- Taking item four’s mean correlation between items also fell below the threshold at .175. The indicators of component (5), Academic Writing Support, showed the lowest correlations between items. Each Academic Writing Support indicator showed at least one 61 negative relationship with the other items. Academic Writing Support item three performed particularly poorly, having a negative mean overall correlation between items (-.021). Cronbach’s alpha with items deleted. Following the correlation analysis, Cronbach’s alpha was recalculated for each component while removing each item and including only the three remaining items. The purpose of recalculating was to determine whether the overall alpha for the component was increased as a result of removing one of the four items. An increase in alpha values suggests that removing the item would be beneficial for component reliability. Table 2 displays all alphas with items deleted compared to the original alpha that included all items. Removing a particular item led to a decrease in the alphas in 13 of the 16 possible cases within components (1) through (4), OLP, UBD, LPCT, and IRNT, suggesting that those 13 items remain in the analysis. The three alphas that did rise as a result of the removal did so minimally; removing LCPT item four caused an increase of 0.23; OLP item three, an increase of 0.09; and IRNT item four, an increase of 0.28. Therefore, the three items remained in the analysis. In component (5), AWS, removing either item two or three would raise the alpha for the component, to alphas of 0.32 and .62, respectively. Revisions due to reliability analysis. Because AWS item three substantially lowered the level of reliability, AWS item three was removed from all further analysis. As can be seen in table 2, the change resulted in a revised Cronbach’s alpha value of .619, compared to the original value of .290. 62 Table 1 Summary of inter-item correlations within each critical component Minimum Maximum Mean UBD Item One .331 .505 .442 UBD Item Two .402 .685 .526 UBD Item Three .505 .685 .605 UBD Item Four .331 .625 .452 LPCT Item One .458 .864 .633 LPCT Item Two .375 .864 .600 LPCT Item Three .520 .578 .553 LPCT Item Four .375 .458 .451 OLP Item One .212 .512 .320 OLP Item Two .286 .512 .377 OLP Item Three .123 .286 .215 OLP Item Four .123 .333 .222 IRNT Item One -.019 .615 .273 IRNT Item Two .223 .431 .305 IRNT Item Three .112 .615 .329 IRNT Item Four -.019 .431 .175 AWS Item One -.058 .831 .340 AWS Item Two -.025 .247 .068 AWS Item Three -.058 .012 -.021 AWS Item Four -.025 .831 .270 63 Table 2 Revised Cronbach’s alpha values with specific items deleted Understanding Backward Design (Original Cronbach’s alpha = .803) Alpha with item deleted UBD Item One .799 UBD Item Two .736 UBD Item Three .672 UBD Item Four .787 Language as a Part of Content Teaching (Original Cronbach’s alpha = .837) LPCT Item One .721 LPCT Item Two .747 LPCT Item Three .804 LPCT Item Four .860 Oral Language Practice (Original Cronbach’s alpha = .606) OLP Item One .499 OLP Item Two .399 OLP Item Three .614 OLP Item Four .599 Interactive Reading and Note-Taking (Original Cronbach’s alpha = .597) IRNT Item One .524 IRNT Item Two .504 IRNT Item Three .420 IRNT Item Four .625 Academic Writing Support (Original Cronbach’s alpha = .290) AWS Item One -.100 AWS Item Two .318 AWS Item Three .619 AWS Item Four .019 Fidelity of Implementation Index by Component As described in Chapter II, determining the degree to which implementation varied across the components would be used to evaluate the implementation of CM practices. The index of fidelity was constructed for each component by calculating the percentage of points earned from the four items. All of the components, except Academic 64 Writing Support (AWS), have 16 points possible (four points from each of the four items). AWS had 12 total points possible, since item three was removed from analysis following the reliability analysis. Table 3 displays the percentage of implementation listed by component. Percentages were used instead of raw scores due to the varying total possible score across components. Overall, Understanding Backward Design was implemented with the highest level of fidelity (66.67%). In contrast, Oral Language Practice was implemented at the lowest level (40.83%). The level of implementation also varied within the components. As can be seen in table 3, Oral Language Practice was implemented with the least variance (SD = 13.10) between teachers. In contrast, Language as a Part of Content Teaching (SD = 21.17) was implemented with the most variance between teachers. Overall index of fidelity. The overall index of fidelity is simply the percentage of all points awarded on the rubric excluding AWS item three. The minimum fidelity index, as a percent, was 34.21%, with the maximum being 78.55%. The mean and median scores were 51.40% and 48.69%, respectively (SD = 11.51). Index of Fidelity by Predictor Variable Fidelity rates were examined as a function of the predictor variables described in Chapter III: years of teaching experience, primary subject area, and time latency since training. Years of teaching experience. The participants included teachers ranging from 1 to 25 years of experience, with a mean experience of 9.85 years. The indices were 65 Table 3 Index of fidelity by critical component (n = 30) Minimum (%) Maximum (%) Mean (%) SD Understanding Backward Design 37.50 100.00 66.67 17.70 Language as a Part of Content Teaching 31.25 100.00 55.00 21.17 Oral Language Practice 25.00 68.75 40.83 13.10 Interactive Reading and Note-Taking 25.00 75.00 50.00 13.23 Academic Writing Support 25.00 83.33 42.22 18.17 analyzed for a relationship to years of experience. Table 4 displays positive correlations between years of teaching experience and the overall index as well as four of the components, although only one was statistically significant at the .05 level (IRNT, r = .455, p = .012). The only component to show a negative correlation was Oral Language Practice (r = -.068). These results suggest that for all of the components except Oral Language Practice, as years of teaching experience increased, the level of implementation also increased. Subject area taught. Fidelity indices were compared between teachers’ primary subject area. Table 5 displays the number of teachers in each subject area. The small group sizes are further addressed in Chapter V. As can be seen in figure 3, humanities teachers implemented AWS (62.5%) and IRNT (54.7%) more than the other subject 66 Table 4 Pearson’s correlation between year of teaching experience and fidelity of implementation UBD AWS IRNT LCPT OLP Overall r .320 .318 .455* .161 -.068 .339 Sig. (2- tailed) .084 .087 .012 .396 .719 .067 Note: * = significant at the .05 level. areas. Special education teachers demonstrated the highest implementation in OLP (52.1%) and LPCT (68.8). In UBD, physical education (PE) was the highest (81.25%), although there was only one teacher in the group. Conversely, art teachers showed the lowest level of implementation in all components, except IRNT, in which the PE teacher showed the lowest level of implementation. Table 5 displays the overall index of fidelity by primary subject area. As can be seen, the humanities and special education teachers showed the highest degree of implementation at close to 60% each, while science and art showed the lowest at 44.57% and 39.47%, respectively. The statistical significance of the differences was tested using a one-way analysis of variance. Analysis of variance by subject area taught. The descriptive statistics revealed variability in implementation by teachers in different subject areas. A one-way ANOVA was conducted to compare teachers’ primary subject area with the implementation index means for each of the critical components, as well as the overall index, in order to determine whether there were statistically significant group differences. Due to the small sample size, teachers were combined into three groups for the ANOVA: group one was humanities teachers (n = 6), group two was science or math (n = 18), and group three was 67 Figure 3. Implementation of each critical component by subject area. Table 5 Index of fidelity of the overall intervention by teachers’ primary subject area n Index (%) SD Math 10 52.76 11.33 Humanities 4 60.20 13.13 Science 8 44.57 3.62 Physical Education 1 57.90 ** Special Education 3 59.21 19.74 Art 2 39.47 7.44 ESL 2 51.32 5.58 electives (n = 6). Only one statistically significant result was obtained. There was a significant relationship between subject area and the implementation of Academic Writing Support [F(2, 27) = 3.523, MSR = 28.873, p = 0.044]. 0 10 20 30 40 50 60 70 80 90 UBD AWS IRNT LPCT OLP Math Humanities Science Physical Educaiton Special Education Art ESL 68 Table 6 shows that the post hoc comparisons using the Tukey HSD test indicated that the mean score for ELA/humanities teachers was significantly different than for science/math teachers. There was no statistically significant difference between ELA/humanities and elective teachers or between science/math and elective teachers. Table 6 Tukey HSD post hoc test Mean Difference Std. Error Significance ELA/humanities Math/science 20.83333* 7.90509 .036 Electives 18.05556 9.68172 .168 Electives Math/science 2.77778 7.90509 .934 Note: * = significant at the 0.05 level. Latency since training. The participants each completed the training within a two-year window. In order to investigate differences in implementation due to the amount of time elapsed since the training, the teachers were organized into three cohorts. Cohort one completed the training during the 2013–2014 (n = 9) school year, cohort two completed it during the following summer (n = 8), and cohort three (n = 13) during the 2014–2015 school year. Fidelity indices were examined by critical component and overall implementation. As can be seen in Table 7, each cohort showed a greater level of fidelity Understanding Backward Design. In the overall index, however, there was a minimal (< 4%) difference in the implementation index among cohorts. Additionally, a one-way analysis of variance did not show a statistically significant relationship between the different training cohorts and the implementation of any specific component or overall. 69 Table 7 Index of fidelity organized by training cohort n UBD AWS IRNT LPCT OLP Overall (%) SD (%) SD (%) SD (%) SD (%) SD (%) SD Cohort 1 9 74.31 17.52 38.89 22.05 47.22 15.02 53.47 24.03 40.28 11.31 51.46 13.23 Cohort 2 8 67.19 22.60 47.92 16.52 43.75 12.05 65.63 16.02 41.41 17.01 53.45 10.93 Cohort 3 13 61.06 13.30 41.03 16.83 55.77 10.96 49.52 20.96 40.87 12.66 50.10 11.35 Teacher Reflection and Survey The electronic document that contained both the teacher reflection and the survey was distributed to all participants. Thirteen teachers (43%) responded during the month that the tool was open. Five of the respondents had been teaching 16 or more years, four between 11 and 15 years, and four between 6 and 10 years. Five primarily teach math, five language arts, and three science. Eight teachers participated in CM training during the current school year, four the prior year, and one took the training two years prior. One teacher maintains a state teaching license in English for Speakers of Another Language (ESOL). Teachers’ self-scoring on the Refining Our Practices Rubric, by component. The fidelity indices displayed previously in this chapter were calculated from the observations conducted by the instructional coaches. The observers could only score the rubric based on what they directly saw or heard, and could have missed evidence of implementation that may have occurred prior to or following their visits. The teachers’ self-reflection of the rubric was analyzed in order to include evidence of implementation that may not have been otherwise observable. An index of fidelity was calculated from 70 only reflection scores using the same technique as for the observational data, including dropping AWS item three. Table 8 displays the indices by component as well as overall. As outlined in the table, teachers scored themselves the highest in UBD and LPCT, 63.94% and 65.86%, respectively. Although the numerical values for the indices are different, those were also the components scored highest during the observations. A comparison of the observed and self-reported scores, along with a test of statistical significance, is presented below. Although the sample size (n = 13) was too small to achieve sufficient statistical power, the observations and participant reflections were analyzed to determine whether any trends emerged that could be recommended for a future, more comprehensive study. A paired sample t-test was used to compare the means for each critical component and the overall index. As illustrated in figure 4 and detailed in table 9, the means between observed scores and reflection scores were statistically different (p < .05) for AWS (MD = -14.74, p = .04), IRNT (MD = -14.01, p = .03), and OLP (MD = -13.08, p = .01). RQ2: Conditions That Favor or Hinder Successful Implementation The goal of RQ2 was to determine the conditions that supported or hindered the implementation of CM practices. The investigation of RQ2 included exploring the teachers’ perceptions of the overall training and their perceptions of implementing the practices at the classroom level, through both open- and closed-ended survey questions. The surveys yielded traditional qualitative data as well as numerical summaries of responses. Additionally, teachers’ perceptions of the initial analysis of quantitative data and CM concepts in general were discussed during semi-structured interviews. Finally, 71 focus groups with students were conducted to understand their experiences with CM practices. Table 8 Index of fidelity based on teacher reflections Minimum Maximum Mean Median SD Understanding Backward Design 50.00 75.00 63.94 62.50 9.59 Language as a Part of Content Teaching 43.75 81.25 65.86 62.50 15.43 Oral Language Practice 43.75 81.25 58.65 56.25 18.31 Interactive Reading and Note-Taking 56.25 93.75 10.92 62.50 9.70 Academic Writing Support 33.33 91.66 53.20 50.00 15.79 Overall Implementation 47.36 88.15 62.45 61.84 11.49 Teachers’ Perceptions of CM Training Nine of the survey questions were closed-ended, asking teachers to evaluate their overall experience with CM trainings. Thirteen (42%) of the teachers participating in the study chose to complete the survey and reflection. As shown in table 10, 100% of respondents felt that the use of CM practices would have a positive impact on student outcomes. Seventy-seven percent felt that the leadership at the school was able to 72 Figure 4. Comparison of scores from observations and reflections. Table 9 Paired sample t-test of the observation and reflection indices Paired Differences Mean Differ- ence SD 95% Confidence Interval of the difference Corre- lation t df Sig. (2- tailed) Lower Upper UBDobs - UBDref 7.69 18.07 -3.22 18.61 0.12 1.53 12 0.15 AWSobs - AWSref -14.74* 23.11 -28.71 -0.78 0.06 -2.30 12 0.04 IRNTobs - IRNTref -14.04* 15.60 -23.85 -5.00 0.18 3.33 12 0.01 LPCTobs - LPCTref -4.08 31.16 -23.63 14.02 -0.28 -0.56 12 0.59 OLPobs - OLPref -13.08* 19.41 25.19 -1.73 0.26 -2.50 12 0.03 Overallobs - Overallref -8.14 16.91 18.36 2.07 -0.03 -1.74 12 0.11 Note: * = significant at the 0.05 level. demonstrate that CM was a priority. Accordingly, 92% of respondents disagreed that the school should be setting different priorities for professional development. Ninety-two percent of respondents also felt that they were able to make connections between the collaborative “learning team” model and the practices associated with CM. However, only 76% agreed that they were able to collaborate with the learning team on CM practices. Ninety-two percent of respondents disagreed that CM practices are too time- 0 10 20 30 40 50 60 70 80 UBD AWS IRNT LCT OLP Overall In d e x o f F id e li ty ( % ) Critical Component Observations Vs. Reflections Observations Reflections 73 consuming to implement; however, only 39% felt that there was time during the school day to work on implementation. Teachers’ perceptions of each critical component of CM practices. Each respondent was asked six Likert scale questions for each of the five critical components. The six questions were identical for each component. In all components, the majority of respondents (> 75%) felt that the practices associated with the critical component were aligned with the subject area they teach, relevant to the students’ needs, and able to be fully implemented within the next two years. There was disagreement regarding whether full implementation would cause a significant change to teaching practices. Sixty-two percent of respondents felt that Language as a Part of Content Teaching would require significant change to current teaching, and only 31% responded similarly to Interactive Reading and Note-Taking. Open-ended survey questions. The survey included four open-ended questions regarding the teachers’ perceptions of the impact of CM training. Teachers were asked to provide evidence of connections to past practice, changes that were required as a result of the training, elements that made the practices relatively easy to implement, and elements that made the practices relatively difficult to implement. Of the thirteen teachers completing the survey, twelve answered the questions (although answers were required for submissions, one teacher simply put an “x” in each of the response spaces to move on). Example responses are shown in Table 11. Specific quotations are also provided and interpreted in Chapter V. 74 Table 10 Summary of closed-ended survey questions Strongly Agree (%) Agree Disagree Strongly Disagree (%) I was able to make direct connections between CM training and my learning team. 46 46 1 0 I was able to collaborate with other members of my learning team regarding CM practices. 46 30 23 0 My school leaders demonstrate that CM practices are a priority. 15 62 23 0 My school leaders we able to adequately allocate resources needed for CM implementation. 15 46 30 8 In my school, there is time available to me, during the school day, to plan for CM implementation. 0 39 46 15 My input was solicited on the allocation of resources (time, consultation, learning materials) for CM implementation. 0 31 62 8 I anticipate that my use of CM practices will have a positive impact on student outcomes. 31 69 0 0 I feel CM practices are too time- consuming for me to implement. 0 8 77 15 I feel the school should have different priorities for professional development than CM. 0 15 77 8 75 Teacher Interviews Nine open-ended interviews were conducted with teachers participating in the study. To select the interviewees teachers were listed by their level of implementation, and three teachers from each third of the distribution were selected at random. Of the original nine selected, four of the teachers declined to participate, and alternates were selected at random. A protocol of open-ended questions was followed for the interviews. The protocol was developed to uncover perceptions that would be useful in the analysis of the two overarching research questions. The interviews were scheduled to take place over an hour, taking place in the teacher’s classroom. The protocol began by asking teachers about their perceptions of the training itself, the systems and structures in place at their school relevant to implementation, and how the training has affected their practice. The second portion of the interview asked teachers to comment on the initial analysis of the quantitative data, both from the observations and reflections. The teacher interview data resulted in themes related to four general categories of comments: collaboration, resources other than time, time as a resource, and general perceptions. Each category of responses yielded groups of emerging themes for analysis. Table 12 displays the frequency of at least one mention of the theme per interview. Student Focus Groups Seven of the 45 students selected, as described in Chapter III, did not attend their assigned focus group, presumably due to school absence on the scheduled day of the group. The resulting group consisted of 13 sixth graders, 14 seventh graders, and 11 76 Table 11 eighth graders. Twenty were male and eighteen were female. Thirty-eight percent were designated as English Language Learners (either active or monitored). Each group of students was asked if they knew about Constructing Meaning. CM was described as a training opportunity for their teachers that helped teachers think of different ways to get students to read, write, and talk to each other. Students then looked Example responses to open-ended survey questions Q1. In what ways were the practices of CM aligned to your practices prior to the training? In the more broad sense, CM does many things that many new "initiatives" claim to do but don't: it truly does take best practice in terms of instruction (along with the pedagogical framework behind instruction) and streamlines specific ways to more explicitly implement these best practices. Q2. In what ways did you need to modify to your curricular materials to implement CM practices? CM gave me really good supports or structures to explicitly teach the language I was wanting students to use. I now (occasionally) add sentence frames to my lessons and worksheets especially if I am focusing on EXPLAINING, or justifying a solution. I write frames as a part of my objectives daily. Q3. Please describe elements of the training and follow-up that made implementation relatively difficult. Time to collaborate and create lessons that daily incorporate the strategies of CM. Time to digest and recognize more quickly how to change my lessons and instruction to more intentionally teach using CM strategies. Time necessary in limited class periods to instruct students in the use of all strategies, finding supportive math text readings (and the time to implement in class). Q4. Please describe elements of the training and follow-up that made implementation relatively easy. Access to others that have completed the training for help. Collaboration with my learning team, which chose increased student talk and writing using academic language as our goals. 77 Table 12 Frequency of reference to emerging themes in teacher interviews Collaboration General Perceptions of Training Resources other than time Time as a resource Theme W it h j o b -a li k e te ac h er s In i n te rd is ci p li n ar y g ro u p s W it h C M t ra in er s P er ce iv ed e ff ec t o n s tu d en t ac h ie v em en t A li g n m en t to o th er p ri o ri ti es (p o si ti v e) A li g n m en t to o th er p ri o ri ti es (n eg at iv e) M at er ia ls F o ll o w -u p o p p o rt u n it ie s T im e w it h s tu d en ts T im e fo r d ev el o p m en t Rate of Occurrence (n = 9) 9 7 6 9 5 4 6 6 6 9 at five sets of examples of types of assignments or tasks that teachers may have asked them to do, one set for each of the CM critical components. None of the students had heard the terms “CM” or “Constructing Meaning.” Similarly, none of the students reported hearing of a specific training or a new or different way to have students read, write, or talk to each other. During the demonstration of examples of types of assignments for each of the critical components, the only ones students recognized in all of the focus groups were sentence frames (LPCT) and A/B partners (OLP). Students in each group were able to identify clearly when and where they used sentence frames. Humanities classes were mentioned more frequently; however, all subjects were mentioned at least once. A/B 78 partners was also clearly used throughout the subject areas; however, students only described it as a way to organize students into groups, rather than a way to assign different tasks to different groups members. Likewise, students were not able to explain why a particular student was assigned as “A” or “B,” other than random selection. Interestingly, three of the six focus groups referenced Cornell notes as a structure for Interactive Reading and Note-Taking. Cornell notes are a specific style of note-taking, requiring students to engage with their notes frequently over the course of the day and weeks following. Although Cornell notes are an example of Interactive Reading and Note-Taking, they are also a significant component of AVID practices, which are also used at the school. Therefore, it was impossible to determine whether students’ exposure to Cornell notes was a result of CM or AVID trainings. 79 CHAPTER V DISCUSSION In this chapter, the results of the quantitative and qualitative findings are discussed. Following the discussion of results, the conditions that favor or hinder successful implementation of CM training are outlined. The results are compared to prior related research. At the conclusion of the chapter, study limitations and implications for future research are presented. Discussion of RQ1: Success of Implementation The success of implementation is a multifaceted concept. In the case of the implementation of CM practices, success was evaluated by the level of implementation fidelity, the variation of implementation between the components, and a comparison of the level of implementation to expected thresholds. Success of the implementation of the critical components. As described in Chapter II, an accepted approach to measuring implementation fidelity involves defining an intervention by its critical components and measuring the implementation of each. The results of the teacher observations indicate that UBD (67%), LPCT (55%), and IRNT (50%) were implemented to the greatest degree, while AWS (42%) and OLP (41%) were implemented with the lowest degree of fidelity. The teachers’ reflections on the rubric revealed similar results, although with different indices, resulting in a different order. IRNT and LPCT were reported at 66% and 64%, respectively. OLP and AWS were also reported lowest by the teachers at 58% and 53.20%, respectively. Considering this data, it is reasonable to conclude that UBD, LPCT, and IRNT were implemented more successfully (with greater fidelity) than OLP and AWS. 80 There is a lack of comparative research in the literature regarding the implementation of the critical components of CM training. However, in published evaluations of other interventions, the variation in the implementation of specific components is to be expected (Dusenbury et al., 2003; McKenna et al., 2014; Mowbray et al., 2003). In a larger but similarly designed study, McHugo et al. (2007) used a mixed- methods design to investigate the implementation fidelity of the five critical components of the National Implementing Evidence-Based Practices Project. The researchers evaluated fidelity of implementation across 53 different sites in eight different states. Similar to the current study, researchers used observations from multiple raters as the source of their quantitative data. Implementation data was collected four times (every six months) over a two-year period. At the first six-month point following the initial implementation, the component implementation fidelity rates ranged from a low score of 20% to a maximum score of 80%. The size of the variation between components did lessen over time, to a low of 30 percentage points between the high and the low scores (55% on the low end and 85% on the high end). These qualitative findings indicated that the components requiring “simple structural” changes were implemented with greater fidelity than those requiring changes to the “expertise” of the practitioners (p. 1283). The researchers in the McHugo study used qualitative interviews to explain their quantitative findings. Four conditions were identified as influencing the varying levels of implementation across the research sites. The researchers identified leadership, prioritization of implementation, complexity of implementation, and the role of the trainer as areas of focus for implementation. Leaders were directed to be actively involved in the implementation process and to seek out direction from the providers to 81 identify areas where support was needed. Researchers suggested that prioritization of implementation stems from an analysis of why the intervention is needed. They proposed that without a clear understanding of the purpose of the intervention, providers would be less likely to implement with fidelity. The actions of the providers are one element that makes implementation a complex process, as do the actions of the recipients, and challenges associated with resource allocation. The individual philosophies of the providers, and the range of talents that they have, were found to influence implementation. Likewise, the recipients of the interventions will have varying degrees of enthusiasm for or acceptance of the intervention. Finally, the availability of resources is likely less than the overall need, requiring careful allocation plans that include underfunding certain areas. The role of the trainer was also identified as critical to implementation. The trainer role is complex because they must have a strong knowledge of the intervention itself, while also understanding the local context enough to apply the knowledge most effectively (Torrey, Lynde, Gorman, 2005). Similarly, the current study was designed to find out why certain components were implemented with greater fidelity than others. The findings of the current study were similar to the McHugo study. The implementation data by component were shared with teachers during the semi-structured interviews. The comments were analyzed, looking for explanations for the variation between components. As the interviews progressed, some themes were consistently mentioned. One common theme was a lack of knowledge of the components themselves. Although all of the teachers were able to speak to specific strategies or assignments that had resulted from CM training, none were able to articulate the components by name or 82 even by description. During each of the interviews, following this realization, teachers were provided with a card from CM that describes the goal of and gives examples of each of the components. Using the card as a refresher, teachers were able to speak to the various components and describe possible explanations for the variation in implementation fidelity. Although the teachers were able to discuss the components using the card as a prompt, their lack of memory of the components indicated to the researcher that they did not use the components as a way to implement CM practices. Teachers consistently attributed the higher implementation of Understanding Backward Design to the emphasis that the school’s district places on “standards based learning” and the required use of “learning targets.” The school district has organized virtually every course taught in the district by learning targets, which are specific, student-friendly statements aligned to a particular state standard. These targets serve as the outcome measures for the class and are reposted on all forms of progress communication. As one teacher described, the process of working backward from the outcome is already built into her practice: I told students what they're going to do. What do you want the kids to learn by the end of the class? That's why backward design is so high. We're so target-focused. It's like this is what I need to get the kids [to do]. What is it that I need to do? How can I set up that, and how can I use Constructing Meaning to help me get where I'm going? The higher implementation of Language as a Part of Content Teaching and Interactive Reading and Note-Taking likely resulted from the belief that these components supplement the content of the class. In Language as a Part of Content Teaching, teachers have students interact with the vocabulary of the course of study. Interacting with the vocabulary of the course was not reported as a new practice. As one 83 teacher stated, “I already had some basis for academic vocabulary instructions, so CM just gave you more tools around the same conceptual understanding.” Other teachers made reference in the surveys and interviews to the quality and availability of materials. The reference to the materials indicates that teachers perceived the CM strategies with these components as tools to help them teach their content. The perception of the materials as helpful tools contrasts with other components that were seen as additional requirements not directly relevant to the content learning goals of the teachers’ classes. There is consensus in the literature that teachers that find a relevant connection between the professional development and their teaching assignments are more likely to implement the new practices in their classroom (Darling-Hammond & Wei, 2009; Garet et al., 2009). However, less agreement exists regarding the difference between subject- specific relevance, such as presenting math methods to math teachers, and school-wide goals, such as literacy across the curriculum (Echevarria et al., 2011). The implementation of CM practices clearly followed the latter by providing specific language instruction to all teachers. In contrast, Academic Writing Support was generally perceived as a method to teach writing. Although teaching writing may not be new to language arts teachers, it can present challenges to those in other disciplines. Teachers from other content areas commented that they were “not writing teachers,” and that the role of writing instruction was “one more thing that [they] need to cram in.” The perception by certain teachers that writing instruction is not a core part of their job likely contributed to the lower implementation of Academic Writing Support. 84 The challenge of integrating writing into non–language arts classes is not new (O’Brien et al., 1995; Vacca & Vacca, 1989). O’Brien and colleagues (1995) presented findings that indicated reluctance to integration stems from teachers failing to see the benefits to their primary instructional objectives. Teachers are less likely to implement changes that they do not perceive as having direct impact on their classroom objectives. Recently, however, the Common Core State Standards (CCSS) have identified content area literacy as one of the “key instructional shifts” (Bennett & Hart, 2015). The premise of the shift is the inclusion of subject-specific literacy standards throughout the curriculum. If the “shift” does occur with teachers in all subject areas, then writing strategies will be, by default, relevant to every subject area. The defined relevance would increase the likelihood that writing strategies, such as those in Academic Writing Support, would be implemented by teachers across the different subject areas. However, teachers’ perceptions of the CCSS shift vary widely and it is not yet known whether teachers will become more accepting of “literacy across the curriculum.” In August of 2014, Gallop conducted a poll of 854 randomly selected teachers from 43 states and the District of Columbia (Saad, 2014). The teachers were asked to respond to questions asking their perception of the CCSS. The overall perception of teachers was split, with 41% responding positively and 44% negatively. The poll did find variation in perception based on level of implementation. Teachers who reported that they worked in schools that had implemented all of the standards were more likely to indicate positive perceptions (61%). A similar poll conducted by the Bill and Melinda Gates Foundation (2014) asked teachers to respond to questions regarding their overall enthusiasm for the standards and their perceptions of the impact the standards were 85 having on their students. Of the 1,676 pre-K to 12th grade teachers that responded, 84% teachers with at least one year of implementation reported being enthusiastic about the CCSS. Similarly, 53% reported seeing positive impacts on students attributable to CCSS implementation. The positive trend between level of CCSS implementation and teachers’ positive perception gives reason to believe that the “literacy across the curriculum” shift will be seen in teachers’ practices. In the current study, teachers reported that writing is still thought of as an extra or supplementary piece of the curriculum and not a direct learning target. If teachers at the school in this study do adopt the shift similar to the teachers responding to the Gallup Poll, they may be more likely to implement literacy strategies, such as those presented in CM training, in all subject areas. The variation in implementation was also found in Oral Language Practice. This component was specifically described by 7 of the 12 teachers who responded to the survey as a shift in practice because they are essentially asking students to talk more, which is in contrast to the idea of students sitting quietly and waiting to speak until called upon. As described by one teacher, “A lot of times, teachers are trying to get the kids to be quiet, and this strategy would be asking them to talk more.” Another teacher echoed this sentiment: For me, I'm trying to get them to be quiet. If I open the floodgates of letting them talk again, oh boy. I wouldn't get it back. This is a really chatty group. Also, it's contrary to what it feels like we want them to be quiet to impart whatever we're doing. If I hear noise, it's very hard to distinguish what the noise is. It is on-task or off-task? Another teacher also described the challenge of determining whether student talk was on- or off-task behavior. It became apparent that teachers felt more comfortable 86 teaching, and holding students accountable, in silent working conditions. The teachers did not seem to feel as comfortable designing lessons that taught the students how to talk. As one teacher said: It is still really hard to get students to get out of their colloquial talk and use academic language, and it's also hard as teachers to model that. I for sure think oral language is the hardest one to teach. . . . That has probably been the thing that's been the hardest. I think it didn't easily fit into the way I teach, so I have to sit down and say, "How am I going to work this in?" because I want to. The lower implementation of Oral Language Practice illustrates the challenge of increasing student talk that has been described in the literature. Mitchell (2008) described 12 distinct classroom conditions that teachers should have in place to increase the amount of student talk in the classroom. The conditions included abstract concepts, such a creating environments conducive to risk-taking and fostering independent decision- making. DeWitt & Hohenstein (2010) go on to describe increasing student talk as particularly challenging for secondary teachers. The researchers report strong relationships with students as a criterion for increasing student talk. However, teachers’ daily student loads of 150 to 200 made relationship-building a difficult task. The teachers in the current study experienced similar students loads, with a mean of 165 students (SD = 8.3). The takeaway is that teachers understand the elements of Oral Language Practice and see its value for students. However, they are either unable or unwilling to create the classroom environments where student talk is abundant. Contextual relevance of the components. The varying implementation of the components seems to have resulted from the contextual relevance the specific practices had with individual teachers. The closer the professional development components directly apply to the specific practices of individual teachers, the more likely the teachers 87 are to implement them as intended (Darling-Hammond & Wei, 2009; Opfer, 2011). The necessity of connecting the professional development learning objectives to teachers’ daily practices must be considered by training providers and school leaders. If the PD will be presented school-wide, as in the case of CM, common agreements, such as a school-wide focus on literacy, ELL achievement, or student behavior should exist prior to implementation (Guskey, 2002; O’Brien et al., 1995; Sugai & Horner, 2002b). The middle school and its parent district include achievement of all students, specifically historically underrepresented students (including ELLs), as organizational goals. However, there is a lack of specificity as to the methods and techniques that should be used to achieve the goals. School leaders should be more explicit about the expectations of CM implementation. Success of the overall implementation. Following the results of the component analysis, the overall analysis yielded similar results. Although the index from the observations (51%) and the reflections (62%) were different, they both indicate that the program has not been fully implemented. Although full (100%) implementation would be the ideal goal, the comparative target for this study was set at 75%, based on the descriptors in the rubric. Clearly, the implementation data collected indicates that the school has not yet met the minimal threshold of success defined by this study. However, the threshold of 75% implementation fidelity was determined for the formative purpose of this case study and was not presented to staff as an administrative expectation or a publicized goal. Because the target was not publicized, teachers were not able to use it as a target. In fact, teachers were not provided with any desired target of implementation fidelity. 88 The level of implementation was addressed during the teacher interviews. Every participant was asked to provide an informed opinion of acceptable level of implementation, within the current context of the school. Of the nine interviews, seven provided a number, while two did not feel they could make an informed response. Of the seven responses, there was a low of 25% and a high of 80%, with the average being 60%. In each of the responses, there was a tone of assumption that the teachers would continue progressing toward higher implementation fidelity. An indirect indication of implementation fidelity arose from the teacher surveys. The survey asked whether teachers felt that the critical components could be implemented fully within the next two years. Although there was variation in how significant the changes required would be for individual teachers’ practices, 100% of respondents agreed that each component could be implemented within two years. Combining the survey results with the indices of implementation could lead to a conclusion that implementation is progressing and will continue to grow. However, there is also counterevidence. The analysis of observational data did not reveal any consistent variation in implementation among the different training cohorts—meaning that latency or recency of teacher training had no observable effect. Because of this, implementation may not be on an upward trajectory, and an implementation plateau may have occurred. Continued analysis through a longitudinal study is needed to determine whether implementation will continue to rise. The recommendations for support are presented at the end of this chapter, and resources for continued growth would be required from school leaders. 89 The possible implementation plateau in the current study is similar to the National Implementation of Evidence Based Practices evaluation where researchers found time to be a statistically significant predictor of implementation from the baseline through 12 months, but not in the final 12 months of the study (McHugo et al., 2007). Furthermore, four of the five components showed no significant growth in fidelity score following the initial 12-month measurement. Researchers suggested specific actions by leaders to support the increase in implementation. Those resources included follow-up training, increased amounts of feedback, and changes in personnel. Further research is needed to determine whether the middle school will continue to see implementation gains. However, the data collected in this case study clearly indicates that while implementation has occurred to some degree across the site, it has not approached an ideal of 100% or the rubric-based threshold of 75%. Success of Implementation by Predictor Variable As described in the previous chapters, the fidelity indices were analyzed as a function of years of teaching experience, teachers’ primary subject areas, and latency since training. Relationship between fidelity and years of teaching experience. As can be seen in table 4, as years of experience increased, the overall level of fidelity increased, as well as the fidelity of four of the five critical components (the exception being OLP). The only statistically significant relationship was in the implementation of IRNT. The statistical significance of the relationship between IRNT and years of experience is likely due to chance rather than any noteworthy difference in the component. However, the overall trends warrant further investigation. The trend, though not statistically significant, is 90 supported by data from the teachers’ surveys, where all respondents had more than five years of teaching experience. In 100% of the surveys, the respondents indicated that they felt they would be able to fully implement each component within two years. If the respondents are correct, they would have a minimum of seven years of experience when critical components are fully implemented. Unfortunately, due to the low survey response rate, survey data regarding perceptions of teachers with less experience are not available. However, one teacher was interviewed who had less than three years of teaching experience. She explained her level of implementation fidelity: “It’s a better way of teaching, but that doesn’t mean it’s easy. I just don’t have the experience to see how it all fits together.” The literature suggests that teachers with limited experience report that emotional exhaustion and pressures from work-related tasks limit their ability to implement changes in practice (Kwakman, 2003; Skaalvik & Skaalvik, 2010). Kwakman (2003) described the inverse relationship between teacher stress and participation in professional development and associated feedback. She noted that perceived stress was a predictor of participation more often than other factors, such as relevance and quality of professional development. CM survey data is not available from teachers with low levels of experience, due to the overall low response rate. However, interview data did provide some insight as to why teachers with limited experience seemed to implement at a lower level. Teachers frequently described a theme that teachers with less experience are overwhelmed by the demands of the profession. Among other challenges, newer teachers feel pressured to use a number of different strategies. This seems to result in less experienced teachers dedicating finite blocks of time to specific tasks or initiatives 91 separately. In contrast, as described by one of the teachers quoted above, more experienced teachers have the ability to see how pieces of different strategies fit together and can therefore “work on them” simultaneously. Thus, the addition of CM strategies did not appear to add to the overall stress of the experienced teachers interviewed. There was not a universal agreement that years of teaching experience result in increased implementation fidelity. In particular, during interview segments that focused on negative perceptions, teachers were able to describe examples where more teaching experience may hinder CM implementation. For example, when discussing the overall implementation of writing in math classes, a teacher commented: Someone will say something like "Well, you're a math teacher. You don't need to be teaching language." And I think, because mostly they're older than me . . . I'm a pretty young teacher, so I think that I don't really ever state my opinion, and I'll listen to what they say. . . . They've had way more years of experience. I definitely respect their experiences, but I would say now in year two, having things calm down a little bit more, math is a language in and of itself. By teaching Constructing Meaning or teaching these sentence frames, you're still, in some way, teaching logical reasoning, which is what you're supposed to be doing in math. For me, in every content area, we need to be teaching language, but that's just me. This teacher described alignment to the overall principles of CM, but was not engaged with her team to collaborate on implementation. Communication with peers was shown to be a statistically significant challenge for novice teachers in a mixed-methods study of 86 novice (less than two years of experience) teachers (Fantilli & McDougall, 2009). Respondents described the challenge as isolation from peers and a perceived lack of respect from more experienced teachers. Data from this case study indicate similar perceptions by less experienced teachers and may have contributed to the variation in implementation of the components. 92 Relationship between fidelity and latency since training. As can be seen in table 6, the analysis of data by training cohort did not reveal any noticeable trends or statistically significant differences. There was little variation within components or in the overall fidelity index. The lack of a difference in cohort implementation does seem to contradict the qualitative findings. For example, 100% of the teachers responding to the survey felt that they should be able to implement each component within two years. If one were to assume based on that sentiment that implementation follows some sort of progression, then it would follow that the earliest-trained teachers would show the greatest level of implementation. That assumption was not borne out in observation. Relationship between fidelity and teachers’ primary subject area. In Chapter II, CM training was described as an intervention to increase the academic language development instruction across the school, in every classroom. The explicit teaching of language in all classes is generally thought to be a shift in practice, particularly in non- writing courses. Based on that shift, there was an expectation that courses not traditionally considered to include writing would show the lowest level of implementation fidelity. The quantitative results are somewhat in line with the expectation described above. Humanities teachers, who are described as teaching English language arts and/or social studies, generally displayed the highest rates of implementation. One unexpected example, however, was the math department, whose teachers showed implementation near the median level. Math may be considered to traditionally include the least amount of language instruction, above only physical education. However, the sample size of this 93 case study prohibits strong conclusions regarding the relationship between subject area and level of implementation. RQ2: Conditions that Influence Implementation As described in chapter I, the main goal of research question 2was to understand the conditions that support or hinder the implementation of CM at a local middle school. The conditions described in this section were derived from the analysis of the qualitative data collected in this case study. Teachers were asked to provide their perceptions regarding what influenced the implementation of CM practices. The section below describes the teachers’ perceptions of the training itself, the conditions either in place or desired that can support implementation, and lastly, the conditions that were in place that were perceived to hinder implementation. Teachers’ perceptions of CM training. Teachers’ “buy-in” of educational reform initiatives has been considered critical to the implementation of new programs (Datnow & Castellano, 2000; Fullan, 1991; Gulumhussein, 2013; Opfer, 2011). Fullan (1991) suggests teacher buy-in is influenced by whether their beliefs align with the priorities of the initiative. The teachers’ overall perception of CM training is a condition that influenced implementation. Teachers’ comments regarding the training itself were all generally positive. Although the principal investigator had developed an initial code to organize any negative comments, the analysis of the interviews did not uncover a single quotation referring to anything negative about the training itself. Teachers used these terms and phrases to describe their perceptions of the training: “practical,” “best practices,” “what we know we should be doing,” and “it helps all students, not just ELLs.” All teachers interviewed felt that, when fully implemented, the training would 94 lead to gains in student achievement. Survey data yielded similar results, with 100% of respondents feeling that CM practices will have a positive effect on student achievement. Likewise, only 15% of respondents felt that the school should have different priorities for their professional development. It is logical to conclude that the content of the trainings was perceived as positive by the participants in the study. The materials themselves began to emerge as a theme regarding the perception of the training. During the course of teacher interviews, the teachers spoke of more tangible resources to understand whether there were resources needed that could simply be purchased. Eight of the nine teachers referred to some sort of preprinted resource available directly from E. L. Achieve. One example is the “CM flipbook” and desktop guide that offers teachers examples of immediate use of strategies aligned to each of the critical components. As one teacher mentioned, “Many of the materials provided by CM are easily modified to use almost immediately in class.” The presenters themselves also contributed to the positive perception of the CM training. One teacher described the trainers as energetic, reporting that the trainers “immediately earned [the teachers’] respect through their knowledge and expertise.” Furthermore, the trainers seemed to stay connected to their trainees beyond the three official days of training. One teacher in an interview named a trainer who “made herself very available for support and help” above and beyond the approachability of the other instructors. None of the teachers I interviewed or surveyed indicated any reluctance or hesitation to contact the trainers for guidance. Collaboration. Teacher collaboration has been described as a systematic process to analyze and improve instructional practices (Dufour, 2004). As can be seen in table 12, 95 every teacher interviewed spoke of the need to work with others. They used words and phrases such as “collaboration,” “planning,” “working together,” and “share the work” when talking about completing their tasks. The tasks they referred to were all related in some way to the development of CM practices. However, they expressed a need to collaborate with distinctly different groups of people. The groups were peers with similar jobs, or other teachers that taught the same course or subject area; interdisciplinary groups or teachers that taught different subjects but to the same group of students, such as grade-level teams; and collaboration with CM trainers. Based on the structure of professional collaboration in place at the middle school, collaboration within subject groups focused on the development or modification of curriculum and materials to align with CM practices. Likewise, teachers often spoke of dividing tasks and sharing resources. For example, one teacher may be writing the sentence frames for the lesson while another teacher is developing discussion prompts. Similarly, another teacher remarked, We work really well together in planning activities that are going to use the vocabulary and get kids talking about what we're doing. That's a huge part of what makes it successful is that we have the opportunity. I'm not trying to do it all by myself. One teacher described this process during teacher interviews as one that would be beneficial but not currently in practice: “I'd like some time to collaborate with my science partners and work on lesson plans that have CM implemented in them.” In contrast, collaboration in interdisciplinary teams centered around shared groups of students. Teachers spoke of discussing the rate of gradual release for specific groups of students or strategies that may have worked particularly well for an individual teacher. One teacher commented on the desire to “collaborate even with my other teammates, so 96 my humanities teacher, my science teacher. It would be nice to know what they're seeing, if they're seeing the same patterns with language with our kids.” Collaboration with peers was found to be one of the key components described by Joyce & Showers (2002) of effective professional development. Peer collaboration, as described by their study, included both the development of curricular materials as well as the logistical planning for implementation. Gulumhussein (2013) describes peer collaboration to be the condition in which teachers can apply educational research and develop innovative changes to their practice. In the case study at hand, school leaders should further develop structures to support the peer collaboration between teachers in the middle school. In addition to peer collaboration, teachers also spoke of the need to collaborate with CM trainers. Specifically, teachers were looking for specific feedback and coaching on their curricula and instruction. Teachers described the need for collaboration with CM trainers as a way to validate their own perceptions of their implementation. As described earlier in this chapter, teachers consistently over-reported compared to the observations. The discrepancy is meaningful, particularly when considering the implementation by individual teachers. If an individual teacher were to rely only on his or her self-evaluation, the perceived need to adapt and change practice may be less than if he or she received direct feedback from an external observer. When asked about the discrepancy between the reflections and observations, responses included the need for an impartial observer to give direct feedback. For example, one teacher described the difference between being observed by a colleague and a trainer in this way: There's the past, the history, the knowing of each other, and that might not be as fruitful as it could be if it was just someone from the outside who's not in charge 97 of evaluating me, they're not associated with, they're not a friend of mine, or someone that I used to teach with, or a colleague. It's just someone from CM, whom I don't know, coming in, and watching me use the strategies and giving me some really effective feedback and coaching. Feedback by trained coaches, as opposed to peers, may be advantageous (Scheeler, Ruhl, & McAfee, 2004). Trained coaches simply have more experience with specific interventions and are able to provide more specific feedback to the associated practices than peer coaches who may still be learning the programs themselves (Mallette, Maheady, & Harper, 1999). In contrast, however, trained coaches may not have informal access to teachers, and their observations are more likely to be scheduled and less frequent than visits from internal coaches (Nishimura, 2014). Implementation plans should include opportunities for regular feedback for the teachers. Practically, the feedback plan will need to include peers, due to resource limitations. However, feedback from experts should not omitted. Time as a resource. The concept of time as a resource was a complex theme during this case study. In 100% of interviews, the need for additional time to implement the CM practices was mentioned. However, in some cases, time simply predicated a need for a different resource, indicating that it was the latter that was actually in need. For example, one teacher stated: “I am a science teacher, not a language arts teacher. I need time to learn how teach this way.” Although the term time is used, the comment actually reveals a need for additional training and support, as described in the preceding section, rather than simply more time. In contrast, other respondents were able to make statements indicating that time itself was the resource needed. Seven of the nine interviews mentioned the need for more time with the students. A characteristic response was, “These strategies are better, but 98 they certainly take more time than just lecturing and moving on.” Interestingly, during the case study, teachers were transitioning from 90-minute instruction period blocks to 75-minute blocks. Therefore, it is possible that the perception of a need for more time with students was due to the change in the schedule and not the addition of CM practices. In addition, five of the nine teachers interviewed discussed simply needing time to modify their existing curricula to meet the requirements of CM practices. One teacher commented that when time was provided to plan with the team, implementation “was done with ease.” Although a specific guideline for the amount of time needed for workshops, coaching, and follow-up is not agreed upon in the literature, the concept of implementation timelines has been addressed in various studies. In her report to the National Staff Development Council, Linda Darling-Hammond and colleagues (2009) analyzed nine experimental design studies measuring the relationship between time and implementation of PD. The researchers found that in every study, the duration of the training (including coaching and follow-up) was positively associated with implementation fidelity. The training time needed may be much higher than what is provided in traditional workshops that last one to five days. In various reports on professional development, between 50 and 80 hours of direct engagement is needed to significantly influence teacher practices (Corcoran, McVay, & Riordan, 2003; Supovitz & Turner, 2000; Wagner & French, 2010). The teachers in this study each completed two full days of direct training (approximately 16 hours). Additional time (up to eight hours per teacher) was provided for teachers to opt to spend time with trainers adapting their curricular materials. Beyond 99 the formal time provided, teachers have described using planning and personal time to implement the strategies. Although teachers were not asked to quantify the total number of hours they have spent working on implementation, it does not seem likely that many, if any, have approached the minimum 50 hours suggested above. An interesting discrepancy was presented from the survey data in regards to time as condition of implementation. Only 8% of respondents stated that CM practices are too time-consuming to implement. However, it is less clear whether the school providing more time would aid implementation. During the surveys, 76% of teachers reported that they were able to collaborate with their “learning team” regarding CM practices. Learning teams are district mandates, similar to the structure of professional learning communities described by Dufour and Dufour (2002). According to the collective bargaining agreement, learning teams are to be allocated a minimum of 90 minutes a month, during the contract hours, to meet and collaborate. However, on a subsequent question, only 39% of respondents agreed that there is time available in the school day to plan for CM implementation. The survey data suggests that there may be conflicting understandings as to how the time that is available to teachers is to be used, and whether implementation of CM practices should take priority during that time. It is clear that teachers will need to invest substantial amounts of time in order to approach the necessary investments required for full implementation. Under the current schedule utilized by the school, the collaboration will need to occur during preparatory periods, after school contracted time, or during one of the three staff development days that occur throughout the year. 100 Alignment with Other Priorities As described earlier in this chapter, the evidence collected in this case study suggests that there are areas of overlap between CM practices and other district priorities. CM training was selected explicitly after an emphasis on SIOP training. Teachers and administrators felt that the ideals and philosophies described in SIOP training were operationalized by CM practices. Additionally, the school, and its associated district, are heavily invested in the implementation of AVID, a structure and curriculum aimed at increasing college success in traditionally underrepresented groups. One of the key components of AVID is the teachers’ use of Writing, Inquiry, Collaboration, Organization, and Reading (WICOR) strategies. The teachers that were interviewed were able to see the similarity between CM practices and WICOR strategies. These connections were described by teachers as positive and contributed to the feeling that CM is not “just another thing.” For example, one veteran teacher commented, “what really connected was the strategies with AVID, and then partnering that with Constructing Meaning. It was like a really good additional underpinning for the strategies in AVID.” Another teacher said that CM “provided more structure and specifics to techniques, handouts, etc., that I had been working to implement as a result of my work with SIOP and more recently AVID trainings.” Alignment with other priorities as a hindrance to implementation. The connections to other initiatives did include some negative remarks. The remarks did not suggest that any one initiative is negative, but that they each place a drain on resources, and teachers said they felt like there were some competing priorities. As one respondent stated, “AVID is great, but it is taking away from the emphasis of CM, like we aren’t 101 really doing CM anymore.” Another teacher simply said, “My focus has also been on AVID much more than CM.” In a similar explanatory study that investigated the implementation of Positive Behavior Intervention and Supports (PBIS), 53% percent of teachers surveyed indicated that explicit prioritization of the program aided in implementation (Andreou, McIntosh, Ross, & Kahn, 2015). Teachers stated that the continued prioritization, over multiple years, helped to “validate the program” and increased the likelihood that they would “alter their practice” (p. 164). Limitations of the Study The discussion of the analysis of research questions presented is intended to be helpful to school stakeholders as they evaluate and plan for their ongoing implementation of CM practices. Guidance has been provided to school district leaders as they adopt and plan professional development activities for their staff. However, the methods described in this case study were not without limitations. The limitations and their potential impacts on the findings of this study are discussed below. Psychometric properties of the rubric. The Refining Our Practices Rubric was at the center of all quantitative analysis in this case study. The lack of psychometric data for the rubric calls into question the accuracy of the findings. Although this study did include a reliability analysis of both the raters and the items within each component, the data used came from a relatively small number of observations, n = 5 and n = 30, respectively. The reliability analysis did provide useful information, as described in Chapter IV. For example, in the case of the item analysis, item AWS(3) showed limited agreement with other AWS items and was removed from analysis. However, a larger 102 scale reliability analysis is needed prior to removing the item from future uses of the rubric. Additionally, the agreement between the two observers only met minimally acceptable levels of reliability (kappa = .607). Therefore, the variance in the data presented may be more a function of rater disagreement than of differences in actual implementation fidelity. However, it should be noted that the rubric allowed for a range of four possible scores. The application of the kappa statisitic did not likewise account for a range of agreement. Rather, the calculation treated the disgareement between scores of one and four on the same item as equivalent to the difference between a three and a four. A simple difference analysis indicated that approximately 70% of the measurements that were in disagreement between the two raters were within one level on the rubric. The small range in disagreement provides confidence about the reliability of the observational data not completely reflected in the calculation of the kappa statistic. The evaluation was formative in nature, meaning it was intended to provide insight to the school as to how to improve the implementaiton process of CM training. There was no intention to use any of the data to make summative judgements on the continuation of the program, and certainly not to make any job performance claims about any individual or the school in general. However, the lack of substantial agreement was a point of concern and is further addressed below. The validity of the rubric was assumed, which presented a significant limitation of the study. The items in the rubric called for a very narrow range of observations for each of the components. However, the observations required for each item may not detail a complete representation of the construct presented by the component. Therefore, a teacher may be implementing practices that are in line with a particular component that 103 are not specifically included in the rubric. If that were the case, the teacher would receive an innappropriatly lower score. Further analysis on the rubric is needed to determine its appropriateness and robustness as an evaluation tool, discussed below. Sample size of the study. All statistical analyses were completed without sufficient statistical power, due to the small sample size. As a result, the findings presented here should be further investigated through studies with greater participation prior to making summative claims. The findings in this study presented school leaders with some potentially useful information for planning, but more investigation is needed before making high-stakes changes to the program. Furthermore, the study was designed and conducted completely within the context of the specific school, and findings are not generalizable to other CM implementation contexts or other teacher PD programs. Participant bias. Participation in this study was voluntary. During the observation phase, only four teachers “opted out.” However, during the survey and reflection phase, only 13 of the 30 teachers participated. An attempt to uncover the reasons behind the lack of participation yielded little conclusive information. The only feedback provided was that time was too valuable to spend on nonessential tasks. While nonresponsive teachers cited only time constraints, there may be specific groups of teachers whose perceptions were not represented in surveys and reflections. The limited range of teacher perceptions should be noted when considering possible bias in this case study. For example, of the teachers who responded to reflections and surveys, none of them were within their first five years of teaching. As described above, teachers in the first few years of teaching describe feeling exhaustion and burnout from pressures at work (Fives, Hamman, & Olivarez, 2007; Kwakman, 2003). If the teachers in this 104 study also feel exhausted during their first years of teaching, they may have had more negative perceptions of CM practices than the more experienced teachers who responded (generally positively) to the survey. There is no way to know the motivations of the teachers who did or did not respond to surveys and reflections. Those with a positive impression of the program might be more likely to respond, and less likely to voice challenges or hindrances to implementation. Over-reporting by teachers. Within the participant bias of the study was the over-reporting of implementation by teachers. Over-reporting emerged as a theme throughout the quantitative analysis of this study. In the case of every component, except UBD and the overall index, teachers’ self-reported scores were higher than those reported from direct observation by the coaches. The differences were found to be statistically significant in AWS, IRNT, and OLP. The higher reporting was a topic discussed in every interview. No teacher made any statements of surprise upon learning that teachers had reported implementation higher than the observers, and many were able to offer opinions as to the reasons. Two common opinions regarding the over-reporting emerged. The first opinion was that teachers may be reporting on what they planned to do or could describe doing, without actually taking into account whether the action had actually taken place. The second opinion is that the over-reporting indicates a need for direct observation by and feedback from trained CM coaches. Several teachers addressed these ideas during the interviews. One respondent said, “I found myself being like, ‘Yeah, I do that,’ and then I overestimate how much I do that. In reality, [the students are] not having as many conversations as I think they're having.” Another teacher commented, “I think that we 105 want to look like we know what we're doing, for one thing. Put a rosy spin on it. Also, we might have an idea that we're thinking of . . . in our head.” Further remarks on this issue include the following quotes from two teachers: We have it in our head. . . . One of my favorite lines is, "I have the best lesson plans in the world. It's going to be awesome." Then the kids show up. They don't do what I'm thinking. "Oh, they'll do this, they'll do that." That's some of it. In my head, as I develop those lessons or want to use these [strategies], I'm thinking about these. I'm not really making it happen in the classroom as effectively as I am thinking about it. I think I know how to use this. I think that I'm using it, but it's not really coming across. The concept that teachers, or any practitioner, would over-report is not surprising and has been well documented in the literature (Eva & Regehr, 2008; Kruger & Dunning, 1999). Described by Kruger and Dunning as “unskilled and unaware” (1999), there is general consensus that individuals have a poor ability to self-assess accurately. However, self-assessments have been shown to have positive contributions to the implementation and evaluation of professional development (Eva & Regehr, 2005; Langendyk, 2006). Self-assessments provide opportunities for individuals to describe the contextual conditions affecting PD and to reflect on their own progress toward implementation. Although the progress described is likely inflated, the opportunity to reflect provides teachers with the opportunity to better understand their practices. In the case study presented here, the over-reporting by teachers suggests a need for more observation by trained observers in order to judge the actual level of implementation and give support where necessary. This quote summarizes the need: I imagine that when you feel like you're working really hard to do something new—doing new things and incorporating new things in your practice is difficult, and if you feel like you're working really hard at it, even if you're not doing a good job at it, you are maybe rating yourself in terms of how hard you feel like you're working at it versus how well you're doing, which is why I think having 106 someone observe and give feedback, and do coaching is a really vital part of any professional development. And it's the most absent part of professional development. Target of implementation. RQ1 investigated the success of implementation. To compound the measurement challenges presented by the rubric and small sample sizes, there was also a lack of a predetermined standard. There were no specific success criteria described during the training or set by the administrators of the school. The targets used as a standard in this study were developed in consultation with district leaders and CM trainers, and included a review of the relevant literature. However, it must be noted that the standard was set for the purposes of this study and evaluation only. The standard of addressing success by investigating the implementation of the components, the variance in implementation, and a threshold based on the rubric were useful in making recommendations to the school. However, the lack of a predetermined standard during the initial design and training is a topic for consideration presented as an application for future research. Implications for Practice In the following sections, the interpretation of the results of this case study and the associated implications for the middle school and its parent district are presented. The recommendations presented are related to determining a target for successful implementation, developing reliable systems of observation, factoring in time as a resource, integrating with other priorities, and distinguishing between training and intervention. Recommendations for the general research community are also presented. 107 Target for Successful Implementation As discussed above, success criteria needed to be established for this study. The lack of a clearly defined implementation plan has been cited as a hindrance in programs that have failed to be adequately implemented (McGrew et al., 1994). Schools lacking a clear plan have noted that interventions seem to simply fade away from their practice over time (Andreou et al., 2015). The school in this study had not determined expected levels of implementation over time. In fact, there was little to no specificity provided to the expectations from management in regards to which instructional practices should be utilized. Rather, the general statement of “we should see these practices in every room” was the only basis for successful implementation, prior to this study. Likewise, E. L. Achieve does not provide an implementation timeline, and neither the district nor the school chose to develop one prior to the training. During the case study, it became apparent that the approach of school and district leadership was to celebrate the areas that were implemented rather than focusing on the areas that were not. It is not too late for the school to develop such a plan. One application of this case study could be to use the indices of the components as a baseline and to set goals for expected future gains. School leaders could develop a systematic plan that addresses the conditions needed for implementation as well as the challenges of evaluation that have been described in this study. Teachers would be able to receive the specific feedback they need, based on the goals of the implementation plan. As discussed above, teachers felt that observations and feedback on specific CM components, made by CM experts, would better support successful implementation. With clear, component-specific plans in place, feedback could be highly targeted. 108 Developing Reliable Systems of Observation The continued implementation of CM practices by the school should include the continuous evaluation of classroom practices. As described above, the teacher comments collected in this case study indicated that observations and feedback supported implementation. Both the data collected and the relevant literature indicate benefits of both peer and outside (CM trainer) observation and feedback cycles. The school should design a program where teachers observe and provide regular feedback to each other as well as periodically bringing in trained CM coaches. The practice of observations and feedback will support teachers in their implementation and provide leaders with evaluative data that can be used to make program adjustments. The methods for the evaluation of CM practices by utilizing both quantitative and qualitative methods, as utilized in this case study, could be repeated, if the institutional limitations discussed above are addressed. Increased reliability analysis of the Refining Our Practices Rubric. The reliability analysis that was presented in this study was derived from data collected from a very small sample. The lack of confirmed psychometric data on the tool needs to be addressed to support large-scale evaluations within the school and the greater district. Ideally, the data would come from E. L. Achieve, which has supported CM schools across the country. However, the district has trained enough teachers to produce a sample size adequate for such analysis. As explained in Chapter III, the goal of the district was to have every middle school teacher trained in CM practices. If the district were to achieve this goal, a comprehensive study would contain a sample size of approximately 400 teachers. A systematic analysis of CM implementation could be used to better understand 109 the psychometric properties of the rubric. If the rubric is shown to be reliable and valid, it could be used immediately in studies similar to the one presented here. Considerations of Time as a Resource In addition to continued evaluation, school leaders should support the conditions that have been described to support implementation. Teachers’ consistency in describing time as a necessary condition to support successful implementation did not come as a surprise. However, the discrepancy between some teachers feeling that adequate time is provided while other teachers felt it was not was interesting. Particularly, the majority of the teachers in this case study were on the same bell schedule with the same number of preparatory minutes. Therefore the discrepancy could not have resulted from simple differences in schedules. Rather, the discrepancy may have resulted from the teachers’ prioritization of CM practices and their use of the time that they did have. Once a clear expectation of an implementation timeline is put in place, teachers could be given clear objectives to accomplish during available preparatory time. Connections to Other Priorities During this case study, the connection to other school and district initiatives was perceived by some teachers as a support to implementation. In particular, teachers spoke of CM practices as a continuation of Sheltered Instruction Observation Protocol (SIOP) trainings. Generally, they described that connection as positive. Teachers also spoke of CM training as being connected to the training received through Advancement Via Individual Determination (AVID) trainings. The shared strategies, such as interactive note-taking, were seen as complimentary and positive. However, further probing into the 110 connection between AVID and CM indicated that some teachers perceived the connection as a hindrance to CM implementation. The district has been financially supported by a Fortune 500 corporation to provide training and implementation of the AVID program at all of its schools. While the foci and practices of AVID and CM align, they are two distinct programs with different critical components and measures of evaluation. Multiple teachers felt that they did not have the capacity to implement both programs simultaneously. What has resulted was the feeling that CM was more of a one-time event, where teachers would use what they felt was beneficial and adapt it to their practices. CM practices were not considered as a systematic intervention but rather a workshop that would supplement ongoing instructional techniques. As described in Chapter II, professional development that is delivered in a workshop format is less likely to be fully implemented. Distinguishing Between Training and Intervention Interview comments indicated uncertainty as to whether CM practices will serve as a standard for teaching practices across subject areas. In one regard, CM training and its associated practices can be thought of as an intervention, which is the approach used by the school and district at the time of this study. Considering the practices as an intervention implies that the practices will be implemented as intended and with an adequate level of fidelity. However, in light of the implementation of AVID and other school priorities, it is also possible to simply consider CM training as a workshop. Workshops tend to be one-time events that may help to improve the practices of certain teachers but do not result in widespread change or improvements to overall outcomes. School leaders should clearly communicate that full implementation of CM practices is 111 consistently expected of all teachers. School leaders should also allocate the appropriate resources, including time, that are necessary to expect full implementation. Chapter II outlined the need for fidelity measurement to be included in the implementation plans for interventions. The current study was an applied research project with a primary goal of presenting school-specific findings and recommendations. This study also identifies several areas in need of further study by the broader research community. This study was able to design and utilize a model based on prior studies and recommendations. In addition to describing a practical application, this study also uncovered challenges that should be addressed by the professional development community. Specifically, challenges to schools completing research-based internal evaluations were uncovered. The substantial challenges uncovered were in the area of valid and reliable measures of implementation, and an established standard for implementation. The critical components of CM were readily identified by the developers of the program. However, as described above, the rubric used in the evaluation lacked accompanying psychometric data. School personnel should not be charged with assessing the evaluation tools of interventions they choose to implement. Developers should provide valid instruments as well as clear guidance for measuring and achieving reliability when offering professional development packages to schools or districts. Additionally, PD providers should define standards of implementation, using accepted methods. 112 Implications for Future Research This study was conducted to not only measure implementation, but to also understand it. Researchers utilizing fidelity studies, either as formative evaluations or within broad studies of program effect, should consider the inclusion of qualitative methods in their analyses, in order to better understand the contextual conditions that influence the implementation of the intervention they are studying. In the current case, the qualitative methods allowed the providers of the intervention to describe perceptions and nuances for the intervention that would be difficult to ascertain through solely quantitative methods. The results of studies that combine both quantitative and qualitative results are thus more likely to be recognizably useful and more likely to be applied by local school leaders. The use of well-designed evaluations of implementation is crucial for school leaders who are attempting to raise student achievement outcomes through quality professional development. The use of implementation data throughout the PD process is crucial. Implementation data will provide school leaders with the information that is necessary for timeline adjustments and resource allocations. Implementation data is also a vital component to showing the efficacy of the programs when included as part of a well-designed experimental study. 113 APPENDIX THE REFINING OUR PRACTICES RUBRIC 114 115 116 117 118 REFERENCES CITED Andreou, T. E., McIntosh, K., Ross, S. W., & Kahn, J. D. (2015). Critical incidents in sustaining School-Wide Positive Behavioral Interventions and Supports. The Journal of Special Education, 49, 157–167. doi:10.1177/0022466914554298 Archer, A., & Hughes, C. (2011). Explicit instruction: Effective and efficient teaching. Explicit instruction: Effective and efficient teaching (1st ed.). New York, NY: The Guilford Press. Babbie, E. (2007). The Practice of Social Research (11th ed.). Belmont, CA: Thomsen Wadsworth. Bangert-Drowns, R. L., Hurley, M. M., & Wilkinson, B. (2004). The effects of school- based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74, 29–58. doi:10.3102/00346543074001029 Benner, G. J., Nelson, J. R., Stage, S. A., & Ralston, N. C. (2011). The influence of fidelity of implementation on the reading outcomes of middle school students experiencing reading difficulties. Remedial and Special Education, 32, 79–88. doi:10.1177/0741932510361265 Bennett, S. M., & Hart, S. M. (2015). Reading horizons addressing the “shift”: Preparing preservice secondary teachers for the common core. Reading Horizons, 53(4), 43– 65. Bickmore, K., & Parker, C. (2014). Constructive conflict talk in classrooms: Divergent approaches to addressing divergent perspectives. Theory & Research in Social Education, 42, 291–335. doi:10.1080/00933104.2014.901199 Blank, R. K., & de las Alas, N. (2009). Effects of teacher professional development on gains in student achievement. Washington D.C. Retrieved from www.ccsso.org Bond, G., Williams, J., Evans, L., Salyers, M., Kim, H.-W., Sharpe, H., & Leff, S. H. (2000). Psychiatric Rehabilitation Fidelity Toolkit. Cambridge, MA: Human Services Research Institute. Bradshaw, C. P., Barrett, S., & Bloom, J. (2004). The implementation phases inventory (IPI). Baltimore, MD. Retrieved from http://www.pbismaryland.org/forms.htm Bradshaw, C. P., Debnam, K., Koth, C. W., & Leaf, P. (2008). Preliminary validation of the Implementation Phases Inventory for assessing fidelity of schoolwide positive behavior supports. Journal of Positive Behavior Interventions, 11, 145–160. doi:10.1177/1098300708319126 119 Bradshaw, C. P., Koth, C. W., Thornton, L. A., & Leaf, P. J. (2009). Altering school climate through school-wide positive behavioral interventions and supports: Findings from a group-randomized effectiveness trial. Prevention Science, 10, 100– 115. doi:10.1007/s11121-008-0114-9 Brandon, P. R. (1998). Stakeholder participation for the purpose of helping ensure evaluation validity: Bridging the gap between collaborative and non-collaborative evaluations. American Journal of Evaluation, 19, 325–337. doi:10.1177/109821409801900305 Briggs, S. R., & Cheek, J. M. (1986). The role of factor analysis in the development and evaluation of personality scales. Journal of Personality, 54(1), 106–148. doi:10.1111/j.1467-6494.1986.tb00391.x Brunette, M. F., Asher, D., Whitley, R., Lutz, W. J., Wieder, B. L., Jones, A. M., & McHugo, G. J. (2008). Implementation of integrated dual disorders treatment: A qualitative analysis of facilitators and barriers. Psychiatric Services (Washington, D.C.), 59, 989–95. doi:10.1176/appi.ps.59.9.989 Carroll, C., Patterson, M., Wood, S., Booth, A., Rick, J., & Balain, S. (2007). A conceptual framework for implementation fidelity. Implementation Science, 9, 1–9. doi:10.1186/1748-5908-2-40 Century, J., Cassata, A., Rudnick, M., & Freeman, C. (2012). Measuring enactment of innovations and the factors that affect implementation and sustainability: Moving toward common language and shared conceptual understanding. The Journal of Behavioral Health Services & Research, 39, 343–61. doi:10.1007/s11414-012-9287- x Century, J., Rudnick, M., & Freeman, C. (2010). A framework for measuring fidelity of implementation: A foundation for shared language and accumulation of knowledge. American Journal of Evaluation, 31, 199–218. doi:10.1177/1098214010366173 Christie, C. A., & Alkin, M. C. (2003). The user-oriented evaluator’s role in formulating a program theory: Using a theory driven approach. American Journal of Evaluation, 24, 373–385. doi:10.1177/109821400302400306 Cizek, G. K., & Bunch, M. B. (2007). Standard setting: A guide establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage Publicaitons. Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–319. doi:10.1037/1040- 3590.7.3.309 120 Codding, R. S., Feinberg, A. B., Dunn, E. K., & Pace, G. M. (2005). Effects of immediate performance feedback on implementation of behavior support plans. Journal of Applied Behavior Analysis, 38, 205–219. doi:10.1901/jaba.2005.98-04 Collins, K. M. T., Onwuegbuzie, A. J., & Sutton, I. L. (2006). A model incorporating the rationale and purpose for conducting mixed-methods research in special education and beyond. Learning Disabilities, 4(1), 67–100. Corcoran, T., McVay, S., & Riordan, K. (2003). Getting it right: The MISE approach to professional development. Philidelphia, PA. Retrieved from https://cpre.org/images/stories/cpre_pdfs/rr55.pdf Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). Los Angeles, CA: Sage Publications. Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research. Los Angeles, CA: Sage Publications. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. doi:10.1007/BF02310555 Dane, A. V, & Schneider, B. H. (1998). Program integrity in primary and early secondary prevention: Are implementation effects out of control? Clinical Psychology Review, 18, 23–45. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/9455622 Darling-Hammond, L., & Wei, R. C. (2009). Professional learning in the learning profession : A status report on teacher development in the united states and abroad. Washington, DC: National Staff Development Council. Datnow, A., & Castellano, M. (2000). Teachers’ responses to Success for All: How beliefs, experiences, and adaptations shape implementation. American Educational Research Journal, 37, 775–799. doi:10.3102/00028312037003775 DeWitt, J., & Hohenstein, J. (2010). School trips and classroom lessons: An investigation into teacher-student talk in two settings. Journal of Research in Science Teaching, 47, 454–473. doi:10.1002/tea.20346 Dobson, D., & Cook, T. J. (1980). Avoiding type III error in program evaluation: Results from a field experiment. Evaluation and Program Planning, 3, 269–276. doi:http://dx.doi.org/10.1016/0149-7189(80)90042-7 Drake, R., Goldman, H., Leff, S., Lehman, A., Dixon, L., Mueser, K., & Torrey, W. (2001). Implementing evidenced-based practices in routine mental health service settings. Psychiatric Services, 52, 179–182. 121 Dufour, R. (2004). What is a “Professional Learning Community?” Educational Leadership, 61(6), 6–11. Durlak, J. A., & DuPre, E. P. (2008). Implementation matters: a review of research on the influence of implementation on program outcomes and the factors affecting implementation. American Journal of Community Psychology, 41, 327–50. doi:10.1007/s10464-008-9165-0 Dusenbury, L., Brannigan, R., Falco, M., & Hansen, W. B. (2003). A review of research on fidelity of implementation: Implications for drug abuse prevention in school settings. Health Education Research, 18, 237–256. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12729182 Dutro, S. (2009). Explicit language instruction. In L. Helma (Ed.), Literacy Development with English Learners: Research-Based Instruction in Grades K–6 (pp. 40–55). New York, NY: The Guilford Press. Dutro, S., & Moran, C. (2003). Rethinking English language instruction: An architectural approach. In G. G. Garcia (Ed.), English learners: Reaching the highest level of English literacy (pp. 227–265). Newark, DE: International Reading Association. Dyas, J. V., Togher, F., & Siriwardena, A. N. (2014). Intervention fidelity in primary care complex intervention trials: Qualitative study using telephone interviews of patients and practitioners. Quality in Primary Care, 22, 25–34. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/24589148 Echevarria, J., Frey, N., & Fisher, D. (2015). What it takes for English Learners to succeed. Educational Leadership, 72(6), 22–26. Echevarria, J., Richards-Tutor, C., Chinn, V. P., & Ratleff, P. A. (2011). Did they get it? The role of fidelity in teaching English learners. Journal of Adolescent and Adult Literacy, 54, 425–434. doi:10.1598/JA E. L. Achieve (2014). Constructing Meaning Home. Retrieved from http://cm.elachieve.org/ Eva, K. W., & Regehr, G. (2005). Self-assessment in the health professions: A reformulation and research agenda. Academic Medicine, 80, S46–S54. doi:10.1097/00001888-200510001-00015 Eva, K. W., & Regehr, G. (2008). “I’ll never play professional football” and other fallacies of self-assessment. The Journal of Continuing Education in the Health Professions, 28(1), 14–19. doi:10.1002/chp 122 Fantilli, R. D., & McDougall, D. E. (2009). A study of novice teachers: Challenges and supports in the first years. Teaching and Teacher Education, 25, 814–825. doi:10.1016/j.tate.2009.02.021 Ferretti, R. P., MacArthur, C. A., & Dowdy, N. S. (2000). The effects of an elaborated goal on the persuasive writing of students with learning disabilities and their normally achieving peers. Journal of Educational Psychology, 92, 694–702. doi:10.1037/0022-0663.92.4.694 Fink, A. (2013). How to conduct surveys: A step by step guide (5th ed.). Thousand Oaks, CA: Sage Publications. Fives, H., Hamman, D., & Olivarez, A. (2007). Does burnout begin with student- teaching? Analyzing efficacy, burnout, and support during the student-teaching semester. Teaching and Teacher Education, 23, 916–934. doi:10.1016/j.tate.2006.03.013 Forgatch, M. S., Patterson, G. R., & DeGarmo, D. S. (2005). Evaluating fidelity: Predictive validity for a measure of competent adherence to the Oregon model of parent management training. Behavior Therapy, 36, 3–13. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1464400&tool=pmcentr ez&rendertype=abstract Fullan, M. (1991). The new meaning of educational change (1st ed.). London, England: Cassell. Garet, M. S., Porter, A. C., Desimone, L., Birman, B. F., & Yoon, K. S. (2009). What makes professional development effective? Results from a national sample of teachers. American Education Research Journal, 38, 915–945. doi:10.3102/00028312038004915 Gliem, J. A., & Gliem, R. R. (2003). Calculating, Interpreting, and Reporting Cronbach’s Alpha Reliability Coefficient for Likert-Type Scales,. 2003 Midwest Research to Practice Conference in Adult, Continuing, and Community Education. doi:10.1109/PROC.1975.9792 Graham, S., & Perin, D. (2007). Writing Next: Effective strategies to improve writing of adolescents in middle and high schools. A Report to Carnegie Corporation of New York. New York, NY. Retrieved from https://www.paytixx.com/education/nclb/ispd/topic1/writing_next.rtf 123 Gulumhussein, A. (2013). Teaching the teachers: Effective professional development in an era of high stakes accountability. Alexandria, VA. Retrieved from http://www.centerforpubliceducation.org/Main-Menu/Staffingstudents/Teaching- the-Teachers-Effective-Professional-Development-in-an-Era-of-High-Stakes- Accountability/Teaching-the-Teachers-Full-Report.pdf Guskey, T. (2002). Professional development and teacher change. Teachers and Teaching: Theory and Practice, 8, 381–391. doi:10.1080/135406002100000512 Hall, G. E., & Hord, S. M. (1987). Change in schools: Facilitating the process. New York, NY: State University of New York Press. Hansen, W. B., Graham, J. W., Wolkenstein, B. H., & Rohrbach, L. A. (1991). Program integrity as a moderator of prevention program effectiveness: Results for fifth-grade students in the adolescent alcohol prevention trial. Journal of Studies on Alcohol, 52, 568–579. Harachi, T. W., Abbott, R. D., Catalano, R. F., Haggerty, K. P., & Fleming, C. B. (1999). Opening the black box : Using process evaluation measures to assess implementation and theory building. American Journal of Community Psychology, 27, 711–731. Harn, B., Parisi, D., & Stoolmiller, M. (2013). Balancing fidelity with flexibility and fit: What do we really know about fidelity of implementation in schools. Exceptional Children, 79, 181–193. Hesse-Biber, S., & Johnson, R. B. (2013). Coming at things differently: Future directions of possible engagement with mixed methods research. Journal of Mixed Methods Research, 7, 103–109. doi:10.1177/1558689813483987 Hoehler, F. K. (2000). Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. Journal of Clinical Epidemiology, 53, 499–503. doi:10.1016/S0895-4356(99)00174-2 Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has come. Educational Researcher, 33(7), 14–26. doi:10.3102/0013189X033007014 Joyce, B., & Showers, B. (2002). Student achievement through staff development. In B. Joyce & Beverly Showers (Eds.), Designing training and peer coaching: Our needs for learning (pp. 1–5). Alexandria, VA: ASCD. Kaderavek, J. N., & Justice, L. M. (2010). Fidelity: An essential component of evidence- based practice in speech-language pathology. American Journal of Speech Language Pathology, 19, 369–379. 124 Knoche, L. L., Sheridan, S. M., Edwards, C. P., & Osborn, A. Q. (2010). Implementation of a relationship-based school readiness intervention: A multidimensional approach to fidelity measurement for early childhood. Early Childhood Research Quarterly, 25, 299–313. doi:10.1016/j.ecresq.2009.05.003 Kobayashi, K. (2005). What limits the encoding effect of note-taking? A meta-analytic examination. Contemporary Educational Psychology, 30, 242–262. doi:10.1016/j.cedpsych.2004.10.001 Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77, 1121–1134. Kwakman, K. (2003). Factors affecting teachers’ participation in professional learning activities. Teaching and Teacher Education, 19, 149–170. doi:10.1016/S0742- 051X(02)00101-4 Langendyk, V. (2006). Not knowing that they do not know: Self-assessment accuracy of third-year medical students. Medical Education, 40, 173–179. doi:10.1111/j.1365- 2929.2005.02372.x Leachman, M., & Mai, C. (2014). Most states funding schools less than before the recession. Washington, DC: Center on Budget and Policy Priorities. Retrieved from www.cbpp.org Learning Forward. (2014). The professional learning associations’s website. Retrieved from www.learningforward.org Learning Forward. (2014). Standards home page. Retrieved October 10, 2015, from http://learningforward.org/standards#.VhqsUmRViko Lucero, A. (2013). Teachers’ use of linguistic scaffolding to support the academic language development of first-grade emergent bilingual students. Journal of Early Childhood Literacy, 14, 534–561. doi:10.1177/1468798413512848 Mallette, B., Maheady, L., & Harper, G. F. (1999). The effects of reciprocal peer coaching on preservice general educators’ instruction of students with special learning needs. Teacher Education and Special Education, 22, 201–216. doi:10.1177/088840649902200402 Maxwell, J. A. (2013). Qualitative research design: An interactive approach (3rd ed.). Thousand Oaks, CA: Sage Publications. 125 McGrew, J. H., Bond, G., Dietzen, L., & Salyers, M. (1994). Measuring the fidelity of implementation of a mental health program model. Journal of Consulting and Clinical Psychology, 62, 670–678. McHugo, G. J., Drake, R. E., Whitley, R., Bond, G. R., Campbell, K., Rapp, C. A., Finnerty, M. T. (2007). Fidelity outcomes in the National Implementing Evidence- Based Practices Project. Psychiatric Services, 58, 1279–1284. doi:10.1176/appi.ps.58.10.1279 McKenna, J. W., Flower, A., & Ciullo, S. (2014). Measuring fidelity to improve intervention effectiveness. Intervention in School and Clinic, 5, 1–7. doi:10.1177/1053451214532348 Messick, S. (1994). Validity of psychological assessment: Validation of inferences from persons’ responses and performancesas scientific inquiry into score meaning. Princeton, NJ: Educational Testing Service. Miller, B., Lord, B., & Dorney, J. (1994). Staff development for teachers: A study of configurations and costs in four districts. Newton, MA: Education Development Center. Mitchell, I. (2008). The relationship between teacher behaviours and student talk in promoting quality learning in science classrooms. Research in Science Education, 40, 171–186. doi:10.1007/s11165-008-9106-9 Moncher, F. J., & Prinz, R. J. (1991). Treatment fidelity in outcome studies. Clinical Psychology Review, 11, 247–266. doi:10.1016/0272-7358(91)90103-2 Morgan, G. A., Leech, N. L., Gloeckner, G. W., & Barrett, K. C. (2013). IBM SPSS for introductory statitics: Uses and interpretations (5th ed.). New York, NY: Routledge. Morisky, D. E., Green, L. W., & Levine, D. M. (1986). Concurrent and predictive validity of a self-reported measure of medication adherence. Medical Care, 24, 67– 74. doi:10.1097/00005650-198601000-00007 Mortneson, B. P., & Witt, J. C. (1998). The use of weekly performance feedback to increase teacher implemntation of a prereferral academic intervention. School Psychology Review, 27, 613–627. Mowbray, C. T., Bybee, D., Holter, M. C., & Lewandowski, L. (2006). Validation of a fidelity rating instrument for consumer-operated services. American Journal of Evaluation, 27, 9–27. doi:10.1177/1098214005284971 126 Mowbray, C. T., Holter, M. C., Teague, G. B., & Bybee, D. (2003). Fidelity criteria: Development, measurement, and validation. American Journal of Evaluation, 24, 315–340. doi:10.1177/109821400302400303 Nishimura, T. (2014). Effective professional development of teachers: A guide to actualizing inclusive schooling. International Journal of Whole Schooling, 10(1), 19–42. Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417–528. doi:10.1111/0023- 8333.00136 O’Brien, D. G., Stewart, R. A., & Moje, E. B. (1995). Why content literacy is difficult to infuse into the secondary school: Complexities of curriculum, pedagogy, and school culture. Reading Research Quarterly, 30, 442–463. O’Donnell, C. L. (2008). Defining, conceptualizing, and measuring fidelity of implementation and its relationship to outcomes in K–12 curriculum intervention research. Review of Educational Research, 78, 33–84. doi:10.3102/0034654307313793 Odden, A., Archibald, S., Fermanich, M., & Gallagher, H. A. (2012). A cost framework for professional development. Journal of Education Finance, 28, 51–74. Opfer, V. D. (2011). Conceptualizing teacher professional learning. Review of Educational Research, 81, 376–407. doi:10.3102/0034654311413609 Orwin, R. G. (2000). Assessing program fidelity in substance abuse health services research. Addiction, 95, S309–S327. doi:10.1080/09652140020004250 Ouimet, J. A., Bunnage, J. C., Carini, R. M., Kuh, G. D., & Kennedy, J. (2004). Using focus groups, expert advice, and cognitive interviews to establish the validity of a college student survey. Research in Higher Education, 45, 233–250. doi:10.1023/B:RIHE.0000019588.05470.78 Polikoff, M. S., McEachin, A. J., Wrabel, S. L., & Duque, M. (2013). The waive of the future? School accountability in the waiver era. Educational Researcher, 43, 45–54. doi:10.3102/0013189X13517137 Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and redommentations. Research in Nursing & Health, 29, 489–497. doi:10.1002/nur Race to the Top Act of 2011, Pub. L. No. H.R. 1532 (2014). 127 Rosenshine, B. (1987). Explicit teaching and teacher training. Journal of Teacher Education, 38(3), 34–36. doi:10.1177/002248718703800308 Saad, L. (2014). U.S. teachers offer split decision on Common Core. Washington, DC:Gallup Retrieved from http://www.gallup.com/poll/178892/teachers-offer-split- decision-common-core.aspx?version=print Sankar, A., Golin, C., Simoni, J. M., Luborsky, M., & Pearson, C. (2006). How qualitative methods contribute to understanding combination antiretroviral therapy adherence. Journal of Acquired Immune Deficiency Syndromes (1999), 43 Suppl 1, S54–S68. doi:10.1097/01.qai.0000248341.28309.79 Scheeler, M. C., Ruhl, K. L., & McAfee, J. K. (2004). Providing performance feedback to teachers: A review. Teacher Education and Special Education, 27, 396–407. doi:10.1177/088840640402700407 Sheen, R. (2002). “Focus on form” and “focus on forms.” English Language Teaching, 56, 303–305. Shymansky, J. A., Wang, T.-L., Annetta, L. A., Yore, L. D., & Everett, S. A. (2010). How much professional development is needed to effect positive gains in K–6 student achievement on high stakes science tests? International Journal of Science and Mathematics Education, 10, 1–19. doi:10.1007/s10763-010-9265-9 Singh, S., & Fletcher, K. E. (2014). A qualitative evaluation of geographical localization of hospitalists: How unintended consequences may impact quality. Journal of General Internal Medicine, 29, 1009–1016. doi:10.1007/s11606-014-2780-6 Skaalvik, E. M., & Skaalvik, S. (2010). Teacher self-efficacy and teacher burnout: A study of relations. Teaching and Teacher Education, 26, 1059–1069. doi:10.1016/j.tate.2009.11.001 Slabine, N. A. (2011). Evidence of effectivness. Oxford, OH: Learning Forward. Sugai, G., & Horner, R. H. (2002). Introduction to the special series on Positive Behavior Support in schools. Journal of Emotional and Behavioral Disorders, 10, 130–135. doi:10.1177/10634266020100030101 Supovitz, J. A., & Turner, H. M. (2000). The effects of professional development on science teaching practices and classroom culture. Journal of Research in Science Teaching, 37, 963–980. doi:10.1002/1098-2736(200011)37:9<963::AID- TEA6>3.0.CO;2-0 128 Tan, L., Sun, X., & Khoo, S. T. (2014). Can engagement be compared? Measuring academic engagement for comparison. In International Conference on Educational Data Mining (pp. 213–216). Retrieved from http://educationaldatamining.org/EDM2014/uploads/procs2014/short papers/213_EDM-2014-Short.pdf Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach’s alpha. International Journal of Medical Education, 2, 53–55. doi:10.5116/ijme.4dfb.8dfd Thorndike, R. M., & Throndike-Christ, T. (2011). Measurement and Evaluation in Psychology and Education (8th ed.). Upper Saddle River, NJ: Pearson Education. Tompkins, G., Campbell, R., Green, D., & Smith, C. (2014). Literacy for the 21st century (2nd ed.). Melborne: Pearson Australia. Torrey, W. C., Lynde, D. W., & Gorman, P. (2005). Promoting the implementation of practices that are supported by research: The national implementing evidence-based practice project. Child and Adolescent Psychiatric Clinics of North America, 14, 297–306. U.S. Department of Education. (2012). ESEA Flexibility. Washington, DC. Vacca, R. T., & Vacca, J. A. L. (1989). Content area reading. Glenview, IL: Scott, Foresman. Vaden-Kiernan, M., Jones, D. H., & McCann, E. (2009). Latest eveidence on the National Staff Development Council’s Standards Assessment Inventory. Austin, TX. Valdés, G., Kibler, A., & Walqui, A. (2014). Changes in the expertise of ESL professionals: Knowledge and action in an era of new standards. Alexandria, VA. Teachers of English to Speakers of Other Languages (TESOL). Wagner, B. D., & French, L. (2010). Motivation, work satisfaction, and teacher change among early childhood teachers. Journal of Research in Childhood Education, 24, 152–171. doi:10.1080/02568541003635268 Warren, C. A. B. (2002). Qualitative interviewing. In J. Gubrium, J. Holstein, Handbook of interview research: Context and method (pp. 230–258). Thousand Oaks, CA: Sage Publications. Webster-Stratton, C., Reinke, W. M., Herman, K. C., & Newcomer, L. L. (2011). The incredible years teacher classroom management training: The methods and principles that support fidelity of training delivery. School Psychology Review, 40, 509–529. 129 Weiss, M. J., Bloom, H. S., & Brock, T. (2013, June). A conceptual framework for studying the sources of variation in program effects. MDRC Working Papers on Research Methodology. Retrieved from http://mdrc.org/sites/default/files/a- conceptual_framework_for_studying_the_sources.pdf Wolery, M. (2011). Intervention research: The importance of fidelity measurement. Topics in Early Childhood Special Education, 31, 155–157. doi:10.1177/0271121411408621 Yeaton, W. H., & Sechrest, L. (1981). Critical dimensions in the choice and maintenance of successful treatments: Strength, integrity, and effectiveness. Journal of Consulting and Clinical Psychology, 49, 156–67. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/7217482 Zvoch, K. (2009). Treatment fidelity in multisite evaluation: A multilevel longitudinal examination of provider adherence status and change. American Journal of Evaluation, 30, 44–61. doi:10.1177/1098214008329523 Zvoch, K. (2012). How does fidelity of implementation matter? Using multilevel models to detect relationships between participant outcomes and the delivery and receipt of treatment. American Journal of Evaluation, 33, 547–565. doi:10.1177/1098214012452715