M. Ehren N. Shackleton. Institute of Education, University of London. June 2014. Grant number: 511490-2010-LLP-NL-KA1-KA1SCR



Similar documents
Handling attrition and non-response in longitudinal data

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

A Basic Introduction to Missing Data

Analyzing Structural Equation Models With Missing Data

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

CHAPTER 5 School Evaluation, Teacher Appraisal and Feedback and the Impact on Schools and Teachers

Problem of Missing Data

HMRC Tax Credits Error and Fraud Additional Capacity Trial. Customer Experience Survey Report on Findings. HM Revenue and Customs Research Report 306

Statistics and Probability (Data Analysis)

Easily Identify Your Best Customers

WVU STUDENT EVALUATION OF INSTRUCTION REPORT OF RESULTS INTERPRETIVE GUIDE

The End of Primary School Test Marleen van der Lubbe, Cito, The Netherlands

Module 14: Missing Data Stata Practical

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

FRAMEWORK OF SUPPORT: SCHOOL-LEVEL PRACTICE PROFILE

APPLIED MISSING DATA ANALYSIS

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Abstract Title Page. Title: Conditions for the Effectiveness of a Tablet-Based Algebra Program

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

An introduction to modern missing data analyses

Longitudinal Meta-analysis

Assessment Policy. 1 Introduction. 2 Background

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Multiple Imputation for Missing Data: A Cautionary Tale

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

National assessment of foreign languages in Sweden

A Risk Management Standard

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

January 26, 2009 The Faculty Center for Teaching and Learning

Principal Questionnaire

Early Mathematics Placement Tool Program Evaluation

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Chapter 7. The Physics Curriculum in the Participating Countries

Abstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs

NCEE What to Do When Data Are Missing in Group Randomized Controlled Trials

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

How to Get More Value from Your Survey Data

Introduction to Data Analysis in Hierarchical Linear Models

A C T R esearcli R e p o rt S eries Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.

Data Cleaning and Missing Data Analysis

The Role of Information Technology Studies in Software Product Quality Improvement

Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Marketing Mix Modelling and Big Data P. M Cain

CIRCLE The Center for Information & Research on Civic Learning & Engagement

Report on impacts of raised thresholds defining SMEs

How To Check For Differences In The One Way Anova

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Statistical Rules of Thumb

IPDET Module 6: Descriptive, Normative, and Impact Evaluation Designs

Qualitative vs Quantitative research & Multilevel methods

EFL LEARNERS PERCEPTIONS OF USING LMS

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

TEST-DRIVEN accountability is now the

Student Assessment That Boosts Learning

Delaware Performance Appraisal System Second Edition (DPAS II)

Introduction to mixed model and missing data issues in longitudinal studies

Dealing with Missing Data

Principal Practice Observation Tool

Technological Attitude and Academic Achievement of Physics Students in Secondary Schools (Pp )

RMTD 404 Introduction to Linear Models

Basic Concepts in Research and Data Analysis

How To Determine Your Level Of Competence

How to Develop a Sporting Habit for Life

Impact of ICT on Teacher Engagement in Select Higher Educational Institutions in India

Performance Assessment Task Baseball Players Grade 6. Common Core State Standards Math - Content Standards

Selecting Research Participants

Grade Level Year Total Points Core Points % At Standard %

Chapter 3 Local Marketing in Practice

Evaluation of the MIND Research Institute s Spatial-Temporal Math (ST Math) Program in California

Psy 212- Educational Psychology Practice Test - Ch. 1

Are your schools getting the most out of what technology has to offer?

A NAFPhk Professional Development Programme Serving Principals Needs Analysis Programme 360 Feedback

MARZANO SCHOOL LEADERSHIP EVALUATION MODEL

NSSE Multi-Year Data Analysis Guide

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

GUIDELINES FOR THE IEP TEAM DATA COLLECTION &

Application of Predictive Model for Elementary Students with Special Needs in New Era University

Guide for the Development of Results-based Management and Accountability Frameworks

Construction StoryBuilder; an instrument for analysing accidents and exchanging measures in the construction industry

Positive Psychology in the Israeli School System. Evaluating the Effectiveness of the Maytiv Center s Intervention Programs in Schools Research Report

Table 1: Number of students enrolled in the program in Fall, 2011 (approximate numbers)

RISK MANAGEMENT GUIDANCE FOR GOVERNMENT DEPARTMENTS AND OFFICES

THE SCHOOL BOARD OF ST. LUCIE COUNTY, FLORIDA TEACHER PERFORMANCE APPRAISAL SYSTEM, TABLE OF CONTENTS

National Survey of Franchisees 2015

Consulting projects: What really matters

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Using Surveys for Data Collection in Continuous Improvement

An Analysis of the Time Use of Elementary School Library Media Specialists and Factors That Influence It

Preparation of Two-Year College Mathematics Instructors to Teach Statistics with GAISE Session on Assessment

University Mission School Mission Department Mission Degree Program Mission

Realizeit at the University of Central Florida

Module 3: Correlation and Covariance

Handling missing data in Stata a whirlwind tour

Transcription:

Impact of school inspections on teaching and learning in primary and secondary education in the Netherlands; Technical report ISI-TL project year 1-3 data M. Ehren N. Shackleton Institute of Education, University of London June 2014 Grant number: 511490-2010-LLP-NL-KA1-KA1SCR

Contents 1. Introduction and theoretical framework... 3 2. Research design... 5 2.1 Sampling in primary education... 5 2.2 Sampling in secondary education... 6 2.3 Data collection... 9 2.4 Additional data secondary schools... 11 2.5 Data analysis... 0 2.6 Handling Missing Data... 0 3. Cross sectional results... 2 3.1 Descriptive statistics... 2 3.2 Differences in responses year 1-3... 7 3.3 Differences in responses between teachers and principals... 8 3.4 Differences in responses of primary and secondary schools... 9 3.5 Differences between inspected and non inspected schools... 10 4. Testing changes over time: principal data... 13 4.1 Change in the scale scores over time for principals... 16 4.2 Changes in the scale scores over time for principals by inspection category... 20 4.3 Longitudinal Path Models: Principals... 27 4.4 Autoregressive Modelling... 30 5. Testing changes over time: teacher data... 35 5.1 Change in the scale scores over time for teachers... 35 5.2 Changes in the scale scores over time for teachers by inspection category... 39 5.3 Longitudinal Path Models: Teachers... 43 6. Principals and teacher data set comparison... 47 6.1 Longitudinal Path model comparing teacher and principal responses.... 47 6.2 Testing differences in changes in the scales over time by Job type... 53 7. Administrative data from secondary schools... 61 7.1 Summary statistics of the additional information on secondary schools... 61 7.2 Impact of inspection category on changes over time in student achievement... 62 7.3 Impact of inspection category on changes over time in other indicators... 78 Summary... 85 s... 90 Appendix A. Sampling and response rates... 91 Appendix B: Full list of coefficients for path model 1 on principal data set.... 97 1 P a g e

Appendix C: Full list of coefficients for path model 2 (including inspection arrangement) on principal data set.... 101 Appendix D: Full list of coefficients for path model 3 on teacher data set.... 104 Appendix E: Full list of coefficients for path model 4 on teacher data set.... 107 2 P a g e

1. Introduction and theoretical framework School inspection is used by most European education systems as a major instrument for controlling and promoting the quality of schools. Growing evidence indicates that school inspections can be a key feature in improvement of schools. The standards inspectorates use to assess educational quality, teaching and learning in schools during inspection visits, the sanctions for failing schools and the rewards for well-functioning schools stimulate and pressure schools to meet nationally defined targets and objectives. This context has been known to lead to both improvement of educational quality in schools and improved student achievement. School inspections may however also lead to unintended negative consequences for teaching and learning in schools when schools for example implement procedures and protocols that have no effect on primary processes in the school but are only implemented to receive a positive inspection evaluation. As school inspections are widely used to improve schools, it is of great importance to gain more knowledge about the in-school processes which take place between the inspection and the ultimate goal of improvement of student performance. Current research provides limited insight into how school inspections drive improvement of schools and which types of approaches are most effective and cause the least unintended consequences. The study presented in this paper intends to expand this knowledge base by comparatively studying responses of principals in primary and secondary education (and in the Netherlands also teachers in primary and secondary education, and school boards in primary education) to school inspections in six European countries (England, Ireland, Sweden, Austria/Styria and the Czech Republic) during three consecutive years. These countries all have different inspection models, enabling us to compare the specific types of school inspections and analyse which of these inspections are most effective. This report includes the results of the three years of data collection and analyses in the Netherlands. The following research questions will be addressed: 1. What are effects and negative consequences of school inspections in terms of changes in school effective conditions and innovation capacity of primary and secondary schools in the Netherlands? 2. What aspects of school inspections (standards and threshold, sanctions and rewards and, frequency of visits) contribute to these changes? The theoretical framework used to study these questions and guide our data collection builds from the assumptions on how school inspections are expected to lead to improvement of schools in each of the participating countries. To reconstruct these assumptions, policy and inspection documents were analysed in each country and interviews were held with inspection officials to describe the mechanisms through which each Inspectorate aims to affect school improvement. The assumptions describe the causal mechanisms of how school inspections are supposed to lead to the improvement of schools, linking school inspections to their intended outcomes of improved teaching and learning in each country. From these assumptions we derived the intermediate processes and mechanisms and intended outcomes that are common to all six countries. These common processes, mechanisms and intended outcomes were included in our theoretical framework. The framework is presented in figure 1. The first part on the left includes a number of variables describing how school inspections, their criteria and procedures in general, and the feedback given during inspection visits are expected to enable schools and their stakeholders to align their views/beliefs and expectations of good education and good schools to the standards in the inspection framework, particularly with respect to those standards the school failed to meet during the latest inspection visit. Schools are expected to act on these views and expectations and use the inspection feedback when conducting self-evaluations and when taking improvement actions. Stakeholders should use the inspection standards, or rather the inspection assessment of the school s functioning against these standards (as publicly reported), to take actions that will motivate the school to adapt their expectations and to improve. Self-evaluations by schools are expected to build their capacity to improve that will lead to more effective teaching and learning conditions. Likewise, improvement actions will (when successfully implemented) lead to more effective schools and teaching conditions. These conditions are expected to result in high student achievement. Figure 1 summarizes these mechanisms and presents the theoretical framework of our data collection. A more detailed description of the conceptual model was published in: 3 P a g e

Ehren, M.C.M., Altrichter, H., McNamara, G. and O Hara, J. (2013). Impact of school inspections on teaching and learning; Describing assumptions on causal mechanisms in seven European countries. Educational Assessment, Evaluation and Accountability. 25(1), 3-43. http://dx.doi.org/10.1007/s11092-012-9156-4 Figure 1. Intended effects of school inspections Inspection methods, standards, threshold, feedback Consequences Public reporting Setting expectations Accepting feedback Promoting/ improving selfevaluations Taking improvement actions Actions of stakeholders High improvement capacity Highly effective school and teaching conditions Good education/ high student achievement The following section outlines the research design used to study our theoretical framework. 4 P a g e

2. Research design The research design used to study the theoretical framework includes a (three year) longitudinal design in which the variables in the model are measured in six European countries, using a survey to principals. In the Netherlands an additional survey was administered to teachers in primary and secondary schools. Three years of data collection enables us to observe change patterns in schools during a full inspection cycle. The survey was administered in September November 2011 (year 1), 2012 (year 2) and 2013 (year 3). We will explore if schools have different patterns of change before or after school inspections, whether these patterns are linear or non-linear, whether there is a consistent pattern of change or not, whether different types of schools in different types of inspection categories experience different patterns of change and which outcomes change at which moment in time. 2.1 Sampling in primary education A two stage sampling design was used to select primary schools and teachers for our study. Our sampling design builds from the categories the Inspectorate of Education uses in their early warning analysis (basic, Zwak/weak, Zeer Zwak/very weak). Schools in these different categories are confronted with different inspection treatments (basic: no visit, weak and very weak: visits and increased monitoring) and we expect them to respond differently to the variables in our survey. We used the results from the early warning analysis in 2010 to select schools from different inspection categories. We included 408 primary schools, and in each school three teachers from grades 3, 5 and 8. These teachers face different stakes to implement changes in response to school inspections, as particularly students test scores in grade 8 are part of the inspection measures. Schools in the weak and very weak inspection treatment categories were over sampled to ensure sufficient response rates. Schools that have not been assigned to an inspection treatment or were not included in the early warning analysis due to failures in the information provided to the Inspectorate (595 schools in total) were excluded from the sample. The following table, table 1, and the tables in appendix A provide an overview of the target population, the target sample and the response rates of each year of data collection of schools and teachers. Response rates are relatively low (particularly in year 1 and 3), but non response of both principals and teachers is similar across the different inspection treatment categories. 5 P a g e

Table 1. Target population and target sample of primary schools Schools (principals) Teachers Target population of schools (total number meeting Target sample of schools (percentage of Target population of teachers (all teachers in group 3, 5 and 8) 1 Target sample of teachers (1 teacher group 3, 5 and 8) inclusion criteria) target population) No risk 4773 165 (3.46%) 23865 495 (2.07%) Unknown risk 795 83 (10.44%) 3975 249 (6.26%) Risk 608 83 (13.65%) 3040 249 (8.19%) Serious risk 557 80 (14.36%) 2785 240 (8.62%) Schools assigned to 6703 208 (3.10%) 33515 624 (1.86) basic inspection treatment Schools assigned to 366 152 (41.53%) 1830 456 (24.92%) weak schools inspection treatment Schools assigned to 61 51 (83.61%) 305 153 (50.16%) very weak schools inspection treatment Total 6638 411 (6.19%) 33356 1233 (3.70%) 2.2 Sampling in secondary education A two stage sampling design was also used to sample school and teachers in secondary education, using the results from the early warning analysis from the Inspectorate of Education in 2010. Only the 548 HAVO and VWO departments of secondary schools were included in our study. HAVO and VWO departments that were not included in the early warning analysis of the Inspectorate or had not been assigned to an inspection arrangement, were considered out of scope 2. The target population of secondary schools was therefore set to 454 schools (including both a HAVO and VWO department). The target sample included almost all HAVO and VWO departments in three different inspection treatments to reach sufficient response rates. Due to the limited number of schools in the very weak inspection treatment, all schools in this category were included in the sample. Within these schools, teachers from the lower grades and from the final grade who teach Dutch language or Geography were included in the sample. These teachers face different stakes as students test scores in the final grade are part of the inspection measures. Table 2 and table 3 provide an overview of the target population and the target sample. Table 4 and 5 show the response rates of schools, principals and teachers. More information on the sample and the response rates can be found in appendix A. Response rates in year 1 are very low (approximately 5% for both principals and teachers) and even lacking for schools and teachers in the very weak inspection treatment. The results for secondary education should therefore be interpreted with great caution. 1 Estimate of number of teachers, based on 2010 data on www.stamos.nl. Estimate was based on an average of 13.4 teachers in one primary school and an average of 5 teachers for grades 3, 5 and 8; the average is based on a total number of fulltime teachers of 96.937 in a total population of 7233 schools. 2 Selection date: may 2011 6 P a g e

Table 2. Target population and target sample of secondary schools Target population (total number of schools in the country Target sample (percentage of target population HAVO VWO HAVO VWO No risk 261 262 183 (70.11%) 184 (70.22%) Risk 151 105 135 (89.40%) 88 (83.81%) Weak 42 87 41 (97.62%) 87 (100%) Schools assigned to basic inspection treatment 416 357 321 (77.16%) 262 (73.39%) Schools assigned to weak 33 91 33 (100%) 91 (100%) schools inspection treatment Schools assigned to very 5 6 5 (100%) 6 (100%) weak schools inspection treatment Total 454 454 359 (79.10%) 359 (79.10%) Table 3. Target population and target sample of secondary teachers 3 Target population (total number of teachers in the country Target sample (percentage of target population: 4 teachers in each department) HAVO VWO HAVO VWO No risk 5220 5240 732 (13.26%) 736 (14.04%) Risk 3020 2100 540 (17.88%) 352 (16.76%) Weak 840 1740 164 (19.52%) 348 (20%) Schools assigned to basic 8320 7140 1284 (15.43%) 1048 (14.68%) inspection treatment Schools assigned to weak 660 1820 132 (20%) 364 (20%) schools inspection treatment Schools assigned to very 100 120 20 (20%) 24 (20%) weak schools inspection treatment Total 9080 9080 1436 (15.81%) 1436 (51.81%) 3 Using TALIS 2008 estimate of 20 teachers per department (see technical report, sampling frame), http://www.oecd.org/dataoecd/16/14/44978960.pdf 7 P a g e

Response rates Table 4. Number of responses within each year for primary education Schools Principals Teachers Target sample 411 411 1233 (grade 3, 5, 8) Year 1 96 73 140 (grade 3: 48; grade 5: 41; grade 8: 51) Year 2 166 136 203 (grade 3: 70; groep 5: 66; grade 8: 67) Year 3 117 76 123 (grade 3: 51; grade 5: 39; grade 8: 33) Overlap 1 response: 148 2 responses: 79 3 responses: 24 Total number of schools: 251 Table 5. Number of responses within each year for secondary education Schools Principals Teachers Target sample 359 359 1436 (lower grades, final examination year) Year 1 40 15 85 (lower grades: 32;, final examination year: 49 Year 2 100 62 189 (lower grades 2: 88;, final examination year: 101 Year 3 95 55 126 (lower grades: 55;, final examination year: 71 Overlap 1 response: 113 2 responses: 40 3 response: 16 Total number of schools: 161 8 P a g e

2.3 Data collection Data collection included a survey to principals and teachers in both primary and secondary education. The framework for the three questionnaires was (to a large extent) similar and included four sets of variables: background characteristics of schools (only administered in year 1), outcome variables, intermediate processes and inspection measures (only administered in year 2 and 3) as described in our theoretical framework. Background characteristics of schools The questionnaire starts with a number of questions on background characteristics of schools and principles that are expected to be relevant for the changes schools make in response to school inspections, and the results of these changes in student achievement, such as the location of the school in a rural or urban area, the composition of the student population, the experience of the principal and his/her tasks, and the resources in the school. Schools in more rural areas may have different stakes in acting on inspection findings compared to schools in a more urban (and perhaps more competitive) environment. The composition of the student population will affect the achievement level of students in the school and perhaps also the type and amount of improvements that need to be implemented to perform well on the indicators in the inspection rubric. More experienced principals are expected to be better prepared for the inspection visit (as they know what to expect) and perhaps also better able to build capacity in, and improve the school. We also ask questions about the extent to which principals are responsible for pedagogical tasks in the school as this may affect the type of improvements they implement in response to school inspections. Questions about the (financial and human) resources in the school are linked to specific inspection outcomes and are placed under the heading inspection measures. Items on the background characteristics of schools were inspired by items from the TALIS and PEARLS surveys. The items were adapted to fit the context of this study. Outcome variables The second part of the questionnaire includes items on the outcome variables in our conceptual framework: capacity to improve and effective school and teaching conditions. Questions about these variables are framed in terms of the time principals have spent during the previous academic year to change the school s functioning in these areas (using a 5-point scale ranging from much less time to much more time ), as well as the school s status and functioning on these variables (5-point scale ranging from strongly disagree to strongly agree ). Our choice to include change as one of our outcome variables is motivated by the expectation that principals will not be able to estimate status reliably and consistently over time, which will make the status data too messy to allow identification of effects of school inspections. The growth modelling techniques we will be using to analyse the results are however designed to capture change over time and promote the use of measuring status in each year. We therefore decided to also include questions measuring status in improvement capacity/capacity-building of schools and effective school and teaching conditions. Additionally we also included questions about unintended consequences of school inspections. These variables are described in more detail below. Effective school and teaching conditions include conditions related to the school organization and management, such as educational leadership, a productive climate and culture and achievementoriented school policy. These conditions are expected to contribute to and facilitate effective teaching and instruction and as a result lead to high student achievement. Teaching/instruction conditions include what a teacher does to create effective learning environments and to boost learning (Scheerens et al., 2009). The sub variables for effective school and teaching conditions included in the questionnaire are opportunity to learn and learning time, achievement orientation, clear and structured teaching, and a safe and stimulating learning climate. The indicators described in Scheerens et al (2010, p.51) were used to formulate items on opportunity to learn, achievement orientation and an orderly learning environment. The ICALT questionnaire, which was developed by Inspectorates of Education in several European countries to comparatively measure the quality of teaching and 9 P a g e

learning, was used to develop items on clear and structured teaching, challenging teaching approaches, and a safe and stimulating learning climate. Capacity-building refers to the school s capacity to improve. A school with a high innovation capacity is one which is capable of implementing change. This type of school is, according to Reezigt (2001), experienced at reflecting on its functioning and at changing and improving. Participation in decisionmaking, cooperation between teachers and transformational leadership are important factors in the school s capacity to improve and are therefore included as sub variables in the questionnaire. Items to measure these sub variables were inspired by the Dutch School Improvement Questionnaire (see Geijsel et al, 2009). Unintended consequences are potential side effects of school inspections. Potential unintended consequences can be identified on the school level and include the extent to which school inspections lead to a narrowing of curricula and instructional processes in the school, the extent to which principals experience inspections as an administrative burden and manipulate documents and data they send to the Inspectorate. Potential unintended consequences on the teaching level are referred to as teaching to the test (which are relevant when the Inspectorate uses standardized tests of student achievement in their measures of schools) and teaching to inspection. Teaching to the test includes items to measure the extent to which teachers (narrowly) align their teaching to tested topics and item formats; teaching to inspection refers to alignment of teaching to the indicators in the inspection framework used by school inspectors during lesson observations. Items measuring unintended consequences on the school level were inspired by the NfER survey evaluation of the Impact of Section 5 inspections (2007). Items in the teacher survey on teaching to the test were inspired by the RAND 2007 study Standards-Based Accountability Under No Child Left Behind Experiences of Teachers and Administrators in Three States 4, Koretz and Hamilton s (2003) CSE study Teachers Responses to High-Stakes Testing and the Validity of Gains: A Pilot Study 5. Items on teaching to inspection were adapted from the NfER survey evaluation of the Impact of Section 5 inspections (2007). These questions on unintended consequences were in year 1 only administered to schools that received an inspection visit in the previous year, while they were administered to all schools in year 2 and 3. They are placed at the end of the questionnaire to make sure that they don t bias responses of principals and teachers on the other two outcome variables of (effective school and teaching conditions and capacity to improve). The items on teaching to the test and teaching to inspection were administered to all teachers in primary education, while items on teaching to the test were only administered to teachers in the final examination grade in secondary education (as teachers in the lower grades do not face standardized tests used by the Inspectorate). Intermediate processes The third part of the questionnaire includes questions about the intermediate processes that precede our outcome variables: setting of expectations, acceptance of feedback, promoting self-evaluations, and stakeholder sensitivity. Setting of expectations refers to the extent to which schools use the inspection standards to guide their work and define the school s goals and directions in working towards the inspection standards of a good school. The acceptance of feedback is related to the assessment of the school and the suggestions for improvement provided during an inspection visit. Promoting self-evaluations refers to the effective implementation of internal systems of evaluation and self-review and the use of inspection standards and expectations of adequate self-evaluation to conduct self-evaluations and to implement self-evaluation systems. Stakeholder sensitivity (particularly parents and school boards) include parents using inspection findings to choose a school and stakeholders voicing necessary improvements to the school. Questions on intermediate processes were inspired by the NfER survey evaluation of the Impact of Section 5 inspections (2007). 4 http://www.rand.org/pubs/monographs/2007/rand_mg589.pdf 5 http://ipea.hmdc.harvard.edu/files/ipea/koretz_and_hamilton_2003_r610.pdf 10 P a g e

Inspection measures The fourth and final part of the data collection includes information about inspection measures, such as the type of inspection visit in school, the methods of data collection used during inspection visits, the standards used to assess schools, the feedback provided to the school, the assessment of the school, sanctions and rewards to schools and how inspection findings are reported to stakeholders. The first year of data collection only included one item in the questionnaire on the occurrence of an inspection visit in the previous year, and data from the Inspectorate on the inspection arrangement to which schools were assigned. Our aim was to collect additional information on inspection measures by means of analyzing inspection databases. Such additional data collection however proved to be impossible in some countries and also proved to be difficult in comparing schools across countries. The principal survey in 2012 and 2013 will therefore include a small set of additional questions to measure these inspection measures. 2.4 Additional data secondary schools The national non-profit organization Vensters voor Verantwoording provided us with additional data about relevant output indicators of all the secondary schools in our target sample. The sample for this analysis includes 301 secondary schools. Table 7 outlines the data available: 11 P a g e

Table 6. Sample secondary schools, including secondary data (ALL secondary schools in target sample) Data 2009-2010 Data 2010-2011 Data 2011-2012 Data 2012-2013 Number of students 2009-2010 Number of students 2010-2011 Number of students 2011 2012 Number of students Number of students in exams per profile 2009-2010 Number of students in exams per profile 2010-2011 Number of students in exams per profile 2011-2012 2012 2013 number of students Average grade school exam per subject 2009-2010 Average grade school exam per subject 2010-2011 Average grade school exam per subject 2011-2012 2012-2013 in APC (poverty area) Average grade central exam per subject 2009-2010 Average grade central exam per subject 2010-2011 Average grade central exam per subject 2011-2012 number of students 2012-2013 NOT in throughput lower grades 2009-2010 throughput lower grades 2011-2012 APC (poverty area) throughput upper grades 2009-2010 number of students 2009-2010 in APC (poverty area) number of students 2009-2010 NOT in APC (poverty area) percentage of students 2009-2010 in APC (poverty area) percentage of students 2009-2010 NOT in APC (poverty area) average student satisfaction 2009-2010 average parent satisfaction 2009-2010 scheduled teaching hours per department (HAVO/VWO) per year 2009-2010 taught hours per department (HAVO/VWO) per year 2009-2010 percentage sick leave 2009 throughput lower grades 2010-2011 throughput upper grades 2010-2011 number of students 2010-2011in APC (poverty area) number of students 2010-2011NOT in APC (poverty area) percentage of students 2010-2011in APC (poverty area) percentage of students 2010-2011NOT in APC (poverty area) average student satisfaction 2010-2011 average parent satisfaction 2010-2011 scheduled teaching hours per department (HAVO/VWO) per year 2010-2011 taught hours per department (HAVO/VWO) per year 2010-2011 percentage sick leave 2010 throughput upper grades 2011-2012 number of students 2011-2012 in APC (poverty area) number of students 2011-2012 NOT in APC (poverty area) percentage of students 2011-2012 in APC (poverty area) percentage of students 2011-2012 NOT in APC (poverty area) average student satisfaction 2011-2012 average parent satisfaction 2011-2012 scheduled teaching hours per department (HAVO/VWO) per year 2011-2012 taught hours per department (HAVO/VWO) per year 2011-2012 number of external evaluations 2011-2012 percentage sick leave 2011 percentage of students 2012-2013 in APC (poverty area) percentage of students 2012-2013 NOT in APC (poverty area) average student satisfaction 2012-2013 Average parent satisfaction 2012 2013 number of external evaluations 2012-2013

We have secondary data on the majority of these schools (266 88%) at all four time points (the year prior to the survey, year 1, year 2 and year 3). Table 7 shows the patterns of missingness within the secondary school data. Within the column labelled pattern a 1 represents a response and the. Represents a missing value. Therefore the response pattern.1.. refers to schools that only have responses at one time point, which is the second time point. Table 7. The pattern of missingness for additional data of secondary schools Freq. Percent Cum. Pattern 266 88.37 88.37 1111 13 4.32 92.69.1.. 3 1 93.69..11 3 1 94.68.11. 3 1 95.68 1... 2 0.66 96.35...1 2 0.66 97.01..1. 2 0.66 97.67.111 2 0.66 98.34 11.. 5 1.66 100 (other patterns) 301 100 XXXX There is a low proportion of schools who are categorised as in inspection category Zwak or Zeer Zwak. In year 1 17 (6%) schools were either classified as Zwak or Zeer Zwak. In year 2, 12 (4%) schools were either classified as Zwak or Zeer Zwak. In year 3, 7 (3%) schools were classified as either Zwak or Zeer Zwak. There are 24 schools who are categorised as either inspection category Zwak or Zeer Zwak. Some of these schools are categorised as Zwak or Zeer Zwak on more than one occasion, i.e there are in the same category for either two years of the survey, or for three years. 2.5 Data analysis We calculated descriptives, and implemented t-tests and ANOVA to compare inspected and noninspected schools, compare responses of teachers and principals, and responses of primary and secondary schools on the cross-sectional data. Additionally we used HML to test the relations in our model and changes across the 3 years. Information about construction of the scales can be found on the project website: www.schoolinspections.eu 2.6 Handling Missing Data The analysis presented here is restricted to these schools which responded to at least two time points of data collection. Maximum likelihood missing data techniques are used on this subsample to account for missingness. Listwise deletion is not an option with these data, as this would result in a sample size of 16-18 schools, but there are many more schools that provided information on more than one occasion and this information can be included in the analysis. Enders and Bandalos (2001) describe the maximum likelihood approach as follows. A case-wise likelihood function using only those variables that are observed for case i is computed. The case wise likelihood of the observed data is obtained by maximizing the function. This is shown in equation 1. Equation 1. Where, K i, is a constant that depends on the number of complete data points for case i, x i, is the observed data for case i, and µ, and, contain the parameter estimates of the mean vector and

covariance matrix, respectively, for the variables that are complete for case i. The case-wise likelihood functions are accumulated across the entire sample and are maximised as shown in equation 2. Equation 2. All available data are utilised during parameter estimation. The algorithm does not impute the missing values, but the borrowing of information from the observed portion of the data is analogous to replacing missing data points with the conditional expectation of the missing value, given the values of the other variables in the model. The results are unbiased as long as the missing data can be assumed missing completely at random (MCAR) or missing at random (MAR) (Enders and Bandalos, 2001; Newman, 2003). Data can be considered MCAR, if, as the name suggests the missingness is random, but this is rarely the case. If data were MCAR there would be no patterning in the missingness, so that everybody, or in this case every school, had an equal chance of non-response. Data can be assumed MAR if the missingness is correlated with other variables included in the analysis (Howell, 2012). That is, dependent upon the responses to other variables missingness is random. The other variables in the model provide information about the missingness. Specifically the other variables provide information about the marginal distributions of the incomplete variables. Where the assumptions of MAR are met, the estimates will be unbiased. A much more difficult scenario occurs when data are missing not at random (MNAR). MNAR exists when even after accounting for all the available observed information, the reason for observations being missing still depends on the unseen observations themselves. The maximum likelihood technique will yield biased estimates when data are MNAR. However, it has been suggested that when data are MNAR, maximum likelihood techniques will result in less bias than list wise deletion or other methods for dealing with missingness (Enders, 2010; Graham, 2009). 1 P a g e

3. Cross sectional results 3.1 Descriptive statistics The descriptive statistics provide some evidence for changes in the scales over time. Bar charts presented in figure 2 and figure 3 provide a summary of changes in the mean average score on the scales over time for the principals (figure 2) and teachers (figure 3). The descriptive statistics provide some indication that responses to the scales have not been the same in each year, and that changes over time have been different for principals and teachers. For example the first bar chart of figure 2, referring to setting expectations suggests that response to the scale have become lower over time i.e respondents are less inclined to agree that inspections lead to the setting of expectations. The same pattern is found in the principal and teacher data set, although it is more pronounced in the teacher data set. Without formally testing changes over time, these descriptive statistics can only be indicative. Therefore these changes over time are tested formally in the next section using ANOVA and hierarchical linear modelling. 2 P a g e

Feedback on Teaching Conditions 0.5 1 1.5 0 1 2 3 2 Change in School Effectiveness 0 1 2 3 4 0 1 2 3 4 Capacity Building 0 1 2 3 4 5 School Effectiveness 0 1 2 3 4 Setting Expectations 0 1 2 3 4 5 Accepting Feedback Stakeholder Sensitivity 0 1 2 3 4 Change in Self-Evaluations 0 1 2 3 4 0 1 2 3 4 Figure 2. Summary of Descriptive statistics for changes in the scales over time in the Principals data set 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Note: the scales feedback on capacity building and feedback on effective school teaching conditions were only asked to principals in year 2 and 3 of the survey 3 P a g e

0 1 2 3 4 5 School Effectiveness Satisfaction with inspection 0 1 2 3 4 5 0 1 2 3 4 Change in Capacity Building 0 1 2 3 4 Change in School Effectiveness 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Accepting Feedback 0 Stakeholder 1 2 Sensitivity 3 4 0 1 2 3 4 Figure 3. Summary of Descriptive statistics for changes in the scales over time in the Teachers data set year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 year 1 year 2 year 3 4 P a g e

Table 8. Means and standard deviations for principals and teachers year 1-3 Inspection measures Setting expectations Year 1 Year 2 Year 3 Principals Teachers Principals Teachers Principals Teachers PS SS PS SS PS SS PS SS PS SS PS SS Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) NA NA NA NA 2,32 (0,62) 2,22 NA NA 2,36 (0,59) 2,25 (0,55) NA NA 3,77 (0,51) Accepting feedback 3,88 (0,49) Stakeholders sensitive to reports 3,74 (0,46) 3,73 (0,44) 3,88 (0,21) 3,83 (0,28) 3,37 (0,67) 3,94 (0,67) 4,01 (0,69) 3,28 (0,65) 3,78 (0,58) 3,87 (0,71) (0,64) 3,74 (0,53) 3,80 (0,61) 4,03 (0,54) 3,72 (0,43) 3,68 (0,48) 3,58 (0,47) 4,14 (0,66) 4,09 (0,73) 4,12 (0,74) 3,63 (0,44) 4,38 (1,07) 4,40 (0,99) (n=63) 3,69 (0,53) (n=68) 3,93 (0,41) (n=68) 3,63 (0,49) (n=68) (n=41) 3,63 (0,58) (n=48) 3,62 (0,49) (n=47) 3,65 (0,46) (n=47) 4,14 (0,68) (n=116) 4.07 (0,84) (n=116) 4,12 (0,84) (n=116) 3,56 (0,49) (n=120) 4,26 (1,05 (n=120) 4,29 (1,03) (n=120) Changes in capacity-building - Changes in Teacher Participation in Decision Making - Changes in cooperation between teachers - Changes in transformation al leadership Promoting selfevaluations Changes in school effectiveness - Changes in opportunity to learn 3,69 (0,54) 3,60 (0,69) 3,93 (0,64) 3,55 (0,63) 4,07 (0,70) 3,74 (0,57) 3,78 (0,71) 3,66 (0,50) 3,46 (0,45) 3,87 (0,61) 3,58 (0,76) 4,17 (0,46) 3,43 (0,15) 3,50 (0,33) 3,73 (0,54) 3,65 (0,73) 3,85 (0,65) 3,78 (0,77) 3,81 (0,75) 3,75 (0,45) 3,78 (0,58) 3,55 (0,53) 3,63 (0,78) 3,75 (0,69) 3,45 (0,69) 3,62 (0,78) 3,44 (0,38) 3,42 (0,52) 3,63 (0,58) 3,77 (0,40) 3,51 (0,73) 3,50 (0,52) 3,89 (0,71) 4,10 (0,54) 3,58 (0,71) 3,83 (0,61 3,84 (0,64) 3,79 (0,60) 3,70 (0,49) 3,73 (0,44) 3,76 (0,57) 3,78 (0,47) 3,70 (0,51) 3,55 (0,62) 3,72 (0,61) 3,87 (0,88) 3,95 (0,89) 3,80 (0,47) 3,79 (0,55) 3,56 (0,48) 3,56 (0,64) 3,57 (0,65) 3,61 (0,86) 3,73 (0,91) 3,65 (0,43) 3,53 (0,49) 3,60 (0,47) (n=68) 3,49 (0,53) (n=68) 3,68 (0,60) (n=68) 3,62 (0,67) (n=67) 3,69 (0,56) (n=68) 3,61 (0,52) (n=68) 3,66 (0,63) (n=68) 3,70 (0,46) (n=52) 3,58 (0,56) (n=52) 4,03 (0,72) (52) 3,59 (0,66) (n=52) 3,65 (0,52) (n=52) 3,65 (0,39) (n=52) 3,64 (0,43) (n=52) 3,60 (0,50) (n=117) 3,48 (0,69) (n=117) 3,59 (0,62) (n=117) 3,77 (0,93) (n=117) 3,82 (0,85) (n=117) 3,74 (0,43) (n=116) 3,72 (0,53) (n=116) 3,56 (0,48) (n=124) 3,52 (0,69) (n=124) 3,56 (0,58) (n=124) 3,68 (0,95) (n=124) 3,79 (1,00) (n=124) 3,62 (0,37) (n=123) 3,44 (0,45) (n=123) 5 P a g e

- Changes in assessments of students - Changes in assessment of staff and school - Changes in clear and structured teaching Year 1 Year 2 Year 3 Principals Teachers Principals Teachers Principals Teachers PS SS PS SS PS SS PS SS PS SS PS SS Mean Mean Mean Mean Mean (SD) Mean Mean Mean Mean (SD) Mean (SD) Mean Mean (SD) (SD) 3,86 (0,72) 3,56 (0,56) 3,75 (0,74) (SD) 3,38 (0,43) 3,54 (0,40) 3,31 (0,44) (SD) 3,82 (0,63) 3,93 (0,95) 3,56 (0,50) (SD) 3,22 (0,35) 3,71 (0,81) 3,45 (0,53) (SD) 3,81 (0,72) 3,70 (0,68) 3,42 (0,58) 3,70 (0,61) 3,75 (0,65) 3,71 (0,60) (SD) 3,76 (0,68) 4,02 (1,06) 3,70 (0,59) (SD) 3,51 (0,56) 3,95 (1,01) 3,66 (0,57) 3,60 (0,63) (N=68) 3,48 (0,61) (N=68) 3,67 (0,61) (N=68) 3,81 (0,67) (n=52) 3,62 (0,60) (n=52) 3,58 (0,53) (n=51) (SD) 3,62 (0,58) (n=116) 4,04 (1,07) (n=116) 3,64 (0,52) (n=116) 3,47 (0,53) (n=123) 4,17 (1,11) (n=123) 3,54 (0,49) (n=123) Capacity-building 4,07 (0,37) 4,20 (0,48) 3,80 (0,54) 4,22 (0,38) 4,02 (0,36) 4,18 (0,50) 3,69 (0,61) 4,24 (0,38) (N=72) 3,99 (0,34) (n=52) 4,14 (0,51) (N=119) 3,78 (0,57) (n=125) School effectiveness 4,01 (0,49) 3,45 (0,48) 4,11 (0,50) 3,64 (0,49) 4,05 (0,41) 3,56 (0,40) 4,21 (0,39) 3,71 (0,41) 4,04 (0,38) (N=68) 3,56 (0,38) (n=52) 4,18 (0,39) (N=119) 3,77 (0,39) (n=124) Note: PS is primary schools; SS is secondary school 6 P a g e

3.2 Differences in responses year 1-3 One-way ANOVA was used to analyze changes in responses of schools across the three years of data collection and whether these changes increase over time as schools implement more changes. The results, shown in table 9, indicate differences in how schools report about school inspections setting expectations (positive trend), stakeholders sensitivity (negative trend), changes in capacity-building and cooperation between teachers (negative trend) and improvements in self-evaluation (negative trend). Schools also report differences in accepting feedback and improving school effectiveness, but these differences are not linear. Table 9. Differences in responses of schools between 3 years of data collection Significant differences Linear trend between three years Inspection Measures (Year 2/3) F = 0.13 (1, 257) F = 0.13 (1, 257) Setting Expectations F = 37.05** (2,1149) F = 29.22** (1, 1149) Accepting Feedback F = 4.11* (2, 971) F = 3.73 (1, 1971) Stakeholders sensitive to reports F = 6.41** (2, 1083) F = 12.55** (1, 1083) Change in Capacity Building F = 3.73* (2, 1219) F = 7.32** (1, 1219) Change in Teacher Participation in F = 2.48 (2, 1218) F = 4.52* (1, 1218) Decision Making Change in Cooperation between teachers F = 6.13** (2, 1213) F = 12.25** (1, 1215) Change in Transformational Leadership F = 2.07 (2, 1177) F = 1.50 (1, 1177) Promoting Self Evaluations F = 7.83** (2, 1169) F = 15.52** (1, 1169) Changes in School Effectiveness F = 4.13* (2, 1200) F = 0.93 (1, 1200) Changes in Opportunity to Learn F = 3.62* (2, 1197) F = 2.75 (1, 1197) Changes in Assessment of Students F = 2.39 (2, 1197) F = 1.31 (1, 1197) Changes in clear and structured teaching F = 6.36** (2, 1198) F = 0.35 (1, 1198) Capacity Building F = 2.41 (2, 1239) F = 3.13 (1, 1239) School Effectiveness F = 0.37 (2, 1211) F = 0.00 (1, 1211) Note: reported is (F(dfM, dfr) = F; * p <.05, ** p <.01 7 P a g e

3.3 Differences in responses between teachers and principals We used t-tests to compare differences between teachers and principals, combining all three years of data. The results of these t-tests are shown in table 10. Teachers report significantly higher scores for accepting feedback and stakeholders sensitivity to reports. Principals on the other hand report significantly higher schools for the school s capacity and some unintended consequences of school inspections. Table 10. Differences in responses between teachers and principals (3 years of data combined) All principals All teachers Average (SD) Average (SD) (N) (N) Inspection measures (year 2/3) 2,34 (0,58) (n=209) 2,14 (0,65) (n=49) Setting expectations 3,76 (0,57) (n=348) 3,73 (0,68) (n=802) Accepting feedback 3,89 (0,57) (n=333) 4,12 (0,87) (n=720) -5,05** (926) Stakeholders sensitive to reports 3,71 (0,51) (n=345) 3,93 (0,75) (n=764) -5,67** (947) Improving self-evaluations 3,79 (0,65) (n=374) 3,75 (0,80) (n=815) Improving capacity-building 3,66 (0,52) (n=381) 3,60 (0,51) (n=839) Improving school effectiveness 3,67 (0,46) (n=378) 3,65 (0,44) (n=823) Capacity to improve 4,14 (0,43) (n=391) 3,99 (0,57) (n=849) 5,00** (976) Effective school and teaching conditions 3,90 (0,45) (n=377) 3,96 (0,49) (n=835) Q46. narrowing of teaching methods 2,35 (0,87) (n=350) 2,19 (0,84) (n=798) 2,88**(1146) Q47. Narrowing curriculum and instructional strategies 2,90 (1,02) (n=349) 2,70 (1,00) (n=797) 3,04** (1144) Q48. Refocusing curriculum and teaching and learning strategies 3,32 (0,92) (n=284) 3,34 (0,99) (n=251) Q49. Documents/facts and figures present a more positive picture of the quality of our school 1,89 (0,81) (n=313) 1,98 (0,80) (n=631) Q50. putting protocols and procedures in writing and gathering documents and data 2,87 (1,06) (n=314) 2,65 (0,94) (n=630) 3,03**(566) Note: t-value between brackets; * p <.05, ** p <.01, differences in means reported for significant differences; positive value indicates higher responses of principals. 8 P a g e

3.4 Differences in responses of primary and secondary schools We also tested for differences between primary and secondary schools, combining all three years of data (including principals and teachers). The results from this analysis are presented in table 11. Principals and teachers in primary schools respond significantly higher scores for setting expectations, improvement of self-evaluations, improvement of capacity-building, and improvement of school effectiveness. Primary schools also report higher scores for the innovation capacity of the school and the effectiveness of the school, as well as unintended consequences of school inspections. These differences may however result from higher response rates in primary education, although the smaller scale of primary schools (compared to secondary schools) would suggest that school inspections have a potential higher impact. Table 11. Differences between primary and secondary schools (three years of data combined) Primary schools (principals and teachers) Secondary schools (principals and teachers) Average (SD) (N) Average (SD) (N) Inspection measures 2,33 (0,61) (n=173) 2,23 (0,59) (n=85) Setting expectations 3,86 (0,69) (n=667) 3,57 (0,54) (n=483) -7,88** (1140) Accepting feedback 4,00 (0,65) (n=617) 4,10 (0,96) (n=436) Stakeholders sensitive to reports 3,83 (0,62) (n=658) 3,90 (0,78) (n=451) Improving self-evaluations 3,83 (0,73) (n=698) 3,68 (0,78) (n=491) -2,30** (1187) Improving capacity-building 3,65 (0,53) (n=711) 3,58 (0,48) (n=509) -2,21* (1149) Improving school effectiveness 3,71 (0,47) (n=699) 3,59 (0,40) (n=502) -4,59** (1167) Capacity to improve 4,20 (0,46) (n=723) 3,81 (0,55) (n=517) -13,11** (990) Effective school and teaching conditions 4,13 (0,43) (n=704) 3,68 (0,42) (n=508) -18.09** (1210) Q46. narrowing of teaching methods 2,30 (0,89) (n=667) 2,15 (0,78) (n=481) -3,04** (1105) Q47. Narrowing curriculum and instructional strategies 2,89 (1,03) (n=666) 2,58 (0,94) (n=480) -5,20** (1081) Q48. Refocusing curriculum and teaching and learning strategies 3,53 (0,90) (n=347) 2,95 (0,95) (n=188) -7,01** (533) Q49. Documents/facts and figures present a more positive picture of the quality of our school 1,83 (0,81) (n=543) 2,12 (0,78) (n=401) 5,50** (942) Q50. putting protocols and procedures in writing and gathering documents and data 2,80 (1,04) (n=543) 2,62 (0,91) (n=401) -2,74** (915) Note: t-value between brackets; * p <.05, ** p <.01, differences in means reported for significant differences; positive value indicates higher responses of secondary schools. 9 P a g e

3.5 Differences between inspected and non inspected schools Additionally we compared schools who were inspected in the year prior to the survey to schools who were not inspected. Inspected schools are primarily schools in the weak and very weak inspection category, but may also include schools that are inspected for other reasons (e.g. as part of the random sample for the annual report on the state of education, or because they haven t had a visit for more than four years). As there are so few schools in the weak and very weak inspection category we didn t test for differences between schools in these different inspection categories. The results, shown in table 12, indicate that inspected schools report significantly higher scores for setting expectations. In year 2 (due to higher response rates), inspected schools also report higher scores for accepting feedback, stakeholder sensitivity, improvement of self-evaluations, improvement of building capacity and improvement of the school s effectiveness. Inspected schools in year 1 and 2 also report more unintended consequences, particularly in discouraging teachers to experiment with new teaching methods and the narrowing and refocusing of the curriculum. 10 P a g e

Table 12. Comparing inspected and non inspected schools Year 1 Year 2 Year 3 Inspected schools Non inspected schools Inspected schools Non inspected schools Inspected schools Setting expectations 3,57 (0,58) 3,09 (0,66) 3,83 (0,54) 3,70 (0,53) 3,79 (0,51) (n=163) (n=84) (n=286) (n=248) (n=132) Non inspected schools 3,59 (0,52) (n=101) -5,56** (245) -2,82** (532) -2,94** (231) Accepting feedback 3,85(0,53) (n=153) NA 3,99 (0,61) (n=268) 3,78 (0,53) (n=217) 3,82 (0,55) (n=123) -3,93** (483) Stakeholders sensitive to reports 3,86 (0,58) 3,78 (0,62) 3,83 (0,57) 3,70 (0,59) 3,72 (0,68) (n=160) (n=83) (n=276) (n=221) (n=125) -2,64** (495) Improving self-evaluations 3,90 (0,77) 3,77 (0,71) 3,84 (0,71) 3,62 (0,71) 3,61 (0,68) (n=159) (n=113) (n=278) (n=240) (n=130) -3,44** (516) Improving capacity-building 3,71 (0,58) 3,59 (0,48) 3,68 (0,53) 3,54 (0,48) 3,51 (0,47) (n=164) (n=116) (n=288) (n=253) (n=135) -3,19** (539) Improving school effectiveness 3,69 (0,49) 3,53(0,42) 3,75 (0,49) 3,61 (0,39) 3,60 (0,39) (n=164) (n=116) (n=286) (n=250) (n=133) -2,79** (278) -3,65** (534) Capacity-building 4,10 (0,51) 4,08 (0,52) 4,01 (0,60) 4,02 (0,50) 3,90 (0,57) (n=165) (n=116) (n=291) (n=258) (n=136) 3,69 (0,67) (n=78) 3,59 (0,71) (n=83) 3,52 (0,70) (n=90) 3,51 (0,47) (n=104) 3,54 (0,38) (n=104) 4,02 (0,56) (n=105) School effectiveness 3,96 (0,58) (n=163) 3,88 (0,48) (n=114) 3,96 (0,47) (n=290) 3,95 (0,46) (n=255) 3,95 (0,46) (n=135) 4,00 (0,43) (n=105) Q46. narrowing of teaching methods 2,32 (0,86) (n=164) 1,99 (0,81) (n=84) -2,96** (246) Q47. Narrowing curriculum and instructional strategies 2,92 (1,03) 2,63 (0,95) (n=163) (n=84) -2,14* (245) 2,32 (0,87) (n=286) 2,72 (0,96) (n=286) 2,25 (0,86) (n=245) 2,74 (0,99) (n=244) 2,22 (0,84) (n=131) 2,69 (1,00) (n=131) 2,04 (0,82) (n=101) 2,82 (1,14) (n=101) 11 P a g e

Q48. Refocusing curriculum and teaching and learning strategies Q49. documents/facts and figures present a more positive picture of the quality of our school Year 1 Year 2 Year 3 Inspected Non Inspected Non Inspected schools inspected schools inspected schools schools 3,49 (1,03) 3,23 (0,88) (n=164) (n=84) -2,13* (192) 2,02 (1,13) (n=46) 3,34 (0,98) (n=93) NA 1,95 (0,83) (n=286) schools 3,23 (0,86) (n=62) 1,95 (0,76) (n=244) NA 2,71 (0,94) (n=131) Non inspected schools NA 2,79 (0,95) (n=100) Q50. putting protocols and procedures in writing and gathering documents and data 3,35 (1,18) (n=46) NA 2,52 (0,96) (n=285) 2,77 (0,95) (n=245) 2,71 (0,94) (n=131) 2,79 (0,95) (n=100) Note: t-value between brackets; * p <.05, ** p <.01, differences in means reported for significant differences; negative value indicates higher responses of inspected schools 12 P a g e

4. Testing changes over time: principal data There are different options for analysing changes over time. The most popular methods are repeated measures ANOVA, or different variants of Hierarchical linear modelling (HLM also known as multilevel modelling or random effects models and mixed models). Given the flexibility of the HLM approach, the amount of data available and the small number of time points the decision was made to use HLM. HLM provides a much more flexible and powerful tool than repeated measures ANOVA and can handle missing data with maximum likelihood methods (Quene and van den Bergh, 2004), Hierarchical linear modelling takes into account that the observations are not independent but that observations from the same school are repeated over time. Another way to think about this is that time points are nested within schools. Due to the small sample sizes and the complexity of the models, the analysis is conducted on the predicted scale scores. Initially principal and teacher responses (combined for primary and secondary education) are analysed separately due to suggestions from the t-tests that teachers and principals gave systematically different answers to the questionnaire. Analysing the data from principals and teachers separately was also beneficial because in some cases only teachers, or only principals from a school responded to the survey. In some cases several teachers from a school responded, and in other cases only one teacher responded. It was not possible to accurately identify whether the same teachers responded to the survey on more than one occasion. The decision was therefore made to analyse the average teachers response by collapsing teacher s responses into a single score. In the final section of the report the collapsed teacher responses and principal responses are compared to identify the extent to which changes over time differed by the role of the responder. The aim of this analysis is to consider longitudinal changes in the schools over time. Therefore if a school only responds at one time point of the survey, it is not possible to estimate changes over time. There are sophisticated techniques available for dealing with missing data, which will be discussed later, but these would not be recommended in this case due to large amount of missing data. There were 285 schools whereby a principal responded at least once to the survey and 317 schools whereby at least one teacher responded. However, there were only 16(principal)-18(teacher) schools that responded in all three years consecutively. For the principal data set there were however responses from 93 schools whereby the principal responded to at least two years of the survey. This pattern of response is highlighted in table 13 below. For the teacher data set there were 105 schools where teachers responded to at least two years of the survey, this response pattern is shown in table 14. The most frequent pattern of missingness is presented in the first rows of table 13 and 14 below. The column labelled pattern refers to the years of data collection for which the school has a response, 1 refers to the school having a response within the year of data collection, and. refers to the school missing data for that year. By restricting the sample to the schools who responded to at least two years of the survey, we are able to make better estimates of what the schools responses might have been in the one year in which they did not respond. It is important to recognise that it is probable that schools who responded to more than one sweep of the survey, have differential characteristics to those who only responded to only one sweep of the survey. A comparison of the samples on different observable characteristics is provided in table 15. Here it is possible to see that despite the reduction in sample size, the two samples do not differ by great amounts on many of the observable characteristics. However, there are more schools from small towns in the longitudinal sample and the schools in the longitudinal sample have fewer children living in poverty. Table 13. The pattern of missingness within the longitudinal principal sample Frequency Percent Pattern 36 38% 1 1. 33 36%. 1 1 16 17% 1 1 1 13 P a g e

8 9% 1. 1 93 100% Table 14. The pattern of missingess within the longitudinal Teacher Sample Frequency Percent Pattern 39 37%. 1 1 35 33% 1 1. 18 17% 1 1 1 13 12% 1. 1 105 100% 14 P a g e

Table 15. Comparing the observable characteristics of the sample used for longitudinal analysis with the full sample obtained across the three years. Characteristic full sample longitudinal sample n 285 principals 93 principals Inspection Arrangement year 1 (%) 74.82 76.29 20.74 19.59 Zwak 4.43 4.12 Zeer Zwak Inspection Arrangement year 2 (%) 77.72 75.77 Zwak 17.72 17.53 Zeer Zwak 4.56 6.70 Inspection Arrangement year 3 (%) 91.94 88.21 Zwak 7.36 10.77 Zeer Zwak 0.70 1.03 Experience working as principal (year 1) (%) 0-1 year 6.25 8.93 1-2 years 2.50 3.57 3-4 years 15.00 14.29 5-6 years 7.50 5.36 7+ years 68.75 67.86 Size of town in which school located (year 1) (%) Fewer than 3,000 27.50 35.71 3,001 to 15,000 33.75 30.36 15,001 to 50,000 18.75 17.86 50,001 to 100,000 8.75 8.93 100,001 to 500,000 10.00 7.14 Over 500,000 1.25 Area in which school located (year 1) (%) Urban 21.25 21.43 Suburban 17.50 16.07 Suburban in a metropolitan area 5.00 5.36 Rural 56.25 57.14 Percentage of student that come from economically disadvantaged backgrounds (year 1) 0% 11.63 10.17 0-10% 55.81 62.71 11-25% 15.12 13.56 26-50% 8.14 8.47 Over 50% 9.30 5.08 Secondary School characteristics mean number of students in school year 1 819.50 812.95 mean Parental satisfaction year 1 7.05 7.18 percentage of students living in poverty year 1 10.33 2.98 mean student satisfaction year 1 6.80 6.99 15 P a g e

4.1 Change in the scale scores over time for principals The change in the scale scores over time for principals and teachers is analysed. The results for each scale are presented in tables 16-23. For each scale a table of the results is provided followed by a brief description and interpretation. Following the formal test of changes in the scales over time, section 4.2 will explore differences in change for the schools in different inspection categories. Setting expectations Table 16. Testing changes in setting expectations over time Setting Expectations Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.59 0.11-5.36 0.00-0.80-0.37 year 3-0.69 0.12-5.71 0.00-0.92-0.45 Intercept 4.78 0.10 45.64 0.00 4.58 4.99 sigma_u 0.42 sigma_e 0.45 rho 0.47 Test of the difference between year 2 and year 3 (chi2(1)=1.27, p=0.26) There was a significant decrease in setting expectations scores between year 1 and year 2 (Est=-0.59, z=-5.36, p=000) and between year 1 and year 3 (Est=-0.69, z=-5.71, p=000). However, there was no difference in the scores between year 2 and year 3 (Est=-0.10, chi2(1)=1.27, p=0.26). In year 1 the average predicted response across schools was 4.78, in year 2 it was 4.19, and in year 3 it was 4.09. The Rho represents the similarity in responses within schools over time, it is the product of the group level variance divided by the total variance. High levels of dependency in the data would result in high rho values. Here the rho value of 0.47 suggests that the variance in setting expectations is approximately equally explained by changes within and between schools. 16 P a g e

Accepting Feedback Table 17. Testing changes in accepting feedback over time Accepting Feedback Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.09 0.10 0.89 0.37-0.11 0.29 year 3 0.22 0.11 2.02 0.04 0.01 0.44 Intercept 3.94 0.09 43.03 0.00 3.76 4.12 sigma_u 0.25 sigma_e 0.43 rho 0.26 Test of the difference between year 2 and year 3 (chi2(1)=2.61, p=0.11) There was a significant increase in scores for accepting feedback between year 1 and year 3 (est=0.22, z=2.02, p=0.04), but the differences in the scores were not statistically significant between year 1 and year 2, or year 2 and year 3. Nevertheless the direction of the change across the three years is the same. In year 1 the average predicted score across schools was 3.95, in year 2 it was 4.03, and in year 3 it was 4.16. The rho value here suggests that there is a high degree of variability within schools over time, indeed about 75% of the variance in accepting feedback is within schools. Stakeholder Sensitivity Table 18. Testing changes in stakeholder sensitivity over time Stakeholder Sensitivity Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.43 0.09-4.83 0.00-0.60-0.25 year 3-0.47 0.10-4.84 0.00-0.65-0.28 Intercept 4.28 0.08 52.88 0.00 4.12 4.43 sigma_u 0.27 sigma_e 0.38 rho 0.33 Test of the difference between year 2 and year 3 (chi2(1)= 0.30, p=0.59) There was a significant decrease in scores for stakeholder sensitivity between year 1 and year 2 (est=-.043, z=-4.83, p=0.000) and year 1 and year 3 (est=-.047, z=-4.84, p=0.000), however the difference between year 2 and year 3 was not statistically significant. In year 1 the average predicted score across schools was 4.28, in year 2 it was 3.85, and in year 3 it was 3.81. The value of rho indicated that one third of the variation in the stakeholder sensitivity score is between schools and two thirds is within schools. 17 P a g e

Improving self-evaluations Table 19. Testing changes in improving self-evaluations over time Self-Evaluations Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.22 0.09-2.37 0.02-0.40-0.04 year 3-0.37 0.11-3.48 0.00-0.58-0.16 Intercept 4.10 0.08 49.89 0.00 3.94 4.26 sigma_u 0.38 sigma_e 0.49 rho 0.37 Test of the difference between year 2 and year 3 (chi2(1)= 2.67, p=0.10 There is a significant decrease in scores on improving self-evaluations between year 1 and year 3 (est=-0.22, z=-2.37, p=0.02), and year 1 and year 3 (est=-0.37, z=-3.48, p=0.00). The difference in scores between year 2 and year 3 was significant at the 10 percent level (est=-0.15). In year 1 the predicted mean score on the improving self-evaluations scale was 4.10, in year 2 this was 3.88, and in year 3 this was 3.73. Again rho indicates that approximately one third of the variation in the scores for improving self-evaluations is between schools and two thirds is within schools. Change in Capacity building Table 20. Testing changes in improving capacity building over time Change in Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.02 0.08 0.27 0.79-0.14 0.18 year 3 0.02 0.10 0.19 0.85-0.17 0.21 Intercept 3.81 0.08 50.03 0.00 3.66 3.96 sigma_u 0.38 sigma_e 0.44 rho 0.44 Test of the difference between year 2 and year 3 (chi2(1)=0.00, p=0.96 There was no evidence for changes in scores on improvements in capacity building over time. Change in school effectiveness Table 21. Testing changes in improving school effectiveness over time Change in School effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.05 0.08 0.66 0.51-0.10 0.21 year 3-0.08 0.09-0.84 0.40-0.25 0.10 Intercept 3.92 0.07 54.68 0.00 3.78 4.06 sigma_u 0.35 sigma_e 0.41 rho 0.42 18 P a g e

Test of the difference between year 2 and year 3 (chi2(1)=2.71, p=0.10 The difference between scores on improvements in school effectiveness in year 2 and 3 was borderline significant (est=-0.13). In year 1 the predicted mean score for improvements in school effectiveness across schools was 3.92, in year 2 this was 3.97 and in year 3 it was 3.84. School Effectiveness Table 22. Testing changes in school effectiveness over time School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.13 0.09 1.38 0.17-0.05 0.31 year 3-0.01 0.10-0.09 0.93-0.21 0.20 Intercept 4.27 0.08 51.43 0.00 4.10 4.43 sigma_u 0.39 sigma_e 0.47 rho 0.40 Test of the difference between year 2 and year 3 (chi2(1)=2.24, p=0.13 There was no evidence for changes in scores on school effectiveness over time. Capacity Building Table 23. Testing changes in capacity building over time Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.38 0.08 4.93 0.00 0.23 0.54 year 3 0.18 0.09 2.03 0.04 0.01 0.35 Intercept 4.83 0.07 71.91 0.00 4.70 4.96 sigma_u 0.27 sigma_e 0.42 rho 0.29 Test of the difference between year 2 and year 3 (chi2(1)=7.03, p=0.008 The difference in scores on capacity building were significantly different between all three years. They were lowest in year 1 (4.83), and highest in year 2 (5.20), dropping down to 5.01 in year 3. 19 P a g e

4.2 Changes in the scale scores over time for principals by inspection category The section above assessed whether there were changes in the scales over time for principals in all schools. Here the impact of the inspection category upon the responses of the principals to the scales within the survey over time are considered. In the Dutch system there are three inspection categories basis Zwak and Zeer Zwak. Due to the very small number of schools categorised as Zeer Zwak (2 schools in year 1, 7 schools in year 2, and 1 school in year 3) this category has been combined with the Zwak category to create a binary variable. A main effect of inspection category is also included to test whether the initial values of the scales, the intercepts, are influenced by the inspection category of the school. The interaction between time and inspection category is also included to test whether scores in the scales changed differentially for principals in schools in different inspection categories. Setting Expectations Table 24. Testing the changes in setting expectations over time by inspection category Setting Expectations Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.78 0.15-5.17 0.00-1.08-0.49 Year 3-0.83 0.15-5.38 0.00-1.13-0.53 Zwak -0.24 0.20-1.19 0.23-0.64 0.16 Time*Zwak year 2*Zwak 0.51 0.25 2.05 0.04 0.02 1.00 year 3*Zwak 0.43 0.30 1.45 0.15-0.15 1.02 Intercept 4.90 0.14 35.28 0.00 4.63 5.17 sigma_u 0.41 sigma_e 0.45 rho 0.45 Test of the interaction across 3 time points chi2(2)=4.25, p=0.12 Those in the Zwak category are predicted to score 0.24 points lower on the setting expectations scale at year 1, but this difference is not statistically significant. However at year 2 of the survey, those in inspection category Zwak scored 0.51 points higher than those in the basis category and this difference was statistically significant. Those who were in inspection group basis experienced significantly larger decreases in the setting expectations scale between year 1 and 2 than those in the Zwak inspection category. For clarification of this interaction, the interaction is depicted in figure 4. As shown in figure 4, there is a differential gradient in changes in the setting expectations scale between year 1 and year 2, however between year 2 and year 3 the slopes follow a similar trajectory. 20 P a g e

3.5 4 Score 4.5 5 5.5 Figure 4. Visualising the relationship for setting expectations6 Setting Expectations year 1 year 2 year 3 Zwak/Zeer Zwak Accepting Feedback Table 25. Testing the changes in accepting feedback over time by inspection category Accepting Feedback Estimate S.E z p 95% Confidence Intervals Year 1 reference Year 2-0.09 0.13-0.70 0.49-0.36 0.17 Year 3 0.04 0.14 0.30 0.76-0.23 0.31 reference Zwak -0.39 0.18-2.17 0.03-0.75-0.04 Time*Zwak year 2*Zwak 0.41 0.22 1.88 0.06-0.02 0.85 year 3*Zwak 0.42 0.27 1.58 0.11-0.10 0.94 Intercept 4.12 0.12 33.71 0.00 3.88 4.36 sigma_u 0.25 sigma_e 0.41 rho 0.26 Test of the interaction across 3 time points chi2(2)=3.87, p=0.14 Those in the Zwak inspection category scored significantly lower (0.39 points) at year 1 than those in the basis category. Therefore principals in the Zwak category reported lower scores on average for the accepting feedback scale than principals in schools in the basis inspection category. There was also some evidence that between year 1 and year 2 there were differential changes in the accepting feedback scores by inspection category. Whilst those in the basis category tended to report slightly lower scores at year 2 compared to year 1 (-0.09), in the Zwak category the average score increased (0.32). The difference between the inspections categories at year 2 was approaching significance (p=0.06). The differences between the inspection categories are plotted in figure 5. In figure 5 it is 6 As with all graphs presented in this report, error bars on charts represent 95% confidence intervals. 21 P a g e

3.5 Score 4 4.5 possible to see the differential intercepts, or starting values, for the school in the different inspection categories and the differential changes over time between year 1 and year 2. Between year 2 and year 3 there are no differences between the groups, shown by the overlapping lines. Figure 5. Visualising the relationship for Accepting Feedback Accepting Feedback year 1 year 2 year 3 Zwak/Zeer Zwak Stakeholder Sensitivity Table 26. Testing the changes in stakeholder sensitivity over time by inspection category Stakeholder Sensitivity Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.65 0.12-5.49 0.00-0.88-0.42 Year 3-0.67 0.12-5.54 0.00-0.90-0.43 Zwak -0.40 0.16-2.54 0.01-0.72-0.09 Time*Zwak year 2*Zwak 0.54 0.20 2.78 0.01 0.16 0.93 year 3*Zwak 0.56 0.24 2.37 0.02 0.10 1.02 Intercept 4.45 0.11 42.16 0.00 4.25 4.66 sigma_u 0.23 sigma_e 0.38 rho 0.27 Test of the interaction across 3 time points chi2(2)=8.52, p=0.01 There were significant differences across all three years for the different inspection categories (chi2(2)=8.52, p=0.01). Principals of schools in the Zwak category tended to report very similar scores on the stakeholder sensitivity scale over time, whereas, on average, principals of schools in the basis category reported a reduction in scores on this scale. In year 1 the basis group had higher scores on average than the Zwak group. But, between year 2 and year 3 they tended to report lower scores on average than the Zwak group. This interaction is shown in figure 6. The distance between the lines at year 1 highlights the significant difference in the intercept. 22 P a g e

3.6 3.8 4 Score 4.2 4.4 4.6 Figure 6. Visualising the relationship for stakeholder sensitivity Stakeholder Sensitivity year 1 year 2 year 3 Zwak/Zeer Zwak Promoting Self-Evaluations Table 27. Testing the changes in self-evaluations over time by inspection category Self-Evaluations Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.41 0.11-3.87 0.00-0.62-0.20 Year 3-0.47 0.12-4.01 0.00-0.70-0.24 Zwak -0.41 0.18-2.26 0.02-0.76-0.05 Time*Zwak year 2*Zwak 0.79 0.23 3.49 0.00 0.35 1.23 year 3*Zwak 0.44 0.29 1.52 0.13-0.13 1.00 Intercept 4.19 0.09 45.80 0.00 4.01 4.37 sigma_u 0.35 sigma_e 0.49 rho 0.34 Test of the interaction across 3 time points chi2(2)=12.35, p=0.002 Joint significance tests reveal a significant interaction between year of survey and inspection category (chi2(2)=12.35, p=0.002). Principals responded differently to the promoting self-evaluations scale over time depending upon the inspection category of the school. The interaction is depicted in figure 7. As shown in figure 7, for the inspection category basis, scores tend to decrease between year 1 and year 2, and increase between year 2 and year 3. However the exact opposite pattern occurs for schools in inspection category Zwak. For year 3 of the survey the difference between the categories is not statistically significant (est=-0.03, z=1.52, p=0.13), shown in figure 7 by the estimated means of the groups being very similar. 23 P a g e

3.4 3.6 3.8 Score 4 4.2 4.4 Figure 7. Visualising the relationship for promoting self-evaluations Promoting Self-Evaluations year 1 year 2 year 3 Zwak/Zeer Zwak Improvements in Capacity Building Table 28. Testing the changes in improvements in capacity building over time by inspection category Improvements in Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.09 0.10-0.96 0.34-0.29 0.10 Year 3-0.04 0.11-0.33 0.74-0.25 0.18 Zwak -0.19 0.17-1.12 0.26-0.51 0.14 Time*Zwak year 2*Zwak 0.47 0.21 2.23 0.03 0.06 0.88 year 3*Zwak 0.26 0.28 0.93 0.35-0.29 0.81 Intercept 3.85 0.09 44.74 0.00 3.68 4.02 sigma_u 0.35 sigma_e 0.44 rho 0.39 Test of the interaction across 3 time points chi2(2)=5.02, p=0.081 Joint significance tests suggest a borderline significant association between inspection category and changes in principals scores on the improvements in capacity building scale over time (chi2(2)=5.02, p=0.081). At year 2 there is a significant difference in the scores, with principals in schools in inspection category Zwak reporting higher scores on average than those in the basis inspection category (Est=0.38, z=2.23, p=0.03). There are no statistically significant differences at year 1 or year 3 of the survey. The predicted mean scale scores for the different inspection categories is depicted in figure 8. 24 P a g e

3.4 3.6 3.8 Score 4 4.2 4.4 Figure 8. Visualising the relationship for improvements in capacity building Improvements in Capacity Building year 1 year 2 year 3 Zwak/Zeer Zwak Improvements in School Effectiveness Table 29. Testing the changes in improvements in school effectiveness over time by inspection category Improvements in School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 Year 2 0.02 0.09 0.25 0.80-0.15 0.20 Year 3-0.05 0.10-0.48 0.63-0.24 0.14 Zwak -0.03 0.16-0.22 0.83-0.35 0.28 Time*Zwak year 2*Zwak 0.07 0.20 0.34 0.73-0.33 0.47 year 3*Zwak -0.37 0.26-1.42 0.16-0.87 0.14 Intercept 3.94 0.08 49.13 0.00 3.78 4.10 sigma_u 0.36 sigma_e 0.37 rho 0.49 Test of the interaction across 3 time points chi2(2)=4.21, p=0.122 There is no evidence for changes in the improvements in school effectiveness scale over time, or that principals in schools of different inspection categories responded differently over time. 25 P a g e

School Effectiveness Table 30. Testing the changes in school effectiveness over time by inspection category School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 Year 2 0.14 0.11 1.27 0.20-0.07 0.35 Year 3 0.03 0.12 0.28 0.78-0.20 0.26 Zwak 0.24 0.19 1.29 0.20-0.13 0.61 Time*Zwak year 2*Zwak -0.06 0.24-0.25 0.81-0.52 0.40 year 3*Zwak -0.12 0.30-0.40 0.69-0.70 0.46 Intercept 4.20 0.09 44.83 0.00 4.02 4.39 sigma_u 0.38 sigma_e 0.48 rho 0.38 Test of the interaction across 3 time points chi2(2)=0.16, p=0.922 There is no evidence for changes in school effectiveness scale over time, or that principals in schools of different inspection categories responded differently over time. Capacity Building Table 31. Testing the changes in Capacity Building over time by inspection category Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 Year 2 0.40 0.09 4.34 0.00 0.22 0.58 Year 3 0.18 0.10 1.84 0.07-0.01 0.38 Zwak 0.03 0.15 0.21 0.83-0.27 0.33 Time*Zwak year 2*Zwak -0.07 0.19-0.36 0.72-0.45 0.31 year 3*Zwak -0.01 0.25-0.03 0.98-0.49 0.48 Intercept 4.82 0.08 62.17 0.00 4.67 4.98 sigma_u 0.26 sigma_e 0.42 rho 0.28 Test of the interaction across 3 time points chi2(2)=0.16, p=0.922 There is no evidence that that principals in schools of different inspection categories responded differently to the capacity building scale over time. 26 P a g e

4.3 Longitudinal Path Models: Principals Longitudinal path models were fit according to the conceptual model whereby it was expected that the scales accepting feedback, setting expectations and stakeholder sensitivity in year one of the survey would go on to influence improvement actions of the school including promoting self-evaluations, improvements in school effectiveness and improvements in capacity building in year 2. These improvement actions would then influence the scales of capacity building and school effectiveness in year 3. Two versions of this path model were tested, in the first specification only information on the scales is included in the model, and in the second specification information on the inspection arrangement of the schools is included in the analysis. The full list of coefficients are presented in appendix B and appendix C respectively, but diagrams showing the significant pathways and the directions of those pathways are provided in figures 9 and 10. Figure 9 shows the results for specification 1. In this figure there are 4 significant direct pathways, one significant covariance and one significant indirect pathway. There is a positive association between accepting feedback and setting expectations, therefore principals that report a higher score on accepting of feedback are also more likely to report a higher score on setting expectations. Improvements in self-evaluations in year 2 was associated with positive increases in improvements in school effectiveness and improvements in capacity building. Therefore principals who reported higher scores on promoting self-evaluations reported higher scores on average for taking improvement actions in capacity building and school effectiveness. Higher scores in improvements in capacity building in year 2 was associated with lower scores in school effectiveness in year 3. Capacity building is positively associated with school effectiveness at year 3 with higher scores on reported capacity building associated with higher scores on school effectiveness. There is a significant indirect pathway from promoting self-evaluations in year 2 to capacity building in year 3. Higher scores on promoting self-evaluations are associated with higher scores on capacity building, because they are associated with higher scores in improvements in capacity building and improvements in school effectiveness. Figure 10 shows the results for specification 2. There are 7 significant direct pathways, one significant covariance and 2 significant indirect effects. The influence of inspection arrangements in the school is controlled for in this specification. Being categorised as Zwak in the first year of the survey resulted in significantly higher scores for improvements in capacity building in year 2. Being categorised as Zwak in the second year of the survey was associated with higher scores in the improvement in capacity building and capacity building scales. It was also indirectly positively associated with school effectiveness in year three through capacity building. As in specification 1, there is a positive association between accepting feedback and setting expectations. Higher scores on promoting self-evaluations are associated with higher scores on improvements in school effectiveness and improvements in capacity building. Higher scores in improvements in capacity building in year 2 was associated with lower scores in school effectiveness in year 3. Capacity building is positively associated with school effectiveness at year 3 with higher scores on reported capacity building associated with higher scores on school effectiveness. However, contrary to specification 1 there is an indirect negative association between promoting selfevaluations in year 2 and school effectiveness in year 3. 27 P a g e

Figure 9. Longitudinal path model 1. Showing only significant pathways. Direct pathways are shown in black and indirect pathways are shown in blue. The direction of the association is shown with wither a + or sign. See appendix B for full list of coefficients 28 P a g e

Figure 10. Longitudinal path model 2: taking into account school inspection category, and changes in school inspection category between year 1 and year 2. See appendix C for full list of coefficients 29 P a g e

4.4 Autoregressive Modelling Autoregressive modelling allows us to consider what influences changes in the scales over time. Autoregressive modelling works by including a lag of the dependent variable as an independent variable in the model. This means that the variance in the dependent variable that is not explained by the lagged variable can be modelled. To give a simplified example that should help explain the concept, if I weigh 60kg at time point 1 and 63kg at time point 2, a large amount of the variation in my weight will be explained by my previous weight, and by all the possible things that lead to my previous weight. What is left to explain then is the 3kg change in my weight between time point 1 and time point 2. Here the same principle is applied to the scales within the analysis. This allows us to consider changes in the scales between year 1 and 2 and year 2 and year 3. However this technique relies on the idea that the previous measurement predicts performance on the current measurement, which is not the case for the accepting feedback scale or the improvements in capacity building scale between year 1 and year 2. What influences changes in the scales between year 1 and year 2 Table 32 shows the results for the autoregressive modelling looking at changes in the scales between year 1 and year 2. The conceptual model was used as an outline for these analyses so that for example school effectiveness in year 1 was not hypothesised to have an influence of accepting feedback, because there is a temporal aspect to the conceptual model which provides an ordering as to the way changes should unfold within schools. Controlling for accepting feedback scores at year 1, setting expectation scores at year 1 and the inspection category of the school in years 1 and 2, Stakeholder sensitivity in year 1 has a significant positive association with changes in accepting feedback between year 1 and year 2. However scores on the accepting feedback scale in year 1 is not related to scores on the accepted feedback scale in year 2. There are no associations for changes in improvements in self-evaluations or changes in improvements in capacity building between year 1 and year 2. However there is an association between improvements in capacity building in year 1, and promoting self-evaluations for changes in school effectiveness between year 1 and year 2. Higher scores on improvements in capacity building in year 1 are associated with larger changes in improvements in school effectiveness between year 1 and year 2. Whereas higher scores on promoting self-evaluations are associated with decreases in the improvements in school effectiveness between year 1 and year 2, controlling for the influence of the other scales in the model. Again this may be because schools that report high levels of promoting selfevaluation don t need to take as many improvement actions as schools which report lower scores on promoting self-evaluations in year 1. For school effectiveness promoting self-evaluations is also associated with a decrease in changes in school effectiveness over time. Again this likely reflects a kind of ceiling effect where schools who are already scoring highly on school effectiveness, who are most likely school which do promote self-evaluations can t score much higher on school effectiveness so there is less change in the measure over time. Stakeholder sensitivity was also associated with larger changes in school effectiveness between year 1 and year 2. Table 32. Autoregressive modelling looking at changes in the scales from year 1 to year 2 Accepting feedback y2 Estimate S.E z p 95% Confidence Intervals accepting feedback y1 0.31 0.23 1.36 0.17-0.14 0.76 stakeholder sensitivity 0.62 0.28 2.17 0.03 0.06 1.17 setting expectations -0.26 0.19-1.38 0.17-0.62 0.11 Inspection y1 0.32 0.20 1.60 0.11-0.07 0.72 inspection y2-0.18 0.21-0.86 0.39-0.60 0.23 Intercept 1.28 1.39 0.92 0.36-1.45 4.01 30 P a g e

Self-Evaluation y2 Self-Evaluation y1 0.40 0.18 2.22 0.03 0.05 0.75 accepting feedback y1-0.03 0.29-0.10 0.92-0.59 0.53 stakeholder sensitivity -0.20 0.39-0.51 0.61-0.96 0.56 setting expectations 0.01 0.24 0.03 0.98-0.45 0.47 Inspection y1-0.12 0.29-0.43 0.67-0.69 0.44 inspection y2 0.36 0.24 1.51 0.13-0.11 0.83 Intercept 3.10 1.72 1.80 0.07-0.27 6.48 Improvements in capacity building y2 Improvements in capacity building y1 0.23 0.18 1.29 0.20-0.12 0.59 Self-Evaluation y1 0.08 0.16 0.46 0.65-0.24 0.40 accepting feedback y1 0.11 0.30 0.36 0.72-0.48 0.70 stakeholder sensitivity 0.15 0.37 0.41 0.68-0.57 0.88 setting expectations 0.06 0.22 0.29 0.77-0.37 0.50 Inspection y1 0.20 0.27 0.76 0.45-0.32 0.73 inspection y2 0.12 0.22 0.53 0.59-0.32 0.56 Intercept 1.14 1.61 0.71 0.48-2.02 4.30 Change in School Effectiveness y2 Change in School Effectiveness y1 0.47 0.22 2.14 0.03 0.04 0.90 Improvements in capacity building y1 0.43 0.15 2.93 0.00 0.14 0.71 Self-Evaluation y1-0.31 0.16-2.02 0.04-0.62-0.01 accepting feedback y1-0.21 0.27-0.77 0.44-0.73 0.32 stakeholder sensitivity -0.23 0.37-0.61 0.54-0.96 0.50 setting expectations -0.06 0.21-0.31 0.76-0.47 0.34 Inspection y1-0.32 0.29-1.11 0.27-0.89 0.25 inspection y2-0.02 0.21-0.10 0.92-0.44 0.39 Intercept 3.94 1.57 2.51 0.01 0.86 7.02 Capacity building y2 Capacity Building y1 0.21 0.11 1.96 0.05 0.00 0.42 Change in School Effectiveness y1-0.16 0.21-0.77 0.44-0.58 0.25 Improvements in capacity building y1-0.04 0.16-0.27 0.79-0.35 0.27 Self-Evaluation y1-0.03 0.13-0.19 0.85-0.29 0.24 accepting feedback y1-0.20 0.21-0.93 0.35-0.62 0.22 31 P a g e

stakeholder sensitivity 0.32 0.29 1.13 0.26-0.24 0.89 setting expectations 0.25 0.17 1.50 0.13-0.08 0.57 Inspection y1-0.06 0.21-0.28 0.78-0.46 0.35 inspection y2 0.02 0.18 0.12 0.91-0.33 0.38 Intercept 3.36 1.18 2.86 0.00 1.05 5.66 School Effectiveness y2 School Effectiveness y1 0.10 0.15 0.66 0.51-0.20 0.40 Capacity Building year 1 0.25 0.15 1.64 0.10-0.05 0.56 Change in School Effectiveness y1 0.26 0.29 0.91 0.36-0.30 0.82 Improvements in capacity building y1 0.09 0.22 0.39 0.70-0.35 0.53 Self-Evaluation y1-0.39 0.18-2.17 0.03-0.74-0.04 accepting feedback y1 0.05 0.33 0.14 0.89-0.60 0.69 stakeholder sensitivity 0.84 0.40 2.11 0.04 0.06 1.61 setting expectations -0.43 0.23-1.88 0.06-0.88 0.02 Inspection y1 0.03 0.31 0.09 0.93-0.58 0.63 inspection y2 0.14 0.26 0.52 0.60-0.38 0.65 Intercept 1.22 1.73 0.71 0.48-2.17 4.62 What influences changes in the scales between year 2 and year 3 The results for this analysis are shown in table 33. The inspection category of the school in year 2 and year 3 influences the change in accepting feedback between year 2 and year 3. Principals of schools who are in the Zwak/Zeer Zwak inspection category in year 2 increase their accepting feedback scores by 0.59 points on average. However principals of schools which change from inspection category in year 2 to inspection category Zwak/Zeer Zwak in year 3 have a significant reduction in their accepting feedback scores between year 2 and year 3 (-0.66). Principals in schools with higher levels of stakeholder sensitivity tend to have reductions in the improvements in capacity building scale. Principals in schools who change form the basis category of inspection in year 2 to the Zwak category of inspection in year 3 report decreases in the improvements in school effectiveness scale between year 2 and year 3. Principals in schools in the Zwak inspection category in year 2 report larger increases in the capacity building and school effectiveness scale between year 2 and year 3. The changes are likely larger in schools in the Zwak inspection category because there were more changes that could be made to improve capacity building and school effectiveness within the year. 32 P a g e

Table 33. Autoregressive modelling looking at changes in the scales from year 2 to year 3 Accepting feedback y3 Estimate S.E z p 95% Confidence Intervals accepting feedback y2 0.37 0.19 1.95 0.05 0.00 0.74 stakeholder sensitivity y2 0.30 0.18 1.64 0.10-0.06 0.65 setting expectations y2-0.06 0.16-0.37 0.71-0.38 0.26 Inspection y2 0.59 0.21 2.83 0.01 0.18 1.01 inspection y3-0.66 0.27-2.40 0.02-1.20-0.12 Intercept 1.73 0.73 2.38 0.02 0.31 3.15 Self-Evaluations y3 Self-Evaluation y2 0.34 0.12 2.87 0.00 0.11 0.58 accepting feedback y2-0.21 0.21-1.04 0.30-0.62 0.19 stakeholder sensitivity y2-0.25 0.19-1.29 0.20-0.63 0.13 setting expectations y2-0.03 0.17-0.16 0.87-0.37 0.31 Inspection y2-0.19 0.22-0.86 0.39-0.62 0.24 inspection y3 0.32 0.29 1.10 0.27-0.25 0.89 Intercept 4.35 0.81 5.35 0.00 2.76 5.94 Improvements in capacity building y3 Improvements in capacity building y2 0.47 0.19 2.46 0.01 0.10 0.85 Self-Evaluation y2-0.09 0.15-0.58 0.56-0.39 0.21 accepting feedback y2 0.28 0.23 1.22 0.22-0.17 0.72 stakeholder sensitivity y2-0.55 0.22-2.52 0.01-0.98-0.12 setting expectations y2 0.11 0.19 0.58 0.56-0.26 0.49 Inspection y2 0.22 0.24 0.91 0.36-0.25 0.69 inspection y3-0.07 0.34-0.19 0.85-0.74 0.60 Intercept 2.81 0.94 2.98 0.00 0.96 4.66 Change in School Effectiveness y3 Change in School Effectiveness y2 0.73 0.17 4.17 0.00 0.39 1.07 Improvements in capacity building y2 0.27 0.14 1.87 0.06-0.01 0.54 Self-Evaluation y2-0.21 0.13-1.55 0.12-0.47 0.05 accepting feedback y2-0.29 0.17-1.74 0.08-0.61 0.04 stakeholder sensitivity y2-0.05 0.16-0.34 0.73-0.37 0.26 setting expectations y2 0.15 0.15 1.00 0.32-0.15 0.45 Inspection y2 0.18 0.17 1.09 0.28-0.15 0.51 inspection y3-0.84 0.24-3.58 0.00-1.31-0.38 Intercept 1.54 0.67 2.30 0.02 0.23 2.85 Capacity Building y3 Capacity Building y2 0.34 0.16 2.15 0.03 0.03 0.64 Change in School Effectiveness y2 0.32 0.21 1.53 0.13-0.09 0.74 Improvements in capacity building y2-0.24 0.17-1.41 0.16-0.58 0.09 Self-Evaluation y2-0.23 0.16-1.47 0.14-0.54 0.08 33 P a g e

accepting feedback y2 0.00 0.19-0.02 0.99-0.38 0.38 stakeholder sensitivity y2 0.32 0.19 1.67 0.09-0.05 0.69 setting expectations y2-0.22 0.17-1.32 0.19-0.55 0.11 Inspection y2 0.43 0.20 2.12 0.03 0.03 0.83 inspection y3-0.25 0.29-0.88 0.38-0.81 0.31 Intercept 3.45 1.19 2.89 0.00 1.11 5.79 School effectiveness year 3 school effectiveness year 2 0.38 0.10 3.84 0.00 0.19 0.58 Capacity Building y2 0.24 0.15 1.56 0.12-0.06 0.54 Change in School Effectiveness y2 0.41 0.19 2.12 0.03 0.03 0.79 Improvements in capacity building y2-0.40 0.16-2.47 0.01-0.71-0.08 Self-Evaluation y2-0.25 0.15-1.67 0.10-0.55 0.04 accepting feedback y2-0.01 0.18-0.04 0.97-0.37 0.35 stakeholder sensitivity y2-0.09 0.18-0.48 0.63-0.44 0.27 setting expectations y2 0.02 0.15 0.16 0.87-0.28 0.33 Inspection y2 0.47 0.21 2.31 0.02 0.07 0.88 inspection y3-0.09 0.28-0.34 0.73-0.64 0.45 Intercept 2.31 1.10 2.10 0.04 0.16 4.47 34 P a g e

5. Testing changes over time: teacher data 5.1 Change in the scale scores over time for teachers Setting expectations Table 34. Changes in the setting expectations scale over time in the teacher data set Setting Expectations Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.97 0.11-8.72 0.00-1.19-0.75 year 3-1.01 0.12-8.67 0.00-1.24-0.78 Intercept 4.21 0.11 39.91 0.00 4.00 4.42 sigma_u 0.35 sigma_e 0.43 rho 0.40 Test of the difference between year 2 and year 3 (chi2(1)=0.26, p=0.61) There was a significant decrease in the average teacher response on the setting expectations scale between year 1 and year 2 (Est=-0.97, z=-8.72, p=000) and between year 1 and year 3 (Est=-1.01, z=- 8.67, p=000). However, there was no difference in the scores between year 2 and year 3 (Est=-0.04, chi2(1)=0.26, p=0.61). In year 1 the average predicted response was 4.21, in year 2 it was 3.24, and in year 3 it was 3.20. Accepting Feedback Table 35. Changes in the accepting feedback scale over time in the teacher data set Accepting Feedback Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.02 0.13-0.14 0.89-0.28 0.24 year 3 0.32 0.14 2.35 0.02 0.05 0.58 Intercept 3.79 0.12 32.77 0.00 3.56 4.02 sigma_u 0.24 sigma_e 0.61 rho 0.14 Test of the difference between year 2 and year 3 (chi2(1)=11.81, p=0.006) There was a significant increase in the average teacher score on the accepting feedback scale between year 1 and year 3 (est=0.32, z=2.35, p=0.02) and year 2 and year 3 (est=0.34, chi(1)=11.81, p=0.006). In year 1 the average response was 3.79, it was 3.77 in year 2 and 4.11 in year 3. 35 P a g e

Stakeholder Sensitivity Table 36. Changes in the stakeholder sensitivity scale over time in the teacher data set Stakeholder Sensitivity Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.04 0.08-0.55 0.58-0.20 0.11 year 3 0.38 0.09 4.48 0.00 0.22 0.55 Intercept 3.68 0.07 55.37 0.00 3.55 3.81 sigma_u 0.18 sigma_e 0.44 rho 0.14 Test of the difference between year 2 and year 3 (chi2(1)=33.00, p=0.000) There was a significant increase in the average teacher report of stakeholder sensitivity between year 1 and year 3 (est=0.38, z=4.48, p=0.00) and between year 2 and year 3 (est=0.42, chi2(1)=33.00, p=0.00). In year 1 the average teacher response to the stakeholder sensitivity scale was 3.68, in year 2 it was 3.64 and in year 3 it was 4.06. Improving Self-Evaluations Table 37. Changes in the improving self-evaluations scale over time in the teacher data set self-evaluations Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.11 0.11-0.97 0.33-0.33 0.11 year 3-0.02 0.12-0.19 0.85-0.26 0.21 Intercept 3.81 0.09 43.50 0.00 3.64 3.98 sigma_u 0.09 sigma_e 0.67 rho7 0.02 Test of the difference between year 2 and year 3 (chi2(1)=0.65, p=0.421) There were no significant changes in the average teacher response to the improving self-evaluations scale over time. 7 The rho value for the improving self-evaluations scale, the improvements in capacity building and the improvements in school effectiveness scales is low. This suggests there is little dependence in the scale scores. This may be because it is difficult for teachers to assess improvement actions rather than more hard outcomes, such as actual improvements. These low rho scores may also be a product of using averaged teacher responses, whereby some teachers who responded may have made improvements and others may not, these teachers may differ in-between sweeps of the survey. 36 P a g e

Improvements in Capacity Building Table 38. Changes in the improvements in capacity building scale over time in the teacher data set Changes in Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.27 0.07-3.81 0.00-0.41-0.13 year 3-0.38 0.08-5.05 0.00-0.53-0.23 Intercept 3.60 0.06 64.03 0.00 3.49 3.71 sigma_u 0.11 sigma_e 0.41 rho 0.07 Test of the difference between year 2 and year 3 (chi2(1)=2.69, p=0.101) There were significant decreases in the average teacher response to the improvements in capacity building scale across all three years. In year 1 the average response was 3.60, in year 2 it was 3.33 and in year 3 it was 3.22. Improvements in School Effectiveness Table 39. Changes in the improvements in school effectiveness scale over time in the teacher data set Change School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.09 0.07-1.38 0.17-0.22 0.04 year 3-0.33 0.07-4.72 0.00-0.46-0.19 Intercept 3.79 0.05 72.16 0.00 3.69 3.90 sigma_u 0.09 sigma_e 0.37 rho 0.05 Test of the difference between year 2 and year 3 (chi2(1)=14.89, p=0.000) There was no significant difference in the average teacher response to improving school effectiveness between year 1 and year 2, but there was a signficiant decrease between year 1 and year 3 (est=-0.33, z=4.72, p=0.00), and year 2 and year 3 (est=0.24, chi2(1)=14.89, p=0.00). 37 P a g e

School Effectiveness Table 40. Changes in the school effectiveness scale over time in the teacher data set School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 year 2 0.02 0.08 0.33 0.74-0.12 0.17 year 3 0.04 0.08 0.48 0.63-0.12 0.20 Intercept 4.57 0.07 67.74 0.00 4.44 4.70 sigma_u 0.32 sigma_e 0.43 rho 0.35 Test of the difference between year 2 and year 3 (chi2(1)=0.04, p=0.8344 There was no evidence for changes in the average teacher response to the school effectiveness scale over time. Capacity Building Table 41. Changes in the capacity building scale over time in the teacher data set Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 year 2-0.22 0.09-2.45 0.01-0.39-0.04 year 3-0.17 0.10-1.78 0.08-0.36 0.02 Intercept 4.92 0.08 65.01 0.00 4.77 5.07 sigma_u 0.31 sigma_e 0.51 rho 0.27 Test of the difference between year 2 and year 3 (chi2(1)=0.28 p=0.59 The average teacher response to the capacity building scale reduced significantly between year 1 and year 2 (est=-0.22, z=2.45, p=0.01). The difference between year 1 and year 3 was borderline significant, but there was no difference between year 2 and year 3. 38 P a g e

5.2 Changes in the scale scores over time for teachers by inspection category Setting Expectations Table 42. Changes in setting expectations by inspection category Setting Expectations Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.91 0.14-6.35 0.00-1.19-0.63 Year 3-0.85 0.15-5.85 0.00-1.14-0.57 Zwak 0.40 0.21 1.89 0.06-0.01 0.81 Time*Zwak year 2*Zwak -0.09 0.25-0.36 0.72-0.59 0.40 year 3*Zwak -0.48 0.30-1.59 0.11-1.07 0.11 Intercept 4.07 0.13 30.31 0.00 3.81 4.34 sigma_u 0.33 sigma_e 0.44 rho 0.36 Test of the interaction across 3 time points chi2(2)=3.27, p=0.195 Teachers in schools that were in the Zwak inspection category tended to report higher scores on the setting expectations scale on average in year 1. The difference in the initial reported scores was borderline significant with teachers in schools in the Zwak inspection category responding to the scale with average scores of 4.47, compared to 4.07 for teachers in schools in the basis inspection category. Accepting Feedback Table 43. Changes in accepting feedback by inspection category Accepting Feedback Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.19 0.17-1.16 0.25-0.52 0.13 Year 3 0.18 0.17 1.10 0.27-0.14 0.51 Zwak -0.32 0.24-1.34 0.18-0.78 0.15 Time*Zwak year 2*Zwak 0.48 0.28 1.68 0.09-0.08 1.04 year 3*Zwak 0.32 0.35 0.93 0.35-0.36 1.00 Intercept 3.92 0.15 26.26 0.00 3.63 4.21 sigma_u 0.28 sigma_e 0.60 rho 0.18 39 P a g e

Test of the interaction across 3 time points chi2(2)=2.82, p=0.240 There is no evidence that the average teacher response to the accepting feedback scale varied by inspection category of the school. Stakeholder Sensitivity Table 44. Changes in stakeholder sensitivity by inspection category Stakeholder Sensitivity Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.05 0.09-0.59 0.56-0.24 0.13 Year 3 0.41 0.09 4.35 0.00 0.23 0.60 Zwak 0.11 0.16 0.67 0.51-0.21 0.42 Time*Zwak year 2*Zwak 0.05 0.20 0.24 0.81-0.35 0.44 year 3*Zwak -0.15 0.25-0.60 0.55-0.64 0.34 Intercept 3.66 0.08 48.57 0.00 3.51 3.81 sigma_u 0.19 sigma_e 0.43 rho 0.16 Test of the interaction across 3 time points chi2(2)=0.81, p=0.668 There is no evidence that the average teacher response to the stakeholder sensitivity scale varied by inspection category of the school. Improving Self-Evaluations Table 45. Changes in stakeholder sensitivity by inspection category Self-Evaluations Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.15 0.13-1.20 0.23-0.41 0.10 Year 3-0.06 0.13-0.49 0.63-0.32 0.19 Zwak -0.08 0.21-0.37 0.71-0.49 0.34 Time*Zwak year 2*Zwak 0.07 0.27 0.24 0.81-0.47 0.61 year 3*Zwak 0.17 0.34 0.50 0.62-0.50 0.84 Intercept 3.84 0.10 38.49 0.00 3.64 4.03 sigma_u 0.11 sigma_e 0.66 rho 0.03 40 P a g e

Test of the interaction across 3 time points chi2(2)=0.25, p=0.882 There is no evidence that the average teacher response to the improving self-evaluations scale varied by inspection category of the school. Improvements in Capacity Building Table 46. Changes in improvements in capacity building by inspection category Change Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.34 0.08-4.22 0.00-0.49-0.18 Year 3-0.38 0.08-4.57 0.00-0.54-0.21 Zwak 0.04 0.13 0.29 0.77-0.22 0.30 Time*Zwak year 2*Zwak 0.29 0.17 1.67 0.10-0.05 0.62 year 3*Zwak 0.00 0.21 0.02 0.99-0.42 0.42 Intercept 3.59 0.06 57.01 0.00 3.47 3.71 sigma_u 0.11 sigma_e 0.41 rho 0.07 Test of the interaction across 3 time points chi2(2)=3.62, p=0.164 There is no evidence that the average teacher response to the improvements in capacity building scale varied by inspection category of the school. Improvements in school effectiveness Table 47. Changes in improvements in school effectiveness by inspection category Change School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.05 0.07-0.74 0.46-0.20 0.09 Year 3-0.29 0.07-3.82 0.00-0.43-0.14 Zwak 0.27 0.14 1.93 0.05 0.00 0.54 Time*Zwak year 2*Zwak -0.23 0.17-1.30 0.19-0.57 0.12 year 3*Zwak -0.23 0.21-1.09 0.28-0.63 0.18 Intercept 3.75 0.06 65.12 0.00 3.64 3.86 sigma_u sigma_e 0.09 rho 0.36 41 P a g e

Test of the interaction across 3 time points chi2(2)=1.89, p=0.388 Teachers in schools in inspection category Zwak report on average higher scores on the improvements in school effectiveness scale in year 1. Teachers in schools in the inspection category Zwak report scores of 4.02 on the improvements in school effectiveness scale, whereas teachers in the basis category report scores of 3.75 on average. School Effectiveness Table 48. Changes in improvements in school effectiveness by inspection category School Effectiveness Estimate S.E z p 95% Confidence Intervals Year 1 Year 2 0.05 0.09 0.59 0.55-0.12 0.22 Year 3 0.07 0.09 0.78 0.44-0.11 0.26 Zwak 0.20 0.16 1.24 0.22-0.12 0.52 Time*Zwak year 2*Zwak -0.09 0.20-0.44 0.66-0.49 0.31 year 3*Zwak -0.01 0.26-0.05 0.96-0.52 0.49 Intercept 4.52 0.08 59.36 0.00 4.37 4.67 sigma_u 0.33 sigma_e 0.43 rho 0.37 Test of the interaction across 3 time points chi2(2)=0.25, p=0.884 There is no evidence that the average teacher response to the school effectiveness scale varied by inspection category of the school. Capacity Building Table 49. Changes in improvements in capacity building by inspection category Capacity Building Estimate S.E z p 95% Confidence Intervals Year 1 Year 2-0.15 0.10-1.49 0.14-0.35 0.05 Year 3-0.08 0.11-0.74 0.46-0.29 0.13 Zwak 0.33 0.18 1.85 0.06-0.02 0.68 Time*Zwak year 2*Zwak -0.38 0.23-1.67 0.09-0.82 0.06 year 3*Zwak -0.55 0.29-1.89 0.06-1.12 0.02 Intercept 4.85 0.08 57.27 0.00 4.69 5.02 sigma_u 0.31 42 P a g e

4 4.5 Score 5 5.5 sigma_e 0.51 rho 0.27 Test of the interaction across 3 time points chi2(2)=4.45, p=0.11 Teachers responses to the capacity building scale varied significant by inspection category at year 1 and year 3 of the survey. Teachers in schools in inspection category Zwak tended to report decreases in capacity building over time where as teachers in schools in inspection category basis reported very similar scores to the capacity building scale over time. Figure 11. the relationship between changes in capacity building over time and the inspection category of the school. Capacity Building year 1 year 2 year 3 basis Zwak/Zeer Zwak 5.3 Longitudinal Path Models: Teachers Two versions of this path model were tested, in the first specification only information on the scales is included in the model, and in the second specification information on the inspection arrangement of the schools is included in the analysis. The full list of coefficients are presented in appendix D and appendix E respectively, but diagrams showing the significant pathways and the directions of those pathways are provided in figures 12 and 13. Figure 12 shows the results for specification 1 of the path model, fit on the average teacher response data set. In this figure there are 7 significant direct pathways and one significant indirect pathway. There is a positive association between setting expectations in year 1 and improvements in school effectiveness in year 2, therefore school in which the teachers on average report higher scores on the setting expectations scale also report higher scores on the improvements in school effectiveness scale. The same relationship exists between the setting expectations scale and the improvements in capacity building scale. The average teacher response on the improving self-evaluations scale also has a positive association with the improving school effectiveness and improving capacity building scale. Teachers average responses to the improvements in capacity building scale are positively associated with school effectiveness and capacity building. Therefore teachers who report higher scores on the improvements in capacity building scale also report higher scores on the school effectiveness scale and the capacity building scale. The average teacher response to capacity building was positively associated with school effectiveness, suggesting that where teachers report schools to have high capacity building they also report high school effectiveness. Figure 13 shows the path model for specification 2 with the inspection arrangements of year 1 and year 2 included in the model. The most notable difference between specification 1 and 2 is that the association between improving self-evaluations and improvements in capacity building is no longer 43 P a g e

significant, neither is the association between improving capacity building and school effectiveness. There is an association between inspection category in year 1 and school effectiveness in year 3, teachers in schools in the Zwak category in year 1 report higher levels of school effectiveness at year 3 than schools in the basis category. Also teachers in schools that change from the basis to the Zwak category between year 1 and year 2 report lower levels of capacity building in year 3. There is a positive association between setting expectations and inspection arrangements in year 1 and year 2 suggesting that teachers in schools in inspection category Zwak report higher scores on the setting expectations scale. The correlation between inspection category at year 1 and year 2 suggests that schools who are in the Zwak inspection category at year 1 are more likely to be in the inspection category Zwak in year 2. 44 P a g e

Figure 12. Longitudinal path model 1. Showing only significant pathways. Direct pathways are shown in black and indirect pathways are shown in blue. The direction of the association is shown with wither a + or sign See appendix D for full tables of coefficients 45 P a g e

Figure 13. Longitudinal path model 2: taking into account school inspection category, and changes in school inspection category between year 1 and year 2. See appendix E for full tables of coefficients 46 P a g e

6. Principals and teacher data set comparison 6.1 Longitudinal Path model comparing teacher and principal responses. The path model depicted in figure 14 shows specification 1 of the path model whereby information on the scales is included in the path model and pathways are fit according to the conceptual model. In each regression pathway the influence of Job type was included. For this analysis principals were given a value of 1 and the average teacher response was given a value of 0. From the path model it is possible to determine whether teachers and principals respond differently. The coefficients from this path model are shown in table 50. Principals on average report statistically significantly higher scores on improvements in capacity building year 2, improvements in school effectiveness year 2 and capacity building year 3 than teachers. The positive correlation between principals and stakeholder sensitivity year 1 suggests that being a principal is also associated with higher scores on the stakeholder sensitivity year 1 scale. Interestingly being a principal is associated with a significantly lower score on the school effectiveness year 3 scale. The influence of the inspection arrangement of the school is included in this path model by including inspection arrangement at year 1 and year 2 in the relevant regression pathways. The coefficients from the second specification of this path model are shown in table 51. 47 P a g e

Figure 14. Path model specification 1. Showing only significant pathways. Direct pathways are shown in black and indirect/total pathways are shown in blue. The direction of the association is shown with wither a + or sign 48 P a g e