School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis Rui Alberto Castro Faculdade de Engenharia da Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal pro10022@fe.up.pt Abstract. School performance evaluation is nowadays a hot subject both in political and pedagogical terms. The ability to perform comparative analysis between different schools in different regions using the national exams results is becoming an important issue for school management boards. Although this inter-school comparison (usually named school rankings) is very important, we believe that also intra-school comparison is of major importance as the most significant decisions that affect student performance are made by teachers and school directors inside a particular school. This paper presents the implementation of a Data Warehouse system using the Dimensional Model suited for inter-school and intra-school performance analysis. The used Data is real data that comes from the Ministério da Educação and from a specific school. We run a set of analysis that show both the usability of the built system and the influence of internal factors (teachers, courses) in final results. We present also some comparisons between SQL queries built over traditional relational model and built over our Data Warehouse system. Keywords: Data Warehouse, Dimensional Model, SQL, School Ranking. 1 Introduction Present days are characterized by the importance of accountability in all aspects of life. The need to measure, compare and rank are spread out over knowledge areas that just a few year ago were not used to it. In Portugal, pre-university school exams results has been published by the Ministério da Educação (ME) [1] since 2002 and since then, the so called school rankings are built and published every year by different sources mainly the media [2]. Although this inter-school comparison, usually named school rankings, is very important we believe that also intra-school comparison is very significant. This inter-school analysis of student results using internal factors such as the student teacher, the student course, the student background in terms of social and economic factors and the student history in terms of past
2 Rui Alberto Castro schools is of major importance as the most significant decisions that could affect student performance are made by their own teachers and by the school directors inside a particular school. Also is at school level that most important improvement measures can be taken in order to present better results. To perform the inter and intra-school analysis some difficulties arise due to the very different data sources and incompatible data systems. In order to overcome this problem and also to provide a simple query system that even, in future work, could be transformed in an automatic query system tool, we present the implementation of a Data Warehouse system using the Dimensional Model [3] suited for inter and intra school performance analysis. To test and evaluate the proposed model, we loaded it using real data from exams provided by the Ministério da Educação [1] and with data from a specific school (Colégio Paulo VI de Gondomar [4]). We present some of the major steps of the ETL process (Extraction, Transformation and Loading) and address some of the important issues of data validation and consolidation. The main goal of this paper is to show a system suited for inter and intra school performance analysis and to demonstrate its feasibility with a particular example. Some performance analysis using the system are provided in order to compare its usability and performance. This document has been structured in following order: In Section 2 we will present other school performance studies and techniques, in Section 3 a general view of the Data Warehouse system and the Dimensional Model applied to our particular case, in Section 4 we describe how the system has been implemented and how the real data has been loaded, in Section 5 we will present some results and query comparisons and in Section 6 we will present some conclusions. 2 Related Work The school performance analysis in Portugal is a recent topic as only since 2002 the Ministério da Educação de Portugal has disclosed the final exams results of the Portuguese students. Since then the so called school rankings are built every year based on the final marks in 12º year and in university access exams. In foreign countries is very usual to make school rankings but, as in Portugal, they make an inter-school analysis based on exams or standard tests. In UK, for example, is usual to publish school rankings based on GCSE (General Certificate of Secondary Education) [6]. A lot of different institutions, mainly the press, make these rankings in the past years but there is no known publication of an analysis of intra-school performance and its relation with inter-school rankings of a Portuguese school. Some reference press publish every year a very deep inter-school analysis [5] and compare the results over the years decomposing in some key factors such as regional distribution, school size and student course. We will extend this work to the intraschool factors that we think are determinant to student performance, mainly the teacher.
School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis 3 3 Data Warehouse System Description The main objective of an implementation of a Data Warehouse System is to make the information available and increase its quality, reliability and usability. The final goal of the system is to provide information that can drive the school managers to better and more fundament decisions. Our system is made around two stars (fact tables) with three and five dimensions respectively. Star one, shown in figure one, stores all the exams results of all students of all schools since 2002. escola_dw cod_escola O código da escola int No Nome Nome da Escola varchar(150) No Nome_Ab... Nome abreviado varchar(20) No id_concelho int No concelho varchar(50) No id_distrito int No distrito varchar(50) No Publica 1 se pública e 0 se privada bit No Exames_Nacionais_ME Column Name Description Data T... Nullable id_escola int No id_disciplina int No id_anolectivo int No Tipo I para Interno ou E para Externo char(1) No Fase A fase dos exames (1 ou 2) char(1) No Sexo M ou F char(1) No N_Provas Nº de Provas desta escola nesta disciplina neste ano lectivo e para esta fase, com alunos deste sexo e deste tipo int No Total_CIF Soma para classificações CIF destes alunos (normalizada individualmente a [0, 20]) - só significado para internos. int No Total_Classif Soma para classificações de exame destes alunos (normalizada individualmente a [0, 200]). int No Total_CFD Soma para classificações de exame destes alunos (normalizada individualmente a [0, 20]). int No disciplina Nome Nome completo da disciplina varchar(64) No abr Abreviatura do nome da disciplina varchar(16) No cod_enes codigo de exportação enes varchar(10) No cod_exame codigo do exame smallint No AnoTerm Ano terminal da disciplina (ano em que faz exame) tinyint No anolectivo Ano O ano (exº 2010 para 2010/20... int No Ano_Desc Extenso. Exº 2010/2011 varchar(9) No Fig. 1. Star One. Results from national exams.
4 Rui Alberto Castro Star two, shown in figure two, stores all the grades and exams results of all students of a specific school since 2002. professor Column N... Descrip... Data Type Nullable id_prof int No nome varchar... No idade tinyint No grau varchar... No curso varchar... No univeridade varchar... No class_final tinyint No escola_dw cod_escola O código da escola int No Nome Nome da Escola varchar... No Nome_Ab... Nome abreviado varchar... No id_concelho int No concelho varchar... No id_distrito int No distrito varchar... No Publica 1 se pública e 0 se priv... bit No Notas_Exames_Escola Column Na... Description Data T... Nullable id_escola int No id_disciplina int No id_anolect... int No id_aluno int No id_professor int No Tipo I para Interno ou E para Externo char(1) No Fase A fase dos exames (1 ou 2) char(1) No Per3 Nota do 3º período deste aluno com este professor nesta escola nesta disciplina e neste ano lectivo int No CIF Nota Interna int No Exame Nota de Exame int No CFD Nota Final int No disciplina Nome Nome completo da disciplina varchar... No abr Abreviatura do nome da disciplina varchar... No cod_enes codigo de exportação enes varchar... No cod_exame codigo do exame smallint No AnoTerm Ano terminal da disciplina (ano em que faz exame) tinyint No aluno id_student int No nome Nome Completo varchar... No idade tinyint No Sexo M ou F char(1) No anolectivo Ano O ano (exº 2010 para 2010/2011) int No Ano_Desc Extenso. Exº 2010/2011 varchar(9) No Fig. 2. Star Two: Results of a specific school
School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis 5 The fact tables store all grades and exams: Star One stores the results from the national exams. Star Two stores the grades and exams of the specific school. Both stars share three common dimensions: anolectivo: Stores the school year to which the grades refers. escola_dw: Stores data about a specific school namely its name, code, geographical location and type (public or private). disciplina: Stores all information about all disciplines, namely their name, cod_enes (a specific code administered by the Ministério da Educação (ME) and used for electronic export), cod_exame (a specific code administered by ME that specifies the exam) and the year of discipline termination. Star Two has two more dimensions: aluno: Stores information about students. professor: Stores information about the student teachers. The star one (national exams) has only aggregate results as ME doesn't disclose specific student information, that is there is no way to relate the results from one student from one year to the other. Having this in mind we aggregate all the information across the specific dimensions (school, discipline and year) and also across sex, phase (phase 1 or phase 2) and type (Internal or External student) as these are the only information available. The star two (specific school) stores all the results with no aggregation as we have enough information to make studies along the years. Each fact is a grade and exam of a particular student. With the proposed system a lot of interrogation about national exams are possible across years, school, regional location, type of school, discipline, phase, type of student and student sex. In addition to the above interrogations, we are able also to get analysis about the specific school data across student, teacher and teacher specific data as age, degree and university. 4 Data Warehouse Implementation To implement the proposed system we have to make two main phases: The ETL process: Data Extraction, Transformation and Loading. Data Verification and Validation. 4.1 The ETL Process - Data Extraction Data extraction is, in this case, a straightforward process. Both the national exams data and the specific school data are already in electronic format and are directly importable to the SQL Server System. The ME data is available online [1] in Access
6 Rui Alberto Castro tables format that are directly importable to SQL Server. There are nine tables (tblcodsconcelho, tblcodsdistrito, tblcodspubpriv, tblcursos, tblcursossubtipos, tblcursostipos, tblescolas, tblexames e tblhomologa2010 (for the 2009/2010 lecture year). The Colégio Paulo VI (CPVI) school has already a relational Database implemented and has a set of 47 different tables from where de data should be extracted. Figure 3 show as an example the table design directly extracted from ME exams data. tblhomologa_2010 Column Name Description Nullable ID No Escola Fase Exame ParaAprov Interno ParaMelhoria ParaIngresso TemInterno Sexo Idade Curso CIF Class_Exam CFD Fig. 3. Design of the ME exams data table directly extracted 4.2 The ETL Process - Data Transformation The Data Transformation phase was more complex due to ambiguity of data, specially on courses definition, and lack of information on ME tables. In fact, ME tables define course by a name and an exam code and CPVI school define course by a name not always equivalent to the ME one and a ENES code which has no automatic conversion in an exam code. Due to these ambiguities a semi-automated process has been used with some manual aid. Here is an example of a SQL code for an automated transformation of data and its storage in the transformation area using a temporary table named _temp_tab_course:
School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis 7 select distinct course.name, abr, cod_enes, termina, class.year as AnoD,cast('' as varchar(200))as nome_tbl,cast(0 as integer)as CodExame, cast('' as varchar(25))as AnoTerm, 0 as OkFlag into _temp_tab_course from course, year_area, year, class_course, class where course.id_year_area=year_area.id and id_year=year.id and year.year>=2001 and exame=1 and id_course=course.id and id_class=class.id and (termina=class.year or termina=0) Here is the SQL code for the automatic math for data from the school and from ME: update _temp_tab_course setnome_tbl=descr,codexame=exame, anoterm=[anoterminal], okflag=1 from tblexames t where name=descr As an example, with this code we are able to get 28% of automatic finalization and the remain 72% needed manual intervention and validation. 4.3 The ETL Process - Data Loading Having done the complex data transformation process for all the ME exams and CPVI data, the task of loading the final tables data is a simpler process. Here is some examples of SQL code to load some of the dimensions: --Insert the anolectivo Dimension insert anolectivo select year,cast(year as varchar(4)) + '/' + cast(year+1 as varchar(4)) from year where year>=2002 order by year -- Insert the disciplina Dimension insert disciplina select name,abr,cod_enes,codexame,termina from _temp_tab_course where okflag=1 Here is as an example the code to load the fact table Star One (national exams results). In this example we are loading the data for the year 2009/2010. A similar SQL expression is used for other years:
8 Rui Alberto Castro insert exames_nacionais_me select escola_dw.id, disciplina.id, anolectivo.id, 'I', fase, sexo, count(*), sum(cif), sum(class_exam), sum(cfd) from tblhomologa_2009, escola_dw, disciplina, anolectivo where ano=2009 and escola=cod_escola and cod_exame=exame and interno='s' group by escola_dw.id, disciplina.id, anolectivo.id, fase, sexo union select escola_dw.id, disciplina.id, anolectivo.id, 'E', fase, sexo, count(*), sum(cif), sum(class_exam), sum(cfd) from tblhomologa_2009, escola_dw, disciplina, anolectivo where ano=2009 and escola=cod_escola and cod_exame=exame and interno='n' group by escola_dw.id, disciplina.id, anolectivo.id, fase, sexo 4.4 Data Verification and Validation After all data loading process is completed, a new phase of verification and validation is done. This is an automatic process done by some SQL processes in order to verify that all valid data is loaded in the Data Warehouse and that the loaded data is consistent with the data sources. Here is a sample of the SQL process code to perform this task: select escola,nome,count(*) as Ex_Number, (select count(*) from exames_nacionais_me where id_escola=escola_dw.id) from tblhomologa_2010, escola_dw where escola=cod_escola group by escola, nome, escola_dw.id having count(*)<> (select count(*) from exames_nacionais_me where id_escola=escola_dw.id)
School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis 9 5 Results and Query Comparisons In this section we present some examples of queries to our implemented system, its results and a comparison between that query and the equivalent in the CPVI system (traditional relational database). Here are some examples: Grades and exams of all students of CPVI in MAT and in 2009/2010 (Data Warehouse Version): select ano_desc,abr,anoterm,professor.nome,per3,cif, exame,cfd from notas_exames_escola,professor,disciplina, escola_dw, anolectivo where id_disciplina=disciplina.id and id_anolectivo = anolectivo.id and id_professor=professor.id and abr = 'mat-a' and ano=2009 Grades and exams of all students of CPVI in MAT and in 2009/2010 (Relational Database Version): select year.year,abr,class.year,class.name,code,nf,cif, exame_f1,exame_f2,cfd from student_bio,class_course,course,year_area,class, year,prof_bio,prof where year.year=2009 and abr='mat-a' and class.year=12 and cif>=10 and course.id_year_area=year_area.id and id_course=course.id and id_class=class.id and student_bio.id_class_course=class_course.id and year_area.id_year=year.id and prof_bio.id_class_course=class_course.id and id_prof=prof.id order by class.year,class.name Table 1. Grades and exams of all students of CPVI in MAT and in 2009/2010 (sample). year abr code nf cif exame_f1 exame_f2 cfd 2009 MAT-A JC 10 11 121 11 2009 MAT-A JC 15 15 171 16 2009 MAT-A JC 17 16 147 135 16 2009 MAT-A JC 14 14 147 156 15 2009 MAT-A JC 20 20 196 20 2009 MAT-A JC 9 11 96 11 2009 MAT-A JC 19 19 178 19 2009 MAT-A JC 18 19 185 177 19 2009 MAT-A JC 15 15 160 15 2009 MAT-A JC 17 18 196 19
10 Rui Alberto Castro Top 12 schools with more exams in 2009/2010: select top 30 nome,count(*) Total_Exames_Nacionais from exames_nacionais_me,anolectivo,escola_dw where id_anolectivo=anolectivo.id and id_escola=escola_dw.id and ano_desc='2009/2010' group by nome order by count(*) desc Table 2. Top 12 schools with more exams in 2009/2010. Escola Nº de Exames em 2009/2010 Escola Secundária Camões 591 Escola Secundária Alberto Sampaio 522 Escola Secundária Jaime Moniz 482 Escola Secundária Santa Maria de Sintra 482 Escola Secundária da Amadora 466 Escola Secundária Alexandre Herculano 459 Escola Secundária de Odivelas 454 Escola Secundária Maria Amália Vaz de Carvalho 443 Escola Secundária Leal da Câmara 429 Escola Secundária Avelar Brotero 428 Escola Secundária de Cascais 428 Escola Secundária Alves Martins 419 Year by year analysis of the last three years performance of CPVI Math teachers: select ano_desc,disciplina.nome,id_prof,avg(exame) Média_Exame_Prof from notas_exames_escola,anolectivo,disciplina, professor where id_anolectivo=anolectivo.id and id_disciplina=disciplina.id and id_professor=professor.id and disciplina.abr='mat-a' group by ano_desc,disciplina.nome,id_prof order by ano_desc,avg(exame) desc
School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis 1 1 Table 3. Three year analysis of CPVI Math teachers performance. Ano Lectivo Disciplina Professor Média Exame 2006/2007 Matemática A 6 124 2006/2007 Matemática A 21 109 2006/2007 Matemática A 157 82 2007/2008 Matemática A 6 163 2007/2008 Matemática A 21 115 2007/2008 Matemática A 157 102 2008/2009 Matemática A 41 158 2008/2009 Matemática A 21 151 2008/2009 Matemática A 157 122 2009/2010 Matemática A 6 143 2009/2010 Matemática A 41 140 2009/2010 Matemática A 21 115 6 Conclusions Nowadays all the human activities are expected to be evaluated and compared. In this paper we present an implementation of a Data Warehouse system suited for school performance evaluation and we present some results obtained from the proposed system. With the presented system we are able to obtain an enormous different figures and tables suited for a managerial and pedagogical analysis. The advantages of building a system as the one proposed are mainly three: Increase Information reliability, confidence and completeness Protect the information Make available in an easier way a big amount of information With the proposed system a set of different results can be obtained in order to analyze the school performance from different points of view. In section 5 we present some examples and show the simplicity of obtaining a lot of results using the proposed model. As future work and due to the Dimensional Model low complexity we will be able to provide an implementation of automatic analysis and interrogations suited for online site results presentation.
12 Rui Alberto Castro References 1. Ministério da Educação de Portugal. Direcção Geral de Inovação e Desenvolvimento Curricular - DGIDC. Exams Data: http://www.dgidc.min-edu.pt/jne/paginas/estatistica.aspx. 2. Sociedade Independente de Comunicação - SIC. Ranking das Escolas: http://sic.sapo.pt/online/noticias/pais/especiais/ranking-escolas-2010/. 3. Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, ISBN: 0471200247. John Wiley & Sons, 2nd edition, 2002 pp 37-89. 4. Colégio Paulo VI de Gondomar: http://www.colegiopaulovi.com. 5. Jornal Público - Dossier with School Ranking: http://static.publico.clix.pt/docs/educacao/especiaranking2010.pdf. 6. The Telegraph Journal - GCSE league tables 2010 school-by-school: http://www.telegraph.co.uk/education/leaguetables/8254332/gcse-league-tables- 2010-school-by-school.html.