The influence of teacher support on national standardized student assessment.

Similar documents

Dimensionality Reduction: Principal Components Analysis

Data Exploration Data Visualization

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Week 1. Exploratory Data Analysis

Segmentation of stock trading customers according to potential value

Descriptive Statistics

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

GeoGebra. 10 lessons. Gerrit Stols

Clustering UE 141 Spring 2013

Simple linear regression

Local outlier detection in data forensics: data mining approach to flag unusual schools

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

MEASURES OF CENTER AND SPREAD MEASURES OF CENTER 11/20/2014. What is a measure of center? a value at the center or middle of a data set

Data Mining: Algorithms and Applications Matrix Math Review

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Algebra I Vocabulary Cards

Means, standard deviations and. and standard errors

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Principal Component Analysis

Factor Analysis. Chapter 420. Introduction

Cafcam: Crisp And Fuzzy Classification Accuracy Measurement Software

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Descriptive statistics parameters: Measures of centrality

Face detection is a process of localizing and extracting the face region from the

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

Geostatistics Exploratory Analysis

DISCRIMINANT FUNCTION ANALYSIS (DA)

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Component Ordering in Independent Component Analysis Based on Data Power

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

430 Statistics and Financial Mathematics for Business

The Annual Inflation Rate Analysis Using Data Mining Techniques

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Northumberland Knowledge

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Java Modules for Time Series Analysis

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

Introduction to Principal Components and FactorAnalysis

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

DEA implementation and clustering analysis using the K-Means algorithm

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Advanced Ensemble Strategies for Polynomial Models

Analyzing Quantitative Data Ellen Taylor-Powell

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Module 4: Data Exploration

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Mean = (sum of the values / the number of the value) if probabilities are equal

Prototype-based classification by fuzzification of cases

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Data Preparation and Statistical Displays

Social Media Mining. Data Mining Essentials

Exploratory data analysis (Chapter 2) Fall 2011

K-Means Clustering Tutorial

Environmental Remote Sensing GEOG 2021

Exploratory data analysis for microarray data

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

3.2 Measures of Spread

How to Get More Value from Your Survey Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Modeling Lifetime Value in the Insurance Industry

Performance Metrics for Graph Mining Tasks

2. Simple Linear Regression

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Alignment and Preprocessing for Data Analysis

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

How To Write A Data Analysis

Survey Data Analysis. Qatar University. Dr. Kenneth M.Coleman - University of Michigan

A fast, powerful data mining workbench designed for small to midsize organizations

Subspace Analysis and Optimization for AAM Based Face Alignment

How To Identify A Churner

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Standardization and Its Effects on K-Means Clustering Algorithm

Transcription:

The influence of teacher support on national standardized student assessment. A fuzzy clustering approach to improve the accuracy of Italian students data Claudio Quintano Rosalia Castellano Sergio Longobardi sergio.longobardi@uniparthenope.it Department of Statistics and Mathematics for Economic Research UNIVERSITY OF NAPLES PARTHENOPE

To err is human, to forgive divine. but to include errors in your design is statistical [Leslie Kish, 1978] 2

TOTAL SURVEY ERROR SAMPLING ERROR NON SAMPLING ERROR SPECIFICATION error PROCESSING error FRAME error NONRESPONSE error MEASURAMENT error (Biemer and Lyberg, 2003) The excessive teacher support might be considered similar to an interviewer effect (Biemer, Groves, Lyberg, Mathiowetz and Sudman, 1991) and we might suppose that the student s score is affected by an error component which inflates the measurement errors (Braverman, 1996) 3

This work considers data on students performance assessments collected by the Italian National Evaluation Institute of the Ministry of Education (INVALSI) in the s.y. 2004/05. A multistage method, combining the factorial analysis with a fuzzy clustering approach, is implemented for evaluating and correcting the cheating in primary classes 4

Teacher cheating, especially in extreme cases, is likely to leave tell-tale signs (Jacob B.A., Levitt S.D.; 2003) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 IV CLASS PRIMARY SCHOOL II CLASS PRIMARY SCHOOL 5

Teacher cheating, especially in extreme cases, is likely to leave tell-tale signs (Jacob B.A., Levitt S.D.; 2003) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 III CLASS UPPER SECONDARY SCHOOL I CLASS UPPER SECONDARY SCHOOL I CLASS LOWER SECONDARY SCHOOL 6

Histogram II CLASS - PRIMARY SCHOOL Histogram Histogram 3.000 2.500 1.250 2.000 1.000 Frequency 2.000 1.000 Frequency 750 500 Frequency 1.500 1.000 250 500 0 0 20 40 60 avergita 80 100 Mean =87,81 Std. Dev. =8,14 N =30.031 0 0 20 40 avergita 60 80 100 Mean =74,71 Std. Dev. =14,133 N =30.097 0 0 Reading s.y. 2004/05 Mathematics Histogram s.y. 2004/05 Science s.y. 2004/05 Histogram 20 40 avergita 60 Histogram 80 100 2.000 2.000 5.000 4.000 1.500 1.500 Frequency 1.000 Frequency 1.000 Frequency 3.000 2.000 500 500 1.000 0 0 20 40 60 avergita 80 100 Mean =77,72 Std. Dev. =13,029 N =29.802 0 0 20 40 60 avergita 80 100 Mean =81,5 Std. Dev. =11,439 N =29.816 Reading s.y. 2005/06 Mathematics s.y. 2005/06 Science s.y. 2005/06 0 0 20 40 avergita 60 80 100 7

The procedure is aimed at managing the presence of cheating at class level. A class is considered suspicious if : the within (class) variability of the final score is close to zero the answer behaviour is homogeneous the percentage of missing data is very low 8

For each class, four indexes are computed: Class mean score: p Class standard deviation score: Class index of answer homogeneity: E Class non-response rate: MC 9

p N k 1 N p k N k1 ( p k N p 2 ) Class mean score Class standard deviation score p k denotes the score of k th student of th class N denotes number of respondent students of th class 10

E Q s 1 Q E s Index of answer homogeneity Mean of h Gini indexes E s h 1 t1 n N t Gini s index of heterogeneity 2 n t /N denotes the relative frequency of students of th class that has given the t th answer to s th question. The index of answer homogeneity is equal to zero when all students of th class have given the same answers to all test questions, while it reaches the value (h-1)/h when in the th class the answer heterogeneity is maximum 11

MC N k 1 N M Q k Class non-response rate 0 (there are no missing or invalid responses for the th class) 1 (all students of th class have given only missing or invalid answers) M k denotes the number both of item non-responses and of invalid responses for the k th student of the th class. Q denotes the number of administered item to th class. It is a constant for each assessment area (reading, mathematics and science) and for each school level. 12

At the second step, the size of the data matrix, composed of the four indexes at class level, is reduced to two components by using Factor Analysis with a principal component extraction (Jolliffe, 2002). Eigenvalues of correlation matrix R, simple and cumulative percentage of explained variability by the principal component analysis COMPONENT Initial Eigenvalues TOTAL % of Variance Cumulative % 1 2.956 73.911 73.911 2 0.723 18.086 91.997 3 0.288 7.211 99.208 4 0.032 0.792 100.000 13

The PCA allows to describe the answer behaviour of each student class through two variables SECOND Component Class non response rate FIRST Component INDEX OF CLASS COLLABORATION TO SURVEY Cheating IDENTIFICATION AXIS 14

Proection on the first two factorial axes plane of second class primary students SUSPICIOUS CLASSES High score High homogeneity Low variability Missing responses close to zero 15

Fuzzy k-means (FKM) Overcoming the hard clustering The fuzzy k-means Developed by Bezdek (1981) and Dunn (1974), it is a fuzzy version of the non-overlapping partition model - hard k-means- 16

Steps of the correction procedure: clustering the classes by fuzzy k-means algorithm proection on the factorial axes of the cluster centroids detection of suspicious cluster centroid computation of a cheating index (on the basis of degree of belonging to suspicious cluster) 17

J FKM c n i1 1 u m i x v i 2 The results of FKM are affected by two parameters: number of the clusters (c) fuzziness index (m) 18

Calibration strategy A two-step approach: Computation of some validity measures (number of cluster c) Sensitivity of FKM results to the level of the fuzziness (fuzziness parameter m) 19

The validity measures to assess the goodness of the fuzzy partitions and to obtain the optimal number of c are: Fuzziness Performance Index (FPI) Normalized Classification Entropy (NCE) Separation index (S) These indices are calculated for several thresholds of the fuzziness parameters m (1.5, 2.0, 2.5). 20

Fuzzy Performance Index (Roubens, 1982) FPI 1 cpc c 1 Partition coefficient (Bedzek,1974) PC 1 n c n i1 1 2 u i NCE PE log c Normalized Classification Entropy (Roubens, 1982) PE 1 n c n i1 1 Partition entropy (Bedzek, 1981) u i log u i The FPI and NCE are used for evaluating the fuzziness of the solutions. The lower the FPI and MPE values are, the more suitable is the corresponding solution (McBratney & Moore, 1985). 21

Separation index S (Xie and Beni, 1991) S 1 c n i1 1 nmin u 2 i i, x v i v v i 2 2 Where the numerator denotes the compactness by the sum of square distances within clusters, while the denominator denotes separation by the minimal distance between clusters. The smaller the value of S, the better the compactness and separation between the clusters v 22

Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=1.5 and for c[2,20] 0.50 fuzziness (m=1.5) 0.40 0.30 0.20 FPI NCE S 0.10 0.00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 clusters (c) 23

Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=2 and for c[2,20] 0.60 fuzziness (m=2.0) 0.50 0.40 0.30 0.20 FPI NCE S 0.10 0.00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 clusters (c) 24

Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=2.5 and for c[2,20] 0.90 0.80 fuzziness (m=2.5) 0.70 0.60 0.50 0.40 0.30 FPI NCE S 0.20 0.10 0.00 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 clusters (c) 25

Number of classes with a weight close to zero (lower than 0.001) in correspondence of increasing values of fuzziness level: 1.1 m 2.5 and for c=3 50% 40% 30% 20% 10% 0% 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 Units with weight lower than 0.001 <0000 2.4 2.5 North West North East Centre South South and Islands fuzziness 26

Some weighted central tendency measures of class mean score in correspondence of increasing values of fuzziness level: 1.1 m 2.5 and for c=3 110 Class score xx 100 90 80 Area: South and South and islands Mean Median Mode The correction procedure tends to equalize the central tendency measures for m in the range 1.7-1.9 70 60 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 fuzziness 27

Cluster 2 gathers the classes that present a cheating profile: high class average scores minimum within variability low presence of missing items Proection on factorial plane of centroids computed by fuzzy k-means algorithm, c=3, m=1.7 28

Indicating with a the cheating cluster, the degree of belonging to this cluster is: µ a This measure is considered as the cheating probability of th class Otherwise it can be interpreted as the cheating level of each class 29

On the basis of µ a, a weighting factor is developed: W =1 - µ a The students class with high probability to belong to outlier cluster will have a low weight while the class very far from this cluster will have a weight close to 1 30

Some Results Graphics by box plot of the index u a distributions (-th unit degree of belonging to cheating cluster) 31

Some results ORIGINAL DISTRIBUTION ADJUSTED DISTRIBUTION 32

Some results ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION North-West North-East Centre South South and Islands ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION 33

References Bezdek J.C., (1974). Cluster validity with fuzzy sets. Journal of Cybernetics, 3, 58-73. Bezdek J.C. (1981). Pattern Recognition with Fuzzy Obective Function Algorithms. Plenum Press Biemer P.P., Groves R.M., Lyberg L.E., Mathiowetz N.A., Sudman S. (1991). Measurement Errors in Surveys. Wiley, New York. Braverman M. (1996). Sources of Survey Error: Implications for Evaluation Studies. New Directions for Evaluation: Advances in Survey Research, 70, 17-28. Dunn J.C. (1974). A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well Separated Clusters. Journal of Cybernetics, 3, 32-57. Jacob B.A., Levitt S.D. (2003) Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating. The Quarterly Journal of Economics, 118(3), 843-877 Jolliffe I.T. (2002). Principal Component Analysis. Springer, New York. McBratney A.B., Moore A.W. (1985). Application of fuzzy sets to climatic classification. Agricultural and Forest Meteorology, 35, 165-185. Quintano, C., Castellano R., Longobardi S. (2009). A Fuzzy Clustering Approach to Improve the Accuracy of Italian Student Data. An Experimental Procedure to Correct the Impact of Outliers on Assessment Test Scores, Statistica & Applicazioni 7,149-171. Roubens M. (1982). Fuzzy clustering algorithms and their cluster validity. European Journal of Operational Research, 10 (3), 294 301. 34

Supplementary Slides Comparison between unweighted average score per class and the weighted score per class according to the factor w =1- u a Original distribution Adusted distribution MEAN 74.71 67.39 MODE 100.00 68.75 I QUARTILE 64.42 60.93 MEDIAN 73.61 67.78 III QUARTILE 85.94 74.25 35

Workshop INVALSI Supplementary Slides The cluster centroids and the respective membership functions that solve the minimization problem of the J FKM function are: c i u x u v n m i n m i i 1, 1 1 n c i v x v x u c k m k i i 1, 1, 1 1 1) 1 /( 2 2 36

Supplementary Slides Invalsi macro areas (SNV 2004/05-2005/06) North West=Valle d Aosta, Piemonte, Liguria, Lombardia North East=Trentino Alto Adige,Veneto, Friuli V.G.,Emilia R. Centre=Toscana,Umbria,Marche,Lazio South=Abruzzo,Molise,Campania,Puglia South and Islands= Basilicata,Calabria,Sicilia,Sardegna 37