The influence of teacher support on national standardized student assessment.

Size: px
Start display at page:

Download "The influence of teacher support on national standardized student assessment."

Transcription

1 The influence of teacher support on national standardized student assessment. A fuzzy clustering approach to improve the accuracy of Italian students data Claudio Quintano Rosalia Castellano Sergio Longobardi sergio.longobardi@uniparthenope.it Department of Statistics and Mathematics for Economic Research UNIVERSITY OF NAPLES PARTHENOPE

2 To err is human, to forgive divine. but to include errors in your design is statistical [Leslie Kish, 1978] 2

3 TOTAL SURVEY ERROR SAMPLING ERROR NON SAMPLING ERROR SPECIFICATION error PROCESSING error FRAME error NONRESPONSE error MEASURAMENT error (Biemer and Lyberg, 2003) The excessive teacher support might be considered similar to an interviewer effect (Biemer, Groves, Lyberg, Mathiowetz and Sudman, 1991) and we might suppose that the student s score is affected by an error component which inflates the measurement errors (Braverman, 1996) 3

4 This work considers data on students performance assessments collected by the Italian National Evaluation Institute of the Ministry of Education (INVALSI) in the s.y. 2004/05. A multistage method, combining the factorial analysis with a fuzzy clustering approach, is implemented for evaluating and correcting the cheating in primary classes 4

5 Teacher cheating, especially in extreme cases, is likely to leave tell-tale signs (Jacob B.A., Levitt S.D.; 2003) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 IV CLASS PRIMARY SCHOOL II CLASS PRIMARY SCHOOL 5

6 Teacher cheating, especially in extreme cases, is likely to leave tell-tale signs (Jacob B.A., Levitt S.D.; 2003) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 III CLASS UPPER SECONDARY SCHOOL I CLASS UPPER SECONDARY SCHOOL I CLASS LOWER SECONDARY SCHOOL 6

7 Histogram II CLASS - PRIMARY SCHOOL Histogram Histogram Frequency Frequency Frequency avergita Mean =87,81 Std. Dev. =8,14 N = avergita Mean =74,71 Std. Dev. =14,133 N = Reading s.y. 2004/05 Mathematics Histogram s.y. 2004/05 Science s.y. 2004/05 Histogram avergita 60 Histogram Frequency Frequency Frequency avergita Mean =77,72 Std. Dev. =13,029 N = avergita Mean =81,5 Std. Dev. =11,439 N = Reading s.y. 2005/06 Mathematics s.y. 2005/06 Science s.y. 2005/ avergita

8 The procedure is aimed at managing the presence of cheating at class level. A class is considered suspicious if : the within (class) variability of the final score is close to zero the answer behaviour is homogeneous the percentage of missing data is very low 8

9 For each class, four indexes are computed: Class mean score: p Class standard deviation score: Class index of answer homogeneity: E Class non-response rate: MC 9

10 p N k 1 N p k N k1 ( p k N p 2 ) Class mean score Class standard deviation score p k denotes the score of k th student of th class N denotes number of respondent students of th class 10

11 E Q s 1 Q E s Index of answer homogeneity Mean of h Gini indexes E s h 1 t1 n N t Gini s index of heterogeneity 2 n t /N denotes the relative frequency of students of th class that has given the t th answer to s th question. The index of answer homogeneity is equal to zero when all students of th class have given the same answers to all test questions, while it reaches the value (h-1)/h when in the th class the answer heterogeneity is maximum 11

12 MC N k 1 N M Q k Class non-response rate 0 (there are no missing or invalid responses for the th class) 1 (all students of th class have given only missing or invalid answers) M k denotes the number both of item non-responses and of invalid responses for the k th student of the th class. Q denotes the number of administered item to th class. It is a constant for each assessment area (reading, mathematics and science) and for each school level. 12

13 At the second step, the size of the data matrix, composed of the four indexes at class level, is reduced to two components by using Factor Analysis with a principal component extraction (Jolliffe, 2002). Eigenvalues of correlation matrix R, simple and cumulative percentage of explained variability by the principal component analysis COMPONENT Initial Eigenvalues TOTAL % of Variance Cumulative %

14 The PCA allows to describe the answer behaviour of each student class through two variables SECOND Component Class non response rate FIRST Component INDEX OF CLASS COLLABORATION TO SURVEY Cheating IDENTIFICATION AXIS 14

15 Proection on the first two factorial axes plane of second class primary students SUSPICIOUS CLASSES High score High homogeneity Low variability Missing responses close to zero 15

16 Fuzzy k-means (FKM) Overcoming the hard clustering The fuzzy k-means Developed by Bezdek (1981) and Dunn (1974), it is a fuzzy version of the non-overlapping partition model - hard k-means- 16

17 Steps of the correction procedure: clustering the classes by fuzzy k-means algorithm proection on the factorial axes of the cluster centroids detection of suspicious cluster centroid computation of a cheating index (on the basis of degree of belonging to suspicious cluster) 17

18 J FKM c n i1 1 u m i x v i 2 The results of FKM are affected by two parameters: number of the clusters (c) fuzziness index (m) 18

19 Calibration strategy A two-step approach: Computation of some validity measures (number of cluster c) Sensitivity of FKM results to the level of the fuzziness (fuzziness parameter m) 19

20 The validity measures to assess the goodness of the fuzzy partitions and to obtain the optimal number of c are: Fuzziness Performance Index (FPI) Normalized Classification Entropy (NCE) Separation index (S) These indices are calculated for several thresholds of the fuzziness parameters m (1.5, 2.0, 2.5). 20

21 Fuzzy Performance Index (Roubens, 1982) FPI 1 cpc c 1 Partition coefficient (Bedzek,1974) PC 1 n c n i1 1 2 u i NCE PE log c Normalized Classification Entropy (Roubens, 1982) PE 1 n c n i1 1 Partition entropy (Bedzek, 1981) u i log u i The FPI and NCE are used for evaluating the fuzziness of the solutions. The lower the FPI and MPE values are, the more suitable is the corresponding solution (McBratney & Moore, 1985). 21

22 Separation index S (Xie and Beni, 1991) S 1 c n i1 1 nmin u 2 i i, x v i v v i 2 2 Where the numerator denotes the compactness by the sum of square distances within clusters, while the denominator denotes separation by the minimal distance between clusters. The smaller the value of S, the better the compactness and separation between the clusters v 22

23 Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=1.5 and for c[2,20] 0.50 fuzziness (m=1.5) FPI NCE S clusters (c) 23

24 Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=2 and for c[2,20] 0.60 fuzziness (m=2.0) FPI NCE S clusters (c) 24

25 Fuzzy performance Index (FPI), Normal Partition Entropy (NPE) and Separation index (S) in correspondence of m=2.5 and for c[2,20] fuzziness (m=2.5) FPI NCE S clusters (c) 25

26 Number of classes with a weight close to zero (lower than 0.001) in correspondence of increasing values of fuzziness level: 1.1 m 2.5 and for c=3 50% 40% 30% 20% 10% 0% Units with weight lower than < North West North East Centre South South and Islands fuzziness 26

27 Some weighted central tendency measures of class mean score in correspondence of increasing values of fuzziness level: 1.1 m 2.5 and for c=3 110 Class score xx Area: South and South and islands Mean Median Mode The correction procedure tends to equalize the central tendency measures for m in the range fuzziness 27

28 Cluster 2 gathers the classes that present a cheating profile: high class average scores minimum within variability low presence of missing items Proection on factorial plane of centroids computed by fuzzy k-means algorithm, c=3, m=1.7 28

29 Indicating with a the cheating cluster, the degree of belonging to this cluster is: µ a This measure is considered as the cheating probability of th class Otherwise it can be interpreted as the cheating level of each class 29

30 On the basis of µ a, a weighting factor is developed: W =1 - µ a The students class with high probability to belong to outlier cluster will have a low weight while the class very far from this cluster will have a weight close to 1 30

31 Some Results Graphics by box plot of the index u a distributions (-th unit degree of belonging to cheating cluster) 31

32 Some results ORIGINAL DISTRIBUTION ADJUSTED DISTRIBUTION 32

33 Some results ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION ORIGINAL DISTRIBUTION North-West North-East Centre South South and Islands ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION ADJUSTED DISTRIBUTION 33

34 References Bezdek J.C., (1974). Cluster validity with fuzzy sets. Journal of Cybernetics, 3, Bezdek J.C. (1981). Pattern Recognition with Fuzzy Obective Function Algorithms. Plenum Press Biemer P.P., Groves R.M., Lyberg L.E., Mathiowetz N.A., Sudman S. (1991). Measurement Errors in Surveys. Wiley, New York. Braverman M. (1996). Sources of Survey Error: Implications for Evaluation Studies. New Directions for Evaluation: Advances in Survey Research, 70, Dunn J.C. (1974). A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well Separated Clusters. Journal of Cybernetics, 3, Jacob B.A., Levitt S.D. (2003) Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating. The Quarterly Journal of Economics, 118(3), Jolliffe I.T. (2002). Principal Component Analysis. Springer, New York. McBratney A.B., Moore A.W. (1985). Application of fuzzy sets to climatic classification. Agricultural and Forest Meteorology, 35, Quintano, C., Castellano R., Longobardi S. (2009). A Fuzzy Clustering Approach to Improve the Accuracy of Italian Student Data. An Experimental Procedure to Correct the Impact of Outliers on Assessment Test Scores, Statistica & Applicazioni 7, Roubens M. (1982). Fuzzy clustering algorithms and their cluster validity. European Journal of Operational Research, 10 (3),

35 Supplementary Slides Comparison between unweighted average score per class and the weighted score per class according to the factor w =1- u a Original distribution Adusted distribution MEAN MODE I QUARTILE MEDIAN III QUARTILE

36 Workshop INVALSI Supplementary Slides The cluster centroids and the respective membership functions that solve the minimization problem of the J FKM function are: c i u x u v n m i n m i i 1, 1 1 n c i v x v x u c k m k i i 1, 1, 1 1 1) 1 /(

37 Supplementary Slides Invalsi macro areas (SNV 2004/ /06) North West=Valle d Aosta, Piemonte, Liguria, Lombardia North East=Trentino Alto Adige,Veneto, Friuli V.G.,Emilia R. Centre=Toscana,Umbria,Marche,Lazio South=Abruzzo,Molise,Campania,Puglia South and Islands= Basilicata,Calabria,Sicilia,Sardegna 37

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.) Center: Finding the Median When we think of a typical value, we usually look for the center of the distribution. For a unimodal, symmetric distribution, it s easy to find the center it s just the center

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION

DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION Ionela-Catalina Tudorache (Zamfir) 1, Radu-Ioan Vija 2 1), 2) The Bucharest University of Economic Studies, Economic Cybernetics

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Week 1. Exploratory Data Analysis

Week 1. Exploratory Data Analysis Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam

More information

Segmentation of stock trading customers according to potential value

Segmentation of stock trading customers according to potential value Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

GeoGebra. 10 lessons. Gerrit Stols

GeoGebra. 10 lessons. Gerrit Stols GeoGebra in 10 lessons Gerrit Stols Acknowledgements GeoGebra is dynamic mathematics open source (free) software for learning and teaching mathematics in schools. It was developed by Markus Hohenwarter

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

MEASURES OF CENTER AND SPREAD MEASURES OF CENTER 11/20/2014. What is a measure of center? a value at the center or middle of a data set

MEASURES OF CENTER AND SPREAD MEASURES OF CENTER 11/20/2014. What is a measure of center? a value at the center or middle of a data set MEASURES OF CENTER AND SPREAD Mean and Median MEASURES OF CENTER What is a measure of center? a value at the center or middle of a data set Several different ways to determine the center: Mode Median Mean

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 May 2008 Introduction Suppose we had measured two variables, length and width, and

More information

Algebra I Vocabulary Cards

Algebra I Vocabulary Cards Algebra I Vocabulary Cards Table of Contents Expressions and Operations Natural Numbers Whole Numbers Integers Rational Numbers Irrational Numbers Real Numbers Absolute Value Order of Operations Expression

More information

Means, standard deviations and. and standard errors

Means, standard deviations and. and standard errors CHAPTER 4 Means, standard deviations and standard errors 4.1 Introduction Change of units 4.2 Mean, median and mode Coefficient of variation 4.3 Measures of variation 4.4 Calculating the mean and standard

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

Cafcam: Crisp And Fuzzy Classification Accuracy Measurement Software

Cafcam: Crisp And Fuzzy Classification Accuracy Measurement Software Cafcam: Crisp And Fuzzy Classification Accuracy Measurement Software Mohamed A. Shalan 1, Manoj K. Arora 2 and John Elgy 1 1 School of Engineering and Applied Sciences, Aston University, Birmingham, UK

More information

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3 COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping

More information

Descriptive statistics parameters: Measures of centrality

Descriptive statistics parameters: Measures of centrality Descriptive statistics parameters: Measures of centrality Contents Definitions... 3 Classification of descriptive statistics parameters... 4 More about central tendency estimators... 5 Relationship between

More information

Face detection is a process of localizing and extracting the face region from the

Face detection is a process of localizing and extracting the face region from the Chapter 4 FACE NORMALIZATION 4.1 INTRODUCTION Face detection is a process of localizing and extracting the face region from the background. The detected face varies in rotation, brightness, size, etc.

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

More information

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13 COMMON DESCRIPTIVE STATISTICS / 13 CHAPTER THREE COMMON DESCRIPTIVE STATISTICS The analysis of data begins with descriptive statistics such as the mean, median, mode, range, standard deviation, variance,

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

430 Statistics and Financial Mathematics for Business

430 Statistics and Financial Mathematics for Business Prescription: 430 Statistics and Financial Mathematics for Business Elective prescription Level 4 Credit 20 Version 2 Aim Students will be able to summarise, analyse, interpret and present data, make predictions

More information

The Annual Inflation Rate Analysis Using Data Mining Techniques

The Annual Inflation Rate Analysis Using Data Mining Techniques Economic Insights Trends and Challenges Vol. I (LXIV) No. 4/2012 121-128 The Annual Inflation Rate Analysis Using Data Mining Techniques Mădălina Cărbureanu Petroleum-Gas University of Ploiesti, Bd. Bucuresti

More information

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set D.Napoleon Assistant Professor Department of Computer Science Bharathiar University Coimbatore

More information

Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

More information

DEA implementation and clustering analysis using the K-Means algorithm

DEA implementation and clustering analysis using the K-Means algorithm Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Analyzing Quantitative Data Ellen Taylor-Powell

Analyzing Quantitative Data Ellen Taylor-Powell G3658-6 Program Development and Evaluation Analyzing Quantitative Data Ellen Taylor-Powell Statistical analysis can be quite involved. However, there are some common mathematical techniques that can make

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES * CLUSTERING LARGE DATA SETS WITH MIED NUMERIC AND CATEGORICAL VALUES * ZHEUE HUANG CSIRO Mathematical and Information Sciences GPO Box Canberra ACT, AUSTRALIA huang@cmis.csiro.au Efficient partitioning

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Module 4: Data Exploration

Module 4: Data Exploration Module 4: Data Exploration Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Mean = (sum of the values / the number of the value) if probabilities are equal

Mean = (sum of the values / the number of the value) if probabilities are equal Population Mean Mean = (sum of the values / the number of the value) if probabilities are equal Compute the population mean Population/Sample mean: 1. Collect the data 2. sum all the values in the population/sample.

More information

Prototype-based classification by fuzzification of cases

Prototype-based classification by fuzzification of cases Prototype-based classification by fuzzification of cases Parisa KordJamshidi Dep.Telecommunications and Information Processing Ghent university pkord@telin.ugent.be Bernard De Baets Dep. Applied Mathematics

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: What do the data look like? Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

More information

Data Preparation and Statistical Displays

Data Preparation and Statistical Displays Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

A Correlation of. to the. South Carolina Data Analysis and Probability Standards A Correlation of to the South Carolina Data Analysis and Probability Standards INTRODUCTION This document demonstrates how Stats in Your World 2012 meets the indicators of the South Carolina Academic Standards

More information

FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS

FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS 4.3 FUZZY IMAGE PROCESSING APPLIED TO TIME SERIES ANALYSIS R. Andrew Weekley 1 *, Robert K. Goodrich 1,2, Larry B. Cornman 1 1 National Center for Atmospheric Research **,Boulder,CO 2 University of Colorado,

More information

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables. FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have

More information

3.2 Measures of Spread

3.2 Measures of Spread 3.2 Measures of Spread In some data sets the observations are close together, while in others they are more spread out. In addition to measures of the center, it's often important to measure the spread

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

More information

Data Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.

Data Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds. Sept 03-23-05 22 2005 Data Mining for Model Creation Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.com page 1 Agenda Data Mining and Estimating Model Creation

More information

KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007

KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007 KEANSBURG HIGH SCHOOL Mathematics Department HSPA 10 Curriculum September 2007 Written by: Karen Egan Mathematics Supervisor: Ann Gagliardi 7 days Sample and Display Data (Chapter 1 pp. 4-47) Surveys and

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Three had equal scores for the Progressive and Humanistic. schools. The other four had the following combination of

Three had equal scores for the Progressive and Humanistic. schools. The other four had the following combination of Three had equal scores for the Progressive and Humanistic schools. The other four had the following combination of schools: Behaviorist-Humanistic, Liberal Education- Progressive, Behaviorist-Progressive,

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Alignment and Preprocessing for Data Analysis

Alignment and Preprocessing for Data Analysis Alignment and Preprocessing for Data Analysis Preprocessing tools for chromatography Basics of alignment GC FID (D) data and issues PCA F Ratios GC MS (D) data and issues PCA F Ratios PARAFAC Piecewise

More information

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7 SPSS Tutorial AEB 37 / AE 802 Marketing Research Methods Week 7 Cluster analysis Lecture / Tutorial outline Cluster analysis Example of cluster analysis Work on the assignment Cluster Analysis It is a

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Assessment of the National Water Quality Monitoring Program of Egypt

Assessment of the National Water Quality Monitoring Program of Egypt Assessment of the National Water Quality Monitoring Program of Egypt Rasha M.S. El Kholy 1, Bahaa M. Khalil & Shaden T. Abdel Gawad 3 1 Researcher, Assistant researcher, 3 Vice-chairperson, National Water

More information

Survey Data Analysis. Qatar University. Dr. Kenneth M.Coleman (Ken.Coleman@marketstrategies.com) - University of Michigan

Survey Data Analysis. Qatar University. Dr. Kenneth M.Coleman (Ken.Coleman@marketstrategies.com) - University of Michigan The following slides are the property of their authors and are provided on this website as a public service. Please do not copy or redistribute these slides without the written permission of all of the

More information

A fast, powerful data mining workbench designed for small to midsize organizations

A fast, powerful data mining workbench designed for small to midsize organizations FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

More information

Subspace Analysis and Optimization for AAM Based Face Alignment

Subspace Analysis and Optimization for AAM Based Face Alignment Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China zhaoming1999@zju.edu.cn Stan Z. Li Microsoft

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information