Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data



Similar documents
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Lecture 1: Review and Exploratory Data Analysis (EDA)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Descriptive Statistics

Data Exploration Data Visualization

Exploratory Data Analysis

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

2. Filling Data Gaps, Data validation & Descriptive Statistics

Exploratory data analysis (Chapter 2) Fall 2011

Variables. Exploratory Data Analysis

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Exploratory Data Analysis. Psychology 3256

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Exercise 1.12 (Pg )

Geostatistics Exploratory Analysis

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Chapter 3. The Normal Distribution

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Java Modules for Time Series Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Data Analysis Tools. Tools for Summarizing Data

3: Summary Statistics

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Dongfeng Li. Autumn 2010

Advanced Excel for Institutional Researchers

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Module 4: Data Exploration

Week 1. Exploratory Data Analysis

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Implications of Big Data for Statistics Instruction 17 Nov 2013

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

IBM SPSS Direct Marketing 19

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Measures of Central Tendency and Variability: Summarizing your Data for Others

4 Other useful features on the course web page. 5 Accessing SAS

Statistics Review PSY379

When to use Excel. When NOT to use Excel 9/24/2014

Data exploration with Microsoft Excel: univariate analysis

1-3 id id no. of respondents respon 1 responsible for maintenance? 1 = no, 2 = yes, 9 = blank

Northumberland Knowledge

ADD-INS: ENHANCING EXCEL

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

determining relationships among the explanatory variables, and

Modeling Lifetime Value in the Insurance Industry

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

How To Write A Data Analysis

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

THE BINOMIAL DISTRIBUTION & PROBABILITY

How Far is too Far? Statistical Outlier Detection

Descriptive Statistics

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Insurance Fraud Detection: MARS versus Neural Networks?

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

430 Statistics and Financial Mathematics for Business

Difference of Means and ANOVA Problems

Mathematics of Risk. Introduction. Case Study #1 Personal Auto Insurance Pricing. Mathematical Concepts Illustrated. Background

Foundation of Quantitative Data Analysis

Data Mining: Data Preprocessing. I211: Information infrastructure II

List of Examples. Examples 319

Exploratory Data Analysis

Descriptive Statistics

GeoGebra Statistics and Probability

Intro to Statistics 8 Curriculum

Evaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University

Exploratory Data Analysis

Basics of Statistics

3.2 Measures of Spread

Means, standard deviations and. and standard errors

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Data Preprocessing. Week 2

Mean = (sum of the values / the number of the value) if probabilities are equal

Data Mining: Algorithms and Applications Matrix Math Review

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Lesson 4 Measures of Central Tendency

Transcription:

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm 1

LF1 Objectives Introduce data preparation and where it fits in in modeling process Discuss Data Quality Focus on a key part of data preparation Exploratory data analysis Identify data glitches and errors Understanding the data Identify possible transformations What to do about missing data Provide resources on data preparation 2

Slide 2 LF1 Louise Francis, 9/29/2006

CRISP-DM Guidelines for data mining projects Gives overview of life cycle of data mining project Defines different phases and activities that take place in phase 3

Modelling Process Business Understanding Deployment 0 Complex Model Artifact Data Understanding Evaluation Data Preparation Modeling 4

Data Preprocessing 5

Data Quality Problem 6

Data Quality: A Problem Actuary reviewing a database 7

May s Law 8

It s Not Just Us In just about any organization, the state of information quality is at the same low level Olson, Data Quality 9

Some Consequences of poor data quality Affects quality (precision) of result Can t do modeling project because of data problems If errors not found modeling blunder 10

Data Exploration in Predictive Modeling 11

Exploratory Data Analysis Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive statistics Purpose Find outliers (and errors) Explore structure of the data 12

Definition of EDA Exploratory data analysis (EDA) is that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system.. Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking. - www.wikipedia.org 13

Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses 14

Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Stem and Leaf Plots Statistical Descriptive statistics Data spheres 15

Histograms Can do them in Microsoft Excel 16

Histograms Frequencies for Age Variable Bin Frequency 20 2853 25 3709 30 4372 35 4366 40 4097 45 3588 50 2707 55 1831 60 1140 65 615 70 397 75 271 80 148 85 83 90 32 95 12 More 5 17

Histograms of Age Variable Varying Window Size 80 60 40 20 0 16 33 51 68 85 Age 20 15 10 5 0 16 23 30 37 44 51 57 64 71 78 85 Age 40 30 20 10 0 16 24 31 39 47 54 62 70 77 85 Age 18

Formula for Window Width h = 3.5 1 3 σ N σ = standard deviation N=sample size h =window width 19

Example of Suspicious Value 25,000 20,000 Frequency 15,000 10,000 5,000 0 600 900 1200 1500 1800 License Year 20

Discrete-Numeric Data 40,000 30,000 Frequency 20,000 10,000 0 0 50,000 100,000 Paid Losses 21

Filtered Data Filter out Unwanted Records 1,200 1,000 800 Frequency 600 400 200 0 0 20,000 40,000 60,000 80,000 100,000 Paid Losses 22

Box Plot Basics: Five Point Summary Minimum 1 st quartile Median 2 nd quartile Maximum 23

Functions for five point summary =min(data range) =quartile(data range1) =median(data range) =quartile(data range,3) =max(data range) 24

Box and Whisker Plot 80 60 Normally Distributed Age 40 20 +1.5*midspread Outliers 0 median 75th Percentile -20 25th Percentile 25

Plot of Heavy Tailed Data Paid Losses 110000 90000 70000 Paid Loss 50000 30000 10000-10000 26

Heavy Tailed Data Log Scale 10 7 10 6 10 5 Paid Loss 10 4 10 3 10 2 10 1 27

28 Box and Whisker Example 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 96 Age 0 200 400 600 800 1,000 Frequency

Descriptive Statistics Analysis ToolPak Statistic Policyholder Age Mean 36.9 Standard Error 0.1 Median 35.0 Mode 32.0 Standard Deviation 13.2 Sample Variance 174.4 Kurtosis 0.5 Skewness 0.7 Range 84 Minimum 16 Maximum 100 Sum 1114357 Count 30226 Largest(2) 100 Smallest(2) 16 29

Descriptive Statistics Claimant age has minimum and maximums that are impossible Std. N Minimum Maximum Mean Deviation License Year 30,250 490 2,049 1,990 16.3 Valid N 30,250 30

Data Spheres: The Mahalanobis Distance Statistic 1 MD = ( x μ)' Σ ( x μ) x μ Σ is a vector of variables is a vector of means is a variance-covariance matrix 31

1 MD = ( x μ)' Σ ( x μ) Screening Many Variables at Once Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-atlantic state 32

Records With Unusual Values Flagged Policy ID Mahalanobis Depth Percentile of Mahalanobis Age License Year Number of Cars Number of Drivers Model Year Incurred Loss 22244 59 100 27 1997 3 6 1994 4,456 6159 60 100 22 2001 2 6 1993 0 22997 65 100 NA NA 2 1 1954 0 5412 61 100 17 2003 3 6 1994 0 30577 72 100 43 1979 3 1 1952 0 28319 8,490 100 30 490 1 1 1987 0 27815 55 100 44 1976-1 0 1959 0 16158 24 100 82 1938 1 1 1989 61,187 4908 25 100 56 1997 4 4 2003 35,697 28790 24 100 82 2039 1 1 1985 27,769 33

Categorical Data: Data Cubes 34

Categorical Data Data Cubes Usually frequency tables Search for missing values coded as blanks Gender Frequency Percent 5,054 14.3 F 13,032 36.9 M 17,198 48.7 Total 35,284 100 35

Categorical Data Table highlights inconsistent coding of marital status Marital Status Frequency Percent 5,053 14.3 1 2,043 5.8 2 9,657 27.4 4 2 0 D 4 0 M 2,971 8.4 S 15,554 44.1 Total 35,284 100 36

Missing Data 37

Screening for Missing Data N Percentiles BUSINESS TYPE Gender Age License Year Valid 35,284 35,284 30,242 30,250 Missing 0 0 5,042 5,034 25 27.00 1,986.00 50 35.00 1,996.00 75 45.00 2,000.00 38

Blanks as Missing Valid Frequency Percent Valid Percent Cumulative Percent 5,054 14.3 14.3 14.3 F 13,032 36.9 36.9 51.3 M 17,198 48.7 48.7 100.0 Total 35,284 100.0 100.0 39

Types of Missing Values Missing completely at random Missing at random Informative missing 40

Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization 41

Imputation A method to fill in missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x 1,x 2, x n ) 42

Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b 1 X 1 +b 2 X 2 b n X n 43

Model for Age Tests of Between-Subjects Effects Dependent Variable: Age Type III Sum of Squares df Mean Square F Sig. Source Corrected Model 3,218,216 24 134,092 1,971.2 0.000 Intercept 9,255 1 9,255 136.0 0.000 ClassCode 3,198,903 18 177,717 2,612.4 0.000 CoverageType 876 3 292 4.3 0.005 ModelYear 7,245 1 7,245 106.5 0.000 No of Vehicles 2,365 1 2,365 34.8 0.000 No of drivers 3,261 1 3,261 47.9 0.000 Error 2,055,243 30,212 68 Total 46,377,824 30,237 Corrected Total 5,273,459 30,236 44

Missing Values A problem for many traditional statistical models Elimination of records missing on anything from analysis Many data mining procedures have techniques built in for handling missing values If too many records missing on a given variable, probably need to discard variable 45

Metadata 46

Metadata Data about data A reference that can be used in future modeling projects Detailed description of the variables in the file, their meaning and permissible values Marital Status Value Description 1 Married, data from source 1 2 Single, data from source 1 4 Divorced, data from source 1 D Divorced, data from source 2 M Married, data from source 2 S Single, data from source 2 Blank Marital status is missing 47

Library for Getting Started Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 Francis, L.A., Dancing with Dirty Data: Methods for Exploring and Claeaning Data, CAS Winter Forum, March 2005, www.casact.org Find a comprehensive book for doing analysis in Excel such as: John Walkebach, Excel 2003 Formulas or Jospeh Schmuller, Statistical Analysis With Excel for Dummies If you use R, get a book like: Fox, John, An R and S-PLUS Companion to Applied Regression, Sage Publications, 2002 48