Some Essential Statistics The Lure of Statistics



Similar documents
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

International Statistical Institute, 56th Session, 2007: Phil Everson

Fairfield Public Schools

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Generalized Linear Models

11. Analysis of Case-control Studies Logistic Regression

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Dongfeng Li. Autumn 2010

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

MTH 140 Statistics Videos

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Multinomial and Ordinal Logistic Regression

Regression Modeling Strategies

Section 6: Model Selection, Logistic Regression and more...

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Chapter 7: Simple linear regression Learning Objectives

Simple Linear Regression Inference

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Multivariate Logistic Regression

Simple Linear Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

2013 MBA Jump Start Program. Statistics Module Part 3

Statistical Machine Learning

Interpretation of Somers D under four simple models

Univariate Regression

Additional sources Compilation of sources:

Descriptive Statistics

Lecture 10: Regression Trees

Why do statisticians "hate" us?

Nominal and ordinal logistic regression

Penalized regression: Introduction

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Intro to Data Analysis, Economic Statistics and Econometrics

Introduction to Regression and Data Analysis

How To Write A Data Analysis

Normality Testing in Excel

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Section 13, Part 1 ANOVA. Analysis Of Variance

Biostatistics: Types of Data Analysis

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Logistic Regression (a type of Generalized Linear Model)

II. DISTRIBUTIONS distribution normal distribution. standard scores

Simple linear regression

Implications of Big Data for Statistics Instruction 17 Nov 2013

Modeling Lifetime Value in the Insurance Industry

Examining a Fitted Logistic Model

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Final Exam Practice Problem Answers

Exercise 1.12 (Pg )

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Multiple Linear Regression

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Chapter 23. Inferences for Regression

Module 3: Correlation and Covariance

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Likelihood: Frequentist vs Bayesian Reasoning

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

2. Simple Linear Regression

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Chapter 4 and 5 solutions

DATA INTERPRETATION AND STATISTICS

Time Series Analysis

Data Analysis Tools. Tools for Summarizing Data

Two Correlated Proportions (McNemar Test)

Introduction to Statistics and Quantitative Research Methods

Categorical Data Analysis

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Organizing Your Approach to a Data Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Example: Boats and Manatees

Elements of statistics (MATH0487-1)

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

A Primer on Forecasting Business Performance

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Part 2: Analysis of Relationship Between Two Variables

Teaching Business Statistics through Problem Solving

Statistics in Retail Finance. Chapter 2: Statistical models of default

Multiple Linear Regression in Data Mining

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

How To Run Statistical Tests in Excel

Testing for Lack of Fit

Likelihood Approaches for Trial Designs in Early Phase Oncology

Data Mining Practical Machine Learning Tools and Techniques

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Statistical Process Control (SPC) Training Guide

Transcription:

Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004

Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived notions Statistics and the scientific method a discipline to help scientists make sense of observations and experiments Too little data for statisticians Too much data in data mining Many of the techniques & algorithms used are shared by both statisticians and data miners

Some Definitions Population: the collection (universe) of things under consideration Sample: a portion of the population selected for analysis Statistic: a summary measure computed to describe a characteristic of the sample

Inferences from a Sample Population Sample Valid for the population Use statistics to summarize features Use parameters to summarize features Inference on the population from the sample

Occam s Razor William of Occam, Franciscan monk, 1280-1349 Influential philosopher, theologian, professor with a very simple idea: Latin: Entia non sunt multiplicanda sine necessitate The simpler explanation is the preferable one ( Keep it simple, stupid! )

The Null Hypothesis The NH assumes that differences among observations are due simply to chance (statement of no effect) Bush vs. Kerry poll s margin of error ~ 3% - 4% Bush Kerry Other Not sure 46% 47% 2% 4% Layperson: Are these % s different? Statistician: What is the probability that these two values are really the same? Skepticism - good for both statisticians and data miners

P-Values and Q-Values Null hypothesis is true implies nothing is really happening; differences are due to chance The p-value is the probability that the null hypothesis is true (strength of evidence, provided by the sample data, in favor of NH) p~0.0: NH is false, and differences are likely p~1.0: no differences detectable, given the sample size p=0.05 indicates a 5% chance of drawing the sample if NH is true NOTE: we cannot prove that a hypothesis is true; rather consider evidence for/against Confidence (q-value) the reverse of a p-value

Type I and Type II errors TRUTH (unknown) H 0 true H 0 false DECISION Do not reject H 0 Correct Type II error Reject H 0 Type I error Correct Significance level = prob. of rejecting the null hypothesis when it is true (alpha). Power = probability of rejecting the null hypothesis when it is false. Beta: the probability of accepting the null hypothesis when it is false Power = 1-beta.

Looking at Data: Discrete Values Discrete data (products, channels, regions, descriptions) common in data mining Histogram bars show number of times different values occur

Looking at Data: Time series Histograms describe a single moment in time Data mining is often concerned with what is happening over time. Time Series Analysis choosing an appropriate time frame to consider the data

Standardized Values Time Series chart limitations -- are changes over time expected? Consider the data as a partition of all the data, with a little bit per day Is it possible that the differences seen on each day are strictly due to chance? (null hypothesis) Analyze sample variation Central Limit Theorem: With many samples are taken from a population, the distribution of the averages of the samples follows the normal distribution. The average of the samples comes arbitrarily close to the population average

Standardized values Normal distribution: described by the mean (average count) and the standard deviation (clustering around the mean) Standardized values z-value = (value mean)/sd mean=0, sd=1 If null hypothesis is true, z-values should follow standard normal distribution Also useful for transforming variables to similar range

Standardized values Roughly normal. Large peak in Dec. Strong weekly trend Not normal. (many more ve values than +ve)

Looking at data: continuous variables Mean (average): the sum of the values divided by the number of values Median: the midpoint of the values (50% above; 50% below) after they have been ordered (ascending or descending order) Mode: the most frequent value among all the values observed Range: the difference between the smallest and largest observation in the sample

Different Shapes of Distributions

Data mining vs. Statistics Statisticians and data miners use similar techniques, but Data mining tends to ignore measurement error in raw data Data mining assume a lot of data and processing power Data mining assumes dependency on time everywhere Can be difficult to experiment in the business world Data can be truncated and censored

Censored data: examples Customer tenure value of active customers must be greater than current tenure (do not know when customer will stop) Claim amount not known for those who have not files a claim Sales and inventory potential sales are greater than actual sales when out of inventory

Regression Basics

Linear relationship 1200 1000 800 revenue 600 400 200 0 0 50 100 150 200 250 300 tenure

Best fit model 1200 1000 800 600 400 y = 3.4032x - 19.221 R 2 = 0.8856 200 0 0 50 100 150 200 250 300 Estimation the weights w: Minimize errors -- Least squares method For which model (w) is the data most likely? -- Maximum likelihood estimation

Regression model amount = 0.56 * tenure + 10.34 y = β x + c output = f (inputs) Model gives expected value when applied Slope β How good is the fit? -- R 2 How much of the relation in the data is captured by the model Stable model? with a different sample, will same model be obtained Residuals normal with mean 0 and sd σ

Obtaining the linear regression model Consider y i = wx i + noise i Independent noise, normally distributed with mean and std. dev. σ P(y w, x) = Normal (mean wx, std dev σ) Maximum likelihood estimate of w find w that maximizes p(y 1, y 2,..y n x 1, x 2,..x n, w) maximize maximize maximize minimize n p( y i= 1 n i= 1 i exp( w, x 1 2 1 i ) y i wx ( i σ n i i = 1 2 σ n i = 1 ( y i y wx i wx ) 2 2 ) i 2 Maximum likelihood estimate minimizes the squared errors

Minimize squared errors E = i ( y i wxi 2 ) = 2 y i 2 xi yi ) w+ 2 2 ( ( xi ) w Minimum E is obtained with w = x i y i / ( xi 2 )

Residuals 400 300 200 100 0-100 0 50 100 150 200 250 300-200 residuals should distribute evenly around 0 should show no pattern with x values should be normally distributed around 0

Heterogeneous data? 1200 1000 800 600 400 y = 3.4032x - 19.221 R 2 = 0.8856 200 0 0 50 100 150 200 250 300

Heterogeneous data 1200 1000 800 Type A 600 400 200 Type B 0 0 50 100 150 200 250 300

Heterogeneous data 1200 1000 800 y = 3.515x + 10.859 R 2 = 0.9515 600 400 200 0 y = 1.322x + 32.507 R 2 = 0.8909 0 50 100 150 200 250 300

Using an Indicator variable Indicator variable Product ={0, 1} y = 2.89 x + 136.8* Product - 40.97 R 2 = 0.93 Individual models y = 3.515x + 10.859 (product 0) R 2 = 0.9515 y = 1.322x + 32.507 (product 1) R 2 = 0.8909

Multiple regression y = β 0 + β 1 x 1 + β 2 x 2 + β m x m Variables should be linearly independent of each other Fewer variables work better Forward selection Stepwise refinement Using a validation set evaluate family of models on a validation set

General linear models Interaction terms y = β 0 + β 1 x 1 x 2 + β 2 x 2 + Polynomial form y = β 0 + β 1 x 1 + β 2 x 12 + + β k x 1 k

Polynomial example 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5

Polynomial example 1.8 1.6 1.4 1.2 1 0.8 0.6 y = 0.2027x + 0.694 R 2 = 0.1515 0.4 0.2 0 0 0.5 1 1.5 2 2.5 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 y = 0.8974x 2-1.9486x + 1.4195 R 2 = 0.8971 0 0.5 1 1.5 2 2.5

Polynomial example overfit? 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 y = 0.8974x 2-1.9486x + 1.4195 R 2 = 0.8971 y = 0.1396x 3 + 0.3943x 2-1.467x + 1.3289 R 2 = 0.9053 y = -0.3959x 4 + 2.0664x 3-2.5936x 2 + 0.0833x + 1.15 R 2 = 0.9201 y = -1.78x 6 + 12.173x 5-31.392x 4 + 38.042x 3-21.237x 2 + 3.7417x + 0.9616 R 2 = 0.9644

Binary dependent variable? 1 y = 0.0022x + 0.2212 R 2 = 0.1172 0 0 50 100 150 200 250 300

Odds Ratio p: probability of an event occurring 1-p: probability of the event not occurring Odds ratio = p/(1-p) Odds of winning 1:3 => odds of 1 win in 3 losses Odds ratio = 0.25/(1-0.25) Log of odds symmetric around 0 -ve values for low probabilities +p=ve for high probabilities 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 logodds 0 0.2 0.4 0.6 0.8 1 1.2

Logistic regression Log of odds ratio as dependent variable ln( 1 y ) y = β + β x 0 1 y = 0 e 1+ e β + β x 0 1 β + β x 1 = 1+ e 1 β + β x) ( 0 1 Non-linear model form Maximum likelihood estimates for coefficients Consider ln(y/(1-y)) = -20.5 + 1.2 x Log-odds changes by 1.2 for a unit change in x y, ie. prob. of occurrence, changes by exp(1.2) for unit change in x