Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools



Similar documents
Some Essential Statistics The Lure of Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

MTH 140 Statistics Videos

Normality Testing in Excel

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Data Mining Techniques Chapter 6: Decision Trees

Data Analysis Tools. Tools for Summarizing Data

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Fairfield Public Schools

Descriptive Statistics

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Exercise 1.12 (Pg )

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Statistics Review PSY379

Adverse Impact Ratio for Females (0/ 1) = 0 (5/ 17) = Adverse impact as defined by the 4/5ths rule was not found in the above data.

Directions for using SPSS

Projects Involving Statistics (& SPSS)

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

How To Check For Differences In The One Way Anova

How To Run Statistical Tests in Excel

Regression Analysis: A Complete Example

An Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Chapter 7 Section 1 Homework Set A

Course Syllabus MATH 110 Introduction to Statistics 3 credits

Scatter Plots with Error Bars

An SPSS companion book. Basic Practice of Statistics

When to use Excel. When NOT to use Excel 9/24/2014

Simple linear regression

An analysis method for a quantitative outcome and two categorical explanatory variables.

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Factors affecting online sales

DATA INTERPRETATION AND STATISTICS

Northumberland Knowledge

Information Technology Services will be updating the mark sense test scoring hardware and software on Monday, May 18, We will continue to score

Foundation of Quantitative Data Analysis

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

SPSS Tests for Versions 9 to 13

Quantitative Methods for Finance

One-Way Analysis of Variance (ANOVA) Example Problem

Introduction to Regression and Data Analysis

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

SPSS Explore procedure

Analysis of categorical data: Course quiz instructions for SPSS

Chapter 23. Inferences for Regression

Recall this chart that showed how most of our course would be organized:

Statistical Functions in Excel

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

1.5 Oneway Analysis of Variance

7. Comparing Means Using t-tests.

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Chapter 5 Analysis of variance SPSS Analysis of variance

Simple Linear Regression Inference

Chapter 9. Two-Sample Tests. Effect Sizes and Power Paired t Test Calculation

430 Statistics and Financial Mathematics for Business

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

CHAPTER IV FINDINGS AND CONCURRENT DISCUSSIONS

How To Write A Data Analysis

ABSORBENCY OF PAPER TOWELS

Chapter 7: Simple linear regression Learning Objectives

Introduction to Quantitative Methods

Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

SPSS TUTORIAL & EXERCISE BOOK

Module 4 (Effect of Alcohol on Worms): Data Analysis

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

The Chi-Square Test. STAT E-50 Introduction to Statistics

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Drawing a histogram using Excel

2013 MBA Jump Start Program. Statistics Module Part 3

Using Excel for inferential statistics

Hypothesis testing - Steps

Problems With Using Microsoft Excel for Statistics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Independent t- Test (Comparing Two Means)

Introduction to Statistics with SPSS (15.0) Version 2.3 (public)

The Dummy s Guide to Data Analysis Using SPSS

Additional sources Compilation of sources:

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Chapter 23. Two Categorical Variables: The Chi-Square Test

Analyzing and interpreting data Evaluation resources from Wilder Research

Chapter 7. One-way ANOVA

UNIT 1: COLLECTING DATA

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Simple Regression Theory II 2010 Samuel L. Baker

Version 5.0. Statistics Guide. Harvey Motulsky President, GraphPad Software Inc GraphPad Software, inc. All rights reserved.

Transcription:

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I......................................................... 3 A look at data II........................................................ 4 Measuring response I..................................................... 5 Measuring response II..................................................... 6 Measuring response III.................................................... 7 Measuring response IV.................................................... 8 Multiple comparisons..................................................... 9 Chi-square test......................................................... 10 Data mining and statistics (differences)......................................... 11 1

Occam s razor The simpler explanation is (usually) the preferable one. Statistical context: null hypothesis assumes that differences among observations are due simply to chance. p-value: probability the null hypothesis is true or probability the observed variation could be explained by chance alone; technically incorrect definition, but OK for intuition. q-value: confidence (flip-side to p-value); not a widely used term in my experience. c Iain Pardoe, 2006 2 / 11 A look at data I Bar charts (qualitative or categorical data) and histograms (quantitative data); note that the histogram on p127 is usually just called a bar chart; histograms generally graph quantitative data into intervals (bins). Time series: graph using line charts with time on horizontal axis. Standardized values: Z = (Y mean) / std. dev. e.g., if mean height is 5 9 and std. dev. is 3 how unusual is someone who is 6? What about someone who is 6 6? example in book on daily service stops; central limit theorem: sample mean is approximately normal; example in book on stops if use weekly averages. c Iain Pardoe, 2006 3 / 11 A look at data II Distributions: smoothed population histograms (e.g., normal bell-curve). From standardized values to probabilities: normal (or t) tables; one-tail or two-tail. Cross-tabulations for categorical data: counts, proportions (e.g., Table 5.1, p136). Sample statistics: range, mean, median, (mode), variance, standard deviation. Correlation and regression. c Iain Pardoe, 2006 4 / 11 2

Measuring response I Is current (champion) or challenger offer better? Challenger offer sent to 100k prospects: 5% response. 95% confidence interval for population response rate: sample mean ±1.96 std. error: sample mean = 0.05, std. error = (0.05 0.95)/100000 = 0.000689; confidence interval is (0.0486, 0.0514). If 5.1% response for champion offer sent to 900k: sample mean = 0.051, std. error = (0.051 0.949)/900000 = 0.000232; 95% confidence interval is (0.0505, 0.0515); overlaps, so challenger offer no different. If 4.8% response for champion offer sent to 900k: sample mean = 0.048, std. error = (0.048 0.952)/900000 = 0.000225; 95% confidence interval is (0.0476, 0.0484); no overlap, so challenger offer better. c Iain Pardoe, 2006 5 / 11 Measuring response II Alternative: rather than seeing if two intervals overlap, calculate interval for difference of proportions and see if it includes zero. Std. error formula on p143 incorrect; should be: p 1 (1 p 1 ) + p 2(1 p 2 ). N 1 N 2 If 5.1% response for champion offer sent to 900k: difference = 0.001, std. error = 0.000727 = (0.051(0.949))/900000 + (0.05(0.95))/100000; 95% confidence interval is ( 0.0024, 0.0004); includes zero, so challenger offer no different. If 4.8% response for champion offer sent to 900k: difference = 0.002, std. error = 0.000725 = (0.048(0.952))/900000 + (0.05(0.95))/100000; 95% confidence interval is (0.0006, 0.0034); excludes zero, so challenger offer better. c Iain Pardoe, 2006 6 / 11 3

Measuring response III Use larger samples to have more confidence in results. If 4.8% response for champion offer sent to 50k: difference = 0.002, std. error = 0.00118 = (0.048(0.952))/50000 + (0.05(0.95))/100000; 95% confidence interval is ( 0.0003, 0.0043); includes zero, so challenger offer no different; significant difference not detected because sample size too small. What the confidence interval really means: ensure random samples, beware bias. c Iain Pardoe, 2006 7 / 11 Measuring response IV Size of test and control for an experiment (power calculations). Example: how large should N be in a difference of proportions test to detect a 0.002 response rate difference? assume equal sample size in control and test; Std. error formula on p147 incorrect; should be: 0.002 set p(1 p) = + 1.96 N Final answer is incorrect too; it should be: (p + d)(1 p d). N N = 0.096796 0.00102 2 = 92963. c Iain Pardoe, 2006 8 / 11 Multiple comparisons Formulas up to now are based on only one comparison. Bonferroni correction divides sig. level by number of comparisons being made: e.g., 2 tests each with sig. level 5% has an overall sig. level of approximately 10%; practical implication: if doing k comparisons, do each test with significance level 5/k to ensure 5% overall. c Iain Pardoe, 2006 9 / 11 4

Chi-square test Test for significant differences between row and column categories in a cross-tabulation. actual response expected response difference yes no total yes no yes no Example: champion 43,200 856,800 900,000 43,380 856,620 180 180 challenger 5,000 95,000 100,000 4,820 95,180 180 180 total 48,200 951,800 1,000,000 48,200 951,800 Test statistic (note no square root as on p151): χ 2 = 1802 43380 + 1802 856620 + 1802 4820 + 1802 95180 = 7.85. Significant difference since p-value = 0.005: Excel: =CHIDIST(7.85,1); degrees of freedom = (r 1)(c 1). Another example: regions/starts from table 5.1. c Iain Pardoe, 2006 10 / 11 Data mining and statistics (differences) No measurement error in basic (DM) data (arguable). There is a lot of data: more is better than less (in general); computer power; oversampling. Time dependency pops up everywhere. Experimentation is hard. Data is (sometimes) censored and truncated. c Iain Pardoe, 2006 11 / 11 5