Some Essential Statistics The Lure of Statistics

Size: px

Start display at page:

Download "Some Essential Statistics The Lure of Statistics"

Hillary Kennedy
8 years ago
Views:

1 Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004

2 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived notions Statistics and the scientific method a discipline to help scientists make sense of observations and experiments Too little data for statisticians Too much data in data mining Many of the techniques & algorithms used are shared by both statisticians and data miners

the scientific method a discipline to help scientists make sense of observations and

3 Some Definitions Population: the collection (universe) of things under consideration Sample: a portion of the population selected for analysis Statistic: a summary measure computed to describe a characteristic of the sample

population selected for analysis Statistic: a summary

4 Inferences from a Sample Population Sample Valid for the population Use statistics to summarize features Use parameters to summarize features Inference on the population from the sample

5 Occam s Razor William of Occam, Franciscan monk, Influential philosopher, theologian, professor with a very simple idea: Latin: Entia non sunt multiplicanda sine necessitate The simpler explanation is the preferable one ( Keep it simple, stupid! )

simple idea: Latin: Entia non sunt multiplicanda sine

6 The Null Hypothesis The NH assumes that differences among observations are due simply to chance (statement of no effect) Bush vs. Kerry poll s margin of error ~ 3% - 4% Bush Kerry Other Not sure 46% 47% 2% 4% Layperson: Are these % s different? Statistician: What is the probability that these two values are really the same? Skepticism - good for both statisticians and data miners

Kerry poll s margin of error ~ 3% - 4% Bush Kerry Other Not sure 46% 47% 2% 4% Layperson: Are

7 P-Values and Q-Values Null hypothesis is true implies nothing is really happening; differences are due to chance The p-value is the probability that the null hypothesis is true (strength of evidence, provided by the sample data, in favor of NH) p~0.0: NH is false, and differences are likely p~1.0: no differences detectable, given the sample size p=0.05 indicates a 5% chance of drawing the sample if NH is true NOTE: we cannot prove that a hypothesis is true; rather consider evidence for/against Confidence (q-value) the reverse of a p-value

0: NH is false, and differences are likely p~1.0: no differences detectable, given the sample size p=0.

8 Type I and Type II errors TRUTH (unknown) H 0 true H 0 false DECISION Do not reject H 0 Correct Type II error Reject H 0 Type I error Correct Significance level = prob. of rejecting the null hypothesis when it is true (alpha). Power = probability of rejecting the null hypothesis when it is false. Beta: the probability of accepting the null hypothesis when it is false Power = 1-beta.

of rejecting the null hypothesis when it is true (alpha).

9 Looking at Data: Discrete Values Discrete data (products, channels, regions, descriptions) common in data mining Histogram bars show number of times different values occur

10 Looking at Data: Time series Histograms describe a single moment in time Data mining is often concerned with what is happening over time. Time Series Analysis choosing an appropriate time frame to consider the data

11 Standardized Values Time Series chart limitations -- are changes over time expected? Consider the data as a partition of all the data, with a little bit per day Is it possible that the differences seen on each day are strictly due to chance? (null hypothesis) Analyze sample variation Central Limit Theorem: With many samples are taken from a population, the distribution of the averages of the samples follows the normal distribution. The average of the samples comes arbitrarily close to the population average

12 Standardized values Normal distribution: described by the mean (average count) and the standard deviation (clustering around the mean) Standardized values z-value = (value mean)/sd mean=0, sd=1 If null hypothesis is true, z-values should follow standard normal distribution Also useful for transforming variables to similar range

(value mean)/sd mean=0, sd=1 If null hypothesis is true, z-values should follow

13 Standardized values Roughly normal. Large peak in Dec. Strong weekly trend Not normal. (many more ve values than +ve)

14 Looking at data: continuous variables Mean (average): the sum of the values divided by the number of values Median: the midpoint of the values (50% above; 50% below) after they have been ordered (ascending or descending order) Mode: the most frequent value among all the values observed Range: the difference between the smallest and largest observation in the sample

been ordered (ascending or descending order) Mode: the most frequent value among all the

15 Different Shapes of Distributions

16 Data mining vs. Statistics Statisticians and data miners use similar techniques, but Data mining tends to ignore measurement error in raw data Data mining assume a lot of data and processing power Data mining assumes dependency on time everywhere Can be difficult to experiment in the business world Data can be truncated and censored

tends to ignore measurement error in raw data Data mining assume a lot of data and

17 Censored data: examples Customer tenure value of active customers must be greater than current tenure (do not know when customer will stop) Claim amount not known for those who have not files a claim Sales and inventory potential sales are greater than actual sales when out of inventory

Claim amount not known for those who have not files a claim Sales and

18 Regression Basics

19 Linear relationship revenue tenure

20 Best fit model y = x R 2 = Estimation the weights w: Minimize errors -- Least squares method For which model (w) is the data most likely? -- Maximum likelihood estimation

8856 200 0 0 50 100 150 200 250 300 Estimation the weights w:

21 Regression model amount = 0.56 * tenure y = β x + c output = f (inputs) Model gives expected value when applied Slope β How good is the fit? -- R 2 How much of the relation in the data is captured by the model Stable model? with a different sample, will same model be obtained Residuals normal with mean 0 and sd σ

22 Obtaining the linear regression model Consider y i = wx i + noise i Independent noise, normally distributed with mean and std. dev. σ P(y w, x) = Normal (mean wx, std dev σ) Maximum likelihood estimate of w find w that maximizes p(y 1, y 2,..y n x 1, x 2,..x n, w) maximize maximize maximize minimize n p( y i= 1 n i= 1 i exp( w, x i ) y i wx ( i σ n i i = 1 2 σ n i = 1 ( y i y wx i wx ) 2 2 ) i 2 Maximum likelihood estimate minimizes the squared errors

23 Minimize squared errors E = i ( y i wxi 2 ) = 2 y i 2 xi yi ) w+ 2 2 ( ( xi ) w Minimum E is obtained with w = x i y i / ( xi 2 )

24 Residuals residuals should distribute evenly around 0 should show no pattern with x values should be normally distributed around 0

25 Heterogeneous data? y = x R 2 =

26 Heterogeneous data Type A Type B

27 Heterogeneous data y = 3.515x R 2 = y = 1.322x R 2 =

28 Using an Indicator variable Indicator variable Product ={0, 1} y = 2.89 x * Product R 2 = 0.93 Individual models y = 3.515x (product 0) R 2 = y = 1.322x (product 1) R 2 =

29 Multiple regression y = β 0 + β 1 x 1 + β 2 x 2 + β m x m Variables should be linearly independent of each other Fewer variables work better Forward selection Stepwise refinement Using a validation set evaluate family of models on a validation set

30 General linear models Interaction terms y = β 0 + β 1 x 1 x 2 + β 2 x 2 + Polynomial form y = β 0 + β 1 x 1 + β 2 x β k x 1 k

31 Polynomial example

32 Polynomial example y = x R 2 = y = x x R 2 =

33 Polynomial example overfit? y = x x R 2 = y = x x x R 2 = y = x x x x R 2 = y = -1.78x x x x x x R 2 =

34 Binary dependent variable? 1 y = x R 2 =

35 Odds Ratio p: probability of an event occurring 1-p: probability of the event not occurring Odds ratio = p/(1-p) Odds of winning 1:3 => odds of 1 win in 3 losses Odds ratio = 0.25/(1-0.25) Log of odds symmetric around 0 -ve values for low probabilities +p=ve for high probabilities logodds

36 Logistic regression Log of odds ratio as dependent variable ln( 1 y ) y = β + β x 0 1 y = 0 e 1+ e β + β x 0 1 β + β x 1 = 1+ e 1 β + β x) ( 0 1 Non-linear model form Maximum likelihood estimates for coefficients Consider ln(y/(1-y)) = x Log-odds changes by 1.2 for a unit change in x y, ie. prob. of occurrence, changes by exp(1.2) for unit change in x

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................