Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004
Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived notions Statistics and the scientific method a discipline to help scientists make sense of observations and experiments Too little data for statisticians Too much data in data mining Many of the techniques & algorithms used are shared by both statisticians and data miners
Some Definitions Population: the collection (universe) of things under consideration Sample: a portion of the population selected for analysis Statistic: a summary measure computed to describe a characteristic of the sample
Inferences from a Sample Population Sample Valid for the population Use statistics to summarize features Use parameters to summarize features Inference on the population from the sample
Occam s Razor William of Occam, Franciscan monk, 1280-1349 Influential philosopher, theologian, professor with a very simple idea: Latin: Entia non sunt multiplicanda sine necessitate The simpler explanation is the preferable one ( Keep it simple, stupid! )
The Null Hypothesis The NH assumes that differences among observations are due simply to chance (statement of no effect) Bush vs. Kerry poll s margin of error ~ 3% - 4% Bush Kerry Other Not sure 46% 47% 2% 4% Layperson: Are these % s different? Statistician: What is the probability that these two values are really the same? Skepticism - good for both statisticians and data miners
P-Values and Q-Values Null hypothesis is true implies nothing is really happening; differences are due to chance The p-value is the probability that the null hypothesis is true (strength of evidence, provided by the sample data, in favor of NH) p~0.0: NH is false, and differences are likely p~1.0: no differences detectable, given the sample size p=0.05 indicates a 5% chance of drawing the sample if NH is true NOTE: we cannot prove that a hypothesis is true; rather consider evidence for/against Confidence (q-value) the reverse of a p-value
Type I and Type II errors TRUTH (unknown) H 0 true H 0 false DECISION Do not reject H 0 Correct Type II error Reject H 0 Type I error Correct Significance level = prob. of rejecting the null hypothesis when it is true (alpha). Power = probability of rejecting the null hypothesis when it is false. Beta: the probability of accepting the null hypothesis when it is false Power = 1-beta.
Looking at Data: Discrete Values Discrete data (products, channels, regions, descriptions) common in data mining Histogram bars show number of times different values occur
Looking at Data: Time series Histograms describe a single moment in time Data mining is often concerned with what is happening over time. Time Series Analysis choosing an appropriate time frame to consider the data
Standardized Values Time Series chart limitations -- are changes over time expected? Consider the data as a partition of all the data, with a little bit per day Is it possible that the differences seen on each day are strictly due to chance? (null hypothesis) Analyze sample variation Central Limit Theorem: With many samples are taken from a population, the distribution of the averages of the samples follows the normal distribution. The average of the samples comes arbitrarily close to the population average
Standardized values Normal distribution: described by the mean (average count) and the standard deviation (clustering around the mean) Standardized values z-value = (value mean)/sd mean=0, sd=1 If null hypothesis is true, z-values should follow standard normal distribution Also useful for transforming variables to similar range
Standardized values Roughly normal. Large peak in Dec. Strong weekly trend Not normal. (many more ve values than +ve)
Looking at data: continuous variables Mean (average): the sum of the values divided by the number of values Median: the midpoint of the values (50% above; 50% below) after they have been ordered (ascending or descending order) Mode: the most frequent value among all the values observed Range: the difference between the smallest and largest observation in the sample
Different Shapes of Distributions
Data mining vs. Statistics Statisticians and data miners use similar techniques, but Data mining tends to ignore measurement error in raw data Data mining assume a lot of data and processing power Data mining assumes dependency on time everywhere Can be difficult to experiment in the business world Data can be truncated and censored
Censored data: examples Customer tenure value of active customers must be greater than current tenure (do not know when customer will stop) Claim amount not known for those who have not files a claim Sales and inventory potential sales are greater than actual sales when out of inventory
Regression Basics
Linear relationship 1200 1000 800 revenue 600 400 200 0 0 50 100 150 200 250 300 tenure
Best fit model 1200 1000 800 600 400 y = 3.4032x - 19.221 R 2 = 0.8856 200 0 0 50 100 150 200 250 300 Estimation the weights w: Minimize errors -- Least squares method For which model (w) is the data most likely? -- Maximum likelihood estimation
Regression model amount = 0.56 * tenure + 10.34 y = β x + c output = f (inputs) Model gives expected value when applied Slope β How good is the fit? -- R 2 How much of the relation in the data is captured by the model Stable model? with a different sample, will same model be obtained Residuals normal with mean 0 and sd σ
Obtaining the linear regression model Consider y i = wx i + noise i Independent noise, normally distributed with mean and std. dev. σ P(y w, x) = Normal (mean wx, std dev σ) Maximum likelihood estimate of w find w that maximizes p(y 1, y 2,..y n x 1, x 2,..x n, w) maximize maximize maximize minimize n p( y i= 1 n i= 1 i exp( w, x 1 2 1 i ) y i wx ( i σ n i i = 1 2 σ n i = 1 ( y i y wx i wx ) 2 2 ) i 2 Maximum likelihood estimate minimizes the squared errors
Minimize squared errors E = i ( y i wxi 2 ) = 2 y i 2 xi yi ) w+ 2 2 ( ( xi ) w Minimum E is obtained with w = x i y i / ( xi 2 )
Residuals 400 300 200 100 0-100 0 50 100 150 200 250 300-200 residuals should distribute evenly around 0 should show no pattern with x values should be normally distributed around 0
Heterogeneous data? 1200 1000 800 600 400 y = 3.4032x - 19.221 R 2 = 0.8856 200 0 0 50 100 150 200 250 300
Heterogeneous data 1200 1000 800 Type A 600 400 200 Type B 0 0 50 100 150 200 250 300
Heterogeneous data 1200 1000 800 y = 3.515x + 10.859 R 2 = 0.9515 600 400 200 0 y = 1.322x + 32.507 R 2 = 0.8909 0 50 100 150 200 250 300
Using an Indicator variable Indicator variable Product ={0, 1} y = 2.89 x + 136.8* Product - 40.97 R 2 = 0.93 Individual models y = 3.515x + 10.859 (product 0) R 2 = 0.9515 y = 1.322x + 32.507 (product 1) R 2 = 0.8909
Multiple regression y = β 0 + β 1 x 1 + β 2 x 2 + β m x m Variables should be linearly independent of each other Fewer variables work better Forward selection Stepwise refinement Using a validation set evaluate family of models on a validation set
General linear models Interaction terms y = β 0 + β 1 x 1 x 2 + β 2 x 2 + Polynomial form y = β 0 + β 1 x 1 + β 2 x 12 + + β k x 1 k
Polynomial example 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5
Polynomial example 1.8 1.6 1.4 1.2 1 0.8 0.6 y = 0.2027x + 0.694 R 2 = 0.1515 0.4 0.2 0 0 0.5 1 1.5 2 2.5 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 y = 0.8974x 2-1.9486x + 1.4195 R 2 = 0.8971 0 0.5 1 1.5 2 2.5
Polynomial example overfit? 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 y = 0.8974x 2-1.9486x + 1.4195 R 2 = 0.8971 y = 0.1396x 3 + 0.3943x 2-1.467x + 1.3289 R 2 = 0.9053 y = -0.3959x 4 + 2.0664x 3-2.5936x 2 + 0.0833x + 1.15 R 2 = 0.9201 y = -1.78x 6 + 12.173x 5-31.392x 4 + 38.042x 3-21.237x 2 + 3.7417x + 0.9616 R 2 = 0.9644
Binary dependent variable? 1 y = 0.0022x + 0.2212 R 2 = 0.1172 0 0 50 100 150 200 250 300
Odds Ratio p: probability of an event occurring 1-p: probability of the event not occurring Odds ratio = p/(1-p) Odds of winning 1:3 => odds of 1 win in 3 losses Odds ratio = 0.25/(1-0.25) Log of odds symmetric around 0 -ve values for low probabilities +p=ve for high probabilities 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 logodds 0 0.2 0.4 0.6 0.8 1 1.2
Logistic regression Log of odds ratio as dependent variable ln( 1 y ) y = β + β x 0 1 y = 0 e 1+ e β + β x 0 1 β + β x 1 = 1+ e 1 β + β x) ( 0 1 Non-linear model form Maximum likelihood estimates for coefficients Consider ln(y/(1-y)) = -20.5 + 1.2 x Log-odds changes by 1.2 for a unit change in x y, ie. prob. of occurrence, changes by exp(1.2) for unit change in x