Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I) Review: n Information Retrieval Similarity measures Evaluation Metrics : Precision and Recall n Question Answering n Web Search Engine An application of IR Related to web mining Prentice Hall 2 Data Mining Techniques Outline Goal: Provide an overview of basic data mining techniques n Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation n Similarity Measures Point Estimation n Point Estimate: estimate a population parameter. n May be made by calculating the parameter for a sample. n May be used to predict value for missing data. n Ex: R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee s salary. Is this a good idea? Prentice Hall 3 Prentice Hall 4 Estimation Error n Bias: Difference between expected value and actual value. n Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: n Root Mean Square Error (RMSE) Prentice Hall 5 Jackknife Estimate n Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. n Named to describe a handy and useful tool n Used to reduce bias n Property: : The Jackknife estimator lowers the bias from the order of 1/n to 1/n 2 Prentice Hall 6 1
Jackknife Estimate n Definition: Divide the sample size n into g groups of size m each, so n=mg. (often m=1 and g=n) estimate θ j by ignoring the jth group. θ_ is the average of θ j. The Jackknife estimator is» θ Q = gθ g (g-1) 1)θ_. Where θ is an estimator for the parameter theta. Jackknife Estimator: Example 1 n Estimate of mean for X={x 1, x 2, x 3,}, n =3, g=3, m=1, θ = µ = (x( 1 + x 2 + x 3 )/3 n θ 1 = (x( 2 + x 3 )/2, θ 2 = (x( 1 + x 3 )/2, θ 1 = (x( 1 + x 2 )/2, n θ = (θ( 1 + θ 2 + θ 2 )/3 n θ Q = gθ-(gg (g-1) θ_= 3θ-(33 (3-1) θ_= (x( 1 + x 2 + x 3 )/3 n In this case, the Jackknife Estimator is the same as the usual estimator. Prentice Hall 7 Prentice Hall 8 Jackknife Estimator: Example 2 n Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1, θ = σ 2 n σ 2 = ((1-3) 2 +(4-3) 2 +(4-3) 2 )/3 = 2 n θ 1 = ((4-4) 4) 2 + (4-4) 4) 2 ) /2 = 0, 0 n θ 2 = 2.25, θ 3 = 2.25 n θ = (θ 1 + θ 2 + θ 2 )/3 = 1.5 n θ Q = gθ-(g-1) θ_= 3θ-(33 (3-1) θ_ =3(2)-2(1.5)=3 2(1.5)=3 n In this case, the Jackknife Estimator is different from the usual estimator. Jackknife Estimator: Example 2(cont Example 2(cont d) n In general, apply the Jackknife technique to the biased estimator σ 2 n σ 2 = Σ (x i x ) 2 / n then the jackknife estimator is s 2 s 2 = Σ (x i x ) 2 / (n -1) Which is known to be unbiased for σ 2 Prentice Hall 9 Prentice Hall 10 Maximum Likelihood Estimate (MLE) n Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. n Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function: MLE Example n Coin toss five times: {H,H,H,H,T} n Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is: n However if the probability of a H is 0.8 then: n Maximize L. Prentice Hall 11 Prentice Hall 12 2
MLE Example (cont d) n General likelihood formula: Expectation-Maximization (EM) n Solves estimation with incomplete data. n Obtain initial estimates for parameters. n Iteratively use estimates for missing data and continue until convergence. n Estimate for p is then 4/5 = 0.8 Prentice Hall 13 Prentice Hall 14 EM Example EM Algorithm Prentice Hall 15 Prentice Hall 16 Models Based on Summarization Scatter Diagram n Basic concepts to provide an abstraction and summarization of the data as a whole. Statistical concepts: mean, variance, median, mode, etc. n Visualization: display the structure of the data graphically. Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs Prentice Hall 17 Prentice Hall 18 3
Bayes Theorem n Posterior Probability: P(h 1 x i ) n Prior Probability: P(h 1 ) n Bayes Theorem: n Assign probabilities of hypotheses given a data value. Prentice Hall 19 Bayes Theorem Example n Credit authorizations (hypotheses): h 1 =authorize purchase, h 2 = authorize after further identification, h 3 =do not authorize, h 4 = do not authorize but contact police n Assign twelve data values for all combinations of credit and income: 1 2 3 4 Excellent x 1 x 2 x 3 x 4 Good x 5 x 6 x 7 x 8 Bad x 9 x 10 x 11 x 12 n From training data: P(h 1 ) = 60%; P(h 2 )=20%; P(h 3 )=10%; P(h 4 )=10%. Prentice Hall 20 Bayes Example(cont d) n Training Data: ID Income Credit Class x i 1 4 Excellent h 1 x 4 2 3 Good h 1 x 7 3 2 Excellent h 1 x 2 4 3 Good h 1 x 7 5 4 Good h 1 x 8 6 2 Excellent h 1 x 2 7 3 Bad h 2 x 11 8 2 Bad h 2 x 10 9 3 Bad h 3 x 11 Bayes Example(cont d) n Calculate P(x i h j ) and P(x i ) n Ex: P(x 7 h 1 )=2/6; P(x 4 h 1 )=1/6; P(x 2 h 1 )=2/6; P(x 8 h 1 )=1/6; P(x i h 1 )=0 for all other x i. n Predict the class for x 4 : Calculate P(h j x 4 ) for all h j. Place x 4 in class with largest value. Ex:»P(h 1 x 4 )=(P(x 4 h 1 )(P(h 1 ))/P(x 4 ) =(1/6)(0.6)/0.1=1.»x 4 in class h 1. 10 1 Bad h 4 x 9 Prentice Hall 21 Prentice Hall 22 Hypothesis Testing n Find model to explain behavior by creating and then testing a hypothesis about the data. n Exact opposite of usual DM approach. n H 0 Null hypothesis; Hypothesis to be tested. n H 1 Alternative hypothesis Chi-Square Test n One technique to perform hypothesis testing n Used to test the association between two observed variable values and determine if a set of observed values is statistically different. n The chi-squared statistic is defines as: n O observed value n E Expected value based on hypothesis. Prentice Hall 23 Prentice Hall 24 4
Chi-Square Test n Given the average scores of five schools. Determine whether the difference is statistically significant. n Ex: O={50,93,67,78,87} E=75 χ 2 =15.55 and therefore significant n Examine a chi-squared significance table. with a degree of 4 and a significance level of 95%, the critical value is 9.488. Thus the variance between the schools scores and the expected value cannot be associated with pure chance. Regression n Predict future values based on past values n Fitting a set of points to a curve n Linear Regression assumes linear relationship exists. y = c 0 + c 1 x 1 + + c n x n n input variables, (called regressors or predictors) One out put variable, called response n+1 constants, chosen during the modlong process to match the input examples Prentice Hall 25 Prentice Hall 26 Linear Regression -- with one input value Correlation n Examine the degree to which the values for two variables behave similarly. n Correlation coefficient r: 1 = perfect correlation -11 = perfect but opposite correlation 0 = no correlation Prentice Hall 27 Prentice Hall 28 Correlation Similarity Measures n Determine similarity between two objects. n Similarity characteristics: n Where X, Y are means for X and Y respectively. n Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1) r =? n Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10) r =? n Alternatively, distance measure measure how unlike or dissimilar objects are. Prentice Hall 29 Prentice Hall 30 5
Similarity Measures Distance Measures n Measure dissimilarity between objects Prentice Hall 31 Prentice Hall 32 Next Lecture: n Data Mining techniques (II) Decision trees, neural networks and genetic algorithms n Reading assignments: Chapter 3 Prentice Hall 33 6