Review: Data Mining Techniques Outline. Estimation Error. Jackknife Estimate. Data Mining. CS 341, Spring Lecture 4: Data Mining Techniques (I)

Similar documents
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Florida Math for College Readiness

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Fairfield Public Schools

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

1 Maximum likelihood estimation

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

MTH 140 Statistics Videos

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Basics of Statistical Machine Learning

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Foundation of Quantitative Data Analysis

Lecture 3: Linear methods for classification

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

2013 MBA Jump Start Program. Statistics Module Part 3

DesCartes (Combined) Subject: Mathematics Goal: Statistics and Probability

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Econometrics Simple Linear Regression

Interaction between quantitative predictors

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Azure Machine Learning, SQL Data Mining and R

Linear Classification. Volker Tresp Summer 2015

Regression III: Advanced Methods

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CHAPTER 2 Estimating Probabilities

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Leveraging Ensemble Models in SAS Enterprise Miner

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

DesCartes (Combined) Subject: Mathematics Goal: Data Analysis, Statistics, and Probability

2.3. Finding polynomial functions. An Introduction:

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

430 Statistics and Financial Mathematics for Business

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Common Core Unit Summary Grades 6 to 8

Reject Inference in Credit Scoring. Jie-Men Mok

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Econ 132 C. Health Insurance: U.S., Risk Pooling, Risk Aversion, Moral Hazard, Rand Study 7

Overview of Factor Analysis

Simple Linear Regression Inference

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

DATA INTERPRETATION AND STATISTICS

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Gamma Distribution Fitting

Regression Analysis: A Complete Example

Copyright PEOPLECERT Int. Ltd and IASSC

UNIT 1: COLLECTING DATA

Multivariate Normal Distribution

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

The Correlation Coefficient

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Nonlinear Regression Functions. SW Ch 8 1/54/

Data Mining Techniques Chapter 6: Decision Trees

College Readiness LINKING STUDY

Introduction to Quantitative Methods

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

PS 271B: Quantitative Methods II. Lecture Notes

Characteristics of Binomial Distributions

Session 7 Bivariate Data and Analysis

CALCULATIONS & STATISTICS

5 Systems of Equations

GRADES 7, 8, AND 9 BIG IDEAS

The Dummy s Guide to Data Analysis Using SPSS

Lecture 9: Bayesian hypothesis testing

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Regression and Correlation

University of Chicago Graduate School of Business. Business 41000: Business Statistics

Part 2: Analysis of Relationship Between Two Variables

Mathematics Pre-Test Sample Questions A. { 11, 7} B. { 7,0,7} C. { 7, 7} D. { 11, 11}

2. Simple Linear Regression

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Social Media Mining. Data Mining Essentials

Why do statisticians "hate" us?

11. Analysis of Case-control Studies Logistic Regression

Introduction to Data Mining

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Transcription:

Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I) Review: n Information Retrieval Similarity measures Evaluation Metrics : Precision and Recall n Question Answering n Web Search Engine An application of IR Related to web mining Prentice Hall 2 Data Mining Techniques Outline Goal: Provide an overview of basic data mining techniques n Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation n Similarity Measures Point Estimation n Point Estimate: estimate a population parameter. n May be made by calculating the parameter for a sample. n May be used to predict value for missing data. n Ex: R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee s salary. Is this a good idea? Prentice Hall 3 Prentice Hall 4 Estimation Error n Bias: Difference between expected value and actual value. n Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: n Root Mean Square Error (RMSE) Prentice Hall 5 Jackknife Estimate n Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. n Named to describe a handy and useful tool n Used to reduce bias n Property: : The Jackknife estimator lowers the bias from the order of 1/n to 1/n 2 Prentice Hall 6 1

Jackknife Estimate n Definition: Divide the sample size n into g groups of size m each, so n=mg. (often m=1 and g=n) estimate θ j by ignoring the jth group. θ_ is the average of θ j. The Jackknife estimator is» θ Q = gθ g (g-1) 1)θ_. Where θ is an estimator for the parameter theta. Jackknife Estimator: Example 1 n Estimate of mean for X={x 1, x 2, x 3,}, n =3, g=3, m=1, θ = µ = (x( 1 + x 2 + x 3 )/3 n θ 1 = (x( 2 + x 3 )/2, θ 2 = (x( 1 + x 3 )/2, θ 1 = (x( 1 + x 2 )/2, n θ = (θ( 1 + θ 2 + θ 2 )/3 n θ Q = gθ-(gg (g-1) θ_= 3θ-(33 (3-1) θ_= (x( 1 + x 2 + x 3 )/3 n In this case, the Jackknife Estimator is the same as the usual estimator. Prentice Hall 7 Prentice Hall 8 Jackknife Estimator: Example 2 n Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1, θ = σ 2 n σ 2 = ((1-3) 2 +(4-3) 2 +(4-3) 2 )/3 = 2 n θ 1 = ((4-4) 4) 2 + (4-4) 4) 2 ) /2 = 0, 0 n θ 2 = 2.25, θ 3 = 2.25 n θ = (θ 1 + θ 2 + θ 2 )/3 = 1.5 n θ Q = gθ-(g-1) θ_= 3θ-(33 (3-1) θ_ =3(2)-2(1.5)=3 2(1.5)=3 n In this case, the Jackknife Estimator is different from the usual estimator. Jackknife Estimator: Example 2(cont Example 2(cont d) n In general, apply the Jackknife technique to the biased estimator σ 2 n σ 2 = Σ (x i x ) 2 / n then the jackknife estimator is s 2 s 2 = Σ (x i x ) 2 / (n -1) Which is known to be unbiased for σ 2 Prentice Hall 9 Prentice Hall 10 Maximum Likelihood Estimate (MLE) n Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. n Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function: MLE Example n Coin toss five times: {H,H,H,H,T} n Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is: n However if the probability of a H is 0.8 then: n Maximize L. Prentice Hall 11 Prentice Hall 12 2

MLE Example (cont d) n General likelihood formula: Expectation-Maximization (EM) n Solves estimation with incomplete data. n Obtain initial estimates for parameters. n Iteratively use estimates for missing data and continue until convergence. n Estimate for p is then 4/5 = 0.8 Prentice Hall 13 Prentice Hall 14 EM Example EM Algorithm Prentice Hall 15 Prentice Hall 16 Models Based on Summarization Scatter Diagram n Basic concepts to provide an abstraction and summarization of the data as a whole. Statistical concepts: mean, variance, median, mode, etc. n Visualization: display the structure of the data graphically. Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs Prentice Hall 17 Prentice Hall 18 3

Bayes Theorem n Posterior Probability: P(h 1 x i ) n Prior Probability: P(h 1 ) n Bayes Theorem: n Assign probabilities of hypotheses given a data value. Prentice Hall 19 Bayes Theorem Example n Credit authorizations (hypotheses): h 1 =authorize purchase, h 2 = authorize after further identification, h 3 =do not authorize, h 4 = do not authorize but contact police n Assign twelve data values for all combinations of credit and income: 1 2 3 4 Excellent x 1 x 2 x 3 x 4 Good x 5 x 6 x 7 x 8 Bad x 9 x 10 x 11 x 12 n From training data: P(h 1 ) = 60%; P(h 2 )=20%; P(h 3 )=10%; P(h 4 )=10%. Prentice Hall 20 Bayes Example(cont d) n Training Data: ID Income Credit Class x i 1 4 Excellent h 1 x 4 2 3 Good h 1 x 7 3 2 Excellent h 1 x 2 4 3 Good h 1 x 7 5 4 Good h 1 x 8 6 2 Excellent h 1 x 2 7 3 Bad h 2 x 11 8 2 Bad h 2 x 10 9 3 Bad h 3 x 11 Bayes Example(cont d) n Calculate P(x i h j ) and P(x i ) n Ex: P(x 7 h 1 )=2/6; P(x 4 h 1 )=1/6; P(x 2 h 1 )=2/6; P(x 8 h 1 )=1/6; P(x i h 1 )=0 for all other x i. n Predict the class for x 4 : Calculate P(h j x 4 ) for all h j. Place x 4 in class with largest value. Ex:»P(h 1 x 4 )=(P(x 4 h 1 )(P(h 1 ))/P(x 4 ) =(1/6)(0.6)/0.1=1.»x 4 in class h 1. 10 1 Bad h 4 x 9 Prentice Hall 21 Prentice Hall 22 Hypothesis Testing n Find model to explain behavior by creating and then testing a hypothesis about the data. n Exact opposite of usual DM approach. n H 0 Null hypothesis; Hypothesis to be tested. n H 1 Alternative hypothesis Chi-Square Test n One technique to perform hypothesis testing n Used to test the association between two observed variable values and determine if a set of observed values is statistically different. n The chi-squared statistic is defines as: n O observed value n E Expected value based on hypothesis. Prentice Hall 23 Prentice Hall 24 4

Chi-Square Test n Given the average scores of five schools. Determine whether the difference is statistically significant. n Ex: O={50,93,67,78,87} E=75 χ 2 =15.55 and therefore significant n Examine a chi-squared significance table. with a degree of 4 and a significance level of 95%, the critical value is 9.488. Thus the variance between the schools scores and the expected value cannot be associated with pure chance. Regression n Predict future values based on past values n Fitting a set of points to a curve n Linear Regression assumes linear relationship exists. y = c 0 + c 1 x 1 + + c n x n n input variables, (called regressors or predictors) One out put variable, called response n+1 constants, chosen during the modlong process to match the input examples Prentice Hall 25 Prentice Hall 26 Linear Regression -- with one input value Correlation n Examine the degree to which the values for two variables behave similarly. n Correlation coefficient r: 1 = perfect correlation -11 = perfect but opposite correlation 0 = no correlation Prentice Hall 27 Prentice Hall 28 Correlation Similarity Measures n Determine similarity between two objects. n Similarity characteristics: n Where X, Y are means for X and Y respectively. n Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1) r =? n Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10) r =? n Alternatively, distance measure measure how unlike or dissimilar objects are. Prentice Hall 29 Prentice Hall 30 5

Similarity Measures Distance Measures n Measure dissimilarity between objects Prentice Hall 31 Prentice Hall 32 Next Lecture: n Data Mining techniques (II) Decision trees, neural networks and genetic algorithms n Reading assignments: Chapter 3 Prentice Hall 33 6