Data Informatics. Seon Ho Kim, Ph.D.

Transcription

1 Data Informatics Seon Ho Kim, Ph.D.

2 Classification and Prediction

3 Outline Classification vs. prediction Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Prediction Summary Reference

4 Classification vs. Prediction Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training dataset and uses it in classifying new data Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval, Targeted marketing, Medical diagnosis, Fraud detection, etc.

5 Classification A Two-Step Process Model construction: describing a set of predetermined classes Each sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae

6 Classification A Two-Step Process Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise overfitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

7 Classification Process (1): Model Construction Training Data Classification Algorithms NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes

8 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Tenured?

9 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

10 Issues Regarding Classification and Prediction (1): Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data

11 Issues regarding classification and prediction (2): Evaluating classification methods Accuracy: classifier and predictor accuracy Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness handling noise and missing values Scalability: efficiency as data size grows Interpretability understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

12 Training Dataset (Example of Buying Computer ) age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no

13 Decision Trees A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences. Sort instances (data) according to feature values (i.e., age, income, etc.): a hierarchy of tests data are classified/sorted according to specific feature values, which become increasingly specific. Nodes: features Root node: feature that best divides data Algorithms exist for determining the best root node Branches: values the node can assume

14 Output: A Decision Tree for buying_computer age? <=30 overcast >40 student? yes credit rating? no yes fair excellent no yes no yes

15 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in top-down recursive divide-andconquer manner At start, all the training examples are at the root Attributes are categorical (if continuous, discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

16 Algorithm for Decision Tree Induction Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

17 Attribute Selection Measure: Information Gain n Select the attribute with the highest information gain n S contains s i tuples of class C i for i = {1,, m} n informationmeasures info required to classify any arbitrary m tuple si si I( s 1,s 2,...,sm) = log 2 s s n entropy of attribute A with values {a 1,a 2,,a v } v s s 1 j mj E(A) = Is (,..., ) 1 j s mj s n j= 1 i= 1 information gained by branching on attribute A Gain(A) = I(s,s,...,s ) E(A) 1 2 m

18 Attribute Selection by Information Gain Computation g Class P: buys_computer = yes g Class N: buys_computer = no g I(p, n) = I(9, 5) =0.940 g Compute the entropy for age: age p i n i I(p i, n i ) <= > age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no 5 I E( age) = I(2,3) + I(4,0) I(3,2) = (2,3) means age <=30 has 5 out of 14 samples, with 2 yes es and 3 no s. Hence, Gain( age) = I( p, n) E( age) = Similarly, Gain( income) = Gain( student) = Gain( credit _ rating) =

19 Computing Information-Gain for Continuous- Value Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D 1 is the set of tuples in D satisfying A split-point, and D 2 is the set of tuples in D satisfying A > split-point

20 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = <=30 AND student = no THEN buys_computer = no IF age = <=30 AND student = yes THEN buys_computer = yes IF age = THEN buys_computer = yes IF age = >40 AND credit_rating = exc THEN buys_computer = yes IF age = <=30 AND credit_rating = fair THEN buys_computer = no

21 Avoid Overfitting in Classification Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies (noise or outliers) Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree

22 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

23 Decision Trees: Assessment Advantages: Classification of data based on limiting features is intuitive Handles discrete/categorical features best Limitations: Danger of overfitting the data Not the best choice for accuracy

24 Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

25 Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X H): probability of observing the sample X, given that the hypothesis holds

26 Bayesian Theorem Given training data X, posteriori probability of a hypothesish, P(H X) follows the Bayes theorem P ( H X ) = P( X H) P( H) P( X ) Informally, this can be written as posteriori = likelihood x prior / evidence MAP (maximum posteriori) hypothesis h argmaxp( h D) = argmaxp( D h) P( h MAP h H h H Practical difficulty: require initial knowledge of many probabilities, significant computational cost ).

27 Naive Bayes Classifier A simplified assumption: attributes are conditionally independent: n P( X Ci) = P( xk Ci) k = 1 The product of occurrence of say 2 elements x 1 and x 2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y 1,y 2 ], C) = P(y 1, C) * P(y 2, C) No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X C i ) is known, assign X to the class with maximum P(X C i ) * P(C i )

28 Steps 1. Convert the data set into a frequency table 2. Create Likelihood table by finding the probabilities 3. use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

29 Question: Players will play if weather is sunny. Is this statement is correct? We can solve it using above discussed method of posterior probability. P(Yes Sunny) = P( Sunny Yes) * P(Yes) / P (Sunny) Here we have P (Sunny Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 Now, P (Yes Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

30 Training dataset Class: C1:buys_computer= yes C2:buys_computer= no Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no

31 Naive Bayesian Classifier: An Example Compute P(X Ci) for each class P(age= <30 buys_computer= yes ) = 2/9=0.222 P(age= <30 buys_computer= no ) = 3/5 =0.6 P(income= medium buys_computer= yes )= 4/9 =0.444 P(income= medium buys_computer= no ) = 2/5 = 0.4 P(student= yes buys_computer= yes)= 6/9 =0.667 P(student= yes buys_computer= no )= 1/5=0.2 P(credit_rating= fair buys_computer= yes )=6/9=0.667 P(credit_rating= fair buys_computer= no )=2/5=0.4 X=(age<=30, income =medium, student=yes, credit_rating=fair) P(X Ci) : P(X buys_computer= yes )= x x x =0.044 P(X buys_computer= no )= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X Ci)*P(Ci ) : P(X buys_computer= yes ) * P(buys_computer= yes )=0.028 P(X buys_computer= no ) * P(buys_computer= no )=0.007 Therefore, X belongs to class buys_computer=yes

32 What Is Prediction? (Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given input Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions Major method for prediction: regression model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees

33 Linear Regression In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted x. The case of one explanatory variable is called simple linear regression.

34 Y X How would you draw a line through the points? How do you determine which line fits best? Y X

35 Which Is More Logical? Sales Sales Advertising Advertising Sales Sales Advertising Advertising

36 Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Simple Multiple Linear Non- Linear Linear Non- Linear

37 Least Squares Best Fit Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative. So square errors! n ( ) Y Yˆ 2 = i i i= 1 i= 1 LS (Least Squares) Minimizes the Sum of the Squared Differences (errors) (SSE) n ˆ ε 2 i 37

38 Least Squares Graphically Y n i = 1 LS minimizes! 2! 2 ε ε! 2 ε! 2 = + + ε +! ε Y i 1 =! β +! β X +! ε ^ ε 2 ^ ε 1 ^ ε 3 ^ ε 4 Y! =! β +! β X i 0 1 X i 38

39 Prediction equation Sample slope Coefficient Equations yˆ i = β 0 + β1x i ˆβ 1 = Sample Y - intercept ˆ ˆ ( x i x) y i y ( ) ( x i x) 2 ˆ β 0 = y ˆ β1x

40 Linear Regression Example Last year, five randomly selected students took a math aptitude test before they began their statistics course. The Statistics Department has three questions. What linear regression equation best predicts statistics performance, based on math aptitude scores? If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics? How well does the regression equation fit the data?

41 Linear Regression Example (Cont ) In the table below, the x i column shows scores on the aptitude test. Similarly, the y i column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the regression analysis.

42 Linear Regression Example (Cont ) The regression equation is a linear equation of the form: ŷ = b 0 + b 1 x. To conduct a regression analysis, we need to solve for b 0 and b 1. Computations are shown below (giving the minimum sum of squared residuals). b 1 = Σ [ (x i - x)(y i - y) ] / Σ [ (x i - x) 2 ] = 470/730 = b 0 = y - b 1 * x = 77 - (0.644)(78) = Therefore, the regression equation is: ŷ = x

43 Linear Regression Example (Cont ) Using the regression equation: Choose a value for the independent variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent variable. In our example, the independent variable is the student's score on the aptitude test. The dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade would be: ŷ = x = * 80 = =

44 Linear Regression Example (Cont ) Warning: When you use a regression equation, do not use values for the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates. In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic.

45 Linear Regression Example (Cont ) Whenever you use a regression equation, you should ask how well the equation fits the data. One way to assess fit is to check the coefficient of determination, which can be computed from the following formula. R 2 = { ( 1 / N ) * Σ [ (x i - x) * (y i - y) ] / (σ x * σ y ) } 2 where N is the number of observations used to fit the model, Σ is the summation symbol, x i is the x value for observation i, x is the mean x value, y i is the y value for observation i, y is the mean y value, σ x is the standard deviation of x, and σ y is the standard deviation of y.

46 Linear Regression Example (Cont ) Computations for the sample problem of this lesson are shown below. A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades (the dependent variable) can be explained by the relationship to math aptitude scores (the independent variable). R 2 = 1 indicates that the regression line perfectly fits the data.

47 Nonlinear Regression Some nonlinear models can be modeled by a polynomial function A polynomial regression model can be transformed into linear regression model. For example, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 convertible to linear with new variables: x 2 = x 2, x 3 = x 3 y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 Other functions, such as power function, can also be transformed to linear model Some models are intractable nonlinear (e.g., sum of exponential terms) possible to obtain least square estimates through extensive calculation on more complex formulae

48 Other Regression-Based Models Generalized linear model: Foundation on which linear regression can be applied to modeling categorical response variables Variance of y is a function of the mean value of y, not a constant Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables Poisson regression: models the data that exhibit a Poisson distribution Log-linear models: (for categorical data) Approximate discrete multidimensional prob. distributions Also useful for data compression and smoothing Regression trees and model trees Trees to predict continuous values rather than class labels

49 Summary Classification and prediction can be used to extract models describing important data classes or to predict future data trends. Effective and scalable methods have been developed for decision trees induction, Naive Bayesian classification, Bayesian belief network, rulebased classifier, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers,and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches. Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction.

50 Enhancements to Basic Decision Tree Induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication

51 Classification in Large Databases Classification a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods