Increasing Classification Accuracy. Data Mining: Bagging and Boosting. Bagging 1. Bagging 2. Bagging. Boosting Meta-learning (stacking)

Size: px

Start display at page:

Download "Increasing Classification Accuracy. Data Mining: Bagging and Boosting. Bagging 1. Bagging 2. Bagging. Boosting Meta-learning (stacking)"

Carmel Howard
7 years ago
Views:

1 Data Mining: Bagging and Boosting Increasing Classification Accuracy Andrew Kusiak 2139 Seamans Center Iowa City, Iowa Tel: Bagging g Boosting Meta-learning (stacking) Bagging 1 Bagging 2 Corporate decision-making analogy Sample 1 Classifier 1 New Managers seeks advice of experts in areas that s/he does not have expertise The skills of the advisers should complement each other rather than being duplicative Applies also to boosting Training Bootstrap scheme Sample 2 Sample 3 Classifier 2 Classifier 3 Combined classifier Voting scheme decision (1-1/n) n ~ e -1 =.368, where e =

edu/~ankusiak Tel: 319-335 5934 Bagging g Boosting Meta-learning (stacking) Bagging 1 Bagging 2 Corporate decision-making analogy Sample 1 Classifier 1 New Managers

2 Bagging Procedure Bagging 3 Classifier generation Step 1. Create t sets from a base applying the sampling with replacement scheme. Step 2. Apply a learning to each sample training set. Classification Step 3. For an object with unknown decision, make predictions with each of the t classifiers. Step 4. Select the most frequently predicted decision. Bagging 4 Classification Voting scheme Prediction Averaging scheme Also used Bagging with costs and randomization schemes within learning s (e.g., features with equal value gain) Bagging 5 The effect of combining different classifiers (hypotheses) can be explained with the theory of bias-variance decomposition Bias an error due to a learning Variance an error due to the learned model ( set related) The total expected error of a classifier = Bias + Variance Boosting 1 Bagging Individual models are built separately Boosting Combines models of the same type (e.g., decision tree) and it is iterative, i.e., a new model is influenced by the performance of the previously built model Boosting Uses voting or averaging (similar to bagging) Different boosting s exist 2

Bagging 4 Classification Voting scheme Prediction Averaging scheme Also used Bagging with costs and randomization schemes within learning s (e.g., features with equal value gain) Bagging 5 The effect

3 Boosting 2 Method AdaBoost.M1 which is widely used Assumption: can handle weighted instances (usually handled by randomization schemes for selection of training subsets) By weighting instances, the learning can concentrate on instances with high weights (called hard instances), i.e., incorrectly classified instances Boosting 3 AdaBoost.M1 Algorithm (Outline) All instances are equally weighted A learning is applied The weight of incorrectly classified examples is increased ( hard instances), correctly decreased ( easy instances) The concentrates on incorrectly classified hard instances Some had instances become harder some softer A series of diverse experts (classifiers) is generated based on the reweighed Boosting 4 AdaBoost.M1 Algorithm (Steps) Classifier generation Step 0. Set the weight value, w = 1, and assign it to each object in the training set. For each of t iterations, perform: Step 1. Apply a learning to the weighted training set. Step 2. Compute classification error e for the weighted training set. If e = 0 or e >=.5, then terminate the classifier generation process and go to Step 4; otherwise multiple the weight w of each object by e/(1 e) and normalize the weights of all objects. Classification Step 4. Assign weight q = 0 to each decision (class) to be predicted. Step 5. For each of t (or less) classifiers, add log e/(1 e) to the weight of the decision predicted by the classifier and output the decision with the highest weight. Boosting 4 For e = 0 all training examples (objects) are correctly classified (a perfect classifier) and therefore there is no reason to modify the object weights, i.e., for e/(1 e) = 0 all new weights w become 0. For e =.5, the expression log e/(1 e) = 0, and therefore the weights q = 0 are not be modified and therefore no decision is generated due to high classification error e. 3

instances with high weights (called hard instances), i.e., incorrectly classified instances Boosting 3 AdaBoost.

4 Training Meta-learning 1 Classifier 1 Creating Meta-training Data Voting Each classifier gets one vote and the majority wins. Test Weighted voting Provides preferential treatment to some voting classifiers. Training 2 Classifier 2 decisions decisions Arbitration An arbitrator makes a selection, if the classifiers can not reach a consensus. decisions Metaclassifier Metalearning Metatraining Combining Decisions produced by different classifiers are combined as one decision. Example (1) 1 Vector 1 High 2 Vector 2 Low 3 Vector 3 High Example (2) Predictions of classifiers 1 and 2 for the training set Object No. Classifier 1 Prediction 1 Vector 1 High 2 Vector 2 Low 3 Vector 3 High Classifier 2 Prediction 4

Training 2 Classifier 2 decisions decisions Arbitration An arbitrator makes a selection, if the classifiers can not reach a consensus.

5 Example (3) Object No. Classifier 1 Prediction Classifier 2 Prediction Training set generated by the class-combiner scheme 1 High, High High 2 High, Low Low 3 Low, Low High Example (4) Object No. Classifier 1 Prediction Classifier 2 Prediction Training set generated by the class-attribute-combiner scheme 1 High, High, Vector 1 High 2 High, Low, Vector 2 Low 3 Low, Low, Vector 3 High Example (5) Example (6) Object No. Classifier 1 Prediction Classifier 2 Prediction Training i set generated by the binary class-attribute-combiner bi scheme Object No. Feature Vector Decision 1 Yes, No, Yes, No High 2 Yes, No, No, Yes Low Binary form of the predictions produced by classifier 1 Object No. Classifier 1 Prediction Feature = High Feature = Low Decision 1 High Yes No High 2 High Yes No Low 3 Low No Yes High 3 No,Yes, No, Yes High 5

(5) Example (6) Object No. Classifier 1 Prediction Classifier 2 Prediction Training i set generated by the binary class-attribute-combiner bi scheme Object No.

6 Meta-learners Distributed Integration of knowledge learned from different and distributed bases. Elimination of inductive bias. Extraction of high level models. Scalability to hierarchical meta-learning. Distributed by partitioning Distributed by nature Data Populations from homogeneously distributed sets Θ i = Θ j = Θ L-learner 1 Θ 1 Homogeneous (Θ i = Θ j, i j - all learners share the same distribution) Heterogeneous (Θ i Θ j,i j) P(D Θ) L-learner 2 Θ 2 M-learner Θ L-learner n Θ n 6

Distributed by partitioning Distributed by nature Data Populations from homogeneously distributed sets Θ i = Θ j = Θ

7 from heterogeneously distributed sets Gini Index 1 P(D 1 Θ 1) L-learner 1 Θ t 1 μ t Θ i Θ j P(D Θ ) 2 2 P(D Θ ) n n L-learner 2 Θ t 2 M-learner (Θ, μ) t L-learner n Θ n S = set with n objects c = number of classes in S p j = relative frequency of class j in S t = step number μ models interrelationships between distributions of the local c gini (S) = 1 Σ p j 2 j = 1 Gini Index 2 S 1 = partition 1 of S n 1 = number of objects in S 1 S 2 = partition 2 of S n 2 = number of objects in S 2, where n 2 =(n -n 1 ) a = splitting criterion gini (S, a) = n 1 /n gini (S 1 ) + n 2 /n gini (S 2 ) 7

interrelationships between distributions of the local c gini (S) = 1 Σ p j 2 j = 1 Gini Index 2 S 1 = partition 1 of S n 1 = number of objects in S

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15