Machine learning for pervasive systems Decision trees

Similar documents
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining Classification: Decision Trees

Data mining techniques: decision trees

Data Mining Practical Machine Learning Tools and Techniques

Data Mining for Knowledge Management. Classification

Decision Trees from large Databases: SLIQ

Classification and Prediction

Data Mining. Nonlinear Classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Decision-Tree Learning

Lecture 10: Regression Trees

Knowledge Discovery and Data Mining

Data Mining Techniques Chapter 6: Decision Trees

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Social Media Mining. Data Mining Essentials

Gerry Hobbs, Department of Statistics, West Virginia University

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

How To Make A Credit Risk Model For A Bank Account

Data Mining Part 5. Prediction

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Comparison of Data Mining Techniques used for Financial Data Analysis

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

6.3 Conditional Probability and Independence

Data Mining - Evaluation of Classifiers

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Knowledge Discovery and Data Mining

Random Forest Based Imbalanced Data Cleaning and Classification

Professor Anita Wasilewska. Classification Lecture Notes

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Performance Analysis of Decision Trees

Interactive Exploration of Decision Tree Results

Data Mining Methods: Applications for Institutional Research

Data Mining: Foundation, Techniques and Applications

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Algorithms Part 1. Dejan Sarka

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Chapter 12 Bagging and Random Forests

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

How To Perform An Ensemble Analysis

Model Combination. 24 Novembre 2009

Data mining and statistical models in marketing campaigns of BT Retail

L25: Ensemble learning

Leveraging Ensemble Models in SAS Enterprise Miner

OUTLIER ANALYSIS. Data Mining 1

Data quality in Accounting Information Systems

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Classification and Prediction

MS1b Statistical Data Mining

Specific Usage of Visual Data Analysis Techniques

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Introduction to Learning & Decision Trees

Decision Trees What Are They?

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Data Mining based on Rough Set and Decision Tree Optimization

Distributed forests for MapReduce-based machine learning

CHAPTER 2 Estimating Probabilities

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Knowledge Discovery and Data Mining

Decompose Error Rate into components, some of which can be measured on unlabeled data

Classification algorithm in Data mining: An Overview

Chapter 20: Data Analysis

Benchmarking Open-Source Tree Learners in R/RWeka

Experiments in Web Page Classification for Semantic Web

Unit 26 Estimation with Confidence Intervals

Chapter 6. The stacking ensemble approach

How To Cluster

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Performance and efficacy simulations of the mlpack Hoeffding tree

Advanced analytics at your hands

Lecture Notes Module 1

Azure Machine Learning, SQL Data Mining and R

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Jitter Measurements in Serial Data Signals

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Data, Measurements, Features

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Decision Tree Learning on Very Large Data Sets

Beating the MLB Moneyline

Transcription:

Machine learning for pervasive systems Decision trees Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version 1.5,

Lecture overview 02.11.2016 Introduction 09.11.2016 Decision Trees 16.11.2016 Regression 23.11.2016 Training Principles and Performance Evaluation 30.11.2016 Artificial Neural Networks 07.12.2016 High dimensional data 14.12.2016 Instance-based learning 21.12.2016 No lecture (ELEC christmas party) 11.01.2017 Probabilistic Graphical Models 18.01.2017 Unsupervised learning 25.01.2017 Anomaly detection, online-learning, recommender systems 01.02.2017 Trend mining and Topic Models 08.02.2017 Feature selection 15.02.2017 Presentations 2 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 3 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 4 / 56

Decision tree A decision tree is a tree that divides the examples from a dataset according to the features and classes observed for them 5 / 56

Decision tree How to generate such decision tree? How to determine the feature to split on? 6 / 56

Decision tree How to generate such decision tree? First select a feature to split on and place it at the root node. Then repeat this procedure for all child nodes How to determine the feature to split on? 6 / 56

Decision tree How to generate such decision tree? First select a feature to split on and place it at the root node. Then repeat this procedure for all child nodes How to determine the feature to split on? 7 / 56

Decision tree WiFi Accelerometer Audio Light At work yes no yes no yes no yes no yes no <3 APs 3 7 walking 4 8 quiet 8 5 outdoor 4 7 16 14 [3, 5] 5 5 standing 1 4 medium 6 3 indoor 12 7 >5 APs 8 2 sitting 11 2 loud 2 6 Which one is the best choice? 8 / 56

Decision tree WiFi Accelerometer Audio Light At work yes no yes no yes no yes no yes no <3 APs 3 7 walking 4 8 quiet 8 5 outdoor 4 7 16 14 [3, 5] 5 5 standing 1 4 medium 6 3 indoor 12 7 >5 APs 8 2 sitting 11 2 loud 2 6 Which one is the best choice? 9 / 56

Decision tree WiFi Accelerometer Audio Light At work yes no yes no yes no yes no yes no <3 APs 3 7 walking 4 8 quiet 8 5 outdoor 4 7 16 14 [3, 5] 5 5 standing 1 4 medium 6 3 indoor 12 7 >5 APs 8 2 sitting 11 2 loud 2 6 Which one is the best choice? 10 / 56

Decision tree We are interested in the gain in information when a particular choice is taken The decision tree should then decide for the split that promises maximum information gain. 11 / 56

Decision tree We are interested in the gain in information when a particular choice is taken The decision tree should then decide for the split that promises maximum information gain. 12 / 56

Decision tree This can be estimated by the entropy of a value: E(p 1, p 2,..., p n ) = p 1 log 2 p 1 p 2 log 2 p 2 p n log 2 p n 13 / 56

Decision tree E(p 1, p 2,..., p n) = p 1 log 2 p 1 p 2 log 2 p 2 p n log 2 p n WiFi information value: ( ) ( ) ( ) 3 E 10, 7 10 5 10 30 + E 10, 5 10 8 10 30 + E 10, 2 10 10 30 = 14 / 56

Decision tree E(p 1, p 2,..., p n) = WiFi information value: ( 3 E 10, 7 ) ( 10 5 10 30 + E 10, 5 10 ) 10 30 + E p 1 log 2 p 1 p 2 log 2 p 2 p n log 2 p n ( 8 10, 2 ) 10 10 30 = ( 3 10 log 2 3 10 7 10 log 2 7 10 ( + + 10 5 log 2 10 5 10 5 log 2 10 5 ( 10 8 log 2 10 8 10 2 log 2 10 2 ) 10 30 ) 10 30 ) 10 30 15 / 56

Decision tree E(p 1, p 2,..., p n) = WiFi information value: ( 3 E 10, 7 ) ( 10 5 10 30 + E 10, 5 10 ) 10 30 + E p 1 log 2 p 1 p 2 log 2 p 2 p n log 2 p n ( 8 10, 2 ) 10 10 30 = ( 3 10 log 2 3 10 7 10 log 2 7 10 ( + + 10 5 log 2 10 5 10 5 log 2 10 5 ( 10 8 log 2 10 8 10 2 log 2 10 2 0.868 ) 10 30 ) 10 30 ) 10 30 16 / 56

Decision tree Information value: WiFi: 0.868 Acc:... Audio:... Light:... 17 / 56

Decision tree Information value: Information gain: WiFi: 0.868 Acc: 0.756 Audio: 0.884 Light: 0.948 Initial information value (working [yes/no]): 0.997 18 / 56

Decision tree Information value: WiFi: 0.868 Acc: 0.756 Audio: 0.884 Light: 0.948 Information gain: WiFi: 0.129 Acc: 0.241 Audio: 0.113 Light: 0.049 Initial information value (working [yes/no]): 0.997 19 / 56

Decision tree Information value: WiFi: 0.868 Acc: 0.756 Audio: 0.884 Light: 0.948 Information gain: WiFi: 0.129 Acc: 0.241 Audio: 0.113 Light: 0.049 Initial information value (working [yes/no]): 0.997 20 / 56

Decision tree 21 / 56

Decision tree 22 / 56

Graphical interpretation: Decision tree 23 / 56

Graphical interpretation: Decision tree 23 / 56

Remark: An alternative to Information gain Gini impurity Gini impurity describes how often randomly chosen samples would be incorrectly labelled if labelled randomly according to the disctribution of labels in the subset. Let p i be the probability that a sample is correctly labelled. Gini impurity is then computed as n I G = p i (1 p i ) i=1 24 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 25 / 56

Optimization Improved decision tree implementation Dealing with numeric values Missing values Noisy data 26 / 56

Optimization numeric values Nominal feature values For nominal features, the decision tree splits on every possible value. Therefore, the information content of this feature is 0 after such branch has been conducted 27 / 56

Optimization numeric values Nominal feature values For nominal features, the decision tree splits on every possible value. Therefore, the information content of this feature is 0 after such branch has been conducted Numeric feature values For numeric feature values, splitting on each possible value would lead to a very wide tree of small depth. 27 / 56

Optimization numeric values For numeric values, the tree is split into several intervals. 28 / 56

Optimization Missing values Missing values in a data set Missing values are a common/prominent event in real-world data sets participants in a survey refuse to answer malfunctioning sensor nodes Biology: plants or animals might die before all variables have been measured... Most machine learning schemes make the implicit assumption that there is no significance in the fact that a certain value is missing. 29 / 56

Optimization Missing values Missing values in a data set Missing values are a common/prominent event in real-world data sets participants in a survey refuse to answer malfunctioning sensor nodes Biology: plants or animals might die before all variables have been measured... Most machine learning schemes make the implicit assumption that there is no significance in the fact that a certain value is missing. The absence of data might already hold valuable information! 29 / 56

Optimization Missing values The absence of data might already hold valuable information! 1 Witten et al., Data Mining, Morgan Kaufmann, 2011 30 / 56

Optimization Missing values The absence of data might already hold valuable information! Example People analyzing medical databases have noticed that cases may, in some circumstances, be diagnosable simply from the tests that a doctor decides to make regardless of the outcome of the tests 1 1 Witten et al., Data Mining, Morgan Kaufmann, 2011 30 / 56

Optimization Missing values Possible solutions Add another feature describing whether the value is missing or not splitting the instance at the missing feature: 1 propagate all instances (weighted with the respective frequency observed from training samples) down to the leaves 2 combine the results at the leaf nodes given the weighting of the instances 31 / 56

Optimization Noisy data Fully expanded decision trees often contain unnecessary structure that should be simplified before deployment 32 / 56

Optimization Noisy data Fully expanded decision trees often contain unnecessary structure that should be simplified before deployment Pruning Prepruning Trying to decide through the tree-building process when to stop developing subtrees Might speed up tree creation phase Difficult to spot dependencies between features at this stage (features might be meaningful together but not on their own) Postpruning Simplification of the decision tree after the tree has been created 32 / 56

Optimization Noisy data Postpruning subtree replacement Select some subtrees and replace them with single leaves Will reduce accuracy on the training set May increase accuracy on independently chosen test set (reduction of noise) 33 / 56

Optimization Noisy data Postpruning subtree replacement Select some subtrees and replace them with single leaves Will reduce accuracy on the training set May increase accuracy on independently chosen test set (reduction of noise) 33 / 56

Optimization Noisy data Postpruning subtree raising Complete subtree is raised one level and samples at the nodes of the subtree have to be recalculated 34 / 56

Optimization Estimating error rates When should we raise or replace subtrees? 35 / 56

Optimization Estimating error rates When should we raise or replace subtrees? Estimating error rates Estimation of error rates at internal nodes and leaf nodes. 35 / 56

Optimization Estimating error rates When should we raise or replace subtrees? Estimating error rates Estimation of error rates at internal nodes and leaf nodes. Assumption: Label of node is chosen as the majority vote from all its leaves 35 / 56

Optimization Estimating error rates When should we raise or replace subtrees? Estimating error rates Estimation of error rates at internal nodes and leaf nodes. Assumption: Label of node is chosen as the majority vote from all its leaves Will lead to a certain number of errors E... out of the total number of instances N 35 / 56

Optimization Estimating error rates When should we raise or replace subtrees? Estimating error rates Estimation of error rates at internal nodes and leaf nodes. Assumption: Label of node is chosen as the majority vote from all its leaves Will lead to a certain number of errors E... out of the total number of instances N Assume: 1 True probability of error at that node is q 2 N instances are generated by Bernoulli process with parameter q and errors E 35 / 56

Optimization Estimating error rates Estimating error rates success probability Given a confidence c (C4.5 uses 25%), we find a confidence limit z (for c = 25% z = 0.69) such that P q q > z = c q(1 q) N (with the observed error rate q = E N ) 36 / 56

Optimization Estimating error rates Estimating error rates success probability Given a confidence c (C4.5 uses 25%), we find a confidence limit z (for c = 25% z = 0.69) such that P q q > z = c q(1 q) N (with the observed error rate q = E N ) This leads to a pessimistic error rate e as an upper confidence limit for q (solving the equation for q): q = e = q + z2 2N + z q N q 2 N 1 + z2 N + z2 4N 2 36 / 56

Optimization Estimating error rates Example Lower left leaf (E = 2, N = 6) Utilising the formula for e, we obtain q = 0.33 and e = 0.47 37 / 56

Optimization Estimating error rates Example Lower left leaf (E = 2, N = 6) Utilising the formula for e, we obtain q = 0.33 and e = 0.47 Center leaf(e = 1, N = 2) e = 0.72 37 / 56

Optimization Estimating error rates Example Lower left leaf (E = 2, N = 6) Utilising the formula for e, we obtain q = 0.33 and e = 0.47 Center leaf(e = 1, N = 2) e = 0.72 Right leaf (E = 2, N = 6) e = 0.47 37 / 56

Optimization Estimating error rates Example Lower left leaf (E = 2, N = 6) Utilising the formula for e, we obtain q = 0.33 and e = 0.47 Center leaf(e = 1, N = 2) e = 0.72 Right leaf (E = 2, N = 6) e = 0.47 Combine error estimates Utilising ratio 6:2:6 this leads to a combined error estimate of 0.47 6 14 + 0.72 2 14 + 0.47 6 14 0.51 37 / 56

Optimization Estimating error rates Example Lower left leaf (E = 2, N = 6) Utilising the formula for e, we obtain q = 0.33 and e = 0.47 Center leaf(e = 1, N = 2) e = 0.72 Right leaf (E = 2, N = 6) e = 0.47 Combine error estimates Utilising ratio 6:2:6 this leads to a combined error estimate of 0.47 6 14 + 0.72 2 14 + 0.47 6 14 0.51 Error estimate for parent node q = 5 14 e = 0.46 0.46 < 0.51 prune children away 37 / 56

C4.5 Further heuristics employed Postpruning Confidence value c = 25% Postpruning Split Threshold Candidate splits on a numeric feature are only considered when at least min(10%, 25) of all training samples are cut off by the split Prepruning with information gain Given S candidate splits on a S certain numeric attribute, log 2 N is subtracted from the information gain in order to prevent overfitting When information gain is negative, tree-construction will stop 38 / 56

C4.5 Remarks Postpruning Postpruning in C4.5 is very fast and therefore popular However, the statistical assumptions are shaky use of upper confidence limit assumption of normal distribution for error rate calculation use of statistics from the training set Often, the algorithm does not prune enough and a better performance can be achieved with a more compact decision tree 39 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 40 / 56

Confidence on a prediction Assume we measure the error of a classifier on a test set and obtain a numerical error rate of q (a success rate of p = (1 q)). What can we say about the true success rate? It will be close to p, but how close? (within 5% or 10%?) This depends on the size of the test set Naturally, we are more confident on the success probability p when it were based on a large number of values. 41 / 56

Confidence on a prediction In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process Bernoulli process A Bernoulli process is a repeated coin flipping, possibly with an unfair coin 42 / 56

Confidence on a prediction In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process Assume that out of N events, S are successful. 42 / 56

Confidence on a prediction In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process Assume that out of N events, S are successful. Then we have an observed success rate of p = S N 42 / 56

Confidence on a prediction In statistics, a succession of independent events that either succeed or fail is called a Bernoulli process Assume that out of N events, S are successful. Then we have an observed success rate of p = S N Confidence Interval The true success rate p lies within an interval with a specified confidence 42 / 56

Confidence on a prediction The probability that a random variable χ, with zero mean, lies within a certain confidence range of width 2z is P[ z χ z] = c 43 / 56

Confidence on a prediction The probability that a random variable χ, with zero mean, lies within a certain confidence range of width 2z is P[ z χ z] = c Confidence limits for the normal distribution are e.g. P[χ z] 0.001 0.005 0.01 0.05 0.1 0.2 0.4 z 3.09 2.58 2.33 1.65 1.28 0.84 0.25 Standard assumption in such tables on the random variable: mean 0 variance 1 43 / 56

Confidence on a prediction P[χ z] 0.001 0.005 0.01 0.05 0.1 0.2 0.4 z 3.09 2.58 2.33 1.65 1.28 0.84 0.25 The z figures are measured in standard deviations from the mean: 44 / 56

Confidence on a prediction P[χ z] 0.001 0.005 0.01 0.05 0.1 0.2 0.4 z 3.09 2.58 2.33 1.65 1.28 0.84 0.25 The z figures are measured in standard deviations from the mean: Interpretation E.g. the figure for P[χ z] = 0.05 implies that there is a 5% chance that χ lies more than 1.65 standard deviations above the mean. 44 / 56

Confidence on a prediction P[χ z] 0.001 0.005 0.01 0.05 0.1 0.2 0.4 z 3.09 2.58 2.33 1.65 1.28 0.84 0.25 The z figures are measured in standard deviations from the mean: Interpretation E.g. the figure for P[χ z] = 0.05 implies that there is a 5% chance that χ lies more than 1.65 standard deviations above the mean. Since the distribution is symmetric, the chance that χ lies more than 1.65 standard deviations from the mean is 10%: P[ 1.65 χ 1.65] = 0.9 44 / 56

Confidence on a prediction In order to apply this to the random variable p, we have to reduce it to have zero mean and unit variance. 45 / 56

Confidence on a prediction In order to apply this to the random variable p, we have to reduce it to have zero mean and unit variance. subtract mean p and divide by standard deviation p(1 p) N 45 / 56

Confidence on a prediction In order to apply this to the random variable p, we have to reduce it to have zero mean and unit variance. subtract mean p and divide by standard deviation This leads to P z < p p p(1 p) N < z = c p(1 p) N 45 / 56

Confidence on a prediction To find confidence limits, given a particular confidence figure c: consult a table with confidence limits for the normal distribution for the corresponding value z 46 / 56

Confidence on a prediction To find confidence limits, given a particular confidence figure c: consult a table with confidence limits for the normal distribution for the corresponding value z Note: since Success probabilities are displayed, we have to subtract our value c from 1 and divide by two: 1 c 2 46 / 56

Confidence on a prediction To find confidence limits, given a particular confidence figure c: consult a table with confidence limits for the normal distribution for the corresponding value z Note: since Success probabilities are displayed, we have to subtract our value c from 1 and divide by two: 1 c 2 Then, write the inequality above as an equality and invert it to find an expression for p 46 / 56

Confidence on a prediction To find confidence limits, given a particular confidence figure c: consult a table with confidence limits for the normal distribution for the corresponding value z Note: since Success probabilities are displayed, we have to subtract our value c from 1 and divide by two: 1 c 2 Then, write the inequality above as an equality and invert it to find an expression for p Finally, solving a quadratic equation will produce the respective value for p 46 / 56

Confidence on a prediction Finally, solving a quadratic equation will produce the respective value for p p = ( p + z2 2N ± z p N p 2 N 1 + z2 N ) + z2 4N 2 The resulting two values are the upper and lower confidence boundaries 47 / 56

Confidence on a prediction Example p = 0.75; N = 1000, c = 0.8 (z = 1.28) [0.732, 0.767] p = 0.75; N = 100, c = 0.8 (z = 1.28) [0.691, 0.801] Note that the assumptions taken are only valid for large N 48 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 49 / 56

Bottom-line: Decision trees Strengths Simple, intuitive approach Robust to the inclusion of irrelevant features Invariant under transformation of features, e.g. scaling 50 / 56

Bottom-line: Decision trees Strengths Simple, intuitive approach Robust to the inclusion of irrelevant features Invariant under transformation of features, e.g. scaling Weaknesses Tendency to overfit Often complex, deep trees even for simple linearly separable classes 50 / 56

Improving classification results Tree bagging Bootstrap aggregating, or bagging for decision trees builds multiple (typically several 100s or 1000s) trees from random subsets of the training set (random samples with replacement) Predictions are made after majority vote or by combining and averaging probabilities. Reduces variance without affecting bias 51 / 56

Improving classification results Tree bagging Bootstrap aggregating, or bagging for decision trees builds multiple (typically several 100s or 1000s) trees from random subsets of the training set (random samples with replacement) Predictions are made after majority vote or by combining and averaging probabilities. Reduces variance without affecting bias 51 / 56

Improving classification results Random forests Random forests exploit Tree bagging and in addition use a random subset of features at each candidate split in order to reduce the impact of strong features. (Strong features may lead to dependent trees and thus impair the benefits of Tree bagging) 52 / 56

Improving classification results Extra Trees A way to generate extremely randomized trees is to build a Random forest but in addition for each feature split exploit random decision (based on information gain or Gini impurity) instead of deterministic choice. 53 / 56

Outline Decision Tree Optimizing Decision Trees Confidence on a prediction Improving classification results of decision trees 54 / 56

Questions? stephan.sigg@aalto.fi Le Nguyen le.ngu.nguyen@aalto.fi 55 / 56

Literature C.M. Bishop: Pattern recognition and machine learning, Springer, 2007. R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification, Wiley, 2001. 56 / 56