# Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Save this PDF as:

Size: px
Start display at page:

Download "Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms"

## Transcription

1 Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya Abu Hasan School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Abstract Pollutant forecasting is an important problem in the environmental sciences. Data mining is an approach to discover knowledge from large data. This paper tries to use data mining methods to forecast concentration level, which is an important air pollutant. There are several tree-based classification algorithms available in data mining, such as CART, C4.5, Random Forest (RF) and C5.0. RF and C5.0 are popular ensemble methods, which are, RF builds on CART with Bagging and C5.0 builds on C4.5 with Boosting, respectively. This paper builds concentration level predictive models based on RF and C5.0 by using R packages. The data set includes period data in a new town of Hong Kong. The concentration is divided into 2 levels, the critical points is 25μg/ (24 hours mean). According to 100 times 10-fold cross validation, the best testing accuracy is from RF model, which is around 0.845~ Keywords Random Forest; C5.0; PM2.5 prediction; data mining. I. INTRODUCTION Air pollution is a major problem for some time. Various organic and inorganic pollutants from all aspects of human activities are added daily to the air. One of the most important pollutants is particulate matter. Particulate matter (PM) can be defined as a mixture of fine particles and droplets in the air and this can be characterized by their sizes. refers to particulate matter whose size is 2.5 micrometers or smaller. Due to its effect on health, it is crucial to prevent the pollution getting worse in a long run. According to WHO's report, the mortality in cities with high levels of pollution exceeds that observed in relatively cleaner cities by 15 20% [1]. Forecasting of air quality is much needed in a short term so that necessary preventive action can be taken during episodes of air pollution. WHO s Air Quality Guideline (AQG) [2] says the mean of concentration in 24-hour level should be less than 25μg/, although Hong Kong s proposed Air Quality Objectives (AQOs) [3] is 75μg/ right now. Because the target data is from a new town in Hong Kong, which means there are lots of people living in this area, so it is need to be a stricter standard of air pollution in such area. As a result, we use 25μg/ based on 24 hours mean as our standard points. The number of particulate at a particular time is dependent on many environmental factors, especially the meteorological data and time serious factors. Predictive models for can vary from the simple to the complex; hence we have CART, C4.5, Artificial Neural Networks, Support Vector Machine among others. In this paper, we try to build models for predicting next day's concentration level by using two popular tree-based classification algorithms, which are, Random Forest (RF) [4-5] and C5.0 [6-7]. CART and C4.5 are simple decision tree models because there is only one decision tree in each model. While RF and C5.0 are ensemble methods based on CART and C4.5, and each of them has a bunch of basic decision trees in the model. Some of the differences among these two algorithms are shown in Table 1. Algorithms RandomForest TABLE I. BRIEF DIFFERENCE BETWEEN RF&C5.0 Number of Trees Multiple C5.0 Multiple Methods Bagging and Voting Boosting and Voting Basic Classifier CART C4.5 R [8] is an open source programming language and software environment for statistical computing and graphics. It is widely used for data analysis and statistical computing projects. In this paper, we will use some R packages as our analysis tools, namely randomforest package [9] and C50 package [10]. Moreover, we also use some packages for plotting figures, such as reshape2 package [11] and ggplot2 package [12]. The structure of the paper is: Section 2 reviews some basic concept of tree-based classification methods, while Section 3 and 4 will describe the data and the experiments. The conclusion will be given in Section 5. II. METHODOLOGY A. Random Forest (RF) RF is an effective prediction tool in data mining, which is based on CART. It employs the Bagging [13] method to produce a randomly sampled set of training data for each of the trees. 21 P a g e

2 CART uses Gini index which is an impurity-based criterion that measures the divergences among the probability distributions of target attribute's values. Definition 1 (Gini Index): Given a training set S and the target attribute takes on k different values, then the Gini index of S is defined as Gini( S) 1 k Where is the probability of S belonging to class i. Definition 2 (Gini Gain): Gini Gain is the evaluation criterion for selecting the attribute A which is defined as p 2 i GiniGain( A, S) Gini( S) Gini( A, S) n Gini( S) Gini( ) S Where is the partition of S induced by the value of attribute A. CART algorithm can deal with the case of features with nominal variables as well as continuous ranges. Pruning a tree is the action to replace a whole sub-tree by a leaf. CART uses a pruning technique called minimal costcomplexity pruning which assuming that the bias in the resubstitution error of a tree increases linearly with the number of leaves. Formally, given a tree T and a real number α>0 which is called the complexity parameter, then the costcomplexity risk of T with respect to α is: R ( T) R( T) T where is the number of terminal nodes (i.e. leaves) and R(T) is the re-substitution risk estimate of T. Bagging, which stands for bootstrap aggregating, is an ensemble classification method. It repeatedly samples from a data set with replacement according to a uniform probability distribution. Each sample has a probability 1 of being selected, where N is the number of observations in the training set. If N is sufficiently large, this probability converges to 1 1 e 0.632, that is, a bootstrap sample contains approximately 63% of the original training data, while other data is a natural good testing dataset which is known as OOB (Out of Bag [14]) in RF. nce every sample has an equal probability of being selected (i.e. 1 N), bagging does not focus on any particular instance of the training data. Therefore, it is less sensitive to model overfitting when applied to noisy data. RF constructs a series of tree-based learners. At each tree node, a random sample of m features is drawn, and only those m features are considered for splitting. Typically m= (as default in R randomforest package), where p is the number of features. The essential difference between Bagging and RF is the latter not only selecting samples randomly but also the features being selected randomly. RF will not prune the trees during the whole growing procedure. B. C5.0 C5.0 is an advanced decision tree algorithm developed based on C4.5. It includes all functionalities of C4.5 and applies some new technologies, the most important application among them is Boosting (i.e. AdaBoost [15]), which is a technique for generating and combining multiple classifiers to improve predictive accuracy. C4.5 uses information gain ratio which is an impuritybased criterion that employs the entropy measure as an impurity measure. Definition 3 (Information Entropy): Given a training set S, the target attribute takes on k different values, and then the entropy of S is defined as k Entropy( S) pilog 2 p Where is the probability of S belonging to class i. Definition 4 (Information Gain): The information gain of an attribute A, relative to the collection of examples S, is defined as InfoGain( A, S) Entropy( S) Entropy( A, S) n Entropy( S) Entropy( ) S Where is the partition of S induced by the value of attribute A. Definition 5 (Gain Ratio): The gain ratio normalizes the information gain as follows: InfoGain( A, S) GainRatio( A, S) SplitEntropy( A, S) InfoGain( A, S) n log2 S S milar to CART, C4.5 can also deal with both nominal and continuous variables. Error Based Pruning (EBP) is the pruning method which is implemented in C4.5 algorithm. The idea behind EBP is to minimize the estimated number of errors that a tree would make on unseen data. The estimated number of errors of a tree is computed as the sum of the estimated number of errors of all its leaves. AdaBoost stands for adaptive boosting, it increases the weights of incorrectly classified examples and decreases the ones of those classified correctly. C5.0 is much efficient than C4.5 also on the aspect of unordered rule sets. That is, when a case is classified, all applicable rules are found and voted. This improves both the interpretability of rule sets and their predictive accuracy. i 22 P a g e

3 PM2.5 Days in One Year III. DATA PREPARATION All of data for the period were obtained from Hong Kong Environmental Protection Department (HKEPD) and Hong Kong Met-online. The air monitoring station is Tung Chung Air Monitoring Station (Latitude 22 17'19"N, Longitude '35"E) which is in a new town of Hong Kong, and the meteorological monitoring station is Hong PM2.5 Concentration Level 200 Level High Low Years Fig.1. concentration levels in Kong International Airport Weather Station (Latitude 22 18'34"N, Longitude '19"E) which is the nearest station from Tung Chung. As mentioned in Section 1, accurately predicting high concentration is of most value from a public health standpoint, thus, the response variable has two classes, which are, Low indicating the daily mean concentration of is below 25μg/ and High representing above it. Figure 1 shows that the days of two levels in We learn that the air quality is the best in 2010 (i.e. it has the most Low days), while the worst is in 2004 among these 12 years. In summary, the percentage of Low and High level is around 40.0% and 59% during 12 years in this area, respectively (around 1% missing values). Thus, if a predictive model obtains the accuracy is less than 60%, which means it approximately equals the randomly guess, that would be failure. The purpose to use data mining method is to raise the accuracy, say, at least more than 60%. We convert all hourly data to daily mean values and the meteorological data is the original daily data. In addition, all of air data and meteorological data are numeric. We certainly cannot ignore the effects of seasonal changes and human activities; hence we add two time variables, namely the month (Figure 2) and the day of week (Figure 3). Figure 2 clearly shows that concentration reaches a low level from May to August, during which is the rainy season in Hong Kong. But the pollutant is serious from October to next January, especially in December and January. We should know that the rainfall may not be an important factor in the experiment as the response variable is the next day s, and it is easy to understand that rainy season includes variant meteorological factors. Figure 3 presents the trends of people s activities in some senses. We learn that the air pollution waves slightly during the week. The concentration levels are similar from Tuesday to Thursday, while the lowest level appears on Sunday. This situation can be related to Tung Chung is a living area in Hong Kong, which means the air is less influenced by factories or other pollution source (i.e. different from business area or industrial area ). At last, there are 4326 observations by deleting all NAs and 14 predictor variables (Table 2) and 1 response variable which is the next day s concentration level Monthly PM2.5 Median Concentration in Month Fig.2. Monthly Concentration in P a g e

4 Testing Accuracy PM Day of Week PM2.5 Median Concentration in WeekDay Fig.3. Daily Concentration in TABLE II. VARIABLE LIST Notation Description Class MP Mean Pressure AT1 Max Air AT2 Mean Air AT3 Min Air MDPT Mean Dew Point RH1 Max Relative Humidity RH2 Mean Relative Humidity RH3 Min Relative Humidity TR Total Rainfall PWD Prevailing Wind Direction MWS Mean Wind Speed PM2.5 concentration MONTH Month Nominal WEEK Day of week Nominal IV. EXPERIMENTS The experiments include three sections: the first and second experiment will test RF and C5.0, respectively. We try to train the model and select the proper parameters in each one. The third one will compare them by using 100 times 10- fold cross validation (10-fold CV) in order to understand which one is more accurate as well as stable. A. Random Forest (RF) We use R package randomforest to train and test the performance of RF model. There are two important parameters in RF model, that is, the number of splitting feathers (i.e. mtry ) and the number of trees (i.e. ntree ). We try to select a proper number in order to obtain the best testing accuracy by 10-fold CV. Firstly, we set ntree from 1 to 500 and mtry from 2 to 5. The result is shown in Figure 4. We learn that when the number of splitting feathers is 2 or 3 is somewhat better than other two values as the accuracy of both are tightness. The best accuracy of each splitting number is shown in Table 3. According to this result we choose mty = 3 and ntree = 98 in the following experiments. Note that the default value in R package is mtry = (as we mentioned in Section 2) and ntree = 500, generally speaking, we can use mtry as the default value and select ntree by using 10-fold CV. Alternatively, one can choose the function tunerf for selecting parameters in RF model and the details can be checked in the help file of randomforest package. Figure 5 shows that the importance of variables in RF model, we can see the most important predictor is the previous, and then MDPT, MP, and MONTH. The criterion of this list is according to the mean decreasing Gini gain of each predictor. Why the variable WEEK is not important in RF model? A reasonable explanation is that WEEK waves slightly on each day, moreover, all the medians of concentration are higher than 25μg/ (see Figure 3) which is the boundary between response variable Fig.4. mtry2 mtry3 mtry4 mtry5 Different Value of mtry Accuracy on Different Number of Splitting Feathers TABLE III. Violin Plot of RF by Changing mtry RESULT OF RF mtry ntree Testing Accuracy B. C5.0 We will use C50 package for building C5.0 model in this paper. milar as RF, we try to obtain the best number of trees at first. Note that the maximum number of trees in C50 package is 100, which is much less than RF (i.e. ntree = 500). Some of the results are shown in Table 4. We find that the highest accuracy is at the 32 nd tree, whose testing accuracy is Figure 6 indicates the trends of accuracy by changing number of trees in RF and C5.0. The training accuracy of both models increases steadily and stays at a stable level at last. The testing accuracy waves little serious at the beginning, variable mtry2 mtry3 mtry4 mtry5 24 P a g e

5 Accuracy RFImportance especially, C5.0 is much higher than RF when there are only a few trees. But both of them float in a moderate level later, RF is higher than C5.0 at this phase. Figure 7 shows the variables importance of C5.0 model. According to this result, we learn that the most important predictor is the previous, too. And MONTH, PWD are also important variables, but it is different from the result of RF. C5.0 calculates the percentage of splits associated with each predictor. We think this result is more accurate than RF algorithm, because Gini gain will be in favor of those variables having more values and thus offering more splits [16]. But C5.0 uses gain radio which avoids this problem. WEEK is still not important in this model. TABLE IV. RESULT OF C5.0 Trees Training Testing Importance of RF AT1 AT2 AT3 MDPT MONTH MP MWS PM2.5 PWD RH1 RH2 RH3 TR WEEK 0 AT1 AT2 AT3 MDPT MONTH MP MWS PM2.5 PWD RH1 RH2 RH3 TR WEEK Fig.5. importance of RF Trends of Accuray by Changing Trees factor(variable) RFtrain RFtest C5.0train C5.0test Number of Trees Fig.6. Trends of Testing Accuracy by Changing Number of Trees in RF & C P a g e

6 Testing Accuracy C5.0Importance Importance of C AT1 AT2 AT3 MDPT MONTH MP MWS PM2.5 PWD RH1 RH2 RH3 TR WEEK 0 Fig.7. importance of C5.0 AT1 AT2 AT3 MDPT MONTH MP MWS PM2.5 PWD RH1 RH2 RH3 TR WEEK Violin Plot of 100 times 10-fold CV variable RF C RF C5.0 Algorithm Fig.8. Violin Plot of 100 Times 10-fold CV C. Comparison We compare two algorithms by using 100 times 10-fold CV with the result shown in Table 5. We learn that RF obtains the best result, and its accuracy is around 0.845~ While C5.0 also gets a moderate accuracy, or say, only a little bit worse than RF. Another issue is the stability of these two algorithms during repeated times. Figure 8 shows the violin plot of 100 times 10-fold CV. We can see that RF is more stable than C5.0 during 100 times and its accuracy is also better than C5.0. TABLE V. COMPARISON BETWEEN RF&C5.0 Maximum Minimum Median RF C V. CONCLUSION In this paper, we build concentration levels predictive models by using two popular data mining algorithms, which are RF and C5.0. The dataset, which is from a new town in Hong Kong, includes 4326 rows and 15 columns by deleting all missing values. Based on all experiments, we have our conclusions as below. 26 P a g e

7 1) Selecting the best parameters in each model based on the testing accuracy by using 10-fold CV. For RF model, the number of trees is 98 and the number of splitting feathers is 3. For C5.0 model, the best number of trees is 32. We prefer to use default value of mtry in RF model and select ntree by 10- fold CV. 2) According to 100 times 10-fold CV, the best result is from RF which is around 0.845~ It not only obtains the highest accuracy but also performs more stable than C5.0. 3) Another issue between them is the importance of variables, and we prefer the result of C5.0 as it is unbiased. 4) The advice of using RF or C5.0 in practice is to select the number of iterations at first. 10-fold CV is the selecting method in this paper, while researchers can repeat this process many times, for instance, 10 times 10-fold CV should be better than once. In summary, the selecting process has to maximum limit reducing the random error. ACKNOWLEDGMENT The authors wish to thank Hong Kong Environmental Protection Department (HKEPD) and Hong Kong Met-online allowing us to use all the data in this paper. REFERENCES [1] Air quality and health, [2] WHO Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide (Global update 2005), World Health Organization, whqlibdoc.who.int/hq/2006/who_sde_phe_oeh_06.02_eng.pdf [3] Proposed New AQOs for Hong Kong, ectives/files/proposed_newaqos_eng.pdf [4] Leo Breiman, Random Forests, Machine Learning, Volume 45, Issue 1, pp. 5-32, [5] Leo Breiman, Statistical Modeling: The Two Cultures, Statistical Science, Volume 16, No. 3, pp , [6] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, [7] C5.0: An Informal Tutorial, [8] R-project: [9] Andy Liaw, Matthew Wiener, randomforest package, Version 4.6.7, [10] Max Kuhn, Steve Weston, Nathan Coulter, C50 package, Version , [11] Hadley Wickham, reshape2 package, Version 1.2.2, [12] Hadley Wickham, ggplot2 package, Version , [13] Leo Breiman, Bagging Predictors, Machine Learning, volume 24 issue 2, pp , [14] Leo Breiman, Out-of-bag estimation, [15] Yoav Freund, Robert Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, European Conference on Computational Learning Theory, pp , [16] Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone, Classification and Regression Trees, Wadsworth Int. Group, pp.42, P a g e

### Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Journal of Asian Scientific Research journal homepage: http://aesswebcom/journal-detailphp?id=5003 COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM25 IN HONG KONG RURAL AREA Yin Zhao School

### Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central

Article Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central Yin Zhao, Yahya Abu Hasan School of Mathematical Sciences, Universiti Sains

### Trees and Random Forests

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

### Classification and Regression by randomforest

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### 6 Classification and Regression Trees, 7 Bagging, and Boosting

hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

### M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### Data mining techniques: decision trees

Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

### Classification/Decision Trees (II)

Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

### Ensemble Data Mining Methods

Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

### Package trimtrees. February 20, 2015

Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,

### Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

### Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

### L25: Ensemble learning

L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### How Boosting the Margin Can Also Boost Classifier Complexity

Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department

### Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

### An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

### ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

### The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Data Analytics and Business Intelligence (8696/8697)

http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.

### Decompose Error Rate into components, some of which can be measured on unlabeled data

Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

### Bike sharing model reuse framework for tree-based ensembles

Bike sharing model reuse framework for tree-based ensembles Gergo Barta 1 Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Magyar tudosok krt. 2.

### IMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE

P 0Tis International Journal of Scientific Engineering and Applied Science (IJSEAS) Volume-2, Issue-, January 206 IMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE Kalpna

### Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Weather forecast prediction: a Data Mining application

Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,ashwini.mandale@gmail.com,8407974457 Abstract

### Performance Analysis of Decision Trees

Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India

### Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

### Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu.

Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees

### Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

### On Adaboost and Optimal Betting Strategies

On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### arxiv:1301.4944v1 [stat.ml] 21 Jan 2013

Evaluation of a Supervised Learning Approach for Stock Market Operations Marcelo S. Lauretto 1, Bárbara B. C. Silva 1 and Pablo M. Andrade 2 1 EACH USP, 2 IME USP. 1 Introduction arxiv:1301.4944v1 [stat.ml]

### Classification tree analysis using TARGET

Computational Statistics & Data Analysis 52 (2008) 1362 1372 www.elsevier.com/locate/csda Classification tree analysis using TARGET J. Brian Gray a,, Guangzhe Fan b a Department of Information Systems,

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### DATA MINING METHODS WITH TREES

DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### Mining Wiki Usage Data for Predicting Final Grades of Students

Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University gokhana@hacettepe.edu.tr, erdal.cosgun@hacettepe.edu.tr, altunar@hacettepe.edu.tr

### Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

### Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

### Logistic Model Trees

Logistic Model Trees Niels Landwehr 1,2, Mark Hall 2, and Eibe Frank 2 1 Department of Computer Science University of Freiburg Freiburg, Germany landwehr@informatik.uni-freiburg.de 2 Department of Computer

### Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Data Science with R Ensemble of Decision Trees

Data Science with R Ensemble of Decision Trees Graham.Williams@togaware.com 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce

### Studying Auto Insurance Data

Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

### Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

### Model Trees for Classification of Hybrid Data Types

Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,

### Package missforest. February 20, 2015

Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer

### On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

### Data Mining with R. December 13, 2008

Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

### New Ensemble Combination Scheme

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

### Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

### Homework Assignment 7

Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam\$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448

### REVIEW OF ENSEMBLE CLASSIFICATION

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

### A Methodology for Predictive Failure Detection in Semiconductor Fabrication

A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

### Classification and Regression Trees (CART) Theory and Applications

Classification and Regression Trees (CART) Theory and Applications A Master Thesis Presented by Roman Timofeev (188778) to Prof. Dr. Wolfgang Härdle CASE - Center of Applied Statistics and Economics Humboldt

### Package bigrf. February 19, 2015

Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type

### Building Ensembles of Neural Networks with Class-switching

Building Ensembles of Neural Networks with Class-switching Gonzalo Martínez-Muñoz, Aitor Sánchez-Martínez, Daniel Hernández-Lobato and Alberto Suárez Universidad Autónoma de Madrid, Avenida Francisco Tomás

### Monday Morning Data Mining

Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

### An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang