New Ensemble Combination Scheme



Similar documents
Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

How To Identify A Churner

Advanced Ensemble Strategies for Polynomial Models

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

REVIEW OF ENSEMBLE CLASSIFICATION

A Learning Algorithm For Neural Network Ensembles

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining. Nonlinear Classification

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Chapter 6. The stacking ensemble approach

On the effect of data set size on bias and variance in classification learning

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Knowledge Discovery and Data Mining

Roulette Sampling for Cost-Sensitive Learning

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Data Mining using Artificial Neural Network Rules

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

D-optimal plans in observational studies

Ensemble Data Mining Methods

A Content based Spam Filtering Using Optical Back Propagation Technique

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Introduction To Ensemble Learning

Getting Even More Out of Ensemble Selection

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Using multiple models: Bagging, Boosting, Ensembles, Forests

Package acrm. R topics documented: February 19, 2015

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Towards better accuracy for Spam predictions

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining

Comparison of Data Mining Techniques used for Financial Data Analysis

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Introducing diversity among the models of multi-label classification ensemble

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

6.2.8 Neural networks for data mining

Solving Regression Problems Using Competitive Ensemble Models

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Lecture 6. Artificial Neural Networks

L25: Ensemble learning

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Studying Auto Insurance Data

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

Data Mining Practical Machine Learning Tools and Techniques

Ensembles and PMML in KNIME

A Review of Missing Data Treatment Methods

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Predictive Data modeling for health care: Comparative performance study of different prediction models

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning Capacity and Performance Analysis and R

Segmentation of stock trading customers according to potential value

A fast multi-class SVM learning method for huge databases

Equity forecast: Predicting long term stock price movement using machine learning

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

A Statistical Text Mining Method for Patent Analysis

Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process

How Boosting the Margin Can Also Boost Classifier Complexity

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

A Divided Regression Analysis for Big Data

Flexible Neural Trees Ensemble for Stock Index Modeling

MS1b Statistical Data Mining

Component Ordering in Independent Component Analysis Based on Data Power

Operations Research and Knowledge Modeling in Data Mining

Categorical Data Visualization and Clustering Using Subjective Factors

Classification of Bad Accounts in Credit Card Industry

Using Data Mining for Mobile Communication Clustering and Characterization

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

Data Mining Algorithms Part 1. Dejan Sarka

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

An Experimental Study on Rotation Forest Ensembles

II. RELATED WORK. Sentiment Mining

Azure Machine Learning, SQL Data Mining and R

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Predict the Popularity of YouTube Videos Using Early View Data

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Distributed forests for MapReduce-based machine learning

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Analecta Vol. 8, No. 2 ISSN

Sub-class Error-Correcting Output Codes

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Leveraging Ensemble Models in SAS Enterprise Miner

How To Make A Credit Risk Model For A Bank Account

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Transcription:

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However, these algorithms sometimes are not robust and does not show good performances The ensemble method can solve these problems It is known that the ensemble learning sometimes improves the generalized performance of machine learning tasks as well as makes it robust However, the combining weights of the ensemble model are usually pre-determined or determined with the concept that the ensemble model is a superposition of individual ones Thus we proposed a new ensemble combination scheme which consider the ensemble model is a factor affects the individual predictors Through experiments, the proposed method shows better performance than other existing methods in the regression problems and shows competitive performance in the classification problems I INTRODUCTION Several learning models such as clustering, classification, regression, etc, are successfully developed in these days and they are used in many areas [], [2], [3], [4] However, sometimes these algorithms show ill performances or are not robust Ensemble can be a solution to overcome these problems Ensemble combines multiple models together in specific ways The ensemble learning is known to improve the generalized performance of a single predictor and it has won some success For example, the famous one million dollars Netflix Prize competition, the team named The Ensemble won the first place They were known to ensemble several algorithms In other data mining competitions, most winning teams employed an ensemble of learning models Also, the ensemble method has been widely used in several practical area [5], [6], [7] Along with these practical successes, many different ensemble algorithms have been proposed and developed [3], [8], [9] However, some of them use the pre-determined weights Most of the rest find weights with the concept that the ensemble model is a superposition of individual predictors Unlike existing ensemble models, we propose a new ensemble combination scheme which consider individual predictors as realizations of the ensemble model In other words, in the view of factor analysis, the ensemble model becomes a factor which affects individual models The organization of this paper is as follows In the next section we summarize related literature and algorithms In Section 3 we describe the proposed method Next we present the results of an empirical analysis and comparison with other ensemble methods in Section 4 Then we conclude in Section 5 Namhyoung Kim is with the Department of Statistics, Seoul National University, Seoul, Korea (email: namhyoung@snuackr) Youngdoo Son and Jaewook Lee is with the Department of Industrial Engineering, Seoul National University, Seoul, Korea (email: hand02, jaewook@snuackr) II LITERATURE REVIEW Most of the contemporary ensemble methods develop ensemble model based on the following equation m f = w i C i () i= where f is a real function predicted, C i is an individual base model and w i is weight It is illustrated in Figure, that is, the predictions of the individual models C i, i =, 2,, m, are combined using the weight w i The weights can vary depending on combination scheme According to combination scheme, ensembles are divided into two main classes, combine by learning and combine by consensus Boosting combines multiple base predictors by learning It utilizes the information of the performance of previous predictors so this kind of ensemble is also known as supervised scheme In bagging, the ensemble is formed by majority voting Bagging doesn t consider the performance of each predictor It averages the predictions of individual models in unsupervised scheme A Bagging Bagging is the simplest algorithm to construct an ensemble[0] It creates individual predictors by training randomly selected training set Each training set is generated by randomly with replacemeth, n exampels In bagging, the weights are uniform, ie, the ensemble prediction is given by f = m C i (2) m i= In the case of classification, majority voting is used to predict the class of data B Boosting Boosting algorithms learn iteratively weak classifiers using a training set selected according to the previous classified results then combine them with different weights to create an ensemble[2] The weights are determined by the weak learners performance Our method uses both supervised and unsupervised scheme In the following section, we describe our proposed method III PROPOSED METHOD In this paper, we propose a new ensemble combination scheme We assume that the prediction of each individual predictor C i reflects the real function value f by λ i and a specific error term δ i In equation form, we would write the following

Fig Ensemble of Classifiers C i = λ i f + δ i, i =, 2,, m (3) We can rewrite the above equation in matrix form as following where C = f () f (2) f (n) C = fλ + (4) C () C () 2 C () C (2) C (2) 2 C m (2) C (n) C (n) 2 C m (n), f =, Λ = [λ λ 2 λ m ], and = δ () δ () 2 δ () δ (2) δ (2) 2 δ m (2) n is the number of cases δ (n) δ (n) 2 δ m (n) of data set In addition, we have the following three assumptions about the proposed method Assumption : ) Variance of f is (n )f f = (5) 2) Specific factors δ are mutually uncorrelated with diagonal covariance matrix: Θ = n = diag(θ 2, θ 2 2,, θ 2 m) (6) 3) Common factor f and specific factors δ are uncorrelated approximation of the correlation matrix S is S = /(n )C C = /(n )(fλ + ) (fλ + ) = /(n )(Λf fλ + fλ + f + ) By assumptions, the correlation matrix becomes simplified as following S = ΛΛ + Θ (7) ML solution : S(Λ Λ + Θ) Λ = Λ f (i) = Λ (ΛΛ + Θ) C (i) or f (i) = ( + Λ Θ Λ) Λ Θ C (i) Let f (i) = Λ (ΛΛ + Θ) C (i) + ϵ (i) and C (i) = Λf (i) Then Λf (i) = C (i) = ΛΛ (ΛΛ + Θ) C (i) + Λϵ (i) Λ Λϵ (i) = Λ [I p ΛΛ (ΛΛ + Θ) ]C (i) ϵ (i) = Λ [ Λ Λ I p (ΛΛ + Θ) ]C (i) ) Bayesian Latent Ensemble f(i) = Λ Θ C +Λ Θ Λ (i) + ϵ (i) f m k= = λ kθ 2 k C k + ϵ + m l= λ2 l θ 2 l f = m k= λ k θ 2 k (+ m l= λ2 l θ 2 l ) C k + ϵ In supervised scheme, we assume the given output of training sets as real function value Then, the process becomes much simpler A Data Sets IV EXPERIMENTAL RESULTS To evaluate the performance of the proposed method, we use both regression and classification data sets from UCI Machine Learning Repository[] Table I shows the summary of the data sets used The data shown in Table I is widely used real-world problems in the machine learning

Fig 2 Proposed method TABLE I SUMMARY OF THE DATA SETS Data set cases class Features Neural Network cont disc inputs ouputs hiddens Epochs breast-cancer-w 699 2 9-9 5 20 Credit-a 690 2 6 9 47 0 35 ionosphere 35 2 34 9 34 0 40 Statlog german 000 2 7 3 45 0 20 Statlog aust 690 2 6 8 40 0 35 ailerons 3750 regression 40-40 0 20 Housing boston 506 regression 3-3 5 35 Pole telecom 5000 regression 48-48 0 20 abalone 477 regression 7 9 5 20 community The data sets vary in terms of both the number of cases and features B Experiments settings In this study, neural networks are used as a base model Neural networks consist of a series of input, hidden and output layers It is needed to set the size of hidden layer Some parameters are also required for the neural networks We trained the neural networks with a backpropagation function, a learning rate of 05, a momentum constant of 09 and randomly selected weights between -05 and 05 The size of hidden layer was determined according to the number of input and output units having five hidden units as a minimum The number of epochs was based on the number of cases We used a larger value of epochs as the size of data increases The training parameters used in the experiments are shown in Table I For comparison, other popular ensemble algorithms, bagging and boosting, and single model are applied to the same data sets We averaged test error of each method over five standard 0-fold cross validation experiments The data set is divided into 0 equal-sized sets, then each set is used as the test set in turns then the other nine sets are used for training In all ensemble methods we used an ensemble size of 25 for each fold C Results Table II and Table III show the classification error for classification problems and root mean squared error for regression problems, respectively The columns indicate the different methods, a single model(single), Bagging(Bagging), Arcing(Arc), Ada-boosting(Ada) and the proposed method(fa) Each row shows the results for each data set From the tables we can notice that the proposed method usually shows good performances for several data sets In classification problems(table II), the proposed method

does not shows the worst performance and its results are often the best or close to the best In regression problems(table III), the proposed method shows the best performances except for the housing boston data set In that data set, bagging algorithm performs the best but the proposed method showed competitive result with the bagging Therefore, we can say that our proposed method performs better than other existing ensemble methods in general V CONCLUSION In this study, we presented novel methods for ensemble combination scheme with a new perspective on a relationship between real function and each classifier We assumed predicted value of each classifier is composed of the real function value and a specific error term According to this assumption, we solved weights for ensemble combination in both supervised and unsupervised scheme The experimental results show the proposed methods are superior to the existing methods They also show self-correction skills, so an ensemble can adjust weights noticing corrupt classifiers in learning process The proposed ensemble method seems to be robust with some corrupting predictors unlike other ensemble schemes whose performances may decrease with corrupting predictors In other words, the proposed method has self-correction reliability Thus, for future work, the analogous proof and experimental results for this intuition are necessary We also need to prove and show the statistical stability of the proposed model using statistical tests like Friedman test, t- test, or sign-test REFERENCES [] D Lee and J Lee, Domain descried support vector classifier for multiclassification problems, Pattern Recognition, vol 40, pp 4 5, 2007 [2] Y Son, D-j Noh, and J Lee, Forecasting trends of high-frequency KOSPI200 index data using learning classifiers, Expert Systems with Applications, vol 39, no 4, pp 607 65, 202 [3] N Kim, K-H Jung, Y S Kim, and J Lee, Uniformly subsampled ensemble (USE) for churn management: Theory and implementation, Expert Systems with Applications, vol 39, no 5, pp 839 845, 202 [4] H Park and J Lee, Forecasting nonnegative option price distributions using Bayesian kernel methods, Expert Systems with Applications, vol 39, no 8, pp 3243 3252, 202 [5] T Gneiting and A E Raftery, Weather forecasting with ensemble methods, Science, vol 30, no 5746, pp 248 249, 2005 [6] D Morrison, R Wang, and L C De Silva, Ensemble methods for spoken emotion recognition in call-centres, Speech communication, vol 49, no 2, pp 98 2, 2007 [7] L K Hansen, C Liisberg, and P Salamon, Ensemble methods for handwritten digit recognition, Neural Networks for Signal Processing [992] II, Proceedings of the 992 IEEE-SP Workshop, IEEE, 992 [8] T G Dietterich, Ensemble methods in machine learning, Multiple classifier systems, pp 5, Springer Berlin Heidelberg, 2000 [9] H Drucker, et al, Boosting and other ensemble methods, Neural Computation, vol 6, no 6, pp 289 30, 994 [0] L Breiman, Bagging predictors, Machine Learning, vol 24, no 2, pp 23 40, 996 [] A Frank and A Asuncion, UCI Machine Learning Repository [http://archiveicsuciedu/ml], Irvine, CA: University of California, School of Information and Computer Science, 200 [2] Y Freund and R Schapire, Experiments with a new boosting algorithm, In Proceedings of the Thirteenth International Conference on Machine Learning, pp 48 56, 996

TABLE II CLASSIFICATION RESULTS (CLASSIFICATION ERROR) Neural Network Data * set Boosting FA Single Bagging Arc Ada Unsupervised Supervised Breast-cancer-w 359% 35% 335% 29% 350% 356% credit-a 47% 394% 504% 452% 403% 388% ionosphere 29% 7% 07% 086% 846% 85% statlog aust 548% 383% 522% 475% 357% 426% statlog german 2666% 2468% 2748% 2496% 2378% 2396% hepatitis 3680% 343% 3640% 3600% 3387% 3400% sonar 740% 660% 440% 570% 660% 670% diabetes 246% 232% 2482% 236% 243% 246% TABLE III REGRESSION RESULTS (RMSE) Neural Network Data set FA Single Bagging Boosting Unsupervised Supervised Abalone 20989 20733 76227 20702 207 housing boston 38296 3236 73399 32743 32705 ailerons 00005 00003 0002 00002 00002 pole telecom 36437 05535 35003 0337 0006