# Studying Auto Insurance Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Studying Auto Insurance Data Ashutosh Nandeshwar February 23, Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from html, which was originally studied in Hallin and Ingenbleek (1983); Andrews and Herzberg (1985). These data consisted of third party motor insurance claims in Sweden for the year In Sweden all motor insurance companies apply identical risk arguments to classify customers, and thus their portfolios and their claims statistics can be combined. The data were compiled by a Swedish Committee on the Analysis of Risk Premium in Motor Insurance. The Committee was asked to look into the problem of analyzing the real influence on claims of the risk arguments and to compare this structure with the actual tariff. According to the data description, the number of claims in each category can be treated as Poisson to a good approximation. The amount of each claim can be treated as gamma. The total payout is therefore compound Poisson. In my cursory literature search, I found that these data were studied using decisiontrees and combination of regression techniques (Andrews and Herzberg 1985; Christmann 2006; Loh 2008; Marin-Galiano and Christmann 2004). For this project, I studied the total payment as a dependent variable. All the variables in these data are listed in Table 1, and the descriptive statistics of total payment are given in Table 2, and the histogram is given in Figure 1. 2 Data Exploration Figure 1 shows the average payments by Kilometers driven and the zones in which the cars were driven. The size of the circle depicts the size of the payments. Kilometer traveled per year ranges 1 (<1000) and 2 ( ) stand out for the average payments and also the size of the payments, especially category. Cars driven in Zone 4 also had high payments regardless of the kilometers driven. As seen in Figure 2, it is clear that make 1 had very high total payments across all bonus types. Also, make 6 had high total payments. Payments were high in the seventh bonus year and the first bonus year. 1

2 Variable Kilometres Zone Bonus Make Insured Claims Payment Values Kilometres travelled per year 1: < : : : : > Geographical zone 1: Stockholm, Gteborg, Malm with surroundings 2: Other large cities with surroundings 3: Smaller cities with surroundings in southern Sweden 4: Rural areas in southern Sweden 5: Smaller cities with surroundings in northern Sweden 6: Rural areas in northern Sweden 7: Gotland No claims bonus. Equal to the number of years, plus one, since last claim 1-8 represent eight different common car models. All other models are combined in class 9 Number of insured in policy-years Number of claims Total value of payments in Skr Table 1: Data Description 3 Experiment After experimenting with some numeric-class learners, I ran the following learners in Weka (Hall et al. 2009): 1. Linear regression 2. Neural network 3. Additive regression with DecisionStump: a boosting technique to improve the robustness and the accuracy of the base learner; this is achieved by selecting a random sub-sample of the training data at each iteration and the model is updated using the values from this sample (Friedman 2002). 4. M5P: a decision tree learning technique specially designed for continuous class variable that applies multivariate linear regression at each node of the tree (Quinlan 1992; Wang and Witten 1997). 5. IBk: an instance based learner that selects and stores some (similar) instances of the data using nearest neighbor technique (Aha et al. 1991). 6. Bagging with REPTree: bagging repeats the learning algorithm n times and take an average of the performance measures over n versions. Boot- 2

3 Statistic Value min 0 max sd avg count 2182 Table 2: Descriptive Statistic of Payment Variable Histogram of Payment Frequency e e e e e+07 Payment strap sampling is used to generate n datasets. It has been shown that bagging can improve the accuracy of the base learner (Breiman 1996). REPTree is a fast decision tree learner. 7. Bagging with linear regression Each learner was trained on 10 cross-validation folds, meaning that the data were split into 10 sets, and trained on 9 of them and tested on the 10th one. This procedure was repeated 10 times. Therefore, 10 result sets were generated for each algorithm, and the complete experiment was re-randomized and repeated 10 times resulting in 700 sets for seven learners. 3

4 Figure 1: Kilometers Driven, Zone and Payments Occurred (the size of the circle denotes the payments occurred for that combination). It is clear that Zone number 4 and Kilometers range 1 and 2 had the highest payouts. 4 Results The average of 10 times 10 run of mean absolute error is given in Table 3 and in Figure 3, and the average of root mean square error for these 10 runs is shown in Figure 4. After examining the results of mean absolute error and root mean squared error, it became clear that M5P and neural network performed much better than the other learners. Since neural network results are incomprehensible, I am including the rules generated using M5P given in Appendix A. In addition, Weka currently does not offer a way to apply generalized linear models to the data, therefore, I ran a GLM learner using Poisson distribution and a log link function in SPSS s Clementine software and obtained a mean absolute error of 75,173 and a correlation coefficient of on the test dataset. That means that GLM performed much better than the state-of-the art techniques for this specific dataset. 4

5 Figure 2: Make vs Avg. Payment. The pink line, which has the highest payouts across all makes, denotes the average payouts by the make of the car Table 3: Mean Absolute Errors Obtained by Learners Learner Mean Absolute Error M5P Neural Net IBk Bagging REPTree AdditiveRegression Bagging Linear Regression Linear Regression

6 Figure 3: Mean Absolute Errors and Correlation Coefficients, which is the text on the right Table 4: Correlation Coefficients Obtained by Learners Learner Correlation coefficient AdditiveRegression 0.62 Linear Regression 0.62 Bagging Linear Regression 0.62 Bagging REPTree 0.83 IBk 0.88 M5P 0.95 Neural net

7 Figure 4: Average of Root Mean Squared Error Over 10 runs 7

8 5 Conclusion Sweeden data (Hallin and Ingenbleek 1983; Andrews and Herzberg 1985) was analyzed using data mining techniques and linear regression and generalized linear models. In the state-of-the art techniques M5P, a model tree generator, and neural network performed the best; however, GLM with Poisson distribution and logit link function was the top performer with a mean absolute error of 75,173 and a correlation coefficient of Although this study covered some of the well-known data mining techniques, this study was not an all inclusive exercise. In addition, I ran some learners by discretizing the output (Payment) variable, however, their performance was very poor. Some detailed investigation is needed to find machine learning learners (after discretizing the class variable) that will work well with this kind of data. One novel approach specifically for optimizing premium pricing using mathematical programming and data mining is given by Yeo et al. (2002). 8

9 References Aha, D., Kibler, D., and Albert, M. (1991). Instance-based learning algorithms. Machine learning, 6(1): Andrews, D. and Herzberg, A. (1985). Data: a collection of problems from many fields for the student and research worker. Springer Verlag. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2): Christmann, A. (2006). Empirical Risk Minimization For Car Insurance Data. Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4): Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11. /urlhttp:// Hallin, M. and Ingenbleek, J. (1983). The Swedish automobile portfolio in 1977: A statistical study. Scandinavian actuarial journal, 83: Loh, W. (2008). Regression by Parts: m. 7 Fitting Visually Interpretable Models with GUIDE. Handbook of Data Visualization, page 447. Marin-Galiano, M. and Christmann, A. (2004). Insurance: an R-Program to Model Insurance Data. Quinlan, J. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pages Citeseer. Wang, Y. and Witten, I. (1997). Induction of model trees for predicting continuous classes. In Proceedings of the poster papers of the European Conference on Machine Learning, pages Yeo, A., Smith, K., Willis, R., and Brooks, M. (2002). A mathematical programming approach to optimise insurance premium pricing within a data mining framework. The Journal of the Operational Research Society, 53(11):

10 A Appendix === Run information === Scheme: weka.classifiers.rules.m5rules -M 4.0 Relation: motorins-weka.filters.unsupervised.attribute.numerictonominal-r1-4-weka.filter Instances: 2182 Attributes: 5 Kilometres Zone Bonus Make Payment Test mode: 10-fold cross-validation === Classifier model (full training set) === M5 pruned model rules (using smoothed linear models) : Number of Rules : 20 Rule: 1 Make=1,9 <= 0.5 Bonus=1,7 <= * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=2,6,1, * Bonus=1, * Bonus= * Make=7,3,2,5,6,1, * Make=2,5,6,1, * Make=6,1, * Make=1, * Make= [1210/4.383%] Rule: 2 10

11 Make=9 <= 0.5 Bonus=7 <= 0.5 Kilometres=3,1,2 <= * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=2,6,1, * Bonus=6,1, * Bonus=1, * Bonus= * Make=4,7,3,2,5,6,1, * Make=7,3,2,5,6,1, * Make=2,5,6,1, * Make=5,6,1, * Make=6,1, * Make=1, * Make= [175/3.012%] Rule: 3 Make=9 <= 0.5 Make=6,1,9 > 0.5 Bonus=7 <= 0.5 Zone=6,3,2,1,4 > * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=2,6,1, * Bonus=6,1, * Bonus=1,7 11

12 * Bonus= * Make=2,5,6,1, * Make=6,1, * Make=1, * Make= [105/6.361%] Rule: 4 Make=1,9 <= 0.5 Zone=6,3,2,1,4 <= * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=2,6,1, * Bonus=1, * Bonus= * Make=7,3,2,5,6,1, * Make=2,5,6,1, * Make=6,1, * Make=1, * Make= [111/2.829%] Rule: 5 Make=1,9 <= 0.5 Bonus=7 > 0.5 Kilometres=3,1,2 > 0.5 Make=7,3,2,5,6,1,9 > * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1,4 12

13 * Zone=1, * Zone= * Bonus=2,6,1, * Bonus=1, * Bonus= * Make=4,7,3,2,5,6,1, * Make=7,3,2,5,6,1, * Make=2,5,6,1, * Make=6,1, * Make=1, * Make= [75/6.122%] Rule: 6 Make=1,9 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=2,6,1, * Bonus=1, * Bonus= * Make=7,3,2,5,6,1, * Make=3,2,5,6,1, * Make=1, * Make= [190/4.75%] Rule: 7 Bonus=1,7 <= 0.5 Zone=6,3,2,1,4 > 0.5 Kilometres=3,1,2 > * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1,4 13

14 * Zone=6,3,2,1, * Zone=3,2,1, * Zone=1, * Zone= * Bonus=3,2,6,1, * Bonus=2,6,1, * Bonus=6,1, * Bonus=1, * Bonus= * Make= [75/15.283%] Rule: 8 Bonus=1,7 <= 0.5 Zone=5,6,3,2,1,4 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone=2,1, * Zone= * Bonus=6,1, * Bonus= * Make= [40/1.135%] Rule: 9 Bonus=1,7 <= 0.5 Zone=2,1,4 <= 0.5 Make=9 > 0.5 Kilometres=1,2 <= 0.5 Zone=3,2,1,4 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1,4 14

15 * Zone=2,1, * Zone= * Bonus=2,6,1, * Bonus=6,1, * Bonus= * Make= [25/3.033%] Rule: 10 Bonus=1,7 <= 0.5 Make=9 > 0.5 Bonus=2,6,1,7 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=3,2,1, * Zone=2,1, * Zone=1, * Zone= * Bonus=6,1, * Bonus=1, * Bonus= * Make= [30/4.176%] Rule: 11 Zone=6,3,2,1,4 <= 0.5 Make=9 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=1, * Bonus= * Make= [31/4.212%] 15

16 Rule: 12 Kilometres=3,1,2 > 0.5 Zone=3,2,1,4 <= 0.5 Zone=5,6,3,2,1,4 > * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=1, * Bonus= * Make= [19/13.531%] Rule: 13 Kilometres=3,1,2 <= 0.5 Bonus=7 <= 0.5 Bonus=1,7 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus=6,1, * Bonus=1, * Bonus= * Make= [16/3.576%] Rule: 14 Kilometres=3,1,2 > 0.5 Make=9 > 0.5 Bonus=7 >

17 * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone= * Bonus= * Make= [15/18.703%] Rule: 15 Zone=3,2,1,4 > 0.5 Make=9 <= 0.5 Kilometres=3,1,2 > * Kilometres=4,3,1, * Kilometres=3,1, * Kilometres=1, * Kilometres= * Zone=6,3,2,1, * Zone= * Bonus= * Make= [12/12.851%] Rule: 16 Zone=3,2,1,4 > 0.5 Kilometres=1,2 <= 0.5 Make=9 > 0.5 Bonus=7 <= * Kilometres=3,1, * Kilometres=1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=1, * Zone= * Bonus= * Make=9 17

18 [12/7.632%] Rule: 17 Zone=3,2,1,4 <= 0.5 Bonus=7 <= * Kilometres=3,1, * Zone=5,6,3,2,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone= * Bonus= * Make= [9/2.06%] Rule: 18 Make=9 > 0.5 Kilometres=3,1,2 <= * Kilometres=4,3,1, * Kilometres=3,1, * Zone=6,3,2,1, * Zone=3,2,1, * Zone=2,1, * Zone= * Make= [14/15.598%] Rule: 19 Kilometres=3,1,2 <= * Kilometres=4,3,1, * Kilometres=3,1, * Zone=1, [10/8.352%] Rule: 20 18

19 * Zone=1, [8/52.373%] Time taken to build model: seconds === Cross-validation === === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Smart Grid Data Analytics for Decision Support

1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

### REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

### Predicting Student Performance by Using Data Mining Methods for Classification

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

### Software Defect Prediction Using Regression via Classification

Software Defect Prediction Using Regression via Classification Bibi S., Tsoumakas G., Stamelos I., Vlahavas I Department of Informatics, Aristotle University of Thessaloniki,54124 Thessaloniki, Greece

### Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

### Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### REVIEW OF ENSEMBLE CLASSIFICATION

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

### Classification of Learners Using Linear Regression

Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

### Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

### More Data Mining with Weka

More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Logistic Model Trees

Logistic Model Trees Niels Landwehr 1,2, Mark Hall 2, and Eibe Frank 2 1 Department of Computer Science University of Freiburg Freiburg, Germany landwehr@informatik.uni-freiburg.de 2 Department of Computer

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering

### Ensemble Data Mining Methods

Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

### Predicting Students Final GPA Using Decision Trees: A Case Study

Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to

### Automatic Resolver Group Assignment of IT Service Desk Outsourcing

Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework

### New Ensemble Combination Scheme

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

### Predictive Modeling and Big Data

Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### College Tuition: Data mining and analysis

CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Colon cancer survival prediction using ensemble data mining on SEER data

Colon cancer survival prediction using ensemble data mining on SEER data Reda Al-Bahrani, Ankit Agrawal, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University Evanston,

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

### Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

### Open-Source Machine Learning: R Meets Weka

Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Decision tree algorithm short Weka tutorial

Decision tree algorithm short Weka tutorial Croce Danilo, Roberto Basili Machine leanring for Web Mining a.a. 2009-2010 Machine Learning: brief summary Example You need to write a program that: given a

### Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Efficiency in Software Development Projects

Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

### Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

### HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

### Predicting IMDB Movie Ratings Using Social Media

Predicting IMDB Movie Ratings Using Social Media Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke ISLA, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

### Machine Learning algorithms and methods in Weka

Machine Learning algorithms and methods in Weka Presented by: William Elazmeh PhD. candidate at the Ottawa-Carleton Institute for Computer Science, University of Ottawa, Canada Abstract: this workshop

### Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

### Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

### Big Data Decision Trees with R

REVOLUTION ANALYTICS WHITE PAPER Big Data Decision Trees with R By Richard Calaway, Lee Edlefsen, and Lixin Gong Fast, Scalable, Distributable Decision Trees Revolution Analytics RevoScaleR package provides

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,

### THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

### Flexible Neural Trees Ensemble for Stock Index Modeling

Flexible Neural Trees Ensemble for Stock Index Modeling Yuehui Chen 1, Ju Yang 1, Bo Yang 1 and Ajith Abraham 2 1 School of Information Science and Engineering Jinan University, Jinan 250022, P.R.China

### Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

### Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

### Trees and Random Forests

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

### Car Insurance. Prvák, Tomi, Havri

Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes

### Package acrm. R topics documented: February 19, 2015

Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/

### University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

### WEKA. Machine Learning Algorithms in Java

WEKA Machine Learning Algorithms in Java Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand E-mail: ihw@cs.waikato.ac.nz Eibe Frank Department of Computer Science

### Data Analytics and Business Intelligence (8696/8697)

http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

### Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

### A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

### WEKA KnowledgeFlow Tutorial for Version 3-5-8

WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

### Data Mining Approach for Predictive Modeling of Agricultural Yield Data

Data Mining Approach for Predictive Modeling of Agricultural Yield Data Branko Marinković, Jovan Crnobarac, Sanja Brdar, Borislav Antić, Goran Jaćimović, Vladimir Crnojević Faculty of Agriculture, University

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address