Lowering social cost of car accidents by predicting high-risk drivers



Similar documents
Course Syllabus. Purposes of Course:

The number of fatalities fell even further last year to below 6,000 for the first time in 54 years since 1953.

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Business Analytics using Data Mining

CHAPTER 1 Land Transport

Data Mining: Overview. What is Data Mining?

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

MOTORBIKE RIDERS AND CYCLISTS

Knowledge Discovery and Data Mining

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Appendix 1: ICD 10 AM (6th Edition) Cause of Injury Code and Description

Data Mining - Evaluation of Classifiers

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Algorithms Part 1. Dejan Sarka

ITARDA INFORMATION. No.102. Special feature

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

E-commerce Transaction Anomaly Classification

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Forecast the monthly demand on automobiles to increase sales for automotive company

Predicting earning potential on Adult Dataset

Accident configurations and injuries for bicyclists based on the German In-Depth Accident Study. Chiara Orsi

SHOULD MOTORCYCLISTS USE GO PRO CAMERAS ON THEIR HELMETS TO HELP PRESERVE EVIDENCE IN THE EVENT OF AN ACCIDENT?

Social Media Mining. Data Mining Essentials

Prediction of Car Prices of Federal Auctions

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Data Mining In Excel: Lecture Notes and Cases

Data Mining with SQL Server Data Tools

Motorcycle Safety & Laws. Stewart Milner Chief Judge, City of Arlington

Data Mining for Fun and Profit

An exploratory neural network model for predicting disability severity from road traffic accidents in Thailand

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

2. Road Traffic Accident Conditions During 2009

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Classification algorithm in Data mining: An Overview

Data Mining and Predictive Modeling with Excel 2007

Knowledge Discovery and Data Mining

Bicycle Traffic Accidents in Japan

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Data modeling for health care: Comparative performance study of different prediction models

Role of Social Networking in Marketing using Data Mining

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

How To Study The Effects Of Road Traf C On A Person'S Health

CS 6220: Data Mining Techniques Course Project Description

Didacticiel Études de cas

Evaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University

Content-Based Recommendation

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Understanding Characteristics of Caravan Insurance Policy Buyer

A CONSUMER'S GUIDE TO MOTORCYCLE INSURANCE. from YOUR North Carolina Department of Insurance CONSUMER'SGUIDE

Data quality in Accounting Information Systems

IBM SPSS Direct Marketing 23

How To Use A Polyanalyst

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Young Drivers The High-Risk Years

JetBlue Airways Stock Price Analysis and Prediction

IBM SPSS Direct Marketing 22

Predicting Defaults of Loans using Lending Club s Loan Data

Tutorial Segmentation and Classification

How To Know If You Can Ride A Motorcycle

ITARDAInstitute for Traffic Accident

Data Mining. Nonlinear Classification

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Customer and Business Analytic

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Policy Document Road safety

Predict Influencers in the Social Network

PharmaSUG2011 Paper HS03

Road safety a work-environment issue

The total number of traffic collisions in Saskatchewan is up 5% from 51,733 in 2008 to 54,229 in 2009.

Predictive Modeling Techniques in Insurance

Mining Road Traffic Accident Data to Improve Safety: Role of Road-related Factors on Accident Severity in Ethiopia

Motorcycle and Scooter crashes Recorded by NSW Police from January to December 2011

PAKDD 2006 Data Mining Competition

Motorcycle Related Crash Victims (What the Statistics Say) Mehdi Nassirpour Illinois Department of Transportation Division of Transportation Safety

INDIAN STATISTICAL INSTITUTE announces Training Program on Statistical Techniques for Data Mining & Business Analytics

Data Mining: STATISTICA

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Advanced analytics at your hands

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

STATISTICS OF FATAL AND INJURY ROAD ACCIDENTS IN LITHUANIA,

Simple Predictive Analytics Curtis Seare

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Classification Techniques (1)

Road Transport and Road Traffic Accident Statistics (Island of Mauritius)

New Zealand all-age mandatory bicycle helmet law

PEDESTRIAN AND BICYCLE ACCIDENT DATA. Irene Isaksson-Hellman If Insurance Company P&C Ltd.

Customer Relationship Management

A Property & Casualty Insurance Predictive Modeling Process in SAS

Transcription:

Lowering social cost of car accidents by predicting high-risk drivers Vannessa Peng Davin Tsai Shu-Min Yeh

Why we do this? Traffic accident happened every day. In order to decrease the number of traffic accident and the losses caused by the traffic accident. Our government have a lot of policy, but these policy are focus on all the citizen. But now we think the policy have to focus on the certain group, the group who have high rate in traffic accident, to let the policy have more Significant effect. Where is our data from? We get the data from the Ministry of Transportation and Communications R.O.C, the data is official data which is credibility. How to do? 1. Buildng the model After cleaning and dummying the data, since the number of high risk is lesser compare with low risk, it will make the high risk be determined as outlier, we do the oversampling to overcome this problem. And we use a few method, compare them with cost matrix to find the best one. 2. we have to collaborate with government to get the information of the rider. (gender, violation, the purpose of using bicycle, job). 3. put the data of step2 into the model. Recommendation we suggest future practitioners to do: Improve the misclassification of whom is not victim but be predicted as. Connected these data to the actual death in the future. Track the responder of survey and combined them to the results in the future (slightly/badly car accident/ death in the accident).

1. Problem Business goal Traffic accident happens almost every day, and it is more risky for a motor-cycle rider for a car driver. Besides the death or mutilation, the social cost (ex: medical expenses, traffic jam) is the main cost caused by the car accident. In order to lessen the cost, we design this model. And we view the government as our customer because it has the most powerful resource and pursue the social benefit maximization. Dataming goal It is a supervised predictive task. We develop classifications for high / low risk groups. And because we want to predict who will be the high risk group, so our goal is predict. 2. Data description After exploring and cleaning the dataset, we only keep 7208 records including 11 predicators and 1 outcome variable in our model. The data sample is displayed below. A record represents a responder of daily motorcycle usage survey, provided by ministry of traffic. Input variable: Capacity: Ordered categorical variable Number of passengers per ride on average. Purpose: Categorical variable Main purpose of using the motorcycle. Use day per week: Ordered categorical variable Number of rides in a week on average Average duration: Ordered categorical variable Duration per ride on average Violation: Ordered categorical variable Number of violation times in the last year (self-reported) Gender: Categorical variable Male/Female (4228/2980) Age: Categorical variable Age of the responder Education: Ordered categorical Education level of the responder Job: Categorical

Job type of the responder Income: Ordered categorical Income interval of the responder City of motorcycle: Categorical variable Where is the motorcycle of the responder located (We will end up turn it into area of motorcycle variable) Output variable: accident (high/low risky). Variable we want to predict. We assume that high risky accident wrecker will engage in more dangerous injury and cost more social cost. They are objectives whom government should take some actions to. capacity purpose use day per average week duration violation gender age education job income city of motorcycle area of motocycle accident(low&high) 1 5 1 2 0 0 3 6 13 6 1 1 0 1 4 7 3 0 0 5 1 22 1 3 2 0 1 5 3 2 0 1 5 4 13 6 8 1 0 1 1 1 3 0 1 6 4 22 3 8 1 0 1 1 7 3 0 0 3 3 12 3 8 1 0 1 5 7 2 0 0 6 5 15 5 22 5 0 Figure sample of original data after cleaning 3. Data Preparation In data preparation, we start with cleaning records that do not have our output variable (accident). We deleted 3273 records (10481-7208) and have 7208 records for our analysis. After exploring the data, we found the responded average accident per year is skewed. The data are mainly concentrated on option zero, one or two times. On the other side, the answered options of more than two times per year are relatively scarce (695/ 7000). Therefore, we separate them (who answered more than two) as high risky group, denoted by 1, who are the targets we want to capture, and those less than two are grouped as low risky, denoted by 0. Under this condition, using lift chart will be more relatively complicated. So we choose to use the classification, trying to find the high class and low class, which will have enough data to know the characteristic of each group. We also transfer one of our input variable, city of motorcycle, into area of motorcycle. We successfully reduced 22 dimensions into 5 dimensions. Next, we create dummy variables for our nominal variables (purpose, job and area of motorcycle). Finally, we accomplished all the work in our data preparation and have 7208 rows and 31 columns of data to do the analysis. 4. Data mining solution First of all, we partition our data into training, validation and test three parts. In our data, we

have very few records of high risky group (1.86%). Therefore, we did a data partition with over-sampling and let the training data set has fifty percent of high risky group records. We used XLminer to run a classification tree to select important variables. We choose first three layers of the full tree, and found out that violation, gender, usage per week, purpose, job and income is the most important variables. Next step, we use these variables to run Naïve Bayes, K-Nearest Neighbors, Neural Network and Logistic Regression and set the cutoff probability to 0.5. We use confusion matrix as our performance evaluation. From the result table, we can see that Logistic Regression capture high risky group more accurate than other methods and KNN provide the smallest overall percent of error. To interpret our result more specifically, we create a cost matrix to compute the final cost of each method. According to government report, death rate in traffic accident is 15%and per death increase cost 15,720,000 NTD, while per injured increase 1,190,000 NTD cost. We assume that the cost of our strategy on high risky group is 1,000 NTD per case. After we consider the cost matrix me made, Logistic regression giving the best result, which means that it can save more money than other models. So we choose Logistic Regression model as our final data mining solution. 4. Conclusions Some Rising Issues of Business goal By using our model, we provide a convenient way to identify high risky group for government to take some actions to prevent them from dangerous accident and expend heavy social cost. However, there re some serious emerging issues. For example, who have authority to browse and use such output, and what would people feel and think about when they are defined as high risky group. We think there s still a call for figuring out better utilization of the data output. Limitation and Recommendation In our model and evaluation, we assume the cost of transforming a high risky rider to low risky rider is much less than the social cost expenditure. However, in reality, things would not go that perfect. For example, we may have larger cost to change riding behavior of one high risky rider. As a result, our performance evaluation may have to change the cutoff value to find the most optimized models that does not misclassify that much, and not merely try to capture more high risky riders as we ve done in this study. Based on the conclusions above, we suggest future practitioners to do: Improve the misclassification of whom is not victim but be predicted as. Connected these data to the actual death in the future. Track the responder of survey and combined them to the results in the future (slightly/badly car accident/ death in the accident).

Appendix Fig1. Classification Tree Fig2. Result Table