Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables
|
|
|
- Ronald York
- 9 years ago
- Views:
Transcription
1 Paper Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables Vinoth Kumar Raja, Vignesh Dhanabal and Dr. Goutam Chakraborty, Oklahoma State University ABSTRACT Memory based Reasoning (MBR) is an empirical classification method which works by comparing cases in hand with similar examples from the past and then applying that information to the new case. MBR modeling is based on the assumptions that the input variables are numeric, orthogonal to each other, and standardized. The latter two assumptions are taken care by Principal Components transformation of raw variables and using the components instead of the raw variables as inputs to MBR. To satisfy the first assumption, the categorical variables are often dummy coded. This raises issues such as increasing dimensionality and overfitting in the training data by introducing discontinuity in the response surface relating inputs and target variables. The Weight of Evidence (WOE) method overcomes this challenge. This method measures the relative response of the target for each group level of a categorical variable. Then the levels are replaced by the pattern of response of the target variable within that category. SAS Enterprise Miner s Interactive Grouping Node is used to achieve this. By this way the categorical variables are converted into numeric. This paper demonstrates the improvement in performance of an MBR model when categorical variables are WOE coded. A credit screening dataset obtained from SAS Education that comprises of 25 attributes for 3,000 applicants is used for this study. Three different types of MBR models were built using SAS Enterprise Miner s MBR node to check the improvement in performance. The results showed the MBR model with WOE coded categorical variables performed best based on misclassification rate. Using this data, when WOE coding was adopted the model misclassification rate decreased from to while the sensitivity of the model increased from to INTRODUCTION The objective of this study is to demonstrate the improvement in performance of Memory Based Reasoning (MBR) model when categorical variables are Weight of Evidence (WOE) coded instead of dummy coding. LITERATURE REVIEW MEMORY BASED REASONING Memory based Reasoning (MBR) is based on reasoning from memories of past experience. MBR modeling applies this information in classifying (or) predicting new record by finding neighbors similar to it. Hence this approach is case based instead of explanation based. MBR consists of two operations. First, to find the distance between case in hand and the all other observations in order to find its neighbors. Second, combining the results from the neighbors to arrive at the response. In SAS Enterprise Miner, the MBR node uses K-nearest neighbor algorithm where neighbors are determined by shortest Euclidean distance. To combine the results of the neighbors, democracy method is used. Hence, the response of the target for these neighbors are taken as votes that act as posterior probability for the response of case in hand. 1
2 Figure 1. An overview of Memory Based Reasoning Figure 1 shows the observations that are close to case in hand, their target values and their distance from case in hand. If we consider K as 3, then the nearest neighbors are observations with IDs 108, 12 and 7. And the target values are Y, N and Y respectively. Thus, the posterior probability for the response of case in hand to be Y is 2/3 or 67%. When the target variable is interval, the average of the K-nearest neighbors is calculated as the prediction for case in hand. MBR ASSUMPTIONS MBR modeling is based on the assumptions that the input variables are numeric, orthogonal to each other and standardized. The latter two assumptions are taken care by Principal Components transformation of raw variables and using the components instead of the raw variables as inputs to MBR. But to satisfy the first assumption, the categorical variables are dummy coded. This raises issues such as increase in dimensionality and overfitting in the training data by introducing discontinuity in the response surface relating inputs and target variables. WEIGHT OF EVIDENCE CODING Weight of Evidence (WOE) is the method of combining evidence in support of hypothesis i.e., it measures the relative response of target variable for each group level of a categorical variable. Thus the levels are replaced by the pattern of response of target variable within that category. By this way the categorical variables are converted into numeric variables. To compute the WOE, SAS credit scoring s Interactive grouping node is used. Figure 2. Calculation of Weight of Evidence 2
3 DATA PREPARATION The data used in this study is CS_ACCEPTS, a credit screening dataset obtained from SAS Education. The dataset contains 3,000 applicant instances and has 25 variables that were considered important to distinguish credit-worthy customers from non-credit-worthy. The 25 attributes comprises of sixteen categorical and nine continuous variables, where the levels of nominal variables ranges from 2 to 9. The basic requirement for MBR is it requires exactly one target variable, but that can be binary, nominal or interval variable. In our data, GB is the only target variable and it is binary. The variables Profession, Product, Residence ID had missing values and were imputed using the tree surrogate method. The important requirement for MBR is to satisfy the three assumptions of numeric, orthogonal and standardized. Principal Component s node in SAS Enterprise Miner is used to generate numeric, orthogonal and standardized variables that would be used as input for MBR modeling. MODELING Data is partitioned in the ratio 70:30 prior modeling. Three models are built - MBR with numeric variables only, MBR with numeric and dummy coded categorical variables and MBR with numeric and WOE coded categorical variables. In the MBR node of SAS Enterprise Miner, we have used the default settings: RD- Tree method to store and retrieve the nearest neighbors and k as 16. As this is a comparative study, same settings are used for all the models. Figure 3. Types of MBR models used in this study 3
4 For Model-1 only continuous variables are used. Model-2 uses both continuous are categorical. The Principal Components node converts the categorical variables to continuous using dummy coding method. In Model-3, there are two stages. Stage 1 is where categorical variables are WOE coded using Interactive Grouping node and thus converted into continuous variables. Stage 2 is where these converted continuous variables along with other continuous variables are fed into Principal Component node prior to MBR modeling. In Model-3, five binary variables: Finloan, Div, Title, Imp_resid and Location and seven nominal variable: Tel, Imp_prof, Car, Regn, Imp_prod, Bureau and Nat are rejected by Interactive Grouping Node because of their low Gini value and Information Value. Figure 4. Interactive Grouping Node Statistics Figure 4 shows that only three categorical variables Status, Cards and EC_Card were selected by Interactive Grouping Node by WOE coding for modeling based on Gini Statistic. MODEL COMPARISON To compare model performance of these three models, Model Comparison node is used. The model selection is based on least misclassification rate in the validation dataset. Figure 5 shows the models built for this study using SAS Enterprise Miner. Figure 5. Models built for this study using SAS Enterprise Miner 4
5 In reference to Figure 6, MBR model with WOE coded categorical variables outperforms MBR model with dummy coded categorical variables and MBR model with numeric variables only. The misclassification rate of Model-3 is less than that of Model-2. This difference indicates the significant improvement in the model performance when categorical variables are WOE coded. Figure 6. Model comparison statistics RESULTS DISCUSSION The misclassification rate of MBR model with WOE coded categorical variables is and that of MBR model with dummy coded categorical variables is In other words we can say that there is an increase of 9.94% in model performance when categorical variables are WOE coded instead of dummy coding. It is also to note that KS-Statistic, which indicates the separation of two classes of target variable, is greater for MBR model with WOE coded categorical variables than other models. Sensitivity of MBR model with WOE coded categorical variables is greater than that of MBR model with dummy coded categorical variables. Based on the ROC chart, the area under the curve (AUC) is large for Model-3 when compared to Model-2 which clearly indicates the outperformance of MBR model with WOE coded categorical variables. Figure 7. ROC charts for model built using SAS Enterprise Miner 5
6 CONCLUSION Two different models, MBR with numeric variables only and MBR with dummy coded categorical variables were used to compare the results of MBR with WOE coded categorical variables. Weight of Evidence coding of categorical variables improves the performance of MBR model significantly. The reasons being elimination of risk in dimensionality increase and the risk of over fitted training data in WOE coding. REFERENCES [1] C. Stanfill, D.L. Waltz (1986). Toward memory-based reasoning, Communications of the ACM 29:12, Dec, ; D. Aha, D. Kibler, M. Albert (1991). [2] DL Olson, D Delen (2008). Advanced Data Mining Techniques. [3] M.J.A. Berry and G.S Linoff (2004). Data Mining Techniques. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Vinoth Kumar Raja Phone: [email protected] Vignesh Dhanabal Oklahoma State University Phone: [email protected] Dr. Goutam Chakraborty Oklahoma State University [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Predictive Modeling of Titanic Survivors: a Learning Competition
SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224
Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers
Paper 1863-2014 Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers Sai Vijay Kishore Movva, Vandana Reddy and Dr. Goutam Chakraborty;
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1
Paper 11682-2016 Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1 Raja Rajeswari Veggalam, Akansha Gupta; SAS and OSU Data Mining Certificate
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Smart Sell Re-quote project for an Insurance company.
SAS Analytics Day Smart Sell Re-quote project for an Insurance company. A project by Ajay Guyyala Naga Sudhir Lanka Narendra Babu Merla Kiran Reddy Samiullah Bramhanapalli Shaik Business Situation XYZ
Improving SAS Global Forum Papers
Paper 3343-2015 Improving SAS Global Forum Papers Vijay Singh, Pankush Kalgotra, Goutam Chakraborty, Oklahoma State University, OK, US ABSTRACT Just as research is built on existing research, the references
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry
Paper 1808-2014 Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry Kittipong Trongsawad and Jongsawas Chongwatpol NIDA Business School, National Institute
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Paper 3508-2015. Downtime of a truck = Truck repair end date - Truck repair start date
Paper 3508-2015 Using Text from Repair Tickets of a Truck Manufacturing Company to Predict Factors that Contribute to Truck Downtime Ayush Priyadarshi and Dr. Goutam Chakraborty, Oklahoma State University
Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report
2012 Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report Dinesh Ganti(61310071), Gauri Singh(61310560), Ravi Shankar(61310210), Shouri Kamtala(61310215),
Lowering social cost of car accidents by predicting high-risk drivers
Lowering social cost of car accidents by predicting high-risk drivers Vannessa Peng Davin Tsai Shu-Min Yeh Why we do this? Traffic accident happened every day. In order to decrease the number of traffic
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Developing
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.
Data Mining: A Magic Technology for College Recruitment Tongshan Chang, Ed.D. Principal Administrative Analyst Admissions Research and Evaluation The University of California Office of the President [email protected]
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Business Analytics and Credit Scoring
Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : [email protected] 1 Aims To introduce the basic concepts of data mining
Some Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Data mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Chapter 32 Histograms and Bar Charts. Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474
Chapter 32 Histograms and Bar Charts Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474 467 Part 3. Introduction 468 Chapter 32 Histograms and Bar Charts Bar charts are
Data Mining Fundamentals
Part I Data Mining Fundamentals Data Mining: A First View Chapter 1 1.11 Data Mining: A Definition Data Mining The process of employing one or more computer learning techniques to automatically analyze
Text Analytics using High Performance SAS Text Miner
Text Analytics using High Performance SAS Text Miner Edward R. Jones, Ph.D. Exec. Vice Pres.; Texas A&M Statistical Services Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Course Description This course will change the way you think about data and its role in business.
INFO-GB.3336 Data Mining for Business Analytics Section 32 (Tentative version) Spring 2014 Faculty Class Time Class Location Yilu Zhou, Ph.D. Associate Professor, School of Business, Fordham University
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Duncan Cleary Revenue Irish Tax and Customs, Ireland [email protected] Abstract: Revenue, the Irish
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Nine Common Types of Data Mining Techniques Used in Predictive Analytics
1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better
Analyzing Marine Piracy from Structured & Unstructured data using SAS Text Miner
Paper 3472-2015 Analyzing Marine Piracy from Structured & Unstructured data using SAS Text Miner Raghavender Reddy Byreddy, Globe Life and Accident Insurance Company; Anvesh Reddy Minukuri, Comcast Corporation;
Full and Complete Binary Trees
Full and Complete Binary Trees Binary Tree Theorems 1 Here are two important types of binary trees. Note that the definitions, while similar, are logically independent. Definition: a binary tree T is full
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Weight of Evidence Module
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG
Paper 3406-2015 Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG Sérgio Henrique Rodrigues Ribeiro, CEMIG; Iguatinan
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Implementation of Data Mining Techniques to Perform Market Analysis
Implementation of Data Mining Techniques to Perform Market Analysis B.Sabitha 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, P.Balasubramanian 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
CS 6220: Data Mining Techniques Course Project Description
CS 6220: Data Mining Techniques Course Project Description College of Computer and Information Science Northeastern University Spring 2013 General Goal In this project, you will have an opportunity to
BIG DATA IN HEALTHCARE THE NEXT FRONTIER
BIG DATA IN HEALTHCARE THE NEXT FRONTIER Divyaa Krishna Sonnad 1, Dr. Jharna Majumdar 2 2 Dean R&D, Prof. and Head, 1,2 Dept of CSE (PG), Nitte Meenakshi Institute of Technology Abstract: The world of
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Interactive Data Mining and Design of Experiments: the JMP Partition and Custom Design Platforms
: the JMP Partition and Custom Design Platforms Marie Gaudard, Ph. D., Philip Ramsey, Ph. D., Mia Stephens, MS North Haven Group March 2006 Table of Contents Abstract... 1 1. Data Mining... 1 1.1. What
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
Crawling and Detecting Community Structure in Online Social Networks using Local Information
Crawling and Detecting Community Structure in Online Social Networks using Local Information TU Delft - Network Architectures and Services (NAS) 1/12 Outline In order to find communities in a graph one
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
Predicting Readmission of Diabetic Patients using the high performance Support Vector Machine algorithm of SAS Enterprise Miner
Paper 3254-2015 Predicting Readmission of Diabetic Patients using the high performance Support Vector Machine algorithm of SAS Enterprise Miner Hephzibah Munnangi, MS, Dr. Goutam Chakraborty Oklahoma State
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Piecewise Logistic Regression: an Application in Credit Scoring
Piecewise Logistic Regression: an Application in Credit Scoring by Raymond Anderson Standard Bank of South Africa Presented at Credit Scoring and Control Conference XIV Edinburgh, 26-28 August 2015 Abstract
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Data Mining III: Numeric Estimation
Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES
SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
