Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS
|
|
|
- Stephanie Carroll
- 10 years ago
- Views:
Transcription
1 Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS Abstract Motivation. Thousands of variables are contained in insurance data warehouses. In addition, external sources of information could be attached to the data contained in data warehouses. When actuaries build a predictive model, they are confronted with redundant variables which reduce the model efficiency (time to develop the model, interpretation of the results, and inflate variance of the estimates). For these reasons, there is a need for a method to reduce the number of variables to input in the predictive model. Method. We have used proc varclus (SAS/STAT ) to find dusters of variables defined at a geographical level and attached to a database of automobile policies. The procedure finds cluster of variables which are correlated between themselves and not correlated with variables in other clusters. Using business "knowledge and 1-R2,,,,o. cluster representatives can be selected, thus reducing the number of variables. Then, the cluster representatives are input in the predictive model. Conclusions. The procedure used in the paper for variable clustering quickly reduces a set of numeric variables to a manageable reduced set of variable clusters. Availability. proc varclus from SAS/STAT has been used for this study. We found an implementation of variable clustering in R, function varchis, while we did not experiment with it. Keywords. variable reduction, clustering, statistical method, data mining, predictive modeling. 1. INTRODUCTION Over the last decade, insurance companies have gathered a vast amount of data in their data warehouses. Some of this information is well-known by the actuaries because it is used for other purposes, e.g. pricing of the policy. Also, there are many sources of external data (demographics, financial, meteorological...) available from vendors. The external sources are typically not as familiar to the actuary as the data from the data warehouses. This vast amount of information is available to create a predictive modal. The objective of the predictive model could be to improve the pricing or reserving process, but also to analyze profitability, fraud, catastrophe, and any insurance operation. This amount of information from multiple sources proxfides numerous variables for the modeling project contemplated. When a modeling project involves numerous variables, the actuary is confronted with the need to reduce the l~umber of variables in order to create the model. The variables have sometimes an unknown behavior with the objective of the modeling project. In addition, when there is a multitude of variables, it becomes difficult to find out the relationship between variables. Too many variables reduce the model efficiency. With many variables there is a potential Casualty Actuarial Society Forum, Winter
2 of overfitting the data. The parameter estimates of the model are destabilized when variables are highly correlated between each other. Also, it is much more difficult to have an explainable model when there are many variables. Finally, creating models with all possible combinations of variables is exhaustive, but this approach would take indefinite time when there are thousands of variables. An intermediate approach to the exhaustive search would also take a lot of time and some combinations of variables could be overseen. Suppose you want to reduce the number of variables to a smaller set of variable clusters for efficiency, you can use variable clustering. Variable clustering provides groups of variables where variables in a group are similar to other variables in the same group and as dissimilar as possible to variables in another group. 1.1 Research Context This paper addresses the initial stage of every predictive modeling project performed by an actuary, i.e. variable selection. Then, the variables selected would become inputs to predictive modeling techniques, such as, linear regression, generalized linear model, a neural network, to name a few. A technical description of the variable clustering algorithm, proc vardus, is induded in the SAS/STAT User's Guide) The method is not found in many textbooks on multivariate techniques, it mostly started as an implementation in statistical software. This paper is focused on variable clustering, but the example could be used, for example, in the context of complement to territorial relativities for automobile insurance. This complement would be obtained from a predictive model based on variables defined at some geographical level. The variables were selected using variable clustering on multiple sources of information, usually not used in pricing, attached to an automobile policy database. If the objective of the predictive model is to predict cost by territory, it makes sense to use fact (demographics, consumer expenditure, weather...) variables selected from the variable clustering on the multiple sources, defined at some geographies (e.g. county), to complement territorial relativities. The example provided in the paper is a simplification of a variable reduction problem. Many more variables would be dustered in a real life study. Note that the variables used in the example have some intuitive relation to automobile 1 SAS/STAT 9.1 User's Guide p Pasta paper Casualty Actuarial Society Forum, Winter 2006
3 J Variable Reduction for Predictive Modeling with Clustering insurance cost, although generally the variables presented to the variable clustering procedure are not previously filtered based on some educated guess. All the demographics, consumer expenditure, and weather variables are used in the clustering analysis. Filtering of variables is typically done after the variable clusters have been created. When there is a multitude of variables, it is more difficult to recognize irrelevant variables than to recognize redundant variables. A variable is considered irrelevant if it is not predictive for the specific predictive model. When the actuary deals with unknown data, a large number of the variables turn out to be irrelevant. A variable is redundant when it is highly correlated with another potential variable. 1.2 Objective More and more actuaries use advanced statistical methods to create insurance models. This paper provides a tool; variable clustering, that can be added to the arsenal of the actuarial miners. Traditionally, PCA have been used for variable reduction by creating a set of components (weighted linear combinations of the original variables) which are difficult to interpret. Typically, in the clustering literature, there is a rule for selecting the cluster representative, the 1-R 2,~,~,. Business "knowledge from subject matter expert should also complement this rule to guide the selection of variables. For this reason, someone could decide to use more than one variable per duster. Even though the clustering procedure provides diagnostic measures, there are reasons for using more than one variable per duster. One of them is that the maximum number of clusters is a parameter provided by the user of the procedure. Also, for communication to users of the predictive model, an alternate variable may provide a better intuitive interpretation of-the model than the cluster representative. We should point out that the variable clustering works only with numeric variables. However, there are ways to convert categorical variables into numeric variables. For example, the hamming distance converts categorical variables into a numeric variable. Conversion of categorical variables is not covered in this paper. We suggest options (centroid without cop) to the procedure of variable clustering which turn out to produce a scale-invariant method. Otherwise it would probably be necessary to rescale the ranges of the variables (withpr0c standara). 1.3 Outline The remainder of the paper proceeds as follows. Section 2 x~l provide an overview of Casualty Actuarial Society Forum, Winter
4 clustering and more precisely the variable dustering. We will describe shordy the variable clustering algorithm used in this paper. Section 3 will provide an example of variable reduction in the context of automobile insurance. We will use variable clustering and will explain how variables can be selected to reduce their nunaber. In section 4, we conclude the study. In Appendix A, we include an example of the SAS code and in Appendix B we include the procedure's output. 2. CLUSTERING 2.1 Clustering "Cluster Analysis is a set of methods for constructing a sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual ''3 In general, the goal of a duster analysis is to divide a data set into groups of similar characteristics, such that observations in a group are as similar as possible and as dissimilar to observations in another group. Variable clustering, however, does not divide a set of data; instead it splits a set of variables with similar characteristics using a set of subject data. Clustering is an unsupervised learning technique as it describes how the data is organized without using an outcome 4. As a comparison, regression is a supervised learning technique as there is an outcome used to derive the model. Most data mining techniques are supervised learning techniques. Unsupervised techniques are only useful when there is redundancy in the data (variables). At the basis of clustering is the notion of similarity. Without supervision, there is no response to say that occurrence a is similar to occurrence b. If there was a response associated with each occurrence; it could be used to compare a and b responses to induce similarity between both. Similarity: Two occurrences are similar if they have common properties. For example, one occurrence is a car, another occurrence is a motorcycle and the last occurrence is a bicycle. First, lets say we have only number of wheels as a property. Then we would cluster the motorcycle and the bicycle since they have the same number of wheels However, if we add the number of cylinders and fuel consumption, then the motorcycle is p 3 Everitt Hastie p.2 92 Casualty Actuarial Society Forum, Winter 2006
5 more similar to the car. Similarity can be measured by distance measure (Euclidian distance, Manhattan or city block distance...) or correlation type metrics. There are two classes of clustering methods: Hierarchical: This class of clustering produces clusters that are hierarchically nested within clusters from previous iterations. This is the most commonly used clustering technique. Partitive: This class of clustering divides data in clusters by minimizing an error function of the distance between the observation vectors and the reference vectors (centroid - initial guess). This clustering technique requires elaborate selection of parameters and evaluation of the error function for all possible partition is impractical. There are two approaches to hierarchical clustering: Agglomerative 1. Start with each observation as its own cluster 2. Compute the similarity between clusters 3. hierge the clusters that are most similar 4. Repeat from step 2 until one cluster is left Divisive 1. Start will all observations assigned to one cluster 2. Compute the similarity between clusters 3. Split the cluster that are least similar 4. Repeat from step 2 until each observation is a cluster 2.2 Variable Clustering The procedure used in this paper for variable clustering is both a divisive algorithm and iterative algorithm. The procedure starts with a single cluster and recursively divides existing clusters into two sub-clusters until it reaches the stopping criteria, producing a hierarchy of disjoint clusters. As mentioned previously, the procedure starts with all variables in one cluster. Based on the smallest percentage of variation explained by its cluster component a cluster is Casualty Actuarial Society Forum, Winter
6 chosen for splitting. The chosen cluster is split in two dusters by finding the first two principal components and assigning each variable to the component with which it has the higher correlation. The assignment follows a hierarchical structure with the approach presented in this paper. The clustering stops when the maximum number of clusters is attained or reached a certain percentage of variation explained. 3. VARIABLE CLUSTERING EXAMPLE After the multiple sources of data (demographics, consumer expenditures, meteorological...) are attached to the auto policy database, variable clustering can be performed to reduce the number of variables. The SAS code is included in Appendix A. The rule dictates to select the variable with the minimum 1-R 2 =~ as the cluster representative. The 1-R 2 ~o is defined below. 1-R 2 =,,,, = (1-R..,,,2)/(1-R... :) (3.1) Intuitively, we want the cluster representative to be as closely correlated to its own cluster (R0~,2--+1) and as uncorrelated to the nearest cluster (R,,,~,,2--+0). Therefore, the optimal representative of a cluster is a variable where 1-R 2... tends to zero. Below, we include an extract of the output from proc vardus (see Appendix B for additional output from the procedure) with three clusters. Based on the 1-R 2... we should select variables snowd, cdensity, and lexp as cluster representatives. 3 Clusters R-squared with Own Next I-R**2 Cluster Variable Cluster Closest Ratio Cluster 1 Raind Snowd ! Asnow Cluster 2 Pdensity Cdensity Cluster3 Growth !Lexp v i Choose Choose Choose 94 Casualty Actuarial Society Forum, Winter 2006
7 Afterproc vardus, we have created a tree usingproc tree which shows how the variable clusters are created. The variables are displayed vertically. The proportion of variance explained at each clustering level is displayed on the horizontal axis. Name of Variable or Cluster growth lexp snowd II asnow raind I 1.0 ] I I I I I I I I , Proportion of Variance Explained In that example, variables with similar factual attributes were dustered together; weather variables are in the same cluster and densit T variables are in the same cluster. Even with more variables, similar grouping patterns are observed. If we consider three dusters; snowd, asnow and ralnd would all be in one cluster as they are on the same branch of the tree. The variable snowd would be the cluster representative since it has the lowest 1-R The number of variables has been reduced and, now, we can efficiently create a predictive model to solve the problem at hand using linear regression, GLM s, or neural network CONCLUSIONS Given hundreds of variables, in order to create a predictive model the variable clustering 5 Holler 6 Francis Casualty Actuarial Society Forum, Winter
8 procedure runs quickly and produces satisf3"ing results. We were able to reduce the number of variables using this procedure in order to efficiently create a predictive model. An efficient model was defined as followed: Interpretable Stable Timely With this procedure, the modeling process is sped up significantly. The hierarchies produced by this procedure are easily interpretable with the tree output. Subject-matter experts usually do not have expertise to analyze statistical output in table form, but given the cluster hierarchy in tree output, can easily uncover alternate cluster representatives or eliminate irrelevant input. Other variable reduction techniques (e.g. PCA) do not create interpretable and disjoint clusters. 96 Casualty Actuarial Society Forum, Winter 2006
9 Appendix A: Code " Example of variable clustering ; %let varlist= pdensity cdensity growth /* demographics */ lexp /* expenditures */ raind snowd asnow /* weather */ proc varclus data='c:\example.sas7bdat' var &varlist ; weight exp; run; outtree=tree centroid maxc=6; axisl label={angle=o rotate=o) minor=none; axis2 minor=none order=(o to 1 by 0.I0); proc tree data=tree horizontal vaxis=axisl haxis=axis2; height _propor_; run; Appendix B: Ouput Cluster summary: Cluster summary gives the number of variables in each cluster. The variation explained by the cluster is displayed. The proportion of variance explained is the variance explained divided by the total variance of the variables in the cluster. Also displayed, is the summary are the R e of each variable with its own cluster, its closest cluster, and the 1-R e... Cluster Summary for 3 Clusters Cluster Variation Cluster Members Variation Explained Proportion Explained Total variation explained = Proportion = Casualty Actuarial Society Forum, Winter
10 3 Clusters Cluster Cluster I Cluster 2 Cluster 3 R-squared with Own Next I-R**2 Variable Variable Cluster Closest Ratio Label Raind Rain 2 Snowd Snow2 Asnow Snow 1 Pdensity Pop density Cdensity Car density Growth Popgrowth Lexp Leg expenditures Standardized scorin~ coefficients: The standardized scoring coefficients predict clusters from the variables. If a variable is not in a cluster, then the coefficient is zero. SAS does not provide unstandardized scoring coefficients. Standardized Scoring Coefficients Cluster Pdensity Pop density Cdensity Car density G rowth Popgrowth Lexp Leg expenditures Raind Rain Snowd Snow Asnow Snow ~luster Structure: The duster structure gives the correlation between the variables and the clusters. 98 Casualty Actuarial Society Forum, Winter 2006
11 Cluster Structure Cluster Pdensity Pop density Cdensity Car density Growth Pop growth Lexp Leg expenditures Raind Rain Snowd Snow Asnow Snow ' Inter-Cluster Correlation: This table provides the correlations between the clusters. Inter-Cluster Correlations!Cluster : Cluster 3 will be split because it has the smailest proportion of variation explained, , which is less than the PROPORTION=I value. Final summary: Cluster summary and the other tables are listed for each number of clusters up to the maximum of clusters (option maxc). This table is listed at the end of the output and summarizes for each number of clusters the total variation and proportion explained by the clusters, the minimum proportion explained by a cluster, the minimum R 2 for a variable and the maximum 1-R 2 =~o for a ratio. Casualty Actuarial Society Forum, Winter
12 Total Proportion Minimum Minimum Maximum Number Variation of Variation Proportion R-squared l-r**2 Ratio of Explained Explained Explained for a for a Clusters by Clusters by Clusters by a Cluster Variable Variable REFERENCES [1] B.S. Everitt, The Cambridge Dictionary of Statistics, 1998 [2] David J. Pasta, Diana Suhr, "Creating Scales from Questionnaires: PROC VARCLUS vs. Factor Analysis," SUGI29 Proceedings, 2004, Paper [3] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of StatisticalLearning, Springer, 2001 [4] SAS Institute Inc. 2004, SAS/STAT 9.1 User's Guide, NC: SAS Institute [5] Holler, "Something Old, Something New in Classification Ratemaking with a Novel Use of GLMs for Credit Insurance," CasualO, Actuaffa/ gode 9, Forum, Winter [6] Francis, "Neural Network Demystified," Casual O, Actuarial Society Forum, Winter Abbreviations and notations PCA, principal component analysis proc, procedure in SAS GLM, generalized linear model Biographies of the Authors Robert Sanche is a Consultant with Tillmghast a business of Towers Perrin. He is responsible for predictive modeling proiects. Prior to joining Tillinghast, he developed class plans for personal lines automobile using multivariate techniques with Travelers and The Hartford. He has also worked for GI~L~,C Insurance and.mxl~. in personal lines doing data mining and ratemaking respectively. He has degrees in Mathematics (Actuarial Science) and Computer Science (Operations Research) from Universit6 de Montrt'al. Kevin Lonergan graduated from Southern Connecticut State University in 1969 with BS, 1972 with MS. He taught mathematics in high school from 1969 to He has developed a new automobile product at turn of century. ACAS FCAS Casualty Actuarial Society Forum, Winter 2006
A Comparison of Variable Selection Techniques for Credit Scoring
1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]
A Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
0.1 What is Cluster Analysis?
Cluster Analysis 1 2 0.1 What is Cluster Analysis? Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Clustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona
Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have
Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
In comparison, much less modeling has been done in Homeowners
Predictive Modeling for Homeowners David Cummings VP & Chief Actuary ISO Innovative Analytics 1 Opportunities in Predictive Modeling Lessons from Personal Auto Major innovations in historically static
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Modifying Insurance Rating Territories Via Clustering
Modifying Insurance Rating Territories Via Clustering Quncai Zou, New Jersey Manufacturers Insurance Company, West Trenton, NJ Ryan Diehl, New Jersey Manufacturers Insurance Company, West Trenton, NJ ABSTRACT
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin
Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)
Using Predictive Analytics to Detect Fraudulent Claims
Using Predictive Analytics to Detect Fraudulent Claims May 17, 211 Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Spring Meeting Palm Beach, FL Experience the Pinnacle Difference! Predictive Analysis for Fraud
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research
Innovations and Value Creation in Predictive Modeling David Cummings Vice President - Research ISO Innovative Analytics 1 Innovations and Value Creation in Predictive Modeling A look back at the past decade
Name: Date: Use the following to answer questions 2-3:
Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams
Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams Matthias Schonlau RAND 7 Main Street Santa Monica, CA 947 USA Summary In hierarchical cluster analysis dendrogram graphs
PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA
PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,
Unsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Chapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
CLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Statistical Databases and Registers with some datamining
Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
DEA implementation and clustering analysis using the K-Means algorithm
Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Nagarjuna College Of
Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted
Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds
Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with
Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor
Hierarchical Cluster Analysis Some Basics and Algorithms
Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Framing Business Problems as Data Mining Problems
Framing Business Problems as Data Mining Problems Asoka Diggs Data Scientist, Intel IT January 21, 2016 Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen
Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
Banking Analytics Training Program
Training (BAT) is a set of courses and workshops developed by Cognitro Analytics team designed to assist banks in making smarter lending, marketing and credit decisions. Analyze Data, Discover Information,
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Foundations of Artificial Intelligence. Introduction to Data Mining
Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
LVQ Plug-In Algorithm for SQL Server
LVQ Plug-In Algorithm for SQL Server Licínia Pedro Monteiro Instituto Superior Técnico [email protected] I. Executive Summary In this Resume we describe a new functionality implemented
PRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
