# Cluster this! June 2011

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Cluster this! June 2011

2 Agenda On the agenda today: SAS Enterprise Miner (some of the pros and cons of using) How multivariate statistics can be applied to a business problem using clustering Some cool variable reduction methods Type of modelling techniques possible and scenarios where each is applicable How to evaluate the cluster models once built 2

3 Power of applied MV statistics Applied statistical methods can be a very powerful tool in answering some difficult business problems. Often it has the stigma of being difficult to understand since some methods are very complex such as multivariate analysis (MV)! But using tools such as SAS Enterprise Miner and Enterprise Guide can assist you in helping explain some of the more complex methods through graphs, visualizations and other diagnostics. In banking where on occasion the business problems can be complex multivariate statistics can come in handy since it helps uncover patterns not easily revealed by simple statistics. Today we will discuss in more detail the specific MV method of clustering.. 3

4 Statistical Analysis What is clustering? By examining more than one characteristic, similar cases can be group together into a cluster. Clusters are distinguished from each other based on the differences. Although you may need a PhD develop the statistics to create new clustering techniques you certainly do not need a PhD to understand when and how to apply it! Can solve difficult questions or business problems with clustering As an example think about the classification of cars.. Different types of vehicles on the surface you can say all have 4 doors and engines, but by digging a little deeper and looking at it on a number of variables such as engine type, fuel efficiency, luxury options, you can start to create a very distinct taxonomy: sports cars, SUVs, etc. 4

5 Using Enterprise Guide to cluster We all know that Enterprise Miner can be used to do modelling! It uses the SEMMA -> Sample; Explore; Modify; Model; and Assess framework to build and deploy models quickly!! 5

6 Nodal structure Sample: Functions related to data sampling such as appending, partitions, file import, merging etc. Explore: Functions related to finding relationships in your data, such as Multi-plots, associations, clustering, self organizing maps etc. Modify: Transform your data, but imputing, creating new variables, consolidating (PCA) etc. Model: Run various modeling frameworks, neural networks, decision trees, regression etc. Assess: Evaluate and measure model performance Utility: Run custom SAS code, edit metadata, control points etc. 5

7 SEMMA outside of Enterprise Miner SEMMA is a way to organize your models and has nothing to do with the modelling itself. You can implement SEMMA without using EM Limited customizable visualization techniques with EM Limited options to customize (not all modeling options are listed, for example in clustering limited to average, centriod, and Wards), great for black box environments!! Not so good for audit when you have to explain yourself Already do the Sampling, Exploring, Modifying, and Assessing outside of Enterprise Miner, only bring it in to do some Modelling! 5

8 Business Problem: UTR reporting from Branches From a UTR reporting perspective many of the best leads of suspicious activity occur at the Branch level as these cases have already been looked at by a human being. From a regulatory perspective, these cases must be reviewed upon receipt. Accurate reporting is important, but more importantly want to ensure each branch is reporting what they should be. How to compare whether a branch is reporting accurately? This can be difficult since branches can vary widely from area to area, differences in size, type of business they conduct, area, and general client demographic and greatly effect the amount of UTR reporting that a branch does. 6

9 How clustering can assist in your analysis Defined our business problem: identify branches that reporting zero UTRs, or under reporting the number of UTRs Using a cluster model will assist in determining similar branches and group them together. Once this task is complete, the analysis can be continued by examining branches within a cluster with each other to determine who appears to be conducting normal vs. non-normal activity. A very powerful tool to profile and group data together. Very good method to find similarities between branches by digging deeper and finding connections that are not apparent. 8

10 How do you build a cluster model Several steps that you need to do when building out a cluster model: Data Gathering } 85% of your time will be spent on these steps, as they are the most Data Cleansing time intensive, and likely to skew results if not done properly, GIGO! Perform the Clustering Evaluate the Clusters 9

11 Overall Process Methodology 10

12 Variable Selection Variable selection can be done using a variety of analysis techniques such as principal components, factor analysis, SAS process such as varclus, or straight correlations. In our exercise this portion of the analysis also included: Ensuring the data files are accurate Looking for Outliers in the data Checking if there are restricted ranges in the continuous variables Checking if there are unequal cell sizes in categorical variables Distributions of the variables Co-linearity between variables Covariance matrices are homogeneous Extent and nature of missing data Discuss Statistical Analysis Method, and approval to move ahead Execute Part I Review results from Part I Milestone I: Short-list of variables 11

13 Factor Analysis Factor analysis and PCAs are very similar in outcome but the roads and reasons for using either are very different, as the assumptions when using either also differ greatly. 13

14 Variable Selection and Exploratory Data Analysis Proc factor was run to reduce the number of redundant variables down to 35: 15

15 PCA Primary use of either Principal components or factor analysis is both data reduction and summarization. Getting the most bang for your buck, that is, less is more! With PCA you are accounting for the maximum variance in a minimal number of variables ( super-variables ). Its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. The results of a PCA are usually discussed in terms of component scores (the transformed variable values corresponding to a particular case in the data) and loadings (the weight by which each standardized original variable should be multiplied to get the component score) 1. 1 Shaw PJA, Multivariate statistics for the Environmental Sciences, (2003) Hodder-Arnold 12

16 PCA in Enterprise Guide 12

17 Variance Explained by the PCA Factors 19

18 PCA Driving Factors 18

19 PCA Plots show evidence of Clusters 20

20 What method to use? Depends on what you want to do! 14

21 What Variables are most Important Key Drivers Population related Geographic Branch Related Internal RBC Variable Description Average Income, Population age, Population type by gender, and family unit (single, families), university graduates, homeowners, renters, children per family etc. Urban or Rural area, Population density etc. Number of days branch open, number of competitors near by, closest branches, branch size etc. Data warehouse related variables such as transit postal code, city, number of business and personal clients etc. Eventually using a combination of techniques 300+ variables were reduced to about 35. Then PCA analysis was performing to get the most out of the 35 remaining variables. 16

22 Cluster Methodology A short list of variables have already been identified through previous analysis, the following methodology was used to find meaningful groups of homogeneous branches: 21

23 Wards Clustering in Enterprise Guide 24

24 Semi-Partial R 2, R 2, and CCC What statistics you can look at and why? Like many SAS outputs, cluster output gives you a number of different statistics to look at to help evaluate, first if the clustering worked, secondly how many clusters are optimal for the solution. Referring to the output of Wards clustering, the following selected statistics are helpful: 1. SpRSq (semipartial R-squared) is a measure of the homogeneity of merged clusters, in other words how similar the cluster elements are to each other. Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups. 2. RSq (R-squared) measures the extent to which groups or clusters are different from each other (so, when you have just one cluster RSQ value is, intuitively, zero). RSQ value should high, or as close to 1 as possible as it explains the proportion of variance accounted for by the clusters. 3. CCC (Cubic Clustering Criterion), rule of thumb using this statistic is that a value of greater than 2 indicates good clusters, values between 0 and 2 indicate potential clusters [and used with caution]; large negative values could indicate outliers in the data. 25

25 Cluster Performance Before pre-processing After pre-processing: 26

26 How to evaluate the cluster models We already looked at a series of statistics that can help us decide but what else?? Graph PCAs to check for evidence of clusters Box plots Testing!! Go out and validate and see if the groupings make sense There is no one right answer!! It s a model, and will never be 100% correct all the time. 27

27 Example: Cluster 7, Urban Competition Cluster 7 contains 61 transits. Area Branch Size Main drivers in this cluster: High Amount of Investors High Number of Renters Lots of Competition nearby Higher Crime areas High number of anonymous sessions per branch Median Income Median Income 28

28 Cluster 7, Urban Competition 8 Branches in Cluster 12 are on average 68 km away from any other branches 29

29 Cluster 7, Urban Competition 30

30 Box Plots Number of STRs (in last 5 years) per transit * Note that Box plots are based on 10 th -90 th percentiles for each cluster. 31

31 Plot of PCA Clusters 6, 8, 11 32

32 Branch related How everything fits together! Population High Income Urban Competition Low Income Suburban/ Rural Elders Large Super Branches Urban Low Income Renters Extreme Rural Geographical Suburban Homeowners Internal RBC data Young Immigrant Families/ Suburban Family Investors 33

33 Findings Key groupings include: Small branches / Rural areas / well established Mid size branches in Urban / young / educated / immigrants Large branches in Lower income / higher crime areas Suburban mid-size branches / families / new branches Solution effectively identifies groups based on size and potential to better measure AML risk. 34

34 In Summary.. Solution effectively identifies groups based on size and potential to better measure AML risk. EG played a key role Apply statistics without deriving them (just know how and when to use them)! 35

35 Questions? Contact: 36

### Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Application of SAS! Enterprise Miner in Credit Risk Analytics Presented by Minakshi Srivastava, VP, Bank of America 1 Table of Contents Credit Risk Analytics Overview Journey from DATA to DECISIONS Exploratory

### White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access

### Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona

Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have

Data Mining with SAS Mathias Lanner mathias.lanner@swe.sas.com Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA

### Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

### CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative

### Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS

Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS Abstract Motivation. Thousands of variables are contained in insurance data warehouses. In addition, external

### Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

### WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

### Text Analytics using High Performance SAS Text Miner

Text Analytics using High Performance SAS Text Miner Edward R. Jones, Ph.D. Exec. Vice Pres.; Texas A&M Statistical Services Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains

### How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information

### Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

### Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### Interpreting Data in Normal Distributions

Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,

### Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

### An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

### ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.

### Organizing Your Approach to a Data Analysis

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

### A fast, powerful data mining workbench designed for small to midsize organizations

FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

### Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 How Do Predictive Analytics and Data Mining Work?.... 2 The Data Mining Process....

### Nagarjuna College Of

Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted

### Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

### EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

### DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

### Banking Analytics Training Program

Training (BAT) is a set of courses and workshops developed by Cognitro Analytics team designed to assist banks in making smarter lending, marketing and credit decisions. Analyze Data, Discover Information,

### Moderation. Moderation

Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation

### Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

### A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

### BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

### What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining

### Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research

Innovations and Value Creation in Predictive Modeling David Cummings Vice President - Research ISO Innovative Analytics 1 Innovations and Value Creation in Predictive Modeling A look back at the past decade

### Homework 11. Part 1. Name: Score: / null

Name: Score: / Homework 11 Part 1 null 1 For which of the following correlations would the data points be clustered most closely around a straight line? A. r = 0.50 B. r = -0.80 C. r = 0.10 D. There is

### Data Mining for Fun and Profit

Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION

DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION Ionela-Catalina Tudorache (Zamfir) 1, Radu-Ioan Vija 2 1), 2) The Bucharest University of Economic Studies, Economic Cybernetics

### Using Predictive Analytics to Detect Fraudulent Claims

Using Predictive Analytics to Detect Fraudulent Claims May 17, 211 Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Spring Meeting Palm Beach, FL Experience the Pinnacle Difference! Predictive Analysis for Fraud

### Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

### Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Certificate Program in Applied Big Data Analytics in Dubai A Collaborative Program offered by INSOFE and Synergy-BI Program Overview Today s manager needs to be extremely data savvy. They need to work

### Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

### Chapter 5 Estimating Demand Functions

Chapter 5 Estimating Demand Functions 1 Why do you need statistics and regression analysis? Ability to read market research papers Analyze your own data in a simple way Assist you in pricing and marketing

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Multivariate Analysis. Overview

Multivariate Analysis Overview Introduction Multivariate thinking Body of thought processes that illuminate the interrelatedness between and within sets of variables. The essence of multivariate thinking

### DHL Data Mining Project. Customer Segmentation with Clustering

DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong Aditya Hridaya MISRA Jeffery JI Jun Yao 3/30/2010 DHL Data Mining Project Table of Contents Introduction to DHL and the

### 12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Understand when to use multiple Understand the multiple equation and what the coefficients represent Understand different methods

### Aspects of Risk Adjustment in Healthcare

Aspects of Risk Adjustment in Healthcare 1. Exploring the relationship between demographic and clinical risk adjustment Matthew Myers, Barry Childs 2. Application of various statistical measures to hospital

### Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide

Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide Olivia Parr-Rud From Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner. Full book available

### Easily Identify the Right Customers

PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### 1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

### Joseph Twagilimana, University of Louisville, Louisville, KY

ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim

### Predictive Modeling and By-Peril Analysis for Homeowners Insurance

Predictive Modeling and By-Peril Analysis for Homeowners Insurance Contents Case for by-peril modeling By-peril model building Data Peril grouping Variables Interactions By-peril territories Model validation

### INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. AGENDA Overview/Introduction to Data Mining

### Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

### Predictive Modeling of Titanic Survivors: a Learning Competition

SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### Data Mining: Overview. What is Data Mining?

Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

### Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of

### Principal Component Analysis

Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

### Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

### Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Duncan Cleary Revenue Irish Tax and Customs, Ireland dcleary@revenue.ie Abstract: Revenue, the Irish

### Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

### Semester 2 Statistics Short courses

Semester 2 Statistics Short courses Course: STAA0001 - Basic Statistics Blackboard Site: STAA0001 Dates: Sat 10 th Sept and 22 Oct 2016 (9 am 5 pm) Room EN409 Assumed Knowledge: None Day 1: Exploratory

### The importance of graphing the data: Anscombe s regression examples

The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### ECONOMETRICS PROJECT ECON-620 SALES FORECAST FOR SUBMITED TO: LAURENCE O CONNELL

ECONOMETRICS PROJECT ECON-620 SALES FORECAST FOR SUBMITED TO: LAURENCE O CONNELL SUBMITTED BY: POOJA OZA: 0839378 NISHTHA PARIKH: 0852970 SUNNY SINGH: 0368016 YULIYA INOZEMYSEVA: 0817808 TABLE OF CONTENTS

### Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

### EXPLORING SPATIAL PATTERNS IN YOUR DATA

EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### OFFLINE NETWORK INTRUSION DETECTION: MINING TCPDUMP DATA TO IDENTIFY SUSPICIOUS ACTIVITY KRISTIN R. NAUTA AND FRANK LIEBLE.

OFFLINE NETWORK INTRUSION DETECTION: MINING TCPDUMP DATA TO IDENTIFY SUSPICIOUS ACTIVITY KRISTIN R. NAUTA AND FRANK LIEBLE Abstract With the boom in electronic commerce and the increasing global interconnectedness

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

### Data analysis process

Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

### Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

### Word Count: Body Text = 5,500 + 2,000 (4 Figures, 4 Tables) = 7,500 words

PRIORITIZING ACCESS MANAGEMENT IMPLEMENTATION By: Grant G. Schultz, Ph.D., P.E., PTOE Assistant Professor Department of Civil & Environmental Engineering Brigham Young University 368 Clyde Building Provo,

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Achieve Better Insight and Prediction with Data Mining

Clementine 11.1 Specifications Achieve Better Insight and Prediction with Data Mining Data mining provides organizations with a clearer view of current conditions and deeper insight into future events.

### White Paper Combining Attitudinal Data and Behavioral Data for Meaningful Analysis

MAASSMEDIA, LLC WEB ANALYTICS SERVICES White Paper Combining Attitudinal Data and Behavioral Data for Meaningful Analysis By Abigail Lefkowitz, MaassMedia Executive Summary: In the fast-growing digital

### Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

FREE echapter C H A P T E R1 Big Data and Analytics Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 percent of the data in the

### FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have

### Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

### Data Warehouse Overview. Srini Rengarajan

Data Warehouse Overview Srini Rengarajan Please mute Your cell! Agenda Data Warehouse Architecture Approaches to build a Data Warehouse Top Down Approach Bottom Up Approach Best Practices Case Example

### @RISK Model Window. o Shows the inputs, outputs modelling features Can be used to edit properties such as variable names

Presentation Agenda o Welcome to @RISK for Project o RPJ Models o Distribution Choice o Simulation Results o Fitting Distributions to Data o Creating Variables Easily o Ranking Risks o Advanced Modelling

### Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

### Variable Selection and Transformation of Variables in SAS Enterprise Miner

Variable Selection and Transformation of Variables in SAS Enterprise Miner Kattamuri S. Sarma, Ph.D Ecostat Research Corp., White Plains NY kssarma@worldnet.att.net kssarma@ecostat-research.com 2 Issues

### Unit 21 Student s t Distribution in Hypotheses Testing

Unit 21 Student s t Distribution in Hypotheses Testing Objectives: To understand the difference between the standard normal distribution and the Student's t distributions To understand the difference between

### How to Quickly Leverage Big Data Analytics May 23, 2013

How to Quickly Leverage Big Data Analytics May 23, 2013 Agenda 3V s of Big Data Value of External leading indicators Buy vs. Build Dilemma Case Study - Dow Chemical Company Case Study - Advanced Drainage

### An Overview of Database management System, Data warehousing and Data Mining

An Overview of Database management System, Data warehousing and Data Mining Ramandeep Kaur 1, Amanpreet Kaur 2, Sarabjeet Kaur 3, Amandeep Kaur 4, Ranbir Kaur 5 Assistant Prof., Deptt. Of Computer Science,

### APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology

### Running head: ASSUMPTIONS IN MULTIPLE REGRESSION 1. Assumptions in Multiple Regression: A Tutorial. Dianne L. Ballance ID#

Running head: ASSUMPTIONS IN MULTIPLE REGRESSION 1 Assumptions in Multiple Regression: A Tutorial Dianne L. Ballance ID#00939966 University of Calgary APSY 607 ASSUMPTIONS IN MULTIPLE REGRESSION 2 Assumptions