Data Mining: Overview. What is Data Mining?



Similar documents
Data Mining for Fun and Profit

Data Mining: Motivations and Concepts

not possible or was possible at a high cost for collecting the data.

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introduction to Data Mining

Data Mining Applications in Higher Education

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Introduction to Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining for Business Analytics

Social Media Mining. Data Mining Essentials

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining: An Introduction

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Part 5. Prediction

from Larson Text By Susan Miertschin

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

Data Mining In Excel: Lecture Notes and Cases

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

An Overview of Knowledge Discovery Database and Data mining Techniques

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

ANALYTICS CENTER LEARNING PROGRAM

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining + Business Intelligence. Integration, Design and Implementation

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Data Mining Solutions for the Business Environment

Chapter 3: Cluster Analysis

Introduction. A. Bellaachia Page: 1

Understanding Characteristics of Caravan Insurance Policy Buyer

Information Management course

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

WHITEPAPER. How to Credit Score with Predictive Analytics

Chapter 2 Literature Review

Data Mining Techniques

Data Warehousing and Data Mining in Business Applications

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Why do statisticians "hate" us?

Principles of Data Mining by Hand&Mannila&Smyth

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Pentaho Data Mining Last Modified on January 22, 2007

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

Course Syllabus. Purposes of Course:

Predictive Analytics Certificate Program

Nagarjuna College Of

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Data Mining with SQL Server Data Tools

Data, Measurements, Features

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk

CART 6.0 Feature Matrix

Banking Analytics Training Program

How To Perform An Ensemble Analysis

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Machine Learning and Data Mining. Fundamentals, robotics, recognition

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Lowering social cost of car accidents by predicting high-risk drivers

Data Mining - Evaluation of Classifiers

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Chapter 12 Discovering New Knowledge Data Mining

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

An Overview of Database management System, Data warehousing and Data Mining

Predictive Data modeling for health care: Comparative performance study of different prediction models

Big Data. Fast Forward. Putting data to productive use

Chapter ML:XI. XI. Cluster Analysis

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining Applications in Fund Raising

Customer Classification And Prediction Based On Data Mining Technique

A Survey on Web Research for Data Mining

MS1b Statistical Data Mining

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

MA2823: Foundations of Machine Learning

Data Warehousing and Data Mining

2.1. Data Mining for Biomedical and DNA data analysis

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Classification: Decision Trees

Customer Analytics. Turn Big Data into Big Value

Data Mining System, Functionalities and Applications: A Radical Review

Transcription:

Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science, engineering and business. In a state of flux, many definitions, lot of debate about what it is and what it is not. Terminology not standard e.g. bias, classification, prediction, feature = independent variable, target = dependent variable, case = exemplar = row. * First International workshop on Knowledge Discovery and Data Mining was in 1995 1

Broad and Narrow Definitions Broad Definition includes traditional statistical methods, Narrow Definition emphasizes automated and heuristic methods Data mining, data dredging, fishing expeditions Knowledge Discovery in Databases (KDD) My Favorite Statistics at scale and speed Darryl Pregibon My extension:... And simplicity 2

Gartner Group Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. 3

Drivers Market: From focus on product/service to focus on customer IT: From focus on up-to-date balances to focus on patterns in transactions - Data Warehouses - OLAP Dramatic drop in storage costs : Huge databases e.g Walmart: 20 million transactions/day, 10 terabyte database, Blockbuster: 36 million households Automatic Data Capture of Transactions e.g. Bar Codes, POS devices, Mouse clicks, Location data (GPS, cell phones) Internet: Personalized interactions, longitudinal data Core Disciplines Statistics (adapted for 21st century data sizes and speed requirements). Examples: Descriptive: Visualization Models (DMD): Regression, Cluster Analysis Machine Learning: e.g. Neural Nets Data Base Retrieval: e.g. Association Rules Parallel developments: e.g. Tree methods, k Nearest Neighbors, OLAP-EDA 4

Process 1. Develop understanding of application, goals 2. Create dataset for study (often from Data Warehouse) 3. Data Cleaning and Preprocessing 4. Data Reduction and projection 5. Choose Data Mining task 6. Choose Data Mining algorithms 7. Use algorithms to perform task 8. Interpret and iterate thru 1-7 if necessary 9. Deploy: integrate into operational systems. Data Mining SEMMA Methodology (SAS) Sample from data sets, Partition into Training, Validation and Test datasets Explore data set statistically and graphically Modify:Transform variables, Impute missing values Model: fit models e.g. regression, classfication tree, neural net Assess: Compare models using Partition, Test datasets 5

Illustrative Applications Customer Relationship Management Finance E-commerce and Internet Customer Relationship Management Target Marketing Attrition Prediction/Churn Analysis Fraud Detection Credit Scoring 6

Target marketing Business problem: Use list of prospects for direct mailing campaign Solution: Use Data Mining to identify most promising respondents combining demographic and geographic data with data on past purchase behavior Benefit: Better response rate, savings in campaign cost Example: Fleet Financial Group Redesign of customer service infrastructure, including $38 million investment in data warehouse and marketing automation Used logistic regression to predict response probabilities to home-equity product for sample of 20,000 customer profiles from 15 million customer base Used CART to predict profitable customers and customers who would be unprofitable even if they respond 7

Churn Analysis: Telcos Business Problem: Prevent loss of customers, avoid adding churn-prone customers Solution: Use neural nets, time series analysis to identify typical patterns of telephone usage of likely-to-defect and likely-to-churn customers Benefit: Retention of customers, more effective promotions Example: France Telecom CHURN/Customer Profiling System implemented as part of major custom data warehouse solution Preventive CPS based on customer characteristics and known cases of churning and non-churning customers identify significant characteristics for churn Early detection CPS based on usage pattern matching with known cases of churn customers. 8

Fraud Detection Business problem: Fraud increases costs or reduces revenue Solution: Use logistic regression, neural nets to identify characteristics of fraudulent cases to prevent in future or prosecute more vigorously Benefit: Increased profits by reducing undesirable customers Example: Automobile Insurance Bureau of Massachusetts Past reports on claims adjustors scrutinized by experts to identify cases of fraud Several characteristics (over 60) of claimant, type of accident, type of injury/treatment coded into database Dimension Reduction methods used to obtain weighted variables. Multiple Regression Step-wise Subset selection methods used to identify characteristics strong correlated with fraud 9

Risk Analysis Business problem: Reduce risk of loans to delinquent customers Solution: Use credit scoring models using discriminant analysis to create score functions that separate out risky customers Benefit: Decrease in cost of bad debts Finance Business problem: Pricing of corporate bonds depends on several factors, risk profile of company, seniority of debt, dividends, prior history, etc. Solution Approach: Through DM, develop more accurate models of predicting prices. 10

E-commerce and Internet Collaborative Filtering From Clicks to Customers Recommendation systems Business opportunity: Users rate items (Amazon.com, CDNOW.com, MovieFinder.com) on the web. How to use information from other users to infer ratings for a particular user? Solution: Use of a technique known as collaborative filtering Benefit: Increase revenues by cross selling, up selling 11

Clicks to Customers Business problem: 50% of Dell s clients order their computer through the web. However, the retention rate is 0.5%, i.e. of visitors of Dell s web page become customers. Solution Approach: Through the sequence of their clicks, cluster customers and design website, interventions to maximize the number of customers who eventually buy. Benefit: Increase revenues Emerging Major Data Mining applications Spam Bioinformatics/Genomics Medical History Data Insurance Claims Personalization of services in e-commerce RF Tags : Gillette Security : Container Shipments Network Intrusion Detection 12

Core Concepts Types of Data: Numeric Continuous ratio and interval Discrete Need for Binning Categorical order and unordered Binary Overfitting and Generalization Regularization: Penalty for model complexity Distance Curse of Dimensionality Random and stratified sampling, resampling Loss Functions 13

Typical characteristics of mining data Standard format is spreadsheet: Row=observation unit, Column=variable Many rows, many columns Many rows moderate number of columns (e.g. tel. calls) Many columns, moderate number of rows (e.g. genomics) Opportunistic (often by-product of transactions) Not from designed experiments Often has outliers, missing data 14

Course Topics Supervised Techniques Classification: k-nearest Neighbors, Naïve Bayes, Classification Trees Discriminant Analysis, Logistic Regression, Neural Nets Prediction (Estimation): Regression, Regression Trees, k-nearest Neighbors Unsupervised Techniques Cluster Analysis, Principal Components Association Rules, Collaborative Filtering 15