Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Similar documents

Web Data Mining: A Case Study. Abstract. Introduction

Introduction to Data Mining

Database Marketing, Business Intelligence and Knowledge Discovery

Foundations of Artificial Intelligence. Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Introduction. A. Bellaachia Page: 1

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining Analytics for Business Intelligence and Decision Support

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Hexaware E-book on Predictive Analytics

Data Mining Solutions for the Business Environment

Chapter 26: Data Mining

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

Data Mining Techniques

CS590D: Data Mining Chris Clifton

Data Mining: Overview. What is Data Mining?

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining for Everyone

IBM SPSS Modeler Professional

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Five Predictive Imperatives for Maximizing Customer Value

not possible or was possible at a high cost for collecting the data.

Data Mining for Fun and Profit

Banking Analytics Training Program

Azure Machine Learning, SQL Data Mining and R

Data Mining and Neural Networks in Stata

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Machine Learning: Overview

Information Management course

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining: An Introduction

KNOWLEDGE BASE DATA MINING FOR BUSINESS INTELLIGENCE

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

The top 10 secrets to using data mining to succeed at CRM

Role of Social Networking in Marketing using Data Mining

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Three proven methods to achieve a higher ROI from data mining

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Data Mining + Business Intelligence. Integration, Design and Implementation

An Overview of Knowledge Discovery Database and Data mining Techniques

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

III JORNADAS DE DATA MINING

IJMIE Volume 2, Issue 5 ISSN:

CUSTOMER RELATIONSHIP MANAGEMENT (CRM) CII Institute of Logistics

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Five predictive imperatives for maximizing customer value

DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Data Mining Applications in Manufacturing

Subject Description Form

Megaputer Intelligence

Data Mining Applications in Higher Education

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

The Data Mining Process

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Automated Predictive Analysis. Tomer Steinberg

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

DATA MINING USING PENTAHO / WEKA

Data Warehousing and Data Mining in Business Applications

DATA MINING AND WAREHOUSING CONCEPTS

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Working with telecommunications

Foundations of Business Intelligence: Databases and Information Management

Chapter 4 Getting Started with Business Intelligence

Fluency With Information Technology CSE100/IMT100

Predictive analytics for the business analyst: your first steps with SAP InfiniteInsight

ETPL Extract, Transform, Predict and Load

Data Mining for Business Analytics

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Business Analytics and Data Visualization. Decision Support Systems Chattrakul Sombattheera

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Solve your toughest challenges with data mining

SPSS Data Mining Tips

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

Solve Your Toughest Challenges with Data Mining

Starting Smart with Oracle Advanced Analytics

Transcription:

Class 1 Data Mining Data Mining and Artificial Intelligence We are in the 21 st century So where are the robots? Data mining is the one really successful application of artificial intelligence technology. It s out there, it s used every day, it makes lots of money and it saves lots of lives. 2

Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. 3 Definition (Cont.) Valid: The patterns hold in general. Novel: We did not know the pattern beforehand Married people buy baby-food Useful: We can devise actions from the patterns More medicine sales after earthquakes Understandable: We can interpret and comprehend the patterns. Counter example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% 4

Insight and a predictive capability Data Mining produces two kinds of results Insight New knowledge, better understanding Predictive capability Spot risks, potentials, anomalies, links 5 Example: data mining and customer processes Insight: Who are my customers and why do they behave the way they do? Prediction: Who is a good prospect, for what product, who is at risk, what is the next thing to offer? Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites 6

Example: data mining and fraud detection Insight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events? Prediction: Recognise similarity to previous frauds how similar? Spot abnormal events how suspicious? Used by: Banks, telcos, retail, government 7 Why Use Data Mining Today? Business Intelligence as the battlefield Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate End-user as the analyst Availability of: Data Storage Computational power Off-the-shelf software Expertise 8

The Evolution of Data Analysis Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (196s) "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data Access (198s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing & Decision Support (199s) "What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR Retrospective, dynamic data delivery at multiple levels Data Mining (Emerging Today) "What s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups Prospective, proactive information delivery 9 Preprocessing and Mining Patterns Knowledge Preprocessed Data Target Data Interpretation Original Data Model Construction Preprocessing Data Integration and Selection 1

CRISP-DM Overview An industry-standard process model for data mining. Not sector-specific Non-proprietary www.crisp-dm.org CRISP-DM Phases: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Not strictly ordered - respects iterative aspect of data mining 11 Data Mining versus OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening 12

Data mining in BI 13 Related fields 14

Statistics, Machine Learning and Data Mining Statistics more theory-based more focused on testing hypotheses Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics areas not part of data mining Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results Distinctions are fuzzy 15 Data Mining Techniques Supervised learning Classification and regression Unsupervised learning Clustering Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection 16

Supervised Learning F(x): true function (usually not known) D: training sample drawn from F(x) 57,M,195,,125,95,39,25,,1,,,,1,,,,,,,1,1,,,,,,,, 78,M,16,1,13,1,37,4,1,,,,1,,1,1,1,,,,,,,,,,,,, 69,F,18,,115,85,4,22,,,,,,1,,,,,1,,,,,,,,,,,, 18,M,165,,11,8,41,3,,,,,1,,,,,,,,,,,,,,,,, 54,F,135,,115,95,39,35,1,1,,,,1,,,,1,,,,,1,,,,1,,,, 84,F,21,1,135,15,39,24,,,,,,,,,1,,,,,,,,,,,,, 89,F,135,,12,95,36,28,,,,,,,,,,,,,1,1,,,,,,,1,, 49,M,195,,115,85,39,32,,,,1,1,,,,,,,1,,,,,,1,,,, 4,M,25,,115,9,37,18,,,,,,,,,,,,,,,,,,,,,, 74,M,25,1,13,1,38,26,1,1,,,,1,1,,,,,,,,,,,,,, 77,F,14,,125,1,4,3,1,1,,,,,,,,,1,,,,,,,,,,1,1 1 1 1 1 1 17 Supervised Learning F(x): true function (usually not known) D: training sample (x,f(x)) 57,M,195,,125,95,39,25,,1,,,,1,,,,,,,1,1,,,,,,,, 78,M,16,1,13,1,37,4,1,,,,1,,1,1,1,,,,,,,,,,,,, 1 69,F,18,,115,85,4,22,,,,,,1,,,,,1,,,,,,,,,,,, 18,M,165,,11,8,41,3,,,,,1,,,,,,,,,,,,,,,,, 54,F,135,,115,95,39,35,1,1,,,,1,,,,1,,,,,1,,,,1,,,, 1 G(x): model learned from D 71,M,16,1,13,15,38,2,1,,,,,,,,,,1,,,,,,,,,,? Goal: E[(F(x)-G(x)) 2 ] is small (near zero) for future samples 18

Classification 19 Decision trees 2

Neural networks 21 Un-Supervised Learning Training dataset: 57,M,195,,125,95,39,25,,1,,,,1,,,,,,,1,1,,,,,,,, 78,M,16,1,13,1,37,4,1,,,,1,,1,1,1,,,,,,,,,,,,, 69,F,18,,115,85,4,22,,,,,,1,,,,,1,,,,,,,,,,,, 18,M,165,,11,8,41,3,,,,,1,,,,,,,,,,,,,,,,, 54,F,135,,115,95,39,35,1,1,,,,1,,,,1,,,,,1,,,,1,,,, 84,F,21,1,135,15,39,24,,,,,,,,,1,,,,,,,,,,,,, 89,F,135,,12,95,36,28,,,,,,,,,,,,,1,1,,,,,,,1,, 49,M,195,,115,85,39,32,,,,1,1,,,,,,,1,,,,,,1,,,, 4,M,25,,115,9,37,18,,,,,,,,,,,,,,,,,,,,,, 74,M,25,1,13,1,38,26,1,1,,,,1,1,,,,,,,,,,,,,, 77,F,14,,125,1,4,3,1,1,,,,,,,,,1,,,,,,,,,,1,1 Test dataset: 71,M,16,1,13,15,38,2,1,,,,,,,,,,1,,,,,,,,,,? 1 1 1 1 1 22

Supervised vs. Unsupervised Learning Supervised y=f(x): true function D: labeled training set D: {x i,f(x i )} Learn: G(x): model trained to predict labels D Goal: E[(F(x)-G(x)) 2 ] Well defined criteria: Accuracy, RMSE,... Unsupervised Generator: true model D: unlabeled data sample D: {x i } Learn?????????? Goal:?????????? Well defined criteria:?????????? 23 Clustering: Unsupervised Learning Given: Data Set D (training set) Similarity/distance metric/information Find: Partitioning of data Groups of similar/close items 24

Clustering 25 Similarity? Groups of similar customers Similar demographics Similar buying behavior Similar health Similar products Similar cost Similar function Similar store Similarity usually is domain/problem specific 26

Clustering: Informal Problem Definition Input: A data set of N records each given as a d- dimensional data feature vector. Output: Determine a natural, useful partitioning of the data set into a number of (k) clusters and noise such that we have: High similarity of records within each cluster (intracluster similarity) Low similarity of records between clusters (intercluster similarity) 27 Market Basket Analysis Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items? 28

Market Basket Analysis Given: A database of customer transactions Each transaction is a set of items Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} TID CID Date Item Qty 111 21 5/1/99 Pen 2 111 21 5/1/99 Ink 1 111 21 5/1/99 Milk 3 111 21 5/1/99 Juice 6 112 15 6/3/99 Pen 1 112 15 6/3/99 Ink 1 112 15 6/3/99 Milk 1 113 16 6/5/99 Pen 1 113 16 6/5/99 Milk 1 114 21 7/1/99 Pen 2 114 21 7/1/99 Ink 2 114 21 7/1/99 Juice 4 29 Market Basket Analysis (Contd.) Coocurrences 8% of all customers purchase items X, Y and Z together. Association rules 6% of all customers who purchase X and Y also buy Z. Sequential patterns 6% of customers who first buy X also purchase Y within three weeks. 3

Example Examples: {Pen} => {Milk} Support: 75% Confidence: 75% {Ink} => {Pen} Support: 1% Confidence: 1% TID CID Date Item Qty 111 21 5/1/99 Pen 2 111 21 5/1/99 Ink 1 111 21 5/1/99 Milk 3 111 21 5/1/99 Juice 6 112 15 6/3/99 Pen 1 112 15 6/3/99 Ink 1 112 15 6/3/99 Milk 1 113 16 6/5/99 Pen 1 113 16 6/5/99 Milk 1 114 21 7/1/99 Pen 2 114 21 7/1/99 Ink 2 114 21 7/1/99 Juice 4 31 Market Basket Analysis: Applications Sample Applications Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling 32

BI applications: customer profiling Who are most profitable customers and what distinguishes them? Who are the happiest customers and how are they different? Which customers are changing categories and what distinguishes them? How do our customers compare to our competitors' customers? How do our prospects differ from our customers? 33 BI Applications: Direct Marketing and CRM Most major direct marketing companies are using modeling and data mining Most financial companies are using customer modeling Modeling is easier than changing customer behaviour Example Verizon Wireless reduced customer attrition rate (churn rate) from 2% to 1.5%, saving many millions of $ 34

BI Applications: e-commerce 35 BI applications: Banking Corporate Credit Risk Assessment Should only low risk customers be served? Predicting a risk profile implies predicting default an outlier Mistakes can be expensive (both + and -) Argentina, B of NE, Kmart, real estate in the 8 s Data sources are not always reliable Rating agencies tend to postpone downgrades Low frequency data 36

BI applications: Banking Consumer Lending Traditionally very crude NN models are common today Based on high frequency data Updated with every event Each decision can be used to tune the model Decisions made by managers AND defaults Credit Card Very large databases 3*1 MM /month Acquisition, retention, cross-selling, fraud detection, customer service 37 BI Applications: Security and Fraud Detection 38

Data Mining with Privacy Exploitation of the customer How is data about me being used? Data Mining looks for patterns, not people! Technical solutions can limit privacy invasion Replacing sensitive personal data with anon. ID Give randomized outputs Multi-party computation distributed data 39 The hype-curve 4