Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support



Similar documents
Data Mining Analytics for Business Intelligence and Decision Support

Introduction to Data Mining

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Principles of Data Mining by Hand&Mannila&Smyth

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

not possible or was possible at a high cost for collecting the data.

Data Mining for Fun and Profit

Data Mining: Overview. What is Data Mining?

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Information Management course

Data Mining Solutions for the Business Environment

The Data Mining Process

Sanjeev Kumar. contribute

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining Algorithms Part 1. Dejan Sarka

Social Media Mining. Data Mining Essentials

Introduction to Data Mining

An Introduction to Data Mining

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Foundations of Business Intelligence: Databases and Information Management

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Warehouse: Introduction

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010

Chapter 20: Data Analysis

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

DATA MINING AND WAREHOUSING CONCEPTS

Data Mining + Business Intelligence. Integration, Design and Implementation

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Data Mining System, Functionalities and Applications: A Radical Review

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Database Marketing, Business Intelligence and Knowledge Discovery

Classification and Prediction

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Data Warehousing and Data Mining in Business Applications

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

ANALYTICS CENTER LEARNING PROGRAM

A Survey on Web Research for Data Mining

Advanced In-Database Analytics

Introduction. A. Bellaachia Page: 1

An Overview of Database management System, Data warehousing and Data Mining

MDM and Data Warehousing Complement Each Other

III JORNADAS DE DATA MINING

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining Techniques Chapter 6: Decision Trees

Hexaware E-book on Predictive Analytics

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Data Mining for Knowledge Management. Classification

Fluency With Information Technology CSE100/IMT100

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

A Review of Data Mining Techniques

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

SQL Server 2005 Features Comparison

Data Mining Techniques

Data Mining: An Introduction

The basic data mining algorithms introduced may be enhanced in a number of ways.

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Data Mining for Successful Healthcare Organizations

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Chapter ML:XI. XI. Cluster Analysis

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining Classification: Decision Trees

New Approach of Computing Data Cubes in Data Warehousing

The Scientific Data Mining Process

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Harnessing the power of advanced analytics with IBM Netezza

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

How To Perform An Ensemble Analysis

Chapter 12 Discovering New Knowledge Data Mining

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Oracle Real Time Decisions

Integrated Data Mining and Knowledge Discovery Techniques in ERP

DATA MINING METHODS WITH TREES

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Data Refinery with Big Data Aspects

Data Science & Big Data Practice

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

Data Mining: Motivations and Concepts

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

White Paper. Data Mining for Business

Data Mart/Warehouse: Progress and Vision

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Transcription:

Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview Knowledge discovery and data mining (KDD) techniques are used for analyzing and discovering actionable insights from data. The talk will Provide technical descriptions of the core algorithms that comprise data mining analytics Describe some business application scenarios for KDD Discuss issues in business intelligence systems Map trends in this area Background Widespread and explosive growth in use and size of databases Traditional use: query based report generation Size and volumes raise new issues: will data help business to achieve an advantage can data be used to model underlying processes and predict their behavior can we understand the data Providing capabilities to support exploration, summarization, and modeling of large databases is the goal of Business Intelligence systems 1

From Transactions to Warehouses Transactional databases: Reliable and accurate data capture; logging, book-keeping warehousing: Turning transactional data into a history repository Can be queried for summaries and aggregate reports First step in transforming transactional data (primary purpose: reliable storage) to one whose primary use is business intelligence May require integration of multiple sources of data Dealing with multiple formats; multiple database systems; integrating distributed databases; data cleaning; creating unified logical view of underlying non-homogeneous data On-Line Analytical Processing (OLAP) Supports query driven exploration of the data warehouse Utilizing pre-computed aggregates along data dimensions Deciding which aggregates to pre-compute and how to derive or reliably estimate from pre-computed projections Extends the Structured Query Language (SQL) framework to accommodate queries that would otherwise have been computationally impossible on a relational database management system Beyond OLAP Supporting queries at much more abstract level than SQL and OLAP Computer-driven exploration of data as opposed to human analyst-driven Facilitating data exploration of high dimensional data Providing solutions when user cannot describe goal in terms of a specific query e.g. discovering fraudulent cases in credit card or telephone uses Visualizing and understanding massive volumes of highdimensional data Rates of growth of data sets exceed by far any rates with which traditional human analyst techniques can cope 2

A Definition for Mining Automated search procedures for discovering credible and actionable insights from large volumes of high dimensional data emphasis upon symbolic learning and modeling methods (i.e. techniques that produce interpretable results) data management methods use of techniques from statistics, pattern recognition, and machine learning machine learning and statistical modeling also heavily used in vision, speech recognition, image processing, handwriting recognition, natural language understanding, etc. issues of scalability and automated business intelligence solutions drive much of and differentiate data mining from the other applications of machine learning and statistical modeling Machine Learning and Statistical Modeling serve as an important core Temporal Modeling Complex Pattern Detection Systems Performance Management Management Business Decision Support Systems Knowledge Discovery and Mining Machine Learning Statistical Modeling Feature Creation and Analysis Markov Modeling Speech Understanding Handwriting Recognition Computational Linguistics Statistical Text Processing Vision / Image Knowledge Management NLP/NLU Others (Agents, Education, etc..) Typical Business Intelligence Applications Risk Analysis Given a set of current customers and their finance/insurance history data, build a predictive model that can be used to classify a new customer into a risk category Targeted Marketing Given a set of current customers and history on their purchases and their responses to promotions, target new promotions to those most likely to respond Customer Retention Given a set of past customers and their behavior prior to leaving, predict who is most likely to leave and take proactive action Fraud Detection Detect fraudulent activities either proactively or on-line real-time Many other new applications keep surfacing 3

There s More to it Than Just Mining The process of identifying valid, novel, potentially useful, and understandable patterns in data requires one or more of: Selecting or sampling data from a data warehouse Cleaning or pre-processing it Transforming or reducing it Applying a data mining component to extract models or patterns Evaluating the derived structure The process is also known as KDD (Knowledge Discovery from Mining) mining is a key component concerned with the algorithmic means by which structures are extracted from data while meeting computational efficiency constraints Identify Business Opportunity The KDD Process Select Transform Mine Assimilate Warehouse Selected data mining n. The process of extracting valid, previously unknown, and ultimately comprehensible and actionable information from large databases and using it to make crucial business decisions. Visualization Mining Techniques Predictive Modeling Predict a specific attribute (database field) based upon the other attributes (fields) in the data Clustering (Segmentation) Group data records into subsets where items in subsets are more similar to each other than to items in other subsets Frequent Patterns Find interesting similarities between a few attributes in subsets of the data Change & Deviation Detect and account for interesting sequence of information in data records Dependencies Generate the joint probability density function that might have generated the data 4

Predictive Modeling Estimate a function? that maps points from an input space? to an output space???given only a finite sampling of the mapping Predict value of field (???in a database based on the other fields (?? Accurately construct an estimator ƒ of ƒfrom a finite sample known as the training set May be corrupted (i.e. noisy) If predicted quantity is numeric (i.e.??r, the real line) then the prediction problem is that of regression modeling If the predicted quantity is discrete (i.e.???????????) then the prediction problem is that of classification modeling Issues in Predictive Modeling Transformations on input space X to improve estimation capability Feature extraction / construction / selection Evaluating the estimate ƒ in terms of how well it performs on data not present in the training set Maximizing prediction accuracy by avoiding underfitting or over-fitting Trading off model complexity versus model accuracy Bias-variance tradeoff, penalized likelihood, minimum message length (MML) or minimum description length (MDL) Classification Predicting the most likely state of a categorical variable (the class) given the values of other variables Density estimation problem: deriving the value of Y given x?? from the joint density on Y and? Kernel density estimators Metric-space based methods (k-nearest neighbor) Projection into decision regions divide attribute space into decision regions and associate prediction with each region Linear classifiers, neural networks, decision trees, disjunctive normal form (DNF) rule-based classifiers Projection methods by far the most practical for data mining 5

120 100 80 60 40 20 0 0 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 120 100 80 60 40 20 Regression Predicting the most likely value of a numerical variable (the target column) given the values of other variables Numerical function approximation problem: deriving the value of Y given x?? from the joint probability distribution on Y and? Statistical probability models (e.g. linear regression) Projection into decision regions divide attribute space into decision regions and associate constant value with each region Neural networks, decision trees, disjunctive normal form (DNF) rule-based classifiers Hybrid Coupling projection methods with statistical models Projection and hybrid methods by far the most practical for data mining The Predictive Modeling Process Mine historical data to train patterns/models that can predict future behaviors Behaviors Response to Direct Mail Product Quality (Defects) Declining Activity Credit Risk Delinquency Likelihood to buy specific products Profitability etc. Score with models to reflect likelihood to exhibit the modeled behavior Act to optimize business objectives based on these scores Decision Trees for Predictive Modeling Tree generation algorithm Beginning with the training set at the root node, recursively split until a stopping criteria is met Split using best test among all possible tests on all attributes Prune tree (MDL, cross-validation, etc.) 6

Issues in Decision Tree Building Splitting at nodes greedy search: GINI (entropy minimizing), class probability profile difference, log-loss likelihood, etc. exhaustive search: ReliefF, Contextual Merit, etc. Testing of attributes Numerical attributes: inequality tests on cut-points Categorical attributes: subset tests Leaf models Piecewise constant Linear Probability functions Scalability A Typical Decision Tree Expected Sales Revenue Historical Sales per Mlg Less than 7 Historical Sales per Mlg 7-15 Historical Sales per Mlg Greater than 15 Segment 8 Hist Avg Sales per Order Less than 113 Segment 1 Hist Avg Sales per Order Greater or equal to 113 Credit Limit Less than 2200 Segment 6 Credit Limit Greater or equal to 2200 Segment 7 Climate Indicator 0 Segment 2 Climate Indicator 1 Risk Score Less than 687 Segment 3 Risk Score Greater or equal to 687 Outdoor Purchases Less than 3 Segment 4 Outdoor Purchases Greater or equal to 3 Segment 5 Training a Predictive Model Observations Predictive Model Predicted Outcomes Prediction Errors Actual Outcomes 7

Training Generalization Training Validation Too many segments Over fit Too few segments Under fit About right About right Prediction Error Optimal Training Validation Error Training Error Optimum Rule Set Number of Rules (degrees of freedom) Clustering Given a finite sampling of points, group them into sets of similar points Representing clusters of points with common characteristics In predictive modeling, class (or value) membership is known in the training data In clustering, this knowledge is not known a- priori, and is perhaps being discovered by clustering or segmentation 8

Techniques for Clustering/Segmentation Two-stage approach outer loop to determine cluster number k inner loop to fit points to clusters Metric distance-based methods; find best k-way partition so that points in a partition are closer to each other than to points in other partitions Model based methods: a best fit (very typically probabilistic) model is hypothesized for each cluster Partition based methods: iteratively enumerating and scoring various partition scenarios using heuristic scoring functions k-mean Clustering Algorithm Widely used in data mining Given k cluster centers c 1,j,c 2,j,,c k,j at iteration j, compute c 1,j+1,c 2,j+1,,c k,j+1 Cluster assignment: For each i=1,,m, assign x i to cluster l(i) such that c l(i),j is nearest to x i Cluster Center Update: For l=1,,k set c l,j+1 to be the mean of all x i assigned to c l,j Stop when c l,j = c l,j+1, l=1,,k Extensions include support for scalability, efficient placement of initial k means, and (harder problem) determining the number of clusters k Frequent Patterns Extracting compact patterns that describe subsets of data Row-wise patterns Column-wise patterns Association rules: detecting combinations of attribute values that occur with a minimum level of frequency (support) and certainty (confidence) Scalable algorithms can find all such rules in linear time under certain conditions of data sparseness Rules are not statements about causal effects amongst attributes, but can still provide useful insights 9

Change and Deviation Detecting sequence information, temporal or otherwise Ordering information of transactions (rows) is utilized Under certain conditions of data sparseness, sequences with desired levels of frequency and certainty can be computed in linear time Dependency Modeling Detecting causal structure within data Causal models Discovering probabilistic distributions governing the data Discovering functional dependencies between attributes in the data Techniques Density estimation methods Expectation maximization Explicit causal modeling Bayesian networks Applying Mining in Business Profit Customer Satisfaction Efficiency Who are the best customers to sell my products to? What are the most effective market segments for my business? How do I increase market share of my products? How do I reduce my costs and not impact production? How do I optimize my inventory? 10

Mapping Operations into Applications Predictive Modeling Assigning risk levels to new insurance and financial contracts Clustering / Segmentation Identifying distinct market groups in customer population Frequent Patterns Market basket analysis (what gets shopped together in a supermarket) Change and Deviation Fraud discovery in health claim data Discovering shopping patterns over time Business Application Opportunities Retail/Distribution Category management Merchandise planning Product management Production planning/tracking Insurance/Healthcare Claims analysis Provider analysis Managed care Outcomes analysis Manufacturing Product costing Manufacturing quality and efficiency Parts analysis Utilities Industrial customer profiles Financial analysis "Bulk Power" analysis Government Budgeting Financial reporting Demographics Telecommunications Customer profiles/ segmentation Product profitability Demand forecasting Usage analysis Cross Industry Market basket analysis Target marketing Customer segmentation Customer service Fraud and abuse Financial performance Transportation Yield management Pricing/rate analysis Logistics Financial Customer profitability and segmentation Products/portfolio profitability Risk management Cross-sale analysis Branch performance The Challenge Where Is It? What Does it Mean? How Can I Get It? What Format Is It In? No Single View of Many Interfaces Difficult to Access Multiple Sources Different Formats, Platforms Inconsistencies, Redundancies 11

An Efficient Environment for Mining Enterprise Model Operational Enterprise Warehouse Global mart Specialized Analysis marts External Mining Analysis Transaction Transformation (legacy system context removal) Meta-data Definition (e.g. consistent business terms) Analytical dimension definition (e.g. time, policyholder) Summarization Aggregation Identification / Collection Review Conversion / Reduction / Normalization Representation The Business Intelligence Process Access Transform Distribute Store Find & Understand Operational & External Enhancing Staging Relational Summarizing Aggregating Flow & Process Flow Joining From Multiple Sources Populating On-Demand Automate & Manage Multiple Platforms & Hardware Information Catalog Business Views Models Discover, Analyze, Visualize Query Interpretation Multi- Dimensional Analysis Mining Multi-Vendor Support Open Interfaces Mining Marketplace Status Enabler for business intelligence systems mining algorithm suites Loosely coupled with database technology Emphasis on data warehousing followed by exploratory data mining Typically conducted by consultants or in-house analytic teams Issues warehouse requirements Sophisticated analytics requirements 12

Key Challenges and Trends Infrastructure Enabling transparent and pervasive usage Algorithms Optimized and robust mining Solutions Vertically integrated for critical problems Enhanced emphasis on the Internet Infrastructure Making data mining transparent base extenders e.g. DB2/UDB User Defined Functions for model training/scoring Sufficient statistics (e.g. histograms, counts, samples, etc.) Parallel and distributed data mining Scalability (sampling and parallelization) XML based APIs for database coupling and application embedding Interoperability Training and scoring in different environments Intelligent or semi-automated data warehousing for mining Industry specific templates Meta-data mining Algorithms Robust and Automated Evaluation metrics Automated feature extraction / transformation / selection Discovering relational and hierarchical structures amongst attributes Incorporating prior knowledge to account for costs / benefits / uncertainty / missing values Incremental and on-line mining Privacy preserving data mining Heterogeneous data mining 13

Solutions Business Risk management Targeted marketing Portfolio management Systems Performance management Internet Site profiling and performance tuning User personalization Summary mining is being embedded in vertical solutions for business intelligence and decision support Management ecommerce Critical Large-Scale Solutions (CRM, etc.) Using data mining should eventually become as easy and pervasive as working with databases and spreadsheets today References Mathematical Programming for Mining: Formulations and Challenges, by Bradley et al., INFORMS Journal on Computing, Volume 11, No. 3, Summer 1999 KDD Nuggets http://www.kdnuggets.com IBM http://www.research.ibm.com/compsci/kdd http://www.research.ibm.com/dar http://www.ibm.com/bi 14