Automation through Structured Risk Minimization. Robert Cooley, Ph.D. VP Technical Operations Knowledge Extraction Engines (KXEN), Inc.

Similar documents
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Energy Savings from Business Energy Feedback

Text Analytics using High Performance SAS Text Miner

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

KXEN 08_09_17_PKDD'08_KXEN 1

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Analysis One Code Desc. Transaction Amount. Fiscal Period

Data Mining: Overview. What is Data Mining?

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Data Mining Methods: Applications for Institutional Research

Microsoft Azure Machine learning Algorithms

The Power of Predictive Analytics

Decision Trees from large Databases: SLIQ

Why do statisticians "hate" us?

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

The Scientific Data Mining Process

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Change-Point Analysis: A Powerful New Tool For Detecting Changes

IBM SPSS Direct Marketing 19

DATA MINING TECHNIQUES AND APPLICATIONS

CS 6220: Data Mining Techniques Course Project Description

Introduction to Data Mining

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Data Cleansing for Remote Battery System Monitoring

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics

IBM SPSS Direct Marketing 23

MHI3000 Big Data Analytics for Health Care Final Project Report

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

IBM SPSS Direct Marketing 22

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Section 3 Part 1. Relationships between two numerical variables

The Data Mining Process

Azure Machine Learning, SQL Data Mining and R

DHL Data Mining Project. Customer Segmentation with Clustering

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Customer and Business Analytic

Performing a data mining tool evaluation

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Anomaly Detection and Predictive Maintenance

An Introduction to Advanced Analytics and Data Mining

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Distances, Clustering, and Classification. Heatmaps

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Comparison of machine learning methods for intelligent tutoring systems

Traffic Prediction and Analysis using a Big Data and Visualisation Approach

Data Mining mit der JMSL Numerical Library for Java Applications

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Framing Business Problems as Data Mining Problems

IBM SPSS Direct Marketing

Predictive Analytics

Make Better Decisions Through Predictive Intelligence

The Artificial Prediction Market

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Easily Identify Your Best Customers

AdTheorent s. The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising. The Intelligent Impression TM

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

The Predictive Data Mining Revolution in Scorecards:

Analytical Test Method Validation Report Template

Predictive analytics for the business analyst: your first steps with SAP InfiniteInsight

Using Data Mining for Mobile Communication Clustering and Characterization

IBM SPSS Direct Marketing 20

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Statistics for BIG data

What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling)

Social Media Mining. Data Mining Essentials

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

SAP Predictive Analysis Real Life Use Case Predicting Who Will Buy Additional Insurance

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Data Mining Applications in Higher Education

Pentaho Data Mining Last Modified on January 22, 2007

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Data Mining: An Overview. David Madigan

CS Introduction to Data Mining Instructor: Abdullah Mueen

Classifying Manipulation Primitives from Visual Data

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Easily Identify the Right Customers

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

The Operational Value of Social Media Information. Social Media and Customer Interaction

Advanced Analytics for Call Center Operations

Data Mining Techniques - SPSS Clementine

Transcription:

Automation through Structured Risk Minimization Robert Cooley, Ph.D. VP Technical Operations Knowledge Extraction Engines (KXEN), Inc.

Personal Motivation & Background When the solution is simple, God is answering Albert Einstein 99 Navy Nuclear Reactor Design Emphasis on simple but highly functional design The key is to automate data mining as much as possible, but not more so Gregory Piatetsky-Shapiro 996 - Data mining research Emphasis on automation of knowledge discovery

Personal Motivation & Background Solving a problem of interest, do not solve a more general problem as an intermediate step Vladimir Vapnik 997 Structured Risk Minimization (SRM) Emphasis on simplicity to manage generalization 998 Support Vector Machines (SVM) for Text Classification Hoping for better generalization, ended up with increased automation When all you have is a hammer, every problem looks like a nail Abraham Maslow 2 Joined KXEN Automation through SRM

Learning / Generalization Under Fit Model/High Robustness (Training Error = Test Error) Over Fit Model/Low Robustness (No Training Error, High Test Error) Robust Model (Low Training Error Low Test Error)

What is Robustness? Statistical Ability to generalize from a set of training examples Are the available examples sufficient? Training Ability to handle a wide range of situations Any number of variables, cardinality, missing values, etc. Deployment Ability to resist degradation under changing conditions Values not seen in training Engineering Ability to avoid catastrophic failure Return an answer without crashing under challenging conditions

From W. of Ockham to Vapnik If two models explain the data equally well then the simpler model is to be preferred st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings.

From W. of Ockham to Vapnik If two models explain the data equally well then the model with the smallest VC dimension is to be preferred st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data).

From W. of Ockham to Vapnik If two models explain the data equally well then the model with the smallest VC dimension has better generalization performance st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data). 3rd discovery: Don t just observe differences between models, control them.

From W. of Ockham to Vapnik To build the best model try to optimize the performance on the training data set, AND minimize the VC dimension st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data). 3rd discovery 3rd discovery: Don t just observe differences between models, control them.

Structured Risk Minimization (SRM) Quality: How well does a model describe your existing data? Achieved by minimizing Error. Reliability: How well will a model predict future data? Achieved by minimizing Confidence Interval. Risk Best Model Total Risk Error Conf. Interval Model Complexity

SRM is a Principle, NOT an Algorithm Consistent Coder Robust Regression Smart Segmenter Support Vector Machine

Features of SRM Adding Variables is Low Cost, Potentially High Benefit Does not cause over-fitting More variables can only add to the model quality Random variables do not harm quality or reliability Highly correlated variables do not harm the modeling process Efficient scaling with additional variables Free of Distribution Assumptions Normal distributions aren t t necessary Skewed distributions do not harm quality Resistant to outliers Independence of inputs isn t t necessary Indicates Generalization Capacity of any Model Insufficient training examples to get a robust model Choose between several models with similar training errors

SRM enables Automation without sacrificing Quality & Reliability Create Analytic Dataset Select Variables Prepare Data Train Model Validate & Understand Analytic Dataset Creation (Lack of) Variable Selection Data Preparation Model Selection Model Testing

Analytic Data Set Creation Adding Variables is Low Cost, Potentially High Benefit Kitchen Sink philosophy for variable creation Don t shoe-horn several pieces of information into a single variable (e.g. RFM) Break events out into several different time periods A single analytic data set with hundreds or thousands of columns can serve as input for hundreds of different models

Event Aggregation Example Price 5 4 3 2 5 4 3 2 A A C C B B Customer BB Jan Feb Mar Apr May A A Customer 2 Customer 2 32 C C Jun Time Customer Customer 2 RFM Aggregation Customer RFM 32 Simple Aggregation A B C Total purchase 5 5 Relative Aggregation Transitions Month $45 Month 2 $35 Month 3 Month 4 Month 5 $25 Month 6 A out C B B A C out B C A B $25 $35 $45

Data Preparation Free of Distribution Assumptions Different encodings of the same information generally lead to the same predictive quality No need to check for co-linearities No need to check for random variables SRM can be used within the data preparation process to automate binning of values

Binning Nominal Group values TV, Chair, VCR, Lamp, Sofa [TV,VCR], [Chair, Sofa], [Lamp] Ordinal Group values, preserving order, 2, 3, 4, 5 [, 2, 3], [4, 5] Continuous Group values into ranges.5, 2.6, 5.2, 2., 3. [.5 5.2], [2. 3.]

Why Bin Variables? Performance Captures non-linear behavior of continuous variables Minimizes impact of outliers Removes noise from large numbers of distinct values Explainability Speed Grouped values are easier to display and understand Predictive algorithms get faster as the number of distinct values decreases

Binning Strategies None Leave each distinct value as a separate input Standardized Group values according to preset ranges Age: [ 9], [2 29], [3 39], Zip Code: [94 9499], [942 94299], Target Based Use advanced techniques to find the right bins for a given question. DEPENDS ON THE QUESTION!

No Binning 9 8 7 6 5 4 3 2 Customer Value by Age 8 9 2 2 22 23 24 25 26 27 28 29 3 3 32 33 34 35 36 37 38 39 4 4 42 43 44 45

Standardized Binning 8 7 6 5 4 3 2 Customer Value by Age 8-25 26-35 36-45

Target Based Binning 9 8 7 6 5 4 3 2 Customer Value by Age 8-28 29-39 4-45

Predictive Benefit of Target Based Binning Sine Wave Training Dataset KXEN Approximation - 2 Bins 2 2. 5.5. 5.5 Y -2 2 4 6 8 2 -. 5-2 2 4 6 8 2 -.5 - -. 5-2 - -.5 X -2

Model Testing Indicates Generalization Capacity of any Model It is not always possible to get a robust model from a limited set of data Determine if a given number of training examples is sufficient Determine amount of model degradation over time

Traditional Modeling Methodology CRISP-DM Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment SEMMA Sample Explore Modify Model Assess

SRM Modeling Methodology. Business Understanding 2. Create Analytic Dataset 3. Modeling 4. Data Understanding 5. Deployment 6. Maintenance

Conclusions SRM is a general principle that can be applied to all phases of data mining The main benefit of applying SRM is increased automation, not increased predictive quality In order to take full advantage of the benefits, the overall modeling methodology must be modified

Thank You Questions? Contact Information: rob.cooley@kxen.com 65-338-388 Additional KXEN Information: www.kxen.com sales-us@kxen.com 5/25 NYC Discovery Workshop, 9am to pm