Framing Business Problems as Data Mining Problems

Similar documents
Broadening Access to Advanced Analytics in the Enterprise

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Azure Machine Learning, SQL Data Mining and R

Microsoft Azure Machine learning Algorithms

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Data Mining Part 5. Prediction

Data Mining for Business Analytics

Easily Identify Your Best Customers

IBM SPSS Direct Marketing 23

Oracle Data Miner (Extension of SQL Developer 4.0)

The Scientific Data Mining Process

Data Mining Algorithms Part 1. Dejan Sarka

Analyzing Research Data Using Excel

Easily Identify the Right Customers

IBM SPSS Direct Marketing 22

An Introduction to Advanced Analytics and Data Mining

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

2015 Workshops for Professors

Data Mining Applications in Higher Education

Concepts of Variables. Levels of Measurement. The Four Levels of Measurement. Nominal Scale. Greg C Elvers, Ph.D.

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Social Media Mining. Data Mining Essentials

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining with Weka

Statistics Review PSY379

Supported Upgrade Paths for FortiOS Firmware VERSION

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Pentaho Data Mining Last Modified on January 22, 2007

Course Syllabus. Purposes of Course:

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Lecture 6 - Data Mining Processes

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

IBM SPSS Direct Marketing 19

Master of Science in Healthcare Informatics and Analytics Program Overview

Advanced Big Data Analytics with R and Hadoop

Simple Predictive Analytics Curtis Seare

Basic Concepts in Research and Data Analysis

No BI without Machine Learning

Data, Measurements, Features

Supervised Learning (Big Data Analytics)

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

ORACLE SOCIAL ENGAGEMENT AND MONITORING CLOUD SERVICE

IBM Content Analytics adds value to Cognos BI

Supervised Learning Evaluation (via Sentiment Analysis)!

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Chapter 7: Simple linear regression Learning Objectives

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Classification and Prediction

Database Marketing, Business Intelligence and Knowledge Discovery

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

MARKET SEGMENTATION, CUSTOMER LIFETIME VALUE, AND CUSTOMER ATTRITION IN HEALTH INSURANCE: A SINGLE ENDEAVOR THROUGH DATA MINING

Additional sources Compilation of sources:

How To Use Neural Networks In Data Mining

Choosing a successful structure for your visualization

Data Mining: Overview. What is Data Mining?

Data Project Extract Big Data Analytics course. Toulouse Business School London 2015

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Alignment and Preprocessing for Data Analysis

Fast Analytics on Big Data with H20

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

IBM SPSS Modeler Professional

Data quality in Accounting Information Systems

Introduction to Data Mining

IBM SPSS Direct Marketing

IBM SPSS Direct Marketing 20

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Data Mining + Business Intelligence. Integration, Design and Implementation

Chapter 12 Discovering New Knowledge Data Mining

ABOUT THE AUTHOR. Stephanie Macedo Field Marketing Manager. Follow Me on

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Free Trial - BIRT Analytics - IAAs

: Introduction to Machine Learning Dr. Rita Osadchy

Advanced analytics at your hands

DATA VISUALIZATION GABRIEL PARODI STUDY MATERIAL: PRINCIPLES OF GEOGRAPHIC INFORMATION SYSTEMS AN INTRODUCTORY TEXTBOOK CHAPTER 7

Web Browsing Quality of Experience Score

Visualization Quick Guide

Lecture 10: Regression Trees

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Customer Classification And Prediction Based On Data Mining Technique

Structural Health Monitoring Tools (SHMTools)

1 Choosing the right data mining techniques for the job (8 minutes,

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Your Guide to Site Traffic Analysis Using Webtrends Analytics

Prerequisites. Course Outline

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Data Visualization Handbook

Big Data: Rethinking Text Visualization

Banking Analytics Training Program

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Transcription:

Framing Business Problems as Data Mining Problems Asoka Diggs Data Scientist, Intel IT January 21, 2016

Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright 2016, Intel Corporation. All rights reserved. Copyright 2016, Intel Corporation. All rights reserved. 2

Outline Introduction, assumptions, and value Iris data mining demo Framing business problems as data mining problems Break ereader data mining demo More on framing business problems as data mining problems Questions throughout Resources and links for further study Copyright 2016, Intel Corporation. All rights reserved. 3

Acknowledgements Textbook Data Mining for the Masses by Dr. Matthew North Copyright 2016, Intel Corporation. All rights reserved. 4

Copyright 2016, Intel Corporation. All rights reserved. My Background DBA starting in late 90 s Various data management roles since then Entity-Relationship modeling ETL development Data integration Data architecture MS Predictive Analytics Fall 2011 June 2015 Designed and teach an internal Intro to Data Mining class (2.5 days)

Assumptions Audience is composed of experienced Data Management Professionals You re interested in Predictive Analytics / Data Mining but haven t studied it You re already convinced of the value of Predictive Analytics / Data Mining We won t particularly be covering the value of Advanced Analytics in a business context here Copyright 2016, Intel Corporation. All rights reserved. 6

What you will learn Class is designed to teach a functional mental model of data mining Definition: Mental Model A mental model is an explanation of someone's thought process about how something works in the real world acts as information filter causes selective perception or perception of only selected parts of information With the mental model, you ll be making your own connections between data mining and skills / experience you already have Goal: Begin learning how to frame business problems as data mining problems Real goal: realize that you can learn to frame business problems as data mining problems with minimal effort But more effort than a 3 hour demo version http://en.wikipedia.org/wiki/mental_model Copyright 2016, Intel Corporation. All rights reserved. 7

Formulate the data mining problem Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics Pre-emptive Analytics What happened? Why did it happen? What will happen? What should I do? What more can I do? What is the attrition in the last 6 months? Which customers have I lost? Why has the attrition rate increased in the last 6 months? Which customers are most likely to leave? Which customers are most likely to leave if I don t contact them? Which customers should I target for retention? What if? What can I offer before the customer even realizes they have the need? Value-add? Information Insights Decision Action Copyright 2016, Intel Corporation. All rights reserved. 8

Copyright 2016, Intel Corporation. All rights reserved. Grain: Another way to think about data Level of detail - What is an observation (row), an observation of? Think of the grain in a piece of wood Example: Cumulative GPA vs. per course grade-point (transcript) Has proven useful in reporting, data structure, formulating ML problems, etc.. Can you identify the grain of an answer to the business question? What is the grain of the source data you are analyzing? Is the grain of the (data mining) analysis of the source data compatible with the grain of the business question? Example: Cumulative GPA is not a good grain to analyze fairness across faculty Don t mix grain in a data set Or at least only mix grain consciously and intentionally If you ever have an itchy feeling that something about the data is wrong, start with the grain Can you even identify what the grain is?

The 1Slide Intro to Data Mining class Training Data Build Analytic Model Model Validation Scoring Data Scored Data Copyright 2016, Intel Corporation. All rights reserved. 10

Copyright 2016, Intel Corporation. All rights reserved. Demo Decision Tree and Iris data set

A variation on CRiSP-DM Business Problem Frame as Data Mining Problem Data Mining Problem Analytical Modeling Analytic Model Use Model / Deliver Model to Production Copyright 2016, Intel Corporation. All rights reserved. 12

Copyright 2016, Intel Corporation. All rights reserved. Data Science / Scientist Anatomy of a Data Scientist Plus Data Management ETL development ER modeling Communicator Technical details Strategic ramifications System impact Presentation developer Project Management Curious / inquisitive Creative

Copyright 2016, Intel Corporation. All rights reserved. Effort to achieve skill level + 2000hr / Graduate degree 200hr / ~1 Graduate class 20hr / 2.5d Intro to Data Mining class 2hr / Data Mining demonstration

Copyright 2016, Intel Corporation. All rights reserved. Deliverable Expectations + Create Data Products Create Analytic Models Frame business problems as data mining problems Recognize the need to learn more about data mining

Break Copyright 2016, Intel Corporation. All rights reserved. 16

Copyright 2016, Intel Corporation. All rights reserved. Demo Decision Tree and ereader data set

Copyright 2016, Intel Corporation. All rights reserved. Data Types (NOIR) Discrete / Categorical / Qualitative (alphanumeric/text) Nominal no meaning in order Distinctness ( =!= ) ; Example: flavors of ice cream Binary/Binomial variables - Represents presence or absence of something» Not necessarily Black/White; Example: gender Ordinal order makes sense Order ( < > ) ; Example: Your preferred flavors of ice cream Continuous / Quantitative (numbers only) Interval sum/difference meaningful, but product/ratio not Addition ( + - ) ; Example: temperature in Fahrenheit, time of day AM/PM Ratio ratios make sense in addition to sum/difference Multiplication ( * / ) ; Example: absolute time, absolute temp, height, weight, wavelength

Copyright 2016, Intel Corporation. All rights reserved. Supervised and Unsupervised learning Machine Learning and predictive modeling falls into two buckets Supervised or Directed (Classification) Presence of a response or outcome variable Finds relationships in the data ~75% of the techniques Generally more interesting and useful biz q s can be addressed Unsupervised or Undirected (Clustering) Absence of a response or outcome variable Finds patterns in the data Not as many techniques More easy to find patterns that aren t useful Think of response variable as human input into the model training and validation process

Copyright 2016, Intel Corporation. All rights reserved. Machine Learning (ML) input and output Any ML algo will have certain kinds of data it can take as input And will produce certain kinds of data as output Example: Logistic Regression requires numeric input (including binary variables), and generates a binary (1/0) output Data prep activity converting data from an unusable form into a usable form The nature of the data mining problem will shape the kind of answer needed (binary, category, number,..) Which will start filtering the available techniques Further filter the techniques based on the form of the data that is available

The 1Slide Intro to Data Mining class Training Data Build Analytic Model Model Validation Scoring Data Scored Data Copyright 2016, Intel Corporation. All rights reserved. 21

Resources Copyright 2016, Intel Corporation. All rights reserved. 22

Self-Study Resources Textbook: Data Mining for the Masses, by Matthew North Method: Download the textbook Follow the step-by-step instructions from the beginning to the end of the book By the time you finish the last chapter, you will have: Analyzed 4 data sets using different supervised learning methods Analyzed 2 more data sets using unsupervised learning methods Used rudimentary text analytics on a data set Incorporated model validation into at least 1 model Our demo today has made use of 1 of these data sets and is drawn from this book Copyright 2016, Intel Corporation. All rights reserved. 23

Additional Resources Broadening Access to Advanced Analytics in the Enterprise, by Asoka Diggs & Christy Foulger Short video series on Text Analytics, by Neil McGuigan Data Science Specialization through Coursera For those looking for an Intermediate level of knowledge of data science The Signal and the Noise: Why So Many Predictions Fail but Some Don t, by Nate Silver Contact me at asoka.diggs@intel.com Copyright 2016, Intel Corporation. All rights reserved. 24

Questions?

Thank You

Copyright 2016, Intel Corporation. All rights reserved. Sharing Intel IT Best Practices With the World Learn more about Intel IT s Initiatives at www.intel.com/it

Additional related material Copyright 2016, Intel Corporation. All rights reserved. 28

Copyright 2016, Intel Corporation. All rights reserved. Data Terminology Data Model Usually NOT an ERD (rather, a data mining model) I am trying to use ER Model and Analytic Model Observation Row, example, case, tuple, sample, member Variable Column, attribute, field, feature, parameter, dimension, metric, measure Roles played by variables Label (dependent), attribute (independent), id Emphasis in data mining is on denormalized data

Copyright 2016, Intel Corporation. All rights reserved. Modeling Terminology Data Set Training. Used to train the data (mining) model The response is a known value Test. An out of sample (observations not part of the training set) used to validate the results of the model SCORING DATA? The response is a known value Scoring. A set of observations where the response or outcome is unknown. We use the model created by the data mining method and the training set to Score or create values for the response variable MODEL After scoring, the response is predicted NOT determined Y

Copyright 2016, Intel Corporation. All rights reserved. Decision Tree Output / response / label: categorical (or continuous) Input / observation / attribute: categorical or continuous Model: (example below) Very friendly for human interpretation http://en.wikipedia.org/wiki/decision _tree_learning http://en.wikipedia.org/wiki/c4.5_alg orithm

Copyright 2016, Intel Corporation. All rights reserved. Model Validation and Testing Citation

Confusion Matrix We ll look at the model validation results together More reading on Precision and Recall http://en.wikipedia.org/wiki/precision_and_recall In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant, while high recall means that an algorithm returned most of the relevant results. Quality Quantity Copyright 2016, Intel Corporation. All rights reserved. 33

Copyright 2014, Intel Corporation. All rights reserved.