Data Mining Algorithms Part 1. Dejan Sarka

Similar documents

Social Media Mining. Data Mining Essentials

Prerequisites. Course Outline

from Larson Text By Susan Miertschin

Data Mining with SQL Server Data Tools

Data Mining Techniques

Foundations of Business Intelligence: Databases and Information Management

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Data Mining Part 5. Prediction

Introduction to Data Mining

Data Mining: Overview. What is Data Mining?

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

The Data Mining Process

Azure Machine Learning, SQL Data Mining and R

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Data Mining and Predictive Modeling with Excel 2007

Data Mining Solutions for the Business Environment

Classification and Prediction

DATA MINING TECHNIQUES AND APPLICATIONS

not possible or was possible at a high cost for collecting the data.

Customer Analytics. Turn Big Data into Big Value

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining: An Introduction

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Fraud Detection with the SQL Server Suite

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Advanced analytics at your hands

Chapter 12 Discovering New Knowledge Data Mining

Data mining and statistical models in marketing campaigns of BT Retail

Data Mining - Evaluation of Classifiers

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data Mining is the process of knowledge discovery involving finding

Using Microsoft Dynamics CRM for Analytical CRM: A Curriculum Package for Business Intelligence or Data Mining Courses

Hexaware E-book on Predictive Analytics

Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734

COURSE SYLLABUS COURSE TITLE:

Evaluation & Validation: Credibility: Evaluating what has been learned

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

SQL Server 2005 Features Comparison

Building Data Cubes and Mining Them. Jelena Jovanovic

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Data Mining Applications in Higher Education

8. Machine Learning Applied Artificial Intelligence

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Data Mining Classification: Decision Trees

An Introduction to Data Mining

Chapter 7: Data Mining

Analytics on Big Data

Data Mining for Fun and Profit

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Data Mining and Marketing Intelligence

NEURAL NETWORKS IN DATA MINING

White Paper. Data Mining for Business

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

SQL Server 2012 Business Intelligence Boot Camp

Data mining techniques: decision trees

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining - The Next Mining Boom?

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Data Mining and Neural Networks in Stata

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Data Mining for Knowledge Management. Classification

Data Warehousing and Data Mining in Business Applications

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Supervised Learning (Big Data Analytics)

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Role of Neural network in data mining

Question 2 Naïve Bayes (16 points)

Data Mining Fundamentals

Knowledge Discovery and Data Mining

Machine Learning and Data Mining. Fundamentals, robotics, recognition

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

GETTING AHEAD OF THE COMPETITION WITH DATA MINING

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Customer Classification And Prediction Based On Data Mining Technique

Principles of Data Mining by Hand&Mannila&Smyth

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Microsoft Data Warehouse in Depth

Business Intelligence: Real ROI Using the Microsoft Business Intelligence Platform. April 6th, 2006

How To Use Neural Networks In Data Mining

Data Mining for Business Analytics

Data quality in Accounting Information Systems

Transcription:

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015

Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses Focus: Data modeling Data mining Data quality

Agenda Introduction Naïve Bayes Decision Trees Neural Network Logistic Regression Predictive Models Evaluation

Data Mining Algorithms Data mining as the most advanced data analysis technique is gaining popularity With modern data mining engines, products and packages, like SQL Server Analysis Services (SSAS), Excel and R, data mining has become a black box It is possible to use data mining without knowing how it works But: not knowing how the algorithms work might lead to many problems, including using the wrong algorithm for a task, misinterpretation of the results, and more Learn how the most popular data mining algorithms work When to use which algorithm Advantages and drawbacks of each algorithm Use the algorithms SSAS, Excel and R

Assumptions This is not an introduction to SQL Server, Excel or R Neither to the tools like Visual Studio, SQL Server Management Studio, Excel, or RStudio Basic familiarity with at least one of the tools assumed Focus is on the algorithms

What Is Data Mining? Michael J. A. Berry and Gordon S. Linoff: Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover patterns and rules Ralph Kimball: Data mining is a collection of powerful analysis techniques for making sense out of very large datasets Bill Inmon: Data Mining / Data Exploration is the usage of historical data to discover and exploit important business relationships

What Is Data Mining? Deduce knowledge by examining data and then make predictions on the knowledge extracted Examining data means scanning samples of known facts about cases using their attributes, which are called variables Knowledge is the patterns, clusters, decision trees, neural networks, association rules On-Line Analytical Processing (OLAP) is model driven, whereas data mining is data driven Alternative names include knowledge discovery in databases (KDD) and predictive analytics

The Two Types of Data Mining Directed (supervised) data mining (top-down approach) Classification Estimation Forecasting Undirected (unsupervised) data mining (bottomup approach) Affinity grouping Clustering Description

Typical Business Questions What s the credit risk of this customer? Are there any groups of my customers? What products do customers tend to buy together? How much of a specific product can I sell in the next time period? What is the potential number of customers shopping in this store? What are the major groups of my web-click customers? Is this a spam email?

Data Mining Tasks Cross-selling market basket analysis Order of items in a purchase might also be of some interest Fraud detection Churn detection Customer segmentation How is a website is used Forecasting

Data Mining Virtuous Cycle Identify Transform Measure Act

The CRISP Model Transform CRISP = Cross Industry Standard Process for Data Mining (http://en.wikipedia.org/wiki/cross_industry_standard_process_for_data_mining)

Data Mining Data Flow Model Browsing LOB Apps Historical Dataset ETL Reports Mining Models Prediction Cube Cube New Dataset

Different Types of Analyses Structured reports Some interaction, but not dynamic restructuring Can enable ad-hoc reports with a semantic model Structured groupings in OLAP Predefined grouping buckets Report structure is dynamic Structured attributes with data mining Predefined attributes Mining model calculates grouping and structure

SQL Server Tools SQL Server Analysis Services (SSAS) installed in Multidimensional and Data Mining mode SQL Server Integration Services (SSIS) Full-text search and semantic search

Excel Tools Microsoft Office Data Mining Add-ins Excel does not become a data mining engine Needs connection to SSAS in multidimensional mode Excel cell range or Excel table can be the data source Three add-ins: Data Mining Client for Excel Table Analysis Tools for Excel Data Mining Templates for Visio

Introducing R R is a free programming language and software environment for statistical computing and graphics Free under the GNU General Public License Pre-compiled binary versions are provided for various operating systems R uses a command line interface; however, several graphical user interfaces are available for use with R RStudio is a free and open source integrated development environment (IDE) for R

Naïve Bayes Naive Bayes quickly builds mining models that can be used for classification and prediction This makes the model a good option for exploring the data It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute The probabilities can later be used to predict an outcome of the predicted attribute based on the known input attributes Input attributes are treated as mutually independent

Naïve Bayes OK Faulty.30.90.10.27 Judged Faulty:.41.03.70.20.80 Judged OK:.59.14 Actual Actual declared.56

Naïve Bayes OK Faulty.67 Reverse Tree.41.59.33.05.95 After classification, the posterior probabilities are much more accurate than the prior probabilities Declared (prior) Actual (posterior) probabilities

Example Table of products with Color, Class, and Weight columns If Color is missing, 80% of Weight values are missing as well; if Class is missing, 60% of Weight values are missing as well 0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48 0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08 The likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown

Example You can convert the likelihoods to probabilities by normalizing their sum to 1: P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857 P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143

Naïve Bayes Usage Naive Bayes is used for classification Assign new cases to predefined classes Typical usage scenarios include: Categorizing bank loan applications Assigning customers to predefined segments Quickly obtaining a basic comprehension of the data by checking the correlation between input variables

Decision Trees Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables Once built, they are easy to understand They are used to predict values of the explained variable Recursive partitioning is used to build the tree Data is split into partitions and then split up more Initially all cases are in one big box

Decision Trees The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions the data to the purest classes of the searched variable Uses several measures of purity, such as frequency distribution, entropy, and Bayesian scoring of prior / posterior probabilities It then repeats the splitting process for each new class, again testing all possible breaks The problem is where to stop

Decision Trees A common problem is over-fitting Not useful branches of the tree can be prepruned or post-pruned Pre-pruning methods try to stunt the growth of the tree before it grows too deep They test each node to see whether a further split would be useful; the tests can be simple (n of cases) or complicated (complexity penalty) Post-pruning methods allow the tree to grow and then prune off branches Again the test can be simple (n of cases) or more complex

Example Interview of the people who watched the famous Woodstock movie You have a population aged between 20 and 60 years old You gathered data about their education and ranged it into 7 classes (1 = lowest, 7 = highest) 55% of them liked the movie Can you discover the factors that have an influence on whether they liked the movie?

Example E D U C A T I O N L E V E L 60 55 50 Liked 55% Y 45% N 45 A G E 40 35 Y E A R S 1 2 3 4 5 6 7 30 25 20

Example E D U C A T I O N L E V E L 60 55 Liked 55% Y 45% N A G E 50 45 40 35 35+ AGE 35- Liked 73% Y 27% N Liked 33% Y 67% N Y E A R S 1 2 3 4 5 6 7 30 25 20

Example E D U C A T I O N L E V E L Liked 55% Y 45% N 35+ AGE 35- A G E 35 Liked 73% Y 27% N Liked 33% Y 67% N EDUCATION 2+ 2-5- 5+ Y E A R S 2 5 Liked 87% Y 13% N Liked 33% Y 67% N Liked 17% Y 83% N Liked 67% Y 33% N

Decision Trees Usage Decision Trees are used for classification and prediction Typical usage scenarios include: Predicting which customers will leave Targeting the audience for mailings and promotional campaigns Explaining the reasons for a decision Answering questions such as What movies do young female customers buy?

Neural Network A neural network is a data modeling tool that can capture and represent complex input/output relationships Neural networks resemble the human brain in the following two ways: It acquires knowledge through learning It s knowledge is stored within inter-neuron connection strengths known as synaptic weights The Neural Network algorithm explores more possible data relationships than the other algorithms

Neural Network Hidden Layer Output Non-Linear Function * Input Unit * Hyperbolic tangent function in hidden layer and sigmoid function in output layer Weighted Sum

Backpropagation Training a neural network is the process of setting the best weights on the inputs of each of the units The backpropagation process: Gets a training example and calculates outputs Calculate the errors the difference between the calculated and the expected (known) result Adjusts the weights to minimize the error

Logistic Regression Sigmoid function is called the logistic function as well If a neural network has only input neurons that are directly connected to the output neurons, it is logistic regression No hidden layer f x = tanh x = ex e x e x + e x 1 g x = σ x = 1 + e x

Logistic Regression and Neural Network Usage Like the Decision Trees algorithm, you can use the Neural Network and Logistic Regression algorithms for Classification Prediction E.g., risk analysis Interpretation is more complex Especially neural networks with many hidden layers Decision Trees algorithm is more popular

Evaluating Predictive Models Lift chart Profit chart Classification matrix Cross validation

Training and Test Sets For predictive models, you need to split the data into training and test sets in order to evaluate the models A training set is required to build the model (70% of the data) A test set is used for predictions (30% of the data) When you know the value of the predicted variable, you can measure the quality of the predictions As with every sampling, it is important to randomly select the data for each set

Lift Chart No target value: overall performance Target value: a percentage of the target audience against a specified percentage of the complete audience

Profit Chart Y = profit X = percentage of the population contacted Settings: Population Fixed Cost Individual Cost Revenue Per Individual

Classification Matrix Predicted Negative Positive Actual Negative A B Positive C D The accuracy (AC) is the proportion of the total number of predictions that were correct The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified The false positive rate (FP) is the proportion of negative cases that were incorrectly classified as positive A C A D B A D B C D B D

Classification Matrix Predicted Negative Positive Actual Negative A B Positive C D The true negative rate (TN) is the proportion of negative cases that were classified correctly The false negative rate (FN) is the proportion of positive cases that were incorrectly classified as negative The precision (P) is the proportion of the predicted positive cases that were correct A C B A C D B D D

Cross Validation Cross validation shows the robustness of the models Splits training set in folds Use one fold for testing, others for training You can see how models perform over different subsets of data

Questions? Thank you! Join the conversation on Twitter: @DevWeek #DW2015