Microsoft Azure Machine learning Algorithms

Similar documents
Social Media Mining. Data Mining Essentials

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

MHI3000 Big Data Analytics for Health Care Final Project Report

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Data Mining. Nonlinear Classification

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Framing Business Problems as Data Mining Problems

Data Mining Algorithms Part 1. Dejan Sarka

Nominal and ordinal logistic regression

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining - Evaluation of Classifiers

Prerequisites. Course Outline

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Employer Health Insurance Premium Prediction Elliott Lui

Car Insurance. Prvák, Tomi, Havri

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Statistics Graduate Courses

LCs for Binary Classification

Lecture 6 - Data Mining Processes

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

STA 4273H: Statistical Machine Learning

Machine learning for algo trading

Decision Trees from large Databases: SLIQ

Pentaho Data Mining Last Modified on January 22, 2007

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

MS1b Statistical Data Mining

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

INTRODUCING AZURE MACHINE LEARNING

An Introduction to Data Mining

Chapter 6. The stacking ensemble approach

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Predicting Flight Delays

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Practical Machine Learning Tools and Techniques

Additional sources Compilation of sources:

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

The Operational Value of Social Media Information. Social Media and Customer Interaction

Analyzing Research Data Using Excel

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Knowledge Discovery from patents using KMX Text Analytics

When to Use a Particular Statistical Test

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

HT2015: SC4 Statistical Data Mining and Machine Learning

TURKISH ORACLE USER GROUP

IBM SPSS Direct Marketing 23

Lecture 2: Types of Variables

Model Combination. 24 Novembre 2009

Data Mining Part 5. Prediction

Knowledge Discovery and Data Mining

Supervised Learning (Big Data Analytics)

Statistical Machine Learning

Introduction to Quantitative Methods

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Gerry Hobbs, Department of Statistics, West Virginia University

Statistics Review PSY379

E-commerce Transaction Anomaly Classification

Elementary Statistics

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Machine Learning in Spam Filtering

SEIZE THE DATA SEIZE THE DATA. 2015

Statistics. Measurement. Scales of Measurement 7/18/2012

Car Insurance. Havránek, Pokorný, Tomášek

Predict Influencers in the Social Network

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Data Mining: Overview. What is Data Mining?

IBM SPSS Direct Marketing 22

Calculating the Probability of Returning a Loan with Binary Probability Models

Predictive Modeling Techniques in Insurance

Knowledge Discovery and Data Mining

A Logistic Regression Approach to Ad Click Prediction

The Data Mining Process

Predictive Data modeling for health care: Comparative performance study of different prediction models

Measurement and Measurement Scales

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Classification of Bad Accounts in Credit Card Industry

Predictive Modeling of Titanic Survivors: a Learning Competition

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Data Mining for Knowledge Management. Classification

Tutorial Segmentation and Classification

Model Deployment. Dr. Saed Sayad. University of Toronto

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Benchmarking of different classes of models used for credit scoring

Cross-Validation. Synonyms Rotation estimation

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Author Gender Identification of English Novels

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

How To Understand How Weka Works

Machine Learning.

Transcription:

Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com

Our Sponsors

Speaker info https://tomaztsql.wordpress.com

Agenda Focus on explanation of algorithms available for predictive analytics in Azure Machine Learning service. Algorithms 1) regression algorithms, 2) Two-Class classifications, 3) Multi-class classification, 4) Clustering - Explore algorithm - which algorithm is used and useful for what kind of empirical problem - which is suitable for particular data-set.

Before start... Why this session?!

First things first

First things something By Type Categorical Numerical Nominal Ordinal Interval Ratio gender (male/ Female), eye colour (blue, green, black,...), marital status (S, M, D,V,...), Buyer Nonbuyer (0 1) No inherited order; all values are the same and all values equal in representation education (primary, secondary, high school,...), Salary (<101,101-200, 201-300,..) Ordering can be applied; values can be compared with =,>,< and recoded with numbers / classes. Height (173cm, 196cm), Invoice Value (24.4, 42.6,..) Ordering can be applied; values can be compared with =,>,< and expressed 2 is 2x times bigger.. Temperature: Celsius vs. Fahrenheit

For warmup...regression

First things second

Sample Data

1 General (interferance) statistics Apply Math Operation -> Applies a mathematical operation to column values Compute Elementary Statistics -> Calculates specified summary statistics for selected dataset columns Compute Linear Correlation -> Calculates the linear correlation between column values in a dataset Descriptive Statistics -> Generates a basic descriptive statistics report for the columns in a dataset Evaluate Probability Function -> Fits a specified probability distribution function to a dataset Replace Discrete Values -> Replaces discrete values from one column with numeric values based on another column Test Hypothesis Using t-test -> Compares means from two datasets using a t-test Source: https://msdn.microsoft.com/en-us/library/azure/dn905867.aspx

1 General Statistics

1 Spliting data Splitting Mode: - Split rows - Recommender split - Regular / Relative expression Random seed Stratified split it depends on the size of our dataset. If it is large enough, 66% split is a good choice (66% for training and the others for test). if it is a moderated dataset, 10-fold cross validation (or leave-one-out) can be a good choice. if your dataset is small then Bootstraping is good. Jackknife, cross validation, n-fold cross validation,...

1 Spliting data

1 Initializing Model

1 Initializing Model

1 Initializing Model Making list of algorithms more transparent

Algorithms in Theory Regression Linear and Logistic Azure ML: Linear Regression Azure ML: Two-class Classification Logistic Regression Multiclass classification Logistic Regression

Algorithms in Theory Decision Tree, Decision Forests, Decision Jungles Decision tree Decision Forest Decision Jungle Azure ML: Regression boosted decision tree Two-class classification boosted decision tree Azure ML: Regression decision forrest Two-class classification decision forrest Multi classification decision forrest Azure ML: Multi-class decision jungle Two-class classification decision jungle

Algorithms in Theory Naive Bayes Azure ML: Regression Bayes linear Two Class classification Bayes point machine

Algorithms in Theory Neural networks and perceptrons Azure ML: Regression Neural networks Two Class classification Neural networks Multi Class classification Neural networks Two Class Classification averaged perceptrons

Algorithms in Theory SVM Azure ML: Two Class classification SVM Two Class classification locally deep SVM Anomaly detection SVM

1 Regression Algorithms

1 Regression Algorithms Parameters

1 Regression Algorithms

Evaluating Regression Algorithms Metrics to measure how close predictions are to eventual outcomes Metrics of differences between predicted values and actual values. Summarization of regression model how well fits a statistical model; R^2 = 1 model is perfect, respectively http://mund-consulting.com/blog/understanding-evaluate-model-in-microsoft-azure-machine-learning/

Evaluating Regression Algorithms Predicting Salary http://mund-consulting.com/blog/understanding-evaluate-model-in-microsoft-azure-machine-learning/

Comparison of Regression Algorithms Regression Algorithm Accuracy Training time Linearity Customization Predicting Variable Type of independant variable(s) Data Quantity linear Good Fast Excellent Good Interval Any small to big Bayesian linear Good Fast Excellent Moderate Interval Any big decision forest Excellent Moderate Good Good Interval Any boosted decision tree Excellent Fast Good Good Interval Any big fast forest quantile Excellent Moderate Moderate Excellent Distribution (Interval) Any neural network Excellent Slow Moderate Excellent Interval Any smaller Excellent* Poisson Good Moderate (log linear) Good Interval (counts) Any small to big ordinal Good Moderate Excellent None Ordinal (order) Any small to big Scale: Excellent Good Moderate Fast Moderate Slow

2 Two-class Classification

2 Two-class Classification parameters

Regularization weight

2 Two-class Classification

2 Evaluating two-class Classification AUC/ROC: <= 0.5 -- 0.5 0.6 -- 0.6 0.7 -- 0.7 0.8 -- 0.8 0.9 -- 0.9 1 -- WTF?

2 Evaluating two-class Classification

Comparison of Two-class Classification Algorithms Two-class classification Accuracy Training time Linearity Customization Predicting Variable Type of independant variable(s) Data Quantity logistic regression Good Fast Excellent Good dichotomous / binary Any small-big decision forest Excellent Moderate Good Good dichotomous / binary Any small-big decision jungle Excellent Moderate Good Good dichotomous / binary Any big boosted decision tree Excellent Moderate Good Good dichotomous / binary Any big neural network Excellent Slow Moderate Excellent dichotomous / binary Any averaged perceptron Good Moderate Excellent Moderate dichotomous / binary Any support vector machine Excellent Moderate Excellent Good dichotomous / binary Any big locally deep support vector machine Good Slow Good Excellent dichotomous / binary Any big Bayes point machine Moderate Moderate Excellent Moderate dichotomous / binary Any Scale: Excellent Good Moderate Fast Moderate Slow

3 Multi-class Classification

3 Multi-class Classification parameters

3 Multi-class Classification

3 Evaluating multi-class Classification

3 Evaluating multi-class Classification

3 Comparison of Multi-class Classification Multi-class classification Accuracy Training time Linearity Customization Predicting Variable Type of independant variable(s) Data Quantity logistic regression Good Fast Excellent Good Nominal / ordinal (with 2+ classes) any small-big decision forest Excellent Moderate Good Good Nominal / ordinal (with 2+ classes) any big decision jungle Excellent Moderate Good Good Nominal / ordinal (with 2+ classes) any big neural network Excellent Slow Moderate Excellent Nominal / ordinal (with 2+ classes) any small Scale: Excellent Good Moderate Fast Moderate Slow

4 Using Sweeping and SMOTE

4 Using Sweeping and SMOTE

5 Clustering

5 Clustering

5 Clustering

5 Evaluating Clustering

Key takeaways https://azure.microsoft.com/en-us/documentation/articles/machinelearning-algorithm-cheat-sheet/ https://msdn.microsoft.com/enus/library/azure/dn906033.aspx

How did you like it? Please give feedback to the event: http://www.sqlsaturday.com/494/eventeval.aspx to me as a speaker: http://www.sqlsaturday.com/494/sessions/ sessionevaluation.aspx

THANKS!