Azure Machine Learning, SQL Data Mining and R



Similar documents
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Prerequisites. Course Outline

MHI3000 Big Data Analytics for Health Care Final Project Report

The Data Mining Process

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining. Nonlinear Classification

Maschinelles Lernen mit MATLAB

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

2015 Workshops for Professors

Machine Learning Capacity and Performance Analysis and R

Data Mining Algorithms Part 1. Dejan Sarka

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

High Productivity Data Processing Analytics Methods with Applications

Chapter 12 Discovering New Knowledge Data Mining

Microsoft Azure Machine learning Algorithms

Advanced analytics at your hands

Data Mining - Evaluation of Classifiers

Chapter 6. The stacking ensemble approach

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

MS1b Statistical Data Mining

Knowledge Discovery and Data Mining

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Predictive Data modeling for health care: Comparative performance study of different prediction models

IBM's Fraud and Abuse, Analytics and Management Solution

Machine learning for algo trading

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Data Mining - The Next Mining Boom?

HT2015: SC4 Statistical Data Mining and Machine Learning

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

MACHINE LEARNING IN HIGH ENERGY PHYSICS

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Big Data Analytics and Optimization

Data Mining with SQL Server Data Tools

Data Mining. Dr. Saed Sayad. University of Toronto

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining Part 5. Prediction

Data Mining Practical Machine Learning Tools and Techniques

INTRODUCING AZURE MACHINE LEARNING

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Pentaho Data Mining Last Modified on January 22, 2007

Predictive Modeling Techniques in Insurance

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

MACHINE LEARNING BASICS WITH R

Data Mining Applications in Higher Education

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Introduction to Data Mining

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

DATA MINING TECHNIQUES AND APPLICATIONS

Machine Learning with MATLAB David Willingham Application Engineer

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Make Better Decisions Through Predictive Intelligence

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Course Syllabus. Purposes of Course:

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Advanced Big Data Analytics with R and Hadoop

Supervised Learning (Big Data Analytics)

Journée Thématique Big Data 13/03/2015

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Analysis Tools and Libraries for BigData

Data Mining + Business Intelligence. Integration, Design and Implementation

Office: LSK 5045 Begin subject: [ISOM3360]...

MSCA Introduction to Statistical Concepts

Enhancing Compliance with Predictive Analytics

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Classification of Bad Accounts in Credit Card Industry

The Scientific Data Mining Process

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Fast Analytics on Big Data with H20

Model Deployment. Dr. Saed Sayad. University of Toronto

ADVANCED MACHINE LEARNING. Introduction

Framing Business Problems as Data Mining Problems

A fast, powerful data mining workbench designed for small to midsize organizations

IBM SPSS Modeler 15 In-Database Mining Guide

Big Data and Marketing

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Cluster

OUTLIER ANALYSIS. Data Mining 1

not possible or was possible at a high cost for collecting the data.

Predictive modelling around the world

Predictive Analytics

from Larson Text By Susan Miertschin

Knowledge Discovery from patents using KMX Text Analytics

The Predictive Data Mining Revolution in Scorecards:

Transcription:

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all: prepare questions that you would like to answer using predictive analytics and machine learning. Target Audience Analysts, power users, predictive and BI developers, database and other professionals who wish to embrace machine learning, budding data scientists, consultants. Format 60% lectures, 20% demos, plus 20% time to help you follow the demos and tasks on your own equipment, if you bring a laptop. You will be challenged to find answers to 4 problems during the course, and you will have a chance to build your own models in SSAS, Azure ML and R. Doing that will help you learn, but it is not a requirement: you are quite welcome to sit back, observe the demos, and ask questions. If you bring your own data, you can analyse it. You will get a list of free or evaluation-edition software to preinstall before attending. You will need your ownazure account: free one is OK, but the paid one is better and it can be inexpensive, or even free during a trial. You can copy course experiments and data into your ML workspace for learning and future reference.

Detailed Agenda Module 1: Overview of Practical Data Science for Business We begin the course with a quick, high-level introduction of all of the key concepts, terminology, components, and tools. Topics covered include: Introduction to data science and its components Machine learning vs data mining Statistics Big data Data wrangling Team, process, and tools Module 2: Tools & Getting Started Configuring Azure ML in the cloud is effortless. You need to pay a little bit more attention to on-premise R and SQL server environment, to make sure that you can easily access your modelling data. Topics covered include: Overview of Cortana Analytics Suite Getting started with and using Azure ML, SSAS DM, and R Structures, models, data flows Configuration concerns and pricing Azure requirements and dependencies Using Rattle with R and RStudio Getting a feel for the data: interpreting notched boxplots in R Module 3: Data Data science requires you to prepare your data into a rather unique, flat, and completely denormalised format. While inputs are always necessary, and you may need to engineer hundreds of them, we do not need predictive outputs in all cases. Topics covered include: Inputs and outputs, features and labels Data formats, discretization vs continuous Cases, observations, signatures Feature engineering Azure ML data preparation and manipulation modules Preparing unstructured text for text analysis Feature hashing Moving data around and its storage Briefly: other Cortana Analytics Suite tools for data management and storage Module 4: Process The analytical process consists of problem formulation, data preparation, modelling, validation, and deployment all in an iterative fashion. You will learn about the CRISP-DM industry-standard approach, as

well as the application of the scientific method of reasoning to experimentation, when solving real-world business problems. Topics covered include: Stating business question in data science term CRISP-DM Scientific method of reasoning Hypothesis testing and experiments Iterative hypothesis refinement Module 5: Algorithm Overview There are hundreds of machine learning algorithms, yet they belong to just a dozen of groups, of which 4-5 are in very common use. We will introduce those algorithm classes, and we will discuss some of the most often used examples in each class, while explaining which technology tools (Azure ML, SQL, or R) provide their most convenient implementation. Topics covered include: What does data mining do? Algorithm classes in Azure ML, R, and SSAS Supervised vs Unsupervised learning Classifiers Clustering Regression Similarity Matching Recommenders Module 6: Segmentation Segmentation is the main application of unsupervised learning using clustering algorithms. While the action of the algorithm is usually quick and easy to configure, interpreting the results can take a lot of time and intuition. We will spend plenty of time practicing segmentation, interpreting the results and subsequently parameterising the algorithm to provide us with additional insight, and to help you apply it back to your own data. You will even learn how to apply this technique for anomaly (outlier) detection and text analytics! Topics covered include: Introduction to segmentation Clustering algorithms (k-means, EM, and others) Interpreting clusters Cluster characteristics Discrimination Tornado charts Using clustering for text analysis Anomaly detection with clustering, PCA and SVMs Module 7: Classification Without doubt, classifiers are the most important, and the most often used category of machine learning algorithms, and the foundation of algorithmic data science. We will focus on several variants of the most important classifier algorithm decision tree while progressively interpreting the results, and improving

its performance. After introducing neural networks and logistic regression we will also compare the performance of all of these classifiers on our test dataset. Topics covered include: Introduction to classifiers Two-class (binary) vs multi-class Decision trees, forests, and boosting Decision jungles * Neural networks and logistic regression Overfitting (overtraining) concerns Using classifiers for text analysis Associative decision trees * Module 8: Basic Statistics Basic concepts of statistics, notably: means, medians, modes, and variance or standard deviation, are essential to validating data and model quality. Probability, and the concept of p-values help you decide which of your inputs (features) are more important than others. R makes all of these powerful ideas accessible and visual, while Azure ML enables you to deploy them easily into production. Topics covered include: Basic concepts of statistics: population vs sample, measure types, means and deviations, distributions, confidence intervals, p-values Correlation Descriptive statistics with R Basic concepts of probability Finding important features using p-values, linear regression and ANOVA * Module 9: Model Validation The most important aspect of any data science project is the iterative validation and improvement of the models. Without validation, your models cannot be used. There are several tests of model validity, and we will focus on accuracy and reliability, showing you different ways to measure it. Topics covered include: Testing accuracy Lift charts Testing reliability Testing usefulness Module 10: Classifier Precision Validation of classifiers is likely to be your main occupation as a data scientist, because classifiers are used so often, and because their precision is not always easy to balance with business requirements, such as restricted resources or required business performance. We will introduce the fundamentals of finding the balance between the acceptable number of false positives and false negatives by using classification (confusion) matrices, and plotting the options using ROC (Receiver Operating Characteristic) charts. Topics covered include: Testing classifiers False positives vs. false negatives

Classification (confusion) matrix Precision Recall Balancing precision with recall vs business goals and constraints Charting precision-recall (sensitivity-specificity) ROC curves Other measures of accuracy * Cross-validation Optimising binary classifier thresholds for a known business goal of prediction quality Refining models to improve accuracy and reliability Using parameter sweeps to fine-tune algorithm performance Class imbalance problem (fraud analytics and rare event prediction) * Module 11: Regressions Considered by some as the numerical equivalent of classifiers, regression is a large subject of its own. We will introduce its simple but a very popular form, linear regression, and the more precise, but also prone-tooverfitting, decision tree variant. Topics covered include: Introduction to simple regressions Linear regression (classic) Regression decision trees and other ensemble regression algorithms Relationship to ANOVA * Measuring linear regression quality (R-squared, predictor p-values, RMSE, MAE, RAE, RSE, and additional testing using R) * Module 12: Similarity Matching & Recommenders From basic concepts of similarity matching, through model-based associative analysis, collaborative filtering, to hybrid systems, like the Matchbox algorithm, there are several techniques for building recommenders. You will get a good overview of this subject, as well as an understanding of how to use these techniques for advanced data exploration, such as Market Basket Analysis. Topics covered include: Introduction to recommender concepts Model-based, similarity-based, and hybrid recommenders Association rules Understanding itemsets and rules Rule importance vs. rule probability Data structures for association rules Market Basket Analysis Collaborative filtering Matchbox recommenders Validating recommenders

Module 13: Other Algorithms (Brief Overview) As the course is coming to its end, we will briefly overview some of the remaining and interesting algorithms, without going into much detail, but letting you have an understanding of the existing general approaches. Topics covered include: Sequence clustering and Markov chains SVM (Support Vector Machines) Time series * Image recognition * Text analysis Module 14: Production & Model Maintenance If you plan on using your models for prediction, rather than just for the exploration of data, you need to deploy your models to production and maintain them on an on-going basis. You will learn about the easiest way to do so using Azure ML web services and its REST synchronous and asynchronous APIs, as well as how to deploy and invoke SSAS models by using DMX queries. Topics covered include: Deploying models to production SSAS models and DMX queries Azure ML web services: preparation and publishing REST APIs: request/response vs batch PMML * On-going maintenance and model updates