Big Data Analytics Clustering and Classification

Similar documents
Mammoth Scale Machine Learning!

Machine Learning using MapReduce

Analytics on Big Data

Big Data Analytics CSCI 4030

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Machine Learning Logistic Regression

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Data Mining. Nonlinear Classification

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Machine learning for algo trading

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predict Influencers in the Social Network

Data Mining Algorithms Part 1. Dejan Sarka

L3: Statistical Modeling with Hadoop

The Artificial Prediction Market

BIG DATA What it is and how to use?

Environmental Remote Sensing GEOG 2021

Supervised Learning (Big Data Analytics)

Supervised Learning Evaluation (via Sentiment Analysis)!

Active Learning SVM for Blogs recommendation

Microsoft Azure Machine learning Algorithms

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Part 5. Prediction

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Journée Thématique Big Data 13/03/2015

MACHINE LEARNING IN HIGH ENERGY PHYSICS

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data Mining - Evaluation of Classifiers

Distributed Computing and Big Data: Hadoop and MapReduce

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Content-Based Recommendation

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Social Media Mining. Data Mining Essentials

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Why is Internal Audit so Hard?

Distributed forests for MapReduce-based machine learning

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Classification of Bad Accounts in Credit Card Industry

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Predicting Flight Delays

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

STA 4273H: Statistical Machine Learning

Azure Machine Learning, SQL Data Mining and R

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

High Productivity Data Processing Analytics Methods with Applications

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Predicting Soccer Match Results in the English Premier League

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Big Data Analytics Opportunities and Challenges

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

The Data Mining Process

Scalable Machine Learning - or what to do with all that Big Data infrastructure

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Scalable Developments for Big Data Analytics in Remote Sensing

Statistical Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

How To Cluster

Binary Logistic Regression

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Machine Learning with MATLAB David Willingham Application Engineer

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Unsupervised Data Mining (Clustering)

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Learning is a very general term denoting the way in which agents:

A Logistic Regression Approach to Ad Click Prediction

Making Sense of the Mayhem: Machine Learning and March Madness

KEITH LEHNERT AND ERIC FRIEDRICH

Data Mining Applications in Higher Education

Bayesian networks - Time-series models - Apache Spark & Scala

Chapter ML:XI (continued)

Advanced In-Database Analytics

Simple and efficient online algorithms for real world applications

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

DATA MINING TECHNIQUES AND APPLICATIONS

CSCI567 Machine Learning (Fall 2014)

Fast Analytics on Big Data with H20

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Support Vector Machines with Clustering for Training with Very Large Datasets

Football Match Winner Prediction

A SYSTEM FOR CROWD ORIENTED EVENT DETECTION, TRACKING, AND SUMMARIZATION IN SOCIAL MEDIA

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Introduction to Data Mining

Predictive Modeling Techniques in Insurance

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Introduction to Data Mining

Weekly Sales Forecasting

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Lecture 3: Linear methods for classification

Transcription:

E6893 Big Data Analytics Lecture 4: Big Data Analytics Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Distinguished Researcher and Chief Scientist, Graph Computing September 29th, 2016 1

Review Key Components of Mahout 2

Machine Learning example: using SVM to recognize a Toyota Camry Non-ML Rule 1.Symbol has something like bull s head Rule 2.Big black portion in front of car. Rule 3...???? ML Support Vector Machine Feature Space Positive SVs Negative SVs 3 2015 CY Lin, Columbia University

Machine Learning example: using SVM to recognize a Toyota Camry ML Support Vector Machine Positive SVs PCamry > 0.95 Feature Space Negative SVs 4 2015 CY Lin, Columbia University

Clustering 5

Clustering on feature plane 6

Clustering example 7

Steps on clustering 8

K-mean clustering 9

Making initial cluster centers 10

Parameters to Mahout k-mean clustering algorithm 11

HelloWorld clustering scenario 12

HelloWorld Clustering scenario - II 13

HelloWorld Clustering scenario - III 14

HelloWorld clustering scenario result 15

Testing difference distance measures 16

Manhattan and Cosine distances 17

Tanimoto distance and weighted distance 18

Results comparison 19

Data preparation in Mahout vectors 20

vectorization example 0: weight 1: color 2: size 21

Mahout codes to create vectors of the apple example 22

Mahout codes to create vectors of the apple example II 23

Vectorization of text Vector Space Model: Term Frequency (TF) Stop Words: Stemming: 24

Most Popular Stemming algorithms 25

Term Frequency Inverse Document Frequency (TF-IDF) The value of word is reduced more if it is used frequently across all the documents in the dataset. or 26

n-gram It was the best of time. it was the worst of times. ==> bigram Mahout provides a log-likelihood test to reduce the dimensions of n-grams 27

Examples using a news corpus Reuters-21578 dataset: 22 files, each one has 1000 documents except the last one. http://www.daviddlewis.com/resources/testcollections/ reuters21578/ Extraction code: 28

Mahout dictionary-based vectorizer 29

Mahout dictionary-based vectorizer II 30

Mahout dictionary-based vectorizer III 31

Outputs & Steps 1. Tokenization using Lucene StandardAnalyzer 2. n-gram generation step 3. converts the tokenized documents into vectors using TF 4. count DF and then create TF-IDF 32

A practical setting of flags 33

normalization Some documents may pop up showing they are similar to all the other documents because it is large. ==> Normalization can help. 34

Clustering methods provided by Mahout 35

K-mean clustering 36

Hadoop k-mean clustering jobs 37

K-mean clustering running as MapReduce job 38

Hadoop k-mean clustering code 39

The output 40

Canopy clustering to estimate the number of clusters Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy. 41

Running canopy clustering Created less than 50 centroids. 42

News clustering code 43

News clustering example > finding related articles 44

News clustering code II 45

News clustering code III 46

Other clustering algorithms Hierarchical clustering 47

Different clustering approaches 48

When to use Mahout for classification? 49

The advantage of using Mahout for classification 50

Classification definition 51

How does a classification system work? 52

Key terminology for classification 53

Input and Output of a classification model 54

Four types of values for predictor variables 55

Sample data that illustrates all four value types 56

Supervised vs. Unsupervised Learning 57

Work flow in a typical classification project 58

Classification Example 1 Color-Fill 59 Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is color-fill label.

Target leak A target leak is a bug that involves unintentionally providing data about the target variable in the section of the predictor variables. Don t confused with intentionally including the target variable in the record of a training example. Target leaks can seriously affect the accuracy of the classification system. 60

Classification Example 2 Color-Fill (another feature) 61

Mahout classification algorithms Mahout classification algorithms include: Naive Bayesian Complementary Naive Bayesian Stochastic Gradient Descent (SDG) Random Forest 62

Comparing two types of Mahout Scalable algorithms 63

Step-by-step simple classification example 1.The data and the challenge 2.Training a model to find color-fill: preliminary thinking 3.Choosing a learning algorithm to train the model 4.Improving performance of the classifier 64

Classification Example 3 65

What may be a good predictor? 66

Choose algorithm via Mahout 67

Stochastic Gradient Descent (SGD) 68

Characteristic of SGD 69

Support Vector Machine (SVM) maximize boundary distances; remembering support vectors 70 nonlinear kernels

Naive Bayes Training set: Classifier using Gaussian distribution assumptions: Test Set: 71 ==> female

Random Forest Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. 72

Choosing a learning algorithm to train the model One low overhead classification method is the stochastic gradient descent (SGD) algorithm for logistic regression. This algorithm is sequential, but it s fast. 73

The donut.csv data file in Example 3 74

Build a model using Mahout 75

Trainlogistic program 76

Evaluate the model AUC (0 ~ 1): 1 perfect 0 perfectly wrong 0.5 random confusion matrix 77

Questions? 78