! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II



Similar documents
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Machine Learning using MapReduce

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Collaborative Filtering. Radek Pelánek

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Using Data Mining and Machine Learning in Retail

Map-Reduce for Machine Learning on Multicore

A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

A Workbench for Comparing Collaborative- and Content-Based Algorithms for Recommendations

High Productivity Data Processing Analytics Methods with Applications

Map/Reduce Affinity Propagation Clustering Algorithm

Mammoth Scale Machine Learning!

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Development of a distributed recommender system using the Hadoop Framework

Performance Metrics for Graph Mining Tasks

Bayesian networks - Time-series models - Apache Spark & Scala

Pentaho Data Mining Last Modified on January 22, 2007

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Advanced Big Data Analytics with R and Hadoop

How To Cluster

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

ITG Software Engineering

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

Content-Boosted Collaborative Filtering for Improved Recommendations

Big Data Analytics Verizon Lab, Palo Alto

Model Selection. Introduction. Model Selection

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

CS Data Science and Visualization Spring 2016

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Java Modules for Time Series Analysis

A Statistical Text Mining Method for Patent Analysis

8. Linear least-squares

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Similarity Search in a Very Large Scale Using Hadoop and HBase

Fast Data in the Era of Big Data: Twitter s Real-

How To Cluster Of Complex Systems

Automated Collaborative Filtering Applications for Online Recruitment Services

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Social Media Mining. Network Measures

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

1. Classification problems

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Big Data Analytics. Lucas Rego Drumond

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data Mining Techniques

Analytics on Big Data

Fast Analytics on Big Data with H20

BIG DATA What it is and how to use?

They can be obtained in HQJHQH format directly from the home page at:

Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering

Data Mining Yelp Data - Predicting rating stars from review text

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Energy Efficient MapReduce

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

DATA ANALYSIS II. Matrix Algorithms

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Module 5: Statistical Analysis

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I

Advanced Ensemble Strategies for Polynomial Models

Active Learning SVM for Blogs recommendation

Distributed Computing and Big Data: Hadoop and MapReduce

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

CLUSTER ANALYSIS FOR SEGMENTATION

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning Final Project Spam Filtering

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Big Data & Scripting Part II Streaming Algorithms

Content-Based Recommendation

MHI3000 Big Data Analytics for Health Care Final Project Report

Transcription:

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center 1 October 2nd, 2014

Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/14 14-15 Two-Day Big Data Analytics Workshop Final Project Presentations 2

Review Key Components of Mahout 3

Mahout reference book 4

Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse Luna (June 2014) 5

Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 6

User-based Recommendation Scenario I gettofail.com 7

User-based Recommendation Scenario II 8

User-based Recommendation Scenario III 9

User-based Recommendation Algorithms 10

Example Recommender Code via Mahout 11

Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 12

Refresh (Reload) Data 13

Update data 14

User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity!! 15

Pearson Correlation Similarity Data: missing data 16

On Pearson Similarity Three problems with the Pearson Similarity:! 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 17

Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 18

Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 19

Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 20

Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 21

Tanimoto (Jaccard) Coefficient Similarity Discard preference values 22 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 23

Performance measurements Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson: 0.89 24

Performance measurements 10 nearest neighbors: 0.98 100 nearest neighbors: 0.89 500 nearest neighbors: 0.75 95% of training; 5% of testing 25

Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or 0.5 26

Item-based recommendation 27

Item-based recommendation algorithm 28

Code and Performance of Item-Based Recommendation performance 29

Slope-One Recommender 30

Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 31

Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 32

Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 33

Cluster-based Recommendation 34

Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 35

Other Recommenders not in Mahout Info Flow (SIGIR 06) CF + SP IF TIF Network Info Flow Number of recommended users Innovators? Late majority adopt? Early adopters Early majority Early adopter Late adopter CF + SP IF TIF Number of recommended users IF: Graphical Information Flow Model TIF: Joint Topic Detection + Information Flow Model Tests: 1 month 586 new docs 1,170 users 36 People with similar tastes Laggards! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better Recall: IF is 87% better, TIF is 113% better

Distributed Item-based Recommender 37

Distributed recommender get co-occurrence matrix Data: 38

Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 39

Translating to MapReduce: generating user vectors 40

Translating to MapReduce: calculating co-occurrence 41

Translating to MapReduce: matrix multiplication 42

Translating to MapReduce: partial products 43

Translating to MapReduce: partial product II 44

Running Recommender on MapReduce and HDFS 45

Questions? 46