Big Data Analytics. Lucas Rego Drumond

Similar documents

Big Data Analytics. Lucas Rego Drumond

Collaborative Filtering. Radek Pelánek

Big Data Analytics. Lucas Rego Drumond

Bayesian Factorization Machines

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

The Need for Training in Big Data: Experiences and Case Studies

Factorization Machines

Big Data Analytics Verizon Lab, Palo Alto

Factorization Machines

Hybrid model rating prediction with Linked Open Data for Recommender Systems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data Analytics. Lucas Rego Drumond

Analyze It use cases in telecom & healthcare

IPTV Recommender Systems. Paolo Cremonesi

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Addressing Cold Start in Recommender Systems: A Semi-supervised Co-training Algorithm

Fundamental Analysis Challenge

Rating Prediction with Informative Ensemble of Multi-Resolution Dynamic Models

Introduction to Online Learning Theory

Big Data Analytics CSCI 4030

Big Data Analytics: Optimization and Randomization

New Ensemble Combination Scheme

Distributed Machine Learning and Big Data

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Linear Threshold Units

Lecture 13: Validation

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Weekly Sales Forecasting

Local classification and local likelihoods

Scalable Machine Learning - or what to do with all that Big Data infrastructure

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Bag of Pursuits and Neural Gas for Improved Sparse Coding

L3: Statistical Modeling with Hadoop

Social Media Mining. Data Mining Essentials

Advanced Ensemble Strategies for Polynomial Models

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

The Operational Value of Social Media Information. Social Media and Customer Interaction

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

CSCI567 Machine Learning (Fall 2014)

Cross Validation. Dr. Thomas Jensen Expedia.com

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Journée Thématique Big Data 13/03/2015

Data Mining Practical Machine Learning Tools and Techniques

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Computer programming course in the Department of Physics, University of Calcutta

Logistic Regression for Spam Filtering

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Mammoth Scale Machine Learning!

Big Data Analytics. Lucas Rego Drumond

Making Sense of the Mayhem: Machine Learning and March Madness

CS Data Science and Visualization Spring 2016

Machine Learning using MapReduce

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Big Data Analytics. Lucas Rego Drumond

Lecture 8 February 4

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Simple and efficient online algorithms for real world applications

Cross-validation for detecting and preventing overfitting

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

How I won the Chess Ratings: Elo vs the rest of the world Competition

Statistical Machine Learning

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Studying Auto Insurance Data

Introduction to Logistic Regression

Programming Exercise 3: Multi-class Classification and Neural Networks

Big Data & Scripting Part II Streaming Algorithms

Predicting borrowers chance of defaulting on credit loans

MLlib: Scalable Machine Learning on Spark

Utility of Distrust in Online Recommender Systems

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Server Load Prediction

Big Data Analytics Using Neural networks

Machine Learning Big Data using Map Reduce

Machine Learning over Big Data

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Machine Learning Capacity and Performance Analysis and R

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Probabilistic Matrix Factorization

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Advances in Collaborative Filtering

Predict Influencers in the Social Network

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Statistical machine learning, high dimension and big data

Chapter 6. The stacking ensemble approach

Transcription:

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender Systems Going For Large Scale Application Scenario: Recommender Systems 1 / 44

Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 1 / 44

1. ADMM Continued Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 1 / 44

1. ADMM Continued Applying ADMM to predictive problems minimize β (i),α N i=1 (x,y) D train i subject to β (i) α = 0 l(y, ŷ(x; β (i) )) + R(β (i) ) The ADMM algorithm iteratively performs the following steps: β (i)t+1 arg min β α t+1 arg min α (x,y) D train i l(y, ŷ(x; β)) + R(β) + ν (i)t T β + s 2 β αt 2 2 N ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 1 / 44

1. ADMM Continued Solving Linear Regression with ADMM Loss function: y X β 2 2 + λ β 2 2 N = y (i) X (i) β (i) 2 2 + λ β (i) 2 2 i=1 The ADMM algorithm iteratively performs the following steps: β (i)t+1 arg min y (i) X (i) β 2 2 + λ β 2 2 + ν (i)t T s β + β 2 β αt 2 2 N α t+1 arg min ν (i)t T (β (i) t+1 α) + s α 2 β(i)t+1 α 2 2 i=1 ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 2 / 44

1. ADMM Continued Solving Linear Regression with ADMM 1st Step: β t+1 arg min y X β 2 2 + λ β 2 2 + ν t T s β + β 2 β αt 2 2 ( β y X β 2 2 + λ β 2 2 + ν t T s ) β + 2 β αt 2 2 = 0 2X T (y X ˆβ) + 2λ ˆβ + ν t + s(β α t ) = 0 2X T X ˆβ 2X T y + 2λ ˆβ + ν t + sβ sα t = 0 2X T X ˆβ + 2λ ˆβ + sβ = 2X T y ν t + sα t ˆβ = (2X T X + 2(λ + s)i) 1 (2X T y ν t + sα t ) (2X T X + 2(λ + s)i) ˆβ = (2X T y ν t + sα t ) Going For Large Scale Application Scenario: Recommender Systems 3 / 44

1. ADMM Continued Solving Linear Regression with ADMM 2nd Step: α t+1 arg min α N ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 i=1 ( N ) α ν (i)t T (β (i) t+1 α) + s 2 β(i)t+1 α 2 2 = 0 i=1 N ν (i)t i=1 N s(β (i)t+1 α) = 0 i=1 Nsα = N ν (i)t + sβ (i)t+1 i=1 N i=1 α = ν(i)t + sβ (i)t+1 Ns Going For Large Scale Application Scenario: Recommender Systems 4 / 44

1. ADMM Continued Solving Linear Regression with ADMM Loss function: y X β 2 2 + λ β 2 2 N = y (i) X (i) β (i) 2 2 + λ β (i) 2 2 i=1 The ADMM algorithm iteratively performs the following steps: β (i)t+1 (2X (i)t X (i) + 2(λ + s)i) 1 (2X (i)t y (i) ν (i)t + sα t ) N α t+1 i=1 ν(i)t + sβ (i)t+1 Ns ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 5 / 44

1. ADMM Continued Solving Linear Regression with ADMM Now assume we initialize ν (i)0 = 0 Our first α update step will be: N α 1 i=1 ν(i)0 + sβ (i)1 Ns N i=1 = 0 + sβ(i)1 Ns = = N i=1 sβ(i)1 Ns N i=1 β(i)1 N Going For Large Scale Application Scenario: Recommender Systems 6 / 44

1. ADMM Continued Solving Linear Regression with ADMM If we have then the ν update will be α 1 = N i=1 β(i)1 N ν (i)1 ν (i)0 + s (β (i)1 α 1) = s β (i)1 N j=1 β(j)1 N Going For Large Scale Application Scenario: Recommender Systems 7 / 44

1. ADMM Continued Solving Linear Regression with ADMM The next α update step will be: N i=1 ν(i)1 i=1 N i=1 sβ(i)2 α 2 + Ns Ns = 1 N s β (i)1 Ns N = 1 N N i=1 ( β (i)1) 1 N N j=1 β(j)1 N i=1 + N j=1 β(j)1 N N i=1 β(i)2 + N N i=1 β(i)2 N = N i=1 β(i)1 N N j=1 β(j)1 N + N i=1 β(i)2 N = N i=1 β(i)2 N Going For Large Scale Application Scenario: Recommender Systems 8 / 44

1. ADMM Continued Solving Linear Regression with ADMM From this it follows that, given ν (i)0 = 0, the algorithm can be further simplified to iteratively perform the following steps: β (i)t+1 (2X (i)t X (i) + 2(λ + s)i) 1 (2X (i)t y (i) ν (i)t + sα t ) N i=1 β(i)t+1 α t+1 N ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) Going For Large Scale Application Scenario: Recommender Systems 9 / 44

1. ADMM Continued Solving Linear Regression with ADMM Finally notice that the α and ν update steps do not depend directly on the loss function. This means the ADMM algorithm can be generalized with the simplified updates: β (i)t+1 arg min β (x,y) D train i N i=1 β(i)t+1 α t+1 N ν (i)t+1 ν (i)t + s (β (i)t+1 α t+1) l(y, ŷ(x; β)) + R(β) + ν (i)t T β + s 2 β αt 2 2 Going For Large Scale Application Scenario: Recommender Systems 10 / 44

1. ADMM Continued Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Going For Large Scale Application Scenario: Recommender Systems 11 / 44

1. ADMM Continued ADMM on The Year Prediction Dataset Runtime in Seconds 0 5 10 15 20 25 30 ADMM Closed Form 2 4 6 8 10 Number of Workers Going For Large Scale Application Scenario: Recommender Systems 12 / 44

2. Recommender Systems Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 13 / 44

2. Recommender Systems Recommender Systems Going For Large Scale Application Scenario: Recommender Systems 13 / 44

2. Recommender Systems Why Recommender Systems? Powerful method for enabling users to filter large amounts of information Personalized recommendations can boost the revenue of an e-commerce system: Amazon recommender systems Netflix challgenge: 1 million dollars for improving their system on 10% Different applications: Human computer interaction E-commerce Education... Going For Large Scale Application Scenario: Recommender Systems 14 / 44

2. Recommender Systems Why Personalization? - The Long Tail Source: http://www.ma.hw.ac.uk/esgi08/unilever.html Going For Large Scale Application Scenario: Recommender Systems 15 / 44

2. Recommender Systems Rating Prediction Given the previously rated items, how the user will evaluate other items? Going For Large Scale Application Scenario: Recommender Systems 16 / 44

2. Recommender Systems Item Prediction Which will be the next items to be consumed by a user? Going For Large Scale Application Scenario: Recommender Systems 17 / 44

2. Recommender Systems Formalization U - Set of Users I - Set of Items Ratings data D U I R Rating data D are typically represented as a sparse matrix R R U I items users Going For Large Scale Application Scenario: Recommender Systems 18 / 44

2. Recommender Systems Example Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) 4 2 5 Bob (b) 4 3 John (j) 4 3 Users U := {Alice, Bob, John} Items I := {Titanic, Matrix, The Godfather, Once} Ratings data D := {(Alice, Titanic, 4), (Bob, Matrix, 4),...} Going For Large Scale Application Scenario: Recommender Systems 19 / 44

2. Recommender Systems Recommender Systems - Some definitions Some useful definitions: N (u) is the set of all items rated by user u N (Alice) := {Titanic, The Godfather, Once} N (i) is the set of all users that rated item i N (Once) := {Alice, John} Going For Large Scale Application Scenario: Recommender Systems 20 / 44

2. Recommender Systems Recommender Systems - Task Given a set of users U, items I and training data D train U I R, find a function such that some error is minimal error(ˆr, D test ) := ˆr : U I R (u,i,r ui ) D test l(r ui, ˆr(u, i)) Going For Large Scale Application Scenario: Recommender Systems 21 / 44

3. Traditional Recommendation approaches Outline 1. ADMM Continued 2. Recommender Systems 3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches 3.2. Factorization Models Going For Large Scale Application Scenario: Recommender Systems 22 / 44

3. Traditional Recommendation approaches Recommender Systems Approaches Most recommender system approaches can be classified into: Content Based Filtering: recommends items similar to the items liked by a user using textual similarity in metadata Collaborative Filtering: similar behavior recommends items liked by users with We will focus on collaborative filtering! Going For Large Scale Application Scenario: Recommender Systems 22 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Nearest Neighbor Approaches Nearest neighbor approaces build on the concept of similarity between users and/or items. The neighborhood N u of a user u is the set of k most similar users to u Analogously, the neighborhood N i of an item i is the set of k most similar items to i There are two main neighborhood based approaches User Based: The rating of an item by a user is computed based on how similar users have rated the same item Item Based: The rating of an item by a user is computed based on how similar items have been rated the user Going For Large Scale Application Scenario: Recommender Systems 23 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches User Based Recommender A user u U is represented as a vector u R I containing user ratings. Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) 4 2 5 Bob (b) 4 3 John (j) 4 3 Examples: a := [4, 0, 2, 5] b := [0, 4, 3, 0] j := [0, 4, 0, 3] Going For Large Scale Application Scenario: Recommender Systems 24 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches User Based Recommender - Prediction Function ˆr(u, i) := r u + v N u sim(u, v)(r vi r v ) v N u sim(u, v) Where: r u is the average rating of user u sim is a similarity function used to compute the neighborhood N u Going For Large Scale Application Scenario: Recommender Systems 25 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Item Based Recommender An item i I is represented as a vector i R U containing information on how items are rated by users. Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) 4 2 5 Bob (b) 4 3 John (j) 4 3 Examples: t := [4, 0, 0] m := [0, 4, 4] g := [2, 3, 0] o := [5, 0, 3] Going For Large Scale Application Scenario: Recommender Systems 26 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Item Based Recommender - Prediction Function ˆr(u, i) := r i + j N i sim(i, j)(r ui r i ) j N i sim(i, j) Where: r i is the average rating of item i sim is a similarity function used to compute the neighborhood N i Going For Large Scale Application Scenario: Recommender Systems 27 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Similarity Measures On both user and item based recomenders the similarity measure plays an important role: It is used to compute the neighborhood of users and items (neighbors are most similar ones) It is used during the prediction of the ratings Which similarity measure to use? Going For Large Scale Application Scenario: Recommender Systems 28 / 44

3. Traditional Recommendation approaches 3.1. Nearest Neighbor Approaches Similarity Measures Commonly used similarity measures: Cosine: Pearson correlation: sim(u, i) = sim(u, v) = cos(u, v) = u v u 2 v 2 i N (u) N (v) (r ui r u )(r vi r v ) i N (u) N (v) (r ui r u ) 2 i N (u) N (v) (r vi r v ) 2 Going For Large Scale Application Scenario: Recommender Systems 29 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Why Factorization Models? Neighborhood based approaches have been shon to be effective but... Computing and maintaining the neighborhoods is expensive In the last years, a number of models have been shown to outperform them One of the results of the Netflix Challenge was the power of factorization models when applied to recommender systems Going For Large Scale Application Scenario: Recommender Systems 30 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models Going For Large Scale Application Scenario: Recommender Systems 31 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Partially observed matrices The ratings matrix R is usually partially observed: No user is able to rate all items Most of the items are not rated by all users Can we estimate the factorization of a matrix from some observations to predict its unbserved part? Going For Large Scale Application Scenario: Recommender Systems 32 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Factorization models Each item i I is associated with a latent feature vector q i R k Each user u U is associated with a latent feature vector p u R k Each entry in the original matrix can be estimated by k ˆr(u, i) = p u q i = p u,f q i,f f =1 Going For Large Scale Application Scenario: Recommender Systems 33 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Example Titanic (t) Matrix (m) The Godfather (g) Once (o) Alice (a) 4 2 5 Bob (b) 4 3 John (j) 4 3 T M G O T M G O Alice 4 2 5 Alice Bob 4 3 a b Bob x John 4 3 John R P Q T Going For Large Scale Application Scenario: Recommender Systems 34 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Latent Factors Source: Yehuda Koren, Robert Bell, Chris Volinsky: Matrix Factorization Techniques for Recommender Systems, Computer, v.42 n.8, p.30-37, August 2009 Going For Large Scale Application Scenario: Recommender Systems 35 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Learning a factorization model - Objective Function Task: Where: arg min P,Q (u,i,r ui ) D train (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) ˆr(u, i) := p u q i D train is the training data λ is a regularization constant Going For Large Scale Application Scenario: Recommender Systems 36 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Optimization method L := (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) (u,i,r ui ) D train Stochastic Gradient Descent: Conditions: Loss function should be decomposable into a sum of components The loss function should be differentiable Procedure: Randomly draw one component of the sum Update the parameters in the opposite direction of the gradient Going For Large Scale Application Scenario: Recommender Systems 37 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models SGD: gradients Gradients: L := (u,i,r ui ) D train (r ui ˆr(u, i)) 2 + λ( P 2 + Q 2 ) L p u = 2(r u,i ˆr(u, i))q i + 2λp u L q i = 2(r u,i ˆr(u, i))p u + 2λq i Going For Large Scale Application Scenario: Recommender Systems 38 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Stochastic Gradient Descent Algorithm 1: procedure LearnLatentFactors input: D Train, λ, α 2: (p u ) u U N(0, σi) 3: (q i ) i I N(0, σi) 4: repeat 5: for (u, i, r u,i ) D Train do In a random order 6: p u p u α ( 2(r u,i ˆr(u, i))q i + 2λp u ) 7: q i q i α ( 2(r u,i ˆr(u, i))p u + 2λq i ) 8: end for 9: until convergence 10: return P, Q 11: end procedure Going For Large Scale Application Scenario: Recommender Systems 39 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models on practice Dataset: MovieLens (ML1M) Users: 6040 Movies: 3703 Ratings: From 1 (worst) to 5 (best) 1.000.000 observed ratings (approx. 4.5% of possible ratings) Going For Large Scale Application Scenario: Recommender Systems 40 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Evaluation Evaluation protocol 10-fold cross falivation Leave-one-out Measure: RMSE (Root Mean Squared Error) (u,i,r ui ) D Test(r ui ˆr(u, i)) 2 RMSE = D Test Going For Large Scale Application Scenario: Recommender Systems 41 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models SGD for factorization Models - Performance over epochs RMSE RMSE 0.4 0.6 0.8 1.0 1.2 1.4 1.6 RMSE on test RMSE on train 0 20 40 60 80 100 epoch Going For Large Scale Application Scenario: Recommender Systems 42 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models - Impact of the number of latent features Movielens1M RMSE 0.80 0.85 0.90 0.95 1.00 0 20 40 60 80 100 120 k latent features Going For Large Scale Application Scenario: Recommender Systems 43 / 44

3. Traditional Recommendation approaches 3.2. Factorization Models Factorization Models - Effect of regularization RMSE 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Regularized Model Unregularized Model 0 20 40 60 80 100 epoch Going For Large Scale Application Scenario: Recommender Systems 44 / 44