Collaborative Filtering. Radek Pelánek



Similar documents
Big Data Analytics. Lucas Rego Drumond

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Hybrid model rating prediction with Linked Open Data for Recommender Systems

The Need for Training in Big Data: Experiences and Case Studies

Machine Learning using MapReduce

IPTV Recommender Systems. Paolo Cremonesi

Lecture #2. Algorithms for Big Data

Advances in Collaborative Filtering

Content-Based Recommendation

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Social Media Mining. Data Mining Essentials

Factorization Machines

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Principles of Data Mining by Hand&Mannila&Smyth

Collaborative Filtering Recommender Systems. Contents

Big & Personal: data and models behind Netflix recommendations

Recommendation Tool Using Collaborative Filtering

Rating Prediction with Informative Ensemble of Multi-Resolution Dynamic Models

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

A Social Network-Based Recommender System (SNRS)

Statistical Machine Learning

Several Views of Support Vector Machines

Lecture 5: Singular Value Decomposition SVD (1)

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Information Management course

How I won the Chess Ratings: Elo vs the rest of the world Competition

Factorization Machines

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Azure Machine Learning, SQL Data Mining and R

Analyze It use cases in telecom & healthcare

Modern consumers are inundated with MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER SYSTEMS COVER FEATURE. Recommender system strategies

Fast Analytics on Big Data with H20

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

2. EXPLICIT AND IMPLICIT FEEDBACK

Fundamental Analysis Challenge

Supervised Learning (Big Data Analytics)

A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Invited Applications Paper

Principal components analysis

Lecture 3: Linear methods for classification

The Data Mining Process

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Bayesian Factorization Machines

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning Introduction

MS1b Statistical Data Mining

An Introduction to Data Mining

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Chapter 12 Discovering New Knowledge Data Mining

Big Data Analytics CSCI 4030

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Machine Learning Capacity and Performance Analysis and R

Mammoth Scale Machine Learning!

System Identification for Acoustic Comms.:

Marketing Mix Modelling and Big Data P. M Cain

Big Data - Lecture 1 Optimization reminders

Java Modules for Time Series Analysis

PREA: Personalized Recommendation Algorithms Toolkit

Machine Learning Big Data using Map Reduce

The Method of Least Squares

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Advanced Ensemble Strategies for Polynomial Models

Recommender Systems. User-Facing Decision Support Systems. Michael Hahsler

Data Mining for Web Personalization

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Role of Social Networking in Marketing using Data Mining

Introduction to Learning & Decision Trees

Identifying SPAM with Predictive Models

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Machine Learning for Data Science (CS4786) Lecture 1

Supporting Online Material for

A Simple Introduction to Support Vector Machines

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

MATHEMATICAL METHODS OF STATISTICS

Statistical machine learning, high dimension and big data

A Workbench for Comparing Collaborative- and Content-Based Algorithms for Recommendations

Text Analytics (Text Mining)

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

arxiv: v1 [cs.ir] 12 Jun 2015

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Transcription:

Collaborative Filtering Radek Pelánek 2015

Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains widely used in practice

Basic CF Approach input: matrix of user-item ratings (with missing values, often very sparse) output: predictions for missing values

Netflix Prize Netflix video rental company contest: 10% improvement of the quality of recommendations prize: 1 million dollars data: user ID, movie ID, time, rating

Main CF Techniques memory based nearest neighbors (user, item) model based latent factors matrix factorization

Neighborhood Methods: Illustration Matrix factorization techniques for recommender systems

Latent Factors: Illustration Matrix factorization techniques for recommender systems

Latent Factors: Netflix Data Matrix factorization techniques for recommender systems

Ratings explicit e.g., stars (1 to 5 Likert scale) to consider: granularity, multidimensionality issues: users may not be willing to rate data sparsity implicit proxy data for quality rating clicks, page views, time on page the following applies directly to explicit ratings, modifications may be needed for implicit (or their combination)

Non-personalized Predictions averages, issues: bias, normalization some users give systematically higher ratings (more details for a CF later) number of ratings, uncertainty average 5 from 3 ratings average 4.9 from 100 ratings

Exploitation vs Exploration pure exploitation always recommend top items what if some other item is actually better, rating is poorer just due to noise? exploration trying items to get more data how to balance exploration and exploitation?

Multi-armed Bandit standard model for exploitation vs exploration arm (unknown) probabilistic reward how to choose arms to maximize reward? well-studied, many algorithms (e.g., upper confidence bounds) typical application: advertisements

Core Idea do not use just averages quantify uncertainty (e.g., standard deviation) combine average & uncertainty for decisions example: TrueSkill, ranking of players (leaderboard)

Note on Improving Performance simple predictors often provide reasonable performance further improvements often small but can have significant impact on behavior (not easy to evaluate) evaluation lecture Introduction to Recommender Systems, Xavier Amatriain

User-based Nearest Neighbor CF user Alice: item i not rated by Alice: find similar users to Alice who have rated i compute average to predict rating by Alice recommend items with highest predicted rating

User-based Nearest Neighbor CF Recommender Systems: An Introduction (slides)

User Similarity Pearson correlation coefficient (alternatives: e.g. spearman cor. coef., cosine similarity) Recommender Systems: An Introduction (slides)

Pearson Correlation Coefficient: Reminder r = n n i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 n i=1 (Y i Ȳ )2

Making Predictions b N pred(a, i) = r a + sim(a, b) (r bi r b ) sim(a, b) r ai rating of user a, item i r a, r b user averages b N

Improvements number of co-rated items agreement on more exotic items more important case amplification more weight to very similar neighbors neighbor selection

Item-based Collaborative Filtering compute similarity between items use this similarity to predict ratings more computationally efficient, often: number of items << number of users

Item-based Nearest Neighbor CF Recommender Systems: An Introduction (slides)

Cosine Similarity rating by Bob rating by Alice cos(α) = A B A B

Similarity, Predictions (adjusted) cosine similarity similar to Pearson s r, works slightly better pred(u, p) = i R sim(i, p)r ui sim(i, p) i R neighborhood size limited (20 to 50)

Preprocessing O(N 2 ) calculations still large Item-item recommendations by Amazon (2003) calculate similarities in advance (periodical update) supposed to be stable, item relations not expected to change quickly reductions (min. number of co-ratings etc)

Matrix Factorization CF main idea: latent factors of users/items use these to predict ratings related to singular value decomposition

Notes singular value decomposition (SVD) theorem in linear algebra in CF context the name SVD usually used for an approach only slightly related to SVD theorem related to latent semantic analysis introduced during the Netflix prize, in a blog post (Simon Funk) http://sifter.org/~simon/journal/20061211.html

Singular Value Decomposition (Linear Algebra) X = USV T U, V orthogonal matrices S diagonal matrix, diagonal entries singular values low-rank matrix approximation (use only top k singular values) http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html

SVD CF Interpretation X = USV T X matrix of ratings U user-factors strengths V item-factors strengths S importance of factors

Latent Factors Matrix factorization techniques for recommender systems

Latent Factors Matrix factorization techniques for recommender systems

Missing Values matrix factorization techniques (SVD) work with full matrix ratings sparse matrix solutions: value imputation expensive, imprecise alternative algorithms (greedy, heuristic): gradient descent, alternating least squares

Notation u user, i item r ui rating ˆr ui predicted rating b, b u, b i bias q i, p u latent factor vectors (length k)

Simple Baseline Predictors [ note: always use baseline methods in your experiments ] naive: ˆr ui = µ, µ is global mean biases: ˆr ui = µ + b u + b i b u, b i biases, average deviations some users/items systematically larger/lower ratings

Latent Factors (for a while assume centered data without bias) ˆr ui = qi T p u vector multiplication user-item interaction via latent factors illustration (3 factors): user (p u ): (0.5, 0.8, 0.3) item (q i ): (0.4, 0.1, 0.8)

Latent Factors vector multiplication ˆr ui = q T i p u user-item interaction via latent factors we need to find q i, p u from the data (cf content-based techniques) note: finding q i, p u at the same time

Learning Factor Vectors we want to minimize squared errors (related to RMSE, more details leater) min q,p (u,i) T (r ui q T i p u ) 2 regularization to avoid overfitting (standard machine learning approach) min q,p (u,i) T How to find the minimum? (r ui q T i p u ) 2 + λ( q i 2 + p u 2 )

Stochastic Gradient Descent standard technique in machine learning greedy, may find local minimum

Gradient Descent for CF prediction error e ui = r ui q T i p u update (in parallel): q i := q i + γ(e ui p u λq i ) p i := p u + γ(e ui q i λp u ) math behind equations gradient = partial derivatives γ, λ constants, set pragmatically learning rate γ (0.005 for Netflix) regularization λ (0.02 for Netflix)

Advice if you want to learn/understand gradient descent (and also many other machine learning notions) experiment with linear regression can be (simply) approached in many ways: analytic solution, gradient descent, brute force search easy to visualize good for intuitive understanding relatively easy to derive the equations (one of examples in IV122 Math & programming)

Advice II recommended sources: Koren, Yehuda, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer 42.8 (2009): 30-37. Koren, Yehuda, and Robert Bell. Advances in collaborative filtering. Recommender Systems Handbook. Springer US, 2011. 145-186.

Adding Bias predictions: ˆr ui = µ + b u + b i + q T i p u function to minimize: (r ui µ b u b i qi T p u ) 2 +λ( q i 2 + p u 2 +bu+b 2 i 2 )] min q,p (u,i) T

Improvements additional data sources (implicit ratings) varying confidence level temporal dynamics

Temporal Dynamics Netflix data Y. Koren, Collaborative Filtering with Temporal Dynamics

Temporal Dynamics Netflix data, jump early in 2004 Y. Koren, Collaborative Filtering with Temporal Dynamics

Temporal Dynamics baseline = behaviour influenced by exterior considerations interaction = behaviour explained by match between users and items Y. Koren, Collaborative Filtering with Temporal Dynamics

Results for Netflix Data Matrix factorization techniques for recommender systems

Slope One Slope One Predictors for Online Rating-Based Collaborative Filtering average over such simple prediction

Slope One accurate within reason easy to implement updateable on the fly efficient at query time expect little from first visitors

Other CF Techniques clustering association rules classifiers

Clustering main idea: cluster similar users non-personalized predictions ( popularity ) for each cluster clustering: unsupervised machine learning many algorithms k-means, EM algorithm,...

Clustering Introduction to Recommender Systems, Xavier Amatriain

Clustering: K-means

Association Rules relationships among items, e.g., common purchases famous example (google it for more details): beer and diapers Customers Who Bought This Item Also Bought... advantage: provides explanation, useful for building trust

Classifiers general machine learning techniques positive / negative classification train, test set logistic regression, support vector machines, decision trees, Bayesian techniques,...

Limitations of CF cold start problem popularity bias difficult to recommend items from the long tail impact of noise (e.g., one account used by different people) possibility of attacks

Cold Start Problem How to recommend new items? What to recommend to new users?

Cold Start Problem use another method (non-personalized, content-based...) in the initial phase ask/force user to rate items use defaults (means) better algorithms e.g., recursive CF

Recursive Collaborative Filtering Recommender Systems: An Introduction (slides)

Collaborative Filtering: Summary requires only ratings, widely applicable neighborhood methods, latent factors use of machine learning techniques