Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses



Similar documents
Development of a distributed recommender system using the Hadoop Framework

Machine Learning using MapReduce

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Map-Reduce for Machine Learning on Multicore

Distributed Framework for Data Mining As a Service on Private Cloud

Mammoth Scale Machine Learning!

Advanced In-Database Analytics

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

A Performance Analysis of Distributed Indexing using Terrier

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Collaborative Filtering. Radek Pelánek

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Data Mining with Hadoop at TACC

Hadoop Parallel Data Processing

Large-Scale Test Mining

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Chapter 7. Using Hadoop Cluster and MapReduce

MapReduce and Hadoop Distributed File System

Introduction to DISC and Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

What s Cooking in KNIME

Getting Started with Hadoop with Amazon s Elastic MapReduce

Introduction Predictive Analytics Tools: Weka

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big-data Analytics: Challenges and Opportunities

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

HiBench Installation. Sunil Raiyani, Jayam Modi

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Use of Hadoop File System for Nuclear Physics Analyses in STAR

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Internals of Hadoop Application Framework and Distributed File System

MapReduce and Hadoop Distributed File System V I J A Y R A O

Evaluating partitioning of big graphs

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Radoop: Analyzing Big Data with RapidMiner and Hadoop

On the Performance of High Dimensional Data Clustering and Classification Algorithms

HiBench Introduction. Carson Wang Software & Services Group

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Integrating Apache Spark with an Enterprise Data Warehouse

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Cloud Computing. Chapter Hadoop

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Understanding Hadoop Performance on Lustre

Data processing goes big

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Analytics Verizon Lab, Palo Alto

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

How To Create A Graphlab Framework For Parallel Machine Learning

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Apache Hadoop s MapReduce

PREA: Personalized Recommendation Algorithms Toolkit

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

From GWS to MapReduce: Google s Cloud Technology in the Early Days

SQL Server 2005 Features Comparison

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Policy-based Pre-Processing in Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

Performance Analysis of Mixed Distributed Filesystem Workloads

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

HadoopRDF : A Scalable RDF Data Analysis System

HADOOP PERFORMANCE TUNING

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Benchmarking Hadoop & HBase on Violin

How To Use Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Project 5 Twitter Analyzer Due: Fri :59:59 pm

MapReduce Evaluator: User Guide

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Optimization and analysis of large scale data sorting algorithm based on Hadoop

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Bringing Big Data Modelling into the Hands of Domain Experts

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

IT services for analyses of various data samples

Image Search by MapReduce

LabStats 5 System Requirements

Transcription:

Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012

Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and Hadoop Experimental Setup and Analyses Drive Mahout on Hadoop Interesting Communities Center for Web Intelligence, DePaul University, USA

Introduction About Me: a recommendation guy My Research: data mining and recommender systems Typical Experimental Research 1) Design or improve an algorithm; 2) Run algorithms and baseline algs on datasets; 3) Compare experimental results; 4) Try different parameters, find reasons and even re-design and improve algorithm itself; 5) Run algorithms and baseline algs on datasets; 6) Compare experimental results; 7) Try different parameters, find reasons and even re-design and improve algorithm itself; 8) And so on Until it approaches expected results.

Introduction Sometimes, data is large-scale. e.g. one algorithm may spend days to complete, how about experimental results are not as expected. Then improve algorithms and run it for days again, and again. How can we do previously? (for tasks not that complicated) 1). Paralleling but complicated synchronization and limited resources, such as CPU, memory, etc; 2). Take advantage of PC Labs, let s do it with 10 PCs Nearly all research will ultimately face the large-scale problems, especially in the domain of data mining. But, we have Map-Reduce NOW!

Introduction Do not need to distribute data and tasks manually. Instead we just simply generate configurations. Do not need to care about more details, e.g. how data is distributed, when one specific task will be ran on which machine, or how they conduct tasks one by one. Instead, we can pre-define working flow. We can take advantage of the functional contributions from mappers and reducers. More benefits: replication, balancing, robustness, etc

Recommender Systems Collaborative Filtering Slope One and Simple Weighted Slope One Slope One in Mahout Distributed Slope One in Mahout Mappers and Reducers Center for Web Intelligence, DePaul University, USA

Recommender Systems

Collaborative Filtering (CF) One of most popular recommendation algorithms. User-based: User-CF Item-based: Item-CF, Slope One Rating? 4 4 star User 5 5 4 5 Example: User-based Collaborative Filtering

Slope One Recommender Reference: Daniel Lemire, Anna Maclachlan, Slope One Predictors for Online Rating-Based Collaborative Filtering, In SIAM Data Mining (SDM'05), April 21-23, 2005. http://lemire.me/fr/abstracts/sdm2005.html User Batman Spiderman U1 3 4 U2 2 4 U3 2? 1). How different two movies were rated? U1 rated Spiderman higher by (4-3) = 1 U2 rated Spiderman higher by (4-2) = 2 On average, Spiderman is rated (1+2)/2 = 1.5 higher 2). Rating difference can tell predictions If we know U3 gave Batman a 2-star, probably he will rated Spiderman by (2+1.5) = 3.5 star

Simple Weighted Slope One Usually user rated multiple items User HarryPotter Batman Spiderman U1 5 3 4 U2? 2 4 U3 4 2? 1). How different the two movies were rated? Diff(Batman, Spiderman) = [(4-3)+(4-2)]/2 = 1.5 Diff(HarryPotter, Spiderman) = (4-5)/1 = -1 2 and 1 here we call them as count. 2). Weighted rating difference can tell predictions We use a simple weighted approach Refer to Batman only, rating = 2+1.5 = 3.5 Refer to HarryPotter only, rating = 4-1 = 3 Consider them all, predicted rating = (3.5*2 + 3*1])/ (2+1) = 3.33

Simple Weighted Slope One User HarryPotter Batman Spiderman u1 5 3 4 u2? 2 4 u3 4 2? To calculate the prediction ratings, we need 2 matrices: 1).Difference Matrix Movie1 Question: Online or Offline? Movie2-1.5 Movie1 Movie2 Movie3 Movie4 Movie3 2 1 Movie4-1 0.5-2 2). Count Matrix Just number of users co-rated on two items

Slope One in Mahout Mahout, an open-source machine learning library. 1). Recommendation algorithms User-based CF, Item-based CF, Slope One, etc 2). Clustering KMeans, Fuzzy KMeans, etc 3). Classification Decision Trees, Naive Bayes, SVM, etc 4). Latent Factor Models LDA, SVD, Matrix Factorization, etc

Slope One in Mahout org.apache.mahout.cf.taste.impl.recommender.slopeone.slopeonerecommender Pre-Processing Stage: (class MemoryDiffStorage with Map) for every item i for every other item j for every user u expressing preference for both i and j add the difference in u s preference for i and j to an average Recommendation Stage: for every item i the user u expresses no preference for for every item j that user u expresses a preference for find the average preference difference between j and i add this diff to u s preference value for j add this to a running average return the top items, ranked by these averages Simple weighting: as introduced previously StdDev weighting: item-item rating diffs with lower sd should be weighted highly

Distributed Slope One in Mahout Similar to our previous practice, e.g. the matrix factorization Process, what we need is the Difference Matrix. Suppose there are M users rated N items, the matrix requires N(N-1)/2 cells. Also, the density is another aspect how user rated items. If there are several items and the rating matrix is dense, the computational costs will increase accordingly. Question again: Online or Offline? Depends on tasks & data. Large-scale data. Let s do it offline!

Distributed Slope One in Mahout package org.apache.mahout.cf.taste.hadoop.slopeone; class SlopeOneAverageDiffsJob class SlopeOnePrefsToDiffsReducer class SlopeOneDiffsToAveragesReducer package org.apache.mahout.cf.taste.hadoop; class ToItemPrefsMapper org.apache.hadoop.mapreduce.mapper Two Mapper-Reducer Stages: 1). Create DiffMatrix for each user 2). Collect AvgDiff info, counts, StdDev Let s see how it works

Mapper and Reducer - 1 User HarryPotter Batman Spiderman U1 5 3 4 U2? 2 4 U3 4 2? Mapper1 (ToItemPrefsMapper) <UserID, Pair<ItemID, Rating>> Reducer1 (PrefsToDiffsReducer) <Pair<Item1,Item2>, Diff> (for all three users) <U1> Potter Bat Spider Potter <U2> Potter Bat Spider Potter Bat -2 Bat NULL Spider -1 1 Spider NULL 2

Mapper and Reducer - 2 <U1> Potter Bat Spider <U2> Potter Bat Spider Potter Bat -2 Potter Bat NULL Spider -1 1 Spider NULL 2 Mapper2 (org.apache.hadoop.mapreduce.mapper) Reducer2 (DiffsToAveragesReducer) Average Diffs, Count, StedDev <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2 Simply, <a,b> pair denotes a=averge diff, b=count Notice: we should use three matrices in practice, here I used 2.

Predictions User HarryPotter Batman Spiderman U1 5 3 4 U2? 2 4 U3 4 2? <Aggregate> Potter Bat Spider Potter Bat -2, 1 Spider -1, 1 1.5, 2 Simply, <a,b> pair denotes a=averge diff, b=count Notice: we should use three matrices in practice, here I used 2. Prediction(U3, Spiderman) = [(4-1)*1 + (2+1.5)*2] / (1+2) = 3.33333333333333333333

Experiments Data Hadoop Setup Running Performances Center for Web Intelligence, DePaul University, USA

Experiment Setup Data: MovieLens-1M ratings # of users: 6,040 # of movies: 3,900 # of ratings: 1,000,209 Density of the ratings: each user has at least 20 ratings obviously, some users have many more ratings Rating format: UserID, ItemID, Rating (scale 1-5) Data Split: 80% training, 20% testing

Experiment Setup Hadoop Cluster Setup IBM SmartCloud 1 master node, 7 slave nodes Each node is as SUSE Linux Enterprise Server v11 SP1 Server Configuration: 64 bit (vcpu: 2, RAM: 4 GiB, Disk: 60 GiB) Hadoop v.0.20.205.0 Mahout distribution-0.6 The environment setup follows the typical workflow as: http://irecsys.blogspot.com/2012/11/configurate-map-reduceenvironment-on.html Thanks Scott Young, neat writeup!!

Experimental Analyses Stage-1: SlopeOneAverageDiffsJob by Map-Reduce Goal: Build DiffStorage Output: DiffStorage txt file, 1.45GB Running Time: real 13m 34.228s user 0m 5.136s sys 0m 1.028s Item1 Item2 Diff Count StdDev 221 223-1.02 197 0.5 Stage-2: Java evaluator to measure MAE on testing set Running Time: Load Testing Set (21K records), 299ms Load Training Set (79K records), 1,771ms Load DiffStorage, 176,352ms = 2.9m Prediction (21K records), 18,182ms = 0.3m MAE = 0.71330756

Experimental Experiences 1. Why not MovieLens 10M data? Map-Reduce on 10M data may cost several hrs; Running time depends on cluster and configuration; Also, DiffStorage file will be too large. 2. Java Evaluator Load full DiffStorage file is time-consuming. Also, incur Java heap space and GCOverlimit errors; Those errors can not be fixed by Xmx or other solutions; Two solutions: 1). Just use simple weighting, discard StdDev weighting. 2). Simple Mapper and Reducer, run it on clusters. For MovieLens 1M, it is not that efficient compared with the live SlopeOne recommendation; 10M data may be better, will try MovieLens-10M data later; Slope One is simple but memory-expensive.

More Drive Mahout on Hadoop Interesting Communities Center for Web Intelligence, DePaul University, USA

Mahout + Hadoop How to put more Mahout algorithms to Hadoop? 1. Pre-set Command in Mahout Let s see bin/mahout help, then it provides a list of available programs such as svd, fkmeans, etc. Some are basic functions, such as splitdataset Some can be executed as Hadoop tasks e.g. Run and evaluate Matrix Factorization on rating dataset bin/mahout parallelals --input inputsource --output outputsource --tempdir tmpfolder --numfeatures 20 --numiterations 10 bin/mahout evaluatefactorization --input inputsource --output outputsource --userfeatures als/out/u/ --itemfeatures als/out/m/ --tempdir tmpfolder

Mahout + Hadoop 2. More Algorithms on Hadoop Mahout provides a way to run more Mahount algorithms. Simply, $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core- <version>.jar <Job Class> --recommenderclassname Class <OPTIONS> Which kinds of Jobs it supports? Mahout implemented some versions. Some popular ones: 1).org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob --recommenderclassname ClassName 2).org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 3).org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob 4).org.apache.mahout.cf.taste.hadoop.slopeone.SlopeOneAverageDiffsJob

Interesting Communities Beyond Hadoop and Mahout official sites 1. Data Mining KDnuggets, http://www.kdnuggets.com Popular community for Data Mining & Analytics. Lots of useful information, such as news, materials, datasets, jobs, etc. 2. Big Data SmartData Collective, http://smartdatacollective.com/ Smarter Computing, http://www.smartercomputingblog.com/ Big Data Meetup, http://big-data.meetup.com/ 3. Recommender Systems ACM Official Site, http://recsys.acm.org/ RecSys Wiki, http://recsyswiki.com/

Thank You! Center for Web Intelligence, DePaul University, USA