Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher



Similar documents
Estimating PageRank Values of Wikipedia Articles using MapReduce

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data and Scripting map/reduce in Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop IST 734 SS CHUNG

Open source Google-style large scale data analysis with Hadoop

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Massive scale analytics with Stratosphere using R

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Spark: Cluster Computing with Working Sets

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

ITG Software Engineering

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Advanced Big Data Analytics with R and Hadoop

Hadoop Ecosystem B Y R A H I M A.

Integrating Big Data into the Computing Curricula

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Data storing and data access

COURSE CONTENT Big Data and Hadoop Training

BIG DATA SOLUTION DATA SHEET

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

How To Scale Out Of A Nosql Database

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

How I won the Chess Ratings: Elo vs the rest of the world Competition

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to DISC and Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data and Analytics (Fall 2015)

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Peers Techno log ies Pv t. L td. HADOOP

Data processing goes big

Apache Flink Next-gen data analysis. Kostas

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data and Data Science: Behind the Buzz Words

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Scripting Systems build on top of Hadoop

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

SQL Server Administrator Introduction - 3 Days Objectives

MapReduce: Algorithm Design Patterns

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

The Stratosphere Big Data Analytics Platform

A Service for Data-Intensive Computations on Virtual Clusters

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data and Scripting Systems beyond Hadoop

Mammoth Scale Machine Learning!

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Similarity Search in a Very Large Scale Using Hadoop and HBase

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

Report: Declarative Machine Learning on MapReduce (SystemML)

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

Cloudera Certified Developer for Apache Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Trafodion Operational SQL-on-Hadoop

Click Stream Data Analysis Using Hadoop

The Need for Training in Big Data: Experiences and Case Studies

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Analytics. Lucas Rego Drumond

Brave New World: Hadoop vs. Spark

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A highly configurable and efficient simulator for job schedulers on supercomputers

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Machine Learning over Big Data

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Complete Java Classes Hadoop Syllabus Contact No:

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Big Data Management and Analytics

Recommended Literature for this Lecture

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Scaling Up HBase, Hive, Pegasus

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

NoSQL and Hadoop Technologies On Oracle Cloud

L3: Statistical Modeling with Hadoop

Qsoft Inc

Hadoop Job Oriented Training Agenda

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Transcription:

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Outline 2 1. Retrospection 2. Stratosphere Plans 3. Comparison with Hadoop 4. Evaluation 5. Outlook

Retrospection 3 Matrix factorization: Find optimal factors W 1 2 2 2 1 1 H 2 1 1 1 2 1 V 5 4?? 5 2 2? 4 W 1 2 2 2 1 1 H 2 1 1 1 2 1 WxH 4 5 3 6 6 4 3 3 2 Stochastic Gradient Descent

Retrospection 4 Result of SGD step Hadoop jobs with judging loss

Stratosphere Plans Multiple Iterations 5 for each starting point: while loss too high: MapReduce optimize factors MapReduce join triples MapReduce calculate loss (training) calculate loss (judging) for each starting point: while stop file not empty: MapMapReduce optimize factors CrossMatch join triples MapReduceCross calculate loss (training) and decide MapMatchReduce calculate loss (judging)

Stratosphere Plans Subplans in One Iteration 6 Triples Training data 3,7,b,w,h Optimize factors + Join triples + 3,7,b,-,- keys: row, column Judging data Losses 0.9; 1.0; 2.4 Calc loss (training) + Calc loss (judging) + Saw,Tom,4 file not empty stop? 0.96

Stratosphere Plans Optimize Factors 7 Triples Factors Map: SGD Reduce: average not materialized Factor W Triples 3,7,b,w,h 5,7,b,w,h Map: SGD 3,-,-,w,- -,7,-,-,h 5,-,-,w,- -,7,-,-,h Map: filter w Map: filter h pacts configured to read fields Reduce: average Reduce: average 3,-,-,w,- 5,-,-,w,- Factor H -,7,-,-,h

Stratosphere Plans Join Triples (1/2) 8 Factors Training data needs matrix dimensions Map: replicate factors, keep training data Reduce: group Triples Factor W Factor H 3,-,-,w,- -,7,-,-,h Cross Training data 3,7,b,-,- 3,7,-,w,h Match (row,col) Triples 3,7,b,w,h

Stratosphere Plans Join Triples (2/2) 9 Factor W 3,-,-,w,- Cross Training data 3,7,-,w,h Triples Factor H -,7,-,-,h 3,7,b,-,- Match (row,col) compiler hints helpful? 3,7,b,w,h Factor W Training data 3,-,-,w,- 3,7,b,-,- Match (row) Factor H -,7,-,-,h 3,7,b,w,- Match (col) Triples 3,7,b,w,h

Stratosphere Plans Calculate Loss (Training) 10 Triples Map: local loss Reduce: RMSE 0.9 driver class knows loss history and decides on stopping dummy, loss, #points Triples Map: local loss Reduce: RMSE # 1 1 Cross: loss history OutputFormat decide stop stop? 1 1 Losses (epoch e) 1.0; 2.4 0.9; 1.0; 2.4 Losses (epoch e+1)

Stratosphere Plans Calculate Loss (Judging) 11 Factors Netflix judging files no MapReduce job, driver class receives loss directly 0.96 Triples Judging data 3,7,b,w,h Saw,Tom,4 Map: emit cells Saw,Tom,4.8 Match (mid, uid): local loss dummy,0.8²,1 similar to training loss, Map included in Match Reduce: RMSE 0.96

Comparison Jobs vs. Plans 12 Equal results without random... Files Starting points Training sequence Factors not materialized Separate factor files possible Create new file for each iteration No efficient serialization between plans (yet) Either parsing text file Or use sequence file singlethreaded

Comparison Data Schema 13 3#7 / TripleStorage (Storage class is tagged union) Dummy key / LossStorage Getter, setter, tostring() 3,7,b,w,h Dummy key, loss, #points Remember key places Reuse pacts with configurations for different keys Composite keys possible

Comparison Preprocessing 14 Requires: Line format according to parameters Copy to HDFS Serialize factors and blocks Use Map file to write serialized values to HDFS Java process to define lines Shell script to move to HDFS Extra pacts to parse lines

Stratosphere Preprocessing Define Line Format 15 factorw.txt Netflix files factorh.txt 0 blocks.txt 1 Reduce: group cells to blocks 2 Create triples Calc loss: training SGD Step Plan 0 Netflix judging files

Suggestions 16 Global aggregation of loss with dummy key: Reducer with no key Reducer with compiler hint nrofkeys = 1 No sorting needed Provide tostring() in PactRecord Provide getter, setter in PactRecord to encapsulate field numbers and class types Keep configuration options (e.g. reduce: average W or H) Sequence files for sinks and sources Move log files to master node

Evaluation Parameters for Netflix Data 17 starting points: 1 max. iterations: 1 degree of parallelism: 1, 2, 5, 10 for each starting point: while stop file not empty: MapMapReduce CrossMatch MapReduceCross MapMatchReduce optimize factors join triples step size: calculate loss (training) and decide calculate loss (judging) factor size: 5 63K 125K 250K 500K user ~2K ~4K 9K 1/8 1/4 1/2 data size block size: 1000 movies x 1000 user 18K 1 movies

Evaluation Run Time for Variable Data Size 18 2 h 52 min 50 x 50 2 h 52 min 1000 x 1000 run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min grows by factor 4 Init SGD Step run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min 0 h 28 min 0 h 28 min 0 h 00 min 1/8 1/4 1/2 1 data size 0 h 00 min 1/8 1/4 1/2 1 data size run time 2 h 52 min 2 h 24 min 1 h 55 min 1000 x 1000 grows by factor 4 1 h 26 min Init Blocks (reduce cells) 0 h 57 min Init Join (cross, match) 0 h 28 min SDG Step 0 h 00 min 1/8 1/4 1/2 1 data size 10 nodes vs. DoP = 8 Usually 10-30 iterations No sequence file between plans

Evaluation Run Time for Variable Degree of Parallelism 19 Data size: 1/8 20 min Reading and writing always DoP=1 run time 15 min 10 min DoP: 1 2 5 performs best, but data is small faster with higher DoP 5 min 10 0 min optimize factors Map Map Reduce join triples Cross Match write triples subplan calc loss (training) Map Reduce Cross calc loss (judging) Map Match Reduce

Outlook 20 Join triples with Cross-Match vs. Match-Match Degree of parallelism: 10 Inspect judging outcome: RMSE should be equal Evaluate DoP with bigger data

Summary 21

References 22 Anand Rajaraman and Jeff Ullman. Mining of Massive Datasets. Cambridge University Press, 2010. Rainer Gemulla, Peter J. Haas, Erik Nijkamp, and Yannis Sismanis. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. IBM Research Report RJ10481, March 2011. [1] http://www.mathworks.de/matlabcentral/fx_files/27631/1/fff.png, November 2011