Up to 800 scrobbles per second More than 40 million scrobbles per day Over 75 billion scrobbles so far

Similar documents
Word Count Code using MR2 Classes and API

Hadoop WordCount Explained! IT332 Distributed Systems

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Xiaoming Gao Hui Li Thilina Gunarathne

CS54100: Database Systems

Biomap Jobs and the Big Picture

Hadoop Configuration and First Examples

MR-(Mapreduce Programming Language)

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Getting to know Apache Hadoop

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

The Hadoop Eco System Shanghai Data Science Meetup

Extreme Computing. Hadoop MapReduce in more detail.

Distributed Systems + Middleware Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

BIG DATA, MAPREDUCE & HADOOP

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Introduction To Hadoop

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Introduction to MapReduce and Hadoop

Hadoop Design and k-means Clustering

Connecting Hadoop with Oracle Database

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Three Approaches to Data Analysis with Hadoop

Word count example Abdalrahman Alsaedi

Health Care Claims System Prototype

Introduction to Cloud Computing

Data Science in the Wild

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Integration Guide

HPCHadoop: MapReduce on Cray X-series

map/reduce connected components

Scalable Computing with Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

AVRO - SERIALIZATION

Hadoop Basics with InfoSphere BigInsights

Mrs: MapReduce for Scientific Computing in Python

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Internals of Hadoop Application Framework and Distributed File System

How To Write A Mapreduce Program In Java.Io (Orchestra)

Introduc)on to Map- Reduce. Vincent Leroy

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

BIG DATA APPLICATIONS

Map Reduce & Hadoop Recommended Text:

Big Data Management and NoSQL Databases

University of Maryland. Tuesday, February 2, 2010

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

LARGE-SCALE DATA PROCESSING WITH MAPREDUCE

Hadoop: Understanding the Big Data Processing Method

Data Science Analytics & Research Centre

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

CSE-E5430 Scalable Cloud Computing Lecture 2

Cloudera Certified Developer for Apache Hadoop

19 Putting into Practice: Large-Scale Data Management with HADOOP

Hadoop Integration Guide

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop IST 734 SS CHUNG

How to program a MapReduce cluster

Tes$ng Hadoop Applica$ons. Tom Wheeler

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

5 HDFS - Hadoop Distributed System

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop: The Definitive Guide

ITG Software Engineering

Hadoop, Hive & Spark Tutorial

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

An Implementation of Sawzall on Hadoop

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Big Data and Apache Hadoop s MapReduce

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Introduction to Big Data Science. Wuhui Chen

Map-Reduce and Hadoop

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

Advanced Java Client API

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Distributed Computing and Big Data: Hadoop and MapReduce

Running Hadoop on Windows CCNP Server

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Streaming. Table of contents

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

LANGUAGES FOR HADOOP: PIG & HIVE

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Data Domain Profiling and Data Masking for Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Transcription:

A Case Study Internet radio and community-driven music discovery service founded in 2002. Users transmit information to Last.fm servers indicating which songs they are listening to. The received data is processed and stored so the user can access it in the form of charts and so Last.fm can make intelligent taste and compatibility decisions for generating recommendations and radio stations. The track listening data is obtained from one of two sources: The listen is a scrobble when a user plays a track of his or her own and sends the information to Last.fm through a client application. The listen is a radio listen when the user tunes into a Last.fm radio station and streams a song. Last.fm applications allow users to love, skip or ban each track they listen to. This track listening data is also transmitted to the server. 1/25

Big Data at Last.fm Over 40M unique visitors and 500M pageviews each month Scrobble stats: Up to 800 scrobbles per second More than 40 million scrobbles per day Over 75 billion scrobbles so far Radio stats: Over 10 million streaming hours per month Over 400 thousand unique stations per day Each scrobble and radio listen generates at least one log line. In other words...lots of data!! 2/25

Last.FM s Herd of Elephants 100 Nodes 8 cores per node (dual quad-core) 24GB memory per node 8TB (4 disks of 2TB each) Hive integration to run optimized SQL queries for analysis. 3/25

Started using Hadoop in 2006 as users grew from thousands to millions. Hundreds of daily, monthly, and weekly jobs including, Site stats and metrics Chart generation (track statistics) Metadata corrections (e.g. misspellings of artists) Indexing for search and combining/formatting data for recommendations Data insights, evaluations, reporting This case study will focus on the chart generation (Track Statistics) job, which was the first Hadoop implementation at Last.fm. 4/25

Features of the case study Shows how to handle k-tuples, where k > 2 using the Writable and WritableComparable interfaces from the Hadoop library. Shows how to have multiple map-reduce phases using the ToolRunner, Tool and Configuration classes/interfaces. Shows how to merge data from two different streams using map-reduce. 5/25

Last.fm Chart Generation (Track Statistics) The goal of the Track Statistics program is to take incoming listening data and summarize it into a format that can be used to display on the website or used as input to other Hadoop programs. 6/25

Track Statistics Program Two jobs to calculate values from the data, and a third job to merge the results. 7/25

Track Statistics Jobs Input: Gigabytes of space-delimited text files of the form (userid trackid scrobble radioplay skip), where the last three fields are 0 or 1. UserId TrackId Scrobble Radio Skip 111115 222 0 1 0 111113 225 1 0 0 111117 223 0 1 1 111115 225 1 0 0 Output: Charts require the following statistics per track: Number of unique listeners Number of scrobbles Number of radio listens Total number of listens Number of radio skips TrackId numlisteners numplays numscrobbles numradioplays numskips IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable 222 1 1 0 1 0 223 1 1 0 1 1 225 2 2 2 0 0 8/25

UniqueListeners Job Calculates the number of unique listeners per track. UniqueListenersMapper input: key is the line number of the current log entry. value is the space-delimited log entry. LineOfFile UserId TrackId Scrobble Radio Skip LongWritable IntWritable IntWritable Boolean Boolean Boolean 0 111115 222 0 1 0 1 111113 225 1 0 0 2 111117 223 0 1 1 3 111115 225 1 0 0 UniqueListenersMapper function: if(scrobbles <= 0 && radiolistens <=0) output nothing; else output(trackid, userid) map(0, 111115 222 0 1 0 ) <222, 111115> UniqueListenersMapper output: TrackId UserId IntWritable IntWritable 222 111115 225 111113 223 111117 225 111115 9/25

UniqueListenersMapper Class /** * Processes space-delimited raw listening data and emits the * user ID associated with each track ID. */ public static class UniqueListenersMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> { public void map(longwritable offset, Text line, Context context) throws IOException, InterruptedException { String[] parts = (line.tostring()).split(" "); int scrobbles = Integer.parseInt(parts[TrackStatistics.COL_SCROBBLE]); int radiolistens = Integer.parseInt(parts[TrackStatistics.COL_RADIO]); /* Ignore track if marked with zero plays */ if (scrobbles <= 0 && radiolistens <= 0) return; /* Output user id against track id */ IntWritable trackid = new IntWritable( Integer.parseInt(parts[TrackStatistics.COL_TRACKID])); IntWritable userid = new IntWritable( Integer.parseInt(parts[TrackStatistics.COL_USERID])); context.write(trackid, userid); 10/25

UniqueListenersReducer UniqueListenersReducer input: key is the TrackID output by UniqueListenersMapper. value is the iterator over the list of all UserIDs who listened to the track. TrackId Iterable<UserIds> IntWritable Iterable<IntWritable> 222 111115 225 111113 111115 223 111117 UniqueListenersReducer function: Add userids to a HashSet as you iterate the list. Since a HashSet doesn t store duplicates, the size of the set will be the number of unique listeners. reduce(225, 111115 111113 ) <225, 2> UniqueListenersReducer output: TrackId numlisteners IntWritable IntWritable 222 1 223 1 225 2 11/25

UniqueListenersReducer Class /** * Receives a list of user IDs per track ID and puts them into a * set to remove duplicates. The size of this set (the number of * unique listeners) is emitted for each track. */ public static class UniqueListenersReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> { public void reduce(intwritable trackid, Iteratble<IntWritable> values, Context context) throws IOException, InterruptedException { Set<Integer> userids = new HashSet<Integer>(); /* Add all users to set, duplicates automatically removed */ while (values.hasnext()) { IntWritable userid = values.next(); userids.add(integer.valueof(userid.get())); /* Output trackid -> number of unique listeners per track */ context.write(trackid, new IntWritable(userIds.size())); 12/25

SumTrackStats Job Adds up the scrobble, radio and skip values for each track. SumTrackStatsMapper input: The same as UniqueListeners. LineOfFile UserId TrackId Scrobble Radio Skip LongWritable IntWritable IntWritable Boolean Boolean Boolean 0 111115 222 0 1 0 1 111113 225 1 0 0 2 111117 223 0 1 1 3 111115 225 1 0 0 SumTrackStatsMapper function: Simply parse input and output the values as a new TrackStats object (see next slide). map(0, 111115 222 0 1 0 ) <222, new TrackStats(0, 1, 0, 1, 0)> SumTrackStatsMapper output: TrackId numlisteners numplays numscrobbles numradioplays numskips IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable 222 0 1 0 1 0 225 0 1 1 0 0 223 0 1 0 1 1 225 0 1 1 0 0 13/25

TrackStats object public class TrackStats implements WritableComparable<TrackStats> { private IntWritable listeners; private IntWritable plays; private IntWritable scrobbles; private IntWritable radioplays; private IntWritable skips; public TrackStats(int numlisteners, int numplays, int numscrobbles, int numradio, int numskips) { this.listeners = new IntWritable(numListeners); this.plays = new IntWritable(numPlays); this.scrobbles = new IntWritable(numScrobbles); this.radioplays = new IntWritable(numRadio); this.skips = new IntWritable(numSkips); public TrackStats() { this(0, 0, 0, 0, 0);... 14/25

SumTrackStatsMapper Class public static class SumTrackStatsMapper extends Mapper<LongWritable, Text, IntWritable, TrackStats> { public void map(longwritable offset, Text line, Context context) throws IOException, InterruptedException { String[] parts = (line.tostring()).split(" "); int trackid = Integer.parseInt(parts[TrackStatistics.COL_TRACKID]); int scrobbles = Integer.parseInt(parts[TrackStatistics.COL_SCROBBLE]); int radio = Integer.parseInt(parts[TrackStatistics.COL_RADIO]); int skip = Integer.parseInt(parts[TrackStatistics.COL_SKIP]); TrackStats trackstat = new TrackStats(0, scrobbles + radio, scrobbles, radio, skip); context.write(new IntWritable(trackId), trackstat); 15/25

SumTrackStatsReducer SumTrackStatsReducer input: key is the TrackID output by the mapper. value is the iterator over the list of TrackStats associated with the track. TrackId Iterable<TrackStats> IntWritable Iterable<TrackStats> 222 (0,1,0,1,0) 225 (0,1,1,0,0) (0,1,1,0,0) 223 (0,1,0,1,1) SumTrackStatsReducer function: Create new TrackStats object to hold totals for current track. Iterate through values and add the stats of each value to the stats of the object we created. reduce(225, (0,1,1,0,0) (0,1,1,0,0) ) <225, (0,2,2,0,0)> SumTrackStatsReducer output: TrackId numlisteners numplays numscrobbles numradioplays numskips IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable 222 0 1 0 1 0 223 0 1 0 1 1 225 0 2 2 0 0 16/25

SumTrackStatsReducer public static class SumTrackStatsReducer extends Reducer<IntWritable, TrackStats, IntWritable, TrackStats> { public void reduce(intwritable trackid, Iterable<TrackStats> values, Context context) throws IOException, InterruptedException { /* Hold totals for this track */ TrackStats sum = new TrackStats(); while (values.hasnext()) { TrackStats trackstats = (TrackStats) values.next(); sum.setlisteners(sum.getlisteners() + trackstats.getlisteners()); sum.setplays(sum.getplays() + trackstats.getplays()); sum.setskips(sum.getskips() + trackstats.getskips()); sum.setscrobbles(sum.getscrobbles() + trackstats.getscrobbles()); sum.setradioplays(sum.getradioplays() + trackstats.getradioplays()); context.write(trackid, sum); 17/25

Merging the Results: MergeResults Job Merges the output from the UniqueListeners and SumTrackStats jobs. Specify a mapper for each input type: MultipleInputs.addInputPath(conf, suminputdir, SequenceFileInputFormat.class, IdentityMapper.class); MultipleInputs.addInputPath(conf, listenersinputdir, SequenceFileInputFormat.class, MergeListenersMapper.class); IdentityMapper simply emits the trackid and TrackStats object output by the SumTrackStats job. IdentityMapper input and output: TrackId numlisteners numplays numscrobbles numradioplays numskips IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable 222 0 1 0 1 0 223 0 1 0 1 1 225 0 2 2 0 0 18/25

MergeListenersMapper Prepares the data for input to the final reducer function by mapping the TrackId to a TrackStats object with the number of unique listeners set. MergeListenersMapper input: The UniqueListeners job output. TrackId numlisteners IntWritable IntWritable 222 1 223 1 225 2 MergeListenersMapper function: Create a new TrackStats object per track and set the numlisteners attribute. map(225, 2) <225, new TrackStats(2, 0, 0, 0, 0)> MergeListenersMapper output: TrackId numlisteners numplays numscrobbles numradioplays numskips 222 1 0 0 0 0 223 1 0 0 0 0 225 2 0 0 0 0 19/25

MergeListenersMapper public static class MergeListenersMapper extends Mapper<IntWritable, IntWritable, IntWritable, TrackStats> { public void map(intwritable trackid, IntWritable uniquelistenercount, Context context) throws IOException, InterruptedException { TrackStats trackstats = new TrackStats(); trackstats.setlisteners(uniquelistenercount.get()); context.write(trackid, trackstats); 20/25

Final Reduce Stage: SumTrackStatsReducer Finally, we have two partially defined TrackStats objects for each track. We can reuse the SumTrackStatsReducer to combine them and emit the final result. SumTrackStatsReducer input: key is the TrackID output by the two mappers. value is the iterator over the list of TrackStats associated with the track (in this case, one contains the unique listener count and the other contains the play, scrobble, radio listen, and skip counts). TrackId Iterable<TrackStats> IntWritable Iterable<TrackStats> 222 (1,0,0,0,0) (0,1,0,1,0) 223 (1,0,0,0,0) (0,1,0,1,1) 225 (2,0,0,0,0) (0,2,2,0,0) SumTrackStatsReducer function: reduce(225, (2,0,0,0,0) (0,2,2,0,0) ) <225, (2,2,2,0,0)> SumTrackStatsReducer output: TrackId numlisteners numplays numscrobbles numradioplays numskips IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable 222 1 1 0 1 0 223 1 1 0 1 1 225 2 2 2 0 0 21/25

Final SumTrackStatsReducer From the SumTrackStats job. public static class SumTrackStatsReducer extends Reducer<IntWritable, TrackStats, IntWritable, TrackStats> { public void reduce(intwritable trackid, Iterable<TrackStats> values, Context context) throws IOException, InterruptedException { /* Hold totals for this track */ TrackStats sum = new TrackStats(); while (values.hasnext()) { TrackStats trackstats = (TrackStats) values.next(); sum.setlisteners(sum.getlisteners() + trackstats.getlisteners()); sum.setplays(sum.getplays() + trackstats.getplays()); sum.setskips(sum.getskips() + trackstats.getskips()); sum.setscrobbles(sum.getscrobbles() + trackstats.getscrobbles()); sum.setradioplays(sum.getradioplays() + trackstats.getradioplays()); context.write(trackid, sum); 22/25

Putting it all together All job classes should extend Configured and implement Tool. public class MergeResults extends Configured implements Tool Then override the run method with specific job configuration. public class MergeResults extends Configured implements Tool { /* mapper and reducer classes */ public int run(string[] args) throws Exception { Configuration conf = new Configuration(); Path suminputdir = new Path(args[1] + Path.SEPARATOR_CHAR + "suminputdir"); Path listenersinputdir = new Path(args[1] + Path.SEPARATOR_CHAR + "listenersinputdir"); Path output = new Path(args[1] + Path.SEPARATOR_CHAR + "finalsumoutput"); Job job = new Job(conf, "merge-results"); job.setjarbyclass(uniquelisteners.class); MultipleInputs.addInputPath(job, suminputdir, SequenceFileInputFormat.class, Mapper.class); MultipleInputs.addInputPath(job, listenersinputdir, SequenceFileInputFormat.class, MergeListenersMapper.class); job.setreducerclass(sumtrackstats.sumtrackstatsreducer.class); job.setoutputkeyclass(intwritable.class); job.setoutputvalueclass(trackstats.class); FileOutputFormat.setOutputPath(job, output); if (job.waitforcompletion(true)) return 0; else return 1; 23/25

Putting it all together Hadoop provides a ToolRunner to simplify job chaining. The driver class simply runs each job in the correct order using ToolRunner. public class GenerateTrackStats { public static void main(string[] args) throws Exception { if (args.length!= 2) { System.out.println("Usage: GenerateTrackStats <input path> <output path>"); System.exit(0); int exitcode = ToolRunner.run(new UniqueListeners(), args); if (exitcode == 0) exitcode = ToolRunner.run(new SumTrackStats(), args); else { System.err.println("GenerateTrackStats: UniqueListeners map-reduce phase failed!!"); System.exit(1); if (exitcode == 0) exitcode = ToolRunner.run(new MergeResults(), args); else { System.err.println("GenerateTrackStats: SumTrackStats map-reduce phase failed!!"); System.exit(1); System.exit(exitCode); 24/25

References Thanks to Marissa Hollingsworth for the full last.fm code example and slides. Converted to new API by Amit Jain. Last.fm. The main website: http://www.last.fm/. Hadoop: The Definitive Guide (3rd ed.). Tom White, 2012, O Reilly. 25/25