Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Similar documents
Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark: Making Big Data Interactive & Real-Time

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

The Internet of Things and Big Data: Intro

Apache Flink Next-gen data analysis. Kostas

Big Data Analytics. Lucas Rego Drumond

How Companies are! Using Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Moving From Hadoop to Spark

Ali Ghodsi Head of PM and Engineering Databricks

Spark and the Big Data Library

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Beyond Hadoop with Apache Spark and BDAS

Bayesian networks - Time-series models - Apache Spark & Scala

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Hadoop Ecosystem B Y R A H I M A.

Unlocking the True Value of Hadoop with Open Data Science

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

HADOOP. Revised 10/19/2015

Architectures for massive data management

SEIZE THE DATA SEIZE THE DATA. 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Infrastructure at Spotify

Hadoop & Spark Using Amazon EMR

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Introduction to Spark

MLlib: Scalable Machine Learning on Spark

A Brief Introduction to Apache Tez

Making big data simple with Databricks

Big Data Research in Berlin BBDC and Apache Flink

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Streaming items through a cluster with Spark Streaming

Spark: Cluster Computing with Working Sets

CSE-E5430 Scalable Cloud Computing Lecture 11

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data and Scripting Systems beyond Hadoop

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Upcoming Announcements

Productionizing a 24/7 Spark Streaming Service on YARN

Hybrid Software Architectures for Big

Analytics on Spark &

From Spark to Ignition:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Flink. Fast and Reliable Large-Scale Data Processing

Machine Learning over Big Data

Brave New World: Hadoop vs. Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analytics Hadoop and Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Databricks. A Primer

BIG DATA What it is and how to use?

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data With Hadoop

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Processing. Patrick Wendell Databricks

Spark Application Carousel. Spark Summit East 2015

Journée Thématique Big Data 13/03/2015

Reference Architecture, Requirements, Gaps, Roles

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Big Data Use Case: Business Analytics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Databricks. A Primer

COMP9321 Web Application Engineering

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data and Industrial Internet

Next-Gen Big Data Analytics using the Spark stack

Architectures for massive data management

Data Science in the Wild

BIG DATA ANALYTICS For REAL TIME SYSTEM

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Advanced In-Database Analytics

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

HiBench Introduction. Carson Wang Software & Services Group

Mobile Monetization Scenario Design & Big Data. Arther Wu Senior Director of Monetization and Business Operation

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Transcription:

Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed) speed Logistic regression in Scikit vs Spark SGD optimizer in Spark General thoughts so far Demo?

Quickly about me 1995 University of Kalmar 1997 The Buyer's Guide 2000 Ph D student, Kalmar + Linköping 2005 Assistant Professor, Kalmar 2007 Venture capital, research project 2007 TestFreaks, Pricerunner 15,000+ sites worldwide 2011 Campanja, AI-team Optimized Netflix worldwide 2013 Spotify, Graph data lead 2014 Spotify, Analytics ML manager

Quickly about Spotify 75+ million monthly active users Launched in 58 different countries 20+ million paying subscribers 30+ million licensed songs 20,000 new songs every day 1,5+ billion playlists created 14 TB of user/service-related log data per day Expands to 170 TB per day 1200+ node Hadoop cluster 50 PB of storage capacity, 48 TB of memory capacity

What is all the data used for? Reporting to labels and right holders Product Features Browse, search, radio, related artists, A/B Testing Catalog quality Artist disambiguation, track deduplication Business Analytics KPI, DAU, MAU, SUBS, conversion, retention, NPS analysis, understand the users User funnel, awareness, activation, conversion, retention Marketing, growth, consumer insights Operational Analysis

A/B Testing

Spotify data architecture

The discovery data pipeline

Collaborative filtering Approximate 60M users x 4M songs with 40 latent factors, ALS In short, minimize the cost function:

Next-generation Data Analytics Analytics 1.0 - Traditional statistical analysis Statistical significance with ~1000 users Centralized relational databases Analytics 2.0 - Big Data Moving algorithms to data Make it possible to handle big data Volume, Variety, and Velocity Analytics 3.0 - Machine Learning & Real-time Simplify distributed data processing Decrease latency between incoming data and decision Intelligent distributed machine learning algorithms

Next-generation Data Analytics (2) Hadoop 2+, YARN application Killing classical Map/Reduce Iterative algorithms in Spark, Tez, and Flink Streaming data (not just music) Kafka, Storm, och Spark Streaming Lambda architecture Improved storage formats Columnar data storage Parquet, ORC Simplified machine learning toolkits Scikit-learn, Spark MLlib, IPython notebooks, R Ubiquitous machine learning, ML for everyone Better tools for datawarehousing and dashboarding

Quickly about classical map/reduce

Quickly about classical map/reduce (2)

Quickly about Spark

Quickly about Spark (2)

Spark Example with the RDD API Array of data distributed of workers Same API as normal arrays Transforming: map, filter, reducebykey, groupbykey,... Joining: joinbykey, leftouterjoin, cogroup, zip,.. Actions: count, saveasavro, saveastext,... Failure recovery, reruns failed tasks

Spark Example with the DataFrame API Higher level of abstraction than RDD Make use of schema-free data sources Dynamic schema-awareness Additional optimizations performed automatically Same performance in Python as in Scala Similar API as Pandas and R

Quickly about Spark (5)

Problem Definition + Hypothesis Improve user targeting for house ads P(C A) Identify users that are likely to convert given that they ve seen house ads Target less people with house ads, and retain as many conversions as possible Hypothesis By making use of information about users behaviour, demographics, and ad data, it will be possible to estimate likelihood of conversion with a logistic regression model. Alternative algorithms Navie Bayes, Decision Trees, Boosted Trees Random Forest, SVM,

Evaluation of the model

Need for (distributed) speed Steps to build the model Extract data for training Transform data into features Train the model using the features Evaluate the performance of the model Tune the parameters Extract data for prediction Transform prediction data into features Predict probability of conversion for all the users Main tools used IPython notebook Scikit learn library Spark + MLlib

Running data extraction in Spark

More often like this

Quickly about Logistic Regression

Logistic Regression in Scikit-learn L2 regularized optimization problem in liblinear Newton Raphson solver

Logistic Regression in Spark Stochastic gradient descent Params: step size, use intercept, regularization, batch size

SGD implementation in Spark

Calculation of the gradient

Updating of the weights

SGD Convergence

Learning rate (step size) tuning

Regularizaton tuning

Thoughts about Spark Advantages with Spark General purpose engine (batch, streaming, sql, graph) Faster Yarn engine, DAG optimization and less IO High level machine learning library RDD, failure recovery, data locality Generic caching and accumulators Nice development environment, local debugging,... Huge community and activity Disadvantages and things to consider Still rather immature, unexpected error messages Beware number of executors Avoid references to outer classes Be careful about partition tunining

Thanks!

Deep learning for identifying similar songs