Big Data Infrastructure at Spotify



Similar documents
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDP Hadoop From concept to deployment.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Dominik Wagenknecht Accenture

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Case Study : 3 different hadoop cluster deployments

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

HDP Enabling the Modern Data Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Moving From Hadoop to Spark

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop. Sunday, November 25, 12

Building a data analytics platform with Hadoop, Python and R

ITG Software Engineering

Data Analyst Program- 0 to 100

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

MapReduce with Apache Hadoop Analysing Big Data

Open source Google-style large scale data analysis with Hadoop

The Future of Data Management with Hadoop and the Enterprise Data Hub

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Workshop on Hadoop with Big Data

BIG DATA TRENDS AND TECHNOLOGIES

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Comprehensive Analytics on the Hortonworks Data Platform

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

White Paper: What You Need To Know About Hadoop

Data Pipeline with Kafka

Upcoming Announcements

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Peers Techno log ies Pv t. L td. HADOOP

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hadoop and Map-Reduce. Swati Gore

TRAINING PROGRAM ON BIGDATA/HADOOP

Large scale processing using Hadoop. Ján Vaňo

Big Data on Microsoft Platform

How Cisco IT Built Big Data Platform to Transform Data Management

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data: Tools and Technologies in Big Data

Hadoop Job Oriented Training Agenda

Big Data and Data Science: Behind the Buzz Words

HDFS. Hadoop Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Native Connectivity to Big Data Sources in MSTR 10

Hadoop Ecosystem B Y R A H I M A.

I/O Considerations in Big Data Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop IST 734 SS CHUNG

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

So What s the Big Deal?

Has been into training Big Data Hadoop and MongoDB from more than a year now

the missing log collector Treasure Data, Inc. Muga Nishizawa

The Inside Scoop on Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Bringing Big Data to People

Cloudera Certified Developer for Apache Hadoop

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

A Brief Outline on Bigdata Hadoop

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Complete Java Classes Hadoop Syllabus Contact No:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

MySQL and Hadoop. Percona Live 2014 Chris Schneider

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

The Future of Data Management

Chase Wu New Jersey Ins0tute of Technology

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Sentimental Analysis using Hadoop Phase 2: Week 2

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Information Builders Mission & Value Proposition

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Coverity Scan. Big Data Spotlight

COURSE CONTENT Big Data and Hadoop Training

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Transcription:

Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013

2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context Why Data? Use Cases Our Infrastructure Lesson learned

Spotify? Spotify! 3

4 Some Context Spotify started in 2006 Now 850+ employees, 250+ engineers 26 million monthly active users 20+ million tracks available 4 data centers across the globe 12 data engineers building a platform for easy access to data

5 Why data? We play music, right?

6 Why data? We were the first company to do free music streaming. But now everybody can do it.

7 Reporting Business Analytics Operational Analytics Product features Use Cases We re a data-driven company, so data is used almost everywhere

Reporting Reporting to labels, licensors, partners and advertisers We support our partners

Business Analytics Analyzing growth, user behavior, sign-up funnels, etc Company KPIs A/B testing NPS analysis Segmentation analysis

Operational metrics Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth)

Product features Radio Top lists Recommendations (better then external parties, because of the amount of data)

Everybody should be able to use data! June 12, 2013

13 So why Data Infrastructure? View Backend Data

Our data infrastructure 14

15 Some geeky numbers 600 GB of compressed data from users per day 150 GB of data from services per day 4 TB of data generated in Hadoop each day 190 node Hadoop cluster Soon 690 nodes 4 PB of storage capacity (soon 28 PB)

16 Spotify s data infrastructure Backend services Product features HDFS Map/Reduce Reporting Operational Databases Luigi Scheduler Hive Analytical databases Map/Reduce Jobs Dashboards

17 The thee pillars of our Data Infrastructure Kafka Collection Hadoop Processing Databases Analytics/Visualization

Data collection June 12, 2013

19 18 Data collection Kafka: High volume pub-sub system Started with a store-and-forward system Evaluated Apache Flume Currently from Backend-to-HDFS, but in the future Backend-to-Backend It was almost a good fit, but...

20 Guaranteed delivery Kafka doesn t provide message acknowledgements..... at least, not in 0.7 (stable) 0.8 has support, but no end-to-end acknowledgements A track streamed == monetary transaction

21 Hacking Kafka acknowledgements HDFS Backend service Kafka broker Kafka client S3 Tape

22 Dumping databases Not only do we have log files, but also production databases Using Sqoop for dumping PostgreSQL Map/Reduce job for dumping Cassandra For large DB s we parse application logs and only dump deltas

23 Failures (or the hard stuff) We started with a store-and-forward system that didn t scale Kafka has multiple components (client, broker, ZooKeeper) Internet weather As with many large Java systems: Garbage collection

Hadoop: our trusted elephant June 12, 2013

25 Scheduling We wrote and open-sourced our own scheduler: Luigi Nothing suitable out there.. (unless you really, really like the XML hell of Oozie) https://github.com/spotify/luigi Written in Python Generic scheduler and dependency system that supports Python M/R, Pig and Hive

26 Map/Reduce languages Python with Hadoop Streaming Pros: fast development, many Spotify libraries available Cons: slower then Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language, not many Spotify libraries available PIG Pros: very small scripts, faster then streaming Cons: yet another language to learn, not many Spotify libs available Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: more moving parts (not well suited for a whole pipe line)

27 Scaling Hadoop at Spotify Our Journey Started with a small (scrap metal) cluster of 37 servers Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale Built an in-house cluster of 60 nodes because of EMR costs Capacity planning every 6 months, grown to 190 nodes today Just ordered 500 more nodes Put in place data-retention policy and data archive

28 Hadoop failures We just need good developers - No, we need Hadoop experts We underestimated the complexity of Hadoop You can throw money at the problem of scaling, but at our scale, it pays off to optimize Give people easy tools early on

29 Lessons learned 4+ years of Hadoop taught us Hadoop has brought us very far. We would never be able to handle the current volume with a cheap RDBMS Commodity hardware doesn t mean cheap hardware Hadoop isn t a silver bullet Hadoop is a complex system that needs love and care You will have to extend Hadoop (and eco-system components) to tailor it to your needs

Databases and visualization June 12, 2013

31 Databases: used for aggregates Aggregates from Hadoop are put into PostgreSQL or Cassandra PostgreSQL powering dashboards and empowering analysts Cassandra for columnar sets (Spotify Analytics for labels) Databases are used for low-latency access by systems or analysts

Spotify Data Warehouse 32

Spotify Data Warehouse 33

Spotify Analytics Dashboards 34

Spotify Analytics 35

Spotify Analytics: Daft Punk 36

Spotify Analytics: Kendrick Lamar - Gábor was right :) 37

Spotify Analytics: Whitney Houston 38

39 Lessons learned Data is crucial for our business Hadoop and other cutting-edge technology work pretty well for us Hiring technical analysts was a good idea! There is no one-size-fits-all data product Spotify has the build it ourselves mindset. Sometimes it s better to buy then build

Want to join the band? Check out http://www.spotify.com/jobs or @Spotifyjobs for more information. Or mail: wouter@spotify.com Or twitter: @xinit June 12, 2013