Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.



Similar documents
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Implement Hadoop jobs to extract business value from large and varied data sets

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop Ecosystem B Y R A H I M A.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Using Kafka to Optimize Data Movement and System Integration. Alex

Big Data Analytics - Accelerated. stream-horizon.com

Complete Java Classes Hadoop Syllabus Contact No:

Apache Hadoop: Past, Present, and Future

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Workshop on Hadoop with Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Pipeline and Analytics Platform

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Using distributed technologies to analyze Big Data

Openbus Documentation

Cloudera Certified Developer for Apache Hadoop

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop: The Definitive Guide

Upcoming Announcements

Wisdom from Crowds of Machines

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

COURSE CONTENT Big Data and Hadoop Training

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Big Data Infrastructure at Spotify

Real-time Big Data Analytics with Storm

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Putting Apache Kafka to Use!

Case Study : 3 different hadoop cluster deployments

Best Practices for Hadoop Data Analysis with Tableau

the missing log collector Treasure Data, Inc. Muga Nishizawa

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Internet of Things and Big Data: Intro

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Bringing Big Data to People

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Delivering Intelligence to Publishers Through Big Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Course Highlights

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Oracle Big Data SQL Technical Update

The Game of Big Data! Analytics Infrastructure at KIXEYE

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Hadoop IST 734 SS CHUNG

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

How To Handle Big Data With A Data Scientist

HADOOP. Revised 10/19/2015

Certified Big Data and Apache Hadoop Developer VS-1221

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

How To Scale Out Of A Nosql Database

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big data blue print for cloud architecture

Big Data Analytics - Accelerated. stream-horizon.com

project collects data from national events, both natural and manmade, to be stored and evaluated by

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Dell In-Memory Appliance for Cloudera Enterprise

A Brief Outline on Bigdata Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data? Definition # 1: Big Data Definition Forrester Research

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Comprehensive Analytics on the Hortonworks Data Platform

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Application Development. A Paradigm Shift

BIG DATA TRENDS AND TECHNOLOGIES

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hadoop & Spark Using Amazon EMR

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Databricks. A Primer

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Sentimental Analysis using Hadoop Phase 2: Week 2

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Kafka & Redis for Big Data Solutions

How Companies are! Using Spark

How To Use Big Data For Telco (For A Telco)

HDP Hadoop From concept to deployment.

White Paper: What You Need To Know About Hadoop

Predictive Analytics with Storm, Hadoop, R on AWS

Creating Big Data Applications with Spring XD

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Hadoop Job Oriented Training Agenda

07/11/2014 Julien! Poorna! Andreas

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Apache Flink Next-gen data analysis. Kostas

Transcription:

Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon.

What is StumbleUpon? Help users find content they did not expect to find The best way to discover new and interesting things from across the Web.

How StumbleUpon works 1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages We use your interests and behavior to recommend new content for you!

StumbleUpon

The Data Challenge 1 Data collection 2 Real time metrics 3 Batch processing / ETL 4 Data warehousing & ad-hoc analysis 5 Business intelligence & Reporting

Challenges in data collection Different services deployed of different tech stacks Site Rec /Ad Server Apache / PHP Scala / Finagle Other internal services Java / Scala / PHP Add minimal latency to the production services Application DBs for Analytics / batch processing From HBase & MySQL

Data Processing and Warehousing Raw Data Challenges/Requirements: E T L Scale over 100 TBs of data End product works with easy querying tools/languages Warehouse (HDFS) Tables Reliable and Scalable ᾶ powers analytics and internal reporting. Massively Denormalize d Tables

Real-time analytics and metrics Atomic counters Tracking product launches Monitoring the health of the site Latency live metrics makes sense A/B tests

Open Source at SU

Data Collection at SU

Activity Streams and Logs All messages are Protocol Buffers Fast and Efficient Multiple Language Bindings ( Java/ C++ / PHP ) Compact Very well documented Extensible

Apache Kafka Distributed pub-sub system Developed @ LinkedIn Offers message persistence Very high throughput ~300K messages/sec Horizontally scalable Multiple subscribers for topics. Easy to rewind

Kafka Near real time process can be taken offline and done at the consumer level Semantic partitioning through topics Partitions for parallel consumption High-level consumer API using ZK Simple to deploy- only requires Zookeeper

Kafka At SU 4 Broker nodes with RAID10 disks 25 topics Peak of 3500 msg/s 350 bytes avg. message size 30 days of data retention

Sutro Scala/Finagle Generic Kafka message producer Deployed on all prod servers Local http daemon Publishes to Kafka asynchronously Snowflake to generate unique Ids

Sutro - Kafka Site Apache/PHP Ad Server Scala/Finagle Rec Server Scala/Finagle Other Services Sutro Sutro Sutro Sutro Kafka Broker Broker Broker

Application Data for Analytics & Batch Processing HBase HBase inter-cluster replication (from production to batch cluster) Near real-time sync on batch cluster Readily available in Hive for analysis MySQL MySQL replication to Batch DB Servers Sqoop incremental data transfer to HDFS HDFS flat files mapped to Hive tables & made available for analysis

Real-time metrics 1. HBase Atomic Counters 2. Asynchbase - Coalesced counter inc++ 3. OpenTSDB (developed at SU) A distributed time-series DB on HBase Collects over 2 Billion data points a day Plotting time series graphs Tagged data points

Real-time counters Real time metrics from OpenTSDB

Kafka Consumer framework aka Postie Distributed system for consuming messages Scala/Akka -on top of Kafka s consumer API Generic consumer - understands protobuf Predefined sinks HBase / HDFS (Text/Binary) / Redis Consumers configured with configuration files Distributed / uses ZK to co-ordinate Extensible

Postie

Akka Building concurrent applications made easy!! The distributed nodes are behind Remote Actors Load balancing through custom Routers The predefined sink and services are accessed through local actors Fault-tolerance through actor monitoring

Postie

Batch processing / ETL GOAL: Create simplified data-sets from complex data Batch processing / ETL Create highly denormalized data sets for faster querying Output structured data for specific analysis e.g. Registration Flow analysis Power the reporting DB with daily stats

Our favourite ETL tools: Pig Optional Schema Work on byte arrays Many simple operations can be done without UDFs Developing UDFs is simple (understand Bags/Tuples) Concise scripts compared to the M/R equivalents Scalding Functional programming in Scala over Hadoop Built on top of Cascading Operating over tuples is like operating over collections in Scala No UDFs.. Your entire program is in a full-fledged general purpose language

Warehouse - Hive Uses SQL-like querying language All Analysts and Data Scientists versed in SQL Supports Hadoop Streaming (Python/R) UDFs and Serdes make it highly extensible Supports partitioning with many table properties configurable at the partition level

Hive at StumbleUpon HBaseSerde Reads binary data from HBase Parses composite binary values into multiple columns in Hive (mainly on key) ProtobufSerde For creating Hive tables on top of binary protobuf files stored in HDFS Serde uses Java reflection to parse and project columns

Data Infrastructure at SU

Data Consumption

End Users of Data Data Pipeline Who uses this data? Data Scientists/Analysts Offline Rec pipeline Ads Team Warehouse Ὴ all this work allows them to focus on querying and analysis which is critical to the business.

Business Analytics / Data Scientists Feature-rich set of data to work on Enriched/Denormalized tables reduce JOINs, simplifies and speeds queries shortening path to analysis. R: our favorite tool for analysis post Hadoop/Hive.

Recommendation platform URL score pipeline M/R and Hive on Oozie Filter / Classify into buckets Score / Loop Load ES/HBase index Keyword generation pipeline Parse URL data Generate Tag mappings

URL score pipeline

Advertisement Platform Billing Service RT Kafka consumer Calculates skips Bills customers Audience Estimation tool Pre-crunched data into multiple dimensions A UI tool for Advertisers to estimate target audience Sales team tools Built with PHP leveraging Hive or pre-crunched ETL data in HBase

More stuff on the pipeline Storm from Twitter Scope for lot more real time analytics. Very high throughput and extensible Applications in JVM language BI tools Our current BI tools / dashboards are minimal Google charts powered by our reporting DB (HBase primarily).

Open Source FTW!! Actively developed and maintained Community support Built with web-scale in mind Distributed systems Easy with Akka/ZK/Finagle Inexpensive Only one major catch!! Hire and retain good engineers!!

Thank You! Questions?