Analytics on Spark & Shark @Yahoo

Similar documents

How Companies are! Using Spark

Moving From Hadoop to Spark

Native Connectivity to Big Data Sources in MSTR 10

How To Create A Data Visualization With Apache Spark And Zeppelin

Beyond Hadoop with Apache Spark and BDAS

A Brief Introduction to Apache Tez

HDP Hadoop From concept to deployment.

Unified Big Data Analytics Pipeline. 连城

Hadoop Ecosystem B Y R A H I M A.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Next-Gen Big Data Analytics using the Spark stack

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Cost-Effective Business Intelligence with Red Hat and Open Source

Upcoming Announcements

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Using distributed technologies to analyze Big Data

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The Internet of Things and Big Data: Intro

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

The Future of Data Management

YARN Apache Hadoop Next Generation Compute Platform

How To Handle Big Data With A Data Scientist

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle Big Data SQL Technical Update

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Conquering Big Data with BDAS (Berkeley Data Analytics)

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Unified Big Data Processing with Apache Spark. Matei

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

HDP Enabling the Modern Data Architecture

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hadoop & Spark Using Amazon EMR

Luncheon Webinar Series May 13, 2013

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Apache Flink Next-gen data analysis. Kostas

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data Analytics - Accelerated. stream-horizon.com

Oracle Database 12c Plug In. Switch On. Get SMART.

Large scale processing using Hadoop. Ján Vaňo

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

HADOOP. Revised 10/19/2015

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics Nokia

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Testing Big data is one of the biggest

BIG DATA What it is and how to use?

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Kylin Introduction Dec 8,

Cisco IT Hadoop Journey

the missing log collector Treasure Data, Inc. Muga Nishizawa

Big Data With Hadoop

Traditional BI vs. Business Data Lake A comparison

Real Time Big Data Processing

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big Data on Microsoft Platform

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop implementation of MapReduce computational model. Ján Vaňo

Dominik Wagenknecht Accenture

Big Data Training - Hackveda

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Spark: Making Big Data Interactive & Real-Time

SAP and Hortonworks Reference Architecture

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Apache Hadoop: The Big Data Refinery

Azure Data Lake Analytics

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Self-service BI for big data applications using Apache Drill

Big Data Can Drive the Business and IT to Evolve and Adapt

Databricks. A Primer

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

What s next for the Berkeley Data Analytics Stack?

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Dell In-Memory Appliance for Cloudera Enterprise

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Information Builders Mission & Value Proposition

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Business Intelligence for Big Data

Transcription:

Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013

Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment Future Spark/Shark/Hadoop Stack Conclusion 2

Some Fun: Old-School Data Processing (1999-2007) Perl Launcher MetaData SSH C++ Worker C++ Worker C++ Worker C++ Worker NFS NFS NFS NFS NFS /d4/data/20030901/partition/b16.gz /d1/data/20030901/partition/b1.gz 3

Current Analytics Architecture Custom log collection infrastructure depositing onto NFS-based storage Logs moved onto Hadoop HDFS Ø Multiple Hadoop instances Pig/MR ETL processing, massive joins, load into warehouse Aggregations / Report Generation in Pig, MapReduce, Hive Reports loaded into RDBMS UI / web services on top Realtime Stream Processing: Storm on YARN Persistence: Hbase, HDFS/Hcat, RDBMS s 4

Current High-Level Analytics Dataflow Batch Processing / Data Pipelines Mobile Apps Web Pages Pixel Servers Ad Servers Colos Data Movement & Collection ETL / HDFS Staging/ Distribution Stream Processing / Queues Pig / Hive Native MR YARN RDBMS/ NoSQL BI/OLAP Adhoc ML Realtime Apps Real-time Stream Processing 5

Legacy Architecture Pain Points Massive data volumes per day (many, many TB) Pure Hadoop stack throughout Data Wrangling Report arrival latency quite high Ø Hours to perform joins, aggregate data Culprit - Raw data processing through MapReduce just too slow Many stages in pipeline chained together Massive joins throughout ETL layer Lack of interactive SQL Expressibility of business logic in Hadoop MR is challenging New reports and dimensions requires engineering throughout stack 6

Aggregate Pre-computation Problems Problem: Pre-computation of reports Ø How is timespent per user distributed across desktop and mobile for Y! Mail? Ø Extremely high cardinality dimensions, ie, search query term Ø Count distincts Problem: Sheer number of reports along various dimensions Ø Report changes required in aggregate, persistence and UI layer Ø Potentially takes weeks to months Ø Business cannot wait 7

Problem Summary Overwhelming need to make data more interactive Shorten time to data access and report publication Ad-hoc queries need to be much faster than Hive or pure Hadoop MR. Ø Concept of Data Workbench : business specific views into data Expressibility of complicated business logic in Hadoop becoming a problem Ø Various verticals within Yahoo want to interpret metrics differently Need interactive SQL querying No way to perform data discovery (adhoc analysis/exploration) Ø Must always tweak MR Java code or SQL query and rerun big MR job Cultural shift to BI tools on desktop with low latency query performance 8

Where do we go from here? How do we solve this problem within the Hadoop ecosystem? Pig on Tez? Hive on Tez? No clear path yet to making native MR/Pig significantly faster Balance pre-aggregated reporting with high demand for interactive SQL access against fact data via desktop BI tools How do we provide data-savvy users direct SQL-query access to fact data? 9

Modern Architecture: Hadoop + Spark Bet on YARN: Hadoop and Spark can coexist Still using Hadoop MapReduce for ETL Loading data onto HDFS / HCat / Hive warehouse Serving MR queries on large Hadoop cluster Spark-on-YARN side-by-side with Hadoop on same HDFS Optimization: copy data to remote Shark/Spark clusters for predictable SLAs Ø While waiting for Shark on Spark on YARN (Hopefully early 2014) 10

Analytics Stack of the Future Batch Processing / Data Pipelines Mobile Apps Web Pages Pixel Servers Ad Servers Colos Data Movement & Collection ETL / HDFS Staging/ Distribution Stream Processing / Queues Spark View 1 View 2 View n RDBMS/ NoSQL Spark/MR Shark YARN BI/OLAP Adhoc Realtime Apps / Querying Hive Real-time Stream Processing 11

Why Spark? Cultural shift towards data savvy developers in Yahoo Recently, the barrier to entry for big data has been lowered Solves the need for interactive data processing at REPL and SQL levels In-memory data persistence obvious next step due to continual decreasing cost of RAM and SSD s Collections API with high familiarity for Scala devs Developers not restricted by rigid Hadoop MapReduce paradigm Community support accelerating, reaching steady state More than 90 developers, 25 companies Awesome storage solution in HDFS yet processing layer / data manipulation still sub-optimal Hadoop not really built for joins Many problems not Pig / Hive Expressible Slow Seemless integration into existing Hadoop architecture 12

Why Spark? (Continued) Up to 100x faster than Hadoop MapReduce Typically less code (2-5x) Seemless Hadoop/HDFS integration RDDs, Iterative processing, REPL, Data Lineage Accessible Source in terms of LOC and modularity BDAS ecosystem: Spark, Spark Streaming, Shark, BlinkDB, MLlib Deep integration into Hadoop ecosystem Read/write Hadoop formats Interop with other ecosystem components Runs on Mesos & YARN EC2, EMR HDFS, S3 13

Spark BI/Analytics Use Cases Obvious and logical next-generation ETL platform Unwind chained MapReduce job architecture ETL typically a series of MapReduce jobs with HDFS output between stages Move to more fluid data pipeline Java ecosystem means common ETL libraries between realtime and batch ETL Faster execution Lower data publication latency Faster reprocessing times when anomalies discovered Spark Streaming may be next generation realtime ETL Data Discovery / Interactive Analysis 14

Spark Hardware 9.2TB addressable cluster 96GB and 192GB RAM machines 112 Machines SATA 1x500GB 7.2k Dual hexa core Sandy Bridge Looking at SSD exclusive clusters 400GB SSD 1x400GB SATA 300MB/s 15

Why Shark? First identified Shark at Hadoop Summit 2012 After seeing Spark at Hadoop Summit 2011 Common HiveQL provides seemless federation between Hive and Shark Sits on top of existing Hive warehouse data Multiple access vectors pointing at single warehouse Direct query access against fact data from UI Direct (O/J)DBC from desktop BI tools Built on shared common processing platform 16

Yahoo! Shark Deployments / Use Cases Advertising / Analytics Data Warehouse Campaign Reporting Pivots, time series, multi-timezone reporting Segment Reporting Unique users across targeted segments Ad impression availability for given segment Overlap analysis fact to fact overlap Other Time Series Analysis OLAP Tableau on top of Shark Custom in-house cubing and reporting systems Dashboards Adhoc analysis and data discovery 17

Yahoo! Contributions Began work in 2012 on making Shark more usable for interactive analytics/ warehouse scenarios Shark Server for JDBC/ODBC access against Tableau Multi-tenant connectivity Threadsafe access Map Split Pruning Use statistics to prune partitions so jobs don t launch for splits w/o data Bloom filter-based pruner for high cardinality columns Column pruning faster OLAP query performance Map-side joins Cached-table Columnar Compression (3-20x) Query cancellation 18

Physical Architecture Spark / Hadoop MR side-by-side on YARN Satellite Clusters running Shark Predictable SLAs Greedy pinning of RDDs to RAM Addresses scheduling challenges Long-term Shark on Spark-on-YARN Goal: early 2014 Large Hadoop Cluster Hadoop MR (Pig, Hive, MR) Satellite Shark Cluster YARN Spark Historical DW (HDFS) Satellite Shark Cluster 19

Future Architecture Prototype migration of ETL infrastructure to pure Spark jobs Breakup chained MapReduce pattern into single discrete Spark job Port legacy Pig/MR ETL jobs to Spark (TB s / day) Faster processing times (goal of 10x) Less code, better maintainability, all in Scala/Spark Leverage RDDs for more efficient joins Prototype Shark on Spark on YARN on Hadoop cluster Direct data access over JDBC/ODBC via desktop Execute both Shark and Spark queries on YARN Still employ satellite cluster model for predictable SLAs in low-latency situations Use YARN as the foundation for cluster resource management 20

Conclusions Barrier to entry for big data analytics reduced, Spark at the forefront Yahoo! now using Spark/Shark for analytics on top of Hadoop ecosystem Looking to move ETL jobs to Spark Satellite cluster pattern quite beneficial for large datasets in RAM and predictable SLAs Clear and obvious speedup compared to Hadoop More flexible processing platform provides powerful base for analytics for the future 21