Beyond Hadoop with Apache Spark and BDAS

Similar documents
Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Analytics Pipeline. 连 城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

How Companies are! Using Spark

Conquering Big Data with BDAS (Berkeley Data Analytics)

Streaming items through a cluster with Spark Streaming

Big Data Analytics. Lucas Rego Drumond

Real Time Data Processing using Spark Streaming

Ali Ghodsi Head of PM and Engineering Databricks

Data Science in the Wild

Hadoop Ecosystem B Y R A H I M A.

The Berkeley Data Analytics Stack: Present and Future

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Spark and the Big Data Library

Spark. Fast, Interactive, Language- Integrated Cluster Computing

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Introduction to Spark

Apache Flink Next-gen data analysis. Kostas

How To Create A Data Visualization With Apache Spark And Zeppelin

Moving From Hadoop to Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Research in the AMPLab: BDAS and Beyond

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

HiBench Introduction. Carson Wang Software & Services Group

Big Data Processing. Patrick Wendell Databricks

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data Course Highlights

Architectures for massive data management

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Analytics on Spark &

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Introduction to Big Data with Apache Spark UC BERKELEY

Next-Gen Big Data Analytics using the Spark stack

The Internet of Things and Big Data: Intro

Apache Spark and Distributed Programming

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data and Scripting Systems build on top of Hadoop

Certification Study Guide. MapR Certified Spark Developer Study Guide

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Introduc8on to Apache Spark

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

MLlib: Scalable Machine Learning on Spark

Conquering Big Data with Apache Spark

Why Spark on Hadoop Matters

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

CSE-E5430 Scalable Cloud Computing Lecture 11

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Hadoop & Spark Using Amazon EMR

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Big Data Analytics with Cassandra, Spark & MLLib

Making Sense at Scale with the Berkeley Data Analytics Stack

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

An Architecture for Fast and General Data Processing on Large Clusters

Big Data and Scripting Systems beyond Hadoop

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Going Deep with Spark Streaming

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Brave New World: Hadoop vs. Spark

Dell In-Memory Appliance for Cloudera Enterprise

Tachyon: memory-speed data sharing

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Machine- Learning Summer School

Big Data Frameworks: Scala and Spark Tutorial

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Spark: Cluster Computing with Working Sets

Hadoop Job Oriented Training Agenda

Workshop on Hadoop with Big Data

CS 294: Big Data System Research: Trends and Challenges

Hadoop: The Definitive Guide

Shark Installation Guide Week 3 Report. Ankush Arora

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Case Study : 3 different hadoop cluster deployments

Big Data Analytics Hadoop and Spark

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Spark Application Carousel. Spark Summit East 2015

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

This is a brief tutorial that explains the basics of Spark SQL programming.

Enterprise Data Storage and Analysis on Tim Barr

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Big Data With Hadoop

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Analysis: Apache Storm Perspective

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Transcription:

Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared spark.apache.org and AmpLabs presentajons.

Apache Spark Apache Spark is a fast and general engine for large- scale data processing. Fast Easy to use Support mix of interacjon models

Big Data Systems Today Giraph Pregel GraphLab MapReduce Batch Dremel Impala Storm S4 Tez Different Models: Iterative, interactive and" streaming Drill Samza

Apache Spark Batch Interac3v e One Framework! " Streaming Three Programming models in one unified framework: Batch- Streaming- InteracJve

What is Spark? Open Source from Amplabs : UC Berkely Fast InteracJve Big Data Processing engine Not a modified version of Hadoop CompaJble with Hadoop s storage APIs Can read/write to any Hadoop- supported system, including HDFS, HBase, SequenceFiles, etc Separate, fast, MapReduce- like engine In- memory data storage for very fast iterajve queries General execujon graphs and powerful opjmizajons Up to 40x faster than Hadoop for iterajve cases faster data sharing across parallel jobs as required for mulj- stage and interacjve apps require Efficient Fault Tolerant Model

Spark History AmpLabs from UC Berkely Spark started as research project in 2009 Open sourced in 2010 Growing community since Entered Apache Incubator in June 2013 and now 1.0 is coming out Commercial Backing: Databricks Enhancements by current stacks: Cloudera, MapR AddiJonal ContribuJons: Yahoo, Conviva, Oooyala,

ContribuJons AmpLabs - Databricks YARN support (Yahoo!) Columnar compression in Shark (Yahoo!) Fair scheduling (Intel) Metrics reporjng (Intel, QuanJfind) New RDD operators (Bizo, ClearStory) Scala 2.10 support (Imaginea)

How does Data Sharing work in Hadoop Map Reduce Jobs? HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input Result: Hadoop Map Reduce is Slow : 1. ReplicaJon 2. SerializaJon 3. Disk IO

Spark faster due to in- memory Data Sharing across jobs Input iter. 1 iter. 2... one- time processing query 1 query 2 Input Distributed memory query 3... In some cases, 10-100 faster than network and disk

Spark is Faster and Easier Extremely fast scheduling Ms- seconds in Spark vs secs- mins- hrs in Hadoop MR Many more useful primijves Higher level APIs for more than map- reduce- filter Broadcast and accumulators variables Combining programming models

Result: Spark vs Hadoop Performance PageRank Performance Itera3on 3me (s) 200 150 100 50 0 171 72 23 Hadoop Basic Spark Spark + Controlled ParJJoning

Apache Spark Unifies batch, streaming, interac/ve comp. Easy to build sophisjcated applicajons Support iterajve, graph- parallel algorithms Powerful APIs in Scala, Python, Java Streaming Spark Streaming Batch, Interactive Batch, Interactive BlinkDB Shark SQL GraphX Spark Core RDD, MR Mesos / YARN HDFS Graph Machine Learning MLlib

RDD Read only Parallelized Fault tolerant Data Set From : Files, TransformaJon, Parallelizing a collecjon From other RDD Spark tracks lineage that is used for recreajon upon failure Late- realizajon

RDDs track Lineage RDD Fault Tolerance Tracks the transformajons used to build them (their lineage) to recompute lost data E.g: messages = textfile(...).filter(lambda s: s.contains( Spark )).map(lambda s: s.split( \t )[2]) HadoopRDD path = hdfs:// FilteredRDD func = contains(...) MappedRDD func = split( )

Richer Programming Constructs Not just map, reduce, filter But map, reduce, flatmap, filter, count, reduce, groupbykey, reducebykey, sortbykey, join

Shared Variables Two types: Broadcast Accumulate Broadcast: Lookup type shared dataset Large (can go 10s- 100s GB) Accumulators AddiJves Like counters in reduce

Scala Shell : InteracJve Big Data Programming scala> val file = sc.textfile ( mytext.data ) scala> val words = file.map( line => line.split( )) scala> words.map(word=>(word, 1)).reduceByKey(_+_))

Spark Streaming Micro- batches: a series of very small, determinisjc batch jobs Microbatch is small (seconds) batches of the live stream Microbatches are treated as RDD Same RDD Programming model and operations Results are returned as batches Recovery based on lineage & Checkpoint live data stream batches of X seconds processed results Spark Streaming Spark

Comparison with Spark and S4 Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node Throughput per node (MB/s) 100 50 Grep 0 100 1000 Record Size (bytes) Spark Storm Throughput per node (MB/s) Ref: AmpLabs : Tathagata Das 30 20 10 WordCount 0 100 1000 Record Size (bytes) Spark Storm 19

DStream Data Sources Many sources out of the box HDFS Kara Flume Twiser TCP sockets Akka actor ZeroMQ Extensible 20

TransformaJons Build new streams from exisjng streams RDD- like operajons map, flatmap, filter, count, reduce, groupbykey, reducebykey, sortbykey, join etc. New window and stateful operajons window, countbywindow, reducebywindow countbyvalueandwindow, reducebykeyandwindow updatestatebykey etc.

Sample Program: Find trending top 10 hashtags in last 10 // Create the stream of tweets min val tweets = ssc.twiserstream(<username>, <password>) // Count the tags over a 10 minute window val tagcounts = tweets.flatmap (statuts => gettags(status)).countbyvalueandwindow (Minutes(10), Second(1)) // Sort the tags by counts val sortedtags = tagcounts.map { case (tag, count) => (count, tag) }.transform(_.sortbykey(false)) // Show the top 10 tags sortedtags.foreach(showtoptags(10) _) 22

Storm + Fusion Convergence Twiser Model

Shark High- Speed In- Memory AnalyJcs over Hadoop (HDFS) and Hive Data Common HiveQL provides seemless federajon between Hive and Shark Sits on top of exisjng Hive warehouse data Direct query access from UI like tableu Direct ODBC/JDBC from desktop BI tools 24

Shark AnalyJc query engine compajble with Hive Hive QL, UDFs, SerDes, scripts, types Makes Hive queries run much faster Builds on top of Spark Caching data OpJmizaJons

Shark Internals A faster execujon engine than Hadoop Map Reduce OpJmized Storage format Columnar memory store Various other opjmizajons Fully distributed sort data co- parjjoning parjjon pruning

Shark Performance Comparison

Performance Throughput (MB/s/node) 35 30 25 20 15 10 5 Storm Spark Response Time (s) 45 40 35 30 25 20 15 10 5 Hive Impala (disk) Impala (mem) Shark (disk) Shark (mem) Response Time (min) 30 25 20 15 10 5 Hadoop Giraph GraphX 0 Streaming 0 SQL 0 Graph

ClassificaJons NaiveBayes SVM SVMwithSGD Clustering Kmeans Regression Linear Lasso RidgeRegressionWithSGD OpJmizaJon: GradientDiscent LogisJcGradient RecommendaJon CollaboraJveFiltering / ALS Others MatrixFactorizaJon MLBase Inventory