Moving From Hadoop to Spark



Similar documents
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Unified Big Data Processing with Apache Spark. Matei

Hadoop Ecosystem B Y R A H I M A.

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Big Data Analytics Pipeline. 连 城

Ali Ghodsi Head of PM and Engineering Databricks

Upcoming Announcements

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Scaling Out With Apache Spark. DTL Meeting Slides based on

Next-Gen Big Data Analytics using the Spark stack

How Companies are! Using Spark

Why Spark on Hadoop Matters

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark: Making Big Data Interactive & Real-Time

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Self-service BI for big data applications using Apache Drill

Conquering Big Data with BDAS (Berkeley Data Analytics)

Oracle Big Data SQL Technical Update

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Self-service BI for big data applications using Apache Drill

Analytics on Spark &

HDP Hadoop From concept to deployment.

Workshop on Hadoop with Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop & Spark Using Amazon EMR

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Beyond Hadoop with Apache Spark and BDAS

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Flink Next-gen data analysis. Kostas

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

The Internet of Things and Big Data: Intro

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

HDP Enabling the Modern Data Architecture

Information Builders Mission & Value Proposition

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Big Data Training

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Spark and the Big Data Library

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Analytics with Cassandra, Spark & MLLib

Native Connectivity to Big Data Sources in MSTR 10

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Large scale processing using Hadoop. Ján Vaňo

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Using distributed technologies to analyze Big Data

CSE-E5430 Scalable Cloud Computing Lecture 11

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

This is a brief tutorial that explains the basics of Spark SQL programming.

Hadoop Big Data for Processing Data and Performing Workload

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Dell In-Memory Appliance for Cloudera Enterprise

HADOOP. Revised 10/19/2015

Unified Batch & Stream Processing Platform

Comprehensive Analytics on the Hortonworks Data Platform

HiBench Introduction. Carson Wang Software & Services Group

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Implement Hadoop jobs to extract business value from large and varied data sets

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Architectures for massive data management

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Dominik Wagenknecht Accenture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Research in the AMPLab: BDAS and Beyond

A Brief Introduction to Apache Tez

Apache Spark and Distributed Programming

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Introduction to Spark

Databricks. A Primer

Brave New World: Hadoop vs. Spark

Databricks. A Primer

SQL on NoSQL (and all of the data) With Apache Drill

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Data Security in Hadoop

How To Handle Big Data With A Data Scientist

Streaming items through a cluster with Spark Streaming

Big Data Analytics Nokia

Write Once, Run Anywhere Pat McDonough

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

An Open Source Memory-Centric Distributed Storage System

Big Data Analytics. Lucas Rego Drumond

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HDFS. Hadoop Distributed File System

Big Data and Scripting Systems build on top of Hadoop

Big Data With Hadoop

Spark Application Carousel. Spark Summit East 2015

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23)

+ HI, Featured in Hadoop Weekly #109

+ About Me : Sujee Maniyam n 15 years+ software development experience n Consulting & Training in Big Data n Author n Hadoop illuminated open source book n HBase Design Patterns coming soon n Open Source contributor (including HBase) http://github.com/sujee n Founder / Organizer of Big Data Guru meetup http://www.meetup.com/bigdatagurus/ n http://sujee.net/ n Contact : sujee@elephantscale.com

+ Hadoop in 20 Seconds n The Big data platform n Very well field tested n Scales to peta-bytes of data n MapReduce : Batch oriented compute

+ Hadoop Eco System Real Time Batch ElephantScale.com, 2014

+ Hadoop Ecosystem n HDFS n provides distributed storage n Map Reduce n Provides distributed computing n Pig n High level MapReduce n Hive n SQL layer over Hadoop n HBase n Nosql storage for realtime queries ElephantScale.com, 2014

+ Spark in 20 Seconds n Fast & Expressive Cluster computing engine n Compatible with Hadoop n Came out of Berkeley AMP Lab n Now Apache project n Version 1.2 just released (Dec 2014) First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com

+ Spark Eco-System Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Stand alone YARN MESOS Cluster managers

+ Hypo-meter J

+ Spark Job Trends

+ Spark Benchmarks Source : stratio.com

+ Spark Code / Activity Source : stratio.com

+ Timeline : Hadoop & Spark

+ Hadoop Vs. Spark Hadoop Spark Source : http://www.kwigger.com/mit-skifte-til-mac/

+ Comparison With Hadoop Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Upto 2x - 10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration

+ Hadoop + Yarn : Universal OS for Distributed Compute Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Applications YARN Cluster Management HDFS Storage

+ Spark Is Better Fit for Iterative Workloads

+ Spark Programming Model n More generic than MapReduce

+ Is Spark Replacing Hadoop? n Spark runs on Hadoop / YARN n Complimentary n Spark programming model is more flexible than MapReduce n Spark is really great if data fits in memory (few hundred gigs), n Spark is storage agnostic (see next slide)

+ Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra???

+ Spark & Hadoop Use Case Other Spark Batch processing Hadoop s MapReduce (Java, Pig, Hive) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark RDDs (java / scala / python) Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups NoSQL (Hbase, Cassandra..etc) No Spark component. But Spark can query data in NoSQL stores

+ Hadoop & Spark Future???

+ Why Move From Hadoop to Spark? n Spark is easier than Hadoop n friendlier for data scientists / analysts n Interactive shell n fast development cycles n adhoc exploration n API supports multiple languages n Java, Scala, Python n Great for small (Gigs) to medium (100s of Gigs) data

+ Spark : Unified Stack n Spark supports multiple programming models n Map reduce style batch processing n Streaming / real time processing n Querying via SQL n Machine learning n All modules are tightly integrated n Facilitates rich applications n Spark can be only stack you need! n No need to run multiple clusters (Hadoop cluster, Storm cluster..etc) Image: buymeposters.com

+ Migrating From Hadoop à Spark Functionality Hadoop Spark Distributed Storage HDFS Cloud storage like Amazon S3 Or NFS mounts SQL querying Hive Spark SQL ETL work flow Pig - Spork : Pig on Spark - Mix of Spark SQL..etc Machine Learning Mahout ML Lib NoSQL DB Hbase???

+ Moving From Hadoop à Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning

+ Hadoop To Spark Real Time Spark can help Batch ElephantScale.com, 2014

+ Big Data

+ Data Size : You Don t Have Big Data

+ 1) Data Size (T-shirt sizing) Spark < few G 10 G + 100 G + 1 TB + 100 TB + PB + Hadoop Image credit : blog.trumpi.co.za

+ 1) Data Size n Lot of Spark adoption at SMALL MEDIUM scale n Good fit n Data might fit in memory!! n Hadoop may be overkill n Applications n Iterative workloads (Machine learning..etc) n Streaming n Hadoop is still preferred platform for TB + data

+ Next : 2) File System ElephantScale.com, 2014

+ 2) File System n Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS n File system choices for Spark n HDFS - Hadoop File System n Reliable n Good performance (data locality) n Field tested for PB of data n S3 : Amazon n Reliable cloud storage n Huge scale n NFS : Network File System ( shared FS across machines)

+ Spark File Systems

+ File Systems For Spark Data locality Throughput Latency Reliability HDFS NFS Amazon S3 High (best) High (best) Low (best) Very High (replicated) Local enough Medium (good) Low Low None (ok) Low (ok) High Very High Cost Varies Varies $30 / TB / Month

+ File System Throughput Comparison (HDFS Vs. S3) n Data : 10G + (11.3 G) n Each file : ~1+ G ( x 10) n 400 million records total n Partition size : 128 M n On HDFS & S3 n Cluster : n 8 Nodes on Amazon m3.xlarge (4 cpu, 15 G Mem, 40G SSD ) n Hadoop cluster, Latest Horton Works HDP v2.2 n Spark : on same 8 nodes, stand-alone, v 1.2

+ File System Throughput Comparison (HDFS Vs. S3) val hdfs = sc.textfile("hdfs:/// /10G/") val s3 = sc.textfile("s3n:// /10G/") // count # records hdfs.count() s3.count()

+ HDFS Vs. S3

+ HDFS Vs. S3 (lower is better)

+ HDFS Vs. S3 Conclusions HDFS Data locality à much higher throughput Need to maintain an Hadoop cluster Large data sets (TB + ) S3 Data is streamed à lower throughput No Hadoop cluster to maintain à convenient Good use case: - Smallish data sets (few gigs) - Load once and cache and re-use

+ Next : 3) SQL ElephantScale.com, 2014

+ 3) SQL in Hadoop / Spark Hadoop Spark Engine Hive Spark SQL Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Petabytes Terabytes? Inter operability Can read Hive tables or stand alone data Formats CSV, JSON, Parquet CSV, JSON, Parquet

+ SQL In Hadoop / Spark n Input Billing Records / CDR Timestamp Customer_id Resource_id Qty cost Milliseconds String Int Int int 1000 1 Phone 10 10c 1003 2 SMS 1 4c 1005 1 Data 3M 5c n Query: Find top-10 customers n Data Set n 10G + data n 400 million records n CSV Format

+ SQL In Hadoop / Spark n Hive Table: CREATE EXTERNAL TABLE billing ( ts BIGINT, customer_id INT, resource_id INT, qty INT, cost INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', stored as textfile LOCATION hdfs location' ; n Hive Query (simple aggregate) select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10;

+ Hive Query Results

+ Spark + Hive Table n Spark code to access Hive table import org.apache.spark.sql.hive.hivecontext val hivectx = new org.apache.spark.sql.hive.hivecontext(sc) val top10 = hivectx.sql("select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10") top10.collect()

+ Spark SQL Vs. Hive Fast on same HDFS data!

+ SQL In Hadoop / Spark : Conclusions n Spark can readily query Hive tables n Speed! n Great for exploring / trying-out n Fast iterative development n Spark can load data natively n CSV n JSON (Schema automatically inferred) n Parquet (Schema automatically inferred)

+ Next : 3) ETL In Hadoop / Spark ElephantScale.com, 2014

+ ETL? Data 1 Data 2 (clean) Data 4 Data 3

+ 3) ETL on Hadoop / Spark Hadoop Spark ETL Tools Pig, Cascading, Oozie Native RDD programming (Scala, Java, Python) Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding

+ ETL On Hadoop / Spark n Pig n High level, expressive data flow language (Pig Latin) n Easier to program than Java Map Reduce n Used for ETL (data cleanup / data prep) n Spork : Run Pig on Spark (as simple as $ pig -x spark..) n https://github.com/sigmoidanalytics/spork n Cascading n High level data flow declarations n Many sources (Cassandra / Accumulo / Solr) n Spark-Scalding n https://github.com/tresata/spark-scalding

+ ETL On Hadoop / Spark : Conclusions n Try spork or spark-scalding n Code re-use n Not re-writing from scratch n Program RDDs directly n More flexible n Multiple language support : Scala / Java / Python n Simpler / faster in some cases

+ 4) Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No Efforts to port Mahout into Spark YES Lots of momentum!

+ Spark Is Better Fit for Iterative Workloads

+ Spark Caching! n Reading data from remote FS (S3) can be slow n For small / medium data ( 10 100s of GB) use caching n Pay read penalty once n Cache n Then very high speed computes (in memory) n Recommended for iterative work-loads

+ Caching Demo!

+ Caching Results Cached!

+ Spark Caching n Caching is pretty effective (small / medium data sets) n Cached data can not be shared across applications (each application executes in its own sandbox)

+ Sharing Cached Data n 1) spark job server n Multiplexer n All requests are executed through same context n Provides web-service interface n 2) Tachyon n Distributed In-memory file system n Memory is the new disk! n Out of AMP lab, Berkeley n Early stages (very promising)

+ Spark Job Server

+ Spark Job Server n Open sourced from Ooyala n Spark as a Service simple REST interface to launch jobs n Sub-second latency! n Pre-load jars for even faster spinup n Share cached RDDs across requests (NamedRDD) App1 : ctx.saverdd( my cached rdd, rdd1) App2: RDD rdd2 = ctx.loadrdd ( my cached rdd ) n https://github.com/spark-jobserver/spark-jobserver

+ Tachyon + Spark

+ Next : New Big Data Applications With Spark

+ Big Data Applications : Now n Analysis is done in batch mode (minutes / hours) n Final results are stored in a real time data store like Cassandra / Hbase n These results are displayed in a dashboard / web UI n Doing interactive analysis???? n Need special BI tools

+ With Spark n Load data set (Giga bytes) from S3 and cache it (one time) n Super fast (sub-seconds) queries to data n Response time : seconds (just like a web app!)

+ Lessons Learned n Build sophisticated apps! n Web-response-time (few seconds)!! n In-depth analytics n Leverage existing libraries in Java / Scala / Python n data analytics as a service

+ Final Thoughts n Already on Hadoop? n Try Spark side-by-side n Process some data in HDFS n Try Spark SQL for Hive tables n Contemplating Hadoop? n Try Spark (standalone) n Choose NFS or S3 file system n Take advantage of caching n Iterative loads n Spark Job servers n Tachyon n Build new class of big / medium data apps

+ Thanks! Sujee Maniyam sujee@elephantscale.com http://elephantscale.com Expert consulting & training in Big Data (Now offering Spark training)