Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia



Similar documents
Ali Ghodsi Head of PM and Engineering Databricks

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark and the Big Data Library

Spark: Making Big Data Interactive & Real-Time

Beyond Hadoop with Apache Spark and BDAS

Unified Big Data Analytics Pipeline. 连 城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Moving From Hadoop to Spark

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How Companies are! Using Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Brave New World: Hadoop vs. Spark

Big Data Processing. Patrick Wendell Databricks

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark. Fast, Interactive, Language- Integrated Cluster Computing

How To Create A Data Visualization With Apache Spark And Zeppelin

Conquering Big Data with Apache Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Data Science in the Wild

Spark Application Carousel. Spark Summit East 2015

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Architectures for massive data management

Hadoop & Spark Using Amazon EMR

Introduc8on to Apache Spark

Real Time Data Processing using Spark Streaming

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data With Hadoop

This is a brief tutorial that explains the basics of Spark SQL programming.

Big Data Analytics Hadoop and Spark

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Streaming items through a cluster with Spark Streaming

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Apache Flink Next-gen data analysis. Kostas

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Machine- Learning Summer School

The Internet of Things and Big Data: Intro

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Next-Gen Big Data Analytics using the Spark stack

Big Data Analytics. Lucas Rego Drumond

HiBench Introduction. Carson Wang Software & Services Group

Dell In-Memory Appliance for Cloudera Enterprise

Big Data and Scripting Systems build on top of Hadoop

Apache Spark and Distributed Programming

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Spark: Cluster Computing with Working Sets

Write Once, Run Anywhere Pat McDonough

A Brief Introduction to Apache Tez

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Project DALKIT (informal working title)

Databricks. A Primer

Databricks. A Primer

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

xpaaerns on Spark, Shark, Tachyon and Mesos

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Functional Query Optimization with" SQL

Reference Architecture, Requirements, Gaps, Roles

An Open Source Memory-Centric Distributed Storage System

Big Data Analytics with Cassandra, Spark & MLLib

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Managing large clusters resources

Parquet. Columnar storage for the people

The Berkeley Data Analytics Stack: Present and Future

Challenges for Data Driven Systems

Information Builders Mission & Value Proposition

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Big Data and Scripting Systems beyond Hadoop

Processing NGS Data with Hadoop-BAM and SeqPig

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Case Study : 3 different hadoop cluster deployments

Large scale processing using Hadoop. Ján Vaňo

Oracle Big Data SQL Technical Update

Play with Big Data on the Shoulders of Open Source

MLlib: Scalable Machine Learning on Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

6.S897 Large-Scale Systems

Big Data Research in the AMPLab: BDAS and Beyond

Why Spark on Hadoop Matters

David Teplow! Integra Technology Consulting!

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Industrial Internet

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Using Apache Spark Pat McDonough - Databricks

Open source Google-style large scale data analysis with Hadoop

Transcription:

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia

What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing Most active open source project in big data

About Databricks Founded by the creators of Spark in 2013 Continues to drive open source Spark development, and offers a cloud service (Databricks Cloud) Partners to support Spark with Cloudera, MapR, Hortonworks, Datastax

Spark Community 2000 1500 Spark 350000 300000 250000 Spark 1000 500 MapReduce YARN HDFS Storm 200000 150000 100000 50000 MapReduce YARN HDFS Storm 0 Commits Lines of Code Changed Activity in past 6 months 0

Community Growth 100 Contributors per Month to Spark 75 50 25 0 2010 2011 2012 2013 2014 2-3x more activity than Hadoop, Storm, MongoDB, NumPy, D3, Julia,

Overview Why a unified engine? Spark execution model Why was Spark so general? What s next

History: Cluster Programming Models 2004

MapReduce A general engine for batch processing

Beyond MapReduce MapReduce was great for batch processing, but users quickly needed to do more: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing Result: many specialized systems for these workloads

Big Data Systems Today MapReduce Pregel Giraph Impala Storm Dremel Drill Presto S4... General batch processing Specialized systems for new workloads

Problems with Specialized Systems More systems to manage, tune, deploy Can t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!

Big Data Systems Today MapReduce Pregel Giraph Impala Storm Dremel Drill Presto S4...? General batch processing Specialized systems for new workloads Unified engine

Overview Why a unified engine? Spark execution model Why was Spark so general? What s next

Background Recall 3 workloads were issues for MapReduce: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3 result 3... Slow due to data replication and disk I/O

What We d Like Input iter. 1 iter. 2... one-time processing query 1 query 2 Input Distributed memory query 3... 10-100 faster than network and disk

Spark Model Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Built via parallel transformations (map, filter, ) > Fault-tolerant without replication

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: foo in s).count() messages.filter(lambda s: bar in s).count()... Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Base RDD Transformed RDD Action Worker Block 3 Driver Cache 3 results tasks Worker Block 1 Worker Block 2 Cache 1 Cache 2

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

Example: Logistic Regression Running Time (s) 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s later iterations 1 s

Spark in Scala and Java // Scala: val lines = sc.textfile(...) lines.filter(s => s.contains( ERROR )).count() // Java: JavaRDD<String> lines = sc.textfile(...); lines.filter(s -> s.contains( ERROR )).count();

How General Is It?

Libraries Built on Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core

Spark SQL Represents tables as RDDs Tables = Schema + Data

Spark SQL Represents tables as RDDs Tables = Schema + Data = SchemaRDD From Hive: c = HiveContext(sc) rows = c.sql( select text, year from hivetable ) rows.filter(lambda r: r.year > 2013).collect() From JSON: c.jsonfile( tweets.json ).registertemptable( tweets ) c.sql( select text, user.name from tweets ) tweets.json { text : hi, user : { name : matei, id : 123 }}

Spark Streaming Time Input

Spark Streaming Time RDD RDD RDD RDD RDD RDD Represents streams as a series of RDDs over time val spammers = sc.sequencefile( hdfs://spammers.seq ) sc.twitterstream(...).filter(t => t.text.contains( QCon )).transform(tweets => tweets.map(t => (t.user, t)).join(spammers)).print()

MLlib Vectors, Matrices

MLlib Vectors, Matrices = RDD[Vector] Iterative computation points = sc.textfile( data.txt ).map(parsepoint) model = KMeans.train(points, 10) model.predict(newpoint)

GraphX Represents graphs as RDDs of edges and vertices

GraphX Represents graphs as RDDs of edges and vertices

GraphX Represents graphs as RDDs of edges and vertices

Combining Processing Types // Load data using SQL val points = ctx.sql( select latitude, longitude from historic_tweets ) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterstream(...).map(t => (model.closestcenter(t.location), 1)).reduceByWindow( 5s, _ + _)

Composing Workloads Separate systems: HDFS read ETL HDFS write HDFS read train HDFS write HDFS read query HDFS write... Spark: HDFS read ETL train query HDFS write

Performance vs Specialized Systems Response Time (sec) 50 40 30 20 10 0 Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) SQL Throughput (MB/s/node) 35 30 25 20 15 10 5 0 Storm Spark Streaming Response Time (min) 60 50 40 30 20 10 0 Mahout GraphLab Spark ML

On-Disk Performance: Petabyte Sort Spark beat last year s Sort Benchmark winner, Hadoop, by 3 using 10 fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort

Overview Why a unified engine? Spark execution model Why was Spark so general? What s next

Why was Spark so General? In a world of growing data complexity, understanding this can help us design new tools / pipelines Two perspectives: > Expressiveness perspective > Systems perspective

1. Expressiveness Perspective Spark MapReduce + fast data sharing

1. Expressiveness Perspective MapReduce can emulate any distributed system! Local computation All-to-all communication One MR step How to share data! quickly across steps? Spark: RDDs How low is this latency? Spark: ~100 ms

2. Systems Perspective Main bottlenecks in clusters are network and I/O Any system that lets apps control these resources can match speed of specialized ones In Spark: > Users control data partitioning & caching > We implement the data structures and algorithms of specialized systems within Spark records

Examples Spark SQL > A SchemaRDD holds records for each chunk of data (multiple rows), with columnar compression GraphX > GraphX represents graphs as an RDD of HashMaps so that it can join quickly against each partition

Result Spark can leverage most of the latest innovations in databases, graph processing, machine learning, Users get a single API that composes very efficiently More info: tinyurl.com/matei-thesis

Overview Why a unified engine? Spark execution model Why was Spark so general? What s next

What s Next for Spark While Spark has been around since 2009, many pieces are just beginning 300 contributors, 2 whole libraries new this year Big features in the works

Spark 1.2 (Coming in Dec) New machine learning pipelines API > Featurization & parameter search, similar to SciKit-Learn Python API for Spark Streaming Spark SQL pluggable data sources > Hive, JSON, Parquet, Cassandra, ORC, Scala 2.11 support

Beyond Hadoop Batch Interactive Streaming Your application here Hadoop Cassandra Mesos Public Clouds Unified API across workloads, storage systems and environments

Learn More Downloads and tutorials: spark.apache.org Training: databricks.com/training (free videos) Databricks Cloud: databricks.com/cloud

www.spark-summit.org