Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY



Similar documents
Spark: Making Big Data Interactive & Real-Time

Unified Big Data Processing with Apache Spark. Matei

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Data Science in the Wild

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Beyond Hadoop with Apache Spark and BDAS

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark and the Big Data Library

Using Apache Spark Pat McDonough - Databricks

Big Data Analytics. Lucas Rego Drumond

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Conquering Big Data with Apache Spark

The Berkeley Data Analytics Stack: Present and Future

How Companies are! Using Spark

Brave New World: Hadoop vs. Spark

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduc8on to Apache Spark

Ali Ghodsi Head of PM and Engineering Databricks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Introduction to Big Data with Apache Spark UC BERKELEY

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

How To Create A Data Visualization With Apache Spark And Zeppelin

Making Sense at Scale with the Berkeley Data Analytics Stack

Big Data and Scripting Systems beyond Hadoop

Introduction to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Architectures for massive data management

Apache Spark and Distributed Programming

Apache Flink Next-gen data analysis. Kostas

Moving From Hadoop to Spark

Big Data Frameworks: Scala and Spark Tutorial

Real Time Data Processing using Spark Streaming

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Big Data Processing. Patrick Wendell Databricks

Spark: Cluster Computing with Working Sets

Hadoop Ecosystem B Y R A H I M A.

Machine- Learning Summer School

MLlib: Scalable Machine Learning on Spark

Big Data and Scripting Systems build on top of Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Functional Query Optimization with" SQL

Big Data Analytics Hadoop and Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Tachyon: memory-speed data sharing

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Cloud Computing and Big Data Processing

Next-Gen Big Data Analytics using the Spark stack

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop & Spark Using Amazon EMR

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Certification Study Guide. MapR Certified Spark Developer Study Guide

Processing Big Data with Small Programs

Cloud Computing and Big Data Processing

An Architecture for Fast and General Data Processing on Large Clusters

Big Data Research in the AMPLab: BDAS and Beyond

Streaming items through a cluster with Spark Streaming

Big Data With Hadoop

HiBench Introduction. Carson Wang Software & Services Group

Case Study : 3 different hadoop cluster deployments

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Project DALKIT (informal working title)

Big Data and Industrial Internet

Writing Standalone Spark Programs

The Stratosphere Big Data Analytics Platform

Spark Application Carousel. Spark Summit East 2015

BIG DATA APPLICATIONS

Making Sense at Scale with Algorithms, Machines & People!

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Dell In-Memory Appliance for Cloudera Enterprise

A Reliable Memory-Centric Distributed Storage System a

GraySort on Apache Spark by Databricks

Hadoop IST 734 SS CHUNG

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hybrid Software Architectures for Big

Introduction to Hadoop

Big Data Frameworks Course. Prof. Sasu Tarkoma

Big Data Analytics with Cassandra, Spark & MLLib

How To Understand The Programing Framework For Distributed Computing

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Beyond Hadoop MapReduce Apache Tez and Apache Spark

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Workshop on Hadoop with Big Data

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Implementation of Spark Cluster Technique with SCALA

Transcription:

Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org

What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through:» In-memory computing primitives» General computation graphs Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 100 faster (2-10 on disk) Often 5 less code

Project History Started in 2009, open sourced 2010 17 companies now contributing code» Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, Entered Apache incubator in June Python API added in February

An Expanding Stack Spark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark (SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) Spark More details: amplab.berkeley.edu

This Talk Spark programming model Examples Demo Implementation Trying it out

Why a New Programming Model? MapReduce simplified big data processing, but users quickly found two problems: Programmability: tangle of map/red functions Speed: MapReduce inefficient for apps that share data across multiple steps» Iterative algorithms, interactive queries

Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3 result 3... Slow due to data replication and disk I/O

What We d Like Input iter. 1 iter. 2... one-time processing query 1 query 2 Input Distributed memory query 3... 10-100 faster than network and disk

Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster» Built via parallel transformations (map, filter, )» Automatically rebuilt on failure

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD lines = spark.textfile( hdfs://... )! results errors = lines.filter(lambda s: s.startswith( ERROR ))! messages = errors.map(lambda s: s.split( \t )[2])! messages.cache()! Driver tasks Worker Block 1 Cache 1 messages.filter(lambda s: foo in s).count()! messages.filter(lambda s: bar in s).count()!...! Result: full-text scaled to search 1 TB data of Wikipedia in 7 sec in (vs 2 sec 180 (vs sec 30 for s for on-disk data) data) Action Worker Block 3 Cache 3 Worker Block 2 Cache 2

Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data messages = textfile(...).filter(lambda s: ERROR in s)!.map(lambda s: s.split( \t )[2])!! HadoopRDD path = hdfs:// FilteredRDD func = lambda s: MappedRDD func = lambda s:

Example: Logistic Regression Goal: find line separating two sets of points + + + + + + + + + + random initial line target

Example: Logistic Regression data = spark.textfile(...).map(readpoint).cache()!! w = numpy.random.rand(d)!! for i in range(iterations):! gradient = data.map(lambda p:! (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x! ).reduce(lambda x, y: x + y)! w -= gradient!! print Final w: %s % w!

Logistic Regression Performance 4000 Running Time (s) 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop PySpark first iteration 80 s further iterations 5 s

Demo

Supported Operators map! filter! groupby! union! join! leftouterjoin! rightouterjoin! reduce! count! fold! reducebykey! groupbykey! cogroup! flatmap! take! first! partitionby! pipe! distinct! save!...!

Other Engine Features General operator graphs (not just map-reduce) Hash-based reduces (faster than Hadoop s sort) Controlled data partitioning to save communication PageRank Performance Iteration time (s) 200 150 100 50 171 72 23 Hadoop Basic Spark Spark + Controlled Partitioning 0

Spark Community 1000+ meetup members 60+ contributors 17 companies contributing

This Talk Spark programming model Examples Demo Implementation Trying it out

Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Your app PySpark Spark client Spark worker Spark worker Python child Python child Python child Python child

Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Your app PySpark Spark client Spark worker Main PySpark author: Josh Rosen Spark cs.berkeley.edu/~joshrosen worker Python child Python child Python child Python child

Object Marshaling Uses pickle library for both communication and cached data» Much cheaper than Python objects in RAM Lambda marshaling library by PiCloud

Job Scheduler Supports general operator graphs A: B: Automatically pipelines functions Aware of data locality and partitioning Stage 1 C: D: map E: groupby F: join G: Stage 2 union Stage 3 = cached data partition

Interoperability Runs in standard CPython, on Linux / Mac» Works fine with extensions, e.g. NumPy Input from local file system, NFS, HDFS, S3» Only text files for now Works in IPython, including notebook Works in doctests see our tests!

Getting Started Visit spark-project.org for video tutorials, online exercises, docs Easy to run in local mode (multicore), standalone clusters, or EC2 Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

Getting Started Easiest way to learn is the shell: $./pyspark! >>> nums = sc.parallelize([1,2,3]) # make RDD from array! >>> nums.count()! 3! >>> nums.map(lambda x: 2 * x).collect()! [2, 4, 6]!

Writing Standalone Jobs from pyspark import SparkContext!! if name == " main ":! sc = SparkContext( local, WordCount )! lines = sc.textfile( in.txt )!! counts = lines.flatmap(lambda s: s.split()) \!.map(lambda word: (word, 1)) \!.reducebykey(lambda x, y: x + y)!! counts.saveastextfile( out.txt )!!!

Conclusion PySpark provides a fast and simple way to analyze big datasets from Python Learn more or contribute at spark-project.org Look for our training camp on August 29-30! My email: matei@berkeley.edu

Behavior with Not Enough RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 Cache disabled 25% 50% 75% Fully cached % of working set in memory

The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark (SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) Spark More details: amplab.berkeley.edu

Performance Comparison Response Time (s) 25 20 15 10 5 0 Impala (disk) Impala (mem) Redshift Shark (disk) Shark (mem) SQL Throughput (MB/s/node) 35 30 25 20 15 10 5 0 Storm Spark Streaming Response Time (min) 30 25 20 15 10 5 0 Hadoop Giraph GraphLab GraphX Graph