Introduction to Big Data with Apache Spark UC BERKELEY

Similar documents
Introduction to Spark

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Frameworks: Scala and Spark Tutorial

Apache Spark and Distributed Programming

Big Data Analytics. Lucas Rego Drumond

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data and Scripting Systems beyond Hadoop

Data Science in the Wild

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

Using Apache Spark Pat McDonough - Databricks

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Beyond Hadoop with Apache Spark and BDAS

Architectures for massive data management

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data and Scripting Systems build on top of Hadoop

Apache Flink Next-gen data analysis. Kostas

Writing Standalone Spark Programs

Unified Big Data Analytics Pipeline. 连 城

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Certification Study Guide. MapR Certified Spark Developer Study Guide

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Streaming items through a cluster with Spark Streaming

Machine- Learning Summer School

Brave New World: Hadoop vs. Spark

Real Time Data Processing using Spark Streaming

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Unified Big Data Processing with Apache Spark. Matei

COURSE CONTENT Big Data and Hadoop Training

Kafka & Redis for Big Data Solutions

Hadoop: The Definitive Guide

Spark: Making Big Data Interactive & Real-Time

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Hadoop, Hive & Spark Tutorial

Spark: Cluster Computing with Working Sets

Spark and the Big Data Library

GraySort on Apache Spark by Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

Moving From Hadoop to Spark

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Big Data With Hadoop

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

CSE-E5430 Scalable Cloud Computing Lecture 11

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

HiBench Introduction. Carson Wang Software & Services Group

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop Ecosystem B Y R A H I M A.

Developing MapReduce Programs

Map Reduce & Hadoop Recommended Text:

Extreme Computing. Hadoop MapReduce in more detail.

Implementations of iterative algorithms in Hadoop and Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

An approach to Machine Learning with Big Data

BIG DATA APPLICATIONS

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data and Scripting map/reduce in Hadoop

The Berkeley Data Analytics Stack: Present and Future

Data Intensive Computing Handout 6 Hadoop

Hadoop Streaming. Table of contents

Big Data Analytics with Cassandra, Spark & MLLib

Implementation of Spark Cluster Technique with SCALA

How to Run Spark Application

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Conquering Big Data with BDAS (Berkeley Data Analytics)

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop & Spark Using Amazon EMR

Other Map-Reduce (ish) Frameworks. William Cohen

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

The Hadoop Framework

Distributed Apriori in Hadoop MapReduce Framework

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

An Open Source Memory-Centric Distributed Storage System

Data Intensive Computing Handout 5 Hadoop

Integrating Apache Spark with an Enterprise Data Warehouse

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Tachyon: memory-speed data sharing

Big Data Analytics Hadoop and Spark

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

The Stratosphere Big Data Analytics Platform

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

MAPREDUCE Programming Model

Big Fast Data Hadoop acceleration with Flash. June 2013

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

University of Maryland. Tuesday, February 2, 2010

Cloudera Certified Developer for Apache Hadoop

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop: The Definitive Guide

Transcription:

Introduction to Big Data with Apache Spark UC BERKELEY

This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model

Python Spark (pyspark) We are using the Python programming interface to Spark (pyspark) pyspark provides an easy-to-use programming abstraction and parallel runtime:» Here s an operation, run it on all of the data RDDs are the key concept

Spark Driver and Workers Worker Spark executor Your application (driver program) Cluster manager SparkContext Worker Spark executor Local threads A Spark program is two programs:» A driver program and a workers program Worker programs run on cluster nodes or in local threads RDDs are distributed across workers Amazon S3, HDFS, or other storage

Spark Context A Spark program first creates a SparkContext object» Tells Spark how and where to access a cluster» pyspark shell and Databricks Cloud automatically create the sc variable» ipython and programs must use a constructor to create a new SparkContext Use SparkContext to create RDDs In the labs, we create the SparkContext for you

Spark Essentials: Master The master parameter for a SparkContext determines which type and size of cluster to use Master Parameter local local[k] spark://host:port mesos://host:port Description run Spark locally with one worker thread (no parallelism) run Spark locally with K worker threads (ideally set to number of cores) connect to a Spark standalone cluster; PORT depends on config (7077 by default) connect to a Mesos cluster; PORT depends on config (5050 by default) In the labs, we set the master parameter for you

Resilient Distributed Datasets The primary abstraction in Spark» Immutable once constructed» Track lineage information to efficiently recompute lost data» Enable operations on collection of elements in parallel You construct RDDs» by parallelizing existing Python collections (lists)» by transforming an existing RDDs» from files in HDFS or any other storage system

Programmer specifies number of partitions for an RDD item-1 item-2 item-3 item-4 item-5 RDD split into 5 partitions item-6 item-7 item-8 item-9 item-10 item-11 item-12 item-13 item-14 item-15 item-16 item-17 item-18 item-19 item-20 RDDs item-21 item-22 item-23 item-24 item-25 (Default value used if unspecified) more partitions = more parallelism Worker Spark executor Worker Spark executor Worker Spark executor

RDDs Two types of operations: transformations and actions Transformations are lazy (not computed immediately) Transformed RDD is executed when action runs on it Persist (cache) RDDs in memory or disk

Working with RDDs Create an RDD from a data source: <list> Apply transformations to an RDD: map filter Apply actions to an RDD: collect count <list> parallelize RDD RDD RDD filter RDD RDD filtered RDD map RDD RDD mapped RDD collect collect action causes parallelize, filter, and map transforms to be executed Result

Spark References http://spark.apache.org/docs/latest/programming-guide.html http://spark.apache.org/docs/latest/api/python/index.html

Creating an RDD Create RDDs from Python collections (lists) >>> data = [1, 2, 3, 4, 5] >>> data No computation occurs with sc.parallelize() Spark only records how to create the RDD with four partitions [1, 2, 3, 4, 5] >>> rdd = sc.parallelize(data, 4) >>> rdd ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229

Creating RDDs From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles, any other Hadoop InputFormat, and directory or glob wildcard: /data/201404* >>> distfile = sc.textfile("readme.md", 4)! >>> distfile! MappedRDD[2] at textfile at NativeMethodAccessorImpl.java:-2!

Creating an RDD from a File distfile = sc.textfile("...", 4) RDD distributed in 4 partitions Elements are lines of input Lazy evaluation means no execution happens now

Spark Transformations Create new datasets from an existing one Use lazy evaluation: results not computed right away instead Spark remembers set of transformations applied to base dataset» Spark optimizes the required calculations» Spark recovers from failures and slow workers Think of this as a recipe for creating result

Some Transformations Transformation map(func) filter(func) distinct([numtasks])) flatmap(func) Description return a new distributed dataset formed by passing each element of the source through a function func return a new dataset formed by selecting those elements of the source on which func returns true return a new dataset that contains the distinct elements of the source dataset similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

Review: Python lambda Functions Small anonymous functions (not bound to a name) lambda a, b: a + b» returns the sum of its two arguments Can use lambda functions wherever function objects are required Restricted to a single expression

Transformations >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> rdd.map(lambda x: x * 2) RDD: [1, 2, 3, 4] [2, 4, 6, 8] Function literals (green) are closures automatically passed to workers >>> rdd.filter(lambda x: x % 2 == 0) RDD: [1, 2, 3, 4] [2, 4] >>> rdd2 = sc.parallelize([1, 4, 2, 2, 3]) >>> rdd2.distinct() RDD: [1, 4, 2, 2, 3] [1, 4, 2, 3]

Transformations >>> rdd = sc.parallelize([1, 2, 3]) >>> rdd.map(lambda x: [x, x+5]) RDD: [1, 2, 3] [[1, 6], [2, 7], [3, 8]] >>> rdd.flatmap(lambda x: [x, x+5]) RDD: [1, 2, 3] [1, 6, 2, 7, 3, 8] Function literals (green) are closures automatically passed to workers

Transforming an RDD lines = sc.textfile("...", 4) comments = lines.filter(iscomment) lines comments Lazy evaluation means nothing executes Spark saves recipe for transforming source

Spark Actions Cause Spark to execute recipe to transform source Mechanism for getting results out of Spark

Some Actions Action reduce(func) take(n) collect() takeordered(n, key=func) Description aggregate dataset s elements using function func. func takes two arguments and returns one, and is commutative and associative so that it can be computed correctly in parallel return an array with the first n elements return all the elements as an array WARNING: make sure will fit in driver program return n elements ordered in ascending order or as specified by the optional key function

Getting Data Out of RDDs >>> rdd = sc.parallelize([1, 2, 3]) >>> rdd.reduce(lambda a, b: a * b) Value: 6 >>> rdd.take(2) Value: [1,2] # as list >>> rdd.collect() Value: [1,2,3] # as list

Getting Data Out of RDDs >>> rdd = sc.parallelize([5,3,1,2]) >>> rdd.takeordered(3, lambda s: - 1 * s) Value: [5,3,2] # as list

Spark Programming Model lines = sc.textfile("...", 4) print lines.count() lines # # # # count() causes Spark to: read data sum within partitions combine sums in driver

Spark Programming Model lines = sc.textfile("...", 4) comments = lines.filter(iscomment) print lines.count(), comments.count() lines # # # # comments # # # # Spark recomputes lines: read data (again) sum within partitions combine sums in driver

Caching RDDs lines = sc.textfile("...", 4) lines.cache() # save, don't recompute! comments = lines.filter(iscomment) print lines.count(),comments.count() lines RAM # comments # RAM # # RAM # # RAM # #

Spark Program Lifecycle 1. Create RDDs from external data or parallelize a collection in your driver program 2. Lazily transform them into new RDDs 3. cache() some RDDs for reuse 4. Perform actions to execute parallel computation and produce results

Spark Key-Value RDDs Similar to Map Reduce, Spark supports Key-Value pairs Each element of a Pair RDD is a pair tuple >>> rdd = sc.parallelize([(1, 2), (3, 4)]) RDD: [(1, 2), (3, 4)]

Some Key-Value Transformations Key-Value Transformation reducebykey(func) sortbykey() groupbykey() Description return a new distributed dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) è V return a new dataset (K, V) pairs sorted by keys in ascending order return a new dataset of (K, Iterable<V>) pairs

Key-Value Transformations >>> rdd = sc.parallelize([(1,2), (3,4), (3,6)]) >>> rdd.reducebykey(lambda a, b: a + b) RDD: [(1,2), (3,4), (3,6)] [(1,2), (3,10)] >>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')]) >>> rdd2.sortbykey() RDD: [(1,'a'), (2,'c'), (1,'b')] [(1,'a'), (1,'b'), (2,'c')]

Key-Value Transformations >>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')]) >>> rdd2.groupbykey() RDD: [(1,'a'), (1,'b'), (2,'c')] [(1,['a','b']), (2,['c'])] Be careful using groupbykey() as it can cause a lot of data movement across the network and create large Iterables at workers

pyspark Closures Spark automatically creates closures for: Driver» Functions that run on RDDs at workers» Any global variables used by those workers functions functions functions functions globals globals globals globals Worker Worker Worker Worker One closure per worker» Sent for every task» No communication between workers» Changes to global variables at workers are not sent to driver

Consider These Use Cases Iterative or single jobs with large global variables» Sending large read-only lookup table to workers» Sending large feature vector in a ML algorithm to workers Counting events that occur during job execution» How many input lines were blank?» How many input records were corrupt?

Consider These Use Cases Iterative or single jobs with large global variables» Sending large read-only lookup table to workers» Sending large feature vector in a ML algorithm to workers Counting events that occur during job execution» How many input lines were blank?» How many input records were corrupt? Problems: Closures are (re-)sent with every job Inefficient to send large data to each worker Closures are one way: driver è worker

Σ + + + pyspark Shared Variables Broadcast Variables» Efficiently send large, read-only value to all workers» Saved at workers for use in one or more Spark operations» Like sending a large, read-only lookup table to all the nodes Accumulators» Aggregate values from workers back to driver» Only driver can access value of accumulator» For tasks, accumulators are write-only» Use to count errors seen in RDD across workers

Broadcast Variables Keep read-only variable cached on workers» Ship to each worker only once instead of with each task Example: efficiently give every worker a large dataset Usually distributed using efficient broadcast algorithms At the driver: >>> broadcastvar = sc.broadcast([1, 2, 3]) At a worker (in code passed via a closure) >>> broadcastvar.value [1, 2, 3]

Broadcast Variables Example Country code lookup for HAM radio call signs # Lookup the locations of the call signs on the # RDD contactcounts. We load a list of call sign # prefixes to country code to support this lookup signprefixes = loadcallsigntable() def processsigncount(sign_count, signprefixes): country = lookupcountry(sign_count[0], signprefixes) count = sign_count[1] return (country, count) countrycontactcounts = (contactcounts.map(processsigncount).reducebykey((lambda x, y: x+ y))) From: http://shop.oreilly.com/product/0636920028512.do Expensive to send large table (Re-)sent for every processed file

Broadcast Variables Example Country code lookup for HAM radio call signs # Lookup the locations of the call signs on the # RDD contactcounts. We load a list of call sign # prefixes to country code to support this lookup signprefixes = sc.broadcast(loadcallsigntable()) def processsigncount(sign_count, signprefixes): country = lookupcountry(sign_count[0], signprefixes.value) count = sign_count[1] return (country, count) countrycontactcounts = (contactcounts.map(processsigncount).reducebykey((lambda x, y: x+ y))) From: http://shop.oreilly.com/product/0636920028512.do Efficiently sent once to workers

Σ + + + Accumulators Variables that can only be added to by associative op Used to efficiently implement parallel counters and sums Only driver can read an accumulator s value, not tasks >>> accum = sc.accumulator(0) >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(x): >>> global accum >>> accum += x >>> rdd.foreach(f) >>> accum.value Value: 10

Σ + + + Accumulators Example Counting empty lines file = sc.textfile(inputfile) # Create Accumulator[Int] initialized to 0 blanklines = sc.accumulator(0) def extractcallsigns(line): global blanklines # Make the global variable accessible if (line == ""): blanklines += 1 return line.split(" ") callsigns = file.flatmap(extractcallsigns) print "Blank lines: %d" % blanklines.value

Σ + + + Accumulators Tasks at workers cannot access accumulator s values Tasks see accumulators as write-only variables Accumulators can be used in actions or transformations:» Actions: each task s update to accumulator is applied only once» Transformations: no guarantees (use only for debugging) Types: integers, double, long, float» See lab for example of custom type

Summary Programmer specifies number of partitions Driver program R D D Spark automatically pushes closures to workers Worker code RDD Worker code RDD Worker code RDD Master parameter specifies number of workers