Data Science in the Wild



Similar documents
Using Apache Spark Pat McDonough - Databricks

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Unified Big Data Processing with Apache Spark. Matei

Beyond Hadoop with Apache Spark and BDAS

Introduction to Big Data with Apache Spark UC BERKELEY

Writing Standalone Spark Programs

Introduction to Spark

Unified Big Data Analytics Pipeline. 连 城

Spark and the Big Data Library

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Introduc8on to Apache Spark

Big Data Analytics. Lucas Rego Drumond

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Frameworks: Scala and Spark Tutorial

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Conquering Big Data with Apache Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark: Cluster Computing with Working Sets

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Apache Spark and Distributed Programming

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Processing Big Data with Small Programs

Scaling Out With Apache Spark. DTL Meeting Slides based on

Apache Flink Next-gen data analysis. Kostas

Big Data and Scripting Systems beyond Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Berkeley Data Analytics Stack: Present and Future

Functional Query Optimization with" SQL

Architectures for massive data management

HiBench Introduction. Carson Wang Software & Services Group

Big Data and Scripting Systems build on top of Hadoop

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Brave New World: Hadoop vs. Spark

Hadoop Ecosystem B Y R A H I M A.

BIG DATA APPLICATIONS

Moving From Hadoop to Spark

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics Hadoop and Spark

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Hadoop: The Definitive Guide

Real Time Data Processing using Spark Streaming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

The Stratosphere Big Data Analytics Platform

This is a brief tutorial that explains the basics of Spark SQL programming.

Certification Study Guide. MapR Certified Spark Developer Study Guide

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Machine- Learning Summer School

Big Data With Hadoop

How To Create A Data Visualization With Apache Spark And Zeppelin

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Masterclass. Part 4 of 4: Analyzing Big Data. Lars George EMEA Chief Architect Cloudera

COURSE CONTENT Big Data and Hadoop Training

Streaming items through a cluster with Spark Streaming

Hadoop & Spark Using Amazon EMR

Hadoop Job Oriented Training Agenda

Map Reduce & Hadoop Recommended Text:

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Making Sense at Scale with the Berkeley Data Analytics Stack

Big Data Course Highlights

The Hadoop Eco System Shanghai Data Science Meetup

Implement Hadoop jobs to extract business value from large and varied data sets

Workshop on Hadoop with Big Data

Hadoop IST 734 SS CHUNG

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Implementations of iterative algorithms in Hadoop and Spark

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop, Hive & Spark Tutorial

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Other Map-Reduce (ish) Frameworks. William Cohen

Shark Installation Guide Week 3 Report. Ankush Arora

Integrating Apache Spark with an Enterprise Data Warehouse

How to Run Spark Application

Cloudera Certified Developer for Apache Hadoop

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

ITG Software Engineering

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Shark: SQL and Rich Analytics at Scale

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Transcription:

Data Science in the Wild Lecture 4 59 Apache Spark 60 1

What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 40x faster than Hadoop Compatible with Hadoop s storage APIs Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc 61 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x 62 2

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: More complex, multi-stage applications (e.g. iterative graph algorithms and machine learning) More interactive ad-hoc queries Both multi-stage and interactive apps require faster data sharing across parallel jobs 63 Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3 result 3... 64 Slow due to replication, serialization, and disk IO 3

Data Sharing in Spark iter. 1 iter. 2... Input one-time processing query 1 query 2 Input Distributed memory query 3... 65 10-100 faster Cornell than Tech network and disk Spark Programming Model Key idea: resilient distributed datasets (RDDs) Distributed collections of objects that can be cached in memory across cluster nodes Manipulated through various parallel operators Automatically rebuilt on failure Interface Clean language-integrated API in Scala Can be used interactively from Scala console 66 4

Execution time (s) 69 58 41 30 12 2/18/2015 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... 67 Result: Result: full-text scaled search to of 1 Wikipedia TB data in in 5-7 <1 sec sec (vs 20 (vs 170 sec for sec on-disk for on-disk data) data) Base RDD Transformed RDD Cache 3 Worker Block 3 Driver Action results tasks Cache 1 Worker Block 1 Cache 2 Worker Block 2 Scaling Down 100 80 60 40 20 68 0 Cache disabled 25% 50% 75% Fully cached % of working set in cache 5

Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textfile.filter(lambda s: s.startswith( ERROR )).map(lambda s: s.split( \t )[2]) 69 HDFS File Filtered RDD Mapped RDD filter (func = startswith( )) map (func = split(...)) SparkContext Main entry point to Spark functionality Available in shell as variable sc In standalone programs, you d make your own 70 6

Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textfile( file.txt ) > sc.textfile( directory/*.txt ) > sc.textfile( hdfs://namenode:9000/path/file ) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopfile(keyclass, valclass, inputfmt, conf) 71 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatmap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} 72 Range object (sequence of numbers 0, 1, Spring, 2015 x-1) 7

Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveastextfile( hdfs://file.txt ) 73 Working with Key-Value Pairs Spark s distributed reduce transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b 74 8

Some Key-Value Operations > pets = sc.parallelize( [( cat, 1), ( dog, 1), ( cat, 2)]) > pets.reducebykey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupbykey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortbykey() # => {(cat, 1), (cat, 2), (dog, 1)} reducebykey also automatically implements combiners on the map side 75 Example: Word Count > lines = sc.textfile( hamlet.txt ) > counts = lines.flatmap(lambda line: line.split( )).map(lambda word => (word, 1)).reduceByKey(lambda x, y: x + y) to be or to be or (to, 1) (be, 1) (or, 1) (be, 2) (not, 1) not to be not to be (not, 1) (to, 1) (be, 1) (or, 1) (to, 2) 76 9

Other Key-Value Operations > visits = sc.parallelize([ ( index.html, 1.2.3.4 ), ( about.html, 3.4.5.6 ), ( index.html, 1.3.3.1 ) ]) > pagenames = sc.parallelize([ ( index.html, Home ), ( about.html, About ) ]) > visits.join(pagenames) # ( index.html, ( 1.2.3.4, Home )) # ( index.html, ( 1.3.3.1, Home )) # ( about.html, ( 3.4.5.6, About )) > visits.cogroup(pagenames) # ( index.html, ([ 1.2.3.4, 1.3.3.1 ], [ Home ])) # ( about.html, ([ 3.4.5.6 ], [ About ])) 77 Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reducebykey(lambda x, y: x + y, 5) > words.groupbykey(5) > visits.join(pageviews, 5) 78 10

Using Local Variables Any external variables you use in a closure will automatically be shipped to the cluster: > query = sys.stdin.readline() > pages.filter(lambda x: query in x).count() Some caveats: Each task gets a new copy (updates are not sent back) Variable must be Serializable 79 Under The Hood: DAG Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles A: B: Stage 1 groupby C: D: E: join F: Stage 2 map filter Stage 3 80 = RDD = cached partition 11

More RDD Operators map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save... 81 82 Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Standalone Programs Python, Scala, & Java Interactive Shells Python & Scala Performance Java & Scala are faster due to static typing but Python is often fine 12

Interactive Shell Available in Python and Scala Runs as an application on an existing Spark Cluster OR Can run locally 83 or a Standalone Application import sys from pyspark import SparkContext if name == " main ": sc = SparkContext( local, WordCount, sys.argv[0], None) lines = sc.textfile(sys.argv[1]) counts = lines.flatmap(lambda s: s.split( )) \.map(lambda word: (word, 1)) \.reducebykey(lambda x, y: x + y) counts.saveastextfile(sys.argv[2]) 84 13

Java Python Scala 2/18/2015 Create a SparkContext import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sc = new SparkContext( url, name, sparkhome, Seq( app.jar )) import org.apache.spark.api.java.javasparkcontext; Cluster URL, local App install List of JARs with / local[n] name path on cluster app code (to ship) JavaSparkContext sc = new JavaSparkContext( masterurl, name, sparkhome, new String[] { app.jar })); from pyspark import SparkContext sc = SparkContext( masterurl, name, sparkhome, [ library.py ])) 85 Add Spark to Your Project Scala / Java: add a Maven dependency on groupid: org.spark-project artifactid: spark-core_2.10 version: 0.9.0 Python: run program with our pyspark script 86 14

Administrative GUIs http://<standalone Master>:8080 (by default) 87 Software Components Spark runs as a library in your program (1 instance per app) Runs tasks locally or on cluster Mesos, YARN or standalone mode Accesses storage systems via Hadoop InputFormat API Can use HBase, HDFS, S3, Worker Spark executor Your application Cluster manager SparkContext Worker Spark executor Local threads HDFS or other storage 88 15

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD path = hdfs:// FilteredRDD func = _.contains(...) MappedRDD func = _.split( ) 89 Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) Load data in memory once Initial parameter vector for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x } ).reduce(_ + _) w -= gradient println("final w: " + w) Repeated MapReduce steps to do gradient descent 90 16

Running Time (s) 2/18/2015 Logistic Regression Performance 91 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s 17