Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Size: px

Start display at page:

Download "Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY"

Oliver McDowell
10 years ago
Views:

1 Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org

2 What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through:» In-memory computing primitives» General computation graphs Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 100 faster (2-10 on disk) Often 5 less code

Improves efficiency through:» In-memory computing primitives» General

3 Project History Started in 2009, open sourced companies now contributing code» Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, Entered Apache incubator in June Python API added in February

4 An Expanding Stack Spark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark (SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) Spark More details: amplab.berkeley.edu

Shark (SQL) Spark Streaming (real-time) GraphX (graph)

5 This Talk Spark programming model Examples Demo Implementation Trying it out

6 Why a New Programming Model? MapReduce simplified big data processing, but users quickly found two problems: Programmability: tangle of map/red functions Speed: MapReduce inefficient for apps that share data across multiple steps» Iterative algorithms, interactive queries

two problems: Programmability: tangle of map/red functions Speed:

7 Data Sharing in MapReduce HDFS read HDFS write HDFS read HDFS write iter. 1 iter Input HDFS read query 1 query 2 result 1 result 2 Input query 3 result 3... Slow due to data replication and disk I/O

8 What We d Like Input iter. 1 iter one-time processing query 1 query 2 Input Distributed memory query faster than network and disk

9 Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster» Built via parallel transformations (map, filter, )» Automatically rebuilt on failure

objects that can be stored in memory or disk across a cluster» Built

10 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD lines = spark.textfile( hdfs://... )! results errors = lines.filter(lambda s: s.startswith( ERROR ))! messages = errors.map(lambda s: s.split( \t )[2])! messages.cache()! Driver tasks Worker Block 1 Cache 1 messages.filter(lambda s: foo in s).count()! messages.filter(lambda s: bar in s).count()!...! Result: full-text scaled to search 1 TB data of Wikipedia in 7 sec in (vs 2 sec 180 (vs sec 30 for s for on-disk data) data) Action Worker Block 3 Cache 3 Worker Block 2 Cache 2

Driver tasks Worker Block 1 Cache 1 messages.filter(lambda s: foo in s).count()!

11 Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data messages = textfile(...).filter(lambda s: ERROR in s)!.map(lambda s: s.split( \t )[2])!! HadoopRDD path = hdfs:// FilteredRDD func = lambda s: MappedRDD func = lambda s:

$.map(lambda s: s.split( \t )[2])!$

12 Example: Logistic Regression Goal: find line separating two sets of points random initial line target

13 Example: Logistic Regression data = spark.textfile(...).map(readpoint).cache()!! w = numpy.random.rand(d)!! for i in range(iterations):! gradient = data.map(lambda p:! (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x! ).reduce(lambda x, y: x + y)! w -= gradient!! print Final w: %s % w!

gradient = data.map(lambda p:! (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.

14 Logistic Regression Performance 4000 Running Time (s) Number of Iterations 110 s / iteration Hadoop PySpark first iteration 80 s further iterations 5 s

30 Number of Iterations 110 s / iteration Hadoop

15 Demo

16 Supported Operators map! filter! groupby! union! join! leftouterjoin! rightouterjoin! reduce! count! fold! reducebykey! groupbykey! cogroup! flatmap! take! first! partitionby! pipe! distinct! save!...!

fold! reducebykey! groupbykey! cogroup! flatmap!

17 Other Engine Features General operator graphs (not just map-reduce) Hash-based reduces (faster than Hadoop s sort) Controlled data partitioning to save communication PageRank Performance Iteration time (s) Hadoop Basic Spark Spark + Controlled Partitioning 0

partitioning to save communication PageRank Performance Iteration time

18 Spark Community meetup members 60+ contributors 17 companies contributing

19 This Talk Spark programming model Examples Demo Implementation Trying it out

20 Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Your app PySpark Spark client Spark worker Spark worker Python child Python child Python child Python child

wrapper) No changes to Python Your app PySpark Spark client

21 Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Your app PySpark Spark client Spark worker Main PySpark author: Josh Rosen Spark cs.berkeley.edu/~joshrosen worker Python child Python child Python child Python child

22 Object Marshaling Uses pickle library for both communication and cached data» Much cheaper than Python objects in RAM Lambda marshaling library by PiCloud

23 Job Scheduler Supports general operator graphs A: B: Automatically pipelines functions Aware of data locality and partitioning Stage 1 C: D: map E: groupby F: join G: Stage 2 union Stage 3 = cached data partition

24 Interoperability Runs in standard CPython, on Linux / Mac» Works fine with extensions, e.g. NumPy Input from local file system, NFS, HDFS, S3» Only text files for now Works in IPython, including notebook Works in doctests see our tests!

25 Getting Started Visit spark-project.org for video tutorials, online exercises, docs Easy to run in local mode (multicore), standalone clusters, or EC2 Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

26 Getting Started Easiest way to learn is the shell: $./pyspark! >>> nums = sc.parallelize([1,2,3]) # make RDD from array! >>> nums.count()! 3! >>> nums.map(lambda x: 2 * x).collect()! [2, 4, 6]!

27 Writing Standalone Jobs from pyspark import SparkContext!! if name == " main ":! sc = SparkContext( local, WordCount )! lines = sc.textfile( in.txt )!! counts = lines.flatmap(lambda s: s.split()) \!.map(lambda word: (word, 1)) \!.reducebykey(lambda x, y: x + y)!! counts.saveastextfile( out.txt )!!!

28 Conclusion PySpark provides a fast and simple way to analyze big datasets from Python Learn more or contribute at spark-project.org Look for our training camp on August 29-30! My [email protected]

29 Behavior with Not Enough RAM 100 Iteration time (s) Cache disabled 25% 50% 75% Fully cached % of working set in memory

30 The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark (SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) Spark More details: amplab.berkeley.edu

31 Performance Comparison Response Time (s) Impala (disk) Impala (mem) Redshift Shark (disk) Shark (mem) SQL Throughput (MB/s/node) Storm Spark Streaming Response Time (min) Hadoop Giraph GraphLab GraphX Graph

Spark: Making Big Data Interactive & Real-Time

Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency