Unified Big Data Analytics Pipeline. 连城

Size: px

Start display at page:

Download "Unified Big Data Analytics Pipeline. 连城 lian@databricks.com"

Muriel Randall
8 years ago
Views:

1 Unified Big Data Analytics Pipeline 连城

2 What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

Distributed Datasets (RDD) Has an advanced DAG execution

3 Why Fast Run machine learning like iterative programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk Run HiveQL compatible queries 100x faster than Hive (with Shark / Spark SQL)

4 Why Compatibility Compatible with most popular storage systems on top of New users needn t suffer ETL to deploy Spark

5 Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce)

6 Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce) sc.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap()

7 Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce) sc.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap() Try to down WordCount in 30 seconds with Hadoop MapReduce :-)

8 Why Unified big data analytics pipeline for Batch / interactive (Spark Core vs MR / Tez) SQL (Shark / Spark SQL vs Hive) Streaming (Spark Streaming vs Storm) Machine learning (MLlib vs Mahout) Graph (GraphX vs Giraph) Shark (SQL) Spark Streaming MLlib (machine learning) GraphX (graph) Spark Core

Streaming vs Storm) Machine learning (MLlib vs Mahout) Graph (GraphX vs

9 The Hadoop Family

10 The Hadoop Family General Batching Specialized systems Streaming Iterative Ad-hoc / SQL Graph MapReduce Storm Mahout Pig Giraph S4 Samza Hive Drill Impala

11 Big Data Analytics Pipeline Today Using separate frameworks

12 Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query

13 Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query ETL Train Query

14 Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query ETL Train Query

15 Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query How? ETL Train Query

16 D A T A H A S Ahhh!!!

17 Resilient Distributed Datasets RDDs are -only, partitioned collections of records can be cached in-memory and are locality aware can only be created through deterministic operations on either (1) data in stable storage, or (2) other RDDs

locality aware can only be created through deterministic

18 Resilient Distributed Datasets Core data sharing abstraction in Spark Computation can be represented by lazily evaluated RDD lineage DAG(s)

19 Resilient Distributed Datasets An RDD is a -only, partitioned, locality aware distributed collection of records either points to a stable data source, or is created through deterministic transformations on some other RDD(s)

records either points to a stable data source, or is

20 Resilient Distributed Datasets Coarse grained Applies the same operation on many records Fault tolerant Failed tasks can be recovered in parallel with the help of the lineage DAG

21 Resilient Distributed Datasets

22 Data Sharing in Spark Frequently accessed RDDs can be materialized and cached in memory Cached RDD can also be replicated for fault tolerance Spark scheduler takes cached data locality into account

23 Data Sharing in Spark Driver SparkContext Executors report caching information back to driver to improve data locality. Worker node Executor Cache Tasks Tasks Tasks CacheManager Cluster Manager Worker node Scheduler tries to schedule tasks to nodes where stable data source or cached input are available. Executor Cache Tasks Tasks Tasks

24 Data Sharing in Spark As a consequence, intermediate results are either cached in memory, or written to local file system as shuffle map output and waiting to be fetched by reducer tasks (similar to MapReduce) Enables in-memory data sharing, avoids unnecessary I/O

25 Generality of RDDs From an expressiveness point of view The coarse grained nature is not a big limitation since many parallel programs naturally apply the same operation to many records, making them easy to express In fact, RDDs can emulate any distributed system, and will do so efficiently in many cases unless the system is sensitive to network latency

26 Generality of RDDs From system perspective RDDs gave applications control over the most common bottleneck resources in cluster applications (i.e., network and storage I/O) Makes it possible to express the same optimizations that specialized systems have and achieve similar performance

27 Limitations of RDDs Not suitable for Latency (millisecond scale) Communication patterns beyond all-to-all Asynchrony Fine-grained updates

28 Analytics Models Built over RDDs MR / Dataflow spark.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap() Streaming val streaming = new StreamingContext( spark, Seconds(1)) streaming.sockettextstream(host, port).flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).print() streaming.start()

29 Analytics Models Built over RDDs SQL val hivecontext = new HiveContext(spark) import hivecontext._ val children = hiveql( """SELECT name, age FROM people WHERE age < 13 """.strimargin).map { case Row(name: String, age: Int) => name -> age }.foreach(println) Machine Learning val points = spark.textfile("hdfs://...").map(_.split.map(_.todouble)).splitat(1).map { case (Array(label), features) => LabeledPoint(label, features) } val model = KMeans.train(points)

30 Mixing Different Analytics Models MapReduce & SQL // ETL with Spark Core case class Person(name: String, age: Int) val people = spark.textfile("hdfs://people.txt").map { line => val Array(name, age) = line.split(",") Person(name, age.trim.toint) }.registerastable("people") // Query with Spark SQL val teenagers = sql("select name FROM people WHERE age >= 13 AND age <= 19") teenagers.collect().foreach(println)

31 Mixing Different Analytics Models SQL & iterative computation (machine learning) // ETL with Spark SQL & Spark Core val trainingdata = sql("""select e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userid = e.userid """.stripmargin).map { case Row(_, age: Double, latitude: Double, longitude: Double) => val features = Array(age, latitude, longitude) LabeledPoint(age, features) } // Training with MLlib val model = new LogisticRegressionWithSGD().run(trainingData)

32 Unified Big Data Analytics Pipeline RDD RDD Core MLlib SQL

33 Unified Big Data Analytics Pipeline RDD RDD ETL Train Query Core MLlib SQL

34 Unified Big Data Analytics Pipeline ETL Train Query Core MLlib SQL

35 Unified Big Data Analytics Pipeline ETL Train Query Interactive analysis

36 Unified Big Data Analytics Pipeline Unified in-memory data sharing abstraction Efficient runtime brings significant performance boost and enables interactive analysis One stack to rule them all ETL Train Query

37 Thanks! Q&A

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Analytics Pipeline. 连城