Unified Big Data Analytics Pipeline. 连城 lian@databricks.com

Similar documents

Unified Big Data Processing with Apache Spark. Matei

Beyond Hadoop with Apache Spark and BDAS

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to Spark

Spark: Making Big Data Interactive & Real-Time

Moving From Hadoop to Spark

Spark and the Big Data Library

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Architectures for massive data management

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Functional Query Optimization with" SQL

Hadoop Ecosystem B Y R A H I M A.

Apache Flink Next-gen data analysis. Kostas

How Companies are! Using Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

This is a brief tutorial that explains the basics of Spark SQL programming.

Apache Spark and Distributed Programming

Conquering Big Data with BDAS (Berkeley Data Analytics)

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Next-Gen Big Data Analytics using the Spark stack

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Brave New World: Hadoop vs. Spark

Why Spark on Hadoop Matters

Big Data Processing. Patrick Wendell Databricks

Shark Installation Guide Week 3 Report. Ankush Arora

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Ali Ghodsi Head of PM and Engineering Databricks

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics. Lucas Rego Drumond

Data Science in the Wild

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Analytics Hadoop and Spark

Big Data and Scripting Systems build on top of Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Streaming items through a cluster with Spark Streaming

HiBench Introduction. Carson Wang Software & Services Group

Introduc8on to Apache Spark

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark Application Carousel. Spark Summit East 2015

Spark: Cluster Computing with Working Sets

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

How To Create A Data Visualization With Apache Spark And Zeppelin

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Brief Introduction to Apache Tez

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop & Spark Using Amazon EMR

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Case Study : 3 different hadoop cluster deployments

The Internet of Things and Big Data: Intro

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Analytics on Spark &

Machine- Learning Summer School

Why Spark Is the Next Top (Compute) Model

Big Data With Hadoop

Big Data Course Highlights

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

MLlib: Scalable Machine Learning on Spark

BIG DATA APPLICATIONS

Big Data and Scripting Systems beyond Hadoop

Big Data Frameworks: Scala and Spark Tutorial

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Analytics with Cassandra, Spark & MLLib

Dell In-Memory Appliance for Cloudera Enterprise

Hybrid Software Architectures for Big

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Hadoop IST 734 SS CHUNG

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Enterprise Data Storage and Analysis on Tim Barr

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

The Stratosphere Big Data Analytics Platform

DduP Towards a Deduplication Framework utilising Apache Spark

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

In Memory Accelerator for MongoDB

Big Data Research in the AMPLab: BDAS and Beyond

Can t We All Just Get Along? Spark and Resource Management on Hadoop

An Architecture for Fast and General Data Processing on Large Clusters

HDFS. Hadoop Distributed File System

Real Time Data Processing using Spark Streaming

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Information Builders Mission & Value Proposition

Apache Flink. Fast and Reliable Large-Scale Data Processing

Upcoming Announcements

How To Understand The Programing Framework For Distributed Computing

Internals of Hadoop Application Framework and Distributed File System

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Transcription:

Unified Big Data Analytics Pipeline 连城 lian@databricks.com

What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

Why Fast Run machine learning like iterative programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk Run HiveQL compatible queries 100x faster than Hive (with Shark / Spark SQL)

Why Compatibility Compatible with most popular storage systems on top of New users needn t suffer ETL to deploy Spark

Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce)

Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce) sc.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap()

Why Easy to use Fluent Scala / Java / Python API Interactive shell 2-5x less code (than Hadoop MapReduce) sc.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap() Try to down WordCount in 30 seconds with Hadoop MapReduce :-)

Why Unified big data analytics pipeline for Batch / interactive (Spark Core vs MR / Tez) SQL (Shark / Spark SQL vs Hive) Streaming (Spark Streaming vs Storm) Machine learning (MLlib vs Mahout) Graph (GraphX vs Giraph) Shark (SQL) Spark Streaming MLlib (machine learning) GraphX (graph) Spark Core

The Hadoop Family

The Hadoop Family General Batching Specialized systems Streaming Iterative Ad-hoc / SQL Graph MapReduce Storm Mahout Pig Giraph S4 Samza Hive Drill Impala

Big Data Analytics Pipeline Today Using separate frameworks

Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query

Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query ETL Train Query

Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query ETL Train Query

Big Data Analytics Pipeline Today Using separate frameworks ETL Train Query How? ETL Train Query

D A T A H A S Ahhh!!!

Resilient Distributed Datasets RDDs are -only, partitioned collections of records can be cached in-memory and are locality aware can only be created through deterministic operations on either (1) data in stable storage, or (2) other RDDs

Resilient Distributed Datasets Core data sharing abstraction in Spark Computation can be represented by lazily evaluated RDD lineage DAG(s)

Resilient Distributed Datasets An RDD is a -only, partitioned, locality aware distributed collection of records either points to a stable data source, or is created through deterministic transformations on some other RDD(s)

Resilient Distributed Datasets Coarse grained Applies the same operation on many records Fault tolerant Failed tasks can be recovered in parallel with the help of the lineage DAG

Resilient Distributed Datasets

Data Sharing in Spark Frequently accessed RDDs can be materialized and cached in memory Cached RDD can also be replicated for fault tolerance Spark scheduler takes cached data locality into account

Data Sharing in Spark Driver SparkContext Executors report caching information back to driver to improve data locality. Worker node Executor Cache Tasks Tasks Tasks CacheManager Cluster Manager Worker node Scheduler tries to schedule tasks to nodes where stable data source or cached input are available. Executor Cache Tasks Tasks Tasks

Data Sharing in Spark As a consequence, intermediate results are either cached in memory, or written to local file system as shuffle map output and waiting to be fetched by reducer tasks (similar to MapReduce) Enables in-memory data sharing, avoids unnecessary I/O

Generality of RDDs From an expressiveness point of view The coarse grained nature is not a big limitation since many parallel programs naturally apply the same operation to many records, making them easy to express In fact, RDDs can emulate any distributed system, and will do so efficiently in many cases unless the system is sensitive to network latency

Generality of RDDs From system perspective RDDs gave applications control over the most common bottleneck resources in cluster applications (i.e., network and storage I/O) Makes it possible to express the same optimizations that specialized systems have and achieve similar performance

Limitations of RDDs Not suitable for Latency (millisecond scale) Communication patterns beyond all-to-all Asynchrony Fine-grained updates

Analytics Models Built over RDDs MR / Dataflow spark.textfile("hdfs://...").flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).collectasmap() Streaming val streaming = new StreamingContext( spark, Seconds(1)) streaming.sockettextstream(host, port).flatmap(_ split " ").map(_ -> 1).reduceByKey(_ + _).print() streaming.start()

Analytics Models Built over RDDs SQL val hivecontext = new HiveContext(spark) import hivecontext._ val children = hiveql( """SELECT name, age FROM people WHERE age < 13 """.strimargin).map { case Row(name: String, age: Int) => name -> age }.foreach(println) Machine Learning val points = spark.textfile("hdfs://...").map(_.split.map(_.todouble)).splitat(1).map { case (Array(label), features) => LabeledPoint(label, features) } val model = KMeans.train(points)

Mixing Different Analytics Models MapReduce & SQL // ETL with Spark Core case class Person(name: String, age: Int) val people = spark.textfile("hdfs://people.txt").map { line => val Array(name, age) = line.split(",") Person(name, age.trim.toint) }.registerastable("people") // Query with Spark SQL val teenagers = sql("select name FROM people WHERE age >= 13 AND age <= 19") teenagers.collect().foreach(println)

Mixing Different Analytics Models SQL & iterative computation (machine learning) // ETL with Spark SQL & Spark Core val trainingdata = sql("""select e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userid = e.userid """.stripmargin).map { case Row(_, age: Double, latitude: Double, longitude: Double) => val features = Array(age, latitude, longitude) LabeledPoint(age, features) } // Training with MLlib val model = new LogisticRegressionWithSGD().run(trainingData)

Unified Big Data Analytics Pipeline RDD RDD Core MLlib SQL

Unified Big Data Analytics Pipeline RDD RDD ETL Train Query Core MLlib SQL

Unified Big Data Analytics Pipeline ETL Train Query Core MLlib SQL

Unified Big Data Analytics Pipeline ETL Train Query Interactive analysis

Unified Big Data Analytics Pipeline Unified in-memory data sharing abstraction Efficient runtime brings significant performance boost and enables interactive analysis One stack to rule them all ETL Train Query

Thanks! Q&A