Architectures for massive data management

Similar documents
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics Hadoop and Spark

Introduction to Spark

Unified Big Data Analytics Pipeline. 连 城

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Streaming items through a cluster with Spark Streaming

Unified Big Data Processing with Apache Spark. Matei

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Spark and Distributed Programming

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Real Time Data Processing using Spark Streaming

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Beyond Hadoop with Apache Spark and BDAS

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Spark and the Big Data Library

Big Data Frameworks: Scala and Spark Tutorial

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Brave New World: Hadoop vs. Spark

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark: Cluster Computing with Working Sets

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Processing. Patrick Wendell Databricks

How Companies are! Using Spark

Moving From Hadoop to Spark

Spark Application Carousel. Spark Summit East 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark. Fast, Interactive, Language- Integrated Cluster Computing

This is a brief tutorial that explains the basics of Spark SQL programming.

HiBench Introduction. Carson Wang Software & Services Group

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Big Data and Scripting Systems beyond Hadoop

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Introduction to Big Data with Apache Spark UC BERKELEY

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Analytics with Cassandra, Spark & MLLib

Apache Flink Next-gen data analysis. Kostas

Shark Installation Guide Week 3 Report. Ankush Arora

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

BIG DATA APPLICATIONS

Dell In-Memory Appliance for Cloudera Enterprise

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Next-Gen Big Data Analytics using the Spark stack

Machine- Learning Summer School

Write Once, Run Anywhere Pat McDonough

Conquering Big Data with BDAS (Berkeley Data Analytics)

Conquering Big Data with Apache Spark

Spark: Making Big Data Interactive & Real-Time

Big Data Research in the AMPLab: BDAS and Beyond

Hadoop Ecosystem B Y R A H I M A.

Big Data and Scripting Systems build on top of Hadoop

Big Data Analysis: Apache Storm Perspective

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Certification Study Guide. MapR Certified Spark Developer Study Guide

High-Speed In-Memory Analytics over Hadoop and Hive Data

Big Data Frameworks Course. Prof. Sasu Tarkoma

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

VariantSpark: Applying Spark-based machine learning methods to genomic information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Databricks. A Primer

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Introduc8on to Apache Spark

Databricks. A Primer

BIG DATA ANALYTICS For REAL TIME SYSTEM

Bayesian networks - Time-series models - Apache Spark & Scala

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Inferring Latent User Attributes in Streams of Multimodal Social Data using Apache Spark

Big Data. Lyle Ungar, University of Pennsylvania

Data Science in the Wild

What s next for the Berkeley Data Analytics Stack?

Project DALKIT (informal working title)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Unlocking the True Value of Hadoop with Open Data Science

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

DduP Towards a Deduplication Framework utilising Apache Spark

Rakam: Distributed Analytics API

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Why Spark on Hadoop Matters

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Transcription:

Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015

Spark Motivation

Apache Spark Figure: IBM and Apache Spark

What is Apache Spark Apache Spark is a fast and general engine for large-scale data processing. Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use: Write applications quickly in Java, Scala, Python, R. Generality: Combine SQL, streaming, and complex analytics. Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. http://spark.apache.org/

Spark Ecosystem

Spark API t e x t f i l e = spark. t e x t F i l e ( hdfs : / /... ) t e x t f i l e. flatmap ( lambda l i n e : l i n e. s p l i t ( ) ).map( lambda word : ( word, 1 ) ). reducebykey ( lambda a, b : a+b ) Word count in Spark s Python API v a l f = sc. t e x t F i l e ( hdfs : //... ) v a l wc = f. flatmap ( l => l. s p l i t ( ) ).map( word => ( word, 1 ) ). reducebykey ( + ) Word count in Spark s Scala API

Apache Spark

Apache Spark Project Spark started as a research project at UC Berkeley Matei Zaharia created Spark during his PhD Ion Stoica was his advisor DataBricks is the Spark start-up, that has raised $46 million

Resilient Distributed Datasets (RDDs) An RDD is a fault-tolerant collection of elements that can be operated on in parallel. RDDs are created : parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system

Spark API: Parallel Collections data = [ 1, 2, 3, 4, 5] distdata = sc. p a r a l l e l i z e ( data ) Spark s Python API v a l data = Array ( 1, 2, 3, 4, 5) v a l distdata = sc. p a r a l l e l i z e ( data ) Spark s Scala API L i s t<integer> data = Arrays. a s L i s t ( 1, 2, 3, 4, 5 ) ; JavaRDD<Integer> distdata = sc. p a r a l l e l i z e ( data ) ; Spark s Java API

Spark API: External Datasets >>> d i s t F i l e = sc. t e x t F i l e ( data. t x t ) Spark s Python API scala> v a l d i s t F i l e = sc. t e x t F i l e ( data. t x t ) d i s t F i l e : RDD [ S t r i n g ] = MappedRDD@1d4cee08 Spark s Scala API JavaRDD<String> d i s t F i l e = sc. t e x t F i l e ( data. t x t ) ; Spark s Java API

Spark API: RDD Operations l i n e s = sc. t e x t F i l e ( data. t x t ) linelengths = l i n e s.map( lambda s : len ( s ) ) totallength = linelengths. reduce ( lambda a, b : a + b ) Spark s Python API v a l l i n e s = sc. t e x t F i l e ( data. t x t ) v a l linelengths = l i n e s.map( s => s. length ) val totallength = linelengths. reduce ( ( a, b ) => a + b ) Spark s Scala API JavaRDD<String> l i n e s = sc. t e x t F i l e ( data. t x t ) ; JavaRDD<Integer> linelengths = l i n e s.map( s > s. length ( ) ) ; i n t t o t a l L e n g t h = linelengths. reduce ( ( a, b ) > a + b ) ; Spark s Java API

Spark API: Working with Key-Value Pairs l i n e s = sc. t e x t F i l e ( data. t x t ) p a i r s = l i n e s.map( lambda s : ( s, 1 ) ) counts = pairs. reducebykey ( lambda a, b : a + b ) Spark s Python API v a l l i n e s = sc. t e x t F i l e ( data. t x t ) v a l p a i r s = l i n e s.map( s => ( s, 1 ) ) val counts = pairs. reducebykey ( ( a, b ) => a + b ) Spark s Scala API JavaRDD<String> l i n e s = sc. t e x t F i l e ( data. t x t ) ; JavaPairRDD<String, Integer> pairs = l i n e s. maptopair ( s > new Tuple2 ( s, 1 ) ) ; JavaPairRDD<String, Integer> counts = p a i r s. reducebykey ( ( a, b ) > a + b ) ; Spark s Java API

Spark API: Shared Variables >>> broadcastvar = sc. broadcast ( [ 1, 2, 3 ] ) >>> broadcastvar. value [ 1, 2, 3] Spark s Python API scala> val broadcastvar = sc. broadcast ( Array ( 1, 2, 3 ) ) scala> broadcastvar. value res0 : Array [ I n t ] = Array ( 1, 2, 3 Spark s Scala API Broadcast<i n t [] > broadcastvar = sc. broadcast (new i n t [ ] {1, 2, 3 } ) ; broadcastvar. value ( ) ; // r e t u r n s [ 1, 2, 3] Spark s Java API

Spark Cluster Figure: Cluster Components

Spark Cluster Spark is agnostic to the underlying cluster manager. The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. Each driver schedules its own tasks. The drivers must listen for and accept incoming connections from its executors throughout its lifetime Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network.

Apache Spark Streaming Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data.

Discretized Streams (DStreams) Discretized Stream or DStream represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs

Discretized Streams (DStreams) Any operation applied on a DStream translates to operations on the underlying RDDs.

Discretized Streams (DStreams) Spark Streaming provides windowed computations, which allow transformations over a sliding window of data.

Spark Streaming v a l conf = new SparkConf ( ). setmaster ( l o c a l [ 2 ] ). setappname ( WCount ) val ssc = new StreamingContext ( conf, Seconds ( 1 ) ) // Create a DStream that w i l l connect to hostname : port, l i k e localhost :9999 val l i n e s = ssc. sockettextstream ( localhost, 9999) // S p l i t each l i n e i n t o words v a l words = l i n e s. flatmap (. s p l i t ( ) ) // Count each word i n each batch val pairs = words. map( word => ( word, 1 ) ) val wordcounts = pairs. reducebykey ( + ) // P r i n t the f i r s t ten elements of each RDD generated i n t h i s DStream to the con wordcounts. p r i n t ( ) ssc. s t a r t ( ) // S t a r t the computation ssc. awaittermination ( ) // Wait for the computation to terminate

Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database.

Spark Machine Learning Libraries MLLib contains the original API built on top of RDDs. spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Spark Machine Learning Libraries MLLib contains the original API built on top of RDDs. spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Spark GraphX GraphX optimizes the representation of vertex and edge types when they are primitive data types The property graph is a directed multigraph with user defined objects attached to each vertex and edge.

Spark GraphX // Assume the SparkContext has already been constructed val sc : SparkContext // Create an RDD f o r the v e r t i c e s v a l users : RDD [ ( VertexId, ( String, S t r i n g ) ) ] = sc. p a r a l l e l i z e ( Array ( ( 3 L, ( r x i n, student ) ), (7 L, ( jgonzal, postdoc ) ), (5L, ( f r a n k l i n, prof ) ), (2L, ( i s t o i c a, prof ) ) ) ) // Create an RDD f o r edges v a l r e l a t i o n s h i p s : RDD [ Edge [ S t r i n g ] ] = sc. p a r a l l e l i z e ( Array ( Edge (3L, 7L, c o l l a b ), Edge (5L, 3L, advisor ), Edge (2L, 5L, colleague ), Edge (5L, 7L, p i ) ) ) // Define a d e f a u l t user i n case there are r e l a t i o n s h i p with missing user val defaultuser = ( John Doe, Missing ) // B u i l d the i n i t i a l Graph val graph = Graph ( users, relationships, defaultuser )

Apache Spark Summary Apache Spark is a fast and general engine for large-scale data processing. Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use: Write applications quickly in Java, Scala, Python, R. Generality: Combine SQL, streaming, and complex analytics. Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. http://spark.apache.org/