Big Data and Scripting Systems beyond Hadoop

Similar documents

Big Data and Scripting Systems build on top of Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to Big Data with Apache Spark UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Analytics. Lucas Rego Drumond

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Apache Hama Design Document v0.6

Introduction to Spark

Apache Flink Next-gen data analysis. Kostas

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Apache Spark and Distributed Programming

From GWS to MapReduce: Google s Cloud Technology in the Early Days

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Frameworks: Scala and Spark Tutorial

Unified Big Data Analytics Pipeline. 连城

Spark: Making Big Data Interactive & Real-Time

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Architectures for massive data management

Unified Big Data Processing with Apache Spark. Matei

Spark: Cluster Computing with Working Sets

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Brave New World: Hadoop vs. Spark

Optimization and analysis of large scale data sorting algorithm based on Hadoop

HiBench Introduction. Carson Wang Software & Services Group

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

ZooKeeper. Table of contents

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Beyond Hadoop with Apache Spark and BDAS

Big Data Analytics Hadoop and Spark

Managing large clusters resources

Evaluating partitioning of big graphs

Spark and the Big Data Library

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Implementation of Spark Cluster Technique with SCALA

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction to DISC and Hadoop

Big Data With Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data looks Tiny from the Stratosphere

How To Create A Data Visualization With Apache Spark And Zeppelin

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

NoSQL and Hadoop Technologies On Oracle Cloud

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Workshop on Hadoop with Big Data

Data Science in the Wild

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

The Stratosphere Big Data Analytics Platform

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Understanding Neo4j Scalability

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Hadoop IST 734 SS CHUNG

Conquering Big Data with BDAS (Berkeley Data Analytics)

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Dell In-Memory Appliance for Cloudera Enterprise

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

HadoopRDF : A Scalable RDF Data Analysis System

How Companies are! Using Spark

MapReduce Job Processing

Streaming items through a cluster with Spark Streaming

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Certification Study Guide. MapR Certified Spark Developer Study Guide

Big Data and Scripting Systems build on top of Hadoop

Hadoop & Spark Using Amazon EMR

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Rakam: Distributed Analytics API

Apache Hadoop. Alexandru Costan

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Flink. Fast and Reliable Large-Scale Data Processing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Graph Processing and Social Networks

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

GraySort on Apache Spark by Databricks

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Real Time Data Processing using Spark Streaming

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Large scale processing using Hadoop. Ján Vaňo

An approach to Machine Learning with Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How To Make A Cluster Of Workable, Efficient, And Efficient With A Distributed Scheduler

Transcription:

Big Data and Scripting Systems beyond Hadoop 1,

2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid repeated implementation of the same services provide primitives for synchronization configuration maintenance naming optimized for failure tolerance, reliability and performance used in other projects as sub service another Apache top-level project

3, Zookeeper provides tree-like information storage update guarantees sequential consistency (keep update order) atomicity single system image (one state for views) reliability (applied updates persist) timeliness (time bounds for updates) extremely simple interface create/delete test node existance get/set data get children sync (wait for update to propagate)

4, ZooKeeper environment run standalone or replicated/distributed interfaces for Java and C independent (but often used in together with) Hadoop/HDFS memory based (stored data is kept in memory and therefore limited) prefers heavy reading over heavy writing

5, Mesos 1 Hadoop: use one (physical) cluster of machines exclusively Mesos: enable sharing of a cluster between multiple distributed computing systems implent intermediate layer between distributed framework (e.g. Hadoop) and hardware administrate physical resources distribute these to involved frameworks improve cluster utilization implement prioritization and failure tolerance 1 Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011

6, example scenarios multiple Hadoop systems on the same (physical) set of machines production system, takes priority testing implementations or execute analyses that are of general interest but should not disturb the production system test new versions of Hadoop all involved Hadoop instances use the same data as input different distributed frameworks on the same cluster different tasks benefit from different optimization approaches the map reduce approach is not optimal in every situation still, the different frameworks might work on the same base data

7, dividing tasks scheduling distribute tasks to available resources consider data locality send tasks to nodes that already store the involved data depends on framework (optimization strategy), job (algorithm) and task order should be implemented by the framework resource distribution distribute available resource to frameworks keep track of system usage ensure priorities between different frameworks should be implemented by intermediate layer (Mesos)

8, architecture centralized master-slave system frameworks run tasks on slave nodes master implements sharing using resource offers: list of free resources on various slaves master decides which (and how many) resources are offered to which framework implements organizational policy frameworks have two parts: scheduler - accepts or declines offers from Mesos executor process - started on computing nodes, executes tasks framework decides which task is solved on a particular resource tasks are executed by sending task description to Master

9, design resource allocation each framework gets a guaranteed amount of resources corresponding to its share when resources are available, framework receives additional offers can cause conflicts (another framework starts using its share) over usage of first framework resolved by revoking (i.e. killing) tasks frameworks can indicate demand, otherwise free resource are offered freely when used resources are below share, no task is revoked when usage above share, any task can be revoked

design robustness and fault tolerance frameworks are treated as unreliable: resources offered to a framework count into its usage unanswered offers are interpreted as rejection the corresponding resources are offered to another framework single master could pose a single point of failure master has soft state: replace master can reconstruct state from information held by slaves and framework schedulers recover: active slaves active frameworks running tasks inactive standby replacement masters elect new leader using ZooKeeper frameworks can also register replacement schedulers 10,

11, summary/overview a framework/library/set of servers allows to run several distributed frameworks on top of a single cluster of machines administrates and distributes resources with respect to configurable priorities is an actually implemented and used system: mesos.apache.org made it into the apache incubator started at UC Berkeley AMP Lab

12, Spark: an alternative distributed framework 2 map reduce is not the optimal framework for all algorithms problem: many algorithms iterate a number of times over their source data example: gradient descent, each iteration uses source data to compute new gradient in map reduce/hadoop every iteration reads all source data completely from disk and computes a single step approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines 2 Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010

13, Spark: overview cluster computing system, comparable to Hadoop provides primitives for in-memory cluster computing data types that are distributed throughout parts on the individual machines kept in memory in applications that benefit from this approach, Spark tends to be much faster than Hadoop provides APIs for Scala, Java, Python build on top of Mesos originally developed for iterative algorithms (iterations using the same source data) interactive data mining spark-project.org not (yet) an Apache project, but open-source (BSD-license)

14, intermission: Scala integrates object-oriented and functional language features Java-based (in part) similar syntax compiles to Java-bytecode (runs on standard JRE) type save allows functional programming interactive usage with console full object-oriented compilable programming (including GUI) free: www.scala-lang.org

15, Spark programming model Spark applications consist of a driver program implements the global, high-level control flow launches operations that are executed in parallel distribution and parallelization is achieved with resilient distributed data sets (RDDs) parallel operations working on RDDs shared variables

16, resilient distributed datasets RDDs are read-only partitioned across the compute nodes not necessarily on physical storage: described by handle handle contains information to infer RDD from reliable storage parts can be rebuild in case of data loss constructed: from files (e.g. in HDFS) from local collection (e.g. distribute array across cluster) by transformation from existing RDD by changing persistence of existing RDD

17, resilient distributed datasets lazy evaluation: creating handle describes derivation derivation is only executed, when necessary ephemeral: not guaranteed to stay in memory recreated on demand state can be changed using cache and save cache: still not evaluated after first evaluation kept in memory, if possible save: triggers evaluation writes to distributed storage handle of saved RDD points to persistently stored object

18, parallel transformations emulating map reduce is simple: flatmap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call) reducebykey(function) called on collections of (K,V) key/value pairs groups by key, aggregate with function other transformations include map(function) apply function to all elements filter(), sample() union() (of two RDDs), distinct() (distinct elements) sort(), groupbykey() join() (equi-join on key), cartesian() cogroup() maps (K,V), (K,W) (K, Seq(V), Seq(W))

19, actions to access distributed data actions extract data from the RDD and transport it back into the context of the driver program: collect() retrieve all elements of an RDD first(), take(n), retrieve first/first n elements reduce(func) use commutative, associative func for parallel reduction and retrieve final result foreach(func) run function over all elements (e.g. for statistics) count() get number of elements

20, shared variables parallel functions transport variables from their original context to the node they are executed at these have to be transport every time a function is send over the network Spark supports 2 additional forms for special use cases broadcast variables: transported only once to all involved nodes read only in parallel functions accumulators parallel functions can add to accumulators adding is some associative operation can be read only by driver program

21, Spark - summary system for distributed computation master-slave architecture main difference to map reduce: in-memory computations with RDDs

22, Pregel 3 many large scale problem involve graphs (graph: nodes connected by edges) example: PageRank distributed system designed for graph computations assumption: many graph algorithms traverse nodes via edges access data very local e.g. computations for a node involve values of its neighbors Pregel implements such a system, but is not public/open source Giraph is an open source framework implementing the same idea: giraph.apache.org 3 Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010

23, Pregel: idea basic unit of computation: node with unique id and its incident edges node can perform computations nodes communicate with each other via messages a superstep is one round where each node computes in each superstep: node receives messages from last round update/computes/sends messages to be received in next round each node can vote for stopping turns node inactive gets reactivated by received message computation stops when all nodes vote for stop

24, example: connected components assume: node ids are totally ordered node initializes minid with its ID sends its ID to all neighbors in each following round: collect all received IDs update minimum if minid changed send new minid to neighbors else vote for stop result: all nodes in a component have same minid nodes in different components have different minid

master slave system master node: determines end of algorithm takes care of node failure synchronizes node communication basic idea can be extended: nodes can mutate the graph (create/delete nodes/edges) 25,