CSE-E5430 Scalable Cloud Computing Lecture 11

Similar documents

Apache Spark and Distributed Programming

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics Hadoop and Spark

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Dell In-Memory Appliance for Cloudera Enterprise

Unified Big Data Processing with Apache Spark. Matei

Big Data and Industrial Internet

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

VariantSpark: Applying Spark-based machine learning methods to genomic information

Unified Big Data Analytics Pipeline. 连城

Big Data and Scripting Systems beyond Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Rakam: Distributed Analytics API

Architectures for massive data management

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

Moving From Hadoop to Spark

How Companies are! Using Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Brave New World: Hadoop vs. Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Future Prospects of Scalable Cloud Computing

ZooKeeper. Table of contents

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Conquering Big Data with BDAS (Berkeley Data Analytics)

Big Data and Scripting Systems build on top of Hadoop

Hadoop IST 734 SS CHUNG

Beyond Hadoop with Apache Spark and BDAS

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark Application on AWS EC2 Cluster:

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

HiBench Introduction. Carson Wang Software & Services Group

Spark: Making Big Data Interactive & Real-Time

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Big Data With Hadoop

Apache Flink Next-gen data analysis. Kostas

Spark and the Big Data Library

Ali Ghodsi Head of PM and Engineering Databricks

Introduction to Spark

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop-BAM and SeqPig

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

DduP Towards a Deduplication Framework utilising Apache Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Streaming items through a cluster with Spark Streaming

Large scale processing using Hadoop. Ján Vaňo

Managing large clusters resources

Spark: Cluster Computing with Working Sets

Big Data Storage Architecture Design in Cloud Computing

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Real Time Data Processing using Spark Streaming

Big Data Analytics. Lucas Rego Drumond

Tachyon: memory-speed data sharing

Hadoop & Spark Using Amazon EMR

Processing NGS Data with Hadoop-BAM and SeqPig

Big Data Processing. Patrick Wendell Databricks

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Management in the Cloud

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hybrid Software Architectures for Big

Apache Hama Design Document v0.6

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE-E5430 Scalable Cloud Computing. Lecture 4

Hadoop and Map-Reduce. Swati Gore

Big Data with Component Based Software

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Challenges for Data Driven Systems

Hadoop implementation of MapReduce computational model. Ján Vaňo

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Machine- Learning Summer School

Next-Gen Big Data Analytics using the Spark stack

Snapshots in Hadoop Distributed File System

Putting Apache Kafka to Use!

Big Data Frameworks Course. Prof. Sasu Tarkoma

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Transcription:

CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24

Distributed Coordination Systems Consensus under failures protocols are extremely subtle, and should not be done in normal applications code! In cloud based systems usually the protocols needed to deal with consensus are hidden inside a distributed coordination system 2/24

Google Chubby One of the most well known systems implementing Paxos is Google Chubby: Michael Burrows: The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006: 335-350 Chubby is used to store the configuration parameters of all other Google services, to elect leaders, to do locking, to maintain pools of servers that are up, to map names to IP addresses, etc. Because in Paxos writes are sent to a single leader, write throughput can be a bottleneck 3/24

Apache Zookeeper An open source distributed coordination service Not based on Paxos, protocol outlined in: Flavio Paiva Junqueira, Benjamin C. Reed: Brief Announcement Zab: A Practical Totally Ordered Broadcast Protocol. DISC 2009: 362-363 Shares many similarities with Paxos but is not the same protocol! We use Figures from: http://zookeeper.apache.org/doc/r3.4.0/ zookeeperover.html 4/24

Apache Zookeeper Zookeeper gives the clients a view of a small fault tolerant filesystem, with access control rights management Each znode (a combined file/directory) in the filesystem can have children, as well as have upto 1MB of data attached to it All znode data reads and writes are atomic, reading or writing all of the max 1MB data atomically: No need to worry about partial file reads and/or writes 5/24

Zookeeper Namespace 6/24

Ephemeral Znodes In addition to standard znodes, there are also ephemeral znodes, which exist in the system as long as the Zookeeper client who has created them can in timely manner reply to Zookeeper keepalive messages Ephemeral nodes are usually used to detect failed nodes: A Zookeeper client that is not able to reply to Zookeeper keepalive messages in time is assumed to have been failed Zookeeper clients can also create watches, i.e., be notified when a znode is modified. This is very useful to notice, e.g., failures/additions of servers without a continuos polling of Zookeeper 7/24

Zookeeper Service Each Zookeeper client is connected to a single server All write requests are forwarded through a Leader node The Leader orders all writes into a total order, and forwards them to the Follower nodes. Writes are acked once a majority of servers have persisted the write 8/24

Zookeeper Characteristics Reads are served from the cache of the Server the client is connected to. The reads can return slightly out-of-date (max tens of seconds) data. There is a sync() call that forces a Server to catch up with the Master before returning data to the client The Leader will acknowledge a write if a majority of the servers are able to persist it in their logs Thus is a minority of Servers fail, there is at least one server with the most uptodate commits If the Leader fails, Zookeeper selects a new Leader 9/24

Zookeeper Components 10/24

Zookeeper Performance 11/24

Zookeeper Performance under Failures 12/24

Raft A widely used alternative to the Paxos and Zab algorithms is the Raft algorithm: Diego Ongaro, John K. Ousterhout: In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference 2014: 305-319 Raft is considered slightly easier to understand and implement compared to Paxos It is implemented in many different systems and programming languages, see: https://raft.github.io/ 13/24

Distributed Coordination Service Summary Maintaining a global database image under node failures is a very subtle problem One should use centralized infrastructure such as Chubby, Zookeeper, or a Raft implementation to only implement these subtle algorithms only once From an applications point of view the distributed coordination services should be used to store global shared state: Global configuration data, global locks, live master server locations, live slave server locations, etc. By modularizing the tricky fault tolerance problems to a distributed coordination service, the applications become much easier to program 14/24

Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements distributed Scala collections like interfaces for Scala, Java, Python, and R Implemented on top of the Akka actor framework Original paper: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012: 15-28 15/24

Resilient Distributed Datasets Resilient Distributed Datasets Resilient Distributed Datasets (RDDs) are Spark datastructure which is a collection of elements distributed over several computers The framework stores the RDDs in partitions, where a separate thread can process each partition To implement fault tolerance, the RDD partitions record lineage: A recipe to recompute the RDD partition based on the parent RDDs and the operation used to generate the partition If a server goes down, the lineage information can be used to recompute its partitions on another server 16/24

Spark Tutorials Spark Tutorials Quick start: http://spark.apache.org/docs/latest/ quick-start.html Spark Programming Guide: http://spark.apache.org/docs/latest/ programming-guide.html Dataframes and Spark SQL: http://spark.apache.org/docs/latest/ sql-programming-guide.html 17/24

Spark Motivation Spark Motivation The MapReduce framework is one of the most robust Big Data processing frameworks MapReduce has several problems that Spark is able to address: Reading and storing data from main memory instead of hard disks - Main memory is much faster than the hard drives Running iterative algorithms much faster - Especially important for many machine learning algorithms 18/24

Spark Benchmarking Spark Benchmarking The original papers claim Spark to be upto 100x faster when data fits into RAM vs MapReduce, or upto 10x faster when data is on disks Large speedups can be observed in the RAM vs HD case However, independent benchmarks show when both use HDs, Spark is upto 2.5-5x faster for CPU bound workloads, however MapReduce can still sort faster: Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma zcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015) http://www.vldb.org/pvldb/vol8/p2110-shi.pdf http://www.slideshare.net/ssuser6bb12d/ a-comparative-performance-evaluation-of-apache-flink 19/24

Spark Extensions Spark Extensions MLlib - Distributed Machine learning Library Spark SQL - Distributed Analytics SQL Database - Can be tightly integrated with Apache Hive Data Warehouse, uses HQL (Hive SQL variant) Spark Streaming - A Streaming Data Processing Framework GraphX - A Graph processing system 20/24

Data Frames Data Frames The Spark project has introduced a new data storage and processing framework - Data Frames - to eventually replace RDDs for many applications Problem with RDDs: As RDDs are operated by arbitrary user functions in Scala/Java/Python, optimizing them is very hard DataFrames allow only a limited number of DataFrame native operations, and allows the Spark SQL optimizer to rewrite user DataFrame code to equivalent but faster codes Instead of executing what the user wrote (as with RDDs), execute an optimized sequence of operations Allows DataFrame operations to be also implemented in C/C++ (LLVM) level 21/24

Lambda Architecture Batch computing systems like Hadoop have a large latency in order of minutes to hours but have extremely good fault tolerance - each data item is processed exactly once (CP) Sometimes data needs to be processed in real time Stream processing systems such as Apache Storm or Spark Streaming combined with Apache Kafka allow for large scale real time processing - each data item is processed at least once (AP) These stream processing frameworks can provide latencies in the order of tens of milliseconds Lambda architecture is the combination of consistent / exact but high latency batch processing and available / approximate low latency stream processing 22/24

Lambda Architecture (cnt.) For more details, see e.g.,: http://lambda-architecture.net/ 23/24

Lambda Architecture (cnt.) Batch layer - exact computation with high latency, maintaining master dataset, CP system Speed layer - approximate computation with low latency, only maintaining approximations on recent data that is not yet processed by the batch layer, AP system Serving layer - serves the output of the batch layer and speed layer Lambda architecture allows for both low latency but also eventual accuracy of the approximate results, as these approximations are all eventually overwritten by exact results computed by the batch layer, combination of AP and CP systems 24/24