Can t We All Just Get Along? Spark and Resource Management on Hadoop

Similar documents

Using RDBMS, NoSQL or Hadoop?

Unified Big Data Analytics Pipeline. 连城

YARN Apache Hadoop Next Generation Compute Platform

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Technology Core Hadoop: HDFS-YARN Internals

Moving From Hadoop to Spark

Introduc8on to Apache Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How Companies are! Using Spark

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Unified Big Data Processing with Apache Spark. Matei

Hadoop & Spark Using Amazon EMR

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Beyond Hadoop with Apache Spark and BDAS

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Tachyon: memory-speed data sharing

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Conquering Big Data with BDAS (Berkeley Data Analytics)

An Open Source Memory-Centric Distributed Storage System

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Architectures for massive data management

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

H2O on Hadoop. September 30,

Characterizing Task Usage Shapes in Google s Compute Clusters

Spark: Cluster Computing with Working Sets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

docs.hortonworks.com

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Big Data Analytics Hadoop and Spark

Large scale processing using Hadoop. Ján Vaňo

GraySort on Apache Spark by Databricks

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Hadoop in the Enterprise

Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

YARN and how MapReduce works in Hadoop By Alex Holmes

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop Ecosystem B Y R A H I M A.

The Future of Data Management

A Brief Introduction to Apache Tez

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Hadoop-BAM and SeqPig

RED HAT ENTERPRISE LINUX 7

Introduction to Apache YARN Schedulers & Queues

This article is the second

June JMS and Hadoop Agent. Automic Workload Automation

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Setup Guide for HDP Developer: Storm. Revision 1 Hortonworks University

Application of Predictive Analytics for Better Alignment of Business and IT

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Dell In-Memory Appliance for Cloudera Enterprise

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop: Embracing future hardware

Apache Flink Next-gen data analysis. Kostas

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HiBench Introduction. Carson Wang Software & Services Group

KNIME & Avira, or how I ve learned to love Big Data

Introduction to Spark

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Analytics on Spark &

Processing NGS Data with Hadoop-BAM and SeqPig

Significantly Speed up real world big data Applications using Apache Spark

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Taming Operations in the Apache Hadoop Ecosystem. Jon Hsieh, Kate Ting, USENIX LISA 14 Nov 14, 2014

Hadoop Big Data for Processing Data and Performing Workload

docs.hortonworks.com

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

The Internet of Things and Big Data: Intro

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Managing large clusters resources

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Scaling Out With Apache Spark. DTL Meeting Slides based on

Next-Gen Big Data Analytics using the Spark stack

Transcription:

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Introduc=ons So>ware engineer at Cloudera MapReduce, YARN, Resource management Hadoop commider

Introduc=on Spark as a first class data processing framework alongside MR and Impala Resource management What we have already What we need for the future

Bringing Computa=on to the Data Users want to ETL a dataset with Pig and MapReduce Fit a model to it with Spark Have BI tools query it with Impala Same set of machines that hold data must also host these frameworks

Cluster Resource Management Hadoop brings generalized computa=on to big data More processing frameworks MapReduce, Impala, Spark Some workloads are more important than others A cluster has finite resources Limited CPU, memory, disk and network bandwidth How do we make sure each workload gets the resources it deserves?

How We See It Impala MapReduce Spark HDFS

How They Want to See It Engineering - 50% Finance - 30% Marketing - 20% Spark MR Spark Spark MR MR Impala Impala Impala HDFS

Central Resource Management Impala MapReduce Spark YARN HDFS

YARN Resource manager and scheduler for Hadoop Container is a process scheduled on the cluster with a resource alloca=on (amount MB, # cores) Each container belongs to an Applica=on

YARN Applica=on Masters Each YARN app has an Applica=on Master (AM) process running on the cluster AM responsible for reques=ng containers from YARN AM crea=on latency is much higher than resource acquisi=on

YARN JobHistory Server ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task

YARN Queues Cluster resources allocated to queues Each applica=on belongs to a queue Queues may contain subqueues Root Mem Capacity: 12 GB CPU Capacity: 24 cores Marketing Fair Share Mem: 4 GB Fair Share CPU: 8 cores R&D Fair Share Mem: 4 GB Fair Share CPU: 8 cores Sales Fair Share Mem: 4 GB Fair Share CPU: 8 cores Jim s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores Bob s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores

YARN app models Applica=on master (AM) per job Most simple for batch Used by MapReduce Applica=on master per session Runs mul=ple jobs on behalf of the same user Recently added in Tez AM as permanent service Always on, waits around for jobs to come in Used for Impala

Spark Usage Modes Mode Long Lived/Multiple Jobs Multiple Users Batch No No Interactive Yes No Server Yes Yes

Spark on YARN Developed at Yahoo Applica=on Master per SparkContext Container per Spark executor Currently useful for Spark Batch jobs Requests all resources up front

Enhancing Spark on YARN Long- lived sessions Mul=ple Jobs Mul=ple Users

Long- Lived Goals Hang on to few resources when we re not running work Use lots of the cluster (over fair share) when it s not being used by others Give back resources gracefully when preempted Get resources quickly when we need them

Mesos Fine- Grained Mode Allocate sta=c chunks of memory at Spark app start =me Schedule CPU dynamically when running tasks

Long- Lived Approach A YARN applica=on master per Spark applica=on (SparkContext) Which is to say an applica=on master per session One executor per applica=on per node One YARN container per executor Executors can acquire and give back resources

Long- Lived: YARN work YARN- 896 - long lived YARN YARN not built with apps that would s=ck around indefinitely Miscellaneous work like renewable container tokens YARN- 1197 - resizable containers

Long- Lived: Spark Work YARN fine- grained mode Changes to support adjus=ng resources in Spark AM Memory?

The Memory Problem We want to be able to have memory alloca=ons preempted and keep running RDDs stored in JVM memory JVMs don t give back memory

The Memory Solu=ons Rewrite Spark in C++ Off- heap cache Hold RDDs in executor processes in off- heap byte buffers These can be freed and returned to the OS Tachyon Executor processes don t hold RDDs Store data in Tachyon Punts off- heap problem to Tachyon Has other advantages, like not losing data when executor crashes

Mul=ple User Challenges A single Spark applica=on wants to run work on behalf of mul=ple par=es Applica=ons are typically billed to a single queue We d want to bill jobs to different queues Rajat from Marketing Cluster Spark App Sylvia from Finance

Mul=ple Users with Spark Fair Scheduler Full- features Fair Scheduler within a Spark Applica=on Two level scheduling Difficult to share dynamically between Spark and other frameworks

Mul=ple Users with Impala Impala has same exact problem Solu=on: Llama (Low Latency Applica=on MAster) Adapter between YARN and Impala Runs mul=ple AMs in a single process Submits resource requests on behalf of relevant AM Jobs billed to the YARN queues they belong in Cluster Spark App AM for Marketing Queue AM for Finance Queue Rajat from Marketing Sylvia from Finance

Spark Other Hadoop processing frameworks