Secrets of YARN Application Development



Similar documents
Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian Hortonworks

6. How MapReduce Works. Jari-Pekka Voutilainen

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Apache Hadoop. Alexandru Costan

YARN Apache Hadoop Next Generation Compute Platform

Advanced Java Client API

Apache Sentry. Prasad Mujumdar

Hortonworks Data Platform Reference Architecture

Integrating Kerberos into Apache Hadoop

HDFS Users Guide. Table of contents

Clustering with Tomcat. Introduction. O'Reilly Network: Clustering with Tomcat. by Shyam Kumar Doddavula 07/17/2002

BlackBerry Enterprise Service 10. Secure Work Space for ios and Android Version: Security Note

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Architecture of Next Generation Apache Hadoop MapReduce Framework

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Administering Jive for Outlook

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Capturx for SharePoint 2.0: Notification Workflows

Automating Big Data Benchmarking for Different Architectures with ALOJA

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Mocean Android SDK Developer Guide

Hadoop & Spark Using Amazon EMR

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Citrix Worx App SDK Overview

Spectrum Scale HDFS Transparency Guide

Cisco UCS CPA Workflows

Beyond Unit Testing. Steve Loughran Julio Guijarro HP Laboratories, Bristol, UK. steve.loughran at hpl.hp.com julio.guijarro at hpl.hp.

Assignment # 1 (Cloud Computing Security)

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Data Pipeline with Kafka

Salesforce Mobile Push Notifications Implementation Guide

ActiveVOS Clustering with JBoss

Apache HBase. Crazy dances on the elephant back

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

A Constraint Programming Based Hadoop Scheduler for Handling MapReduce Jobs with Deadlines on Clouds

Insights to Hadoop Security Threats

How to Run Spark Application

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

HDFS Architecture Guide

With a single download, the ADT Bundle includes everything you need to begin developing apps:

Last time. Today. IaaS Providers. Amazon Web Services, overview

Hadoop Security Design

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Connection Broker Managing User Connections to Workstations, Blades, VDI, and More. Quick Start with Microsoft Hyper-V

A Sample OFBiz application implementing remote access via RMI and SOAP Table of contents

WHITEPAPER. SECUREAUTH 2-FACTOR AS A SERVICE 2FaaS

H2O on Hadoop. September 30,

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Using the VMRC Plug-In: Startup, Invoking Methods, and Shutdown on page 4

Enterprise-level EE: Uptime, Speed, and Scale

Salesforce Mobile Push Notifications Implementation Guide

Enterprise Access Control Patterns For REST and Web APIs

Deploying Load balancing for Novell Border Manager Proxy using Session Failover feature of NBM and L4 Switch

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Peers Techno log ies Pv t. L td. HADOOP

Still Aren't Doing. Frank Kim

CDH installation & Application Test Report

Using the Push Notifications Extension Part 1: Certificates and Setup

Big Data Too Big To Ignore

Introduction to Mobile Application Management (MAM)

Data Security in Hadoop

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Xiaoming Gao Hui Li Thilina Gunarathne

PROXY SETUP WITH IIS USING URL REWRITE, APPLICATION REQUEST ROUTING AND WEB FARM FRAMEWORK OR APACHE HTTP SERVER FOR EMC DOCUMENTUM EROOM

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Important Release Information and Technical and Deployment Support Notes

Sample HP OO Web Application

Hadoop Security Analysis NOTE: This is a working draft. Notes are being collected and will be edited for readability.

AxCMS.net on Network Load Balancing (NLB) Environment

docs.hortonworks.com

Oracle Solaris Remote Lab User Guide for Release 1.01

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

How to Hadoop Without the Worry: Protecting Big Data at Scale

CS54100: Database Systems

QuickStart Guide for Managing Mobile Devices. Version 9.2

Department of Veterans Affairs VistA Integration Adapter Release Enhancement Manual

Microsoft Business Intelligence 2012 Single Server Install Guide

The Hadoop Distributed File System

Word Count Code using MR2 Classes and API

The Monitis Monitoring Agent ver. 1.2

Upcoming Announcements

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Deploying Hadoop with Manager

Norton Family Product Manual

Big Data Management and NoSQL Databases

Setup Guide for AD FS 3.0 on the Apprenda Platform

Avoiding Performance Bottlenecks in Hyper-V

HDP Hadoop From concept to deployment.

Transcription:

Secrets of YARN Application Development Steve Loughran stevel at hortonworks.com @steveloughran Berlin, May 2014 2014

If all you have is a JobTracker Page 2

Everything pretends to be an MR Job Page 3

Page 4 YARN: more tools for the engineers

SQL, Stats Codd Your code Knuth YARN Lamport Swinehart et al. Network Cerf Host OS Kernighan Page 5

YARN runs code across the cluster Servers run YARN Node Managers NM's heartbeat to Resource Manager RM schedules work over cluster RM allocates containers to apps NMs start containers NMs report container health YARN Node Manager YARN Resource Manager The RM YARN Node Manager YARN Node Manager 2014 Page 6

Client creates App Master Client YARN Node Manager Application Master YARN Resource Manager The RM YARN Node Manager YARN Node Manager 2014 Page 7

ContainerLaunchContext // env vars, with env- variables // ApplicationConstants.Environment.LOG_DIRS.$() setenvironment(map<string,string> env) // Local resources are files in, S3 to copy/[+unzip/untar] setlocalresources(map<string, LocalResource> localresources) //bash - c commands setcommands(list<string> commands); Page 8

Local Resources pulled from Map<String, LocalResource> localresources = new HashMap<>(); LocalResource res = Records.newRecord(LocalResource.class); res.settype(localresourcetype.archive); res.setvisibility(localresourcevisibility.application); URL yarnurl URL.newInstance("s3n","packages","80","hbase.tar.gz"); res.setresource(yarnurl); resources.put("lib/hbase",res); ctx.setlocalresources(resources); Page 9

Classpath setup public static String buildcp(configuration conf) { StringBuilder cp = new StringBuilder( ApplicationConstants.Environment.CLASSPATH.$()); } String[] ycp = conf.getstrings( "yarn.application.classpath", DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH); List<String> ycpl = Arrays.asList(ycp); ycpl.stream().foreach( (e) - > cp.append(e).append(':') ); return cp.tostring(); Page 10

Classpath Grief 1. Client uploads dependencies to, adds as resources then onto classpath 2. "yarn.application.classpath" 3. YarnConfiguration. DEFAULT_YARN_CROSS_PLATFORM_ APPLICATION_CLASSPATH 4. Your code had better use the same JARs as Hadoop 5. HADOOP-9991 "roll up JARs to latest versions" Better: OSGi, leaner Hadoop client libs Page 11

Tasks of Application Master Request containers from YARN Resource Manager Process allocation responses Submit launch requests to run code in containers Handle IPC calls from clients and container-side code Handle notifications from YARN Resource Manager Implement placement policy Implement failure handling policy Page 12

Container Requests public static class ContainerRequest { final Resource capability;... } final List<String> nodes; final List<String> racks; final Priority priority; final boolean relaxlocality; In Slider best-effort, persistent placement history some failure tracking TODO: moving average, greylisting Page 13

AM asks for containers YARN Resource Manager YARN Node Manager Application Master containers YARN Node Manager YARN Node Manager Container container 2014 Page 14

YARN notifies AM of failures YARN Resource Manager YARN Node Manager Application Master YARN Node Manager Container YARN Node Manager Container Container 2014

Remember Model-View-Controller REST Client RPC Client ZK RM NM Web/ REST Application Master Engine Cluster State Cluster Spec Node Map Role History requested starting releasing 2014 Page 16

AM Restart leading edge Persisted in Rebuilt Transient Specification resources.json &c NodeMap model of YARN cluster Event History application history ComponentHistory persistent history of component placements Component Map container ID -> component instance Container Queues requested, starting, releasing ctx.setkeepcontainersacrossapplicationattempts(true) Page 17

Testing is fun 1. Unit tests 2.?? Unmanaged AM?? 3. MiniYARNCluster 4. 1-node VMs 5. Cloud-hosted Hadoop clusters 6. Physical clusters Me: Unit, Mini, Physical, VMs: RHEL 6 + Hadoop 2.4, Ubuntu 12, Java 8, Hadoop branch-2, Kerberos Windows 7, Hadoop 2.4 Page 18

Avoid the Lamport Layer: delegate Spring XD: YARN apps in Spring Apache Tez: pipeline of operations, "sessions" Apache Slider : existing apps in a YARN cluster Apache Twill Microsoft Reef? Page 19

Twill: Runnables -> "elsewhere" TwillController controller; TwillRunnerService twillrunner = new YarnTwillRunnerService( conf, zkstr, createlocationfactory()); twillrunner.startandwait(); TwillController controller = twillrunner.prepare( new RenderRunnable()).addLogHandler(new Slf4JLogHandler(log)).start(); Services.getCompletionFuture(controller).get(); Page 20

Example: Frame Renderer render args + source image è frame Generates GB of data Each frame render independent, repeatable Output-driven placement: consecutive frames close github.com/steveloughran/renderize Page 21

actions from Twill context; IO to public void run() { String[] aa = getcontext().getapplicationarguments(); } RenderArgs args = new RenderArgs(aa); HadoopImageIO imageio = new HadoopImageIO(conf); BufferedImage jpeg = imageio.readjpeg(args.image); Renderer renderer = new Renderer(jpeg); int width = jpeg.getwidth(); int height = jpeg.getheight(); int x = args.getrenderx(width); int y = args.getrendery(height); renderer.render(x, y, args.message); imageio.writejpeg(renderer.image, args.dest); Page 22

Page 23

Summary : YARN Lets you run whatever code you want in a Hadoop Cluster Hides a lot of the tasks of deploying and running distributed apps Does require someone to handle the remainder Life is simpler if you can find someone else to do this: Twill, Spring XD, Tez, etc. if you try, you can do lots of interesting things Focus on the Algorithms Page 24

Questions? hortonworks.com Hortonworks Inc 2014 Page 25

Security Embrace Kerberos and test early 2014 Page 26

Set Security at launch time if (insecurecluster) { env.put("hadoop_user_name", UserGroupInformation.getCurrentUser().getUserName()); } else { } Credentials creds = new Credentials(); String renewer = conf.get("yarn.resourcemanager.principal"); hdfs.adddelegationtokens(renewer, creds); DataOutputBuffer dob = new DataOutputBuffer(); creds.writetokenstoragetostream(dob); ByteBuffer bytes = ByteBuffer.wrap(dob.getData(), 0, dob.getlength()); containerlaunchcontext.settokens(bytes); Page 27

And retrieve in AM if (UserGroupInformation.isSecurityEnabled()) { secretmanager.setmasterkey( response.getclienttoamtokenmasterkey().array()); applicationacls = response.getapplicationacls(); } Tokens expire after ~72 hours Page 28

IPC/RPC: Avoid Hard to code, security painful, Better: REST APIs If you must do it: avoid protobuf 2014 Page 29

If you really, really must META- INF/services/org.apache.hadoop.security.SecurityInfo: org.apache.slider.rpc.sliderrpcsecurityinfo class SliderRPCSecurityInfo extends SecurityInfo {... } class SliderAMPolicyProvider extends PolicyProvider {... } @KerberosInfo(serverPrincipal = "slider.kerberos.principal") public interface SliderClusterProtocol extends VersionedProtocol {... } //and in AM rpcservice.refreshserviceacl(conf, new SliderAMPolicyProvider()) Page 30

REST APIs SliderAmIpFilter.doFilter(request, response, filterchain) String dest = httpreq.getrequesturi(); if (!dest.startswith("ws/") } &&!(proxyaddrs.contains(httpreq.getremoteaddr())) { String redirect = httpresp.encoderedirecturl(proxy + dest); httpresp.sendredirect(redirect); return; Even if clients handled 307 response Proxy doesn't forward PUT/POST/DELETE Page 31