Hadoop Applications on High Performance Computing. Devaraj Kavali devaraj@apache.org



Similar documents
Next-Gen Big Data Analytics using the Spark stack

Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian Hortonworks

Extended Attributes and Transparent Encryption in Apache Hadoop

Hetero Streams Library 1.0

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Intel Media SDK Library Distribution and Dispatching Process

Intel Unite. User Guide

The Case for Rack Scale Architecture

Page Modification Logging for Virtual Machine Monitor White Paper

Intel Desktop public roadmap

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Extending PCIe NVMe Storage to Client. John Carroll Intel Corporation. Flash Memory Summit 2015 Santa Clara, CA 1

Intel HTML5 Development Environment. Article - Native Application Facebook* Integration

Intel Service Assurance Administrator. Product Overview

Intel Platform and Big Data: Making big data work for you.

YARN and how MapReduce works in Hadoop By Alex Holmes

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Intel IoT Gateways: Publishing Data to an MQTT Broker Using Python

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

YARN Apache Hadoop Next Generation Compute Platform

Intel Retail Client Manager Audience Analytics

Cloud based Holdfast Electronic Sports Game Platform

Intel HTML5 Development Environment. Tutorial Building an Apple ios* Application Binary

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

Intelligent Business Operations

EHCI Removal from 6 th Generation Intel Core Processor Family Platform Controller Hub (PCH)

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Intel Identity Protection Technology (IPT)

Intel SSD 520 Series Specification Update

Version Rev. 1.0

BSP for Windows* Embedded Compact* 7 and Windows* Embedded Compact 2013 for Mobile Intel 4th Generation Core TM Processors and Intel 8 Series Chipset

with PKI Use Case Guide

Fast, Low-Overhead Encryption for Apache Hadoop*

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Intel HTML5 Development Environment. Tutorial Test & Submit a Microsoft Windows Phone 8* App (BETA)

iscsi Quick-Connect Guide for Red Hat Linux

CLOUD SECURITY: Secure Your Infrastructure

Intel Retail Client Manager

* * * Intel RealSense SDK Architecture

Intel Data Center Manager. Data center IT agility and control

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

High Performance Computing and Big Data: The coming wave.

Experiences with Lustre* and Hadoop*

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Intel Desktop Board DG41TY

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Scaling up to Production

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel HTML5 Development Environment Article Using the App Dev Center

Benchmarking Cloud Storage through a Standard Approach Wang, Yaguang Intel Corporation

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Resetting USB drive using Windows Diskpart command

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Architecture of Next Generation Apache Hadoop MapReduce Framework

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Accelerating Business Intelligence with Large-Scale System Memory

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

新 一 代 軟 體 定 義 的 網 路 架 構 Software Defined Networking (SDN) and Network Function Virtualization (NFV)

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Intel Desktop Board D945GCPE

Intel Simple Network Management Protocol (SNMP) Subagent v6.0

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Intel Desktop Board DG41BI

Lustre* Testing: The Basics. Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France

Intel Remote Configuration Certificate Utility Frequently Asked Questions

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Intel Desktop Board DG43RK

Integrating Genetic Data into Clinical Workflow with Clinical Decision Support Apps

Cloud Service Brokerage Case Study. Health Insurance Association Launches a Security and Integration Cloud Service Brokerage

Dell* In-Memory Appliance for Cloudera* Enterprise

Intel Internet of Things (IoT) Developer Kit

Hadoop Security Analysis NOTE: This is a working draft. Notes are being collected and will be edited for readability.

Deploying Hadoop with Manager

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Software Evaluation Guide for Autodesk 3ds Max 2009* and Enemy Territory: Quake Wars* Render a 3D character while playing a game

Intel Desktop Board D945GCPE Specification Update

Lustre * Filesystem for Cloud and Hadoop *

Intel Desktop Board D101GGC Specification Update

Intel Desktop Board DP55WB

Big Data Analytics on Object Storage -- Hadoop over Ceph* Object Storage with SSD Cache

Intel Desktop Board DG31PR

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

CDH 5 Quick Start Guide

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

The Improved Job Scheduling Algorithm of Hadoop Platform

Intel vpro Technology Module for Microsoft* Windows PowerShell*

Transcription:

Hadoop Applications on High Performance Computing Devaraj Kavali devaraj@apache.org

About Me Apache Hadoop Committer Yarn/MapReduce Contributor Senior Software Engineer @Intel Corporation 2

Agenda Objectives HDFS Applications with HPC File Systems Yarn Application Mapreduce Job HPC Schedulers Yarn Protocols Log Aggregation Shuffle Implementation Q&A 3

Objectives Use existing HPC Cluster for running Hadoop Applications Use any of the HPC File Systems like Lustre, PVFS, IBRIX Fusion, etc. Use any of the HPC schedulers like Slurm, Moab, PBS Pro, etc. Combine Hadoop workloads with HPC workloads No code changes to existing Hadoop(HDFS/YARN/MR) applications Minimal Hadoop configuration changes 4

HDFS Applications Using HPC File Systems public class <HPC>Adapter extends AbstractFileSystem{.. HDFS Application.. HPC FS } 5

Hadoop Configurations for File System <property> <name>fs.defaultfs</name> <value>${hpc-uri}:///</value> </property> <property> <name>fs.abstractfilesystem.${hpc-uri}.impl</name> <value>hpcfilesystemadapter</value> </property> 6

YARN Application Node Manager Container Container Client Resource Manager Node Manager App Master Container 7

YARN Application with HPC Scheduler HPC Scheduler Slave Container Container Client HPC Scheduler Master HPC Scheduler Slave App Master Container 8

Yarn Application Submission with HPC Scheduler Yarn Application 1. submitapplication() Yarn Client HPC 2. submit() Application 3. run 7. Launch Yarn Child HPC Scheduler Client Protocol Impl Container Yarn Child 4. Launch 5. Allocate App Master 6. Run Child Task HPC Application Master Protocol Impl Application Master HPC Container Management Protocol Impl 8. Report Progress AMRM Client NM Client 9

Mapreduce Job with HPC Scheduler 1. submit() LocalClientProtocol Provider Job Client Local Job Runner Yarn Client ApplicationC lientprotocol Proxy Resource Manager 2. submitjobinternal() YarnClientProtocol Provider 5. submit() 6. submit() Job Submitter 3. submitjob() YARN Runner 4. submit() ResourceMg rdelegate HPCApplicat ionclientprot ocolimpl 7. run() HPC Scheduler Job History Server 10

Yarn Protocols Configurations RPC class Configuration <property> <description>rpc class implementation</description> <name>yarn.ipc.rpc.class</name> <value>hadoopyarnhpcrpc</value> </property> 11

Yarn Protocols Configurations public class HadoopYarnHPCRPC extends HadoopYarnProtoRPC { @Override public Object getproxy(class protocol, InetSocketAddress address, Configuration conf) { Object proxy; if (protocol == ApplicationClientProtocol.class) { proxy = new HPCApplicationClientProtocolImpl(conf); } else if (protocol == ApplicationMasterProtocol.class) { proxy = new HPCApplicationMasterProtocolImpl(conf); } else if (protocol == ContainerManagementProtocol.class) { proxy = new HPCContainerManagementProtocolImpl(conf); } else { proxy = super.getproxy(protocol, address, conf); } return proxy; } 12

Application Client Protocol Yarn Application Client Protocol Flow Yarn Application Client RMClient RMApplicati onclientpro tocolproxy 1. getnewapplication () 2. submitapplication () 3. forcekillapplication () 4. getclustermetrics () 5. getclusternodes () 6. getqueueinfo ().. Resource Manager 13

Application Client Protocol Yarn Application Client HPC Scheduler Flow Yarn Application Client RMClient HPCApplicati onclientprot ocolimpl 1. Allocate Resources cmd 2. Submit Job(Batch) cmd 3. Cancel Job cmd 4. Cluster Info cmd 5. Get Jobs Report cmd.. HPC Scheduler 14

Application Client Protocol API s for interaction 1. getnewapplication() The interface used by clients to obtain a new ApplicationId for submitting new applications. 2. submitapplication() The interface used by clients to submit a new application to the ResourceManager. 3. forcekillapplication() The interface used by clients to request the ResourceManager to abort submitted application. 4. getclustermetrics() 5. getclusternodes() 6. getqueueinfo() 15

Application Master Protocol Yarn Application Master Flow Diagram RMClient Application Master RMApplicati onmasters erviceproto colproxy 1. registerapplicationmaster() 2. allocate() 3. finishapplicationmaster() Resource Manager 16

Application Master Protocol HPC Scheduler Application Master Flow Diagram NMClient Application Master HPCApplicati onmasterprot ocolimpl 1. Cluster Info Cmd 2. Resource Allocation Cmd 3. Finish Tasks Cmds HPC Scheduler 17

Application Master Protocol API s for interaction 1. registerapplicationmaster() The interface used by a new ApplicationMaster to register with the ResourceManager. 2. allocate() The main interface between an ApplicationMaster and the ResourceManager. 3. finishapplicationmaster() The interface used by an ApplicationMaster to notify the ResourceManager about its completion (success or failed). 18

Container Management Protocol Yarn Container Management Flow NMClient Application Master NMContain ermanagem entprotocol Proxy 1. startcontainers() 2. stopcontainers() 3. getcontainerstatuses() Node Manager 19

Container Management Protocol HPC Scheduler Task Management Flow NMClient Application Master HPCContai nermanage mentprotoc olimpl 1. Start Containers Cmd 2. Stop Containers Cmd 3. Get Container Statuses Cmd HPC Scheduler 20

Container Management Protocol API s for interaction 1. startcontainers() The ApplicationMaster provides a list of StartContainerRequest's to a NodeManager to start Container's allocated to it using this interface. 2. stopcontainers() The ApplicationMaster requests a NodeManager to stop a list of Container's allocated to it using this interface. 3. getcontainerstatuses() The API used by the ApplicationMaster to request for current statuses of Container's from the NodeManager. 21

Yarn Log Aggregation Log Aggregation by Node Manager <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> Log Aggregation with HPC Scheduler Issue an HPC scheduler command to execute in all nodes(where application tasks executed) as part of ApplicationMasterProtocol.finishApplicationMaster() for aggregating the application logs. 22

Shuffle Handling Hadoop Node 1 Map Task 1. Assign Task 3. Task Completed MRAppMaster 4. Get Map Completion Events 5. Map Completed Reduce Task EventFetcher Node Manager 2. Write Final Map O/P Local Dirs Shuffle Consumer 6. Read Map O/P 23

Shuffle Handling HPC File Systems Map Task 1. Assign Task 3. Task Completed MRAppMaster 4. Get Map Completion Events 5. Map Completed Reduce Task EventFetcher Shuffle Consumer 2. Write Final Map O/P 6. Read Map O/P Parallel File System 24

Shuffle Handling Shuffle Handler <property> <name>mapreduce.job.map.output.collector.class</name> <value>org.apache.hadoop.mapred.maptask$mapoutputbuffer</value> <description> The MapOutputCollector implementation(s) to use. This may be a comma-separated list of class names, in which case the map task will try to initialize each of the collectors in turn. The first to successfully initialize will be used. </description> </property> 25

Shuffle Handling Shuffle Consumer <property> <name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name> <value>org.apache.hadoop.mapreduce.task.reduce.shuffle</value> <description> Name of the class whose instance will be used to send shuffle requests by reduce tasks of this job. The class must be an instance of org.apache.hadoop.mapred.shuffleconsumerplugin. </description> </property> 26

Summary HDFS configuration for new File System HPC Schedulers YARN Protocols M/R Shuffle Implementation Yarn Log Aggregation 27

Q & A 28

Thank You devaraj@apache.org 29

Notices and Disclaimers Copyright 2014 Intel Corporation. Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from publish. 30

31