Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler



Similar documents
Cray s Storage History and Outlook Lustre+ Jason Goodman, Cray LUG Denver

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

The Fusion of Supercomputing and Big Data: The Role of Global Memory Architectures in Future Large Scale Data Analytics

Lustre* Testing: The Basics. Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Ecosystem B Y R A H I M A.

HiBench Introduction. Carson Wang Software & Services Group

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Workshop on Hadoop with Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop: The Definitive Guide

ITG Software Engineering

Hadoop & Spark Using Amazon EMR

Upcoming Announcements

Dell In-Memory Appliance for Cloudera Enterprise

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Comprehensive Analytics on the Hortonworks Data Platform

A Brief Introduction to Apache Tez

Qsoft Inc

Apache Hadoop: Past, Present, and Future

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Implement Hadoop jobs to extract business value from large and varied data sets

Dominik Wagenknecht Accenture

Deploying Hadoop with Manager

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Extended Attributes and Transparent Encryption in Apache Hadoop

Chase Wu New Jersey Ins0tute of Technology

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Moving From Hadoop to Spark

Peers Techno log ies Pv t. L td. HADOOP

Bringing Big Data to People

How Companies are! Using Spark

Why Spark on Hadoop Matters

HADOOP. Revised 10/19/2015

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

HPCHadoop: MapReduce on Cray X-series

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Unified Big Data Analytics Pipeline. 连 城

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Enabling High performance Big Data platform with RDMA

Case Study : 3 different hadoop cluster deployments

Unified Big Data Processing with Apache Spark. Matei

Hadoop Applications on High Performance Computing. Devaraj Kavali

TRAINING PROGRAM ON BIGDATA/HADOOP

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Next-Gen Big Data Analytics using the Spark stack

and Hadoop Technology

Ali Ghodsi Head of PM and Engineering Databricks

How Intel IT Successfully Migrated to Cloudera Apache Hadoop*

Performance Analysis of Mixed Distributed Filesystem Workloads

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Data Security in Hadoop

Apache Hadoop Ecosystem

HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Savanna Hadoop on. OpenStack. Savanna Technical Lead

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

COURSE CONTENT Big Data and Hadoop Training

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA What it is and how to use?

<Insert Picture Here> Big Data

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Fast, Low-Overhead Encryption for Apache Hadoop*

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

THE HADOOP DISTRIBUTED FILE SYSTEM

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Apache Hadoop: The Big Data Refinery

Bright Cluster Manager

#TalendSandbox for Big Data

BIG DATA HADOOP TRAINING

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Bright Cluster Manager

Red Hat Enterprise Linux is open, scalable, and flexible

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Understanding Hadoop Performance on Lustre

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

YARN Apache Hadoop Next Generation Compute Platform

Transcription:

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. 2

Hadoop MapReduce: new workload for XC30 Apache Hadoop is an open-source implementation of a big data application framework pioneered by Google. Typical commercial applications are things like Social networking, web 2.0 applications. Business Intelligence targeted marketing, sales analysis Inventory and sales pattern analysis HPC applications in areas like genomics and seismic (search) and generally in post-processing large data sets. Hadoop is a new/additional workload for your XC30 3

XC30 Hadoop Releases Aimed at existing XC30 customers that want to run hadoop on a subset of their compute nodes Provided to assist and optimize with installation and getting familiar with the hadoop environment and develop applications that take advantage of hadoop Download and go No cost No formal support contract 4

HPCS Challenges Workload Management Some Hadoop components are long-running, shared services HPCS is generally a batch-oriented platform Hadoop components that are batch-oriented (map-reduce/yarn) don t integrate well with traditional HPCS WLMs Storage and I/O Hadoop traditionally uses local, persistent storage Hadoop Distributed File System (HDFS) tightly coupled with many hadoop components HPC Systems utilize global parallel filesystems 5

Cray XC30: Cray Framework for Hadoop Typical Hadoop Components Sqoop Data exchange Flume Log collection Zookeeper coordination Oozie Workflow Pig Scripting Mahout Machine Learning R Connectors Statistics Apache MapReduce 2.x Distributed Processing Framework Hive SQL-like queries Hadoop Distributed File System (HDFS) HBase Columnar Storage Batch MyHadoop Components Cray Components XC30 Hadoop Optimization Best Practices Validation, and Documentation Cray XC-30 Integrated Aries Network, Cascade Compute, CLE, HSS Lustre Storage Hadoop Performance Enhancements Cray Framework for Hadoop < > < 6

XC30 Hadoop Initial Release December 2013 7

XC30 Hadoop - Initial Release System Components: Batch template files for Moab/Torque, PBSPro XC30-specific configuration settings Batch Components: MapReduce, YARN, HDFS Plus environment components (Hive, Flume, HBase, Zookeeper, ) Sqoop Data exchange Flume Log collection Zookeeper coordination Oozie Workflow Pig Scripting Mahout Machine Learning R Connectors Statistics Apache MapReduce 2.x Distributed Processing Framework Hive SQL-like queries Hadoop Distributed File System (HDFS) HBase Columnar Storage XC30 Hadoop Optimization Best Practices Validation, and Documentation Cray XC-30 Integrated Aries Network, Cascade Compute, CLE, HSS Lustre Storage Hadoop Performance Enhancements 8

Initial Release Contents Site chooses Hadoop distribution Cray provides XC/XE instructions Apache, Bigtop, Cloudera, HortonWorks, IDH all used internally Hadoop environment packages optionally installed on re-purposed compute nodes Suggested Site Customizations Adjust YARN parameters to reflect number of cpus/node and mem/node Adjust default memory limits for mapper and reducer tasks Batch script and template files provided for quickstart examples PBS Pro MOAB/Torque Best when all nodes configured for same cpus/node and mem/node 9

MapReduce on XC30 Utilizes Global Filesystem single copy of input data XC30 High-speed Aries Network fast data movement Large compute farm easy to install/configure additional MR nodes Typical MapReduce XC30 MapReduce Mapper Node Mapper Node Mapper Node Mapper Node Mapper Node Mapper Node Mapper Node Mapper Node HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS Copy/Shuffle Copy/Shuffle Reducer Node Reducer Node Result Lustre Global File System HDFS HDFS Reducer Node Reducer Node Result HDFS HDFS 10

PUMA Execution Details 11

Performance Improvement XC30 Tuning Achieved an average of 20% performance improvement of the PUMA benchmarks with XC30-specific tuning as compared to hadoop out of the box 450 Average Combined Execution Time (20% Overall Improvement) 400 350 Execution Time (secs) 300 250 200 150 100 PUMA Benchmarks 50 0 "Out of the Box" Config Accelerations Plus Compression 12

XC30 Hadoop Release Update April 2014 13

Update Release April 2014 Focus on batch myhadoop Tested at scale on XC30 and XE Support eslogin/jdk1.7 compatible Update to myhadoop (Apache 2.0.6) Simplified batch interaction Batch script reduced from 130 lines to 34 lines of code Dynamic parameter selection based on job size Hadoop rawfs support for native Lustre/Filesystem Reducer Shuffle update Additional Hadoop instructions (pig, mahout, crunch) Updated Documentation Avoiding Lustre problems. 14

Cray myhadoop Uses system batch system to create on-demand Hadoop instance. Works with PBS, Moab/Torque Includes Hadoop, Pig (scripting), Mahout (Machine Learning), Crunch (pipelining) Easy setup and install Pre-configured Dynamic configurations matching node count HDFS or nohdfs configuration switch 5/4/2014 15

myhadoop login qsub batch script Cray Compute WLM Allocation MapReduce Lustre Hadoop ResourceManager YARN Allocator Hadoop NodeManager YARN Task executor 5/4/2014 16

HDFS and Lustre A typical Hadoop deployment, makes extensive use of HDFS: Application jar files job logfiles to HDFS Input and output data,. HDFS on top of Lustre performance penalties Larger jobs cause problems with Lustre software stack Solution: Default config parameters set to run without HDFS Off-loading of files to tmpfs rather than writing to lustre FS 17

Puma Benchmarks HDFS/NoHDFS 1200 Puma Bechmarks: NoHDFS vs HDFS 1000 Exection time (secs) 800 600 400 HDFS/Lustre NoHDFS/Luster 200 0 Benchmark 18

Hadoop DFSIO test times Single node writing to HDFS/Lustre vs Lustre Throughput rate with/without HDFS XC30 - DFSIO Node bandwith 2,500 2304 Total Bandwidth MB/s 2,000 1,500 1,000 500 1944 516 1836 1,008 1,428 1596 1,248 1476 1,110 HDFS/Lustre Lustre 0 10 100 1,000 10,000 100,000 File Sizes (MB) 19

XC30 Hadoop Next Release Date is TBD 20

Prototypes Aries RDMA - protocol over RDMA that supports a socketlevel API for applications Transparent bandwidth advantage without modifying a TCP sockets based application Bandwidth comparison: TCP sockets: ~25Gbps Gsockets: ~60 Gbps Lustre Filesystem class native Lustre Access Use of intermediate storage Integration with System Management allows provisioning/finer-grain management of compute nodes to grow/shrink map/reduce nodes 21

Additional Capabilities Scala GraphX (graph) Apache Spark (in-memory) Apache Giraph (graph) Apache Hadoop Core (batch) Apache Applications (crunch, mahout,.) Chapel Further capabilities Graph analytics In-memory map reduce Chapel with XC30 runtime Tachyon (in-memory filesystem) Tachyon (in-memory fs) Aries G-Sockets HDFS file system Class Plug-in XC30 Opt Runtime Lustre File System 22

XC30 Hadoop Roadmap LA Release Dec 2013 LA Update April 2014 Proposed XC30 config optimizations for 20% performance improvement Lustre-aware shuffle plug-in Batch myhadoop native hadoop for full env capabilities Lustre fixes and workarounds Easier install process Focus on batch myhadoop Lustre filesystem class plug-in (lustre shim ) Additional capabilities: In-memory mapreduce (spark) Aries RDMA GraphX Chapel HDFS I/O 23

Feedback What applications are you running or considering running with hadoop? What XC30 Hadoop roadmap components are being utilized? XE/XC Beta Sites providing feedback and sharing use cases via hadoop-dev@cray.com 24

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. 25

26