HPC ABDS: The Case for an Integrating Apache Big Data Stack

Similar documents
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING

Twister4Azure: Data Analytics in the Cloud

Cloud-based Analytics and Map Reduce

Enabling High performance Big Data platform with RDMA

Architectures for Big Data Analytics A database perspective

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Case Study : 3 different hadoop cluster deployments

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Hadoop. Bioinformatics Big Data

Hadoop in the Enterprise

Hadoop and Map-Reduce. Swati Gore

CSE-E5430 Scalable Cloud Computing Lecture 2

A Brief Introduction to Apache Tez

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

How Companies are! Using Spark

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data. Lyle Ungar, University of Pennsylvania

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Challenges for Data Driven Systems

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

HPC data becomes Big Data. Peter Braam

Workshop on Hadoop with Big Data

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

ANALYTICS CENTER LEARNING PROGRAM

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Moving From Hadoop to Spark

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data and Apache Hadoop s MapReduce

Scaling Out With Apache Spark. DTL Meeting Slides based on

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Large-Scale Data Processing

Hadoop Ecosystem B Y R A H I M A.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

How To Scale Out Of A Nosql Database

HiBench Introduction. Carson Wang Software & Services Group

MapReduce Evaluator: User Guide

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

BEYOND MAP REDUCE: THE NEXT GENERATION OF BIG DATA ANALYTICS

Information Processing, Big Data, and the Cloud

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

CSE-E5430 Scalable Cloud Computing Lecture 11

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Using Data Mining and Machine Learning in Retail

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Create A Data Visualization With Apache Spark And Zeppelin

Unlocking the True Value of Hadoop with Open Data Science

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Spark and the Big Data Library

Map-Reduce for Machine Learning on Multicore

TRAINING PROGRAM ON BIGDATA/HADOOP

1. Introduction 2. A Cloud Defined

Open source Google-style large scale data analysis with Hadoop

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Unified Big Data Analytics Pipeline. 连 城


Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Course Highlights

Data Mining with Hadoop at TACC

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Accelerating and Simplifying Apache

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

I/O Considerations in Big Data Analytics

BIG DATA SOLUTION DATA SHEET

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Automating Big Data Benchmarking for Different Architectures with ALOJA

Apache Flink Next-gen data analysis. Kostas

Deploying Hadoop with Manager

Unified Batch & Stream Processing Platform

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CIEL A universal execution engine for distributed data-flow computing

Storage Architectures for Big Data in the Cloud

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

A survey on platforms for big data analytics

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

BIG DATA TRENDS AND TECHNOLOGIES

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Transcription:

HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Enhanced Apache Big Data Stack ABDS ~120 Capabilities >40 Apache Green layers have strong HPC Integration opportunities Goal Functionality of ABDS Performance of HPC

Broad Layers in HPC ABDS Workflow Orchestration Application and Analytics High level Programming Basic Programming model and runtime SPMD, Streaming, MapReduce, MPI Inter process communication Collectives, point to point, publish subscribe In memory databases/caches Object relational mapping SQL and NoSQL, File management Data Transport Cluster Resource Management (Yarn, Slurm, SGE) File systems(hdfs, Lustre ) DevOps (Puppet, Chef ) IaaS Management from HPC to hypervisors (OpenStack) Cross Cutting Message Protocols Distributed Coordination Security & Privacy Monitoring

Getting High Performance on Data Analytics (e.g. Mahout, R ) On the systems side, we have two principles The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization HPC including MPI has striking success in delivering high performance with however a fragile sustainability model There are key systems abstractions which are levels in HPC ABDS software stack where Apache approach needs careful integration with HPC Resource management Storage Programming model horizontal scaling parallelism Collective and Point to Point communication Support of iteration Data interface (not just key value) In application areas, we define application abstractions to support Graphs/network Geospatial Images etc.

4 Forms of MapReduce (a) Map Only (b) Classic MapReduce (c) Iterative MapReduce (d) Loosely Synchronous Input map Input map Input Iterations map P ij Output reduce reduce BLAST Analysis High Energy Physics Expectation maximization Parametric sweep (HEP) Histograms Clustering e.g. Kmeans Pleasingly Parallel Distributed search Linear Algebra, Page Rank Domain of MapReduce and Iterative Extensions Science Clouds Classic MPI PDE Solvers and particle dynamics MPI Giraph MPI is Map followed by Point to Point or Collective Communication 7 as in style c) plus d)

HPC ABDS System (Middleware) HPC ABDS Hourglass 120 Software Projects System Abstractions/standards Data format Storage HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point communication Support of iteration High performance Applications Application Abstractions/standards Graphs, Networks, Images, Geospatial. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) or High performance Mahout, R, Matlab..

Integrating Yarn with HPC

We are sort of working on Use Cases with HPC ABDS Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over all images Use Case 44 Radar Images. Running on Amazon

Features of Harp Hadoop Plug in Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check pointing

Architecture Application MapReduce Applications Map Collective Applications Framework MapReduce V2 Harp Resource Manager YARN

Performance on Madrid Cluster (8 nodes) 1600 1400 1200 K Means Clustering Harp v.s. Hadoop on Madrid Identical Computation Increasing Communication Execution Time (s) 1000 800 600 400 200 0 100m 500 10m 5k 1m 50k Problem Size Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores Note compute same in each case as product of centers times points identical

Mahout and Hadoop MR Slow due to MapReduce Python slow as Scripting Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives MPI fastest as C not Java Increasing Communication Identical Computation

Performance of MPI Kernel Operations 10000 MPI.NET C# in Tempest FastMPJ Java in FG OMPI nightly Java FG OMPI trunk Java FG OMPI trunk C FG 5000 MPI.NET C# in Tempest FastMPJ Java in FG OMPI nightly Java FG OMPI trunk Java FG OMPI trunk C FG Average time (us) 100 1 0B 2B 8B 32B Message size (bytes) 128B 512B Performance of MPI send and receive operations 10000 1000 2KB OMPI trunk C Madrid OMPI trunk Java Madrid OMPI trunk C FG OMPI trunk Java FG 8KB 32KB 128KB 512KB Average time (us) 5 1000000 10000 4B 16B 64B 256B 1KB 4KB 16KB 64KB Message size (bytes) 256KB 1MB 4MB Performance of MPI allreduce operation OMPI trunk C Madrid OMPI trunk Java Madrid OMPI trunk C FG OMPI trunk Java FG Pure Java as in FastMPJ slower than Java interfacing to C version of MPI Average Time (us) 100 10 1 0B 2B 8B 32B 128B 512B 2KB 8KB Message Size (bytes) 32KB 128KB 512KB Performance of MPI send and receive on Infiniband and Ethernet Average Time (us) 100 1 4B 16B 64B 256B 1KB 4KB 16KB 64KB Message Size (bytes) 256KB 1MB 4MB Performance of MPI allreduce on Infiniband and Ethernet

Use case 28: Truthy: Information diffusion research from Twitter Data Building blocks: Yarn Parallel query evaluation using Hadoop MapReduce Related hashtag mining algorithm using Hadoop MapReduce: Meme daily frequency generation using MapReduce over index tables Parallel force directed graph layout algorithm using Twister (Harp) iterative MapReduce

Use case 28: Truthy: Information diffusion research from Twitter Data Two months data loading for varied cluster size Scalability of iterative graph layout algorithm on Twister Hadoop FS not indexed

2000 1800 Different Kmeans Implementation Total execution time vs. mapper number Pig Performance Total execution time (s) 1600 1400 1200 1000 800 600 400 200 0 24 48 96 number of mappers Hadoop 100m,500 Hadoop 10m,5000 Hadoop 1m,50000 Harp 100m,500 Harp 10m,5000 Harp 1m,50000 Pig HD1 100m,500 Pig HD1 10m,5000 Pig HD1 1m,50000 Pig Yarn 100m,500 Pig Yarn 10m,5000 Pig Yarn 1m,50000

Lines of Code Pig Kmeans Hadoop Kmeans Pig IndexedHBase meme cooccurcount IndexedHBase meme cooccurcount Java ~345 780 152 ~434 Pig 10 0 10 0 Python / Bash ~40 0 0 28 Total Lines 395 780 162 462

DACIDR for Gene Analysis (Use Case 19,20) Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for pleasingly parallel applications, and Twister (replacing by Yarn) for iterative MapReduce applications Sequences Cluster Centers Add Existing data and find Phylogenetic Tree All Pair Sequence Alignment Pairwise Clustering Multidimensional Scaling Streaming Visualization Simplified Flow Chart of DACIDR

Summarize a million Fungi Sequences Spherical Phylogram Visualization RAxML result visualized in FigTree. Spherical Phylogram from new MDS method visualized in PlotViz

Lessons / Insights Integrate (don t compete) HPC with Commodity Big data (Google to Amazon to Enterprise data Analytics) i.e. improve Mahout; don t compete with it Use Hadoop plug ins rather than replacing Hadoop Enhanced Apache Big Data Stack HPC ABDS has 120 members please improve! HPC ABDS+ Integration areas include file systems, cluster resource management, file and object data management, inter process and thread communication, analytics libraries, Workflow monitoring