Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing



Similar documents
Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Case Study : 3 different hadoop cluster deployments

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Hadoop & Spark Using Amazon EMR

Moving From Hadoop to Spark

Apache Flink Next-gen data analysis. Kostas

Hadoop in the Enterprise

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Upcoming Announcements

Hadoop Ecosystem B Y R A H I M A.

Large scale processing using Hadoop. Ján Vaňo

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Ali Ghodsi Head of PM and Engineering Databricks

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Comprehensive Analytics on the Hortonworks Data Platform

Data Security in Hadoop

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

YARN Apache Hadoop Next Generation Compute Platform

Hadoop IST 734 SS CHUNG

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Information Builders Mission & Value Proposition

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Scaling Out With Apache Spark. DTL Meeting Slides based on

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Unified Big Data Analytics Pipeline. 连 城

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Welcome to the first Workshop on Big data Open Source Systems (BOSS)

Dell In-Memory Appliance for Cloudera Enterprise

Cloud-based Analytics and Map Reduce

Data Semantics Aware Cloud for High Performance Analytics

Lustre * Filesystem for Cloud and Hadoop *

Modernizing Your Data Warehouse for Hadoop

Big Data Research in the AMPLab: BDAS and Beyond

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Big Data and Industrial Internet

Big Data Research in Berlin BBDC and Apache Flink

Why Spark on Hadoop Matters

Workshop on Hadoop with Big Data

Enabling High performance Big Data platform with RDMA

Hadoop and Map-Reduce. Swati Gore

A Brief Introduction to Apache Tez

Chase Wu New Jersey Ins0tute of Technology

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

HDP Enabling the Modern Data Architecture

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Job Oriented Training Agenda

Unlocking the True Value of Hadoop with Open Data Science

Managing large clusters resources

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Real Time Big Data Processing

The Internet of Things and Big Data: Intro

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Analytics on Spark &

Open source Google-style large scale data analysis with Hadoop

How To Write A Trusted Analytics Platform (Tap)

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop. Sunday, November 25, 12

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Processing NGS Data with Hadoop-BAM and SeqPig

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

HADOOP. Revised 10/19/2015

Apache Hadoop: Past, Present, and Future

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Virtualizing Apache Hadoop. June, 2012

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Native Connectivity to Big Data Sources in MSTR 10

Azure Data Lake Analytics

Open Cirrus: Towards an Open Source Cloud Stack

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Course Highlights

Unified Big Data Processing with Apache Spark. Matei

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Hadoop-BAM and SeqPig

Storage Architectures for Big Data in the Cloud

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

CSE-E5430 Scalable Cloud Computing Lecture 2

Transcription:

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing Andre Luckow, Peter M. Kasson, Shantenu Jha STREAMING 2016, 03/23/2016 RADICAL, Rutgers, http://radical.rutgers.edu

Motivation There is a need to couple data sources, HPC, analytics! 20+ applications identified at STREAM16 Challenges: Data applications and pipelines are complex Scalability and Elasticity: dynamic changes in resource demands Scheduling and provisioning of resources: right amount of resources at right time Programming models: HPC (MPI, OpenMP, GPU) vs. Big Data (Java, Python, R) Interoperability: Data sources sinks often in different environments (IoT, cloud, HPC, HPDC) than compute Current State: Streaming (in sciences) often implemented on application-level (w/ limited re-use) Manifold landscape of streaming tools (Apache Open Source Tools, Cloud Tools)

Workload Characteristics HPC Resource HPC Resource 1 HPC Resource 2 Simulation Analysis Simulation Analysis

Workload Characteristics HPC Resource 1 Simulation Message Broker HPC Resource 2 HPC Resource 3 Analysis 1 Analysis 2

Introduction Pilot Abstraction User Space User Application Pilot-Job System Pilot-Job Pilot-Job Policies Resource Manager System Space Resource A Resource B Resource C Resource D http://arxiv.org/abs/1207.6644

The Convergence of HPC and Data Intensive Computing MPI Frameworks for Advanced Analytics & Machine Learning (Blas, ScaLAPACK, CompLearn, PetSc, Blast) Applications Orchestration (Pegasus, Taverna, Dryad, Swift) Advanced Analytics & Machine Learning (Pilot-KMeans, Replica Exchange) MapReduce Frameworks (Pilot-MapReduce) Declarative Languages (Swift) Workload Management (Pilots, Condor) Higher-Level Workload Management (TEZ, LLama) Applications Advanced Analytics & Machine Learning (Mahout, R, MLBase) SQL-Engines (Impala, Hive, Shark, Phoenix) Data Store & Processing (HBase) Scheduler Orchestration (Oozie, Pig) In-Memory (Spark) Spark Scheduler MapReduce Map Reduce Scheduler Twister MapReduce Twister Scheduler Data Processing, Analytics, Orchestration Higher-Level Runtime Environment MPI, RDMA Hadoop Shuffle/Reduction, HARP Collectives Communication Cluster Resource Manager (Slurm, Torque, SGE) Compute Resources (Nodes, Cores, VMs) Data Access (Virtual Filesystem, GridFTP, SSH) Storage Management (irods, SRM, GFFS) Storage Resources (Lustre, GPFS) Cluster Resource Manager (YARN, Mesos) Compute and Data Resources (Nodes, Cores, HDFS) Resource Management Resource Fabric High-Performance Computing Apache Hadoop Big Data A Tale of Two Data-Intensive Paradigms: Data Intensive Applications, Abstractions and Architectures In collaboration with Geoffrey Fox (Indiana), http://arxiv.org/abs/1403.1528

http://arxiv.org/abs/1602.00345 Pilot-Abstraction for HPC and Hadoop Interoperability Map Reduce Spark- App Other YARN App Hadoop/Spark App HPC App (e.g. MPI) Application YARN Pilot-Job Spark Hadoop Application Scheduler (e.g. Spark, Tez, LLama) Pilot-Job Application-level Scheduling HPC Scheduler (Slurm, Torque, SGE) Mode I: Hadoop on HPC YARN/HDFS Mode II: HPC on Hadoop System-level Scheduling

Streaming and Batch Computing Data Broker Broker Broker Streaming Framework Compute (e.g. YARN, SLURM, Torque, PBS) ETL Hadoop SQL Storage and Format (e.g. Lustre, HDFS, ) Raw Text HDF5 Columnar Mutable/ Random Access Machine Learning Other Questions: - How to manage batch and streaming frameworks side-byside? - How to enable interoperability between different programming system/models/middleware/schedu lers? - How to enable elasticity? Message Broker Storage Stream Processing http://dx.doi.org/10.5281/zenodo.47946

Pilot-Streaming Distributed Application SAGA Local/ Parallel FS (SSH/GO) HPC Node Node n n Pilot Agent SSH SSH Pilot Compute GFFS HTC (OSG/EGI) Pilot API Cloud Pilot Data Cloud YARN SSH irods Globus Online Cloud HDFS Kafka Local (irods) SRM (irods) Node Node n n Pilot Agent SSH SSH Local / EBS (SSH) S3 (HTTP) EC2 Node VM Node n n Pilot Agent SSH SSH Hadoop HDFS (WebHDFS) YARN Node Node n n Pilot Agent SSH SSH Infrastructure User-Space

Conclusion 1. Pilot-Jobs enable the co-location of HPC/Simulations and Big Data Tools (Hadoop, Spark, higher-level tools) 2. Pilot-Streaming will support message-broker as data source/sink that enables the de-coupling of applications 3. Dynamic resource management provided by the Pilot- Abstraction is critical for stream environments

Thank you!