Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing Andre Luckow, Peter M. Kasson, Shantenu Jha STREAMING 2016, 03/23/2016 RADICAL, Rutgers, http://radical.rutgers.edu

Motivation There is a need to couple data sources, HPC, analytics! 20+ applications identified at STREAM16 Challenges: Data applications and pipelines are complex Scalability and Elasticity: dynamic changes in resource demands Scheduling and provisioning of resources: right amount of resources at right time Programming models: HPC (MPI, OpenMP, GPU) vs. Big Data (Java, Python, R) Interoperability: Data sources sinks often in different environments (IoT, cloud, HPC, HPDC) than compute Current State: Streaming (in sciences) often implemented on application-level (w/ limited re-use) Manifold landscape of streaming tools (Apache Open Source Tools, Cloud Tools)

Workload Characteristics HPC Resource HPC Resource 1 HPC Resource 2 Simulation Analysis Simulation Analysis

Workload Characteristics HPC Resource 1 Simulation Message Broker HPC Resource 2 HPC Resource 3 Analysis 1 Analysis 2

Introduction Pilot Abstraction User Space User Application Pilot-Job System Pilot-Job Pilot-Job Policies Resource Manager System Space Resource A Resource B Resource C Resource D http://arxiv.org/abs/1207.6644

The Convergence of HPC and Data Intensive Computing MPI Frameworks for Advanced Analytics & Machine Learning (Blas, ScaLAPACK, CompLearn, PetSc, Blast) Applications Orchestration (Pegasus, Taverna, Dryad, Swift) Advanced Analytics & Machine Learning (Pilot-KMeans, Replica Exchange) MapReduce Frameworks (Pilot-MapReduce) Declarative Languages (Swift) Workload Management (Pilots, Condor) Higher-Level Workload Management (TEZ, LLama) Applications Advanced Analytics & Machine Learning (Mahout, R, MLBase) SQL-Engines (Impala, Hive, Shark, Phoenix) Data Store & Processing (HBase) Scheduler Orchestration (Oozie, Pig) In-Memory (Spark) Spark Scheduler MapReduce Map Reduce Scheduler Twister MapReduce Twister Scheduler Data Processing, Analytics, Orchestration Higher-Level Runtime Environment MPI, RDMA Hadoop Shuffle/Reduction, HARP Collectives Communication Cluster Resource Manager (Slurm, Torque, SGE) Compute Resources (Nodes, Cores, VMs) Data Access (Virtual Filesystem, GridFTP, SSH) Storage Management (irods, SRM, GFFS) Storage Resources (Lustre, GPFS) Cluster Resource Manager (YARN, Mesos) Compute and Data Resources (Nodes, Cores, HDFS) Resource Management Resource Fabric High-Performance Computing Apache Hadoop Big Data A Tale of Two Data-Intensive Paradigms: Data Intensive Applications, Abstractions and Architectures In collaboration with Geoffrey Fox (Indiana), http://arxiv.org/abs/1403.1528

http://arxiv.org/abs/1602.00345 Pilot-Abstraction for HPC and Hadoop Interoperability Map Reduce Spark- App Other YARN App Hadoop/Spark App HPC App (e.g. MPI) Application YARN Pilot-Job Spark Hadoop Application Scheduler (e.g. Spark, Tez, LLama) Pilot-Job Application-level Scheduling HPC Scheduler (Slurm, Torque, SGE) Mode I: Hadoop on HPC YARN/HDFS Mode II: HPC on Hadoop System-level Scheduling

Streaming and Batch Computing Data Broker Broker Broker Streaming Framework Compute (e.g. YARN, SLURM, Torque, PBS) ETL Hadoop SQL Storage and Format (e.g. Lustre, HDFS, ) Raw Text HDF5 Columnar Mutable/ Random Access Machine Learning Other Questions: - How to manage batch and streaming frameworks side-byside? - How to enable interoperability between different programming system/models/middleware/schedu lers? - How to enable elasticity? Message Broker Storage Stream Processing http://dx.doi.org/10.5281/zenodo.47946

Pilot-Streaming Distributed Application SAGA Local/ Parallel FS (SSH/GO) HPC Node Node n n Pilot Agent SSH SSH Pilot Compute GFFS HTC (OSG/EGI) Pilot API Cloud Pilot Data Cloud YARN SSH irods Globus Online Cloud HDFS Kafka Local (irods) SRM (irods) Node Node n n Pilot Agent SSH SSH Local / EBS (SSH) S3 (HTTP) EC2 Node VM Node n n Pilot Agent SSH SSH Hadoop HDFS (WebHDFS) YARN Node Node n n Pilot Agent SSH SSH Infrastructure User-Space

Conclusion 1. Pilot-Jobs enable the co-location of HPC/Simulations and Big Data Tools (Hadoop, Spark, higher-level tools) 2. Pilot-Streaming will support message-broker as data source/sink that enables the de-coupling of applications 3. Dynamic resource management provided by the Pilot- Abstraction is critical for stream environments

Thank you!