Workflow Tools at NERSC. Debbie Bard djbard@lbl.gov NERSC Data and Analytics Services



Similar documents
NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Bringing Compute to the Data Alternatives to Moving Data. Part of EUDAT s Training in the Fundamentals of Data Infrastructures

Evaluating MapReduce and Hadoop for Science

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Globus Genomics Tutorial GlobusWorld 2014

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Ali Ghodsi Head of PM and Engineering Databricks

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Brief Introduction to Apache Tez

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Data Analytics at NERSC. Joaquin Correa NERSC Data and Analytics Services

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

the missing log collector Treasure Data, Inc. Muga Nishizawa

Workshop on Hadoop with Big Data

What's New in SAS Data Management

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Integration in Action using JBoss Middleware. Ashokraj Natarajan - Cognizant

Data processing goes big

Overview of HPC Resources at Vanderbilt

Synthetic Data Generation for Realistic Analytics Examples and Testing

Integrating computational data analysis capabilities into analytics applications

SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

The Globus Galaxies Platform: Delivering Science Gateways as a Service

Managing large clusters resources

An Approach to Implement Map Reduce with NoSQL Databases

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Log Mining Based on Hadoop s Map and Reduce Technique

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

Data Centric Systems (DCS)

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop IST 734 SS CHUNG

DTI Image Processing Pipeline and Cloud Computing Environment

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Scalable Architecture on Amazon AWS Cloud

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

Real Time Big Data Processing

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

Data Integration Hub

Building Platform as a Service for Scientific Applications

Hadoop & Spark Using Amazon EMR

SAP Data Services 4.X. An Enterprise Information management Solution

In Memory Accelerator for MongoDB

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

GigaSpaces Real-Time Analytics for Big Data

SAS Customer Intelligence 360: Creating a Consistent Customer Experience in an Omni-channel Environment

Internals of Hadoop Application Framework and Distributed File System

Hadoop Ecosystem B Y R A H I M A.

Practical Solutions for Big Data Analytics

Workflow and BPM: What Are the Differences?

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data and Analytics: Challenges and Opportunities

Unlocking the True Value of Hadoop with Open Data Science

Information Architecture

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Azure Data Lake Analytics

BSC vision on Big Data and extreme scale computing

Manifest for Big Data Pig, Hive & Jaql

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Bringing Big Data Modelling into the Hands of Domain Experts

Distributed Computing and Big Data: Hadoop and MapReduce

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

On a Hadoop-based Analytics Service System

October 21 November 5 November 14 December 12. Washington DC Boston Chicago San Jose. ni.com

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Hadoop implementation of MapReduce computational model. Ján Vaňo

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Has been into training Big Data Hadoop and MongoDB from more than a year now

Transcription:

Workflow Tools at NERSC Debbie Bard djbard@lbl.gov NERSC Data and Analytics Services NERSC User Meeting August 13th, 2015

What Does Workflow Software Do? Automate connection of applications Chain together different steps in a job pipeline. Automate provenance tracking -> enable ability to reproduce results. Assist with data movement. Monitor running processes and handle errors. Data processing of streaming experimental data (including near-realtime processing). Workflows help work with (around?) batch scheduler and queue policies. -2-

Workflows are Personal Many tools exist in the workflow space Google: Scientific Workflow Software It seems like each domain has its own workflow solution to handle domain-specific quirks No single tool solves every single problem Fireworks qdo Tigres Galaxy Swift -3- BigPanda Pegasus Taverna Airavata.

Visualising a workflow: swift -4-

Visualising a workflow: galaxy -5-

Workflows as GUI: galaxy -6-

Workflows as code: Swift/Tigres Swift is a workflow language (http://swift-lang.org) Tigres is a Python/C library or capturing workflow constructs within your code (http://tigres.lbl.gov) Parallel computing, Split/merge, Sequences -7-

Workflows Working Group Earlier this year Workflows working group investigated breadth of technologies We support 2 tools at NERSC FireWorks Swift this doesn t mean other tools won t be used/supported at NERSC, only that DAS has specific expertise in these. Create an ecosystem to enable self-supported WF tools Databases, User defined software modules, AMQP services etc. -8-

Existing Workflow Ecosystem @ NERSC Science Gateways Databases Mongo, Postgres, MySQL, SQLite, SciDB Workflow tools (self-supported) Fireworks, swift, Tigres, qdo, Galaxy High throughput batch queues NEWT REST API Globus / Data Transfer Nodes Many task frameworks MySGE, Taskfarmer Other web based tools for interactive use cases ipython, R Studio, NX MapReduce frameworks Spark, Hadoop -9- Workflow tools exist in and interact with a rich environment of NERSC capabilities and services.

Workflows and Data Intensive Science Data intensive scientific computing may not always fit the traditional HPC paradigm Large numbers of tasks, low degree of parallelism. Job dependencies and chaining. Need to communicate with external datasources, DBs. Workflow and work orchestration in this context can be thought of as sequences of compute and data-centric operations. - 10 -

High Throughput Bag of Tasks Often need to process large numbers of smallish tasks repeatedly. Typical queue policies work against you a lot of time lost waiting. Batch system not set up for lots of little tasks. Instead use a workflow system to queue up tasks. to launch long running workers to consume these tasks. Examples: qdo and fireworks... - 11 -

Use Case: qdo (cosmology) qdo is specifically designed to package up multiple small tasks into one batch job. - 12 -

qdo examples - 13 -

Use case: Fireworks (material science) MongoDB containing task info and metadata

Fireworks: Error Handling and Dynamic Workflows Can specify action based on soft failures, hard failures, human errors lpad rerun s FIZZLED lpad detect_unreserved rerun OR lpad detect_lostruns rerun OR

Many task/mr frameworks Repeatedly perform tasks on a large dataset Map => perform an operation across a large set i.e. map a task across the dataset Reduce => collect and reduce the results from map operation Split the data across nodes and run task on each node Typically does not require much cross node communication Frameworks at NERSC Spark Hadoop MySGE Taskfarmer - 16 -

Batch Queues NERSC has support for serial and high throughput queues well suited to jobs that need many task computing Cori Serial queue designed specifically for these use cases. Reservations available for special needs. Consider using job packing options in various workflow tools to optimize for HPC queue infrastructure also for packing single-core jobs into a multi-core node. - 17 -

Use Case: Materials Project Simulate properties of all possible materials. - 18 -

Use Case: Materials Project Simulate properties of all possible materials. - 19 -

Use Case: Materials Project Tasks submitted to Fireworks MongoDB via API/python script etc. MongoDB keeps a list of tasks to be run. Fireworks submits workers to NERSC queues. Workers pull jobs from MongoDB. Fireworks manages job orchestration Retry on failure File transfer Job Dependencies Flow control for subsequent jobs Duplicate management - 20 -

Materials Project Workflow Submit! Custom material input: A cool material!! Input generation Workflow mapping Supercomputer (parameter choice) submission / monitoring output: Lots of information about cool material!! Error handling File Transfer File Parsing / DB insertion 21

Materials Project Gateway - 22 -

Use Case: SPOT Suite Collect Data from Beamline SPADE/Globus to move data to NERSC Trigger Analysis at NERSC via AMQP View Jobs and Results on Science Gateway Track Provenance and Metadata via MongoDB

Use Case: SPOT Suite Workflow

SPOT Suite Gateway

Tying it all together Workflow software Compute System Batch Q Database/MQ Science Gateway Global File System - 26 -

Finding the Right Hammer Workflow tools have lots of features but there is no one size-fits-all NERSC is building expertise in classes of workflow tools and will help guide you towards the right tool for your job Consider stitching together a couple of different tools to make it all work - 27 -

Thank you. - 28 -

Engagement Enabling science in a scalable manner Build re-usable workflow components that can be used across domains. Support a 2 to 4 classes of workflow tools Create an ecosystem of services to enable new tools Engage with domain specific science to address specific needs. Each project will have its own requirements. Bring those requirements to the table and we can evolve our ecosystem to meet your needs. - 29 -