Big Data Challenges in Simulation-based Science



Similar documents
Big Data Challenges in Simulation-based Science

Addressing Big Data Challenges in Simulation-based Science

Big Data Challenges in Simulation-based Science

Addressing Crosscutting Compute/Data Challenges

New Jersey Big Data Alliance

Data Centric Systems (DCS)

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Building Platform as a Service for Scientific Applications

Exploring Software Defined Federated Infrastructures for Science

Visualization and Data Analysis

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Hank Childs, University of Oregon

Data Semantics Aware Cloud for High Performance Analytics

Data-Intensive Science and Scientific Data Infrastructure

Big Data Management in the Clouds and HPC Systems

Big Data Challenges in Bioinformatics

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

HPC & Visualization. Visualization and High-Performance Computing

BSC vision on Big Data and extreme scale computing

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Big Data and Scientific Discovery

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Scientific Computing Programming with Parallel Objects

Towards Scalable Visualization Plugins for Data Staging Workflows

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

HPC technology and future architecture

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Use Cases for Large Memory Appliance/Burst Buffer

Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research

Jean-Pierre Panziera Teratec 2011

Cluster, Grid, Cloud Concepts

Solvers, Algorithms and Libraries (SAL)

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Mission Need Statement for the Next Generation High Performance Production Computing System Project (NERSC-8)

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

Data Requirements from NERSC Requirements Reviews

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

RevoScaleR Speed and Scalability

In situ data analysis and I/O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN

Big Data Mining Services and Knowledge Discovery Applications on Clouds

HPC Trends and Directions in the Earth Sciences Per Nyberg Sr. Director, Business Development

Appendix 1 ExaRD Detailed Technical Descriptions

BMW11: Dealing with the Massive Data Generated by Many-Core Systems. Dr Don Grice IBM Corporation

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Data Storage and Data Transfer. Dan Hitchcock Acting Associate Director, Advanced Scientific Computing Research

Big Data. George O. Strawn NITRD

Trends in High-Performance Computing for Power Grid Applications

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Performance Monitoring of Parallel Scientific Applications

Beyond Embarrassingly Parallel Big Data. William Gropp

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

MCA Standards For Closely Distributed Multicore

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

IBM Deep Computing Visualization Offering

High Performance Computing OpenStack Options. September 22, 2015

Hadoop on the Gordon Data Intensive Cluster

Big Data and Cloud Computing for GHRSST

NA-ASC-128R-14-VOL.2-PN

Elements of Scalable Data Analysis and Visualization

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

MapReduce and Hadoop Distributed File System V I J A Y R A O

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Data Grids. Lidan Wang April 5, 2007

for my computation? Stefano Cozzini Which infrastructure Which infrastructure Democrito and SISSA/eLAB - Trieste

IBM WebSphere Distributed Caching Products

Autonomics, Cyberinfrastructure Federation, and Software-Defined Environments for Science

Bringing Big Data Modelling into the Hands of Domain Experts

HPCOR Data Management Policies Co-Chairs: Bill Allcock, Julia White. HPCOR 2014, June 18-19, Oakland, CA 1

Putting Eggs in Many Baskets: Data Considerations in the Cloud. Rob Futrick, CTO

SciDAC Petascale Data Storage Institute

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

High Performance Computing. Course Notes HPC Fundamentals

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

HPC Programming Framework Research Team

Parallel Analysis and Visualization on Cray Compute Node Linux

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Data Deduplication in a Hybrid Architecture for Improving Write Performance

Application of Grid-Enabled Technologies for Solving Optimization Problems in Data Driven Reservoir Systems

Achieving Performance Isolation with Lightweight Co-Kernels

NITRD and Big Data. George O. Strawn NITRD

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Kriterien für ein PetaFlop System

Data Centric Interactive Visualization of Very Large Data

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Reference Architecture, Requirements, Gaps, Roles

LSKA 2010 Survey Report Job Scheduler

Towards Analytical Data Management for Numerical Simulations

Doctor of Philosophy in Computer Science

Conquering the Astronomical Data Flood through Machine

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Advanced Big Data Analytics with R and Hadoop

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

HPC data becomes Big Data. Peter Braam

How To Write A Data Mining Program On A Large Data Set

Transcription:

Big Data Challenges in Simulation-based Science Manish Parashar Rutgers Discovery Informatics Institute (RDI 2 ) NSF Center for Cloud and Autonomic Computing (CAC) http://parashar.rutgers.edu *Hoang Bui, Tong Jin, Qian Sun, Fan Zhang and other ex-students and collaborators (ORNL, GT, UU, NCSU, SNL, LBNL, LLNL, Manish Parashar, Rutgers University Outline Data Grand Challenges in Simulation-base Science Rethinking the simulations -> insights pipeline The DataSpaces Project Conclusion 1

6/28/14 Modern Science & Society Transformed by Compute & Data New paradigms and prac/ces in science and engineering Data- driven, data and compute- intensive Inherently mul/- disciplinary Collabora/ve (university, na/onal, global) Many Challenges Computing, Data, Software, People Many Challenges. Computing Multicore; large and increasing core counts, deep memory hierarchies (TH-2: 54.9 PF, 3.12 M Cores, 1.4 PB, 25 MW) New prgm. models, concerns (fault tolerance, energy, etc) New models & technologies: Clouds, grids, hybrid manycore, accelerators, deep storage hierarchies, Data Generating more data than in all of human history: preserve, mine, share? How do we create data scientists/engineers? Software Complex applications on coupled compute-data-networked environments, tools needed Modern apps: 106+ lines, many groups contribute, take decades to develop, very long lifetimes People Multidisciplinary expertise essential! Appropriate academic program, career tracks 4 2

6/28/14 The Era of escience and Big Data SKA project will generate 1 EB data/day in 2020! Climate Data 2004 -> 36TB 2012 -> 2PB Keep increasing! Exa Bytes Climate, Environment Square Kilometer Array Clearly, modern scientific network/ instruments/experiments/ are producing Big Data!! Bytes per day Genomics Peta Bytes Tera Bytes Giga Bytes Climate, Environment TeraGrid/ XSEDE, Blue LHC Waters But what about HPC? Genomics Genome sequencing are doubling in output every 9 months! LHC will produce roughly 15PB data/ year! Many smaller datasets 2012 LHC LSST 100PB data from LSST by the end of this decade!! 2020 Credit: R. Pennington/A. Blatecky O Scientific Discovery through Simulations (I) ACI providing increasing computational scales, capabilities and capacities Scientific simulations running on high-end computing systems generate huge amounts of data! If a single core produces 2MB/minute on average, one of these machines could generate simulation data between ~170TB per hour -> ~700PB per day -> ~1.4EB per year Successful scientific discovery depends on a comprehensive understanding of this enormous simulation data How we enable the computation scientists to efficiently manage and explore extreme scale data: find the needles in haystack?? 3

Scientific Discovery through Simulations (II) Complex workflows integrating coupled models, data management/ processing, analytics Tight / loose coupling, data driven, ensembles Advanced numerical methods (E.g., Adaptive Mesh Refinement) Integrated (online) uncertainty quantification, analytics Complex, heterogeneous components Large data volumes and data rates Complex couplings, data redistribution, data transformations Dynamic data exchange patterns Strict performance, energy constraints Traditional Simulation -> Insight Pipelines Break Down Figure. Traditional data analysis pipeline Traditional simulation -> insight pipeline: Run large-scale simulation workflows on large supercomputers Dump data on parallel disk systems Export data to archives Move data to users sites usually selected subsets Perform data manipulations and analysis on mid-size clusters Collect experimental / observational data Move to analysis sites Perform comparison of experimental/observational to validate simulation data 4

Challenges Faced by Traditional HPC Data Pipelines Scalable data analytics challenge Can current data mining, manipulation and visualization algorithms still work effectively on extreme scale machine? I/O and storage challenge Increasing performance gap: disks are outpaced by computing speed Figure. Traditional data analysis pipeline Data movement challenge Lots of data movement between simulation and analysis, between coupled mutli-physics simulation components => longer latencies Improving data locality is critical: do work where the data resides! Energy challenge Future extreme systems are designed to have low-power chips however, much greater power consumption will be due to memory and data movement! The costs of data movement are increasing and dominating! The Cost of Data Movement Moving data between node memory and persistent storage is slow! The energy cost of moving data is a significant concern performance gap Energy _ move _ data = bitrate * length 2 cross_section_area_of_wire From K. Yelick, Software and Algorithms for Exascale: Ten Ways to Waste an Exascale Computer " 5

Challenges Faced by Traditional HPC Data Pipelines Traditional data analysis pipeline The costs of data movement (power and performance) are increasing and dominating! We need to Rethink the Data Management Pipeline! Reduce data movement Move computation/analytics closer to the data Add value to simulation data along the IO path Rethinking the Data Management Pipeline Hybrid Staging + In-Situ & In-Transit Execution Issues/Challenges Programming abstractions/systems Mapping and scheduling Control and data flow Autonomic runtime 6

Design space of possible workflow architectures Loca%on of the compute resources Same cores as the simula/on (in situ) Some (dedicated) cores on the same nodes Some dedicated nodes on the same machine Dedicated nodes on an external resource Data access, placement, and persistence Direct access to simula/on data structures Shared memory access via hand- off / copy Shared memory access via non- vola/le near node storage (NVRAM) Data transfer to dedicated nodes or external resources Synchroniza%on and scheduling Execute synchronously with simula/on every n th simula/on /me step Execute asynchronously Analysis Tasks Simulation Visualization DRAM DRAM Node 2 Node 1 Sharing cores with the simulation Using distinct cores on same node CPUs DRAM NVRAM SSD Hard Disk... Network Node N DRAM DRAM Simulation Node Network Communication Staging Node Processing data on remote nodes CPUs DRAM Staging option 2 NVRAM Staging option 3 SSD Hard Disk Staging option 1 DataSpaces In-situ/In-transit Data Management & Analytics ADIOS/DataSpaces dataspaces.org! Virtual shared-space programming abstraction! Simple API for coordination, interaction and messaging! Distributed, associative, in-memory object store! Online data indexing, flexible querying! Adaptive cross-layer runtime management! Hybrid in-situ/in-transit execution! Efficient, high-throughput/low-latency asynchronous data transport 7

DataSpaces: A Scalable Shared Space Abstraction for Hybrid Data Staging [HPDC10, JCC12]! Virtual shared-space abstraction! Simple API for coordination, interaction and messaging! Provides a global-view programming abstraction consistent with PGAS! Distributed, associative, in-deepmemory object store! Online data indexing, flexible querying! Adaptive cross-layer runtime management! Hybrid in-situ/in-transit execution! Data-centric mappings! High-throughput/low-latency memoryto-memory asynchronous data transport Dynamic coordination and interaction patterns between the coupled applications Transparent data redistribution Complex geometry-based queries In-space (online) data transformation and manipulations DataSpaces Abstraction: Indexing + DHT Dynamically constructed structured overlay using hybrid staging cores Index constructed online using SFC mappings and applications attributes E.g., application domain, data, field values of interest, etc. DHT used to maintain metadata information E.g., geometric descriptors for the shared data, FastBit indices, etc. Data objects load-balanced separately across staging cores The SFC maps the global domain to a set of intervals Intervals are non-contiguous and can lead to load imbalance Second mapping compacts the intervals into a contiguous virtual interval Split the contiguous interval equally to the DataSpaces nodes 8

DataSpaces: Scalability on ORNL Titan Time(s) 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0 2GB 4GB 8GB 16GB 32GB 64GB 128GB 256GB Data redistribution time 64K/4K 32K/2K 16K/1K 8192/512 4096/256 2048/128 1024/64 512/32 Throughput(GB/s) 120 110 100 90 80 70 60 50 40 30 20 10 0 2GB 4GB 8GB 16GB 32GB 64GB 128GB 256GB Aggregate data redistribution throughput 64K/4K 32K/2K 16K/1K 8192/512 4096/256 2048/128 1024/64 512/32 Weak scaling with increasing number of processors Applications redistribute data using DataSpaces Application 1 runs on M processors and inserts data into the space Application 2 runs on N processors and retrieves data from the space Result: A 128 fold increase in the application sizes from 512 to 64K writers, total data size exchanged per step is increased from 2GB to 256GB DataSpaces: Enabling Coupled Scientific Workflows at Extreme Scales Multphysics Code Coupling at Extreme Scales [CCGrid10] Data-centric Mappings for In-Situ Workflows [IPDPS12] PGAS Extensions for Code Coupling [CCPE13] data_kernels.o 0x69 0x20 0x61 0x6d 0x20 0x63 0x6f 0x6f 0x6c 0x0 gcc -c data_kernels.c kernel_min { for i = 1, n for j = 1, m for k = 1, p if (min > A(i, j, k)) min = A(i, j, k) } Link Applications.o Applications executable aprun Compute nodes return() Staging nodes aprun rexec() Load Runtime execution system (Rexec) data_kernels.lua for i = 0, ni-1 do for j = 0, nj-1 do for k = 0, nk-1 do val = input:get_val(i,j,k) if min > val then min = val end end end Dynamic Code Deployment In-Staging [IPDPS11] 9

6/28/14 Applications Enabled Combustion: S3D uses chemical model to simulate combustion process in order to understand the ignition characteristic of bio-fuels for automotive usage [SC 2012] Fusion: EPSi simulation workflow provides insight into edge plasma physics in magnetic fusion devices by simulating movement of millions particles [CCGrid 2010] Subsurface modeling: IPARS uses multi-physics, multi-model formulations and coupled multi-block grids to support oil reservoir simulation of realistic, highresolution reservoir studies with a million or more grids [CCGrid 2011] Computational mechanics: FEM coupled with AMR is often used to solve heat transfer or fluid dynamic problems. AMR only performs refinements on important region of the mesh [SC 2014 (in progress)] Material Science: SNS workflow enables integration of data, analysis, and modeling tools for materials science experiments to accelerate scientific discoveries which requires beam lines data query capability Material Science: QMCPack utilizes quantum Monte Carlo (QMC) studies in heterogeneous catalysis of transition metal nanoparticles, phase transitions, properties of materials under pressure, and strongly correlated materials Workflows beyond simulations Content-based image retrieval (CBIR) on digitized peripheral blood smear specimens using low-level morphological features (MICAII 2012, CAC 2013, MICAII 2013) Asynchronous Replica Exchange - Monitor the progress of replicas, using the secondary structure prediction methods (J. Comput. Chem. 2008) Control fluid flow in micro-devices by placing a sequence of pillar obstacles (MTAGS 2013, CISE 2013) Monte Carlo numerical strategies to unravel how nucleosomes shape DNA into chromatin (J. Biophysical 2014) 10

In-Situ Feature Extraction and Tracking using Decentralized Online Clustering (DISC 12, ICAC 10) DOC workers executed in- situ on simula%on machines DOC Overlay Simula/on Compute Nodes Processor core runs simula/on Processor core runs DOC worker Benefits of runtime feature extraction and tracking (1) Scientists can follow the events of interest (or data of interest) (2) Scientists can do real-time monitoring of the running simulations One compute node In-situ viz. and monitoring with staging" Pixie3D 1024 cores Pixplot 8 cores ParaView Server 4 cores record.bp record.bp pixie3d.bp pixie3d.bp DataSpaces Pixmon 1 core (login node) 11

AMR-as-a-Service using DataSpaces FEM (TalyFEM) Grid GridField pgrid pgridfield DataSpaces Grid GridField pgrid pgridfield AMR Finite Element Model (FEM): Models using uniform 3D meshes for nearrealistic engineering problems such as heattransfer, fluid flow and phase transformation AMR Service: Refines localized areas of interest, e.g., with high local error DataSpaces-as-a-Service: Persistent staging servers run as a service on HEC platform Applications can connect, exchange data objects, and disconnect independently Overview of Recent DataSpaces Research Programming abstractions / system DataSpaces: Interaction, coordination and messaging abstractions for coupled scientific workflow [HPDC10, CCGrid10, HiPC12, JCC12] XpressSpaces: PGAS extensions for coupling using DataSpaces [CCGrid11, CCPE13] ActiveSpaces: Dynamic code deployment for in-staging data processing [IPDPS11] Runtime mechanisms Data-centric Task Mapping: Reduce data movement and increase intra-node data sharing [IPDPS12, DISC12] In-situ & In-transit Data Analytics: Simulation-time analysis of large volume data by combining in-situ and in-transit execution [SC12, DISC12] Cross-layer Adaptation: Adaptive cross-layer approach for dynamic data management in large scale simulation-analysis workflow [SC13] Value-based Data Indexing and Querying: Use FastBit to build in-situ, in-memory valuebased indexing and query support in the staging area [In review] Power/performance Tradeoffs: Characterizing power/performance tradeoffs for dataintensive simulations workflows [SC13] Data Staging over Deep Memory Hierarchy: Build distributed associative object store over hierarchical memory storage, e.g., DRAM/NVRAM/SSD [HiPC13] High-throughput/low-latency asynchronous data transport DART: Network independent transport library for high speed asynchronous data extraction and transfer [HPDC08] 12

6/28/14 Integrating In-Situ and In-Transit Analytics (SC 12) S3D: First-principles direct numerical simulation Simulation resolves features on the order of 10 simulation time steps Currently on the order of every 400th time step can be written to disk Temporal fidelity is compromised when analysis is done as a postprocess Recent data sets generated by S3D, developed at the Combus/on Research Facility, Sandia Na/onal Laboratories In-situ Topological Analysis as Part of S3D* Sta/s/cs Volume Rendering Topology Identify features of interest *J. C. Bennett et al., Combining In-Situ and In-Transit Processing to Enable Extreme-Scale Scientific Analysis, SC 12, Salt Lake City, Utah, November, 2012. 33 13

Integrating In-Situ and In-Transit Analytics (SC 12) Primary resources execute the main simulation and in situ computations Secondary resources provide a staging area whose cores act as containers for in transit computations 4896 cores total (4480 simula/on/in situ; 256 in transit; 160 task scheduling/data movement) Simula/on size: 1600x1372x430 All measurements are per simula/on /me step Simulation case study with S3D: Timing results for 4896 cores and analysis every 10 th simulation time step S3D& in>situ& data&movement& in>transit&& in>situ&sta1s1cs& hybrid&sta1s1cs& in>situ&visualiza1on& hybrid&visualiza1on& 1.0% 1.0%.43%.004% hybrid&topology& 119.81& 1.61% simula1on& 168.5& 0& 20& 40& 60& 80& 100& 120& 140& 160& seconds& 14

Simulation case study with S3D: Timing results for 4896 cores and analysis every 100 th simulation time step S3D% in<situ% data%movement% in<transit%% in<situ%sta/s/cs% hybrid%sta/s/cs% in<situ%visualiza/on% hybrid%visualiza/on% hybrid%topology%.1%.1%.04%.004%.16% simula/on% 1685% 0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% seconds% Challenges Opportunities Ahead: The Vision of the Exascale Computing Initiative Achieve order 10 18 operations per second and order 10 18 bytes of storage Address the next generation of scientific, engineering, and large data problems Enable extreme scale computing: 1,000X capabilities of today's computers with a similar size and power footprint 20 pj per average operation => ~ x40 improvement over today's efficiency Billion-way concurrency Productive system based on marketable technologies Ecosystem for application development and execution, system management Deployed in early 2020s 15

Challenges Opportunities Ahead: Validation Workflows First Principles Fusion Code XGC, core- edge turbulence with UQ Ensemble of Models with UQ Model Model Model Model Knowledge Database Synthe/c Diagnos/c Use first principle calcula/ons on HPC to generate more accurate models Visual Analy/cs KSTAR Tokamak diagnos/c diagnos/c diagnos/c diagnos/c diagnos/c diagnos/c Analysis Compara/ve Analy/cs Visualiza/on Experiment mock up Automate the process of these workflows and move work/data to sa/sfy metrics specified by users, and track provenance Generate a DB of knowledge by codes, to use for predic/ve capability during/aker each experiment Challenges Opportunities Ahead: Pervasive Computational Ecosystems 16

Summary & Conclusions Complex applications running on high-end systems generate extreme amounts of data that must be managed and analyzed to get insights Data costs (performance, latency, energy) are quickly dominating Traditional data management/analytics pipelines are breaking down Hybrid data staging, in-situ workflow execution, etc. can address this challenges Users to efficiently intertwine applications, libraries, middleware for complex analytics Many challenges; Programming, mapping and scheduling, control and data flow, autonomic runtime management. The DataSpaces project explores solutions at various levels However, many challenges opportunities ahead!! Thank You!! EPSi Edge Physics Simulation Manish Parashar, Ph.D. Rutgers Discovery Informatics Institute (RDI 2 ) Cloud & Autonomic Computing Center (CAC) Rutgers, The State University of New Jersey Email: parashar@rutgers.edu WWW: rdi2.rutgers.edu WWW: dataspaces.org 17