Big Data Challenges in Simulation-based Science



Similar documents
Addressing Big Data Challenges in Simulation-based Science

Big Data Challenges in Simulation-based Science

Data Centric Systems (DCS)

Visualization and Data Analysis

Towards Scalable Visualization Plugins for Data Staging Workflows

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Building Platform as a Service for Scientific Applications

HPC & Visualization. Visualization and High-Performance Computing

BSC vision on Big Data and extreme scale computing

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Exploring Software Defined Federated Infrastructures for Science

Big Data Management in the Clouds and HPC Systems

Big Data Challenges in Bioinformatics

Data Semantics Aware Cloud for High Performance Analytics

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

New Jersey Big Data Alliance

HPC technology and future architecture

Hank Childs, University of Oregon

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Cluster, Grid, Cloud Concepts

IBM Deep Computing Visualization Offering

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HadoopTM Analytics DDN

Hadoop on the Gordon Data Intensive Cluster

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

High Performance Computing OpenStack Options. September 22, 2015

Parallel Analysis and Visualization on Cray Compute Node Linux

Big Data Mining Services and Knowledge Discovery Applications on Clouds

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Use Cases for Large Memory Appliance/Burst Buffer

Big Data and Science: Myths and Reality

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

HPC Trends and Directions in the Earth Sciences Per Nyberg Sr. Director, Business Development

Data Requirements from NERSC Requirements Reviews

Make the Most of Big Data to Drive Innovation Through Reseach

Advanced Big Data Analytics with R and Hadoop

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Parallel Large-Scale Visualization

HPC enabling of OpenFOAM R for CFD applications

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High Performance Computing. Course Notes HPC Fundamentals

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Kriterien für ein PetaFlop System

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Performance Monitoring of Parallel Scientific Applications

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

Next-Generation Networking for Science

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Jean-Pierre Panziera Teratec 2011

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Data Centric Interactive Visualization of Very Large Data

Conquering the Astronomical Data Flood through Machine

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Cloud Computing and the Future of Internet Services. Wei-Ying Ma Principal Researcher, Research Area Manager Microsoft Research Asia

JOURNAL OF OBJECT TECHNOLOGY

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

Distributed Systems LEEC (2005/06 2º Sem.)

2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Bringing Big Data Modelling into the Hands of Domain Experts

Reference Architecture, Requirements, Gaps, Roles

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

NA-ASC-128R-14-VOL.2-PN

Manjrasoft Market Oriented Cloud Computing Platform

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

Transcription:

Big Data Challenges in Simulation-based Science Manish Parashar* Rutgers Discovery Informatics Institute (RDI 2 ) Department of Computer Science *Hoang Bui, Tong Jin, Qian Sun, Fan Zhang, ADIOS Team, and other ex-students and collaborators. Manish Parashar, Rutgers University

Outline Data Grand Challenges Data challenges of simulation-based science Rethinking the simulations -> insights pipeline The DataSpaces Project Conclusion

RDI2 Data-driven Discovery in Science Nearly every field of discovery is transitioning from data poor to data rich Oceanography: OOI Astronomy: LSST Physics: LHC Biology: Sequencing Sociology: The Web Neuroscience: EEG, fmri Economics: POS terminals Credit: Ed Lazowska, Univ. of WA

Scientific Discovery through Simulations Scientific simulations running on high-end computing systems generate huge amounts of data! If a single core produces 2MB/minute on average, one of these machines could generate simulation data between ~170TB per hour -> ~700PB per day -> ~1.4EB per year Successful scientific discovery depends on a comprehensive understanding of this enormous simulation data How we enable the computation scientists to efficiently manage and explore extreme scale data: find the needles in haystack??

Scientific Discovery through Simulations Complex workflows integrating coupled models, data management/ processing, analytics Tight / loose coupling, data driven, ensembles Advanced numerical methods (E.g., Adaptive Mesh Refinement) Integrated (online) analytics, uncertainty quantification, Complex, heterogeneous components Large data volumes and data rates Data re-distribution (MxNxP), data transformations Dynamic data exchange patterns Strict performance/overhead constraints

Traditional Simulation -> Insight Pipelines Break Down Simulation Machines Simulation Raw Data Storage Servers Analysis/Visualization Cluster Traditional simulation -> insight pipeline Run large-scale simulation workflows on large supercomputers Dump data to parallel disk systems Export data to archives Move data to users sites usually selected subsets Figure. Traditional data analysis pipeline Perform data manipulations and analysis on mid-size clusters Collect experimental / observational data Move to analysis sites Perform comparison of experimental/observational to validate simulation data

Challenges Faced by Traditional HPC Data Pipelines Data analysis challenge Can current data mining, manipulation and visualization algorithms still work effectively on extreme scale machine? I/O and storage challenge Increasing performance gap: disks are outpaced by computing speed Data movement challenge Figure. Traditional data analysis pipeline Lots of data movement between simulation and analysis machines, between coupled multi-physics simulation components -> longer latencies Improving data locality is critical: do work where the data resides! Energy challenge Future extreme systems are designed to have low-power chips however, much greater power consumption will be due to memory and data movement! The costs of data movement are increasing and dominating!

The Cost of Data Movement Moving data between node memory and persistent storage is slow! performance gap The energy cost of moving data is a significant concern Energy _ move _ data = bitrate * length 2 cross_section_area_of_wire From K. Yelick, Software and Algorithms for Exascale: Ten Ways to Waste an Exascale Computer

Rethinking the Data Management Pipeline Hybrid Staging + In-Situ & In-Transit Execution Issues/Challenges Programming abstractions/systems Mapping and scheduling Control and data flow Autonomic runtime Link with the external workflow Systems Glean, Darshan, FlexPath,..

Design space of possible workflow architectures Loca%on of the compute resources Same cores as the simulahon (in situ) Some (dedicated) cores on the same nodes Some dedicated nodes on the same machine Dedicated nodes on an external resource Data access, placement, and persistence Direct access to simulahon data structures Shared memory access via hand- off / copy Shared memory access via non- volahle near node storage (NVRAM) Data transfer to dedicated nodes or external resources Synchroniza%on and scheduling Execute synchronously with simulahon every n th simulahon Hme step Execute asynchronously Analysis Tasks Simulation Visualization DRAM DRAM Node 2 Node 1 Sharing cores with the simulation Using distinct cores on same node CPUs DRAM NVRAM SSD Hard Disk... Network Node N DRAM DRAM Simulation Node Network Communication Staging Node Processing data on remote nodes CPUs DRAM NVRAM Staging option 3 SSD Hard Disk Staging option 1 Staging option 2

l l l l DataSpaces: A Scalable Shared Space Abstraction for Hybrid Data Staging [JCC12] Shared-space programming abstraction over hybrid staging l l l l Simple API for coordination, interaction and messaging Provides a global-view programming abstraction Distributed, associative, in-deepmemory object store Online data indexing, flexible querying Exposed as a persistent service Autonomic cross-layer runtime management High-throughput/low-latency asynchronous data transport 0 Throughput(GB/s) Simula on get() App2 120 110 100 90 80 70 60 50 40 30 20 10 put()/pub() m m m m d d d d d put() Shared Space Model 64K/4K 32K/2K 16K/1K 8192/512 4096/256 2048/128 1024/64 512/32 d: Data object m: Meta-data object App1 get()/sub() get() App3 2GB 4GB 8GB 16GB 32GB 64GB 128GB 256GB Aggregate data redistribution throughput DataSpaces scalability on ORNL Titan

DataSpaces Abstraction: Indexing + DHT [HPDC10] Dynamically constructed structured overlay using hybrid staging cores Index constructed online using SFC mappings and applications attributes E.g., application domain, data, field values of interest, etc. DHT used to maintain meta-data information E.g., geometric descriptors for the shared data, FastBit indices, etc. Data objects load-balanced separately across staging cores

DataSpaces: Enabling Coupled Scientific Workflows at Extreme Scales Multphysics Code Coupling at Extreme Scales [CCGrid10] PGAS Extensions for Code Coupling [CCPE13] compute nodes running simulation and in-situ analytics Pool of computation bucket in the staging area Simulation Simulation Simulation Simulation in-situ analytics DART in-situ analytics in-situ analytics DART in-situ analytics DART DART Data in-transit analytics in-transit analytics DART in-transit analytics DART in-transit analytics DART DART... "bucket-ready" event "data-ready" event Assigned task & data descriptors Resoure scheduler Data Interaction and coordnation DataSpaces data_kernels.o 0x69 0x20 0x61 0x6d 0x20 0x63 0x6f 0x6f 0x6c 0x0 data_kernels.c kernel_min { for i = 1, n for j = 1, m for k = 1, p if (min > A(i, j, k)) min = A(i, j, k) } Applications.o Applications executable Data-centric Mappings for In-Situ Workflows [IPDPS12] Dynamic Code Deployment In-Staging [IPDPS11] gcc -c Link aprun Compute nodes return() Staging nodes aprun rexec() Load Runtime execution system (Rexec) data_kernels.lua for i = 0, ni-1 do for j = 0, nj-1 do for k = 0, nk-1 do val = input:get_val(i,j,k) if min > val then min = val end end end

Integrating In-situ and In-transit Analytics [SC 12] S3D: First-principles direct numerical simulation Simulation resolves features on the order of 10 simulation time steps Currently on the order of every 400 th time step can be written to disk Temporal fidelity is compromised when analysis is done as a post-process Recent data sets generated by S3D, developed at the CombusHon Research Facility, Sandia NaHonal Laboratories

Integrating In-situ and In-transit Analytics Design Couple concurrent data analysis with simulation StaHsHcs Volume Rendering Topology Identify features of interest *J. C. Bennett et al., Combining In-Situ and In-Transit Processing to Enable Extreme-Scale Scientific Analysis, SC 12, Salt Lake City, Utah, November, 2012. 23

In-situ/In-transit Data Analytics Online data analytics for large-scale simulation workflows Combine both in-situ and in-transit placement to reduce impact on simulation, reduce network data movement, reduce total execution time Adaptive runtime management: Optimize the performance of simulation-analysis workflow through adaptive data and analysis placement, data-centric task mapping primary resources Simulation asynchronous in-situ to in-transit data transfers secondary resources In-transit Task Executors shared data structures In-situ processing data ready Pull task & data descriptors Workflow Manager Figure. Overview of the in-situ/in-transit data analysis framework Figure. System architecture for cross-layer runtime adaptation

Simulation case study with S3D: Timing results for 4896 cores and analysis every simulation time step S3D& in@situ& data&movement& in@transit&& in@situ&sta3s3cs& 1.69& hybrid&sta3s3cs& 1.7& in@situ&visualiza3on& 0.73& hybrid&visualiza3on& 5.07& hybrid&topology& 2.72& 2.06& 119.81& simula3on& 16.85& 0& 2& 4& 6& 8& 10& 12& 14& 16& seconds&

Simulation case study with S3D: Timing results for 4896 cores and analysis every simulation time step S3D& in@situ& data&movement& in@transit&& in@situ&sta3s3cs& 1.69& 10.0% hybrid&sta3s3cs& 1.7& 10.1% in@situ&visualiza3on& 0.73& 4.3% hybrid&visualiza3on& 5.07&.04% hybrid&topology& 2.72& 2.06& 119.81& 16.1% simula3on& 16.85& 0& 2& 4& 6& 8& 10& 12& 14& 16& seconds&

Simulation case study with S3D: Timing results for 4896 cores and analysis every 10 th simulation time step S3D& in>situ& data&movement& in>transit&& in>situ&sta1s1cs& hybrid&sta1s1cs& in>situ&visualiza1on& hybrid&visualiza1on& 1.0% 1.0%.43%.004% hybrid&topology& 119.81& 1.61% simula1on& 168.5& 0& 20& 40& 60& 80& 100& 120& 140& 160& seconds&

Simulation case study with S3D: Timing results for 4896 cores and analysis every 100 th simulation time step S3D% in<situ% data%movement% in<transit%% in<situ%sta/s/cs% hybrid%sta/s/cs% in<situ%visualiza/on% hybrid%visualiza/on% hybrid%topology%.1%.1%.04%.004%.16% simula/on% 1685% 0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% seconds%

In-Situ Feature Extraction and Tracking using Decentralized Online Clustering (DISC 12) DOC workers executed in- situ on simula%on machines DOC Overlay SimulaHon Compute Nodes Processor core runs simulahon Processor core runs DOC worker Benefits of runtime feature extraction and tracking (1) Scientists can follow the events of interest (or data of interest) (2) Scientists can do real-time monitoring of the running simulations One compute node

In-situ viz. and monitoring with staging (SC 12) Pixie3D 1024 cores Pixplot 8 cores ParaView Server 4 cores record.bp record.bp pixie3d.bp pixie3d.bp DataSpaces Pixmon 1 core (login node)

Scalable Online Data Indexing and Querying (BDAC 14, CCPE 15) Enable query-driven data analytics to extract insights from the large volume of scientific simulation data Query1 Hilbert SFC Index- 1 Coord- based Query2 Coord- based Z- order SFC Bitmap Index Hash: H1 Hash: H2 Hash: H3 Index- 2 Index- 3 Index- 4 Query3 Value- based Hash: H4 Query4 Tree Index Value- based N 1 N 2 N 3 N 4 Figure. Conceptual view of the programmable online indexing & querying using multiple index Figure. Value-based online index and query framework using FastBit

Multi-tiered Data Staging with Autonomic Data Placement Adaptation Overview (IPDPS 15) Fig. Conceptual framework for realizing multi-tiered staging for coupled dataintensive simulation workflows. A multi-tiered data staging approach that uses both DRAM and SSD to support data-intensive simulation workflows with large data coupling requirements. Efficient application-aware data placement mechanism that leverages user provided data access pattern information

Wide-area Staging (Demo at SC 14) Wide-area shared staging abstraction achieved by interconnecting multiple staging instances Leverages SDN for provisioning; Workflow/Grid technologies Data placement optimized based on demand, availability, locality, etc. App Dataspaces WA Control Node App TCP/IP get()/sub() put()/ pub() ADIOS/ Dataspaces ADIOS/ Dataspaces RDMA, Gridftp Data Tx/ Rx Node

Combus%on: S3D uses chemical model to simulate combushon process in order to understand the ignihon characterishc of bio- fuels for automohve usage. PublicaHon: SC 2012. Fusion: EPSi simulahon workflow provides insight into edge plasma physics in magnehc fusion devices by simulahng movement of millions parhcles. PublicaHon: CCGrid 2010, Cluster 2014 Subsurface modeling: IPARS uses mulh- physics, mulh- model formulahons and coupled mulh- block grids to support oil reservoir simulahon of realishc, high- resoluhon reservoir studies with a million or more grids. PublicaHon: CCGrid 2011 Computa%onal mechanics: FEM coupled with AMR is oben used to solve heat transfer or fluid dynamic problems. AMR only performs refinements on important region of the mesh. (work in progress) Material Science: SNS workflow enables integrahon of data, analysis, and modeling tools for materials science experiments to accelerate scienhfic discoveries which requires beam lines data query capability. (work in progress) Material Science: QMCPack uhlizes quantum Monte Carlo (QMC) studies in heterogeneous catalysis of transihon metal nanoparhcles, phase transihons, properhes of materials under pressure, and strongly correlated materials. (work in progress)

Summary & Conclusions Complex applications running on high-end systems generate extreme amounts of data that must be managed and analyzed to get insights Data costs (performance, latency, energy) are quickly dominating Traditional data management/analytics pipelines are breaking down Hybrid data staging, in-situ workflow execution, etc. can address this challenges Users to efficiently intertwine applications, libraries, middleware for complex analytics Many challenges; Programming, mapping and scheduling, control and data flow, autonomic runtime management. The DataSpaces project explores solutions at various levels

Thank You!! EPSi Edge Physics Simulation Manish Parashar Email: parashar@rutgers.edu WWW: parashar.rutgers.edu WWW: dataspaces.org

DataSpaces Team @ RU 39