Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

Similar documents
VisIt: A Tool for Visualizing and Analyzing Very Large Data. Hank Childs, Lawrence Berkeley National Laboratory December 13, 2010

The Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT): A Vision for Large-Scale Climate Data

Hank Childs, University of Oregon

Parallel Analysis and Visualization on Cray Compute Node Linux

VisIt: An End-User Tool For Visualizing and Analyzing Very Large Data

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

SciDAC Visualization and Analytics Center for Enabling Technology

High Performance Computing. Course Notes HPC Fundamentals

DIY Parallel Data Analysis

Achieving Performance Isolation with Lightweight Co-Kernels

Cellular Computing on a Linux Cluster

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Parallel Computing. Benson Muite. benson.

Visualization and Data Analysis

Big Data Management in the Clouds and HPC Systems

HPC enabling of OpenFOAM R for CFD applications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

How is EnSight Uniquely Suited to FLOW-3D Data?

Visualizing Large, Complex Data

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Dynamic Load Balancing of Parallel Monte Carlo Transport Calculations

Performance Monitoring of Parallel Scientific Applications

Cray: Enabling Real-Time Discovery in Big Data

MPI and Hybrid Programming Models. William Gropp

NA-ASC-128R-14-VOL.2-PN

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

NEXT-GENERATION. EXTREME Scale. Visualization Technologies: Enabling Discoveries at ULTRA-SCALE VISUALIZATION

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

VisIt Visualization Tool

Program Optimization for Multi-core Architectures

Data Centric Systems (DCS)

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Visualization of Adaptive Mesh Refinement Data with VisIt

Visualisation of Large Datasets with Houdini

HPC Wales Skills Academy Course Catalogue 2015

MOSIX: High performance Linux farm

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Fast Multipole Method for particle interactions: an open source parallel library component

Big Graph Processing: Some Background

Performance Engineering of the Community Atmosphere Model

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

How To Improve Performance On A Single Chip Computer

2015 The MathWorks, Inc. 1

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Multi-GPU Load Balancing for Simulation and Rendering

From GWS to MapReduce: Google s Cloud Technology in the Early Days

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations. Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1

Parallel Visualization for GIS Applications

A Very Brief History of High-Performance Computing

Matlab on a Supercomputer

Introduction to Cluster Computing

Architectures for Big Data Analytics A database perspective

Bringing Big Data Modelling into the Hands of Domain Experts

Distributed communication-aware load balancing with TreeMatch in Charm++

White Paper The Numascale Solution: Affordable BIG DATA Computing

In-situ Visualization

Symmetric Multiprocessing

Cray DVS: Data Virtualization Service

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Implementing Network Attached Storage. Ken Fallon Bill Bullers Impactdata

Chapter 7. Using Hadoop Cluster and MapReduce

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

ICS Principles of Operating Systems

Distributed Operating Systems. Cluster Systems

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Petascale Visualization: Approaches and Initial Results

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Load Balancing using Potential Functions for Hierarchical Topologies

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Software and Performance Engineering of the Community Atmosphere Model

Transcription:

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G. Weber, S. Ahern, & K. Joy Lawrence Berkeley National Laboratory / University of California at Davis October 25, 2010

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Supercomputers are generating large data sets that often require parallelized postprocessing. 1 billion element unstructured mesh 217 pin reactor cooling simulation. Nek5000 simulation on ¼ of Argonne BG/P. Image credit: Paul Fischer using VisIt

Communication between channels are a key factor in effective cooling.

Particle advection can be used to study communication properties.

This sort of analysis requires many particles to be statistically significant. How can we parallelize this process? Place thousands of particles in one channel Observe which channels the particles pass through Observe where particles come out (compare with experimental data) Repeat for other channels

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

For embarrassingly parallel algorithms, the most common processing technique is pure parallelism. Parallel Simulation Code P1 P2 P4 P0 P3 P8 P7 P6 P5 P9 P0 P1 P2 P3 Parallelized visualization data flow network Read Process Render Particle advection is not embarrassingly parallel. So how to parallelize? A: it depends Processor 0 Read Process Render Processor 1 P4 P8 P5 P9 P6 P7 Pieces of data (on disk) Read Process Render Processor 2

Particle advection: Four dimensions of complexity vs Data set size Seed set distribution vs vs Seed set size Vector field complexity

Do we need parallel processing? When? How complex? Data set size? Not enough! Large #ʼs of particles?

Parallelization for small data and a large number of particles. Simulation code File This scheme is referred to as parallelizing-over-particles. Parallelized visualization data flow network Read Advect Render Processor 0 GPU-accelerated Read approaches Advect Render follow a variant of this model. Processor 1 The key is that the data is small enough Read that it can fit in Advect memory. Render Processor 2

Do we need advanced parallelization techniques? When? Data set size? Not enough! Large #ʼs of particles? Need to parallelize, but embarrassingly parallel OK Large #ʼs of particles + large data set sizes

Parallelization for large data with good distribution. Parallel Simulation Code P1 P2 P4 P0 P3 P8 P7 P6 P5 P9 This scheme is referred to as parallelizing-over-data. Parallelized visualization data flow network Read Advect Render Processor 0 Read Advect Render P0 P1 P2 P3 Processor 1 P4 P8 P5 P9 P6 P7 Pieces of data (on disk) Read Advect Render Processor 2

Do we need advanced parallelization techniques? When? Data set size? Not enough! Large #ʼs of particles? Need to parallelize, but embarrassingly parallel OK Large #ʼs of particles + large data set sizes Need to parallelize, simple schemes may be OK Large #ʼs of particles + large data set sizes + (bad distribution OR complex vector field) Need smart algorithm for parallelization

Parallelization with big data & lots of seed points & bad distribution Two extremes: Partition data over processors and pass particles amongst processors Parallelize inefficiency! over Partition particles seed points Hybrid over algorithms processors and process necessary data for advection Redundant I/O! P0 Parallelizing Over I/O Efficiency Data Good Bad Particles Bad Good P1 P0 P0 P0 P0 P0 P2 P3 P4 P1 P1 P1 P1 P1 Parallelize over data P2 P2 P2 P2 P2 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 Notional streamline example

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

The master-slave algorithm is an example of a hybrid technique. Scalable Computation of Streamlines on Very Large Datasets, Pugmire, Childs, Garth, Ahern, Weber. SC09 Many of the following slides compliments of Dave Pugmire. Algorithm adapts during runtime to avoid pitfalls of parallelize-over-data and parallelize-over-particles. Nice property for production visualization tools. Implemented inside VisIt visualization and analysis package.

Master-Slave Hybrid Algorithm Divide processors into groups of N Uniformly distribute seed points to each group Master: - Monitor workload - Make decisions to optimize resource utilization Slaves: - Respond to commands from Master - Report status when work complete

Master Process Pseudocode Master() { while (! done ) { if ( NewStatusFromAnySlave() ) { What are the possible commands? commands = DetermineMostEfficientCommand() } } } for cmd in commands SendCommandToSlaves( cmd )

Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds Master Slave Slave is given a streamline that is contained in a block that is already loaded

Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds Master Slave Slave is given a streamline and loads the block

Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds Master Load Slave Slave is instructed to load a block. The streamline in that block can then be computed.

Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds Master Send to J Slave Slave is instructed to send a streamline to another slave that has loaded the block Slave J

Master Process Pseudocode Master() { while (! done ) { if ( NewStatusFromAnySlave() ) { commands = DetermineMostEfficientCommand() for cmd in commands SendCommandToSlaves( cmd ) } } * See SC 09 paper } for details

Master-slave in action Iteration Action 0 P0 reads B0, P3 reads B1 P0 - When to pass P0 and 0: Read when to read? - How to coordinate communication? Status? Efficiently? P1 1: Pass P1 P2 0: Read P3 P2 1: Read P4 1: Pass 1 P1 passes points to P0, P4 passes points to P3, P2 reads B0 Notional streamline example

Algorithm Test Cases - Core collapse supernova simulation - Magnetic confinement fusion simulation - Hydraulic flow simulation

Workload distribution in supernova simulation Parallelization by: Particles Data Hybrid Colored by processor doing integration

Workload distribution in parallelize-overparticles Too much I/O

Workload distribution in parallelize-overdata Starvation

Workload distribution in hybrid algorithm Just right

Comparison of workload distribution

Astrophysics Test Case: Total time to compute 20,000 Streamlines Non-uniform Seeding Seconds Seconds Uniform Seeding Number of procs Particles Number of procs Data Hybrid

Astrophysics Test Case: Number of blocks loaded Non-uniform Seeding Blocks loaded Blocks loaded Uniform Seeding Number of procs Number of procs Particles Data Hybrid

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Are today s algorithms going to fit well on tomorrow s machines? Traditional approach for parallel visualization 1 core per MPI task may not work well on future supercomputers, which will have 100-1000 cores per node. Exascale machines will likely have O(1M) nodes and we anticipate in situ particle advection. Hybrid parallelism blends distributed- and shared-memory parallelism concepts.

The word hybrid is being used in two contexts The master-slave algorithm is a hybrid algorithm, sharing concepts from both parallelization-over-data and parallelization-over-seeds. Hybrid parallelism involves using a mix of shared and distributed memory techniques, e.g. MPI + pthreads or MPI+CUDA. One could think about implement a hybrid particle advection algorithm in a hybrid parallel setting.

What can we learn from a hybrid parallel study? How do we implement parallel particle advection algorithms in a hybrid parallel setting? How do they perform? Which algorithms perform better? How much better? Why? Streamline Integration Using MPI-Hybrid Parallelism on a Large Multi-Core Architecture by David Camp, Christoph Garth, Hank Childs, Dave Pugmire and Ken Joy Accepted to TVCG

Streamline integration using MPI-hybrid parallelism on a large multi-core architecture Implement parallelize-over-data and parallelize-overparticles in a hybrid parallel setting Did not study the master-slave algorithm Run series of tests on NERSC Franklin machine (Cray) Compare 128 MPI tasks (non-hybrid) vs 32 MPI tasks / 4 cores per task (hybrid) 12 test cases: large vs small # of seeds uniform vs non-uniform seed locations 3 data sets

Hybrid parallelism for parallelize-over-data Expected benefits: Less communication and communicators Should be able to avoid starvation by sharing data within a group. Starvation

Measuring the benefits of hybrid parallelism for parallelize-over-data

Gantt chart for parallelize-over-data

Hybrid parallelism for parallelize-overparticles Expected benefits: Only need to read blocks once for node, instead of once for core. Larger cache allows for reduced reads Long paths automatically shared among cores on node

Measuring the benefits of hybrid parallelism for parallelize-over-particles

Gantt chart for parallelize-over-particles

Summary of Hybrid Parallelism Study Hybrid parallelism appears to be extremely beneficial to particle advection. We believe the parallelize-over-data results are highly relevant to the in situ use case. Although we didn t implement the master-slave algorithm, we believe the benefits shown at the spectrum extremes provide good evidence that hybrid algorithms will also benefit. Implemented on VisIt branch, goal to integrate into VisIt proper in the coming months.

Summary for Large Data and Parallelization The type of parallelization required will vary based on data set size, number of seeds, seed locations, and vector field complexity Parallelization may occur via parallelization-over-data, parallelization-over-particles, or somewhere in between (master-slave). Hybrid algorithms have the opportunity to de-emphasize the pitfalls of the traditional techniques. Hybrid parallelism appears to be very beneficial. Note that I said nothing about time-varying data

Thank you for your attention! Acknowledgments: DoE SciDAC program, VACET Participants: Hank Childs (hchilds@lbl.gov), Dave Pugmire (ORNL), Dave Camp (LBNL/UCD), Christoph Garth (UCD), Sean Ahern (ORNL), Gunther Weber (LBNL), and Ken Joy (UCD) Questions?