In-situ Visualization

Similar documents
Facts about Visualization Pipelines, applicable to VisIt and ParaView

Parallel Large-Scale Visualization

Parallel Analysis and Visualization on Cray Compute Node Linux

In-situ Visualization: State-of-the-art and Some Use Cases

HPC Wales Skills Academy Course Catalogue 2015

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

HPC enabling of OpenFOAM R for CFD applications

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Big Data and Cloud Computing for GHRSST

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High Performance Computing in CST STUDIO SUITE

Parallel Computing. Benson Muite. benson.

Petascale Visualization: Approaches and Initial Results

Best practices for efficient HPC performance with large models

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Kriterien für ein PetaFlop System

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

High Performance Computing. Course Notes HPC Fundamentals

1 Bull, 2011 Bull Extreme Computing

Visualization of Adaptive Mesh Refinement Data with VisIt

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

Large File System Backup NERSC Global File System Experience

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

HPC Deployment of OpenFOAM in an Industrial Setting

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Data Centric Systems (DCS)

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

PRIMERGY server-based High Performance Computing solutions

Cluster Implementation and Management; Scheduling

Scalability and Classifications

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

IDL. Get the answers you need from your data. IDL

OpenMP Programming on ScaleMP

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Multi-GPU Load Balancing for Simulation and Rendering

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

DIY Parallel Data Analysis

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Building Platform as a Service for Scientific Applications

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

RevoScaleR Speed and Scalability

Parallel Visualization of Petascale Simulation Results from GROMACS, NAMD and CP2K on IBM Blue Gene/P using VisIt Visualization Toolkit

Big Data Paradigms in Python

MPI Hands-On List of the exercises

Stream Processing on GPUs Using Distributed Multimedia Middleware

Recommended hardware system configurations for ANSYS users

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Visualization Infrastructure and Services at the MPCDF

ParaView s Comparative Viewing, XY Plot, Spreadsheet View, Matrix View

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

A Multi-layered Domain-specific Language for Stencil Computations

High Performance Computing OpenStack Options. September 22, 2015

Overlapping Data Transfer With Application Execution on Clusters

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

A Case Study - Scaling Legacy Code on Next Generation Platforms

JMulTi/JStatCom - A Data Analysis Toolkit for End-users and Developers

VisIt Visualization Tool

LSKA 2010 Survey Report Job Scheduler

Debugging with TotalView

Chapter 2 System Structures

Analysis and Implementation of Cluster Computing Using Linux Operating System

White Paper The Numascale Solution: Extreme BIG DATA Computing

Visualization and Data Analysis

EnduraData Cross Platform File Replication and Content Distribution (November 2010) A. A. El Haddi, Member IEEE, Zack Baani, MSU University

Distributed Visualization Parallel Visualization Large data volumes

HPC technology and future architecture

Software Development around a Millisecond

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

YALES2 porting on the Xeon- Phi Early results

Efficient Load Balancing using VM Migration by QEMU-KVM

Post-processing and Visualization with Open-Source Tools. Journée Scientifique Centre Image April 9, Julien Jomier

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

JoramMQ, a distributed MQTT broker for the Internet of Things

Achieving Performance Isolation with Lightweight Co-Kernels

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Scientific Visualization with Open Source Tools. HM 2014 Julien Jomier

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Using the Windows Cluster

Transcription:

In-situ Visualization Dr. Jean M. Favre Scientific Computing Research Group 13-01-2011

Outline Motivations How is parallel visualization done today Visualization pipelines Execution paradigms Many grids or subsets on few processors A 2D solver with parallel partitioning and patterns In-situ visualization Enable visualization in a running simulation Source code instrumentation Specify ghost-cells

Context In-situ visualization is one action item in our HP2C efforts The HP2C platform aims at developing applications to run at scale and make efficient use of the next generation of supercomputers. Presently this will be the generation of computing technologies available in 2013 timeframe.

Motivations Parallel simulations are now ubiquitous The mesh size and number of time steps are of unprecedented size The traditional post-processing model computestore-analyze does not scale because I/O to disks is the slowest component Consequences: Datasets are often under-sampled on disks Many time steps are never archived It takes a supercomputer to re-load and visualize supercomputer data

Visualization at Supercomputing Centers, IEEE CG&A Jan/Feb 2011 As we move from petascale to exascale., favor the use of the supercomputer instead of a vis. cluster Memory must be proportional to size and is $$$$$ Aggregate flops instead of a parasitic expense Close coupling I/O But, Little memory per core

Solving a PDE and visualizing the execution Full source code solution is given here: http://portal.nersc.gov/svn/visit/trunk/src/tools/data ManualExamples/Simulations/contrib/pjacobi/

A PDE with fixed boundary conditions sin(πx). sin(-π) Update grid with solver Fixed boundary conditions Laplace equation: Δu = 0 0 sin(πx) 0

2D grid partitioning and I/O pattern BC init, or restart (mp+2) grid lines to read (mp+2) grid lines to read (mp+2) grid lines to read (mp+2) grid lines to read

grid partitioning & I/O pattern Check-pointing and solution dump (mp+1) grid lines to write (mp) grid lines to write (mp) grid lines to write (mp+1) grid lines to write

grid partitioning & communication pattern Overlap Send and Receive Proc. 0 does not receive from below Proc. (N-1) does not send above

grid partitioning & in situ graphics Use ghost-nodes to prevent overlapps (mp+2) lines to send (mp+1) lines to send (mp+1) lines to send (mp+1) lines to send Visitrectmeshsetrealindices(h, minrealindex, maxrealindex)

Strategies for load balancing (static) A checkerboard pattern seem like it would give a good compromise, but the communication pattern is more complex to program Grid splitting strategies will also affect: boundary conditions I/O patterns in-situ ghost regions

Parallel visualization pipelines Data parallelism everywhere Data I/O Processing Rendering The Source will distribute data partitions to multiple execution engines Data Source 1 2 N Data Filters Mappers Rendering

Arbitrary (or adaptive) 3-D data partitioning Is the final image order-independent? A sort-last compositing (valid for opaque geometry) enables complete freedom in data partitioning

Example of the use of ghost-cells Ghost- or halo-cells are owned by processor P, but reside in the region covered by processor P+k (k is an offset in the MPI grid) Their purpose is to guarantee data continuity

Example of the use of ghost-cells Ghost- or halo-cells are usually not saved in solution files because the overhead can be quite high. A restart usually involves reading the normal grid, and communicate (initialize) ghost-cells before proceeding.

in-situ (parallel) visualization Could I instrument parallel simulations to communicate to a subsidiary visualization application/driver? Eliminate I/O to and from disks Use all my grid data with ghost-cells Have access to all time steps, all variables Use my parallel compute nodes Don t invest into building a GUI, a new visualization package, or a parallel I/O module

Two complementary approaches Parallel Data transfer to Distributed Shared Memory Computation and Post-processing physically separated developed at CSCS, publicly available on HPCforge or, Use ADIOS (not described here, google ADIOS NCCS ) Co-processing or in-situ Computation and Visualization on the same nodes A ParaView project A VisIt project See short article at EPFL Tutorial at PRACE Winter School Tutorial proposed at ISC 11 18

First approach: Parallel Data transfer to Distributed Shared Memory Live post-processing and steering: Get analysis results whilst the simulation is still running Re-use generated data for further analysis/compute operations Main objectives: Minimize modification of existing simulation codes

HDF5 Virtual File Drivers HDF5 (Hierarchical Data Format): Widely known data format HDF5 based on a Virtual File Drivers (VFD) architecture: Modular low-level structure called drivers (H5FDxxx) Mapping between HDF5 format address space and storage HDF5 already provides users with different file drivers (e.g. MPI-IO driver)

HDF5 Virtual File Drivers The CSCS driver: H5FDdsm Data sent into HDF5 is then automatically redirected to this driver H5Pset_fapl_dsm(fapl_id, MPI_COMM_WORLD, NULL); Based on the Distributed Shared Memory (DSM) concept 0 local length 2 local length N 1 (local length) N (local length) Free space Used space Addresses divided among N processes

H5FDdsm Modular communication Inter-communicator: MPI/socket/(something else) ; Intra-communicator: always MPI Main operating mode: DSM allocated on the post-processing nodes Computation writes using remote put Intra-communicator 0 Inter-communicator 0 1 N-1 1 M-1 N computation PE M post-processing PE

Data write rate (GB/s) Write test with a 20GB DSM distributed among 8 post-processing nodes the XT5 12 10 Saturation of the network 8 6 4 Parallel File System (Lustre) H5FDdsm (Socket over SeaStar2+ Interconnect) 2 0 16 32 64 128 256 512 Number of PE used for writing to the DSM

The DSM interface within ParaView Pause, Play, Switch to disk implemented within the driver No additional calls necessary Select datasets to be sent on the fly Restart simulation from HDF file stored in the DSM Modify parameters/arrays/etc share datasets back to the simulation Simulation processes ParaView servers 0 MPI or sockets 0 Fast switch (Infiniband) ParaView client N-1 M-1 Cluster 0 Cluster 1

Advantages Parallel real-time post-processing of the simulation Easy to use if application uses HDF5 already Do not need to have a big amount of memory on compute nodes since everything is sent to a remote DSM Not all post-processing apps scale as well as the simulation Run PV on fat nodes and simulation on a very large number of nodes

MPI MPI MPI Second Method: VisIt Desktop Machine Parallel Supercomputer VisIt GUI and Viewer commands images VisIt library VisIt library simulation code simulation code node220 node221 VisIt library simulation code node222 Link simulation with visualization library and drive it from a GUI VisIt library simulation code node223

Use VisIt https://wci.llnl.gov/codes/visit The Simulation s window shows meta-data about the running code Control commands exposed by the code are available here All of VisIt s existing functionality is accessible Users select simulations to open as if they were files

MPI MPI MPI VisIt s plotting is called from the simulation Desktop Machine Parallel Supercomputer VisIt GUI and Viewer commands images VisIt library VisIt library simulation code simulation code node220 node221 VisIt library simulation code node222 No pre-defined visualization scenario needs to be defined VisIt library simulation code node223

Linux desktop machine Launch Simulation on Big Parallel Machine BPM Home Directory PSUB/SRUN SimCode0 SimCode1 SimCode2 SimCode3 data data data data BPM login node bpm20 bpm21 bpm22 bpm23

Linux desktop machine Remote VisIt task connects to Simulation VisIt GUI and Viewer BPM Home Directory ~/.visit/simulations jet00 bpm33 6666 May1 jet01 bpm20 2345 May1 VisIt Launcher SimCode0 listening data SimCode1 data SimCode2 data SimCode3 data BPM login node bpm20 bpm21 bpm22 bpm23

Simulation becomes engine, connects to Viewer Linux desktop machine VisIt GUI and Viewer BPM Home Directory ~/.visit/simulations jet00 bpm33 6666 May1 jet01 bpm20 2345 May1 VisIt Launcher SimCode0 listening VisIt Engine data SimCode1 VisIt Engine data SimCode2 VisIt Engine data SimCode3 VisIt Engine data BPM login node bpm20 bpm21 bpm22 bpm23

Linux desktop machine VisIt requests pull Data from Simulation VisIt GUI and Viewer BPM Home Directory ~/.visit/simulations jet00 bpm33 6666 May1 jet01 bpm20 2345 May1 VisIt Launcher SimCode0 listening VisIt Engine data interface SimCode1 VisIt Engine data interface SimCode2 VisIt Engine data interface SimCode3 VisIt Engine data interface data data data data BPM login node bpm20 bpm21 bpm22 bpm23

Some details on the APIs The C and Fortran interfaces for using SimV2 are identical, apart from calling different function names The VisIt Simulation API has just a few functions set up the environment open a socket and start listening process a VisIt command set the control callback routines The VisIt Data API has just a few callbacks GetMetaData() GetMesh() GetScalar(), etc

Main program int main(int argc, char **argv) { SimulationArguments(argc, argv); read_input_deck(); } mainloop(); return 0;

Main program is augmented int main(int argc, char **argv) { SimulationArguments(argc, argv); VisItSetupEnvironment(); VisItInitializeSocketAndDumpSimFile(); read_input_deck(); Example with our Shallow Water Simulation code: Additions to driver program: 15 lines of F95 code } mainloop(); return 0;

mainloop() will be augmented void mainloop(simulation_data *sim) { int blocking, visitstate, err = 0; do { simulate_one_timestep(sim); } } while(!sim->done && err == 0);

Flow diagram detect input? exec. visualization command visualization tries to connect exec. one iteration

Simulation main loop is augmented void mainloop(simulation_data *sim) { int blocking, visitstate, err = 0; do { blocking = (sim->runmode == SIM_RUNNING)? 0 : 1; visitstate = VisItDetectInput (blocking, -1); Example with our Shallow Water Simulation code: Additions to simulation loop: 53 lines of F95 code if(visitstate == 0) { simulate_one_timestep(sim); }

Two entry points to the execution detect input? on-demand execution from the GUI exec. visualization command visualization tries to connect exec. one iteration

Callbacks are added to advertize the data visitcommandcallback() Example with our Shallow Water Simulation code: visitgetmetadata () Mesh name Mesh type Topological and spatial dimensions Units, labels Variable names and location (cell-based, node-based) Variable size (scalar, vector, tensor) New file to link with the rest of simulation code: 432 lines of F77 code Commands which will be understood (next(), halt(), run(), ) visitsimgetmesh()

Get_Mesh() & GetVariable() must be created integer function visitgetmesh(domain, name, lname) parameter (NX = 512) parameter (NY = 512) real rmx(nx), rmy(ny) allocate data handles for the coordinates and advertize them F90 example: visitrectmeshsetcoords(h, x, y)

How much impact in the source code? The best suited simulations are those allocating large (contiguous) memory arrays to store mesh connectivity, and variables Memory pointers are used, and the simulation (or the visualization) can be assigned the responsibility to deallocate the memory when done. F90 example: allocate ( v(0:m+1,0:mp+1) ) visitvardatasetd(h, VISIT_OWNER_SIM, 1, (m+2)*(mp+2), v)

How much impact in the source code? The least suited are those pushing the Object Oriented philosophy to a maximum. Example: Finite Element code handling a triangular mesh: TYPE Element REAL(r8) :: x(3) REAL(r8) :: y(3) REAL(r8) :: h REAL(r8) :: u REAL(r8) :: zb(3) END TYPE Element

How much impact in the source code? When data points are spread across many objects, there must be a new memory allocation and a gathering done before passing the data to the Vis Engine REAL, DIMENSION(:), ALLOCATABLE :: cx ALLOCATE( cx(numnodes), stat=ierr) DO ielem = 1, numelems+numhalos DO i = 1, 3 cx(elementlist(ielem)%lclnodeids(i)) = ElementList(iElem)%x(i) END DO END DO err = visitvardatasetf(x, VISIT_OWNER_COPY, 1, numnodes, cx)

VisIt can control the running simulation Connect and disconnect at any time while the simulation is running We program some buttons to react to the user s input: Halt, Step, Run, Update, others

Ex: Domain distributed over MPI tasks Dimensions = 1080x720x360 10 variable arrays (double) MPI Partition = 12x6 Data per MPI task: 360 Mbytes Application linked with (resp. without) the libsim library = 977 Kb (resp. 850 Kb) At runtime, one extra lib is loaded (306 Kb)

Domain distributed over multiple MPI tasks If graphics is very heavy, it is done remotely (by the simulation), and sent over to the client as a pixmap. If light, geometry is sent to client for local rendering

The in-situ library provides many features Access to scalar, vector, tensor arrays, and label CSG meshes AMR meshes Polyhedra Material species Ability to save images directly from the simulation Interleaved XY, XYZ coordinate arrays

Advantages compared to saving files The greatest bottleneck (disk I/O) is eliminated Not restricted by limitations of any file format No need to reconstruct ghost-cells from archived data All time steps are potentially accessible All problem variables can be visualized Internal data arrays can be exposed or used Step-by-step execution will help you debug your code and your communication patterns The simulation can watch for a particular event and trigger the update of the VisIt plots

In the past, we focused on raw data => images <DataArray type="uint8" Name="types" format="appended" RangeMin="" RangeMax="" offset="5948" /> 4000 </Cells> time-steps with fluidstructure </Piece> interaction </UnstructuredGrid> <AppendedData encoding="base64"> _AQAAPAAAAFwAAAA==eJwVzzEoRHE Ax/H/YDAYbjAYDDcYDIYbDAbluQwGww 0Gg+EGg8Fwg8FgeEm6JF2SLkkvSZekS 9J1SS9Jl6RL0iXpjUaj0Uf9PvOvbwgfxRA+ +SIjngohREZMQkpGmA4hR54CESXKVIip UqNOQoMmLVI6dOmR8c0PvwT/ffQzQI5 BhhgmzwijjFFgnAkmiZhhljlKzLPAImWWW GaFCqussf7fzgabbFFlmx12qbHHPgfUOe SIYxJOOOWMBudccEmTK665oUWbW+5 IueeBRzo88cwLXV55451e8Q8G5lcqAQA AAACAAABABQAAgQIAAA

We are now adding a new interaction paradigm mega-, giga-, peta-, exa-scale simulations can now be coupled with visualization Serve a Visualization Request? Check for convergence or end-of-loop Solve Next Step

now focus on source code => live images REAL, DIMENSION(:), ALLOCATABLE :: cx ALLOCATE( cx(numnodes), stat=ierr) DO ielem = 1, numelems+numhalos DO i = 1, 3 cx(elementlist(ielem)%lclnodeids(i)) = ElementList(iElem)%x(i) END DO END DO err = visitvardatasetf(x, VISIT_OWNER_COPY, 1, numnodes, cx)

We need a new data analysis infrastrcuture Domain decomposition optimized for simulation is often unsuitable for parallel visualization To optimize memory usage, we must share the same data structures between simulation code and visualization code to avoid data replication Create a new vis infrastructure, develop in-situ data encoding algorithms, indexing methods, incremental 4D feature extraction and tracking Petascale visualization tools may soon need to exploit new parallel paradigms in hardware, such as multiple cores, multiple GPUs, cell processors

Conclusion Parallel visualization is a mature technology, but was optimized as a stand-alone process. It can run like a supercomputer simulation, but is also limited by I/O. In-situ visualization is an attractive strategy to mitigate this problem, but will require an even stronger collaboration between the application scientists and the visualization scientist, and the development of a new family of visualization algorithms Demonstrations