Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process

Transcription

1 Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process Hasan Abbasi, Janine Bennett, Peer-Timo Bremer, Varis Carey, Greg Eisenhauer, Attila Gyulassy, Scott Klasky, Robert Moser, Todd Oliver, Manish Parashar, Valerio Pascucci, Karsten Schwan, Hongfeng Yu, and Matthew Wolf

2 SDMA challenges in extreme-scale computing

3 Combustion Workflow RHS of S3D solver at each Stage of an explicit time step Asynchronous movement of data or share data in memory (different levels) In situ, in transit data analysis/viz workflow via hybrid staging

4 We are Building Proxy and Skeletal Apps that Enable Empirical Evaluation of Codesign Design Choices Proxy App for Topology driven feature extraction Proxy App for Topology driven feature tracking Proxy App for Statistical analysis Proxy App for Visualization Proxy App for Uncertainty Quantification Skeletal Apps for Staging and Data Movements

5 Codesign Questions for Data Analysis and Visualization Algorithms How much memory will be available in situ and with what characteristics? Will we have hardware and runtime support for asynchronous computation in situ? Will be performance for small messages reduced? What is the ratio of network bandwidth for in situ vs in transit communication? How well will modern processors support code that is branch heavy and flop free?

6 A Wide Range of Analysis and Visualization Algorithms Are Needed for Combustion Applications In situ multi-variate volume and particle rendering Lagrangian particle querying and analysis Topological segmentation: Contour trees Morse-Smale complex Time tracking Scalar field comparison Distance field (level set) Filtering and averaging (spatial and temporal) Shape analysis Statistical moments (conditional) Statistical dimensionality reduction (joint PDFS) Spectra (scalar, velocity, coherency) Flame-centric control volume analysis

7 We Build Reduced Topological Models for Characterization and Tracking of Combustion Features domain hierarchy Split Birth t Death Continuation t + Δt

8 Merge Trees Represent Feature Extraction at Different Scales with Thresholds for Noise Removal

9 Visualization and Analysis Bottlenecks May Differ from Simulation Bottlenecks Typically I/O bound: limited by rate at which data can be accessed Memory layout may significantly impact efficiency beyond traditional cache effects seen in solvers FLOP vs branch ratios can vary dramatically Algorithms with many branches are highly data-dependent Feature density: how much data is relevant Feature distribution: balance of work load distributed It is difficult to reliably predict generally expected behavior

10 Execution Models and Data Movements Depend on Different Flavors of Hybrid Staging In Situ Co-located: complete resource sharing/contention with simulation Partial Sharing: different processor/core less resource contention Out of Band: on node with minimal resource usage (e.g. use of ooc techniques, low priority) In Transit Local: sending data to different nodes of the same machine Remote: sending data to a separate machine

11 The Codesign Process Provides a Unique Opportunity to Studying Algorithmic Design Space

12 Multi-scale Design Patterns and Execution Models Lead to Local, and global algorithm design parameters Global design parameters include: Number of execution units Data aggregation patterns N-1 communication patterns Use of pair-wise data exchange schemes Local design parameters are algorithm-specific: Sort First, Filter Last Filter First, Sort Last Filter First, Traverse Last

13 We can project behaviors for different design parameters and execution models onto prospective hardware configurations SST and spreadsheet models provide a projection of communication and wall time Prospective configurations: PIM (Processing in memory, cache-less architecture) exanode1 (commodity NIC and memory) exanode2 (custom on-board NIC + faster memory) ExaCT tools Performance analysis spreadsheet

14 We are exploring 3 use cases that cover a range of characteristic behaviors Statistics Visualization Topology Local compute behavior Complexity is data dependent? Amount of data transferred for aggregation/gather Two-Phases: All FLOPs Two-phases: Some FLOPS, Three-phases: 2 FLOP-free, 1 FLOP-heavy No Yes Yes Constant - small Can be datadependent Data-dependent Scatter required? Sometimes (small data) No Yes (small data)

15 Topology Algorithm Computation of local features Process vertices in sorted order, detecting joining of iso-surface components Uses Union-Find data structure Communication to resolve features spanning multiple blocks Local merge trees are communicated in N-to-1 merges Corrections to local trees are re-broadcast to local compute nodes Feature-based statistics computation The segmentation stored with the corrected local merge trees are used to compute per-feature statistics Averages, size, shape

16 Basic Topology Measurements Obtained with Byfl Basic analysis for on node computation (data size 560x560x560) num cores num points points per core Total Loads (MB) Total Stores (MB) Total FLOPs Loads/core (MB) ,616, ,616 4,344, , ,616,000 25,088 4,098, ,616,000 6,272 3,790,

17 Topology Design Space Exploration Computation of local features alternatives Feature Advantage Disadvantage Sort-first Efficient (possible GPU) O(n) Memory for sorted indices Progressive-sort Smaller memory footprint extra pass over data Union-Find Efficient O(n) Union-Find data structure Streaming computation Re-use of memory for tree Significantly slower Communication alternatives Interleave compute/comm Streaming compute Asynchronous comm N-to-1 gather then 1-to-N scatter Simple com. Model Higher latency N-to-1 gather interleaved with 1- to-n scatter N-to-1 gather interleaved with 1- to-leaves scatter Less idle time on compute node Less idle time on compute node Feature-based statistics alternatives Re-do some work Complex communicators Wait for corrections then compute Simple Higher latency Compute-and-correct Lower latency Re-do work, more complex

18 Example execution model 1: in situ data transfers on node or on network Topology computed directly integrated with the solver K-rounds of merges on in-situ processes Factors analyzed: Number of nodes Number of cores per node Number of merge operations per stage Initial data access through shared pointer Free parameters: Communicate on network first Communicate on network last Size of messages

19 Msg. Size Msg. Count Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay=0.31, Binary Merges 2-cores per node 4-cores per node 8-cores per node Merge stages

20 Msg. Size Msg. Count Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay=0.31, 8-way Merges 2-cores per node 4-cores per node 8-cores per node Merge stages

21 Example execution model 2: part in situ and part plus in transit Initial local compute directly integrated with the solver K-rounds of merges on in-situ processes (N-K) rounds of merging in staging area Predicted cost factors: Initial data access through shared pointer K-round of merges in blocking mode as part of the solver code including shared memory (on-node) and MPI messages (off-node) communications Data transfer to staging area Asynchronous computation in staging area Free parameters: merging strategy and staging area break Things to watch out for: Initial data transfer might pollute the cache In-situ merge becomes sparse quickly

22 Total Communication Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay= cores per node 4-cores per node 8-cores per node binary merge 8-way merge Merge stages

23 Statistics Algorithm 1 st -4 th order moments, variance, skewness, kurtosis, minimum and maximum values are values commonly computed by physics codes Pair-wise update formulas for 1 st -4 th order moments allow for a single-pass distributed implementation Given moment(a) and moment(b), compute moment(a U B) The global model can optionally be scattered to all processes to allow for assessment of observations (e.g. to determine outliers)

24 Statistics Design Space Exploration The algorithm is mostly FLOPs, is not data-dependent, and requires small amounts of data to be communicated Small-scale, algorithm-specific design parameters None: local computations are straightforward implementation of update formulas Large-scale design parameters Update formulas provide complete flexibility in communication patterns Support arbitrary depth/width of the compute tree Execution model Initial local compute level is good candidate for insitu (all data must be transferred otherwise) Later local compute levels could be placed anywhere (require very small data sizes transferred: moments, minima, and maxima only)

25 operations per core Measurements obtained with Byfl confirm data-parallel, scalable nature of statistics algorithms 1.00E+10 HCCI-ALU HCCI-FLOP LEJ-ALU LEJ-FLOP 1.00E E E E E E E E E E E E E+08 points per core Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core LEJ 2,025 1, ,250, ,000,036 1,032,750,120 40,500,024 LEJ 2,025 1, ,025, ,700, ,275,113 4,050,018 LEJ 2,025 1, , , ,670,036 10,327, ,017 LEJ 2,025 1, ,000 20, ,036 1,032,862 40,517 LEJ 2,025 1, ,000 2, , ,387 4,050 HCCI ,136, ,808, ,936,131 6,272,041 HCCI , ,780,836 15,993, ,219 HCCI ,600 31, ,116 1,599,472 62,737 HCCI ,000 3, , ,048 6,289 HCCI ,000 1, ,940 80,080 3,153 Data movement options: all gather or gather of 3KB per processor

26 Visualization Algorithm Volume rendering of local data to generate partial images Cast a ray from the eye through each pixel of an image For each ray, sample local volume, map data into color values via transfer function, and accumulate color values. Parallel image compositing to combine the partial images into a final global image Build communication schedule according to distribution of pixel data Exchange pixel data via communication Blend exchanged pixel data

27 Visualization Design Space Exploration The algorithm is marginally FLOPs, can be data-dependent, and requires a potentially large amount of messages exchanged. Small-scale, algorithm-specific design parameters Adaptive workload of local rendering Features specified by users in transfer function space, features identified by analysis algorithms, data resolution for different exploration purposes Optimization and acceleration on CPUs and/or GPUs Large-scale design parameters Optimize communication schedule according to pixel data distribution Minimize link contention, pixel data exchanged, and blending cost Exploration using MPI and/or OpenMP Execution model Local rendering of simulation data could be performed in situ to minimize data movement Local rendering of feature data could be place anywhere

28 Visualization Analysis Results Measurements obtained with Byfl confirm data-parallel, nearly scalable nature of local rendering. The number of operations varies marginally across cores depending on data features and rendering parameters. Measurements with a middle image quality Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core LEJ 2,025 1, ,250,000 1, ±1.67% ±1.80% 55,276,662±1.34% 789,879,803 ±1.62% 466,466,350 ±1.84% LEJ 2,025 1, ,025, ±1.76% ±2.05% 7,712,093 ±1.37% 102,163,763 ±1.77% 58,18,647 ±2.04% LEJ 2,025 1, , , ±1.87% ±2.21% 383,819 ±1.26% 4,921,160 ±1.82% 2,691,746 ±2.21% LEJ 2,025 1, ,000 20, ±4.71% 1.84 ±5.92% 106,312 ±4.86% 1,655,434 ± ,687 ±5.29% LEJ 2,025 1, ,000 2, ±2.91% 0.23 ±3.11% 16,851 ±1.86% 272,890 ±1.43% 132,357 ±3.33% HCCI ,744, ±3.28% ±3.54% 5,508,745 ±3.06% 87,148,662 ±3.53% 53,197,224 ±3.47% HCCI , ±1.29% ±1.52% 436,888 ±9.34% 5,569,581 ±1.26% 3,068,742 ±1.52% HCCI ,400 27, ±0.58% 5.83 ±0.73% 322,109 ±0.64% 4,993,671 ±0.87% 3,036,488 ±0.64% HCCI ,000 2, ±1.75% 2.05 ±1.96% 110,723 ±1.71% 1,749,591 ±2.06% 1,070,992 ±1.87% HCCI , ±1.36% 0.38 ±1.86% 20,762 ±1.56% 325,887 ±1.98% 196,553 ±2.03% Middle image quality For un-optimized image compositing, the size of messages exchanged across cores depends on image resolutions. For the same image resolution, the size of messages is nearly same for the different numbers of cores. The messages can be reduced via optimization. High image quality

29 Successful execution hinges on tight integration with all the co-design components Separate Proxy Apps for Data Analysis and Visualization Integration to understand combined behavior Further reduction to build skeletal apps Coordination with Data Management Efficient data transfer strategies Where are the biggest reserves in performance, energy, wall time? Coordination with DSL Improve local compute patterns Fast index computation Elimination of unnecessary branches (boundary cases) UQ Analysis How much persistent memory is required? Modeling capabilities Compilers (Byfl, ROSE) Simulation(SST, spreadsheet model) Solvers Study tradeoffs and cache pollution effects (possible sharing of data structures)

30 Uncertainty Quantification within the SDMA workflow

31 UQ and Data Management The Problem: Evaluate sensitivities of quantities of interest (QoIs) with respect to chemistry model parameters, or modeled fields (e.g. reaction rate fields) The to solving this problem requires solving P+1 forward simulations, where P is the number of sensitivity evaluations P can be very large (>> 1000) makes classical approach infeasible solve one auxiliary problem, the adjoint problem, which is linearized about the primal solution. Need the primal (forward) solution to solve the adjoint problem.

32 Solving the Adjoint Problem The challenge: Solving the Adjoint Problem requires the Primal State The Adjoint Problem must be solved backwards in time. Must store primal state at all time substep!

33 More Sophisticated Adjoint Solution To reduce Storage requirements: Store a limited number of Primal states (check points) Use check points to recompute Primal state when needed Example with two checkpoints:

34 Storing full Primal solution state is prohibitive (e.g. 1PB/state) Only interested in sensitivities in a limited region in space & time (RoI, e.g. Extinction event) RoIs are not known a priori Solve Primal problem & identify RoIs using in situ analysis Resolve Primal problem only checkpointing state from the RoIs Only solve adjoint problem in RoIs Example with one Region of Interest

35 Example from Combustion Use Case Naïve adjoint solution (store full state in space & time) Storage: 5ZB Compute: 2 X Primal problem One level checkpointing Storage: 4EB Compute: 3 X Primal problem Six level checkpointing Storage: 50PB Compute: 8 X Primal problem One level checkpointing on N RoIs Storage: (1+1.3N)PB Compute: ( N) X Primal Problem Important trade-off between storage and recomputing A proxy app capturing adjoint solution data flow is being developed

36 Data Management

37 Goals for SDMA in EXACT 1. Explore data staging techniques to deal with Exascale data 2. Design questions for data staging Where should data for A&V be stored Where should the A&V operations be executed How should SDMA integrate with the solver What architectural features be leveraged for SDMA 3. Evaluate design space

38 The Meta-Skeleton

39 Design Space Solver Proxy? Descriptive Stats Visualization Topological Analysis 1. Where do we move the data to 2. How do we extract data from the solver 3. What hardware features can be exploited 1. What processing resources are allocated 2. How do we schedule the execution of these tasks

40 Storage scalability and the power wall Disk sizes of TB Single disk bandwidth of MB/s Power consumption of ~45 W/disk System memory 32PB 3 A full checkpoint every hour => TB/s I/O bandwidth => 277,633 disks => 13 MW of power 1,2 Without even considering RAID! 1. Power use of disk subsystems in supercomputers, Curry, M.L. and Ward, H.L. and Grider, G. and Gemmill, J. and Harris, J. and Martinez, D., Proceedings of the sixth workshop on Parallel Data Storage, Nov G. Grider. Exa-scale FSIO. HEC-FSIO workshop presentation. August R. Stevens and A. White. A DOE laboratory plan for providing exascale applications and technologies for critical DOE mission needs, SciDac Workshop, July 2010

41 Synchronous I/O is not the solution S3D simulation Synchronous I/O O(400 PB)/run O(1M) cores 1 PB/dump every 30 minutes Storage space requirements 35 disks for each dump (No RAID) 1.5 KW/live dump Performance requirements 5% overhead, ~31k disks, >1.4 MW 10% overhead, ~15k disks, >0.65 MW 50% overhead, ~3k disks, >0.14 MW Synchronous I/O Analysis MS-Complex Visualization Volume, Surface, Particle rendering Downstream Isomap

42 What EXACTly is Staging? Extra stage(s) in the data pipeline Use available memory resources to serve as a buffer Use available compute resources to serve as an execution target Traditional staging used discrete nodes Used for high performance I/O managing storage variability for application coupling for in transit workflows

43 Keep data in a Shared Data Space Maintain data in a shared space in staging Shared space can share the same memory as the simulation Multiple analysis and visualization services can access data

44 Managing Data Movement Data movement is expected to be a bottleneck for Exascale Use flexible resource allocation to optimize data movement

45 Hybrid Staging Asynchronous Data Transfer Asynchronous Data Transfer Hybrid Staging S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS Parallel Data Staging coupling/analytics/viz Use compute and deep-memory hierarchies to optimize overall workflow for power vs. performance tradeoffs S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization Utilize hybrid staging for analytics and visualization Abstract complex/deep memory hierarchy access ADIOS Compute cores Statistics Topology In transit ADIOS Visualization S3D-Box In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS

46 Hybrid Staging Hybrid staging is a combination of the available solutions Classify data processing actions into In line/in situ Asynchronous/in situ Asynchronous/in staging Asynchronous/on disk What about tasks that span multiple classes? Partition the algorithms Place tasks to take advantage of data management

47 Resources in Hybrid Staging Placement of analysis and visualization tasks in a complex system Leverage fast/slow DRAM and Leverage local NVRAM/SSD Impact of network data movement compared to memory movement Network topology impact on performance and power

48 Tradeoffs for Hybrid Staging Going to disk is slow even for small application sizes Inline approach adds more overhead to application runtime In transit approach gives better overall performance Additional cost of data movement Normalized Total CPU Seconds Offline: Process data after writing to disk In line: Process data in place synchronously with the application Staging: Move data to staging resources for processing

49 Impact of Task Mapping Data-centric task mapping Significant saving in amount of data transferred data by co-locating data producers and consumers Time Time CAP1 data Concurrent coupling (CAP1: 512, CAP2: 64 cores) data SAP2 SAP1 CAP2 data SAP3 Sequential coupling (SAP1: 512, SAP2: 128, SAP3: 384 cores)

50 Complex Memory Hierarchy Workflow integrates knowledge of complex memory hierarchies Placement decisions are important factors for evaluation Local NVRAM must be leveraged Fast-small memory vs slow-big memory

51

52 Impact of NVRAM Study how deep memory hierarchy can be used for end-toend I/O analytic pipeline How NVRAM can be used as a staging area How much of each level of the memory hierarchy to use for the staging area? Where to move data (RAM, NVRAM, SSD, disk, network) When (and how frequently) to move the data over the hierarchy

53 Tradeoff between Frequency and Costs Frequency of analysis impacts energy and performance of analysis NVRAM/disk gap Not a linear function Sweet Spot case dependent Experiments in collaboration with Steve Poole, ORNL

54 Asynchronous Workflow Impact Frequency of analysis impacts energy and performance of analysis Not a linear function Bitter spot case dependent

55 Execution time (ms) NVRAM for C/R Optimizing by hiding data movement to NVM Run time (sec) Run time - Optimized (sec) NVM B/W per core (MB) NVM per core data copy bandwidth was assumed to be 450 MB/sec (compared to 2 GB device B/W). Experiment was conducted in a 12 core GHz Intel Xeon node, with 48 GB DDR3 memory. To emulate NVM, 24 GB of memory was used for NVM Experiments in collaboration with HP

56 Deep Memory Tradeoffs Driving synthetic application benchmark: Generates data Allocates two matrices in DRAM memory and fills them up with random data Operates with generated data runs a kernel Multiplication (MUL), addition (ADD) or read access (NOP) generated matrices Manage generated data: Keeps the data in RAM memory Staging area (Local Fusion-IO, remote Fusion-IO, HDD, SSD, etc ) Asynchronously runs data analysis (reads data from staging area) Quality of solutions Frequency of data analysis Resources for analysis (accuracy)

57 Co-Design Decisions for Complex Memory Impact of using slow memory for SDMA processing Power vs Performance tradeoffs Size of Memory vs Speed of Memory Using a combination of fast memory and slow memory Fast memory size and speed compared to slow memory size Managing performance at the application level Tradeoff frequency of analysis with memory usage Use combination of asynchronous and synchronous computation Use knowledge of the workflow to study tradeoffs Write vs Read ratios

58 Dealing with UQ Our next big target for data management Use analysis to select only the ROI Use the feature detection algorithms and their optimization Ideal candidate for deep memory Data is not used for a long time after output Data access is regular and predictable Move data to fat/slow memory

59 The Proxy App Original Application Link Tracing Records Pattern Analyzer Tracing Tools Skel-2.0 Child of Skel that creates proxies for synchronous I/O Skel-xml file for I/O Insert User s Changes SKEL xml file for Communication Communication Phase SKEL- Code Generator Desired Benchmark Magically generates a workflow through tracing patterns Underdevelopment

60 X-Stack Interactions Increase engagement with X-Stack projects Dynamic Task Scheduling Resource allocation Memory and Task dependencies Bring Fast Forwards into the conversation Storage Complex Memory hierarchies DRAM, NVRAM, ScratchPads, SSDs New processing elements GPUs, MICs

61 Outstanding Questions GPU/Accelerators for SDMA tasks? Staging nodes can be customized with additional resources Analysis tasks can be split similarly to in situ/in transit Integration with the Solver Data extraction impact on Solver performance Inline processing can impact cache Data copies pollute cache Exploring asynchronicity in algorithms

62

63 What about UQ? Solver Compute Adjoint Store Solution Feed the adjoint computation in reverse order Need to store the ENTIRE data set

64 UQ v2.0 Solver Store Solution Compute Adjoint Store smaller subset of solution u(t), from t i to t i-1 Recompute u(t)

65 UQ v3.0 Solver Identify the interesting events Filter Domains Store even smaller subset of solution Store Solution u(t), from t i to t i-1 Recompute u(t) Compute Adjoint

66