Big Data Visualization on the MIC

Size: px

Start display at page:

Download "Big Data Visualization on the MIC"

Dina Juliana Martin
10 years ago
Views:

1 Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth Many-Core Seminar Series 26/02/14

Oxford Mel Krokos, University of Portsmouth Klaus Dolag, University

2 Splotch Team Tim Dykes, University of Portsmouth Claudio Gheller, Swiss National Supercomputing Centre Marzia Rivi, University of Oxford Mel Krokos, University of Portsmouth Klaus Dolag, University Observatory Munich Martin Reinecke, Max Planck Institute for Astrophysics

3 Contents Splotch Overview MIC-Splotch Implementation & Optimization Performance Measurements Further Work

4 Splotch Ray-casting algorithm for large datasets Primarily for astrophysical N-body simulations Applicable to any data representable as point-like elements with attributes [1] Particle contribution to image determined using radiative transfer equation and a Gaussian distribution function [2] [1] 3D Modelling and Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ. Observatory Munich, Germany) [2] Filaments Connecting Galaxy Clusters By K.Dolag and M.Reineke (Univ. Observatory Munich, Max Planck Institute, Germany) [3] Modified Gravity Models By Gong-Bo Zhao (Univ. of Portsmouth, UK) [3]

Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ.

5 Splotch Workflow

6 Notable Challenges for Parallel Implementation Potential race conditions due to single pixels affected by many elements Load balancing for data spread unevenly throughout the image High concentration of particles in small area Low concentration of particles spread across large area

balancing for data spread unevenly throughout the image High

7 Motivations for MIC-Splotch Accelerator/coprocessor usage in HPC Exploitation of all available hardware New architecture Comparison to CUDA-Splotch

8 MIC Architecture PCIe SMP-on-a-chip 512-bit wide SIMD Up to 61 cores Up to 8GB GDDR5 Up to 244 HW threads Lightweight Linux OS

9 MIC Architecture cont. Parallel Programming Models OpenMP Intel Cilk Plus MPI Pthreads Processing Models Available Native Offload Cross compile source to run directly on device LEO (Language Extensions for Offload) Symmetric Use each coprocessor as a node in an MPI cluster, or subdivide the device to contain a series of MPI nodes

Available Native Offload Cross compile source to run directly on device LEO

10 MIC Execution Model

11 Optimization Methods Address memory allocation and transfer issues Thread management Automatic vectorization Manual vectorization

12 Memory Allocation Problem: Dynamic allocation is slow Allocate memory at the start of the program and reuse through rather than deleting and reallocating Mitigation Advice: Pre-allocating large buffers Allocating with large pages Avoid dynamic allocations This can also reduce page faults and translation look-aside (TLB) misses! MIC_USE_2MB_BUFFERS=64K Enables use of 2MB page sizes for any allocations over 64K

Allocating with large pages Avoid dynamic allocations This can also reduce page faults and translation

13 Double buffered computation Particles processed in chunks Computation overlapped with transfer Reduced transfer times

14 Multithreaded Rendering Step 1 - Allocate Split threads into groups Create full image buffer for each group Split images into tiles T0 TN Step 2 - Prerender Each thread generates a list of particle indices per tile for subset of particle data allocated T0 T1 T2 Particle Subset N TN ThreadN Tile_list_N T0 T1... TN P0 P13 P17 P4 P6... P10 P45 Step 3 - Render Tile_list_0 N T0 T0 T0 P0 P13 P17 P66 P69 P88 P92 P99 Thread0 T0 T0 Each thread renders all particles from all lists for one particular tile Step 4 Image buffers accumulated and transferred back to host HOST

TN ThreadN Tile_list_N T0 T1... TN P0 P13 P17 P4 P6.

15 Vectorization Aiding Automatic Vectorization Data structure organisation Data alignment Compiler directives Use of the compiler option -vec-reportx (where X = 0-6 ) provides detailed information on what has and has not been vectorized, along with suggestions as to why The guide to auto-vectorization with Intel C++ compilers is useful at this stage Converting from array of structures to structure of arrays provided 10% performance boost to rasterization phase Data should be aligned to 64 byte boundaries on host using _mm_malloc() to ensure offload allocation & transfers are also aligned Use of the assume_aligned(ptr,64) directive informs the compiler the array being worked on is aligned correctly. #pragma ivdep informs compiler that vectors do not overlap each other

structures to structure of arrays provided 10% performance boost to rasterization phase Data should be aligned to 64 byte boundaries on host using _mm_malloc() to ensure offload allocation &

16 Vectorization Cont. Manual Vectorization Difficult to automatically vectorize complex areas of code. Intrinsics, mapping directly to the Intel Many-Core Instruction (IMCI) set, can be used to manually vectorize code. The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly unpredictable unaligned memory access patterns. For each particle the spread of affected pixels is calculated, then each column of pixels is rendered. A pixel color is calculated by multiplying the color of the particle by a contributive factor. This value is then additively combined with the previous color of the affected pixel.

The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly unpredictable unaligned memory access patterns.

17 Manual Vectorization Method Step 1: Pack particle color x5 into _m512 vector container V1 R G B R G B R G B R G B R G B _mm512_setr_ps(r,g,b,r,g,b...) Step 2: Pack contribution value x3 per pixel into _m512 vector container V2 C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 _mm512_setr_ps(c1,c1,c1,c2,c2...) Step 3: Pack affected pixel colors (up to 5) into _m512 vector container V3 R G B R G B R G B R G B R G B _mm512_setr_ps(p1.r, p1.g, p1.b, p2.r..._) Step 4: Fused Multiply-Add vectors where V3 = (V1*V2) + V3 R G B R G B R G B R G B R G B x x x x x x x x x x x x x x x C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C R G B R G B R G B R G B R G B } Step 5: Masked Unaligned store V3 back to memory _mm512_mask_packstorelo_ps((void*)&dest[idx], _mask, V3) _mm512_mask_packstorehi_ps(((void*)&dest[idx])+64, _mask, V3) _mm512_fmadd_ps(v1,v2,v3) The _mask ensures the unused 16th float in the vector containers is not written to the image int mask = 0b ; _mmask16 _mask = _mm512_int2mask(mask);

Offloading with MPI Data heavy algorithms can benefit from MPI based offloading. Multiple MPI processes run on the host, sharing a single or multiple devices.

18 Offloading with MPI Data heavy algorithms can benefit from MPI based offloading. Multiple MPI processes run on the host, sharing a single or multiple devices. Allows to allocate, transfer and process chunks of data in parallel, providing a significant performance boost. The script to subdivide multiple devices amongst 8 tasks can be unwieldy

Allows to allocate, transfer and process chunks of data in parallel, providing a

19 Performance Testing Test System Specification 'Dommic' Facility at the Swiss National Supercomputing Centre 7 Nodes Each node based on dual socket eight-core Intel Xeon 2670, running at 2.6GHz with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available Test Scenario ~21 Million particle N-Body simulation produced using the Gadget code. 100 frame animation orbiting the dataset 8 host MPI processes per device, 2 thread groups of 15 threads each 4 OpenMP threads per available core (~236)

6GHz with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available Test Scenario ~21 Million particle N-Body

20 Results Per-Frame time for all phases: host OpenMP 1-16 cores vs single and dual Xeon Phi devices. Per-Frame time for rasterization: host OpenMP 1-16 cores vs single and dual Xeon Phi devices

21 Results Cont. Per-Frame processing time comparing MPI, OpenMP and MPI offloading to single and dual Xeon Phi devices.

22 Results Notes Best performance boost is seen in the rasterization phase, with a single device outperforming 16 OpenMP threads by ~2.5x. Use of a second device provides, as expected, 2x performance boost in comparison to single device Per frame processing times for the current implementation compares to 4 host OpenMP threads for a single device, while two devices outperforms 16 OpenMP threads due to non-linear scaling of the host OpenMP implementation In comparison with the linearly scaling MPI implementation, each device performs similarly to 4 cores of the host.

23 Further work Further optimisation and tuning through use of Intel VTune Dynamic thread grouping system Comparison against GPU model Exploration of MPI running on both host and device (symmetric)

24 References Splotch Publications 1. Dolag, K., Reinecke, M., Gheller, C., Imboden, S.: Splotch: Visualizing Cosmological Simulations. New Journal of Physics, 10(12) id (2008) 2. Jin,Z.,Krokos,M.,Rivi,M.,Gheller,C.,Dolag,K.,Reinecke,M.:High-Performance Astrophysical Visualization using Splotch. Procedia Computer Science, 1(1) (2010) 3. Rivi, M., Gheller, C., Dykes, T., Krokos, M., Dolag, K.: GPU Accelerated Particle Visualisation with Splotch. To appear in Astronomy and Computing (2014) 4. Dykes, T., Gheller, C., Rivi, M., Krokos, M.: Big Data Visualization on the Xeon Phi. Submitted to International Supercomputing Conference (2014) The Splotch Code

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France