Big Data Visualization on the MIC

Similar documents

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

A quick tutorial on Intel's Xeon Phi Coprocessor

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Data Centric Interactive Visualization of Very Large Data

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Parallel Programming Survey

Keys to node-level performance analysis and threading in HPC applications

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

A Case Study - Scaling Legacy Code on Next Generation Platforms

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

GeoImaging Accelerator Pansharp Test Results

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Large-Data Software Defined Visualization on CPUs

Overview of HPC Resources at Vanderbilt

Visualization and Exploration of huge data volumes

Intel Many Integrated Core Architecture: An Overview and Programming Models

Computer Graphics Hardware An Overview

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

HPC Wales Skills Academy Course Catalogue 2015

Programming the Intel Xeon Phi Coprocessor

Evaluation of CUDA Fortran for the CFD code Strukti

Rethinking SIMD Vectorization for In-Memory Databases

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Next Generation GPU Architecture Code-named Fermi

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Multi-Threading Performance on Commodity Multi-Core Processors

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Using the Windows Cluster

Introduction to GPGPU. Tiziano Diamanti

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

FLOW-3D Performance Benchmark and Profiling. September 2012

High Performance Computing in CST STUDIO SUITE

Building a Top500-class Supercomputing Cluster at LNS-BUAP

HP ProLiant SL270s Gen8 Server. Evaluation Report

Writing Applications for the GPU Using the RapidMind Development Platform

CFD Implementation with In-Socket FPGA Accelerators

GPU Architecture. Michael Doggett ATI

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

YALES2 porting on the Xeon- Phi Early results

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Parallel Computing with MATLAB

White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Fast Exponential Computation on SIMD Architectures

Introduction to Hybrid Programming

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Debugging with TotalView

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

PRIMERGY server-based High Performance Computing solutions

OpenMP Programming on ScaleMP

OpenMP and Performance

Trends in High-Performance Computing for Power Grid Applications

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

GPUs for Scientific Computing

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

CUDA programming on NVIDIA GPUs

MAQAO Performance Analysis and Optimization Tool

Binary search tree with SIMD bandwidth optimization using SSE

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Lecture Notes, CEng 477

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Parallel Algorithm Engineering

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Part I Courses Syllabus

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Introduction to GPU Programming Languages

ultra fast SOM using CUDA

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Transcription:

Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14

Splotch Team Tim Dykes, University of Portsmouth Claudio Gheller, Swiss National Supercomputing Centre Marzia Rivi, University of Oxford Mel Krokos, University of Portsmouth Klaus Dolag, University Observatory Munich Martin Reinecke, Max Planck Institute for Astrophysics

Contents Splotch Overview MIC-Splotch Implementation & Optimization Performance Measurements Further Work

Splotch Ray-casting algorithm for large datasets Primarily for astrophysical N-body simulations Applicable to any data representable as point-like elements with attributes [1] Particle contribution to image determined using radiative transfer equation and a Gaussian distribution function [2] [1] 3D Modelling and Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ. Observatory Munich, Germany) [2] Filaments Connecting Galaxy Clusters By K.Dolag and M.Reineke (Univ. Observatory Munich, Max Planck Institute, Germany) [3] Modified Gravity Models By Gong-Bo Zhao (Univ. of Portsmouth, UK) [3]

Splotch Workflow

Notable Challenges for Parallel Implementation Potential race conditions due to single pixels affected by many elements Load balancing for data spread unevenly throughout the image High concentration of particles in small area Low concentration of particles spread across large area

Motivations for MIC-Splotch Accelerator/coprocessor usage in HPC Exploitation of all available hardware New architecture Comparison to CUDA-Splotch

MIC Architecture PCIe SMP-on-a-chip 512-bit wide SIMD Up to 61 cores Up to 8GB GDDR5 Up to 244 HW threads Lightweight Linux OS

MIC Architecture cont. Parallel Programming Models OpenMP Intel Cilk Plus MPI Pthreads Processing Models Available Native Offload Cross compile source to run directly on device LEO (Language Extensions for Offload) Symmetric Use each coprocessor as a node in an MPI cluster, or subdivide the device to contain a series of MPI nodes

MIC Execution Model

Optimization Methods Address memory allocation and transfer issues Thread management Automatic vectorization Manual vectorization

Memory Allocation Problem: Dynamic allocation is slow Allocate memory at the start of the program and reuse through rather than deleting and reallocating Mitigation Advice: Pre-allocating large buffers Allocating with large pages Avoid dynamic allocations This can also reduce page faults and translation look-aside (TLB) misses! MIC_USE_2MB_BUFFERS=64K Enables use of 2MB page sizes for any allocations over 64K

Double buffered computation Particles processed in chunks Computation overlapped with transfer Reduced transfer times

Multithreaded Rendering Step 1 - Allocate Split threads into groups Create full image buffer for each group Split images into tiles T0 TN Step 2 - Prerender Each thread generates a list of particle indices per tile for subset of particle data allocated T0 T1 T2 Particle Subset N TN ThreadN Tile_list_N T0 T1... TN P0 P13 P17 P4 P6... P10 P45 Step 3 - Render Tile_list_0 N T0 T0 T0 P0 P13 P17 P66 P69 P88 P92 P99 Thread0 T0 T0 Each thread renders all particles from all lists for one particular tile Step 4 Image buffers accumulated and transferred back to host HOST

Vectorization Aiding Automatic Vectorization Data structure organisation Data alignment Compiler directives Use of the compiler option -vec-reportx (where X = 0-6 ) provides detailed information on what has and has not been vectorized, along with suggestions as to why The guide to auto-vectorization with Intel C++ compilers is useful at this stage Converting from array of structures to structure of arrays provided 10% performance boost to rasterization phase Data should be aligned to 64 byte boundaries on host using _mm_malloc() to ensure offload allocation & transfers are also aligned Use of the assume_aligned(ptr,64) directive informs the compiler the array being worked on is aligned correctly. #pragma ivdep informs compiler that vectors do not overlap each other

Vectorization Cont. Manual Vectorization Difficult to automatically vectorize complex areas of code. Intrinsics, mapping directly to the Intel Many-Core Instruction (IMCI) set, can be used to manually vectorize code. The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly unpredictable unaligned memory access patterns. For each particle the spread of affected pixels is calculated, then each column of pixels is rendered. A pixel color is calculated by multiplying the color of the particle by a contributive factor. This value is then additively combined with the previous color of the affected pixel.

Manual Vectorization Method Step 1: Pack particle color x5 into _m512 vector container V1 R G B R G B R G B R G B R G B _mm512_setr_ps(r,g,b,r,g,b...) Step 2: Pack contribution value x3 per pixel into _m512 vector container V2 C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 _mm512_setr_ps(c1,c1,c1,c2,c2...) Step 3: Pack affected pixel colors (up to 5) into _m512 vector container V3 R G B R G B R G B R G B R G B _mm512_setr_ps(p1.r, p1.g, p1.b, p2.r..._) Step 4: Fused Multiply-Add vectors where V3 = (V1*V2) + V3 R G B R G B R G B R G B R G B x x x x x x x x x x x x x x x C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 + + + + + + + + + + + + + + + R G B R G B R G B R G B R G B } Step 5: Masked Unaligned store V3 back to memory _mm512_mask_packstorelo_ps((void*)&dest[idx], _mask, V3) _mm512_mask_packstorehi_ps(((void*)&dest[idx])+64, _mask, V3) _mm512_fmadd_ps(v1,v2,v3) The _mask ensures the unused 16th float in the vector containers is not written to the image int mask = 0b0111111111111111; _mmask16 _mask = _mm512_int2mask(mask);

Offloading with MPI Data heavy algorithms can benefit from MPI based offloading. Multiple MPI processes run on the host, sharing a single or multiple devices. Allows to allocate, transfer and process chunks of data in parallel, providing a significant performance boost. The script to subdivide multiple devices amongst 8 tasks can be unwieldy

Performance Testing Test System Specification 'Dommic' Facility at the Swiss National Supercomputing Centre 7 Nodes Each node based on dual socket eight-core Intel Xeon 2670, running at 2.6GHz with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available Test Scenario ~21 Million particle N-Body simulation produced using the Gadget code. 100 frame animation orbiting the dataset 8 host MPI processes per device, 2 thread groups of 15 threads each 4 OpenMP threads per available core (~236)

Results Per-Frame time for all phases: host OpenMP 1-16 cores vs single and dual Xeon Phi devices. Per-Frame time for rasterization: host OpenMP 1-16 cores vs single and dual Xeon Phi devices

Results Cont. Per-Frame processing time comparing MPI, OpenMP and MPI offloading to single and dual Xeon Phi devices.

Results Notes Best performance boost is seen in the rasterization phase, with a single device outperforming 16 OpenMP threads by ~2.5x. Use of a second device provides, as expected, 2x performance boost in comparison to single device Per frame processing times for the current implementation compares to 4 host OpenMP threads for a single device, while two devices outperforms 16 OpenMP threads due to non-linear scaling of the host OpenMP implementation In comparison with the linearly scaling MPI implementation, each device performs similarly to 4 cores of the host.

Further work Further optimisation and tuning through use of Intel VTune Dynamic thread grouping system Comparison against GPU model Exploration of MPI running on both host and device (symmetric)

References Splotch Publications 1. Dolag, K., Reinecke, M., Gheller, C., Imboden, S.: Splotch: Visualizing Cosmological Simulations. New Journal of Physics, 10(12) id. 125006 (2008) 2. Jin,Z.,Krokos,M.,Rivi,M.,Gheller,C.,Dolag,K.,Reinecke,M.:High-Performance Astrophysical Visualization using Splotch. Procedia Computer Science, 1(1) 1775 1784 (2010) 3. Rivi, M., Gheller, C., Dykes, T., Krokos, M., Dolag, K.: GPU Accelerated Particle Visualisation with Splotch. To appear in Astronomy and Computing (2014) 4. Dykes, T., Gheller, C., Rivi, M., Krokos, M.: Big Data Visualization on the Xeon Phi. Submitted to International Supercomputing Conference (2014) The Splotch Code https://github.com/splotchviz/splotch