# Algorithms of Scientific Computing II

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware implementations and Barnes-Hut-Method 1) Today s Systems - Compute Power Today s supercomputer are highly parallel systems. Please read LRZ s web page and identify the different levels of parallelism in SuperMUC! Vector-register; Hardware-thread; core; processor; node; island; whole system. Which levels of parallelism are not documented there, but can be found in Intel Architecture Manual? Instruction level parallelism: A core is able to decode several instruction a time (e.g. four-way decode). Since it features several execution units the so-called out-oforder unit is able to run different instructions employing different units out-of-order and in parallel. Please refer to Fig. 1 and Fig. 2 for examples. In total we are now able to calculated the peak-performance (double percision) of a node in SuperMUC. 2: out-of-order: mul and add 4: vectorization 8: cores per package 2.7: clock-speed 2: two sockets per node 45.6 GFLOPS per node! =.19 PFLOPS for whole system.

2 Branch Prediction 2 KB Instruction Cache Instruction Fetch and Pre-Decode uop decoded Cache (1.5k uops) 4-way Decode Rename/Allocate/Retirement (Reorder-Buffer: 168 Entries) Scheduler (physical registerfile: bit VPU registers, 160 integer registers) Port 0 Port 1 Port 5 Port 2 Port Port 4 ALU SSE MUL SSE Shuffle DIV AVX FP MUL Imm Blend ALU SSE ADD SSE Shuffle AVX FP ADD ALU JMP AVX FP Shuffle AVX FP Boolean Imm Blend Load Store Address Store Address Load Store Data 256 KB Cache (MLC) Memory Control 2 KB Data Cache Abbildung 1: Intel Sandy Bridge Architecture, changes in comparison to Intel Nehalem Architecture are highlighted in orange, based on descriptions from Intel. Instruction Fetch 64 KB Instruction Cache Instruction Decode Branch Prediction 4-way Decode Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU Instruction Retirement Floating Point Scheduler, 60 entries Port 0 Port 1 Port 2 Port 128bit FMAC 128bit FMAC SSE x87 SSE x87 Instruction Retirement Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU 2 MB shared Cache Load/Store Unit 16 KB Data Cache Floating Point Load / Store Buffer 16 KB Data Cache Load/Store Unit Data Prefetches Abbildung 2: AMD Bulldozer Architecture, based on descriptions from AMD. 2) Today s Systems - Memory Subsystem Please sketch the memory subsystem/ network structure of SuperMUC. Is there a different network topology available on another supercomputer? A single node features a state of the art NUMA-architecture as displayed in Fig., which implements a shared memory but with different access time: local vs. remo-

3 te. A remote access takes roughly 10-20% longer than a local access. Operating systems like Linux employ a so called first touch policy: the NUMA-node which touches a page first is the owner of the this page. On each socket we are able to achieve a streaming bandwidth of more than 40 GByte/s. Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr CH DDR 1600 Shared L Cache 20MB 4 CH DDR 1600 Shared L Cache 20MB Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr. 15 QPI 8.0 GT QPI 8.0 GT Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Abbildung : Intel Sandy Bridge NUMA-Architecture For inter node communication, SuperMUC features an Infiniband network. Infiniband offers low latencies (1.2 us) at more then 6 GByte/s per second bandwidth. The overall network is organizes as a pruned fat-tree: 512 nodes form a so-called island with one high-end switch. 18 island are connected via a 1-level tree. Other systems, like IBM s BlueGene or Cray s XK6 feature custom interconnects (which are routerless) in order to form a 2D/D mesh. If a scientific application (e.g. based on regular grids or matrices) matches to this network only llowöverhead for communication is measured. What is the peak performance of the system if a memory bound application is executed?

4 SuperMUC features four channels per socket of DDR-1600 RAM. On channel is able to deliver 12.8 GByte/s what results in a total bandwidth of GByte/s. Let s execute a matrix vector multiplication: For a N N matrix we have to calculate N scalar products. For each scalar product we have to read 2N values which is have an amount of of 16N bytes and we are executing N FLOP. Therefore we are able to achieve 102.4/16 GFLOPS = 6.4 GFLOPS (1.8 % peak performance!!!) Which hardware feature may help to improve this performance? Caches! Abbildung 4: cache hierarchy Let s assume we can store b of our matrix vector multiplication in the cache, we just have to stream A. This means for a scalar product we have to read 8N bytes which results in a performance of 12.8 GFLOPS. In order to unleash the performance of today s chips cache awareness is a must.

5 ) Vectorization of Linked-Cells particle simulations Recently C++ has established itself as one of the standard language for new development of HPC codes apart from FORTRAN and C. Please sketch the software layout of a linked-cells C++ implementation. Classes for: Particle, ParticleContainer, Cell, ForceCalculation. Which problems do you face when it comes to low level kernels and data layout? Interactions are taking place on a particle basis. ForceCalculation class gets two references and computes and updates velocities and forces for each particle. This leads to a so-called Array of Structures (AoS) which is performance-killing. In order to speed-up calculations it s useful to flip this layout which leads to a cell-based only method, Fig v_y_1 v_x_1 f_z_1 64 byte (512 bit) f_y_1 f_x_1 z_1 y_1 x_ y_2 x_2 b_1 a_1 v_z_1... x_ x_2 x_ y_ b_ y_2 b_2 y_1 b_1 Abbildung 5: AoS to SoA conversion Can you use vectorization in order to speed-up calculations? Given a SoA-storage, we are able to vectorize the LJ-12-6 kernel across particle pairs. Please note that this approach scales with the number of particles per cell and does not require zero-padding or is limited to a vector-length of four as methods. In our case the only limitations are the number of particles per cell, the size of the cutoff-radius r cutoff and cell-length r c as these factors influence the vector load.

6 x_a x_a calculate distances x_2 x_1 y_a y_a d_a_2 d_a_1 y_2 y_1 z_a 0 0 z_a f_x_a f_y_a d_a_2 > rcutoff d_a_1 > rcutoff z_2 f_x_2 f_y_2 z_1 f_x_1 f_y_1 0 f_z_a calculate forces (LJ-12-6) f_z_2 f_z_1 Abbildung 6: Vectorization of the LJ-12-6 force calculation kernel. Figure 6 sketches the applied vectorization of the LJ-12-6 calculations. As we need to deal with double precision floating point numbers we can store a maximum of two elements per SSE register. The calculation is performed on particle pairs, therefore we broadcast-load (duplicating by vector-length) the required data of one particle in the first register (a) and the second register is filled by data of two different particles (1 and 2). With this approach we can theoretically reduce the number of operations by a factor of two. Since we are calculating two interactions within one step, we need to apply some pre- and post-processing, which can be performed by regular logical operations directly in hardware: the SSE/AVX instruction set offers compare, logical and sign-bit operations that can be used to decide if the force calculation should be initiated (at least one particle pair distance is smaller than r cutoff, pre-processing) and if calculated results need to be zeroed by a mask (a particle pair distance is greater than r cutoff ). In case of SSE this should already gain a factor of two for small particle numbers. The lower performance bound is the scalar performance as these algorithms will not execute more instructions than necessary in the scalar case. additional: 4) Complexity of the Barnes-Hut-Method In the lecture the costs for the d-dimensional Barnes-Hut-Method were given as O( d NlogN). First derive the costs for the twodimensional Barnes-Hut-Method. Then explain descriptively (without proof), why the formula is also correct for d. The outer loop of the Barnes-Hut-algorithm iterates over all particles and calculates the force effective on every particle. As there are N particles in total, for each of those particles the costs have to be O( d log N). For the given complexity of O( d N log N) we assume that the tree is not degenerated, thus the tree

7 has O(log N) levels. If we can show, that at each level the costs are O( d ), the formula is proven valid. We start out from the twodimensional case. First we consider an arbitrary level and determine the number of nodes (i.e. particles of pseude-particles) we have to consider.those are exactly the nodes for which the parent node didn t fulfill the -criterion. For the parent node it has to hold: diam distance >. Thus it follows for the distance: distance < diam We will now try to determine an upper bound for the number of parent cells which fulfill that condition. Therefor we consider only one coordinate axis first. Not considering the absolute value, (1) can be rewritten as diam < distance < diam. Thus the area of the distance is 2 diam. As one node coveres an area with the diameter diam there can be only 2 diam diam nodes next to each other, for which the -criterion does not hold. In the second axis there are just as many. Thus there are = 2 ( ) 2 2 = 4 2 nodes in total, which don t fulfill the -criterion. Those are the parent nodes we have to consider at the current level. That number still has to be multiplied by 4, what doesn t change the order. Thus the maximum computational costs on each level are O( 2 ). As there area log N levels in total and N particles, the total computational costs sum up to O( 2 N log N). For three dimensions it is easy to see that there are not c 2 any more for which the -criterion is fulfilled, but only c, as we are dealing now with a cubic area. (1)

8 5,51 2,85 1,5 0,82 0,44 0,2 0,12 2,86 1,42 0,82 0,42 0,22 0,12 0,062 9,1 4, 2,15 1,08 0,62 9,48 0,4 4,87 0,19 2,57 0,118 1,7 0,081 0,77 0,4 0,22 runtime per timestep [s] ,1 0,01 SGI ICE, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 nks 4,7 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 2, 0,1, 1,7 0,89 0,46 0,24 0,1 1,16,1 1,69 0,88 0,46 0,24 0,1 0,61 2,85 1,5 0,82 0,44 0,2 0,12 0,5 0,01,1 1,72 0,88 0,46 0,24 0,1 0,19 AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE 2,6 1,2 0,65 0,5 0,18 0,099 0,11 cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 1,42 0,82 0,42 0,22 0,12 0,062 0,065 15,5 8,1 4,19 2,5 1,22 0,69 0,047 9,99 Abbildung 5,7 7: 2,78 Runtimes in 1,54 seconds for 0,81 a scenario with 0,46 100k particles on both 4,87 clusters, 2,57 one MPI 1,7 rank per0,77 core. 0,4 0,22 14,6 7,7,95 2,09 1,21 0,65 0,6 0,28 0,1 runtime per timestep [s] MiniMUC, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 128 ranks 256 ranks x x x x

### CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

### OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### Scalability and Classifications

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

### Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### Next-Generation PRIMEHPC. Copyright 2014 FUJITSU LIMITED

Next-Generation PRIMEHPC The K computer and the evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1TFLOPS ~ # of cores

### CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION

### Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

### Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

### HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

### Performance Evaluation of Amazon EC2 for NASA HPC Applications!

National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!

### Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

### Parallel Programming Survey

Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

### and RISC Optimization Techniques for the Hitachi SR8000 Architecture

1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.

### Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

### OC By Arsene Fansi T. POLIMI 2008 1

IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

### FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic

### 1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

### Introduction to GPU Architecture

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

### Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

### SX-ACE Processor: NEC s Brand-New Vector Processor

HOTCHIPS26 SX-ACE Processor: NEC s Brand-New Vector Processor Shintaro Momose, Ph.D. NEC Corporation IT Platform Division Manager of SX vector supercomputer development August 11th, 2014 Performance SX

### BSC - Barcelona Supercomputer Center

Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to

### YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

### Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

### High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

### Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

### SPARC64 X+: Fujitsu s Next Generation Processor for UNIX servers

X+: Fujitsu s Next Generation Processor for UNIX servers August 27, 2013 Toshio Yoshida Processor Development Division Enterprise Server Business Unit Fujitsu Limited Agenda Fujitsu Processor Development

### COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

### Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

### Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

### TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

### Overview of High Performance Computing

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

### Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

### THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

### Lecture 1: the anatomy of a supercomputer

Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949

### Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

### A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

### Interconnection Network

Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) \$ P Node: processor(s), memory system, plus communication assist Network

### Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

### Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

### High Performance Computing in the Multi-core Area

High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable

### Amazon Cloud Performance Compared. David Adams

Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance

### Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

### Topological Properties

Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

### GPUs for Scientific Computing

GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

### Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

### Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

### Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

### Building an Inexpensive Parallel Computer

Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

### Computer Organization and Architecture

Computer Organization and Architecture Chapter 14 Instruction Level Parallelism and Superscalar Processors What does Superscalar mean? Common instructions (arithmetic, load/store, conditional branch) can

### Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

### Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

### Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

### Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

### Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

### Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

### Current Trend of Supercomputer Architecture

Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,

### SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

### ANALYSIS OF SUPERCOMPUTER DESIGN

ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis

### "JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer

AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

### Sequential Performance Analysis with Callgrind and KCachegrind

Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut

### Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

### OpenMP and Performance

Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

### High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

### Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

### Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

### COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

### Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

### Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

### ~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

### Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q

### The AMD Athlon XP Processor with 512KB L2 Cache

The AMD Athlon XP Processor with 512KB L2 Cache Technology and Performance Leadership for x86 Microprocessors Jack Huynh ADVANCED MICRO DEVICES, INC. One AMD Place Sunnyvale, CA 94088 Page 1 The AMD Athlon

### Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

### LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

### Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

### White Paper COMPUTE CORES

White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

### HIGH PERFORMANCE COMPUTING COMPETENCE CENTER BADEN-WÜRTTEMBERG

HIGH PERFORMANCE COMPUTING COMPETENCE CENTER BADEN-WÜRTTEMBERG Contents High Performance Computing Competence Center Baden-Württemberg (hkz-bw)... 4 Vector Parallel Supercomputer NEC SX-6X... 8 Massively

### GPU Accelerated Pathfinding

GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65-bleiweiss.pdf?ip=129.241.138.231&acc=active

### Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

### AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009

AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU

### Pedraforca: ARM + GPU prototype

www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of

### 64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

### ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster

### Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

### Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### Collaborative and Interactive CFD Simulation using High Performance Computers

Collaborative and Interactive CFD Simulation using High Performance Computers Petra Wenisch, Andre Borrmann, Ernst Rank, Christoph van Treeck Technische Universität München {wenisch, borrmann, rank, treeck}@bv.tum.de