Algorithms of Scientific Computing II
|
|
- Christian Douglas
- 8 years ago
- Views:
Transcription
1 Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware implementations and Barnes-Hut-Method 1) Today s Systems - Compute Power Today s supercomputer are highly parallel systems. Please read LRZ s web page and identify the different levels of parallelism in SuperMUC! Vector-register; Hardware-thread; core; processor; node; island; whole system. Which levels of parallelism are not documented there, but can be found in Intel Architecture Manual? Instruction level parallelism: A core is able to decode several instruction a time (e.g. four-way decode). Since it features several execution units the so-called out-oforder unit is able to run different instructions employing different units out-of-order and in parallel. Please refer to Fig. 1 and Fig. 2 for examples. In total we are now able to calculated the peak-performance (double percision) of a node in SuperMUC. 2: out-of-order: mul and add 4: vectorization 8: cores per package 2.7: clock-speed 2: two sockets per node 45.6 GFLOPS per node! =.19 PFLOPS for whole system.
2 Branch Prediction 2 KB Instruction Cache Instruction Fetch and Pre-Decode uop decoded Cache (1.5k uops) 4-way Decode Rename/Allocate/Retirement (Reorder-Buffer: 168 Entries) Scheduler (physical registerfile: bit VPU registers, 160 integer registers) Port 0 Port 1 Port 5 Port 2 Port Port 4 ALU SSE MUL SSE Shuffle DIV AVX FP MUL Imm Blend ALU SSE ADD SSE Shuffle AVX FP ADD ALU JMP AVX FP Shuffle AVX FP Boolean Imm Blend Load Store Address Store Address Load Store Data 256 KB Cache (MLC) Memory Control 2 KB Data Cache Abbildung 1: Intel Sandy Bridge Architecture, changes in comparison to Intel Nehalem Architecture are highlighted in orange, based on descriptions from Intel. Instruction Fetch 64 KB Instruction Cache Instruction Decode Branch Prediction 4-way Decode Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU Instruction Retirement Floating Point Scheduler, 60 entries Port 0 Port 1 Port 2 Port 128bit FMAC 128bit FMAC SSE x87 SSE x87 Instruction Retirement Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU 2 MB shared Cache Load/Store Unit 16 KB Data Cache Floating Point Load / Store Buffer 16 KB Data Cache Load/Store Unit Data Prefetches Abbildung 2: AMD Bulldozer Architecture, based on descriptions from AMD. 2) Today s Systems - Memory Subsystem Please sketch the memory subsystem/ network structure of SuperMUC. Is there a different network topology available on another supercomputer? A single node features a state of the art NUMA-architecture as displayed in Fig., which implements a shared memory but with different access time: local vs. remo-
3 te. A remote access takes roughly 10-20% longer than a local access. Operating systems like Linux employ a so called first touch policy: the NUMA-node which touches a page first is the owner of the this page. On each socket we are able to achieve a streaming bandwidth of more than 40 GByte/s. Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr CH DDR 1600 Shared L Cache 20MB 4 CH DDR 1600 Shared L Cache 20MB Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr. 15 QPI 8.0 GT QPI 8.0 GT Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Abbildung : Intel Sandy Bridge NUMA-Architecture For inter node communication, SuperMUC features an Infiniband network. Infiniband offers low latencies (1.2 us) at more then 6 GByte/s per second bandwidth. The overall network is organizes as a pruned fat-tree: 512 nodes form a so-called island with one high-end switch. 18 island are connected via a 1-level tree. Other systems, like IBM s BlueGene or Cray s XK6 feature custom interconnects (which are routerless) in order to form a 2D/D mesh. If a scientific application (e.g. based on regular grids or matrices) matches to this network only llowöverhead for communication is measured. What is the peak performance of the system if a memory bound application is executed?
4 SuperMUC features four channels per socket of DDR-1600 RAM. On channel is able to deliver 12.8 GByte/s what results in a total bandwidth of GByte/s. Let s execute a matrix vector multiplication: For a N N matrix we have to calculate N scalar products. For each scalar product we have to read 2N values which is have an amount of of 16N bytes and we are executing N FLOP. Therefore we are able to achieve 102.4/16 GFLOPS = 6.4 GFLOPS (1.8 % peak performance!!!) Which hardware feature may help to improve this performance? Caches! Abbildung 4: cache hierarchy Let s assume we can store b of our matrix vector multiplication in the cache, we just have to stream A. This means for a scalar product we have to read 8N bytes which results in a performance of 12.8 GFLOPS. In order to unleash the performance of today s chips cache awareness is a must.
5 ) Vectorization of Linked-Cells particle simulations Recently C++ has established itself as one of the standard language for new development of HPC codes apart from FORTRAN and C. Please sketch the software layout of a linked-cells C++ implementation. Classes for: Particle, ParticleContainer, Cell, ForceCalculation. Which problems do you face when it comes to low level kernels and data layout? Interactions are taking place on a particle basis. ForceCalculation class gets two references and computes and updates velocities and forces for each particle. This leads to a so-called Array of Structures (AoS) which is performance-killing. In order to speed-up calculations it s useful to flip this layout which leads to a cell-based only method, Fig v_y_1 v_x_1 f_z_1 64 byte (512 bit) f_y_1 f_x_1 z_1 y_1 x_ y_2 x_2 b_1 a_1 v_z_1... x_ x_2 x_ y_ b_ y_2 b_2 y_1 b_1 Abbildung 5: AoS to SoA conversion Can you use vectorization in order to speed-up calculations? Given a SoA-storage, we are able to vectorize the LJ-12-6 kernel across particle pairs. Please note that this approach scales with the number of particles per cell and does not require zero-padding or is limited to a vector-length of four as methods. In our case the only limitations are the number of particles per cell, the size of the cutoff-radius r cutoff and cell-length r c as these factors influence the vector load.
6 x_a x_a calculate distances x_2 x_1 y_a y_a d_a_2 d_a_1 y_2 y_1 z_a 0 0 z_a f_x_a f_y_a d_a_2 > rcutoff d_a_1 > rcutoff z_2 f_x_2 f_y_2 z_1 f_x_1 f_y_1 0 f_z_a calculate forces (LJ-12-6) f_z_2 f_z_1 Abbildung 6: Vectorization of the LJ-12-6 force calculation kernel. Figure 6 sketches the applied vectorization of the LJ-12-6 calculations. As we need to deal with double precision floating point numbers we can store a maximum of two elements per SSE register. The calculation is performed on particle pairs, therefore we broadcast-load (duplicating by vector-length) the required data of one particle in the first register (a) and the second register is filled by data of two different particles (1 and 2). With this approach we can theoretically reduce the number of operations by a factor of two. Since we are calculating two interactions within one step, we need to apply some pre- and post-processing, which can be performed by regular logical operations directly in hardware: the SSE/AVX instruction set offers compare, logical and sign-bit operations that can be used to decide if the force calculation should be initiated (at least one particle pair distance is smaller than r cutoff, pre-processing) and if calculated results need to be zeroed by a mask (a particle pair distance is greater than r cutoff ). In case of SSE this should already gain a factor of two for small particle numbers. The lower performance bound is the scalar performance as these algorithms will not execute more instructions than necessary in the scalar case. additional: 4) Complexity of the Barnes-Hut-Method In the lecture the costs for the d-dimensional Barnes-Hut-Method were given as O( d NlogN). First derive the costs for the twodimensional Barnes-Hut-Method. Then explain descriptively (without proof), why the formula is also correct for d. The outer loop of the Barnes-Hut-algorithm iterates over all particles and calculates the force effective on every particle. As there are N particles in total, for each of those particles the costs have to be O( d log N). For the given complexity of O( d N log N) we assume that the tree is not degenerated, thus the tree
7 has O(log N) levels. If we can show, that at each level the costs are O( d ), the formula is proven valid. We start out from the twodimensional case. First we consider an arbitrary level and determine the number of nodes (i.e. particles of pseude-particles) we have to consider.those are exactly the nodes for which the parent node didn t fulfill the -criterion. For the parent node it has to hold: diam distance >. Thus it follows for the distance: distance < diam We will now try to determine an upper bound for the number of parent cells which fulfill that condition. Therefor we consider only one coordinate axis first. Not considering the absolute value, (1) can be rewritten as diam < distance < diam. Thus the area of the distance is 2 diam. As one node coveres an area with the diameter diam there can be only 2 diam diam nodes next to each other, for which the -criterion does not hold. In the second axis there are just as many. Thus there are = 2 ( ) 2 2 = 4 2 nodes in total, which don t fulfill the -criterion. Those are the parent nodes we have to consider at the current level. That number still has to be multiplied by 4, what doesn t change the order. Thus the maximum computational costs on each level are O( 2 ). As there area log N levels in total and N particles, the total computational costs sum up to O( 2 N log N). For three dimensions it is easy to see that there are not c 2 any more for which the -criterion is fulfilled, but only c, as we are dealing now with a cubic area. (1)
8 5,51 2,85 1,5 0,82 0,44 0,2 0,12 2,86 1,42 0,82 0,42 0,22 0,12 0,062 9,1 4, 2,15 1,08 0,62 9,48 0,4 4,87 0,19 2,57 0,118 1,7 0,081 0,77 0,4 0,22 runtime per timestep [s] ,1 0,01 SGI ICE, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 nks 4,7 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 2, 0,1, 1,7 0,89 0,46 0,24 0,1 1,16,1 1,69 0,88 0,46 0,24 0,1 0,61 2,85 1,5 0,82 0,44 0,2 0,12 0,5 0,01,1 1,72 0,88 0,46 0,24 0,1 0,19 AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE 2,6 1,2 0,65 0,5 0,18 0,099 0,11 cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 1,42 0,82 0,42 0,22 0,12 0,062 0,065 15,5 8,1 4,19 2,5 1,22 0,69 0,047 9,99 Abbildung 5,7 7: 2,78 Runtimes in 1,54 seconds for 0,81 a scenario with 0,46 100k particles on both 4,87 clusters, 2,57 one MPI 1,7 rank per0,77 core. 0,4 0,22 14,6 7,7,95 2,09 1,21 0,65 0,6 0,28 0,1 runtime per timestep [s] MiniMUC, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 128 ranks 256 ranks x x x x
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationIMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationPerformance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationAppro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationInterconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
More informationFLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
More informationand RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationPerformance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationYALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationCluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationCOMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationSoftware implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
More informationBSC - Barcelona Supercomputer Center
Objectives Research in Supercomputing and Computer Architecture Collaborate in R&D e-science projects with prestigious scientific teams Manage BSC supercomputers to accelerate relevant contributions to
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationHow To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationTHE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationHow To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
More informationHigh Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationTopological Properties
Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any
More informationBuilding an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationPerformance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
More informationCurrent Trend of Supercomputer Architecture
Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationKeys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More information"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationANALYSIS OF SUPERCOMPUTER DESIGN
ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis
More informationSequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut
More informationSockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
More informationOverview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationCell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
More informationInterconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)
Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems
More informationUnleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationCollaborative and Interactive CFD Simulation using High Performance Computers
Collaborative and Interactive CFD Simulation using High Performance Computers Petra Wenisch, Andre Borrmann, Ernst Rank, Christoph van Treeck Technische Universität München {wenisch, borrmann, rank, treeck}@bv.tum.de
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationDesigning and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
More informationEvoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
More informationAMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationHigh Performance Computing, an Introduction to
High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationPower-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationPedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
More informationBuild an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)!
Build an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)! Marty Deneroff Chief Technology Officer Green Wave Systems, Inc. deneroff@grnwv.com 1 Using COTS Intellectual Property,
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationPorting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationPARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationCFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
More informationFPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More information