Optimizing Code for Accelerators: The Long Road to High Performance
|
|
|
- Dwayne Wiggins
- 10 years ago
- Views:
Transcription
1 Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010
2 The Age of Accelerators 2
3 Accelerators in Real Life 3
4 Latency (ps/inst) Why Accelerators? Exploit available transistor budget Increased performance 30:1 1,000:1 Energy-efficiency 30,000:1 (performance/watt) Dally, et al. The Classical Computer, ISAT study, 2001 Sacrifice generality (application-specific) 30/09/09 CSL Intro 4
5 GPP vs Accelerators AMD Opteron 2380 NVIDIA Tesla c1060 ATI FirePro V8700 STI Cell QS21 (8 SPEs) Core Frequency 2.7 GHz 1.3 GHz 750 MHz 3.2 GHz Processor cores Data-level Parallelism 4-way SIMD 8-way SIMT 5-way VLIW 4-way SIMD Memory BW 24 GB/s GB/s GB/s 25.6 GB/s (in) 35 GB/s (out) Peak Perf SP (GFLOPS) Peak Perf DP (GFLOPS) TDP 75W 187W 114W 45W Feature size 45nm 55nm 55nm 90nm 5
6 Massively Parallel Architectures E.g. NVIDIA GPU Architecture 6
7 Position of This Talk Let s say we want to do this What performance can we obtain? How to optimize the code? How much effort? What happens to the code? Will we regret this? Method: implementing algorithms Study performance portability Optimize an application for the Cell processor 7
8 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 8
9 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 9
10 Method: Compare Code Optimizations Across Processors Loop Unrolling Vectorization Processors Loop Body Body R 1 R 2 Single Vector instruction ai R1, R2, 1 A B C D A B C D CPU: Core i7 Tesla c1060 FirePro V8700 Cell QS21 Body Evaluate code optimizations on several accelerators OpenCL: Functional portability 13
11 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0, scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 17
12 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 CPU benefits from vectorization CPU is indifferent to loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 18
13 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Vectorization is critical to Cell Cell benefits from loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 19
14 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Optimizations interact with each other FirePro more sensitive Too much unrolling degrades performance scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 20
15 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 1,2 1,2 1,0 0,8 0,6 Cell benefits from loop unrolling 1,0 0,8 0,6 0,4 0,4 0,2 0,2 0, , scalar vector scalar vector CPU Tesla FirePro Cell Benchmark: cp CPU Tesla FirePro Cell Benchmark: mri-fhd 21
16 Execution time (secs) Performance Impact of Thread Block Size (Parallelism) Block size CPU Tesla FirePro Cell Thread block size is the most important parameter for Tesla Benchmark: mri-fhd Loop unrolling: 2 Vectorization: yes 25
17 What have we learned from this? Performance portability: no free lunch Optimizations: architecture specific Program specific sensitivity Interaction between optimizations Potentially large speedups 27
18 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 28
19 Cell processor: a high-level overview Power Processing Element (PPE) 64-bit PowerPC RISC 32 KB IL1, 32 KB DL1, 512 KB L2 In-order instruction issue 2-way superscalar Synergistic Processing Element (SPE) 128-bit in-order RISC-like vector processor 256 KB Local Store, explicit memory management SIMD, no hardware branch predictor System Memory Interface (MFC) 16 B/cycle 25.6 GB/s (1.6 GHz) Element Interconnect Bus (EIB) GB/s peak BW 4 data rings of 16 bytes A Heterogeneous multi-core 29
20 Clustal W: a program for alignment of nucleotide or amino acid sequences N = Number of sequences L = Typical sequence length Pairwise alignment Space = O(N 2 +L 2 ) Time = O(N 2 L 2 ) Guide tree Space = O(N 2 ) Time = O(N 4 ) atgagttcttaa gattgttgcc gccttcttgtta cgttaacttc Progressive alignment Space = O(N 2 +L 2 ) Time = O(N 4 +L 2 ) 32
21 Percentage execution time Analysis of Clustal W: determine what is important 100% 80% 60% 40% 20% PW GT PA 0% N = 10 N = 50 N = 100 N = 500 N = 1000 Number of sequences (lower) & Sequence length (upper) 33
22 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 34
23 Execution time (secs) Execution time (secs) A simple port from PPU to SPU Minimal code changes to allow execution on SPU Important slowdown Overhead of DMAs and mailboxes not the cause is not enough Pairwise Alignment 2500 Progressive Alignment x x 0.73 PPU SPU-base PPU SPU-base QS21 dual Cell BE-based blade Each Cell 3.2 GHz gcc Fedora Core 7 35
24 Optimizing for the SPU: Loop Structure In both phases the majority of work is performed by 3 consecutive loop nests Pairwise alignment Progressive alignment forward Increasing indices for sequence arrays forward backward Decreasing indices for sequence arrays backward 3 rd Using intermediate values of fwd &bwd 3 rd Forward loop most important for pairwise alignment 3 rd loop performs least work in both phases 36
25 Execution time (secs) Optimizing for the SPU: Control Flow Optimization Branch misprediction is expensive (18 cycles) Convert control flow to data flow using compare & select Pairwise Alignment x 1.35 Determine maximum 0 If(x > max) max = x; max = spu_sel(max, x, cmpgt(x, max)); 37
26 Optimizing for the SPU: j-loop Vectorization f, e, s 4 x 32bits i-loop HH[j] Vectorize by 4 i-loop iterations Vector length = 128 bits 38
27 Execution time (secs) Difficulty 1 Optimizing for the SPU: Construction of loop pre- and post-ambles Difficulty 2 Pairwise alignment computes position of maximum Vectorization changes execution order For each of the vector lanes keep position of maximum; afterwards select maximum that occurred first in original program order Vectorization (cont) Pairwise Alignment x
28 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Loop structure vector inth; for(j=1; j< N; ++j) { } h[0..3] = HH[j..j+3]; // operations on h; HH[j..j+3] = h[0..3]; Code structure results from vectorization Memory accesses HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 40
29 Unaligned Vector Loads on the Cell Unaligned vector load vector intv, qw0, qw1, part0, part1; unsigned in shift; 7 HH[j] HH[j+1] shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector sint *)&HH[j])+1); part0 = spu_slqwbyte( qw0, shift ); part1 = spu_rlmaskqwbyte( qw1, (signed int)( shift - 16 ) ); v = spu_or( part0, part1 ); 41
30 Unaligned Vector Stores on the Cell Unaligned vector store vector intv, qw0, qw1, merge0, merge1; vector unsigned int mask; unsigned int shift; shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector int *)&HH[j])+1); mask = ( vector unsigned int) spu_rlmaskqwbyte( spu_promote((unsigned char)0xff, 0), -shift); v = spu_rlqwbyte( v, -shift ); merge0 = spu_sel( qw0, v, mask ); merge1 = spu_sel( v, qw1, mask ); *( vector int *)(&HH[j-3]) = merge0; *(( vector int *)(&HH[j-3])+1) = merge1; 10 HH[j] HH[j+1] HH[j] HH[j+1] 42
31 HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Memory accesses address HH[1] is vector-aligned iterations access 2 aligned vectors Optimizations Unroll loop 4 times Ensure first iteration accesses aligned elements Cache memory values in registers Remove redundant computations One aligned vector load and store per cycle in regime 43
32 Execution time (secs) Optimizing for the SPU: Loop Unrolling to avoid unaligned DMAs Unaligned vector access 4*17 = 68 instructions if unoptimized Reduced to 24 instructions in regime Optimization applies to multiple arrays Optimization applied only to regime loop Pairwise Alignment x
33 Execution time (secs) Code size (KB) Optimizing for the SPU: Pairwise Alignment Pairwise Alignment forward backward 3rd 47
34 Execution time (secs) Code size (KB) Optimizing for the SPU: Progressive Alignment Progressive Alignment x 1.35 x 1.62 x forward backward 48
35 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 49
36 Speedup Parallelization of Pairwise Alignment Scores for every pair of sequences can be calculated independently i j 1 work package Maximize load balance by processing packages in order of decreasing size Pairwise Alignment Number of SPUs 50
37 Parallelization of Progressive Alignment: Inter-loop Parallelism Exploit parallelism between loop nests Progressive alignment Work scheduling forward forward backward backward 3 rd 3 rd 3 rd loop performs less work 51
38 iteration Parallelization of Progressive Alignment: Intra-loop Parallelism (1) Single-threaded for(i=0; i< N; ++i) { s = prfscore(i,j); forward // other computations // with cross-iteration // dependences D(s); } Parallel-Stage Pipeline prf score prf score prf score D(s) D(s) time D(s) 52
39 Parallelization of Progressive Alignment: Intra-loop Parallelism (2) Parallel-Stage: prfscore() NThreads threads execute code in this stage Sequential Stage: D(s) 1 thread executes code in this stage // For thread T of NThreads for(i=0; i< N; ++i) { if( i % Nthreads == T ) { s = prfscore(i,j); put(s, Queue[T]); } } for(i=0; i< N; ++i) { s = get(q[i % NThreads]); D(s); Demonstrated for round-robin distribution In practice take vectorization into account and distribute larger blocks to minimize branches! } 53
40 Speedup Parallelization of Progressive Exploiting both inter-loop and intraloop parallelism Optimized inter-spu communication by using non-blocking SPU-to-SPU local store copies Alignment 4,0 3,5 3,0 2,5 Progressive alignment SPU 2 A-set prfscore SPU 4 A-set prfscore SPU 0 SPU 3 fwd B-set prfscore SPU 1 SPU 5 bwd 3 rd B-set prfscore 2,0 1,5 1,0 0,5 0,0 1 SPU 6 SPU 54
41 Speedup Overall Speedup Pairwise Alignment Progressive Alignment calcprf1 Total Number of SPUs 55
42 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 56
43 Execution Time (seconds) Total Speedup is Limited by Least Parallel Part Progressive Alignment Pairwise Alignment Number of SPUs 57
44 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore Progressive Alignment Pairwise Alignment AMD Opteron x4 cores Parallelized, no optimizations Progressive Alignment Pairwise Alignment Number of SPUs Number of cores 58
45 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Alignment Time spent in progressive alignment Alignment Pairwise is higher 2000on Cell BE Pairwise Alignment in baseline and after optimization Alignment Number of SPUs Number of cores 59
46 Execution time (seconds) Execution time (seconds) 1-SPU Optimizations Are Compensated by Smart Hardware Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Strongly Alignment optimized 1-SPU implementation Alignment Pairwise as 2000 fast as Pairwise Alignment Alignment 1500 single-threaded on general-purpose fat core Number of SPUs Number of cores 60
47 Execution time (seconds) Execution time (seconds) What if GPP Code is Vectorized? Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Ratio Alignment of AMD/Cell for pairwise alignment is Alignment Pairwise It is plausible 2000that this gap can be Pairwise Alignment bridged by vectorization Alignment Number of SPUs Number of cores 61
48 Execution time (seconds) Execution time (seconds) Overall, Cell BE Looses the Comparison Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Alignment AMD is faster by 28% 1500 Progressive Alignment due Pairwise to lack of performance 2000 of Cell BE on Alignment progressive alignment Pairwise Alignment Number of SPUs Number of cores 62
49 Execution time (seconds) Execution time (seconds) Hybrid System Is Fastest Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Execute Alignment pairwise alignment on Cell BE; Alignment Pairwise progressive 2000 alignment on GPP: Pairwise Alignment±40% faster than AMD Alignment Number of SPUs Number of cores 63
50 Looking back Let s say we have done this What performance can we obtain? Significant speedups How much effort? Significant: several person-months What happens to the code? Huge code transformations and duplications Can t really maintain it any more Will we regret this? In this case, the gains compared to homogeneous multicores are negative Lack of parallelism in part of the program 64
51 Conclusion Accelerators contain many simple cores Significant optimization required for single-core performance Each accelerator architecture benefits from different optimizations There is no compiler to do the job for you! Fat general-purpose cores do much of this work for you Performance improvements are worthwhile Optimization is time-consuming, error-prone, architecture specific Nearly intractable to maintain optimized code Beware of Amdahl s Law: require scalable parallelism in all parts of the application 65
52 Acknowledgements Collaborators Sean Rul JorisD Haene MichielQuestier Koen De Bosschere 66
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
High-Performance Modular Multiplication on the Cell Processor
High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland [email protected] 1 / 19 Outline Motivation and previous work
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures
Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY
Fast Implementations of AES on Various Platforms
Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Shared Memory Abstractions for Heterogeneous Multicore Processors
Shared Memory Abstractions for Heterogeneous Multicore Processors Scott Schneider Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers
X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
How To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Communicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
A Scalable VISC Processor Platform for Modern Client and Cloud Workloads
A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K [email protected], [email protected] Abstract In this project we propose a method to improve
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Comparative Performance Review of SHA-3 Candidates
Comparative Performance Review of the SHA-3 Second-Round Candidates Cryptolog International Second SHA-3 Candidate Conference Outline sphlib sphlib sphlib is an open-source implementation of many hash
Programming the Cell Multiprocessor: A Brief Introduction
Programming the Cell Multiprocessor: A Brief Introduction David McCaughan, HPC Analyst SHARCNET, University of Guelph [email protected] Overview Programming for the Cell is non-trivial many issues to be
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Fast Software AES Encryption
Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2010 Fast Software AES Encryption Osvik, Dag Arne Proceedings FSE'10 Proceedings of the 17th
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
ECE 5745 Complex Digital ASIC Design Course Overview
ECE 5745 Complex Digital ASIC Design Course Overview Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece5745 Application Algorithm
Clusters: Mainstream Technology for CAE
Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Multicore Processors A Necessity By Bryan Schauer
Discovery Guides Multicore Processors A Necessity By Bryan Schauer Abstract As personal computers have become more prevalent and more applications have been designed for them, the end-user has seen the
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
