Optimizing Code for Accelerators: The Long Road to High Performance

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Optimizing Code for Accelerators: The Long Road to High Performance"

Transcription

1 Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010

2 The Age of Accelerators 2

3 Accelerators in Real Life 3

4 Latency (ps/inst) Why Accelerators? Exploit available transistor budget Increased performance 30:1 1,000:1 Energy-efficiency 30,000:1 (performance/watt) Dally, et al. The Classical Computer, ISAT study, 2001 Sacrifice generality (application-specific) 30/09/09 CSL Intro 4

5 GPP vs Accelerators AMD Opteron 2380 NVIDIA Tesla c1060 ATI FirePro V8700 STI Cell QS21 (8 SPEs) Core Frequency 2.7 GHz 1.3 GHz 750 MHz 3.2 GHz Processor cores Data-level Parallelism 4-way SIMD 8-way SIMT 5-way VLIW 4-way SIMD Memory BW 24 GB/s GB/s GB/s 25.6 GB/s (in) 35 GB/s (out) Peak Perf SP (GFLOPS) Peak Perf DP (GFLOPS) TDP 75W 187W 114W 45W Feature size 45nm 55nm 55nm 90nm 5

6 Massively Parallel Architectures E.g. NVIDIA GPU Architecture 6

7 Position of This Talk Let s say we want to do this What performance can we obtain? How to optimize the code? How much effort? What happens to the code? Will we regret this? Method: implementing algorithms Study performance portability Optimize an application for the Cell processor 7

8 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 8

9 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 9

10 Method: Compare Code Optimizations Across Processors Loop Unrolling Vectorization Processors Loop Body Body R 1 R 2 Single Vector instruction ai R1, R2, 1 A B C D A B C D CPU: Core i7 Tesla c1060 FirePro V8700 Cell QS21 Body Evaluate code optimizations on several accelerators OpenCL: Functional portability 13

11 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0, scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 17

12 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 CPU benefits from vectorization CPU is indifferent to loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 18

13 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Vectorization is critical to Cell Cell benefits from loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 19

14 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Optimizations interact with each other FirePro more sensitive Too much unrolling degrades performance scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 20

15 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 1,2 1,2 1,0 0,8 0,6 Cell benefits from loop unrolling 1,0 0,8 0,6 0,4 0,4 0,2 0,2 0, , scalar vector scalar vector CPU Tesla FirePro Cell Benchmark: cp CPU Tesla FirePro Cell Benchmark: mri-fhd 21

16 Execution time (secs) Performance Impact of Thread Block Size (Parallelism) Block size CPU Tesla FirePro Cell Thread block size is the most important parameter for Tesla Benchmark: mri-fhd Loop unrolling: 2 Vectorization: yes 25

17 What have we learned from this? Performance portability: no free lunch Optimizations: architecture specific Program specific sensitivity Interaction between optimizations Potentially large speedups 27

18 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 28

19 Cell processor: a high-level overview Power Processing Element (PPE) 64-bit PowerPC RISC 32 KB IL1, 32 KB DL1, 512 KB L2 In-order instruction issue 2-way superscalar Synergistic Processing Element (SPE) 128-bit in-order RISC-like vector processor 256 KB Local Store, explicit memory management SIMD, no hardware branch predictor System Memory Interface (MFC) 16 B/cycle 25.6 GB/s (1.6 GHz) Element Interconnect Bus (EIB) GB/s peak BW 4 data rings of 16 bytes A Heterogeneous multi-core 29

20 Clustal W: a program for alignment of nucleotide or amino acid sequences N = Number of sequences L = Typical sequence length Pairwise alignment Space = O(N 2 +L 2 ) Time = O(N 2 L 2 ) Guide tree Space = O(N 2 ) Time = O(N 4 ) atgagttcttaa gattgttgcc gccttcttgtta cgttaacttc Progressive alignment Space = O(N 2 +L 2 ) Time = O(N 4 +L 2 ) 32

21 Percentage execution time Analysis of Clustal W: determine what is important 100% 80% 60% 40% 20% PW GT PA 0% N = 10 N = 50 N = 100 N = 500 N = 1000 Number of sequences (lower) & Sequence length (upper) 33

22 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 34

23 Execution time (secs) Execution time (secs) A simple port from PPU to SPU Minimal code changes to allow execution on SPU Important slowdown Overhead of DMAs and mailboxes not the cause is not enough Pairwise Alignment 2500 Progressive Alignment x x 0.73 PPU SPU-base PPU SPU-base QS21 dual Cell BE-based blade Each Cell 3.2 GHz gcc Fedora Core 7 35

24 Optimizing for the SPU: Loop Structure In both phases the majority of work is performed by 3 consecutive loop nests Pairwise alignment Progressive alignment forward Increasing indices for sequence arrays forward backward Decreasing indices for sequence arrays backward 3 rd Using intermediate values of fwd &bwd 3 rd Forward loop most important for pairwise alignment 3 rd loop performs least work in both phases 36

25 Execution time (secs) Optimizing for the SPU: Control Flow Optimization Branch misprediction is expensive (18 cycles) Convert control flow to data flow using compare & select Pairwise Alignment x 1.35 Determine maximum 0 If(x > max) max = x; max = spu_sel(max, x, cmpgt(x, max)); 37

26 Optimizing for the SPU: j-loop Vectorization f, e, s 4 x 32bits i-loop HH[j] Vectorize by 4 i-loop iterations Vector length = 128 bits 38

27 Execution time (secs) Difficulty 1 Optimizing for the SPU: Construction of loop pre- and post-ambles Difficulty 2 Pairwise alignment computes position of maximum Vectorization changes execution order For each of the vector lanes keep position of maximum; afterwards select maximum that occurred first in original program order Vectorization (cont) Pairwise Alignment x

28 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Loop structure vector inth; for(j=1; j< N; ++j) { } h[0..3] = HH[j..j+3]; // operations on h; HH[j..j+3] = h[0..3]; Code structure results from vectorization Memory accesses HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 40

29 Unaligned Vector Loads on the Cell Unaligned vector load vector intv, qw0, qw1, part0, part1; unsigned in shift; 7 HH[j] HH[j+1] shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector sint *)&HH[j])+1); part0 = spu_slqwbyte( qw0, shift ); part1 = spu_rlmaskqwbyte( qw1, (signed int)( shift - 16 ) ); v = spu_or( part0, part1 ); 41

30 Unaligned Vector Stores on the Cell Unaligned vector store vector intv, qw0, qw1, merge0, merge1; vector unsigned int mask; unsigned int shift; shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector int *)&HH[j])+1); mask = ( vector unsigned int) spu_rlmaskqwbyte( spu_promote((unsigned char)0xff, 0), -shift); v = spu_rlqwbyte( v, -shift ); merge0 = spu_sel( qw0, v, mask ); merge1 = spu_sel( v, qw1, mask ); *( vector int *)(&HH[j-3]) = merge0; *(( vector int *)(&HH[j-3])+1) = merge1; 10 HH[j] HH[j+1] HH[j] HH[j+1] 42

31 HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Memory accesses address HH[1] is vector-aligned iterations access 2 aligned vectors Optimizations Unroll loop 4 times Ensure first iteration accesses aligned elements Cache memory values in registers Remove redundant computations One aligned vector load and store per cycle in regime 43

32 Execution time (secs) Optimizing for the SPU: Loop Unrolling to avoid unaligned DMAs Unaligned vector access 4*17 = 68 instructions if unoptimized Reduced to 24 instructions in regime Optimization applies to multiple arrays Optimization applied only to regime loop Pairwise Alignment x

33 Execution time (secs) Code size (KB) Optimizing for the SPU: Pairwise Alignment Pairwise Alignment forward backward 3rd 47

34 Execution time (secs) Code size (KB) Optimizing for the SPU: Progressive Alignment Progressive Alignment x 1.35 x 1.62 x forward backward 48

35 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 49

36 Speedup Parallelization of Pairwise Alignment Scores for every pair of sequences can be calculated independently i j 1 work package Maximize load balance by processing packages in order of decreasing size Pairwise Alignment Number of SPUs 50

37 Parallelization of Progressive Alignment: Inter-loop Parallelism Exploit parallelism between loop nests Progressive alignment Work scheduling forward forward backward backward 3 rd 3 rd 3 rd loop performs less work 51

38 iteration Parallelization of Progressive Alignment: Intra-loop Parallelism (1) Single-threaded for(i=0; i< N; ++i) { s = prfscore(i,j); forward // other computations // with cross-iteration // dependences D(s); } Parallel-Stage Pipeline prf score prf score prf score D(s) D(s) time D(s) 52

39 Parallelization of Progressive Alignment: Intra-loop Parallelism (2) Parallel-Stage: prfscore() NThreads threads execute code in this stage Sequential Stage: D(s) 1 thread executes code in this stage // For thread T of NThreads for(i=0; i< N; ++i) { if( i % Nthreads == T ) { s = prfscore(i,j); put(s, Queue[T]); } } for(i=0; i< N; ++i) { s = get(q[i % NThreads]); D(s); Demonstrated for round-robin distribution In practice take vectorization into account and distribute larger blocks to minimize branches! } 53

40 Speedup Parallelization of Progressive Exploiting both inter-loop and intraloop parallelism Optimized inter-spu communication by using non-blocking SPU-to-SPU local store copies Alignment 4,0 3,5 3,0 2,5 Progressive alignment SPU 2 A-set prfscore SPU 4 A-set prfscore SPU 0 SPU 3 fwd B-set prfscore SPU 1 SPU 5 bwd 3 rd B-set prfscore 2,0 1,5 1,0 0,5 0,0 1 SPU 6 SPU 54

41 Speedup Overall Speedup Pairwise Alignment Progressive Alignment calcprf1 Total Number of SPUs 55

42 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 56

43 Execution Time (seconds) Total Speedup is Limited by Least Parallel Part Progressive Alignment Pairwise Alignment Number of SPUs 57

44 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore Progressive Alignment Pairwise Alignment AMD Opteron x4 cores Parallelized, no optimizations Progressive Alignment Pairwise Alignment Number of SPUs Number of cores 58

45 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Alignment Time spent in progressive alignment Alignment Pairwise is higher 2000on Cell BE Pairwise Alignment in baseline and after optimization Alignment Number of SPUs Number of cores 59

46 Execution time (seconds) Execution time (seconds) 1-SPU Optimizations Are Compensated by Smart Hardware Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Strongly Alignment optimized 1-SPU implementation Alignment Pairwise as 2000 fast as Pairwise Alignment Alignment 1500 single-threaded on general-purpose fat core Number of SPUs Number of cores 60

47 Execution time (seconds) Execution time (seconds) What if GPP Code is Vectorized? Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Ratio Alignment of AMD/Cell for pairwise alignment is Alignment Pairwise It is plausible 2000that this gap can be Pairwise Alignment bridged by vectorization Alignment Number of SPUs Number of cores 61

48 Execution time (seconds) Execution time (seconds) Overall, Cell BE Looses the Comparison Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Alignment AMD is faster by 28% 1500 Progressive Alignment due Pairwise to lack of performance 2000 of Cell BE on Alignment progressive alignment Pairwise Alignment Number of SPUs Number of cores 62

49 Execution time (seconds) Execution time (seconds) Hybrid System Is Fastest Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Execute Alignment pairwise alignment on Cell BE; Alignment Pairwise progressive 2000 alignment on GPP: Pairwise Alignment±40% faster than AMD Alignment Number of SPUs Number of cores 63

50 Looking back Let s say we have done this What performance can we obtain? Significant speedups How much effort? Significant: several person-months What happens to the code? Huge code transformations and duplications Can t really maintain it any more Will we regret this? In this case, the gains compared to homogeneous multicores are negative Lack of parallelism in part of the program 64

51 Conclusion Accelerators contain many simple cores Significant optimization required for single-core performance Each accelerator architecture benefits from different optimizations There is no compiler to do the job for you! Fat general-purpose cores do much of this work for you Performance improvements are worthwhile Optimization is time-consuming, error-prone, architecture specific Nearly intractable to maintain optimized code Beware of Amdahl s Law: require scalable parallelism in all parts of the application 65

52 Acknowledgements Collaborators Sean Rul JorisD Haene MichielQuestier Koen De Bosschere 66

Experiences with Parallelizing a Bio-informatics Program on the Cell BE

Experiences with Parallelizing a Bio-informatics Program on the Cell BE Experiences with Parallelizing a Bio-informatics Program on the Cell BE Hans Vandierendonck, Sean Rul, Michiel Questier, and Koen De Bosschere Ghent University, Department of Electronics and Information

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics

More information

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19 Outline Motivation and previous work

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Using Many-Core Hardware to Correlate Radio Astronomy Signals

Using Many-Core Hardware to Correlate Radio Astronomy Signals Using Many-Core Hardware to Correlate Radio Astronomy Signals Rob V. van Nieuwpoort ASTRON, Netherlands Institute for Radio Astronomy Dwingeloo, The Netherlands nieuwpoort@astron.nl Categories and Subject

More information

Fast Implementations of AES on Various Platforms

Fast Implementations of AES on Various Platforms Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Supporting OpenMP on Cell

Supporting OpenMP on Cell Supporting OpenMP on Cell Kevin O Brien, Kathryn O Brien, Zehra Sura, Tong Chen and Tao Zhang IBM T. J Watson Research Abstract. The Cell processor is a heterogeneous multi-core processor with one Power

More information

High Performance Computing in the Multi-core Area

High Performance Computing in the Multi-core Area High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 Administrivia FlameWar Games Night Next Friday, April 27 th 5pm

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

SPARC64 X+: Fujitsu s Next Generation Processor for UNIX servers

SPARC64 X+: Fujitsu s Next Generation Processor for UNIX servers X+: Fujitsu s Next Generation Processor for UNIX servers August 27, 2013 Toshio Yoshida Processor Development Division Enterprise Server Business Unit Fujitsu Limited Agenda Fujitsu Processor Development

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Shared Memory Abstractions for Heterogeneous Multicore Processors

Shared Memory Abstractions for Heterogeneous Multicore Processors Shared Memory Abstractions for Heterogeneous Multicore Processors Scott Schneider Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is

More information

Multicore Architectures

Multicore Architectures Multicore Architectures Week 1, Lecture 2 Multicore Landscape Intel Dual and quad-core Pentium family. 80-core demonstration last year. AMD Dual, triple (?!), and quad-core Opteron family. IBM Dual and

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster!

Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster! Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, Thomas Neumann Technische Universität München

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction 28/8-2009 Overview Course topic and scope Background for the use and parallel processing using heterogeneous multi-core processors

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Next-Generation PRIMEHPC. Copyright 2014 FUJITSU LIMITED

Next-Generation PRIMEHPC. Copyright 2014 FUJITSU LIMITED Next-Generation PRIMEHPC The K computer and the evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1TFLOPS ~ # of cores

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

MAGMA: Matrix Algebra on GPU and Multicore Architectures

MAGMA: Matrix Algebra on GPU and Multicore Architectures MAGMA: Matrix Algebra on GPU and Multicore Architectures Presented by Scott Wells Assistant Director Innovative Computing Laboratory (ICL) College of Engineering University of Tennessee, Knoxville Overview

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011 Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB

More information

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background

More information

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline of the course March 2: Introduction to GPU architecture Let's

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009 ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster

More information

INF5063: Programming heterogeneous multi-core processors. September 13, 2010

INF5063: Programming heterogeneous multi-core processors. September 13, 2010 INF5063: Programming heterogeneous multi-core processors September 13, 2010 Overview Course topic and scope Background for the use and parallel processing using heterogeneous multi-core processors Examples

More information

Communicating with devices

Communicating with devices Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

More information

ARM Cortex A9. Alyssa Colyette Xiao Ling Zhuang

ARM Cortex A9. Alyssa Colyette Xiao Ling Zhuang ARM Cortex A9 Alyssa Colyette Xiao Ling Zhuang Outline Introduction ARMv7-A ISA Cortex-A9 Microarchitecture o Single and Multicore Processor Advanced Multicore Technologies Integrating System on Chips

More information

ARM Cortex-A8 Processor

ARM Cortex-A8 Processor ARM Cortex-A8 Processor High Performances And Low Power for Portable Applications Architectures for Multimedia Systems Prof. Cristina Silvano Gianfranco Longi Matr. 712351 ARM Partners 1 ARM Powered Products

More information

Intel Xeon Processor E5-2600

Intel Xeon Processor E5-2600 Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

More information

Clusters: Mainstream Technology for CAE

Clusters: Mainstream Technology for CAE Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux

More information

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

White Paper COMPUTE CORES

White Paper COMPUTE CORES White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information