High Performance Computing: The processor

Size: px
Start display at page:

Download "High Performance Computing: The processor"

Transcription

1 High Performance Computing: The processor Erik Saule ITCS 6010/8010: Special Topics: High Performance Computing. 01/16/2014 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

2 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

3 Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

4 Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

5 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

6 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Instruction pipelining Source: Wikipedia One instruction per cycle (thr). Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

7 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim Instruction pipelining ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Pipeline Stall (Bubbles) Source: Wikipedia One instruction per cycle (thr). Source: Wikipedia Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

8 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

9 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

10 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

11 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

12 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Out-of-order Execution Keep a pool of instructions to come and execute a safe one. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

13 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

14 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

15 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

16 Summary A modern multicore processors are complicated. There are shared ressources between cores I/O port L3 cache Memory Controller(s) It is VERY difficult to predict their behavior. One core can retire more than one instruction per cycle. The philosophy of RISC optimization still applies. Writting efficient code manually is going to be tedious. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

17 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

18 Single Instruction Multiple Data Flynn s Taxonomy Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

19 Single Instruction Multiple Data Flynn s Taxonomy Principle Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Perform in a single instruction the same operation of multiple data. In typical processor, it use special registers which contains multiple values. Often called packed register (Packed Integer, Packed Double,...) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

20 History Intel processors MMX (1996): 64-bit integer packed register (overlap with FPU). SSE (1999): new 128 bit register for SP. SSE2 (2001): extend SSE to integer. SSE3 (2004): Horizontal operation, misaligned read, convert. SSSE3 (2006): new minor (mostly horizontal) operations. SSE4 (2006): minor, String, Bit counts. AES (2008): to help AES encryption. AVX (2008): introduces 256 bits registers and 3 operand instructions. Overlap with SSE registers. AVX2 (2013?): expand AVX to most instruction. FMA. gather. Some nuances with AMD, but mostly the same. Other classical ones: AltiVec for Power and Neon for ARM. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

21 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

22 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

23 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

24 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

25 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Use intrinsics/builtin Most compilers expose vector types on which built-in/intrinsics functions can be applied. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

26 AVX (SandyBridge) and AVX2 (Haswell) Registers There are 16 registers of 256-bits called YMM[0-15]. They overlap with XMM[0-15] registers for SSE. Declare them in C (with proper include): mm256s: single precision mm256d: double precisions mm256i: integers Notice the lack of signedness and size of the integers All operations are suffixed with the type of data they operate on: ps: packed single precision pd: packed double precision epi32: signed 32-bit integers (mostly AVX2) epu8: unsigned 8 bit integers (mostly AVX2) Note: read the documentation carefully, many are confusing. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

27 Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

28 Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); For instance: m256d a = mm256 set pd(4.0, 9.0, 16.0, 25.0); m256d b = mm256 sqrt pd(a); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

29 Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

30 Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 add pd(a, b); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

31 Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

32 Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 hadd pd(a, b); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

33 Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

34 Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Getting data from vector register m128i mm256 extractf128 si256( m256i m1, const int offset); m256d mm256 shuffle pd( m256d m1, m256d m2, const int select); m256 mm256 blend ps( m256 m1, m256 m2, const int mask); m256 mm256 permute ps( m256 m1, int control); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

35 Different kinds of instructions Getting data from memory m256 mm256 load ps(float const *a); m256d mm256 loadu pd(double const *a); (unaligned) m256 mm256 broadcast ps( m128 const *a); m256d mm256 broadcast sd(double const *a); m128d mm i64gather pd(double const * base, m128i vindex, const int scale); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

36 Data Layout Array of Structure Vs Structure of Array (AoS/SoA). A gaming example. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

37 Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

38 Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo On the software side Many libraries and software probe the capability of the CPU and chose which code path to take. FFmpeg. Intel MKL. It used to be way to restrictive. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

39 Summary Lots of performance available by using vectorial operations AVX can process in a single instruction: 4 double precision floating point value 8 single precision floating point value 8 32-bit integer (AVX2) bit integer (AVX2) 32 8-bit integer (AVX2) Fused Multiply Add can potentially double that number again (AVX2) Not only vertical operations: horizontal operations data reordering gather (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

40 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

41 Why Hardware Threading? Context switching Sometimes CPU idles because it waits for data. there is a cache/page fault. it waits for the systems to wake it up (waiting on a condition to be met by another process). it is waiting for I/Os. its time-quantum has elapsed. Low level reason Sometimes some ressources of the processor are left idle. waiting for data in L2 or L3 cache to reach L1. contention on the L1 cache. pipeline bubbles. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

42 What is it? Multiple hardware context The principle is to have multiple hardware context within a core: Replicate the registers. Do not replicate execution components. Switching from one to the other is now virtually free. Keeps more potential work to run on the execution component. Efficiency increases. Multiple kinds One thread runs until it blocks and switch. Round robin on the threads. Decode everything at once and let them execute as possible. (SMT) But it is a very old idea used to overlap disk I/O and computation. Now, this (or similar ideas) are used by most computing hardware. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

43 Interesting facts about threading Often used to overlap data transfers and operations. Can cause more contention on shared ressources leading to performance drop. On compute kernels, 30% can be gained. On I/O bound application, large factor can be gained. On x86 CPUs, rarely more than 2-way SMT. (Xeon Phi is 4-way.) Sparc had 16-way threading on some processors. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

44 Conclusion (before talking about OpenMP) Generalities There is parallelism at many degrees: at the processor level (multiple cores) at the core level with SMT within a core (multiple instruction ports) within a port (SIMD) To understand performance, you need to understand how things are going to be executed. Very difficult to predict actual performance. Things that matter Frequency Number of core Instruction set Architecture of the core Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

45 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

46 Resources Classical RISC: Detailed explanation of Nehalem: Composer XE documentation for SIMD instruction: doclib/stdxe/2013/composerxe/compiler/cpp-mac/guid-64e5bfbb-fe9e-47cb-82a7-76c2ad57ed9c.htm On pragma simd: Intel compiler ways to vectorize automatically: SIMD in gcc: (a little old) Builtin in gcc: Intel developper manual: SMT on wikipedia: CPUID: On Stream Programming (vectorization): sandybridge/haswell architectural comparison: Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Adaptive Stable Additive Methods for Linear Algebraic Calculations Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

Pexip Speeds Videoconferencing with Intel Parallel Studio XE 1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Big Data Visualization on the MIC

Big Data Visualization on the MIC Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

More information

CPU Organization and Assembly Language

CPU Organization and Assembly Language COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

picojava TM : A Hardware Implementation of the Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the

More information

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION

More information

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

Instruction Set Architecture

Instruction Set Architecture Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

All ju The State of Software Development Today: A Parallel View. June 2012

All ju The State of Software Development Today: A Parallel View. June 2012 All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

Cycles for Competitiveness: A View of the Future HPC Landscape

Cycles for Competitiveness: A View of the Future HPC Landscape Cycles for Competitiveness: A View of the Future HPC Landscape October 6, 2010 Stephen R. Wheat, Ph.D. Sr. Director, HPC WW Business Operations Intel, Data Center Group Legal Disclaimer Intel may make

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009

AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009 AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040 Computing for Data Analytics: Methods and Tools CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS

More information

How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture

How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture Instructor: Markus Püschel Guest Instructor: Franz Franchetti TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay

More information

DNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk

DNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk DNA Data and Program Representation Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),

More information

High Performance Computing, an Introduction to

High Performance Computing, an Introduction to High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Exploiting SIMD Instructions

Exploiting SIMD Instructions Instructions Felix von Leitner CCC Berlin felix-simd@fefe.de August 2003 Abstract General purpose CPUs have become powerful enough to decode and even encode MPEG audio and video in real time. These tasks

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

IA-32 Intel Architecture Software Developer s Manual

IA-32 Intel Architecture Software Developer s Manual IA-32 Intel Architecture Software Developer s Manual Volume 1: Basic Architecture NOTE: The IA-32 Intel Architecture Software Developer s Manual consists of three volumes: Basic Architecture, Order Number

More information

Instruction Set Architecture (ISA) Design. Classification Categories

Instruction Set Architecture (ISA) Design. Classification Categories Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

Computer Architecture. Secure communication and encryption.

Computer Architecture. Secure communication and encryption. Computer Architecture. Secure communication and encryption. Eugeniy E. Mikhailov The College of William & Mary Lecture 28 Eugeniy Mikhailov (W&M) Practical Computing Lecture 28 1 / 13 Computer architecture

More information

Computer Organization and Architecture

Computer Organization and Architecture Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal

More information

Performance Application Programming Interface

Performance Application Programming Interface /************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more

More information

PSE Molekulardynamik

PSE Molekulardynamik OpenMP, bigger Applications 12.12.2014 Outline Schedule Presentations: Worksheet 4 OpenMP Multicore Architectures Membrane, Crystallization Preparation: Worksheet 5 2 Schedule 10.10.2014 Intro 1 WS 24.10.2014

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

More information

GPI Global Address Space Programming Interface

GPI Global Address Space Programming Interface GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming

More information

High-speed image processing algorithms using MMX hardware

High-speed image processing algorithms using MMX hardware High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

Benchmarking Large Scale Cloud Computing in Asia Pacific

Benchmarking Large Scale Cloud Computing in Asia Pacific 2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

CPU performance monitoring using the Time-Stamp Counter register

CPU performance monitoring using the Time-Stamp Counter register CPU performance monitoring using the Time-Stamp Counter register This laboratory work introduces basic information on the Time-Stamp Counter CPU register, which is used for performance monitoring. The

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

Administrative Issues

Administrative Issues CSC 3210 Computer Organization and Programming Introduction and Overview Dr. Anu Bourgeois (modified by Yuan Long) Administrative Issues Required Prerequisites CSc 2010 Intro to CSc CSc 2310 Java Programming

More information

The programming language C. sws1 1

The programming language C. sws1 1 The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Performance monitoring of the software frameworks for LHC experiments

Performance monitoring of the software frameworks for LHC experiments Performance monitoring of the software frameworks for LHC experiments William A. Romero R. wil-rome@uniandes.edu.co J.M. Dana Jose.Dana@cern.ch First EELA-2 Conference Bogotá, COL OUTLINE Introduction

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Using the Game Boy Advance to Teach Computer Systems and Architecture

Using the Game Boy Advance to Teach Computer Systems and Architecture Using the Game Boy Advance to Teach Computer Systems and Architecture ABSTRACT This paper presents an approach to teaching computer systems and architecture using Nintendo s Game Boy Advance handheld game

More information

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Taiwo O. Ojeyinka Department of Computer Science, Adekunle Ajasin University, Akungba-Akoko Ondo State, Nigeria. Olusola Olajide

More information