High Performance Computing: The processor
|
|
- John Gray
- 7 years ago
- Views:
Transcription
1 High Performance Computing: The processor Erik Saule ITCS 6010/8010: Special Topics: High Performance Computing. 01/16/2014 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
2 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
3 Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
4 Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
5 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
6 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Instruction pipelining Source: Wikipedia One instruction per cycle (thr). Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
7 A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim Instruction pipelining ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Pipeline Stall (Bubbles) Source: Wikipedia One instruction per cycle (thr). Source: Wikipedia Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
8 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
9 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
10 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
11 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
12 How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Out-of-order Execution Keep a pool of instructions to come and execute a safe one. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
13 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
14 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
15 Core architecture (Nehalem) Source: TAMU supercomputing facility. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
16 Summary A modern multicore processors are complicated. There are shared ressources between cores I/O port L3 cache Memory Controller(s) It is VERY difficult to predict their behavior. One core can retire more than one instruction per cycle. The philosophy of RISC optimization still applies. Writting efficient code manually is going to be tedious. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
17 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
18 Single Instruction Multiple Data Flynn s Taxonomy Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
19 Single Instruction Multiple Data Flynn s Taxonomy Principle Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Perform in a single instruction the same operation of multiple data. In typical processor, it use special registers which contains multiple values. Often called packed register (Packed Integer, Packed Double,...) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
20 History Intel processors MMX (1996): 64-bit integer packed register (overlap with FPU). SSE (1999): new 128 bit register for SP. SSE2 (2001): extend SSE to integer. SSE3 (2004): Horizontal operation, misaligned read, convert. SSSE3 (2006): new minor (mostly horizontal) operations. SSE4 (2006): minor, String, Bit counts. AES (2008): to help AES encryption. AVX (2008): introduces 256 bits registers and 3 operand instructions. Overlap with SSE registers. AVX2 (2013?): expand AVX to most instruction. FMA. gather. Some nuances with AMD, but mostly the same. Other classical ones: AltiVec for Power and Neon for ARM. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
21 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
22 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
23 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
24 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
25 How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Use intrinsics/builtin Most compilers expose vector types on which built-in/intrinsics functions can be applied. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
26 AVX (SandyBridge) and AVX2 (Haswell) Registers There are 16 registers of 256-bits called YMM[0-15]. They overlap with XMM[0-15] registers for SSE. Declare them in C (with proper include): mm256s: single precision mm256d: double precisions mm256i: integers Notice the lack of signedness and size of the integers All operations are suffixed with the type of data they operate on: ps: packed single precision pd: packed double precision epi32: signed 32-bit integers (mostly AVX2) epu8: unsigned 8 bit integers (mostly AVX2) Note: read the documentation carefully, many are confusing. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
27 Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
28 Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); For instance: m256d a = mm256 set pd(4.0, 9.0, 16.0, 25.0); m256d b = mm256 sqrt pd(a); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
29 Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
30 Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 add pd(a, b); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
31 Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
32 Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 hadd pd(a, b); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
33 Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
34 Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Getting data from vector register m128i mm256 extractf128 si256( m256i m1, const int offset); m256d mm256 shuffle pd( m256d m1, m256d m2, const int select); m256 mm256 blend ps( m256 m1, m256 m2, const int mask); m256 mm256 permute ps( m256 m1, int control); Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
35 Different kinds of instructions Getting data from memory m256 mm256 load ps(float const *a); m256d mm256 loadu pd(double const *a); (unaligned) m256 mm256 broadcast ps( m128 const *a); m256d mm256 broadcast sd(double const *a); m128d mm i64gather pd(double const * base, m128i vindex, const int scale); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
36 Data Layout Array of Structure Vs Structure of Array (AoS/SoA). A gaming example. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
37 Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
38 Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo On the software side Many libraries and software probe the capability of the CPU and chose which code path to take. FFmpeg. Intel MKL. It used to be way to restrictive. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
39 Summary Lots of performance available by using vectorial operations AVX can process in a single instruction: 4 double precision floating point value 8 single precision floating point value 8 32-bit integer (AVX2) bit integer (AVX2) 32 8-bit integer (AVX2) Fused Multiply Add can potentially double that number again (AVX2) Not only vertical operations: horizontal operations data reordering gather (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
40 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
41 Why Hardware Threading? Context switching Sometimes CPU idles because it waits for data. there is a cache/page fault. it waits for the systems to wake it up (waiting on a condition to be met by another process). it is waiting for I/Os. its time-quantum has elapsed. Low level reason Sometimes some ressources of the processor are left idle. waiting for data in L2 or L3 cache to reach L1. contention on the L1 cache. pipeline bubbles. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
42 What is it? Multiple hardware context The principle is to have multiple hardware context within a core: Replicate the registers. Do not replicate execution components. Switching from one to the other is now virtually free. Keeps more potential work to run on the execution component. Efficiency increases. Multiple kinds One thread runs until it blocks and switch. Round robin on the threads. Decode everything at once and let them execute as possible. (SMT) But it is a very old idea used to overlap disk I/O and computation. Now, this (or similar ideas) are used by most computing hardware. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
43 Interesting facts about threading Often used to overlap data transfers and operations. Can cause more contention on shared ressources leading to performance drop. On compute kernels, 30% can be gained. On I/O bound application, large factor can be gained. On x86 CPUs, rarely more than 2-way SMT. (Xeon Phi is 4-way.) Sparc had 16-way threading on some processors. Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
44 Conclusion (before talking about OpenMP) Generalities There is parallelism at many degrees: at the processor level (multiple cores) at the core level with SMT within a core (multiple instruction ports) within a port (SIMD) To understand performance, you need to understand how things are going to be executed. Very difficult to predict actual performance. Things that matter Frequency Number of core Instruction set Architecture of the core Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
45 Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
46 Resources Classical RISC: Detailed explanation of Nehalem: Composer XE documentation for SIMD instruction: doclib/stdxe/2013/composerxe/compiler/cpp-mac/guid-64e5bfbb-fe9e-47cb-82a7-76c2ad57ed9c.htm On pragma simd: Intel compiler ways to vectorize automatically: SIMD in gcc: (a little old) Builtin in gcc: Intel developper manual: SMT on wikipedia: CPUID: On Stream Programming (vectorization): sandybridge/haswell architectural comparison: Erik Saule (6010/8010 HPC) The Processor 01/16/ / 27
Intel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationSoftware implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationAdaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationCPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationFLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
More informationGPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
More informationAlgorithms of Scientific Computing II
Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationPexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
More informationINTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationBig Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
More informationCPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
More informationMAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
More informationGPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010
GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationpicojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationFloating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the
More informationIMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
More informationElemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel
More informationGenerations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationOverview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationInstruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
More informationInstruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
More informationAll ju The State of Software Development Today: A Parallel View. June 2012
All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationCycles for Competitiveness: A View of the Future HPC Landscape
Cycles for Competitiveness: A View of the Future HPC Landscape October 6, 2010 Stephen R. Wheat, Ph.D. Sr. Director, HPC WW Business Operations Intel, Data Center Group Legal Disclaimer Intel may make
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationAMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationHow to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture
How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture Instructor: Markus Püschel Guest Instructor: Franz Franchetti TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay
More informationDNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk
DNA Data and Program Representation Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationData-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
More informationProgramming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationHigh Performance Computing, an Introduction to
High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationPipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
More informationExploiting SIMD Instructions
Instructions Felix von Leitner CCC Berlin felix-simd@fefe.de August 2003 Abstract General purpose CPUs have become powerful enough to decode and even encode MPEG audio and video in real time. These tasks
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationIA-32 Intel Architecture Software Developer s Manual
IA-32 Intel Architecture Software Developer s Manual Volume 1: Basic Architecture NOTE: The IA-32 Intel Architecture Software Developer s Manual consists of three volumes: Basic Architecture, Order Number
More informationInstruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationComputer Architecture. Secure communication and encryption.
Computer Architecture. Secure communication and encryption. Eugeniy E. Mikhailov The College of William & Mary Lecture 28 Eugeniy Mikhailov (W&M) Practical Computing Lecture 28 1 / 13 Computer architecture
More informationComputer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
More informationPerformance Application Programming Interface
/************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more
More informationPSE Molekulardynamik
OpenMP, bigger Applications 12.12.2014 Outline Schedule Presentations: Worksheet 4 OpenMP Multicore Architectures Membrane, Crystallization Preparation: Worksheet 5 2 Schedule 10.10.2014 Intro 1 WS 24.10.2014
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationPerformance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
More informationGPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
More informationHigh-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationCHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationBenchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationInstruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
More informationChapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
More informationCPU performance monitoring using the Time-Stamp Counter register
CPU performance monitoring using the Time-Stamp Counter register This laboratory work introduces basic information on the Time-Stamp Counter CPU register, which is used for performance monitoring. The
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
More informationCS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
More informationAdministrative Issues
CSC 3210 Computer Organization and Programming Introduction and Overview Dr. Anu Bourgeois (modified by Yuan Long) Administrative Issues Required Prerequisites CSc 2010 Intro to CSc CSc 2310 Java Programming
More informationThe programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
More informationSOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
More informationWAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
More informationPerformance monitoring of the software frameworks for LHC experiments
Performance monitoring of the software frameworks for LHC experiments William A. Romero R. wil-rome@uniandes.edu.co J.M. Dana Jose.Dana@cern.ch First EELA-2 Conference Bogotá, COL OUTLINE Introduction
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationUsing the Game Boy Advance to Teach Computer Systems and Architecture
Using the Game Boy Advance to Teach Computer Systems and Architecture ABSTRACT This paper presents an approach to teaching computer systems and architecture using Nintendo s Game Boy Advance handheld game
More informationPerformance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor
Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Taiwo O. Ojeyinka Department of Computer Science, Adekunle Ajasin University, Akungba-Akoko Ondo State, Nigeria. Olusola Olajide
More information