Exploiting SIMD Instructions
|
|
|
- Caren Oliver
- 9 years ago
- Views:
Transcription
1 Instructions Felix von Leitner CCC Berlin August 2003 Abstract General purpose CPUs have become powerful enough to decode and even encode MPEG audio and video in real time. These tasks have previously been the domain of special purpose hardware, DSPs and FPGAs. These feats would in general not be possible without SIMD instructions. SIMD Hacking
2 Agenda 1. How do CPUs work 2. Making a CPUs that is good at number crunching 3. Typical vector instruction sets: MMX, SSE 4. Vectorizing code SIMD Hacking 1
3 How do CPUs work (VAX, 386) 1. read one instruction from memory (FETCH) 2. decode instruction (DECODE) 3. load prerequisite data (READ) 4. execute instruction (EXEC) 5. write back results (WRITE) 6. repeat from step 1 SIMD Hacking 2
4 How do CPUs work (VAX, 386) +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ Problem: slow. SIMD Hacking 3
5 How do CPUs work (486, VIA C3, early RISC) +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ Problem: needs good compiler, bad for architectures with few registers. SIMD Hacking 4
6 How do CPUs work (Pentium, later RISC) Two pipelines. One general purpose pipeline One special purpose, only executes some instructions Problem: needs good compiler, bad for architectures with few registers. SIMD Hacking 5
7 How do CPUs work (current x86 or RISC) Fetch many instructions Calculate dependency graph Send instructions to functional units Each functional unit has a wait queue If you need more floating point performance, add another FP unit Has shadow registers, renames internally Problem: expensive, complex. State of the art. SIMD Hacking 6
8 Register renaming Register renaming makes it possible for a CPU to execute several iterations of the same loop in parallel. However, the iterations must be independent. If the architecture has few registers, it may not be possible to formulate the loop so that the iterations are independent. Adding registers is usually not an option (breaks assembler, compiler). Simulations have shown that adding more than 32 registers yields very little improvement, so most RISC architectures have 32 integer and 32 floating point registers. But x86 has only 7 general purpose registers! SIMD Hacking 7
9 How to make a CPU good for number crunching There are several ways to improve floating point performance: Use low latency FP units Has lower bound. Not useful for rare insns like sqrt. Reduce work of FP units Alpha defers normalization, no exception handling. Add more FP units Useless on stack arch (387), only helps for parallel algorithms, expensive. Out of order execution has excellent FP performance even without handoptimized assembly core routines. SIMD Hacking 8
10 How to make a CPU good for number crunching The number 1 performance killer in modern CPUs are branches. Because CPUs fetch instructions and then distribute them to functional units, it needs to know which instructions to fetch and distribute. Branches make that impossible. So CPUs either have elaborate guesswork to guess the outcome of a branch, or they execute both outcomes and then discard the wrong one (IA64). It turns out that floating point and DSP code usually has comparatively few branches, so it is actually easier to build a CPU that is a good number cruncher than to make one that is good at interpreting perl or Java. SIMD Hacking 9
11 Why add a vector extension at all? Many operations can be done in constant time, no matter how wide the machine word is. Examples are: XOR, AND, OR, shifting, adding. Partitioning a large word (say, 128 bits) into 4 32-bit words is a no-op for XOR, AND, OR. Shifting needs very few additional hardware, adding and negating simply need new hardware to not carry over the carry bit at the partitions. Subtracting can be done by adding the negated value, and negating is a XOR and an ADD. These instructions can all easily be done in 1 machine cycle. SIMD Hacking 10
12 Historical perspective The first general purpose CPU to add word partitioning so that one could add 4 bytes in a 32-bit register was... does anyone know? SIMD Hacking 11
13 Historical perspective No, it s not the Pentium MMX, that was announced in 1996 but shipped in Sun UltraSPARC vector extension is called VIS (Visual Instruction Set) and shipped in The HP PA-RISC vector extension is called MAX (Multimedia Acceleration extensions) and also shipped in HP is widely credited as being the first desktop CPU vendor to ship vector extensions. However, the first CPU with partitioning came from Intel: the i860 (1989), a little known but very innovative CPU. The i860 was the first superscalar CPU, first multiply-add, manual pipelining... cool chip! Unfortunately, it failed in the market, because Intel s compiler sucked. SIMD Hacking 12
14 Why add a vector extension at all? The application that benefits obviously and trivially from integer vector instructions is image operations like alpha channel calculations. These operations often also need saturation, i.e. add 5, but don t wrap around (adding 5 to 253 yields 255). Some vector units offer normal add and add-with-saturation instructions. SIMD Hacking 13
15 Number crunching with x86 When MMX was added, the worst bottleneck was graphics speed, in particular drawing shaded textures in VR apps and games. 3dfx didn t exist yet, quake was not on the market yet. So MMX is an integer vector unit, it is useless for 3d calculations and number crunching. Before modern RISC, floating point math was generally much slower than integer math. So an established optimization technique was to express and algorithm like mp3 decoding in integer math. On some embedded architectures this is still useful (ARM for example). SIMD Hacking 14
16 Number crunching with x86 It turns out that MPEG2 video decoding can be done completely in integer math. And like most imaging algorithms it is highly parallelizable. These days non-vector floating point is faster than non-vector integer math on most architectures. Still, SSE2 has 128-bit registers for floating point and integer math. floats have 32 bits. If you can write the algorithm with 16-bit integer math, you can put twice the data in each vector. SIMD Hacking 15
17 The different x86 vector extensions MMX does integer vector operations on 8 64-bit vector registers that are mapped over the 387 floating point registers (using 8-, 16- and 32-bit integers as elements). 3dnow! is MMX plus instructions to use the MMX registers as vectors of 2 32-bit floats. SSE adds 8 separate 128-bit registers that can be used as vectors of 4 32-bit floats or 2 64-bit doubles. SSE also adds a few more integer instructions on the MMX registers. SSE2 adds integer vector instructions on the 128-bit registers from SSE. SIMD Hacking 16
18 The different RISC vector extensions Altivec operates on bit vectors of floats, 32-, 16- or 8-bit integers. Altivec is very versatile and supports most every operation. Alpha MAX supports only integers in its normal 64-bit registers, and can only compare, logical, absolute difference, min/max and pack/unpack. HP MAX2 operates on 16-bit values in 64-bit registers, can add, shift, saturating shift-and-add (to roll your own multiply-by-constant), logical, pack and shuffle. SPARC VIS can add (no saturation), compare, logical, absolute difference, pack and unpack on some subsets of 64-bit registers. SIMD Hacking 17
19 Which SIMD instructions are important? About the only ones actually used are Altivec and the various x86 extensions. I will focus on these from now on. If you want to see how these are used in the real world, the best place to find code using them is ffmpeg, in particular the platform dependent dsputils.c implementation. ffmpeg has platform specific routines for alpha, arm, i386 (MMX, SSE), SPARC VIS (using mlib, Sun s abstraction library), Altivec, the Playstation 2 (I m not kidding!) and SH-4. Arm and SH-4 do not actually have a vector extension, but this is a treasure trove for SIMD hackers. SIMD Hacking 18
20 What can you typically do with SIMD? Add and substract vectors of bytes, shorts or ints Same, but with saturation ( don t go over 255 or below 0 ) logical bitwise operations (obviously): AND, OR, XOR, NOT compare (element in result vector set to all-1 or all-0) shift (bits inside each element are shifted, not the elements themselves) multiply-and-add SIMD Hacking 19
21 What can you typically do with SIMD? absolute difference (i.e. abs(a-b)) (Altivec has this) min, max (more useful that you d think!) packing (e.g. 2 words 2 bytes) unpacking (e.g. 2 bytes 2 words) shuffle (permute the elements in a vector) SIMD Hacking 20
22 What can you typically NOT do with SIMD? look up value in table using vector elements as index multiply the second word with the third word in same vector get sum of all 8 bytes in vector The first one is the most serious, because often you vectorize code that has already been optimized, and people often put statically computable values in tables and later look them up. With SIMD it s often faster to compute the value than look it up. SIMD Hacking 21
23 What does SIMD code look like? This is actually used in MPEG encoding: int sum(uint8_t* pix,int width) { int s,i,j; for (s=i=0, i<16; ++i) { for (j=0; j<16; ++j) s+=pix[j]; pix+=width; } return s; } SIMD Hacking 22
24 Poor man s SIMD int sum(uint8_t* pix,int width) { int i,j; uint32_t v=0; for (s=i=0, i<16; ++i) { uint32_t x for (j=0; j<16; j+=4) { x=*(uint32_t*)pix; v+=((x & 0xff00ff00)>>8) + (x & 0x00ff00ff); } pix+=width; } return (v & 0xffff0000) + (v >> 16); } SIMD Hacking 23
25 What does SIMD code look like? Let s vectorize this! #include <mmintrin.h> int sum(uint8* pix,int width) { int i; m64 x=_mm_add_pi8(*( m64*)pix,*( m64*)(pix+8)); for (i=1; i<16; ++i) { pix+=width; x=_mm_add_pi8(x,*( m64*)pix); x=_mm_add_pi8(x,*( m64*)(pix+8)); } /* there is no instruction to sum the elements */ } /* problem: overflow! */ SIMD Hacking 24
26 Uh... What now? First we need to make sure there are no overflows. We read 8-bit values. We need to add them as 16-bit values. So first, we have to promote them. 1. copy vector 2. shift right 16-bit vector elements in copy by 8 3. AND original with 0x00ff00ff00ff00ff bit vector add both Yuck! SIMD Hacking 25
27 Run that by me again 0. read (1,2,3,4,5,6,7,8) 1. copy (1,2,3,4,5,6,7,8) (1,2,3,4,5,6,7,8) bit shr 8 (0,1,0,3,0,5,0,7) (1,2,3,4,5,6,7,8) 3. and 00ff00ff (0,1,0,3,0,5,0,7) (0,2,0,4,0,6,0,8) 4. sum (1+3,3+4,5+6,7+8) Now we have 4 16-bit values. SIMD Hacking 26
28 More problems The next floating point operation after using MMX will generate an error (usually NaN). That is because MMX shares the floating point registers and sets a bit that this register was used for MMX. You are required to run the instruction EMMS after the MMX code. This instructions takes 6 cycles on Pentium 3, 12 cycles on Pentium 4 (2 on Athlon). This alone can eat up all your MMX savings! 3dnow! defines a FEMMS ( fast EMMS ) instruction that will not zero the register but just clear the MMX bit. Still, this sucks, and it is a major source of ridicule in Apple s Altivec PR. SIMD Hacking 27
29 Shuffling Permutating the elements of a vector is impossible in MMX. You would have to write the vector to memory, permutate manually, and then read the vector back. This is prohibitively slow, don t even think about it. SSE adds a few integer instructions as MMX extensions as well, most importantly PSHUFW. pshufw can only shuffle 16-bit elements. It takes three arguments: an immediate byte constant, a destination register, and a source register or memory location. Every two bits in the immediate byte value are taken as index in the source vector. So, to turn (1,2,3,4) into (4,3,2,1), the immediate would be or 0x1b. Use 0x00 to turn (1,2,3,4) into (1,1,1,1). SIMD Hacking 28
30 Integer multiplication Multiplying two n bit integers yields a 2n bit result. MMX handles this by having two multiply instructions one for the upper n bits of the result, one for the lower n bits of the result. This is handy because typically integer multiplication happens in ported floating point code. DSP style floating point code works on values between 0 and 1, so you work on the most significant bits of the fractional part of the number. This means that after a multiplication you would normally shift right the result to get the most significant bits. The SSE multiplication means you don t have to shift. SIMD Hacking 29
31 What now? As you can see, there is major bit fiddling involved in porting even this trivial function to MMX. Even worse, many typical operations just are not there for vectors. I find it intellectually stimulating to find SIMD ways to do things. For example, to set an MMX register to 0, you XOR it with itself. This is a well-known trick from non-mmx. To set an MMX register to all-1, you... well, there is no decrement for SIMD. So what you do is to compare a vector with itself for equality. The result will always be true, so the result vector is all-1. SIMD Hacking 30
32 Other interesting ways to do stuff To exchange two vectors: XOR a,b; XOR b,a; XOR a,b. To calculate abs() on a float vector: AND each element with 0x7fffffff. To make a vector with 0x7fffffff: make an all-1 vector and shr by 1. To make a vector with 0x : make an all-1 vector and shl by 31. Imaging people often need the absolute sum of differences of two images. There is no abs() for MMX. The solution is to calculate a-b and b-a... with saturation! So one is 0 and the other is positive. OR them and you are done. SIMD Hacking 31
33 What else? SSE has 128-bit vectors, i.e. 4 floats. There are two load instructions, a fast one that only works if the memory location is 16-byte aligned, and a slow one for the general case. If you use gcc to write your SSE code, and you declare an m128 variable on the stack, gcc will not make sure it is 16-byte aligned, but will generate the aligned load and store instructions. The result is a core dump (I reported this bug a few months ago). The problem is: gcc without -O generates tons of temporaries on the stack. So you will notice that suddenly, when you start compiling with -g to find that bug, your SSE code will start segfaulting all around you. Don t be surprised, it s a known bug. SIMD Hacking 32
34 What about floating point vectors? Partitioning floating point words is not as easy as partitioning an integer word. CPUs with floating point vector math usually add several math units. So, normal math and vector math go to different units (on x86, the normal math unit can do stuff like sin() and log(), the vector units can only do basic arithmetic. If you interleave vector floating point code with scalar code, you can be even faster than just the vector code! SIMD Hacking 33
35 Funky floating point math 3dnow! and SSE offer instructions to calculate an approximation of 1/x. The background is that multiplying is always faster than dividing, so multiplying by 1/x is often faster than straight division, in particular if you do several divisions with the same divisor. People often need square roots. 3dnow! and SSE do not offer square roots, but they have instructions to calculate 1/sqrt(x). 3dnow! is even funkier than SSE in that they offer two functions each, one quick one and one that you can call additionally if you need more precision. SIMD Hacking 34
36 Problem cases In many real-life functions you end up shuffling your vectors around all the time. For example, many imaging algorithms do a zig-zag traversal of the image, i.e. first scan line from left to right, next scan line from right to left. One particularly nasty function is a part of MPEG 4 where they compare neighbouring pixels, but if your vector leaves the 16x16 tile to the left or right, you don t take the pixels from the neightbouring tile, but you reflect off the border. This is a major switch statement with a lot of evil shuffling to get it done with vector instructions at all. Every if, switch or shuffle hurts performance. SIMD Hacking 35
37 Case study I recently spent some time with libvorbis and an SSE manual. I used gcov to find lines of code that are executed particularly often. I found about 10 hot spots in the encoder code (psy.c). Three of them looked like using vectors might be beneficial. I converted those three. The speed-up was 25%. It took me about a week to learn SSE enough to do this. SIMD Hacking 36
38 Go forth and multiply! If I can do it, so can you. Send questions to [email protected]! BTW: If you want to learn more about computer architecture, go to a good book store and order the Hennessy-Patterson. If they are any good, they ll know. SIMD Hacking 37
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
High-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
DNA Data and Program Representation. Alexandre David 1.2.05 [email protected]
DNA Data and Program Representation Alexandre David 1.2.05 [email protected] Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
CISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers
This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition
Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1
Divide: Paper & Pencil Computer Architecture ALU Design : Division and Floating Point 1001 Quotient Divisor 1000 1001010 Dividend 1000 10 101 1010 1000 10 (or Modulo result) See how big a number can be
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
Adaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
ARM Microprocessor and ARM-Based Microcontrollers
ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Instruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit
Decimal Division Remember 4th grade long division? 43 // quotient 12 521 // divisor dividend -480 41-36 5 // remainder Shift divisor left (multiply by 10) until MSB lines up with dividend s Repeat until
Chapter 5 Instructor's Manual
The Essentials of Computer Organization and Architecture Linda Null and Julia Lobur Jones and Bartlett Publishers, 2003 Chapter 5 Instructor's Manual Chapter Objectives Chapter 5, A Closer Look at Instruction
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series By: Binesh Tuladhar Clay Smith Overview History of GPU s GPU Definition Classical Graphics Pipeline Geforce 6 Series Architecture Vertex
Intel 8086 architecture
Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
Software implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Chapter 7D The Java Virtual Machine
This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly
Computer Architectures
Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services [email protected] 2 Instruction set architectures
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
The mathematics of RAID-6
The mathematics of RAID-6 H. Peter Anvin 1 December 2004 RAID-6 supports losing any two drives. The way this is done is by computing two syndromes, generally referred P and Q. 1 A quick
CPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
Pexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
CHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
Fast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Keil C51 Cross Compiler
Keil C51 Cross Compiler ANSI C Compiler Generates fast compact code for the 8051 and it s derivatives Advantages of C over Assembler Do not need to know the microcontroller instruction set Register allocation
Instruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture
ELE 356 Computer Engineering II Section 1 Foundations Class 6 Architecture History ENIAC Video 2 tj History Mechanical Devices Abacus 3 tj History Mechanical Devices The Antikythera Mechanism Oldest known
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
SmartArrays and Java Frequently Asked Questions
SmartArrays and Java Frequently Asked Questions What are SmartArrays? A SmartArray is an intelligent multidimensional array of data. Intelligent means that it has built-in knowledge of how to perform operations
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
FAST INVERSE SQUARE ROOT
FAST INVERSE SQUARE ROOT CHRIS LOMONT Abstract. Computing reciprocal square roots is necessary in many applications, such as vector normalization in video games. Often, some loss of precision is acceptable
Attention: This material is copyright 1995-1997 Chris Hecker. All rights reserved.
Attention: This material is copyright 1995-1997 Chris Hecker. All rights reserved. You have permission to read this article for your own education. You do not have permission to put it on your website
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
Computer Architecture Basics
Computer Architecture Basics CIS 450 Computer Organization and Architecture Copyright c 2002 Tim Bower The interface between a computer s hardware and its software is its architecture The architecture
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER
ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER Pierre A. von Kaenel Mathematics and Computer Science Department Skidmore College Saratoga Springs, NY 12866 (518) 580-5292 [email protected] ABSTRACT This paper
CHAPTER 5 Round-off errors
CHAPTER 5 Round-off errors In the two previous chapters we have seen how numbers can be represented in the binary numeral system and how this is the basis for representing numbers in computers. Since any
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
The Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
How To Write A Hexadecimal Program
The mathematics of RAID-6 H. Peter Anvin First version 20 January 2004 Last updated 20 December 2011 RAID-6 supports losing any two drives. syndromes, generally referred P and Q. The way
a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.
CS143 Handout 18 Summer 2008 30 July, 2008 Processor Architectures Handout written by Maggie Johnson and revised by Julie Zelenski. Architecture Vocabulary Let s review a few relevant hardware definitions:
Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
Let s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C
Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection
A numerically adaptive implementation of the simplex method
A numerically adaptive implementation of the simplex method József Smidla, Péter Tar, István Maros Department of Computer Science and Systems Technology University of Pannonia 17th of December 2014. 1
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
OAMulator. Online One Address Machine emulator and OAMPL compiler. http://myspiders.biz.uiowa.edu/~fil/oam/
OAMulator Online One Address Machine emulator and OAMPL compiler http://myspiders.biz.uiowa.edu/~fil/oam/ OAMulator educational goals OAM emulator concepts Von Neumann architecture Registers, ALU, controller
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
Intel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
Computer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance
What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
C# and Other Languages
C# and Other Languages Rob Miles Department of Computer Science Why do we have lots of Programming Languages? Different developer audiences Different application areas/target platforms Graphics, AI, List
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Hardware/Software Co-Design of a Java Virtual Machine
Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada [email protected] Micaela Serra University of Victoria
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
Optimizing matrix multiplication Amitabha Banerjee [email protected]
Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013
18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Parrot in a Nutshell. Dan Sugalski [email protected]. Parrot in a nutshell 1
Parrot in a Nutshell Dan Sugalski [email protected] Parrot in a nutshell 1 What is Parrot The interpreter for perl 6 A multi-language virtual machine An April Fools joke gotten out of hand Parrot in a nutshell
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History
The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History John P. Doty, Noqsi Aerospace, Ltd. This work is Copyright 2007 Noqsi Aerospace, Ltd. This work is licensed under the
