Exploiting SIMD Instructions

Similar documents

İSTANBUL AYDIN UNIVERSITY

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

High-speed image processing algorithms using MMX hardware

DNA Data and Program Representation. Alexandre David

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

CISC, RISC, and DSP Microprocessors

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Instruction Set Architecture (ISA)

Generations of the computer. processors.

Using Power to Improve C Programming Education

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Intel Pentium 4 Processor on 90nm Technology

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

ARM Microprocessor and ARM-Based Microcontrollers

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Instruction Set Architecture

Instruction Set Design

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Chapter 5 Instructor's Manual

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Intel 8086 architecture

Design Cycle for Microprocessors

Software implementation of Post-Quantum Cryptography

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Chapter 7D The Java Virtual Machine

Computer Architectures

LSN 2 Computer Processors

VLIW Processors. VLIW Processors

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

IA-64 Application Developer s Architecture Guide

The mathematics of RAID-6

CPU Organization and Assembly Language

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

CHAPTER 7: The CPU and Memory

Fast Arithmetic Coding (FastAC) Implementations

GPU Hardware Performance. Fall 2015

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Next Generation GPU Architecture Code-named Fermi

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Keil C51 Cross Compiler

Instruction Set Architecture (ISA) Design. Classification Categories

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

SmartArrays and Java Frequently Asked Questions

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FAST INVERSE SQUARE ROOT

Attention: This material is copyright Chris Hecker. All rights reserved.

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Introduction to GPU Programming Languages

on an system with an infinite number of processors. Calculate the speedup of

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Computer Architecture Basics

Rethinking SIMD Vectorization for In-Memory Databases

OpenACC 2.0 and the PGI Accelerator Compilers

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

CHAPTER 5 Round-off errors

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Week 1 out-of-class notes, discussions and sample problems

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

The Bus (PCI and PCI-Express)

PROBLEMS #20,R0,R1 #$3A,R2,R4

How To Write A Hexadecimal Program

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Let s put together a Manual Processor

Chapter 1 Computer System Overview

Operation Count; Numerical Linear Algebra

A Lab Course on Computer Architecture

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

A numerically adaptive implementation of the simplex method

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Introduction to Microprocessors

Intel 64 and IA-32 Architectures Software Developer s Manual

Computer Organization and Architecture

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

64-Bit versus 32-Bit CPUs in Scientific Computing

C# and Other Languages

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Hardware/Software Co-Design of a Java Virtual Machine

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Optimizing matrix multiplication Amitabha Banerjee

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Parrot in a Nutshell. Dan Sugalski dan@sidhe.org. Parrot in a nutshell 1

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Pipelining Review and Its Limitations

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

The Evolution of CCD Clock Sequencers at MIT: Looking to the Future through History

Transcription:

Instructions Felix von Leitner CCC Berlin felix-simd@fefe.de August 2003 Abstract General purpose CPUs have become powerful enough to decode and even encode MPEG audio and video in real time. These tasks have previously been the domain of special purpose hardware, DSPs and FPGAs. These feats would in general not be possible without SIMD instructions. SIMD Hacking

Agenda 1. How do CPUs work 2. Making a CPUs that is good at number crunching 3. Typical vector instruction sets: MMX, SSE 4. Vectorizing code SIMD Hacking 1

How do CPUs work (VAX, 386) 1. read one instruction from memory (FETCH) 2. decode instruction (DECODE) 3. load prerequisite data (READ) 4. execute instruction (EXEC) 5. write back results (WRITE) 6. repeat from step 1 SIMD Hacking 2

How do CPUs work (VAX, 386) +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ Problem: slow. SIMD Hacking 3

How do CPUs work (486, VIA C3, early RISC) +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ +===+===+===+===+===+ F D R E W +===+===+===+===+===+ Problem: needs good compiler, bad for architectures with few registers. SIMD Hacking 4

How do CPUs work (Pentium, later RISC) Two pipelines. One general purpose pipeline One special purpose, only executes some instructions Problem: needs good compiler, bad for architectures with few registers. SIMD Hacking 5

How do CPUs work (current x86 or RISC) Fetch many instructions Calculate dependency graph Send instructions to functional units Each functional unit has a wait queue If you need more floating point performance, add another FP unit Has shadow registers, renames internally Problem: expensive, complex. State of the art. SIMD Hacking 6

Register renaming Register renaming makes it possible for a CPU to execute several iterations of the same loop in parallel. However, the iterations must be independent. If the architecture has few registers, it may not be possible to formulate the loop so that the iterations are independent. Adding registers is usually not an option (breaks assembler, compiler). Simulations have shown that adding more than 32 registers yields very little improvement, so most RISC architectures have 32 integer and 32 floating point registers. But x86 has only 7 general purpose registers! SIMD Hacking 7

How to make a CPU good for number crunching There are several ways to improve floating point performance: Use low latency FP units Has lower bound. Not useful for rare insns like sqrt. Reduce work of FP units Alpha defers normalization, no exception handling. Add more FP units Useless on stack arch (387), only helps for parallel algorithms, expensive. Out of order execution has excellent FP performance even without handoptimized assembly core routines. SIMD Hacking 8

How to make a CPU good for number crunching The number 1 performance killer in modern CPUs are branches. Because CPUs fetch instructions and then distribute them to functional units, it needs to know which instructions to fetch and distribute. Branches make that impossible. So CPUs either have elaborate guesswork to guess the outcome of a branch, or they execute both outcomes and then discard the wrong one (IA64). It turns out that floating point and DSP code usually has comparatively few branches, so it is actually easier to build a CPU that is a good number cruncher than to make one that is good at interpreting perl or Java. SIMD Hacking 9

Why add a vector extension at all? Many operations can be done in constant time, no matter how wide the machine word is. Examples are: XOR, AND, OR, shifting, adding. Partitioning a large word (say, 128 bits) into 4 32-bit words is a no-op for XOR, AND, OR. Shifting needs very few additional hardware, adding and negating simply need new hardware to not carry over the carry bit at the partitions. Subtracting can be done by adding the negated value, and negating is a XOR and an ADD. These instructions can all easily be done in 1 machine cycle. SIMD Hacking 10

Historical perspective The first general purpose CPU to add word partitioning so that one could add 4 bytes in a 32-bit register was... does anyone know? SIMD Hacking 11

Historical perspective No, it s not the Pentium MMX, that was announced in 1996 but shipped in 1997. Sun UltraSPARC vector extension is called VIS (Visual Instruction Set) and shipped in 1995. The HP PA-RISC vector extension is called MAX (Multimedia Acceleration extensions) and also shipped in 1995. HP is widely credited as being the first desktop CPU vendor to ship vector extensions. However, the first CPU with partitioning came from Intel: the i860 (1989), a little known but very innovative CPU. The i860 was the first superscalar CPU, first multiply-add, manual pipelining... cool chip! Unfortunately, it failed in the market, because Intel s compiler sucked. SIMD Hacking 12

Why add a vector extension at all? The application that benefits obviously and trivially from integer vector instructions is image operations like alpha channel calculations. These operations often also need saturation, i.e. add 5, but don t wrap around (adding 5 to 253 yields 255). Some vector units offer normal add and add-with-saturation instructions. SIMD Hacking 13

Number crunching with x86 When MMX was added, the worst bottleneck was graphics speed, in particular drawing shaded textures in VR apps and games. 3dfx didn t exist yet, quake was not on the market yet. So MMX is an integer vector unit, it is useless for 3d calculations and number crunching. Before modern RISC, floating point math was generally much slower than integer math. So an established optimization technique was to express and algorithm like mp3 decoding in integer math. On some embedded architectures this is still useful (ARM for example). SIMD Hacking 14

Number crunching with x86 It turns out that MPEG2 video decoding can be done completely in integer math. And like most imaging algorithms it is highly parallelizable. These days non-vector floating point is faster than non-vector integer math on most architectures. Still, SSE2 has 128-bit registers for floating point and integer math. floats have 32 bits. If you can write the algorithm with 16-bit integer math, you can put twice the data in each vector. SIMD Hacking 15

The different x86 vector extensions MMX does integer vector operations on 8 64-bit vector registers that are mapped over the 387 floating point registers (using 8-, 16- and 32-bit integers as elements). 3dnow! is MMX plus instructions to use the MMX registers as vectors of 2 32-bit floats. SSE adds 8 separate 128-bit registers that can be used as vectors of 4 32-bit floats or 2 64-bit doubles. SSE also adds a few more integer instructions on the MMX registers. SSE2 adds integer vector instructions on the 128-bit registers from SSE. SIMD Hacking 16

The different RISC vector extensions Altivec operates on 32 128-bit vectors of floats, 32-, 16- or 8-bit integers. Altivec is very versatile and supports most every operation. Alpha MAX supports only integers in its normal 64-bit registers, and can only compare, logical, absolute difference, min/max and pack/unpack. HP MAX2 operates on 16-bit values in 64-bit registers, can add, shift, saturating shift-and-add (to roll your own multiply-by-constant), logical, pack and shuffle. SPARC VIS can add (no saturation), compare, logical, absolute difference, pack and unpack on some subsets of 64-bit registers. SIMD Hacking 17

Which SIMD instructions are important? About the only ones actually used are Altivec and the various x86 extensions. I will focus on these from now on. If you want to see how these are used in the real world, the best place to find code using them is ffmpeg, in particular the platform dependent dsputils.c implementation. ffmpeg has platform specific routines for alpha, arm, i386 (MMX, SSE), SPARC VIS (using mlib, Sun s abstraction library), Altivec, the Playstation 2 (I m not kidding!) and SH-4. Arm and SH-4 do not actually have a vector extension, but this is a treasure trove for SIMD hackers. SIMD Hacking 18

What can you typically do with SIMD? Add and substract vectors of bytes, shorts or ints Same, but with saturation ( don t go over 255 or below 0 ) logical bitwise operations (obviously): AND, OR, XOR, NOT compare (element in result vector set to all-1 or all-0) shift (bits inside each element are shifted, not the elements themselves) multiply-and-add SIMD Hacking 19

What can you typically do with SIMD? absolute difference (i.e. abs(a-b)) (Altivec has this) min, max (more useful that you d think!) packing (e.g. 2 words 2 bytes) unpacking (e.g. 2 bytes 2 words) shuffle (permute the elements in a vector) SIMD Hacking 20

What can you typically NOT do with SIMD? look up value in table using vector elements as index multiply the second word with the third word in same vector get sum of all 8 bytes in vector The first one is the most serious, because often you vectorize code that has already been optimized, and people often put statically computable values in tables and later look them up. With SIMD it s often faster to compute the value than look it up. SIMD Hacking 21

What does SIMD code look like? This is actually used in MPEG encoding: int sum(uint8_t* pix,int width) { int s,i,j; for (s=i=0, i<16; ++i) { for (j=0; j<16; ++j) s+=pix[j]; pix+=width; } return s; } SIMD Hacking 22

Poor man s SIMD int sum(uint8_t* pix,int width) { int i,j; uint32_t v=0; for (s=i=0, i<16; ++i) { uint32_t x for (j=0; j<16; j+=4) { x=*(uint32_t*)pix; v+=((x & 0xff00ff00)>>8) + (x & 0x00ff00ff); } pix+=width; } return (v & 0xffff0000) + (v >> 16); } SIMD Hacking 23

What does SIMD code look like? Let s vectorize this! #include <mmintrin.h> int sum(uint8* pix,int width) { int i; m64 x=_mm_add_pi8(*( m64*)pix,*( m64*)(pix+8)); for (i=1; i<16; ++i) { pix+=width; x=_mm_add_pi8(x,*( m64*)pix); x=_mm_add_pi8(x,*( m64*)(pix+8)); } /* there is no instruction to sum the elements */ } /* problem: overflow! */ SIMD Hacking 24

Uh... What now? First we need to make sure there are no overflows. We read 8-bit values. We need to add them as 16-bit values. So first, we have to promote them. 1. copy vector 2. shift right 16-bit vector elements in copy by 8 3. AND original with 0x00ff00ff00ff00ff 4. 16-bit vector add both Yuck! SIMD Hacking 25

Run that by me again 0. read (1,2,3,4,5,6,7,8) 1. copy (1,2,3,4,5,6,7,8) (1,2,3,4,5,6,7,8) 2. 16-bit shr 8 (0,1,0,3,0,5,0,7) (1,2,3,4,5,6,7,8) 3. and 00ff00ff (0,1,0,3,0,5,0,7) (0,2,0,4,0,6,0,8) 4. sum (1+3,3+4,5+6,7+8) Now we have 4 16-bit values. SIMD Hacking 26

More problems The next floating point operation after using MMX will generate an error (usually NaN). That is because MMX shares the floating point registers and sets a bit that this register was used for MMX. You are required to run the instruction EMMS after the MMX code. This instructions takes 6 cycles on Pentium 3, 12 cycles on Pentium 4 (2 on Athlon). This alone can eat up all your MMX savings! 3dnow! defines a FEMMS ( fast EMMS ) instruction that will not zero the register but just clear the MMX bit. Still, this sucks, and it is a major source of ridicule in Apple s Altivec PR. SIMD Hacking 27

Shuffling Permutating the elements of a vector is impossible in MMX. You would have to write the vector to memory, permutate manually, and then read the vector back. This is prohibitively slow, don t even think about it. SSE adds a few integer instructions as MMX extensions as well, most importantly PSHUFW. pshufw can only shuffle 16-bit elements. It takes three arguments: an immediate byte constant, a destination register, and a source register or memory location. Every two bits in the immediate byte value are taken as index in the source vector. So, to turn (1,2,3,4) into (4,3,2,1), the immediate would be 00011011 or 0x1b. Use 0x00 to turn (1,2,3,4) into (1,1,1,1). SIMD Hacking 28

Integer multiplication Multiplying two n bit integers yields a 2n bit result. MMX handles this by having two multiply instructions one for the upper n bits of the result, one for the lower n bits of the result. This is handy because typically integer multiplication happens in ported floating point code. DSP style floating point code works on values between 0 and 1, so you work on the most significant bits of the fractional part of the number. This means that after a multiplication you would normally shift right the result to get the most significant bits. The SSE multiplication means you don t have to shift. SIMD Hacking 29

What now? As you can see, there is major bit fiddling involved in porting even this trivial function to MMX. Even worse, many typical operations just are not there for vectors. I find it intellectually stimulating to find SIMD ways to do things. For example, to set an MMX register to 0, you XOR it with itself. This is a well-known trick from non-mmx. To set an MMX register to all-1, you... well, there is no decrement for SIMD. So what you do is to compare a vector with itself for equality. The result will always be true, so the result vector is all-1. SIMD Hacking 30

Other interesting ways to do stuff To exchange two vectors: XOR a,b; XOR b,a; XOR a,b. To calculate abs() on a float vector: AND each element with 0x7fffffff. To make a vector with 0x7fffffff: make an all-1 vector and shr by 1. To make a vector with 0x80000000: make an all-1 vector and shl by 31. Imaging people often need the absolute sum of differences of two images. There is no abs() for MMX. The solution is to calculate a-b and b-a... with saturation! So one is 0 and the other is positive. OR them and you are done. SIMD Hacking 31

What else? SSE has 128-bit vectors, i.e. 4 floats. There are two load instructions, a fast one that only works if the memory location is 16-byte aligned, and a slow one for the general case. If you use gcc to write your SSE code, and you declare an m128 variable on the stack, gcc will not make sure it is 16-byte aligned, but will generate the aligned load and store instructions. The result is a core dump (I reported this bug a few months ago). The problem is: gcc without -O generates tons of temporaries on the stack. So you will notice that suddenly, when you start compiling with -g to find that bug, your SSE code will start segfaulting all around you. Don t be surprised, it s a known bug. SIMD Hacking 32

What about floating point vectors? Partitioning floating point words is not as easy as partitioning an integer word. CPUs with floating point vector math usually add several math units. So, normal math and vector math go to different units (on x86, the normal math unit can do stuff like sin() and log(), the vector units can only do basic arithmetic. If you interleave vector floating point code with scalar code, you can be even faster than just the vector code! SIMD Hacking 33

Funky floating point math 3dnow! and SSE offer instructions to calculate an approximation of 1/x. The background is that multiplying is always faster than dividing, so multiplying by 1/x is often faster than straight division, in particular if you do several divisions with the same divisor. People often need square roots. 3dnow! and SSE do not offer square roots, but they have instructions to calculate 1/sqrt(x). 3dnow! is even funkier than SSE in that they offer two functions each, one quick one and one that you can call additionally if you need more precision. SIMD Hacking 34

Problem cases In many real-life functions you end up shuffling your vectors around all the time. For example, many imaging algorithms do a zig-zag traversal of the image, i.e. first scan line from left to right, next scan line from right to left. One particularly nasty function is a part of MPEG 4 where they compare neighbouring pixels, but if your vector leaves the 16x16 tile to the left or right, you don t take the pixels from the neightbouring tile, but you reflect off the border. This is a major switch statement with a lot of evil shuffling to get it done with vector instructions at all. Every if, switch or shuffle hurts performance. SIMD Hacking 35

Case study I recently spent some time with libvorbis and an SSE manual. I used gcov to find lines of code that are executed particularly often. I found about 10 hot spots in the encoder code (psy.c). Three of them looked like using vectors might be beneficial. I converted those three. The speed-up was 25%. It took me about a week to learn SSE enough to do this. SIMD Hacking 36

Go forth and multiply! If I can do it, so can you. Send questions to felix-simd@fefe.de! BTW: If you want to learn more about computer architecture, go to a good book store and order the Hennessy-Patterson. If they are any good, they ll know. SIMD Hacking 37