CPU-specific optimization. Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.

Similar documents
(Refer Slide Time: 00:01:16 min)

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Microprocessor & Assembly Language

Computer Systems Structure Main Memory Organization

CHAPTER 7: The CPU and Memory

CPU Organisation and Operation

1. Memory technology & Hierarchy

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

MICROPROCESSOR AND MICROCOMPUTER BASICS

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

on an system with an infinite number of processors. Calculate the speedup of

Rethinking SIMD Vectorization for In-Memory Databases

Chapter 2 Logic Gates and Introduction to Computer Architecture

İSTANBUL AYDIN UNIVERSITY

Memory Systems. Static Random Access Memory (SRAM) Cell

Software implementation of Post-Quantum Cryptography

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Fast Arithmetic Coding (FastAC) Implementations

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Binary search tree with SIMD bandwidth optimization using SSE

HY345 Operating Systems

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Memory Basics. SRAM/DRAM Basics

Chapter 5 Instructor's Manual

Let s put together a Manual Processor

MACHINE ARCHITECTURE & LANGUAGE

Hardware Assisted Virtualization

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

LSN 2 Computer Processors

CUDA Programming. Week 4. Shared memory and register

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

ARM Cortex-M3 Assembly Language

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

2

High-speed image processing algorithms using MMX hardware

CHAPTER 4 MARIE: An Introduction to a Simple Computer

VLIW Processors. VLIW Processors

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Chapter 1 Computer System Overview

Overview of the Cortex-M3

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Digital Systems Based on Principles and Applications of Electrical Engineering/Rizzoni (McGraw Hill

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Choosing a Computer for Running SLX, P3D, and P5

Generations of the computer. processors.

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CAs and Turing Machines. The Basis for Universal Computation

GPUs for Scientific Computing

1 Classical Universal Computer 3

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CISC, RISC, and DSP Microprocessors

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Central Processing Unit (CPU)

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

recursion, O(n), linked lists 6/14

Microprocessor or Microcontroller?

Computer Architecture

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

EE361: Digital Computer Organization Course Syllabus

Instruction Set Architecture

Intel 8086 architecture

AsicBoost A Speedup for Bitcoin Mining

Python Programming: An Introduction to Computer Science

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Instruction Set Design

Chapter 5 :: Memory and Logic Arrays

GPU Hardware Performance. Fall 2015

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Spring 2011 Prof. Hyesoon Kim

PROBLEMS (Cap. 4 - Istruzioni macchina)

Quantum Computing. Robert Sizemore

Software based Finite State Machine (FSM) with general purpose processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

Pitfalls of Object Oriented Programming

Instruction Set Architecture (ISA)

Using Power to Improve C Programming Education

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

HSL and its out-of-core solver

Parallel Computing. Benson Muite. benson.

An Introduction to the ARM 7 Architecture

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

B.Sc.(Computer Science) and. B.Sc.(IT) Effective From July 2011

TivaWare Utilities Library

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Lecture 12: More on Registers, Multiplexers, Decoders, Comparators and Wot- Nots

Computer Architecture TDTS10

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Transcription:

CPU-specific optimization 1 Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.

Example of a function that we want to optimize: adding 1000 integers mod 2 32. 2 Reference implementation: int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += x[i]; return result; }

Counting cycles: 3 static volatile unsigned int *const DWT_CYCCNT = (void *) 0xE0001004;... int beforesum = *DWT_CYCCNT; int result = sum(x); int aftersum = *DWT_CYCCNT; UARTprintf("sum %d %d\n", result,aftersum-beforesum); Output shows 8012 cycles. Change 1000 to 500: 4012.

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4 Bad approach: Apply random optimizations (and tweak compiler options) until you get bored/frustrated. Keep the fastest results.

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4 Bad approach: Apply random optimizations (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles.

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4 Bad approach: Apply random optimizations (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles.

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4 Bad approach: Apply random optimizations (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles.

Okay, 8 cycles per addition. Um, are microcontrollers really this slow at addition? 4 Bad approach: Apply random optimizations (and tweak compiler options) until you get bored/frustrated. Keep the fastest results. Try -Os: 8012 cycles. Try -O1: 8012 cycles. Try -O2: 8012 cycles. Try -O3: 8012 cycles.

Try moving the pointer: 5 int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; }

Try moving the pointer: 5 int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;++i) result += *x++; return result; } 8010 cycles.

Try counting down: 6 int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; }

Try counting down: 6 int sum(int *x) { int result = 0; int i; for (i = 1000;i > 0;--i) result += *x++; return result; } 8010 cycles.

Try using an end pointer: 7 int sum(int *x) { int result = 0; int *y = x + 1000; while (x!= y) result += *x++; return result; }

Try using an end pointer: 7 int sum(int *x) { int result = 0; int *y = x + 1000; while (x!= y) result += *x++; return result; } 8010 cycles.

Back to original. Try unrolling: 8 int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; }

Back to original. Try unrolling: 8 int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 2) { result += x[i]; result += x[i + 1]; } return result; } 5016 cycles.

int sum(int *x) { int result = 0; int i; for (i = 0;i < 1000;i += 5) { result += x[i]; result += x[i + 1]; result += x[i + 2]; result += x[i + 3]; result += x[i + 4]; } return result; } 9

4016 cycles. Are we done now? 10

4016 cycles. Are we done now? 10 Most random optimizations that we tried seem useless. Can spend time trying more. Does frustration level tell us that we re close to optimal?

4016 cycles. Are we done now? 10 Most random optimizations that we tried seem useless. Can spend time trying more. Does frustration level tell us that we re close to optimal? Good approach: Figure out lower bound for cycles spent on arithmetic etc. Understand gap between lower bound and observed time. Let s try this approach.

Find ARM Cortex-M4 Processor Technical Reference Manual. Rely on Wikipedia comment that M4F = M4 + floating-point unit. 11 Manual says that Cortex-M4 implements the ARMv7E-M architecture profile. Points to the ARMv7-M Architecture Reference Manual, which defines instructions: e.g., ADD for 32-bit addition. First manual says that ADD takes just 1 cycle.

Inputs and output of ADD are integer registers. ARMv7-M has 16 integer registers, including special-purpose stack pointer and program counter. 12 Each element of x array needs to be loaded into a register. Basic load instruction: LDR. Manual says 2 cycles but adds a note about pipelining. Then more explanation: if next instruction is also LDR (with address not based on first LDR) then it saves 1 cycle.

n consecutive LDRs takes only n + 1 cycles ( more multiple LDRs can be pipelined together ). 13 Can achieve this speed in other ways (LDRD, LDM) but nothing seems faster. Lower bound for n LDR + n ADD: 2n + 1 cycles, including n cycles of arithmetic. Why observed time is higher: non-consecutive LDRs; costs of manipulating i.

2281 cycles using ldr.w: 14 y = x + 4000 p = x result = 0 loop: xi9 = *(uint32 *) (p + 76) xi8 = *(uint32 *) (p + 72) xi7 = *(uint32 *) (p + 68) xi6 = *(uint32 *) (p + 64) xi5 = *(uint32 *) (p + 60) xi4 = *(uint32 *) (p + 56) xi3 = *(uint32 *) (p + 52) xi2 = *(uint32 *) (p + 48)

xi1 = *(uint32 *) (p + 44) xi0 = *(uint32 *) (p + 40) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p + 36) xi8 = *(uint32 *) (p + 32) xi7 = *(uint32 *) (p + 28) 15

xi6 = *(uint32 *) (p + 24) xi5 = *(uint32 *) (p + 20) xi4 = *(uint32 *) (p + 16) xi3 = *(uint32 *) (p + 12) xi2 = *(uint32 *) (p + 8) xi1 = *(uint32 *) (p + 4) xi0 = *(uint32 *) p; p += 160 result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 16

result += xi1 result += xi0 xi9 = *(uint32 *) (p - 4) xi8 = *(uint32 *) (p - 8) xi7 = *(uint32 *) (p - 12) xi6 = *(uint32 *) (p - 16) xi5 = *(uint32 *) (p - 20) xi4 = *(uint32 *) (p - 24) xi3 = *(uint32 *) (p - 28) xi2 = *(uint32 *) (p - 32) xi1 = *(uint32 *) (p - 36) xi0 = *(uint32 *) (p - 40) result += xi9 result += xi8 result += xi7 17

result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 xi9 = *(uint32 *) (p - 44) xi8 = *(uint32 *) (p - 48) xi7 = *(uint32 *) (p - 52) xi6 = *(uint32 *) (p - 56) xi5 = *(uint32 *) (p - 60) xi4 = *(uint32 *) (p - 64) xi3 = *(uint32 *) (p - 68) xi2 = *(uint32 *) (p - 72) 18

xi1 = *(uint32 *) (p - 76) xi0 = *(uint32 *) (p - 80) result += xi9 result += xi8 result += xi7 result += xi6 result += xi5 result += xi4 result += xi3 result += xi2 result += xi1 result += xi0 19 goto loop if!= =? p - y

Wikipedia: By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts. 20

Wikipedia: By the late 1990s for even performance sensitive code, optimizing compilers exceeded the performance of human experts. 20 Reality: The fastest software today relies on human experts understanding the CPU. Cannot trust compiler to optimize instruction selection. Cannot trust compiler to optimize instruction scheduling. Cannot trust compiler to optimize register allocation.

The big picture 21 CPUs are evolving farther and farther away from naive models of CPUs.

The big picture 21 CPUs are evolving farther and farther away from naive models of CPUs. Minor optimization challenges: Pipelining. Superscalar processing. Major optimization challenges: Vectorization. Many threads; many cores. The memory hierarchy; the ring; the mesh. Larger-scale parallelism. Larger-scale networking.

CPU design in a nutshell 22 f 0 g 0 g 1 5 555 f 1 6 666 22222 22222 22222 h 0 h 1 h 3 h 2 Gates : a; b 1 ab computing product h 0 + 2h 1 + 4h 2 + 8h 3 of integers f 0 + 2f 1 ; g 0 + 2g 1.

Electricity takes time to percolate through wires and gates. If f 0 ; f 1 ; g 0 ; g 1 are stable then h 0 ; h 1 ; h 2 ; h 3 are stable a few moments later. 23

Electricity takes time to percolate through wires and gates. If f 0 ; f 1 ; g 0 ; g 1 are stable then h 0 ; h 1 ; h 2 ; h 3 are stable a few moments later. 23 Build circuit with more gates to multiply (e.g.) 32-bit integers: 88888888888888888 88 (Details omitted.)

Build circuit to compute 24 32-bit integer r i given 4-bit integer i and 32-bit integers r 0 ; r 1 ; : : : ; r 15 : register read

Build circuit to compute 24 32-bit integer r i given 4-bit integer i and 32-bit integers r 0 ; r 1 ; : : : ; r 15 : register read Build circuit for register write : r 0 ; : : : ; r 15 ; s; i r 0 ; : : : ; r 15 where r j = r j except r i = s.

Build circuit to compute 24 32-bit integer r i given 4-bit integer i and 32-bit integers r 0 ; r 1 ; : : : ; r 15 : register read Build circuit for register write : r 0 ; : : : ; r 15 ; s; i r 0 ; : : : ; r 15 where r j = r j except r i = s. Build circuit for addition. Etc.

r 0 ; : : : ; r 15 ; i; j; k r 0 ; : : : ; r 15 where r = r except r i = r j r k : 25 register read register read 88888888888 8 register write

Add more flexibility. 26 More arithmetic: replace (i; j; k) with ( ; i; j; k) and ( + ; i; j; k) and more options.

Add more flexibility. 26 More arithmetic: replace (i; j; k) with ( ; i; j; k) and ( + ; i; j; k) and more options. More (but slower) storage: load from and store to larger RAM arrays.

Add more flexibility. 26 More arithmetic: replace (i; j; k) with ( ; i; j; k) and ( + ; i; j; k) and more options. More (but slower) storage: load from and store to larger RAM arrays. Instruction fetch : p o p ; i p ; j p ; k p ; p.

Add more flexibility. 26 More arithmetic: replace (i; j; k) with ( ; i; j; k) and ( + ; i; j; k) and more options. More (but slower) storage: load from and store to larger RAM arrays. Instruction fetch : p o p ; i p ; j p ; k p ; p. Instruction decode : decompression of compressed format for o p ; i p ; j p ; k p ; p.

Build flip-flops storing (p; r 0 ; : : : ; r 15 ). 27 Hook (p; r 0 ; : : : ; r 15 ) flip-flops into circuit inputs. Hook outputs (p ; r 0 ; : : : ; r 15 ) into the same flip-flops. At each clock tick, flip-flops are overwritten with the outputs. Clock needs to be slow enough for electricity to percolate all the way through the circuit, from flip-flops to flip-flops.

Now have semi-flexible CPU: flip-flops insn fetch 28 insn decode register read register read 8 88888 register write Further flexibility is useful: e.g., logic instructions.

Pipelining allows faster clock: flip-flops insn fetch flip-flops insn decode flip-flops register read register read flip-flops 8 88888 flip-flops register write stage 1 stage 2 stage 3 stage 4 stage 5 29

Goal: Stage n handles instruction one tick after stage n 1. 30 Instruction fetch reads next instruction, feeds p back, sends instruction. After next clock tick, instruction decode uncompresses this instruction, while instruction fetch reads another instruction. Some extra flip-flop area. Also extra area to preserve instruction semantics: e.g., stall on read-after-write.

Superscalar processing: register read flip-flops insn insn fetch fetch flip-flops insn insn decode decode flip-flops register read register read flip-flops register read 31 8 88888 flip-flops register write register write

Vector processing: 32 Expand each 32-bit integer into n-vector of 32-bit integers. ARM NEON has n = 4; Intel AVX2 has n = 8; Intel AVX-512 has n = 16; GPUs have larger n.

Vector processing: 32 Expand each 32-bit integer into n-vector of 32-bit integers. ARM NEON has n = 4; Intel AVX2 has n = 8; Intel AVX-512 has n = 16; GPUs have larger n. n speedup if n arithmetic circuits, n read/write circuits. Benefit: Amortizes insn circuits.

Vector processing: 32 Expand each 32-bit integer into n-vector of 32-bit integers. ARM NEON has n = 4; Intel AVX2 has n = 8; Intel AVX-512 has n = 16; GPUs have larger n. n speedup if n arithmetic circuits, n read/write circuits. Benefit: Amortizes insn circuits. Huge effect on higher-level algorithms and data structures.

Network on chip: the mesh 33 How expensive is sorting? Input: array of n 1; numbers. Each number in 2; : : : ; n 2, represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input.

Network on chip: the mesh 33 How expensive is sorting? Input: array of n 1; numbers. Each number in 2; : : : ; n 2, represented in binary. Output: array of n numbers, in increasing order, represented in binary; same multiset as input. Metric: seconds used by circuit of area n 1+o(1). For simplicity assume n = 4 k.

Spread array across square mesh of n small cells, each of area n o(1), with near-neighbor wiring: 34

Sort row of n 0:5 cells in n 0:5+o(1) seconds: 35 Sort each pair in parallel. 3 1 4 1 5 9 2 6 1 3 1 4 5 9 2 6 Sort alternate pairs in parallel. 1 3 1 4 5 9 2 6 1 1 3 4 5 2 9 6 Repeat until number of steps equals row length.

Sort row of n 0:5 cells in n 0:5+o(1) seconds: 35 Sort each pair in parallel. 3 1 4 1 5 9 2 6 1 3 1 4 5 9 2 6 Sort alternate pairs in parallel. 1 3 1 4 5 9 2 6 1 1 3 4 5 2 9 6 Repeat until number of steps equals row length. Sort each row, in parallel, in a total of n 0:5+o(1) seconds.

Sort all n cells in n 0:5+o(1) seconds: 36 Recursively sort quadrants in parallel, if n > 1. Sort each column in parallel. Sort each row in parallel. Sort each column in parallel. Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array.

For example, assume that this 8 8 array is in cells: 37 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9 2

Recursively sort quadrants, top, bottom : 38 1 1 2 3 2 2 2 3 3 3 3 3 4 5 5 6 3 4 4 5 6 6 7 7 5 8 8 8 9 9 9 9 1 1 0 0 2 2 1 0 4 4 3 2 5 4 4 3 7 6 5 5 9 8 7 7 9 9 8 8 9 9 9 9

Sort each column in parallel: 39 1 1 0 0 2 2 1 0 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 3 3 4 3 3 5 5 5 6 4 4 4 5 6 6 7 7 5 6 5 5 9 8 7 7 7 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9

Sort each row in parallel, alternately, : 40 0 0 0 1 1 1 2 2 3 2 2 2 2 2 1 1 3 3 3 3 3 4 4 4 6 5 5 5 4 3 3 3 4 4 4 5 6 6 7 7 9 8 7 7 6 5 5 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8

Sort each column in parallel: 41 0 0 0 1 1 1 1 1 3 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 4 4 4 4 6 5 5 5 6 5 5 5 7 8 7 7 6 6 7 7 9 8 8 8 9 9 8 8 9 9 9 9 9 9 9 9

Sort each row in parallel, or as desired: 42 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9

Chips are in fact evolving towards having this much parallelism and communication. 43 GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh.

Chips are in fact evolving towards having this much parallelism and communication. 43 GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don t even get the right exponent without taking this into account.

Chips are in fact evolving towards having this much parallelism and communication. 43 GPUs: parallel + global RAM. Old Xeon Phi: parallel + ring. New Xeon Phi: parallel + mesh. Algorithm designers don t even get the right exponent without taking this into account. Shock waves from subroutines into high-level algorithm design.