The Evolution of High Performance Computers

Similar documents
Computer Architecture TDTS10

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to Cloud Computing

VLIW Processors. VLIW Processors

Binary search tree with SIMD bandwidth optimization using SSE

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

PROBLEMS #20,R0,R1 #$3A,R2,R4

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Hardware design for ray tracing

2

Introduction to GPU Programming Languages

OpenACC 2.0 and the PGI Accelerator Compilers

HPC with Multicore and GPUs

Computer Architecture

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Computer Graphics Hardware An Overview

DARPA, NSF-NGS/ITR,ACR,CPA,

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

THE NAS KERNEL BENCHMARK PROGRAM

Architectures and Platforms

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

An Introduction to Parallel Computing/ Programming

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

High level code and machine code

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

WAR: Write After Read

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pitfalls of Object Oriented Programming

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

QCD as a Video Game?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

10 Things Every Linux Programmer Should Know Linux Misconceptions in 30 Minutes

on an system with an infinite number of processors. Calculate the speedup of

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Case Study on Productivity and Performance of GPGPUs

Optimizing Code for Accelerators: The Long Road to High Performance

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Study of a neural network-based system for stability augmentation of an airplane

A Lab Course on Computer Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Software implementation of Post-Quantum Cryptography

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Introducción. Diseño de sistemas digitales.1

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

GPU Hardware Performance. Fall 2015

C++ Programming Language

High-Performance Modular Multiplication on the Cell Processor

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Chapter 5 Instructor's Manual

Intelligent Heuristic Construction with Active Learning

Thread level parallelism

LSN 2 Computer Processors

Rethinking SIMD Vectorization for In-Memory Databases

Let s put together a Manual Processor

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Summer Student Project Report

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

IA-64 Application Developer s Architecture Guide

Week 1 out-of-class notes, discussions and sample problems

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Performance monitoring of the software frameworks for LHC experiments

Multi-GPU Load Balancing for Simulation and Rendering

Introduction to Compiler Consultant

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Texture Cache Approximation on GPUs

AsicBoost A Speedup for Bitcoin Mining

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

CHAPTER 1 INTRODUCTION

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

FPGA-based Multithreading for In-Memory Hash Joins

Transcription:

The Evolution of High Performance Computers David Chisnall February 8, 2011

Outline

Classic Computer Architecture Fetch Decode Execute

Execute? 1. Read value from memory 2. Store in register 1. Read value from register 2. Store in memory 1. Read value(s) from register(s) 2. Perform calculation 3. Store result in register

Calculation? Add or subtract? Shift? Multiply? Divide? Evaluate polynomial? Discrete Cosine Transform? Exact set of operations depends on the computer

A Simple Case r1 Adder r3 r2

Making it Faster: Vectors Each register stores multiple values Operations on components

Example Problem Pixel values: 32 bits per channel, red, green, blue, alpha Lighten operation scales each value by a constant factor Same operation on each of the colour channels

Scalar Solution for ( int i=0 ; i< imagesize ; i ++) { pixels [i]. red *= 1.1; pixels [i]. green *= 1.1; pixels [i]. blue *= 1.1; } Three loads per pixel Three multiplies per pixel Three stores per pixel

Vector Solution pixel_ t scale = {1.1, 1.1, 1.1, 1}; for ( int i=0 ; i< imagesize ; i ++) { pixel [i] *= scale ; } One (vector) load per pixel One (vector) multiply per pixel One (vector) store per pixel Redundant multiply (alpha 1) costs nothing, but makes potential speedup only 3, not 4 for a 4-element vector unit

Why is this faster? One fetch, one decode, four executes Fewer instructions, less space used for storing the program

Memory Layout For vectors to be fast, you must be able to load vectors from memory If you separate the channels, this would not be possible Modern compilers can auto-vectorise code with a usable memory layout

Even Faster: Parallel Streams Each loop iteration is independent We can do them in parallel

Bigger Vectors? pixel_ pair_ t scale = {1.1, 1.1, 1.1, 1, 1.1, 1.1, 1.1, 1}; for ( int i=0 ; i< imagesize ; i +=2) { pixels [i] *= scale ; } Now we can process two pixels at once Some modern systems have 256-bit vectors Really hard to map most problems to them Most vector units are only 128-bit

Independent Processors? Split loop into two or more loops Run each on a different processor

Hybrid Solution: Semi-independent Cores Each core runs a group of threads When threads are executing the same thing, all run When they branch, only one runs Typical GPU architecture

What the Programmer Sees One program (kernel) per input (e.g. per pixel) Conceptually all run in parallel In practice, 1 128 run in parallel

How it Works Processor is n m element vector processor Programmer sees n element vector processor Processor starts m elements at once Linear code Just Work TM Special case for branches: All threads take the same branch, all continue to run Some take a different branch, paused, resumed later

Bad Example kernel void stripe ( const float4 * input global float4 * output ) { int i = get_ global_ id (0) ; // Lighten even pixels, darken odd pixels if (i % 2) { output [i] = input [i] * 1.1; } else { output [i] = input [i] * 0.9; } } Each pair of threads will take different branches Only half will actually be running in parallel

Why it s Useful Good for code with highly independent elements Higher ratio of execution units to other components than a general purpose CPU

When is it not Useful? Problems that don t have independent inputs Algorithms that have lots of branches

So, Branches are Fast on CPUs? Fetch, decode, execute, is an oversimplification Modern pipelines are 7 20 stages Can t fetch instruction after the branch until after executing the branch! P4 can have 140 instructions executing at once Can have none executing if it doesn t know what to execute though...

Not So Much Then? Typical CPUs do branch prediction Correct prediction is very fast Incorrect prediction is very slow Accuracy is about 95% So 5% of branches cause a pipeline stall (bad!)

Another Work-Around: Predicated Instructions Found in ARM and Itanium Instruction predicated on condition register Executes anyway Result is only stored if the condition was set No pipeline stall, but some redundant work Much faster for short branches

Avoiding Branches: Loop Unrolling What are funroll loops? for ( int i=0 ; i< imagesize ; i ++) { pixels [i] *= scale ; } One branch per pixel Should be correctly predicted... But still wastes some time

Avoiding Branches: Loop Unrolling What are funroll loops? for ( int i=0 ; i< imagesize ; i ++) { pixels [i] *= scale ; pixels [i ++] *= scale ; pixels [i ++] *= scale ; pixels [i ++] *= scale ; } One branch per four pixels Code is bigger, but should be faster Don t do this yourself! The compiler will do it for you!

Gotcha: Short-Circuit Evaluation if (a() b() c()) dostuff (); What this really means: if (a()) dostuff (); else if (b()) dostuff (); else if (c()) dostuff (); One if statement, three branches!

Avoiding Short-Circuiting int condition = a(); condition = b() condition ; condition = c() condition ; if ( condition ) dostuff (); All of the subexpressions are evaluated.

Questions?