Chip Multiprocessors 1

Similar documents
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Multi-Threading Performance on Commodity Multi-Core Processors

Operating System Impact on SMT Architecture

WAR: Write After Read

A Lab Course on Computer Architecture

Real-Time Monitoring Framework for Parallel Processes

Thread Level Parallelism II: Multithreading

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Programming Survey

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Multithreading Lin Gao cs9244 report, 2006

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Energy-Efficient, High-Performance Heterogeneous Core Design

POWER8 Performance Analysis

Thread level parallelism

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

Symmetric Multiprocessing

2

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Processes and Non-Preemptive Scheduling. Otto J. Anshus


Performance evaluation

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Operating Systems and Networks

Chapter 2: OS Overview

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to Cloud Computing

Putting Checkpoints to Work in Thread Level Speculative Execution

VLIW Processors. VLIW Processors

Pipelining Review and Its Limitations

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Module: Software Instruction Scheduling Part I

IA-64 Application Developer s Architecture Guide

Introduction to Microprocessors

Understanding Hardware Transactional Memory

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Computer Architecture TDTS10

Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications

ADVANCED COMPUTER ARCHITECTURE

CHAPTER 1 INTRODUCTION

Precise and Accurate Processor Simulation

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

Spring 2011 Prof. Hyesoon Kim

Generations of the computer. processors.

Sequential Logic. (Materials taken from: Principles of Computer Hardware by Alan Clements )

The 104 Duke_ACC Machine

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Introducción. Diseño de sistemas digitales.1

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Squashing the Bugs: Tools for Building Better Software

Java Memory Model: Content

Multi-core Programming System Overview

Software and the Concurrency Revolution

Computer Science 483/580 Concurrent Programming Midterm Exam February 23, 2009

Design Cycle for Microprocessors

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Computer Organization and Components

How to design and implement firmware for embedded systems

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Putting it all together: Intel Nehalem.

Chapter 18: Database System Architectures. Centralized Systems

Microprocessor & Assembly Language

GPUs for Scientific Computing

An Implementation Of Multiprocessor Linux

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Real Time Programming: Concepts

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Using Power to Improve C Programming Education

Chapter 1 Computer System Overview

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS

路 論 Chapter 15 System-Level Physical Design

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

A Survey of Parallel Processing in Linux

PROBLEMS #20,R0,R1 #$3A,R2,R4

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

GETTING STARTED WITH PROGRAMMABLE LOGIC DEVICES, THE 16V8 AND 20V8

Cloud Computing. By: Jonathan Delanoy, Mark Delanoy, Katherine Espana, Anthony Esposito, Daniel Farina, Samuel Feher, and Sean Flahive

MPI and Hybrid Programming Models. William Gropp

Wait-Time Analysis Method: New Best Practice for Performance Management

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Multi-Core Programming

AMD Opteron Quad-Core

SOC architecture and design

CUDA programming on NVIDIA GPUs

Concept of Cache in web proxies

Transcription:

Chip Multiprocessors 1

Branch Prediction Preview 2

Computing the PC For non-branches (the wrong way) Non-branch instruction PC = PC + 4 When is PC ready? 3

Computing the PC for non-branches (the right way) Pre-decode in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. 5

Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? 6

Solution: Branch Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? Branch prediction vocabulary Prediction -- a guess about whether a branch will be taken or not taken Misprediction -- a prediction that turns out to be incorrect. Misprediction rate -- fraction of predictions that are incorrect. 7

Predict Not-taken We start the add, and then, when we discover the branch outcome, we squash it. Also called flushing the pipeline All the work done for the squashed instructions is wasted! 8

And Branch Prediction Preview 9

latch latch latch latch latch latch latch latch Pipelining Recap Break up the logic with pipeline registers into pipeline stages Each stage can act on different instruction/data States/Control Signals of instructions are hold in pipeline registers (latches) 10ns 2ns 2ns 2ns 2ns 2ns 10

Pipelining Is Not Free 11

Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Register delay (RD) -- Delay through the internals of the register. CTbase -- cycle time before pipelining CTbase = LD + ST + RD. CTpipe -- cycle time after pipelining N times CTpipe = ST + RD + LD/N Total time = N*ST + N*RD + LD 12

Pipelining Power Pipelining is a triple threat to power Increased clock rate (P~f^3) Pipeline register overheads Increased cost of control hazards 13

Limits of OOO Superscalars 14 branches @ 80% accuracy =.8^14 =4.3% 14 branches @ 90% accuracy =.9^14 =22% 14 branches @ 95% accuracy =.95^14 =49% 14 branches @ 99% accuracy =.99^14 =86%

Multiprocessors Specifically, shared-memory multiprocessors have been around for a long time. Originally, put several processors in a box and provide them access to a single, shared memory. In the old days (< 2004), MPs were expensive and mildly exotic. Big servers Sophisticated users/data-center applications 15

Chip Multiprocessors (CMPS) Multiple processors on one die An easy way to spend xtrs Now common place Laptops/desktops/game consoles/etc. Less sophisticated users, all kinds of applications. 16

17

Big Core 18

Small Cores 19

IPC Big Cores vs. Small Cores 2.5 2 1.5 1 Small Core IPC Big Core IPC 0.5 0 20

% Misses/Instruction Big Cores vs. Small Cores 9 8 7 6 5 4 3 Small Core D-CACHE % MPCI Big Core D-CACHE % MPCI 2 1 0 21

Threads are Hard to Find To exploit CMP parallelism you need multiple processes or multiple threads Processes Separate programs actually running (not sitting idle) on your computer at the same time. Common in servers Much less common in desktop/laptops Threads Independent portions of your program that can run in parallel Most programs are not multi-threaded. We will refer to these collectively as threads A typical user system might have 1-8 actively running threads. Servers can have more if needed (the sysadmins will hopefully configure it that way) 22

Parallel Programming is Hard Difficulties Correctly identifying independent portions of complex programs Sharing data between threads safely. Using locks correctly Avoiding deadlock There do not appear to be good solutions We have been working on this for 30 years (remember, multi-processors have been around for a long time.) It remains stubbornly hard. 23

Critical Sections and Locks A critical section is a piece of code that only one thread should be executing at a time. int shared_value = 0; void IncrementSharedVariable() { int t = shared_value + 1; // Line 1 shared_value = t; // line 2 } If two threads execute this code, we would expect the shared_value to go up by 2 However, they could both execute line 1, and then both execute line 2 -- both would write back the same new value. Instructions in the two threads can be interleaved in any way. 24

Critical Sections and Locks By adding a lock, we can ensure that only one thread executes the critical section at a time. int shared_value = 0; lock shared_value_lock; void IncrementSharedVariable() { acquire(shared_value_lock); int t = shared_value + 1; // Line 1 shared_value = t; // line 2 release(shared_value_lock); } In this case we say shared_value_lock protects shared_value. 25

Locks are Hard The relationship between locks and the data they protect is not explicit in the source code and not enforced by the compiler In large systems, the programmers typically cannot tell you what the mapping is As a result, there are many bugs. 26

Locking Bug Example void Swap(int * a, lock * a_lock, int * b, lock * b_lock) { lock(a_lock); lock(b_lock); int t = a; a = b; b = t; unlock(a_lock); unlock(b_lock); } Thread 1 Thread 2... Swap(foo, foo_lock, bar, bar_lock);...... Swap(bar, bar_lock, foo, foo_lock);... Thread 1 locks foo_lock, thread 2 locks bar_lock, both wait indefinitely for the other lock. Finding, preventing, and fixing this kind of bug are all hard 27

End of CMPs 28