Thread level parallelism



Similar documents
Multithreading Lin Gao cs9244 report, 2006

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Thread Level Parallelism II: Multithreading

Energy-Efficient, High-Performance Heterogeneous Core Design


Operating System Impact on SMT Architecture

Enterprise Applications

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Internet Content Distribution

VLIW Processors. VLIW Processors

Control 2004, University of Bath, UK, September 2004

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Parallel Programming Survey

PROBLEMS #20,R0,R1 #$3A,R2,R4

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

NVIDIA Tegra 4 Family CPU Architecture

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Real-Time Scheduling 1 / 39

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Next Generation GPU Architecture Code-named Fermi

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Introduction to Microprocessors

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Week 1 out-of-class notes, discussions and sample problems

OC By Arsene Fansi T. POLIMI

An Implementation Of Multiprocessor Linux

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Multi-Core Programming

CPU Scheduling Outline

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Chapter 2 Parallel Computer Architecture

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Introduction to Cloud Computing

Communicating with devices

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Application Performance Analysis of the Cortex-A9 MPCore

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Module: Software Instruction Scheduling Part I

USB readout board for PEBS Performance test

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Performance evaluation

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

MAGENTO HOSTING Progressive Server Performance Improvements

FPGA-based Multithreading for In-Memory Hash Joins

Multicore Processor, Parallelism and Their Performance Analysis

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Latency Considerations for 10GBase-T PHYs

Architecture of Hitachi SR-8000

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Introduction to GPU Architecture

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Multi-GPU Load Balancing for Simulation and Rendering

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

OpenCL Programming for the CUDA Architecture. Version 2.3

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Computer Architecture

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Packet Sniffer using Multicore programming. By B.A.Khivsara Assistant Professor Computer Department SNJB s KBJ COE,Chandwad

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

CPU Scheduling. Core Definitions

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

What is RAID? data reliability with performance

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

How To Understand The Design Of A Microprocessor

Real-time Java Processor for Monitoring and Test

Transcription:

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute 19 From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. 20

Thread Level parallelism Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 21 Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s T1 22

Coarse-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Needto have very fast thread-switching Doesn t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time 23 Simultanuous Multithreading SMT Fine-grained multithreading implemented on top of multiple-issued dynamically scheduled processor. Multiple instructions from different threads. 24

SMT One thread, 8 Units 1 2 3 4 5 6 7 8 9 Two threads, 8 Units 1 2 3 4 5 6 7 8 9 25 Time (processor cycle) Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 26

Sun T1 Focused on TLP rather than ILP Fine-grained multithreading 8 cores, 4 threads per core, one shared FP unit. 6-stage pipeline (similar to MIPS with one stage for thread switching) L1 caches: 16KB I, 8KB D, 64-byte block size (misses to L2 23 cycles with no contention) L2 caches: 4 separate L2 caches each 750KB. Misses to main memory 110 cycles assuming no contention 27 Sun T1 Relative change in the miss rate and latency when executing one thread per core vs 4 threads per core (TPC-C) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 L1 I miss rate L1 D miss rate L2 miss rate L1 I miss latency L1 D miss latency L2 miss latency 28

Sun T1 Breakdown of the status on an average thread. Ready means the thread is ready, but another one is chosen The core stalls only if all the 4 threads are not ready 120% 100% 80% 60% 40% Not ready Ready Executing 20% 0% TPC C SPECJBB00 SPECWeb99 29 Sun t1 Breakdown of the causes for a thread being not ready 120% 100% 80% 60% 40% 20% Other Pipeline delay L2 miss L1 D miss L1 I miss 0% TPC C SPECJBB00 SPECWeb99 30

The ARM Cortex-A8 Dual issue processor 13-stage pipeline 31 The ARM Cortex-A8 Five stage instruction decode 32

ARM Cortex-A8 Execution Pipeline 33