Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University

Similar documents
TDT 4260 lecture 11 spring semester Interconnection network continued

Software Synthesis from Dataflow Models for G and LabVIEW

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

VLIW Processors. VLIW Processors

Introduction to Cloud Computing

A prototype data flow computer with token labelling

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

POWER-EFFICIENCY is today one of the main challenges in


Classification and Performance Evaluation of Hybrid Dataflow Techniques With Respect to Matrix Multiplication

Glossary of Object Oriented Terms

Real Time Programming: Concepts

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Central Processing Unit (CPU)

Module: Software Instruction Scheduling Part I

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

(Refer Slide Time: 00:01:16 min)

Course Development of Programming for General-Purpose Multicore Processors

CPU Organisation and Operation

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

picojava TM : A Hardware Implementation of the Java Virtual Machine

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

1 The Java Virtual Machine

Instruction Set Design

Multi-GPU Load Balancing for Simulation and Rendering

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Thread level parallelism

WAR: Write After Read

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

Vector Architectures

Write Barrier Removal by Static Analysis

Chapter 11: Input/Output Organisation. Lesson 06: Programmed IO

Spring 2011 Prof. Hyesoon Kim

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Let s put together a Manual Processor

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Sequential Performance Analysis with Callgrind and KCachegrind

An Hybrid MPI/Stream Programming Model for Heterogeneous High Performance Systems

Replication on Virtual Machines

Translating Data Flow to Synchronous Block Diagrams

Design Patterns in C++

Embedded C Programming, Linux, and Vxworks. Synopsis

fakultät für informatik informatik 12 technische universität dortmund Data flow models Peter Marwedel Informatik 12 TU Dortmund Germany

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Energy-Efficient, High-Performance Heterogeneous Core Design

Performance Analysis and Optimization Tool

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Management by Network Search

Instruction Set Architecture (ISA)

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Annotation to the assignments and the solution sheet. Note the following points

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Client/Server Computing Distributed Processing, Client/Server, and Clusters

The Kiel Reactive Processor

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Stream Processing on GPUs Using Distributed Multimedia Middleware

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Objectives of Lecture. Network Architecture. Protocols. Contents

Protocols and Architecture. Protocol Architecture.

CPU Organization and Assembly Language

1/20/2016 INTRODUCTION

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

Review: MIPS Addressing Modes/Instruction Formats

Modeling and Simulation Firewall Using Colored Petri Net

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Transport Layer Protocols

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Effective Parameters on Response Time of Data Stream Management Systems

Module 2 Stacks and Queues: Abstract Data Types

Computer Organization and Components

Operating System Overview. Otto J. Anshus

Dynamic Thread Pool based Service Tracking Manager

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Next Generation GPU Architecture Code-named Fermi

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

SOC architecture and design

Compositional hardware virtualization. Raphael kena Poss University of Amsterdam January 11th, 2014

Interconnection Networks

MAPREDUCE Programming Model

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Instruction Set Architecture

Preserving Message Integrity in Dynamic Process Migration

CSC 2405: Computer Systems II

1 Classical Universal Computer 3

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

Transcription:

Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University

A Note on This Lecture These slides are from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 22: Dataflow I Video: http://www.youtube.com/watch? v=d2uue7izu2c&list=pl5phm2jkkxmh4cdkc3s1vbb7- njlgig5d&index=19 2

Some Required Dataflow Readings Dataflow at the ISA level Dennis and Misunas, A Preliminary Architecture for a Basic Data Flow Processor, ISCA 1974. Arvind and Nikhil, Executing a Program on the MIT Tagged- Token Dataflow Architecture, IEEE TC 1990. Restricted Dataflow Patt et al., HPS, a new microarchitecture: rationale and introduction, MICRO 1985. Patt et al., Critical issues regarding HPS, a high performance microarchitecture, MICRO 1985. 3

Other Related Recommended Readings Dataflow Gurd et al., The Manchester prototype dataflow computer, CACM 1985. Lee and Hurson, Dataflow Architectures and Multithreading, IEEE Computer 1994. Restricted Dataflow Sankaralingam et al., Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture, ISCA 2003. Burger et al., Scaling to the End of Silicon with EDGE Architectures, IEEE Computer 2004. 4

Today Start Dataflow 5

Data Flow

Readings: Data Flow (I) Dennis and Misunas, A Preliminary Architecture for a Basic Data Flow Processor, ISCA 1974. Treleaven et al., Data-Driven and Demand-Driven Computer Architecture, ACM Computing Surveys 1982. Veen, Dataflow Machine Architecture, ACM Computing Surveys 1986. Gurd et al., The Manchester prototype dataflow computer, CACM 1985. Arvind and Nikhil, Executing a Program on the MIT Tagged-Token Dataflow Architecture, IEEE TC 1990. Patt et al., HPS, a new microarchitecture: rationale and introduction, MICRO 1985. Lee and Hurson, Dataflow Architectures and Multithreading, IEEE Computer 1994. 7

Readings: Data Flow (II) Sankaralingam et al., Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture, ISCA 2003. Burger et al., Scaling to the End of Silicon with EDGE Architectures, IEEE Computer 2004. 8

Data Flow The models we have examined in 447/740 all assumed Instructions are fetched and retired in sequential, control flow order This is part of the Von-Neumann model of computation Single program counter Sequential execution Control flow determines fetch, execution, commit order What about out-of-order execution? Architecture level: Obeys the control-flow model Uarch level: A window of instructions executed in data-flow order à execute an instruction when its operands become available 9

Data Flow In a data flow machine, a program consists of data flow nodes A data flow node fires (fetched and executed) when all its inputs are ready i.e. when all inputs have tokens Data flow node and its ISA representation 10

Data Flow Nodes 11

Data Flow Nodes (II) A small set of dataflow operators can be used to define a general programming language Fork Primitive Ops Switch Merge + T T F T T F + T T F T T F

Dataflow Graphs {x = a + b; y = b * 7 in (x-y) * (x+y)} Values in dataflow graphs are represented as tokens token < ip, p, v > instruction ptr port data ip = 3 p = L a b 1 + 2 *7 x y 3-4 + An operator executes when all its input tokens are present; copies of the result token are distributed to the destination operators 5 * no separate control flow

Example Data Flow Program OUT 14

Control Flow vs. Data Flow 15

Static Dataflow Allows only one instance of a node to be enabled for firing A dataflow node is fired only when all of the tokens are available on its input arcs and no tokens exist on any of its its output arcs Dennis and Misunas, A Preliminary Architecture for a Basic Data Flow Processor, ISCA 1974. 16

Static Dataflow Machine: Instruction Templates a b 1 2 3 4 5 + 3L 4L 3R 4R * - 5L + 5R * out 1 + 2 *7 x y 3-4 + Presence bits Each arc in the graph has an operand slot in the program 5 *

Static Dataflow Machine (Dennis+, ISCA 1974) Receive Instruction Templates 1 Op dest1 dest2 p1 src1 p2 src2 2... FU FU FU FU FU Send <s 1, p 1, v 1 >, <s 2, p 2, v 2 > Many such processors can be connected together Programs can be statically divided among the processors

Static versus Dynamic Dataflow Machines 19

Static Data Flow Machines Mismatch between the model and the implementation The model requires unbounded FIFO token queues per arc but the architecture provides storage for one token per arc The architecture does not ensure FIFO order in the reuse of an operand slot The static model does not support Reentrant code Function calls Loops Data Structures 20

Problems with Re-entrancy Assume this was in a loop Or in a function And operations took variable time to execute How do you ensure the tokens that match are of the same invocation? 21

Dynamic Dataflow Architectures Allocate instruction templates, i.e., a frame, dynamically to support each loop iteration and procedure call termination detection needed to deallocate frames The code can be shared if we separate the code and the operand storage a token <fp, ip, port, data> frame pointer instruction pointer

A Frame in Dynamic Dataflow 1 2 + * 1 2 3L, 4L 3R, 4R Program a b 3 4 5 - + * 3 4 5 5L 5R out <fp, ip, p, v> 1 + 2 *7 x y 1 3-4 + 2 3 4 5 L 7 Frame 5 * Need to provide storage for only one operand/operator

Monsoon Processor (ISCA 1990) op r d1,d2 Code ip Instruction Fetch Frames fp+r Operand Fetch Token Queue ALU Form Token Network Network

Concept of Tagging Each invocation receives a separate tag 25

Procedure Linkage Operators f a1... an get frame extract tag change Tag 0 change Tag 1 change Tag n Like standard call/return but caller & callee can be active simultaneously Fork 1: n: Graph for f token in frame 0 token in frame 1 change Tag 0 change Tag 1

Function Calls Need extra mechanism to direct the output token of the function to the proper calling site Usually done by sending special token containing the return node address 27

Loops and Function Calls Summary 28

Control of Parallelism Problem: Many loop iterations can be present in the machine at any given time 100K iterations on a 256 processor machine can swamp the machine (thrashing in token matching units) Not enough bits to represent frame id Solution: Throttle loops. Control how many loop iterations can be in the machine at the same time. Requires changes to loop dataflow graph to inhibit token generation when number of iterations is greater than N 29

Data Structures Dataflow by nature has write-once semantics Each arc (token) represents a data value An arc (token) gets transformed by a dataflow node into a new arc (token) à No persistent state Data structures as we know of them (in imperative languages) are structures with persistent state Why do we want persistent state? More natural representation for some tasks? (bank accounts, databases, ) To exploit locality 30

Data Structures in Dataflow Data structures reside in a structure store tokens carry pointers I-structures: Write-once, Read multiple times or allocate, write, read,..., read, deallocate No problem if a reader arrives before the writer at the memory location P Memory.... P a I-fetch a v I-store

I-Structures 32

Dynamic Data Structures Write-multiple-times data structures How can you support them in a dataflow machine? Can you implement a linked list? What are the ordering semantics for writes and reads? Imperative vs. functional languages Side effects and mutable state vs. No side effects and no mutable state 33

MIT Tagged Token Data Flow Architecture Resource Manager Nodes responsible for Function allocation (allocation of context/frame identifiers), Heap allocation, etc. 34

MIT Tagged Token Data Flow Architecture Wait Match Unit: try to match incoming token and context id and a waiting token with same instruction address Success: Both tokens forwarded Fail: Incoming token > Waiting Token Mem, bubble (noop forwarded) 35

TTDA Data Flow Example 36

TTDA Data Flow Example 37

TTDA Data Flow Example 38

TTDA Synchronization Heap memory locations have FULL/EMPTY bits if the heap location is EMPTY, heap memory module queues request at that location When "I Fetch" request arrives (instead of "Fetch"), Later, when "I Store" arrives, pending requests are discharged No busy waiting No extra messages 39

Manchester Data Flow Machine Matching Store: Pairs together tokens destined for the same instruction Large data set à overflow in overflow unit Paired tokens fetch the appropriate instruction from the node store 40

Data Flow Summary Availability of data determines order of execution A data flow node fires when its sources are ready Programs represented as data flow graphs (of nodes) Data Flow at the ISA level has not been (as) successful Data Flow implementations under the hood (while preserving sequential ISA semantics) have been successful Out of order execution Hwu and Patt, HPSm, a high performance restricted data flow architecture having minimal functionality, ISCA 1986. 41

Data Flow Characteristics Data-driven execution of instruction-level graphical code Nodes are operators Arcs are data (I/O) As opposed to control-driven execution Only real dependencies constrain processing No sequential I-stream No program counter Operations execute asynchronously Execution triggered by the presence of data Single assignment languages and functional programming E.g., SISAL in Manchester Data Flow Computer No mutable state 42

Data Flow Advantages/Disadvantages Advantages Very good at exploiting irregular parallelism Only real dependencies constrain processing Disadvantages Debugging difficult (no precise state) Interrupt/exception handling is difficult (what is precise state semantics?) Implementing dynamic data structures difficult in pure data flow models Too much parallelism? (Parallelism control needed) High bookkeeping overhead (tag matching, data storage) Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited 43

Combining Data Flow and Control Flow Can we get the best of both worlds? Two possibilities Model 1: Keep control flow at the ISA level, do dataflow underneath, preserving sequential semantics Model 2: Keep dataflow model, but incorporate control flow at the ISA level to improve efficiency, exploit locality, and ease resource management Incorporate threads into dataflow: statically ordered instructions; when the first instruction is fired, the remaining instructions execute without interruption 44