TDT 4260 lecture 11 spring semester Interconnection network continued

Transcription

1 1 TDT 4260 lecture 11 spring semester 2013 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Interconnection network continued Routing Switch microarchitecture Dataflow computing Principles MDM in detail think differently! innovation Research method Administrativia Reading list is now in its final version Next week: Mini project presentations, Room 454 in IT-Building Last lecture 30/4, exam Saturday 25/5 at 0900 Wrap up Short presentation of projects/master theses offered by EECS/CARD/Lasse Repetition --- send to Lasse before 24/4 to ask for special topics

2 3 ARM guest lectures Thursday morning Part of the course TDT Energieffektive datamaskinsystemer Thursday 18. April, at 08:15 in auditorium F6: 1) Low power HW design. Guest lecturer: Nir Leshem (Hardware Engineering Manager, ARM) (In english) 2) Driverutvikling for Linux, driverarkitektur og debugging. Gjesteforelesere: Ørjan Eide (Senior Engineer, ARM) og Mikael Valen-Sendstad (Staff Software Architect, ARM) (In Norwegian) 4 F.5: Routing, Arbitration, Switching Routing Which of the possible paths are allowable for packets? Set of operations needed to compute a valid path Arbitration When are paths available for packets? Resolves packets requesting the same resources at the same time For every arbitration, there is a winner and possibly many losers Losers are buffered (lossless) or dropped on overflow (lossy) Switching How are paths allocated to packets? The winning packet (from arbitration) proceeds towards destination Paths can be established one fragment at a time or in their entirety

3 5 Routing Shared Media Broadcast to everyone Switched Media needs real routing. Options: Source-based routing: message specifies path to the destination (changes of direction) Virtual Circuit: circuit established from source to destination, message picks the circuit to follow Destination-based routing: message specifies destination, switch must pick the path Deterministic: always follow same path Adaptive: pick different paths to avoid congestion, failures Randomized routing: pick between several good paths to balance network load 6 Store & Forward vs Cut-Through Routing Store & Forward Routing Cut-Through Routing Source Dest Dest Time Cut-through (on blocking) Virtual cut-through (spools rest of packet into buffer) Wormhole (buffers only a few flits, leaves tail along route)

4 7 Routing mechanism Need to select output port for each input packet And fast Simple arithmetic in regular topologies Example: x, y routing in a grid with bi-directional links (first x then y) west (-x) x < 0 east (+x) x > 0 south (-y) x = 0, y < 0 north (+y) x = 0, y > 0 Unidirectional links sufficient for torus (+x, +y) Dimension-order routing (DOR) Reduce relative address of each dimension in order to avoid deadlock 8 Deadlock How can it arise? necessary conditions: shared resources incrementally allocated non-preemptible How do you handle it? constrain how channel resources are allocated (deadlock avoidance) Add a mechanism that detects likely deadlocks and fixes them (deadlock recovery)

5 9 Deadlock example 1 Red: S1 d 1 Green:S 2 d 2 Blue: S 3 d 3 Black: S 4 d 4 10 Deadlock example 1, avoided by DOR

6 11 Deadlock example 2 TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3) TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3) TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3) X X TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3) Deadlock can occur even with DOR if uni-directional links Can be solved by having two (virtual) channels 12 Arbitration (1/2) Several simultaneous requests to shared resource Ideal: Maximize usage of network resources Problem: Starvation Fairness needed Figure: Two phase arbitration. Request, Grant Poor usage

7 13 Arbitration (2/2) Three phases Multiple requests Better usage But: Increased latency 14 Switching Allocating paths for packets Two techniques: Circuit switching (connection oriented) Communication channel Allocated before first packet Packet headers don t need routing info Wastes bandwidth Packet switching (connection less) Each packet handled independently Can t guarantee response time Two types next slide

8 15 Store & Forward vs. Cut-Through Routing Time Packet switching Store & Forward Routing Circuit switching Cut-Through Routing Source Dest Dest Cut-through (on blocking) Virtual cut-through (spools rest of packet into buffer) Wormhole (buffers only a few flits, leaves tail along route, (--- only one flit in the figure above)) Switch micro architecture

9 17 Pipelined switch 18 SOMETHING ELSE

10 19 IDI Open, a challenge for you? 20 DATAFLOW COMPUTING AND MDM

11 21 Dataflow computing and computers Dataflow computing suitable for highly parallel solutions requires different HW and SW Dataflow computers Principles History Statical vs. dynamical Typical architecture pipelined ring with circulating packets Manchester Dataflow Machine (MDM) 22 Dataflow programs 1 b c e a = (b +1) x (b - c) d = c x e + f = a x d Represent computation as a graph Node = operation = instruction a d Dataflow graph Computation flows through Inherently parallel, data driven, no program counter, asynchronous f Logical processor at each node, activated by availability of operands, executed when a physical processor is available

12 23 Example data flow 24 Control flow and data flow (Traditional) control-flow Explicit control flow (manipulation of program counter (PC)) Data are communicated between instructions via shared memory locations Data is referenced via memory-address One single control thread Many parallel control threads: Explicit parallelism Data flow computers Data driven computation, that is the selection of instructions for execution is controlled by the availability of operands Implicit parallelism Programs represented as directed graphs Results are sent directly as data-packets between instructions Has normally/originally no shared memory that more than one instruction may refer to i.e. no side effects

13 25 Data flow computers, history Relatively old topic Many research projects Fundamentally different interesting Link to functional languages gave renewed interest Some prototypes built, none with outstanding performance Status 1998 Few research projects Data flow principle used many places In processors (Reservation stations, Tomasulo, TDT 4255) Chaining of DSP PE s for high performance Dataflow computers related to other architectures (anno 1986) Dataflow machine architecture Arthur H. Veen, ACM Computing Surveys December 1986 Volume 18 Issue 4

14 27 Dataflow machines, architecture and implementation (anno 1986) 28 Motivation

15 29 Static and dynamic data flow systems Static systems, does not allow concurrent reactivation of code A given part of a data flow graph can only exist in one instance at the same time Maximum one data packet exist on one line Data packets communicate directly from instruction to instruction Control packets are used as acknowledge signal from receiver to sender so it is know when it can produce a new result Dynamic systems Allows concurrent activation, e.g. the same code can be executed at the same time in different contexts What opportunities does this give for program execution? Loop unrolling, unfolding (iteration number) Simultaneous procedure calls Recursive procedures 30 Dynamic dataflow systems - implementation How can it be realized? + context-1: value = 10 context-2: value = 33 1) Tag operands with contextidentificator tagged token 2) Copying of code Needs the ability to have more than on value «on its way» between two instructions at the same time In this case not enough storage space in the receiverinstruction to store several operands Needs a unit/component where one operand can wait for its «fellow-operand» Makes «logical buffering» possible on each line

16 31 Manchester Dataflow Machine (MDM) Source: The Manchester Prototype Dataflow Computer, Gurd, Kirkham, Watson, CACM, jan 85, pp Data flow machine based on dynamic tagging of (small) data-packets (token) Approx 1-2 MIPS in MDM: Data flow programs Three levels SISAL (Fig. 2) Assembler (Fig. 3)» variables from SISAL» operators from data flow instruction-set Machine code (Fig. 1)

17 33 MDM machine code Graphical presentation Some special instructions: CGR, DUP, BRR, ADL 34 SISAL (fig.2)

18 35 Template Assembly Language (TASS), fig Execution sequence, fig. 4

19 37 Execution sequence, fig. 4, cnt d 38 Manchester Dataflow Machine Tagged data packets = token Tag: - Iteration level (for loops) - Activation name (for simultaneous procedure calls and recursion) Output Token Queue Matching Unit Instruction Store - Index (when same code operates on different parts of a data structure) Implementation Input Switch Processing Unit P0...P19 Fig. 7 og 8

20 39 MDM Matching Unit (1/2) 40 MDM Matching Unit (2/2)

21 41 Instruction Store & Processing Unit 42 MDM: System evaluation Test method: Load program, load input-token into input-queue, starts clock and release token from queue. Stops clock when first token is received at the host Execution time = f(#processors, program, input) Research goals; build knowledge about: Hardware utilization and bottlenecks Parallelism in software Data flow-mips vs. «normal MIPS" Reduce number of "variables" Different artificial situations to avoid testing too much at the same time» Micro benchmarks, e.g. program with Pby = 1.00 Test classes 1) small programs -> does not use overflow unit 2a) moderate degree of overflow 2b) extreme degree of overflow Simulator of the computer AvePara = T(1)/T(inf)

22 43 Test programs (Table II) 44 Speedup (Performance)

23 45 MDM: Problems Low efficiency when handling data structures special-hw Arbitraration to/from functional units Easily becomes a bottleneck Experiments: processors starves when a large fraction of the tokens do not give "match"» larger buffer on the output port of the matching unit Needs better compilers 46 MDM: Retrospective (1992) Manchester Data-Flow: A Progress Report, Gurd & Snelling, ICS 92. MDM, history; started in 1976, stopped in 1989 Included: Structure Store Units Throttle Unit Large programs can generate too much parallel activity that drowns the system in tokens that cannot be processed in a long time (a kind of trashing (OS-concept)) A unit in the ring that assigns unique activation names» (A part of the tag -field) From information about the load of different parts of the system can the throttling unit slow down the assignment of these so that the total load is reduced Experience Microcoded instruction set was a good choice for research/experimentation

24 47 Lecture plan Administrativia Reading list is now in its final version Note that all slides are part of the curriculum Next week: Mini project presentations, Room 454 in IT-Building Last lecture 30/4, exam Saturday 25/5 at 0900 Wrap up Short presentation of projects/master theses offered by EECS/CARD/Lasse Short presentation of mini-course TDT1: TDT1 Energy Efficient Multicore Computing Repetition --- send to Lasse before 24/4 to ask for special topics