Why the Network Matters

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Why the Network Matters"

Transcription

1 Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

2 So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing number of cores on a chip Cache coherency shopping list Memory performance network performance Today: The Network Moving data around to support shopping list model How to connect processors to memory and its impact on performance and applications (Much of the material today is derived from Culler and Singh.)

3 Data Mobility It s actually all about the data, data, data. No matter how fast the functional units of a system, the performance bottleneck has always been (and will continue to be) moving data around. Challenge How to efficiently feed the functional units? How to layout and track data and get it quickly from A to B?

4 Granularity Increasing our focus in granularity Functional unit pipelines Single and multicore cache hierarchies Coherence to manage nondeterminism between tightly coupled cores And now interconnection networks Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

5 Interconnection Networks The network that connects processing elements together. Broad applicability Infrastructure in shared and distributed memory systems that tie processors to memories and to each other. Examples: (1) Distributed memory system with potentially large message sizes SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently GPGPU :-) Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

6 Interconnection Networks for Multicore? On-chip How cores are linked together. Off-chip How CMPs are connected to motherboard buses. Recall the bi-directional circular EIB interconnect on the Cell. On-chip or off-chip interconnect?

7 Design Factors Economic factors for the actual hardware Performance Peak Sustained / Actual / Practical Other Routing and switching characteristics

8 Design Dimensions Topology The physical interconnection structure of the network Routing algorithms The method for choosing which route that messages take through the network graph from source to destination Switching Strategy How the data in a message traverse the route Flow Control Determination of when a message (or portions thereof) moves along its route.

9 Terminology Channel: A link between two nodes on the network, including buffers to hold data Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T Degree: Connectivity of a node (# channels to/from a node) Route: A path through the network graph Diameter: Length of the maximum shortest path between any two nodes Routing Distance: Number of links traversed enroute between two nodes Average Distance: Average routing distance over all pairs of nodes

10 Bandwidth Raw bandwidth is b = wf, where w = width and f = frequency Effective bandwidth is impacted by overhead n E for encapsulating a packet of size n If a switch delays routing decisions by d, the bandwidth degrades further.

11 Bisection Bandwidth Multiple nodes on an interconnect send messages at the same time. How to measure? Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes. Value? If all nodes communicate in a uniform pattern, half the messages will be expected to cross the bisection in each direction.

12 Routing A request between two processors must be routed in some way, preferably in an optimal manner that minimizes hops. Desirable properties Simple Low complexity, low overhead, ease of correctness (deadlock free) Minimal latency in the presence of large message sizes.

13 Routing Strategies Store-and-forward: A method typically used in LAN or WAN networks. Data is sent in packets that are received in their entirety at switches before being forwarded

14 Routing Strategies Cut-through routing: A method that reduces latency for packets to traverse a path. Think of it as network pipelining.

15 Store-and-Forward vs. Cut-Through Routing Store-and-forward makes the routing decision only when all phits are received of a packet. Cut-through routing makes the routing decision immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits cutthrough this route. Train analogy What train scenario looks like store-and-forward? What train scenario looks like cut-through?

16 Train Analogy Train at a station (as a connection to the next station) Store-and-forward routing: The entire train must stop before moving on. Train encountering a railroad switch Cut-through routing: The first car makes the decision as to which direction to take, and all the others simply follow along.

17 Routing Strategies: Analysis What s the big deal? Latency Let h be the routing distance, b the bandwidth, n the size of the message, and d the delay at each switch. How to make a store-and-forward look more like a cut-through, thus reaping some benefits of pipelining?

18 Anatomy of a Switch

19 The Crossbar Provides the internal switching structure for the switch. Non-blocking crossbar + Guarantees a path between each distinct input and output simultaneously in any permutation Costs go up quadratically. Cost of full NxN crossbar, N = # inputs = # outputs? Anatomy of a fully-connected NxN crossbar? Collection of multiplexers that forms a crossbar

20 The Crossbar Provides the internal switching structure for the switch. Blocking crossbar Pros and cons complement the above. Degenerate crossbar is a bus. Cost of a bus-based NxN crossbar? Multistage interconnection network (MIN)? What does it look like? Cost of a MIN NxN crossbar? More on this coming up next in Topology

21 Topology Oftentimes infeasible to connect every processing element to each other. Example Macroscale, e.g., cluster supercomputers PE count: O(1,000) to O(10,000) Functionally possible but very, very expensive.» As much as half the price of a supercomputer Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU Larger interconnect larger real estate required for noncompute entities Solution: Be smart about how to connect PEs together. This connection pattern is the topology.

22 Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case?

23 Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case? Cell?

24 The Effect of Adding Dimensions Increase to two dimensions, i.e., 1-D chain 2-D grid Each side (or dimension) will have how many processors? What about an k-dimensional grid? For 2-D, connect each processor to its neighbors. Up to 4 connections per processor. Boundaries can be wired to form a 2-D torus

25 2-D Mesh and Torus

26 Higher-Dimensional Meshes and Tori Keep playing this trick of embedding processors into grids of increasing dimensionality. Key Observation Each time the dimension is increased, the # of point-to-point connections for each processor increases. Generalization The # of point-to-point connections per node within a k- dimensional grid is?

27 Hypercubes A d-dimensional hypercube has 2 d corners, each of which is an endpoint for d edges. Such interconnection networks were the rage of the 1980s and 1990s Pros and Cons?

28 Trees Another topology for attacking the hop count problem Hop distance is logarithmic. Yay! Bisection bandwidth is O(1) due to single critical node at root. Boo! (See figure to right.)

29 Butterflies Extend the tree with butterflies Takes same logarithmic-depth approach but with multiple roots. Can be built out of basic 2x2 switches. For N = 2 d nodes, we have log 2 N levels of switches.

30 Butterflies Extend the tree with butterflies Pro Natural correspondence to algorithmic structures, e.g., Fast Fourier Transform (FFT) and sorting networks. Con Cost of short diameter (logarithmic) and bisection (N/2) is $$$$. Each node needs log 2 N switches!

31 Butterflies Fat Trees Butterflies are related to another topology encountered in practice fat trees particularly in large cluster supercomputers.

32 Topology Properties * : d = dimension ** : Bisection can be 1 for some switches, N for crossbar

33 Topologies and Routing Topologies with regular structure have simple routing algorithms. Example: Hypercube (2-D and 3-D) Simple labeling of nodes with the binary encoding of the number 0 2 N 1 yields a convenient routing pattern

34 Connectivity and Routing: Hypercube Connectivity: A matter of edges between nodes that differ by exactly one bit. Routing: A to B must traverse the dimensions that have bits on in XOR(A, B). Shortest Path Length: Hamming Distiance

35 Connectivity and Routing: Hypercube Connectivity: A matter of edges between nodes that differ by exactly one bit. Routing: A to B must traverse the dimensions that have bits on in XOR(A, B). Shortest Path Length: Hamming Distiance

36 Routing Algorithms Key Insight Build algorithms that take advantage of intrinsic properties of topology. Other Considerations Minimize hop counts Minimize data transmissions What happens when link to root (in a tree) goes down due to heat? Consider a torus-based network where each processor holds a set of numbers. Goal: Compute the sum of all numbers and store the result on each processor.

37 Global Sum on a Torus 1. Each processor computes sum of local data. 2. Each processor sends its sum to their left neighbor. Sum of neighbor is added to local sum. This new partial sum is passed to the left. 3. After sqrt(p) steps, the partial sum along one dimension (i.e., row) returns to each processor. 4. Repeat 1-3 but along the other dimension (i.e., column). Total time for data set of size N split over P processors? Is there a faster way?

38 Considerations Faster than sequential? Local sums obviously faster. Concurrently compute the partial sums of N/P elements faster than any one processor can compute the sum of all N elements. Problem? Interconnect overhead to execute the 2 * sqrt(p) transmissions may be quite high relative to the computing capability of each processor. Why is the above a problem?

39 Performance: Machine Balance Last example refers to the need to balance a machine or algorithm Quantity that we are tuning? Surface (communication) to volume (computation) ratio. Performance factors to consider performance profiling Time to compute a local sum over a local data set. Time to send a single small message over the interconnect. Performance profiling will come into play when using the CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.

40 Reflection Architectural Aspects Currently, caches still a key performance enhancement for multiprocessor systems (just like single CPU systems). Caches require some additional logic to make them continue to function and provide determinism in the main memory of a compute node. Coherence protocols and any other form of data transport between cores requires an interconnection network. At scale, all-to-all bus-like structures are infeasible. Solution: Novel topologies that sacrifice peak performance (avg latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.

41 Reflection Multicore Considerations Interconnection networks are constrained more in the multicore context than in the large-scale SMP world. Why? But AMD Barcelona quad-core processor utilizes 11 Cu layers. Relative to # transistors in in the two planar dimensions of the processor, the CPU remains for all intents and purposes flat. Cramming a sophisticated interconnection network that is not planar into a limited number of layers is quite hard. (Caveat: Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.

42 Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing

43 Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

44 Parallel Software: Correctness & Performance Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

45 Correctness Hardest aspect of parallel algorithm design and parallel programming? Writing programs that are correct What good is a program that generates wrong answers faster? What do we mean by correctness? Traditionally, proving that a given algorithm produced the output that is desired. Example: Prim s algorithm produces a minimum spanning tree. Correctness means that the tree produced by Prim s algorithm is indeed a minimum spanning tree.

46 Underlying Assumption Traditional algorithms take the following for granted: The machine is deterministic. Only one flow of control is active at any given time. Nondeterminism only comes into play in a purely theoretical sense when talking about automata theory, NFAs vs. DFAs, and P vs. NP. This is not the sort of determinism that we are talking about here. What are we talking about? When two uncoordinated flows of control that interact with each other, no guarantee that without explicit guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.

47 Performance The holy grail of parallel computing A parallel program should run at least as fast as the sequential equivalent for a fixed input size. One may use parallelism to increase the volume that can be computed, in which case, comparisons of time are not as important. (Weak scaling)

48 Performance and Correctness (or Correctness and Performance?) Performance and correctness are often intimately coupled. Without protections in place, a program can run very quickly but suffer from severe correctness problems. Very conservative decisions can be made to ensure correctness but at the cost of significant performance degradation. Example of this? Other performance factors (unrelated to logic flow in place) to maintain determinism and correctness. Example: Granularity of computation and communication can be poorly chosen resulting in abysmal performance.

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Topological Properties

Topological Properties Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

More information

Interconnection Network Design

Interconnection Network Design Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure

More information

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E) Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E) 1 Topologies Internet topologies are not very regular they grew incrementally Supercomputers

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Components: Interconnect Page 1 of 18

Components: Interconnect Page 1 of 18 Components: Interconnect Page 1 of 18 PE to PE interconnect: The most expensive supercomputer component Possible implementations: FULL INTERCONNECTION: The ideal Usually not attainable Each PE has a direct

More information

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

Multicore Architectures

Multicore Architectures Multicore Architectures Week 1, Lecture 2 Multicore Landscape Intel Dual and quad-core Pentium family. 80-core demonstration last year. AMD Dual, triple (?!), and quad-core Opteron family. IBM Dual and

More information

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Interconnection Networks

Interconnection Networks CMPT765/408 08-1 Interconnection Networks Qianping Gu 1 Interconnection Networks The note is mainly based on Chapters 1, 2, and 4 of Interconnection Networks, An Engineering Approach by J. Duato, S. Yalamanchili,

More information

Parallel and Distributed Computing Chapter 5: Basic Communications Operations

Parallel and Distributed Computing Chapter 5: Basic Communications Operations Parallel and Distributed Computing Chapter 5: Basic Communications Operations Jun Zhang Laboratory for High Performance Computing & Computer Simulation Department of Computer Science University of Kentucky

More information

Interconnect. Jesús Labarta. Index

Interconnect. Jesús Labarta. Index Interconnect Jesús Labarta Index 1 Interconnection networks Need to send messages (commands/responses, message passing) Processors Memory Node Node Interconnection networks Components Links Switches Network

More information

Parallel Programming

Parallel Programming Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix

More information

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Scaling 10Gb/s Clustering at Wire-Speed

Scaling 10Gb/s Clustering at Wire-Speed Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400

More information

On-Chip Interconnection Networks Low-Power Interconnect

On-Chip Interconnection Networks Low-Power Interconnect On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks

More information

Chapter 2. Multiprocessors Interconnection Networks

Chapter 2. Multiprocessors Interconnection Networks Chapter 2 Multiprocessors Interconnection Networks 2.1 Taxonomy Interconnection Network Static Dynamic 1-D 2-D HC Bus-based Switch-based Single Multiple SS MS Crossbar 2.2 Bus-Based Dynamic Single Bus

More information

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors 2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,

More information

Interconnection Networks

Interconnection Networks Advanced Computer Architecture (0630561) Lecture 15 Interconnection Networks Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interconnection Networks: Multiprocessors INs can be classified based on: 1. Mode

More information

Chapter 8 Multiple Processor Systems. 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems

Chapter 8 Multiple Processor Systems. 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Chapter 8 Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Multiprocessor Systems Continuous need for faster computers shared memory model message passing multiprocessor

More information

Multiprocessor Systems. Chapter 8 Multiple Processor Systems. Multiprocessors. Multiprocessor Hardware (1)

Multiprocessor Systems. Chapter 8 Multiple Processor Systems. Multiprocessors. Multiprocessor Hardware (1) Chapter 8 Multiple Processor Systems Multiprocessor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Continuous need for faster computers shared memory model message passing multiprocessor

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

Asynchronous Bypass Channels

Asynchronous Bypass Channels Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table

More information

Annotation to the assignments and the solution sheet. Note the following points

Annotation to the assignments and the solution sheet. Note the following points Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

More information

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Cristina SILVANO silvano@elet.polimi.it Politecnico di Milano, Milano (Italy) Talk Outline

More information

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Interconnection Networks

Interconnection Networks Interconnection Networks Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Three questions about

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Switched Interconnect for System-on-a-Chip Designs

Switched Interconnect for System-on-a-Chip Designs witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

Energy-Efficient Algorithms on Mesh-Connected Systems with Additional Communication Links

Energy-Efficient Algorithms on Mesh-Connected Systems with Additional Communication Links Energy-Efficient Algorithms on Mesh-Connected Systems with Additional Communication Links by Patrick J. Poon A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Data Communications & Computer Networks. Circuit and Packet Switching

Data Communications & Computer Networks. Circuit and Packet Switching Data Communications & Computer Networks Chapter 9 Circuit and Packet Switching Fall 2008 Agenda Preface Circuit Switching Softswitching Packet Switching Home Exercises ACOE312 Circuit and packet switching

More information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.

More information

Chapter 4 Multi-Stage Interconnection Networks The general concept of the multi-stage interconnection network, together with its routing properties, have been used in the preceding chapter to describe

More information

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation

More information

Chapter 14: Distributed Operating Systems

Chapter 14: Distributed Operating Systems Chapter 14: Distributed Operating Systems Chapter 14: Distributed Operating Systems Motivation Types of Distributed Operating Systems Network Structure Network Topology Communication Structure Communication

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Distributed Systems LEEC (2005/06 2º Sem.)

Distributed Systems LEEC (2005/06 2º Sem.) Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師 Lecture 7: Distributed Operating Systems A Distributed System 7.2 Resource sharing Motivation sharing and printing files at remote sites processing information in a distributed database using remote specialized

More information

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana

More information

Module 15: Network Structures

Module 15: Network Structures Module 15: Network Structures Background Topology Network Types Communication Communication Protocol Robustness Design Strategies 15.1 A Distributed System 15.2 Motivation Resource sharing sharing and

More information

Technology White Paper Capacity Constrained Smart Grid Design

Technology White Paper Capacity Constrained Smart Grid Design Capacity Constrained Smart Grid Design Smart Devices Smart Networks Smart Planning EDX Wireless Tel: +1-541-345-0019 I Fax: +1-541-345-8145 I info@edx.com I www.edx.com Mark Chapman and Greg Leon EDX Wireless

More information

Chapter 16: Distributed Operating Systems

Chapter 16: Distributed Operating Systems Module 16: Distributed ib System Structure, Silberschatz, Galvin and Gagne 2009 Chapter 16: Distributed Operating Systems Motivation Types of Network-Based Operating Systems Network Structure Network Topology

More information

Network Architecture and Topology

Network Architecture and Topology 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end

More information

From Hypercubes to Dragonflies a short history of interconnect

From Hypercubes to Dragonflies a short history of interconnect From Hypercubes to Dragonflies a short history of interconnect William J. Dally Computer Science Department Stanford University IAA Workshop July 21, 2008 IAA: # Outline The low-radix era High-radix routers

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394

More information

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks Chapter 12: Multiprocessor Architectures Lesson 04: Interconnect Networks Objective To understand different interconnect networks To learn crossbar switch, hypercube, multistage and combining networks

More information

Local-Area Network -LAN

Local-Area Network -LAN Computer Networks A group of two or more computer systems linked together. There are many [types] of computer networks: Peer To Peer (workgroups) The computers are connected by a network, however, there

More information

Multicast Group Management for Interactive Distributed Applications

Multicast Group Management for Interactive Distributed Applications Multicast Group Management for Interactive Distributed Applications Carsten Griwodz griff@simula.no September 25, 2008 based on the thesis work of Knut-Helge Vik, knuthelv@simula.no Group communication

More information

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Computer

More information

Multi-core Systems What can we buy today?

Multi-core Systems What can we buy today? Multi-core Systems What can we buy today? Ian Watson & Mikel Lujan Advanced Processor Technologies Group COMP60012 Future Multi-core Computing 1 A Bit of History AMD Opteron introduced in 2003 Hypertransport

More information

InfiniBand Clustering

InfiniBand Clustering White Paper InfiniBand Clustering Delivering Better Price/Performance than Ethernet 1.0 Introduction High performance computing clusters typically utilize Clos networks, more commonly known as Fat Tree

More information

Computer Networks Vs. Distributed Systems

Computer Networks Vs. Distributed Systems Computer Networks Vs. Distributed Systems Computer Networks: A computer network is an interconnected collection of autonomous computers able to exchange information. A computer network usually require

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.) - Tehnologii informationale, Editura

More information

Parallel Architectures and Interconnection

Parallel Architectures and Interconnection Chapter 2 Networks Parallel Architectures and Interconnection The interconnection network is the heart of parallel architecture. Feng [1] - Chuan-Lin and Tse-Yun 2.1 Introduction You cannot really design

More information

Low-Overhead Hard Real-time Aware Interconnect Network Router

Low-Overhead Hard Real-time Aware Interconnect Network Router Low-Overhead Hard Real-time Aware Interconnect Network Router Michel A. Kinsy! Department of Computer and Information Science University of Oregon Srinivas Devadas! Department of Electrical Engineering

More information

Architecture of distributed network processors: specifics of application in information security systems

Architecture of distributed network processors: specifics of application in information security systems Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia vlad@neva.ru 1. Introduction Modern

More information

Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

Communication Networks. MAP-TELE 2011/12 José Ruela

Communication Networks. MAP-TELE 2011/12 José Ruela Communication Networks MAP-TELE 2011/12 José Ruela Network basic mechanisms Introduction to Communications Networks Communications networks Communications networks are used to transport information (data)

More information

Introduction to LAN/WAN. Network Layer

Introduction to LAN/WAN. Network Layer Introduction to LAN/WAN Network Layer Topics Introduction (5-5.1) Routing (5.2) (The core) Internetworking (5.5) Congestion Control (5.3) Network Layer Design Isues Store-and-Forward Packet Switching Services

More information

Distributed Operating Systems Introduction

Distributed Operating Systems Introduction Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of

More information

The proliferation of the raw processing

The proliferation of the raw processing TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer

More information

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School Introduction to Infiniband Hussein N. Harake, Performance U! Winter School Agenda Definition of Infiniband Features Hardware Facts Layers OFED Stack OpenSM Tools and Utilities Topologies Infiniband Roadmap

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Vorlesung Rechnerarchitektur 2 Seite 178 DASH Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica

More information

524 Computer Networks

524 Computer Networks 524 Computer Networks Section 1: Introduction to Course Dr. E.C. Kulasekere Sri Lanka Institute of Information Technology - 2005 Course Outline The Aim The course is design to establish the terminology

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Computer Systems Structure Main Memory Organization

Computer Systems Structure Main Memory Organization Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory

More information

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL Sandeep Kumar 1, Arpit Kumar 2 1 Sekhawati Engg. College, Dundlod, Dist. - Jhunjhunu (Raj.), 1987san@gmail.com, 2 KIIT, Gurgaon (HR.), Abstract

More information

TDT 4260 lecture 11 spring semester 2013. Interconnection network continued

TDT 4260 lecture 11 spring semester 2013. Interconnection network continued 1 TDT 4260 lecture 11 spring semester 2013 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Interconnection network continued Routing Switch microarchitecture

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the

More information

Chapter 13 Selected Storage Systems and Interface

Chapter 13 Selected Storage Systems and Interface Chapter 13 Selected Storage Systems and Interface Chapter 13 Objectives Appreciate the role of enterprise storage as a distinct architectural entity. Expand upon basic I/O concepts to include storage protocols.

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Interconnection Network of OTA-based FPAA

Interconnection Network of OTA-based FPAA Chapter S Interconnection Network of OTA-based FPAA 5.1 Introduction Aside from CAB components, a number of different interconnect structures have been proposed for FPAAs. The choice of an intercmmcclion

More information

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09 Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology Large Scale Clustering with Voltaire InfiniBand HyperScale Technology Scalable Interconnect Topology Tradeoffs Since its inception, InfiniBand has been optimized for constructing clusters with very large

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Customer Specific Wireless Network Solutions Based on Standard IEEE 802.15.4

Customer Specific Wireless Network Solutions Based on Standard IEEE 802.15.4 Customer Specific Wireless Network Solutions Based on Standard IEEE 802.15.4 Michael Binhack, sentec Elektronik GmbH, Werner-von-Siemens-Str. 6, 98693 Ilmenau, Germany Gerald Kupris, Freescale Semiconductor

More information

Building Blocks. CPUs, Memory and Accelerators

Building Blocks. CPUs, Memory and Accelerators Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)

More information

Using High Availability Technologies Lesson 12

Using High Availability Technologies Lesson 12 Using High Availability Technologies Lesson 12 Skills Matrix Technology Skill Objective Domain Objective # Using Virtualization Configure Windows Server Hyper-V and virtual machines 1.3 What Is High Availability?

More information