Why the Network Matters

Similar documents
Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Topological Properties

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Interconnection Network Design

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Network

Components: Interconnect Page 1 of 18

Lecture 2 Parallel Programming Platforms

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Scalability and Classifications

Interconnection Networks

Parallel Programming

On-Chip Interconnection Networks Low-Power Interconnect

Scaling 10Gb/s Clustering at Wire-Speed

Chapter 2 Parallel Architecture, Software And Performance

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Chapter 2. Multiprocessors Interconnection Networks

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Annotation to the assignments and the solution sheet. Note the following points

Asynchronous Bypass Channels

Interconnection Networks

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Interconnection Networks

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Switched Interconnect for System-on-a-Chip Designs

Introduction to Cloud Computing

Symmetric Multiprocessing

Principles and characteristics of distributed systems and environments

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Chapter 14: Distributed Operating Systems


CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Distributed Systems LEEC (2005/06 2º Sem.)

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

High Performance Computing. Course Notes HPC Fundamentals

Parallel Programming Survey

Module 15: Network Structures

Technology White Paper Capacity Constrained Smart Grid Design

Chapter 16: Distributed Operating Systems

From Hypercubes to Dragonflies a short history of interconnect

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Network Architecture and Topology

Computer Systems Structure Input/Output

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Local-Area Network -LAN

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

InfiniBand Clustering

SOC architecture and design

Architecture of distributed network processors: specifics of application in information security systems

Low-Overhead Hard Real-time Aware Interconnect Network Router

Computer Networks Vs. Distributed Systems

Parallel Architectures and Interconnection

How To Understand The Concept Of A Distributed System

Introduction to LAN/WAN. Network Layer

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Cray: Enabling Real-Time Discovery in Big Data

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Parallel Computing. Benson Muite. benson.

Multi-core and Linux* Kernel

VMWARE WHITE PAPER 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Systems Structure Main Memory Organization

CHAPTER 1 INTRODUCTION

The proliferation of the raw processing

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

- Nishad Nerurkar. - Aniket Mhatre

Communication Networks. MAP-TELE 2011/12 José Ruela

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

LAN Switching Computer Networking. Switched Network Advantages. Hubs (more) Hubs. Bridges/Switches, , PPP. Interconnecting LANs

Customer Specific Wireless Network Solutions Based on Standard IEEE

Interconnection Network of OTA-based FPAA

Multilevel Load Balancing in NUMA Computers

524 Computer Networks

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

Binary search tree with SIMD bandwidth optimization using SSE

Definition. A Historical Example

GPUs for Scientific Computing

Chapter 2 Parallel Computer Architecture

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Parallel Firewalls on General-Purpose Graphics Processing Units

SAN Conceptual and Design Basics

Scalable Source Routing

Chapter 13 Selected Storage Systems and Interface

Non-blocking Switching in the Cloud Computing Era

Transcription:

Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing number of cores on a chip Cache coherency shopping list Memory performance network performance Today: The Network Moving data around to support shopping list model How to connect processors to memory and its impact on performance and applications (Much of the material today is derived from Culler and Singh.)

Data Mobility It s actually all about the data, data, data. No matter how fast the functional units of a system, the performance bottleneck has always been (and will continue to be) moving data around. Challenge How to efficiently feed the functional units? How to layout and track data and get it quickly from A to B?

Granularity Increasing our focus in granularity Functional unit pipelines Single and multicore cache hierarchies Coherence to manage nondeterminism between tightly coupled cores And now interconnection networks Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

Interconnection Networks The network that connects processing elements together. Broad applicability Infrastructure in shared and distributed memory systems that tie processors to memories and to each other. Examples: (1) Distributed memory system with potentially large message sizes SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently GPGPU :-) Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

Interconnection Networks for Multicore? On-chip How cores are linked together. Off-chip How CMPs are connected to motherboard buses. Recall the bi-directional circular EIB interconnect on the Cell. On-chip or off-chip interconnect?

Design Factors Economic factors for the actual hardware Performance Peak Sustained / Actual / Practical Other Routing and switching characteristics

Design Dimensions Topology The physical interconnection structure of the network Routing algorithms The method for choosing which route that messages take through the network graph from source to destination Switching Strategy How the data in a message traverse the route Flow Control Determination of when a message (or portions thereof) moves along its route.

Terminology Channel: A link between two nodes on the network, including buffers to hold data Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T Degree: Connectivity of a node (# channels to/from a node) Route: A path through the network graph Diameter: Length of the maximum shortest path between any two nodes Routing Distance: Number of links traversed enroute between two nodes Average Distance: Average routing distance over all pairs of nodes

Bandwidth Raw bandwidth is b = wf, where w = width and f = frequency Effective bandwidth is impacted by overhead n E for encapsulating a packet of size n If a switch delays routing decisions by d, the bandwidth degrades further.

Bisection Bandwidth Multiple nodes on an interconnect send messages at the same time. How to measure? Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes. Value? If all nodes communicate in a uniform pattern, half the messages will be expected to cross the bisection in each direction.

Routing A request between two processors must be routed in some way, preferably in an optimal manner that minimizes hops. Desirable properties Simple Low complexity, low overhead, ease of correctness (deadlock free) Minimal latency in the presence of large message sizes.

Routing Strategies Store-and-forward: A method typically used in LAN or WAN networks. Data is sent in packets that are received in their entirety at switches before being forwarded

Routing Strategies Cut-through routing: A method that reduces latency for packets to traverse a path. Think of it as network pipelining.

Store-and-Forward vs. Cut-Through Routing Store-and-forward makes the routing decision only when all phits are received of a packet. Cut-through routing makes the routing decision immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits cutthrough this route. Train analogy What train scenario looks like store-and-forward? What train scenario looks like cut-through?

Train Analogy Train at a station (as a connection to the next station) Store-and-forward routing: The entire train must stop before moving on. Train encountering a railroad switch Cut-through routing: The first car makes the decision as to which direction to take, and all the others simply follow along.

Routing Strategies: Analysis What s the big deal? Latency Let h be the routing distance, b the bandwidth, n the size of the message, and d the delay at each switch. How to make a store-and-forward look more like a cut-through, thus reaping some benefits of pipelining?

Anatomy of a Switch

The Crossbar Provides the internal switching structure for the switch. Non-blocking crossbar + Guarantees a path between each distinct input and output simultaneously in any permutation Costs go up quadratically. Cost of full NxN crossbar, N = # inputs = # outputs? Anatomy of a fully-connected NxN crossbar? Collection of multiplexers that forms a crossbar

The Crossbar Provides the internal switching structure for the switch. Blocking crossbar Pros and cons complement the above. Degenerate crossbar is a bus. Cost of a bus-based NxN crossbar? Multistage interconnection network (MIN)? What does it look like? Cost of a MIN NxN crossbar? More on this coming up next in Topology

Topology Oftentimes infeasible to connect every processing element to each other. Example Macroscale, e.g., cluster supercomputers PE count: O(1,000) to O(10,000) Functionally possible but very, very expensive.» As much as half the price of a supercomputer Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU Larger interconnect larger real estate required for noncompute entities Solution: Be smart about how to connect PEs together. This connection pattern is the topology.

Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case?

Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case? Cell?

The Effect of Adding Dimensions Increase to two dimensions, i.e., 1-D chain 2-D grid Each side (or dimension) will have how many processors? What about an k-dimensional grid? For 2-D, connect each processor to its neighbors. Up to 4 connections per processor. Boundaries can be wired to form a 2-D torus

2-D Mesh and Torus

Higher-Dimensional Meshes and Tori Keep playing this trick of embedding processors into grids of increasing dimensionality. Key Observation Each time the dimension is increased, the # of point-to-point connections for each processor increases. Generalization The # of point-to-point connections per node within a k- dimensional grid is?

Hypercubes A d-dimensional hypercube has 2 d corners, each of which is an endpoint for d edges. Such interconnection networks were the rage of the 1980s and 1990s Pros and Cons?

Trees Another topology for attacking the hop count problem Hop distance is logarithmic. Yay! Bisection bandwidth is O(1) due to single critical node at root. Boo! (See figure to right.)

Butterflies Extend the tree with butterflies Takes same logarithmic-depth approach but with multiple roots. Can be built out of basic 2x2 switches. For N = 2 d nodes, we have log 2 N levels of switches.

Butterflies Extend the tree with butterflies Pro Natural correspondence to algorithmic structures, e.g., Fast Fourier Transform (FFT) and sorting networks. Con Cost of short diameter (logarithmic) and bisection (N/2) is $$$$. Each node needs log 2 N switches!

Butterflies Fat Trees Butterflies are related to another topology encountered in practice fat trees particularly in large cluster supercomputers.

Topology Properties * : d = dimension ** : Bisection can be 1 for some switches, N for crossbar

Topologies and Routing Topologies with regular structure have simple routing algorithms. Example: Hypercube (2-D and 3-D) Simple labeling of nodes with the binary encoding of the number 0 2 N 1 yields a convenient routing pattern

Connectivity and Routing: Hypercube Connectivity: A matter of edges between nodes that differ by exactly one bit. Routing: A to B must traverse the dimensions that have bits on in XOR(A, B). Shortest Path Length: Hamming Distiance

Connectivity and Routing: Hypercube Connectivity: A matter of edges between nodes that differ by exactly one bit. Routing: A to B must traverse the dimensions that have bits on in XOR(A, B). Shortest Path Length: Hamming Distiance

Routing Algorithms Key Insight Build algorithms that take advantage of intrinsic properties of topology. Other Considerations Minimize hop counts Minimize data transmissions What happens when link to root (in a tree) goes down due to heat? Consider a torus-based network where each processor holds a set of numbers. Goal: Compute the sum of all numbers and store the result on each processor.

Global Sum on a Torus 1. Each processor computes sum of local data. 2. Each processor sends its sum to their left neighbor. Sum of neighbor is added to local sum. This new partial sum is passed to the left. 3. After sqrt(p) steps, the partial sum along one dimension (i.e., row) returns to each processor. 4. Repeat 1-3 but along the other dimension (i.e., column). Total time for data set of size N split over P processors? Is there a faster way?

Considerations Faster than sequential? Local sums obviously faster. Concurrently compute the partial sums of N/P elements faster than any one processor can compute the sum of all N elements. Problem? Interconnect overhead to execute the 2 * sqrt(p) transmissions may be quite high relative to the computing capability of each processor. Why is the above a problem?

Performance: Machine Balance Last example refers to the need to balance a machine or algorithm Quantity that we are tuning? Surface (communication) to volume (computation) ratio. Performance factors to consider performance profiling Time to compute a local sum over a local data set. Time to send a single small message over the interconnect. Performance profiling will come into play when using the CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.

Reflection Architectural Aspects Currently, caches still a key performance enhancement for multiprocessor systems (just like single CPU systems). Caches require some additional logic to make them continue to function and provide determinism in the main memory of a compute node. Coherence protocols and any other form of data transport between cores requires an interconnection network. At scale, all-to-all bus-like structures are infeasible. Solution: Novel topologies that sacrifice peak performance (avg latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.

Reflection Multicore Considerations Interconnection networks are constrained more in the multicore context than in the large-scale SMP world. Why? But AMD Barcelona quad-core processor utilizes 11 Cu layers. Relative to # transistors in in the two planar dimensions of the processor, the CPU remains for all intents and purposes flat. Cramming a sophisticated interconnection network that is not planar into a limited number of layers is quite hard. (Caveat: Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.

Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing

Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

Parallel Software: Correctness & Performance Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

Correctness Hardest aspect of parallel algorithm design and parallel programming? Writing programs that are correct What good is a program that generates wrong answers faster? What do we mean by correctness? Traditionally, proving that a given algorithm produced the output that is desired. Example: Prim s algorithm produces a minimum spanning tree. Correctness means that the tree produced by Prim s algorithm is indeed a minimum spanning tree.

Underlying Assumption Traditional algorithms take the following for granted: The machine is deterministic. Only one flow of control is active at any given time. Nondeterminism only comes into play in a purely theoretical sense when talking about automata theory, NFAs vs. DFAs, and P vs. NP. This is not the sort of determinism that we are talking about here. What are we talking about? When two uncoordinated flows of control that interact with each other, no guarantee that without explicit guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.

Performance The holy grail of parallel computing A parallel program should run at least as fast as the sequential equivalent for a fixed input size. One may use parallelism to increase the volume that can be computed, in which case, comparisons of time are not as important. (Weak scaling)

Performance and Correctness (or Correctness and Performance?) Performance and correctness are often intimately coupled. Without protections in place, a program can run very quickly but suffer from severe correctness problems. Very conservative decisions can be made to ensure correctness but at the cost of significant performance degradation. Example of this? Other performance factors (unrelated to logic flow in place) to maintain determinism and correctness. Example: Granularity of computation and communication can be poorly chosen resulting in abysmal performance.