Why the Network Matters

So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing number of cores on a chip Cache coherency shopping list Memory performance network performance Today: The Network Moving data around to support shopping list model How to connect processors to memory and its impact on performance and applications (Much of the material today is derived from Culler and Singh.)

Data Mobility It s actually all about the data, data, data. No matter how fast the functional units of a system, the performance bottleneck has always been (and will continue to be) moving data around. Challenge How to efficiently feed the functional units? How to layout and track data and get it quickly from A to B?

Granularity Increasing our focus in granularity Functional unit pipelines Single and multicore cache hierarchies Coherence to manage nondeterminism between tightly coupled cores And now interconnection networks Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

Interconnection Networks The network that connects processing elements together. Broad applicability Infrastructure in shared and distributed memory systems that tie processors to memories and to each other. Examples: (1) Distributed memory system with potentially large message sizes SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently GPGPU :-) Cannot practically sustain bus snooping protocols in hardware Use to interconnect small multiprocessors to form arbitrarily large supercomputers.

Interconnection Networks for Multicore? On-chip How cores are linked together. Off-chip How CMPs are connected to motherboard buses. Recall the bi-directional circular EIB interconnect on the Cell. On-chip or off-chip interconnect?

Design Factors Economic factors for the actual hardware Performance Peak Sustained / Actual / Practical Other Routing and switching characteristics

Design Dimensions Topology The physical interconnection structure of the network Routing algorithms The method for choosing which route that messages take through the network graph from source to destination Switching Strategy How the data in a message traverse the route Flow Control Determination of when a message (or portions thereof) moves along its route.

Terminology Channel: A link between two nodes on the network, including buffers to hold data Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T Degree: Connectivity of a node (# channels to/from a node) Route: A path through the network graph Diameter: Length of the maximum shortest path between any two nodes Routing Distance: Number of links traversed enroute between two nodes Average Distance: Average routing distance over all pairs of nodes

Bandwidth Raw bandwidth is b = wf, where w = width and f = frequency Effective bandwidth is impacted by overhead n E for encapsulating a packet of size n If a switch delays routing decisions by d, the bandwidth degrades further.

Bisection Bandwidth Multiple nodes on an interconnect send messages at the same time. How to measure? Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes. Value? If all nodes communicate in a uniform pattern, half the messages will be expected to cross the bisection in each direction.

Routing A request between two processors must be routed in some way, preferably in an optimal manner that minimizes hops. Desirable properties Simple Low complexity, low overhead, ease of correctness (deadlock free) Minimal latency in the presence of large message sizes.

Routing Strategies Store-and-forward: A method typically used in LAN or WAN networks. Data is sent in packets that are received in their entirety at switches before being forwarded

Routing Strategies Cut-through routing: A method that reduces latency for packets to traverse a path. Think of it as network pipelining.

Store-and-Forward vs. Cut-Through Routing Store-and-forward makes the routing decision only when all phits are received of a packet. Cut-through routing makes the routing decision immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits cutthrough this route. Train analogy What train scenario looks like store-and-forward? What train scenario looks like cut-through?

Train Analogy Train at a station (as a connection to the next station) Store-and-forward routing: The entire train must stop before moving on. Train encountering a railroad switch Cut-through routing: The first car makes the decision as to which direction to take, and all the others simply follow along.

Routing Strategies: Analysis What s the big deal? Latency Let h be the routing distance, b the bandwidth, n the size of the message, and d the delay at each switch. How to make a store-and-forward look more like a cut-through, thus reaping some benefits of pipelining?

Anatomy of a Switch

The Crossbar Provides the internal switching structure for the switch. Non-blocking crossbar + Guarantees a path between each distinct input and output simultaneously in any permutation Costs go up quadratically. Cost of full NxN crossbar, N = # inputs = # outputs? Anatomy of a fully-connected NxN crossbar? Collection of multiplexers that forms a crossbar

The Crossbar Provides the internal switching structure for the switch. Blocking crossbar Pros and cons complement the above. Degenerate crossbar is a bus. Cost of a bus-based NxN crossbar? Multistage interconnection network (MIN)? What does it look like? Cost of a MIN NxN crossbar? More on this coming up next in Topology

Topology Oftentimes infeasible to connect every processing element to each other. Example Macroscale, e.g., cluster supercomputers PE count: O(1,000) to O(10,000) Functionally possible but very, very expensive.» As much as half the price of a supercomputer Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU Larger interconnect larger real estate required for noncompute entities Solution: Be smart about how to connect PEs together. This connection pattern is the topology.

Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case?

Simple Topology One-Dimensional Topologies Chain Order all N processors in a line number 1... P and connect processor P with processors P-1 and P+1 Sending a message from P1 to P4 must traverse 3 links. Best case? Average case? Worst case? Torus (or Ring) Instead of letting ends dangle, connect first to last to form a ring. Best case? Average case? Worst case? Cell?

The Effect of Adding Dimensions Increase to two dimensions, i.e., 1-D chain 2-D grid Each side (or dimension) will have how many processors? What about an k-dimensional grid? For 2-D, connect each processor to its neighbors. Up to 4 connections per processor. Boundaries can be wired to form a 2-D torus

2-D Mesh and Torus

Higher-Dimensional Meshes and Tori Keep playing this trick of embedding processors into grids of increasing dimensionality. Key Observation Each time the dimension is increased, the # of point-to-point connections for each processor increases. Generalization The # of point-to-point connections per node within a k- dimensional grid is?

Hypercubes A d-dimensional hypercube has 2 d corners, each of which is an endpoint for d edges. Such interconnection networks were the rage of the 1980s and 1990s Pros and Cons?

Trees Another topology for attacking the hop count problem Hop distance is logarithmic. Yay! Bisection bandwidth is O(1) due to single critical node at root. Boo! (See figure to right.)

Butterflies Extend the tree with butterflies Takes same logarithmic-depth approach but with multiple roots. Can be built out of basic 2x2 switches. For N = 2 d nodes, we have log 2 N levels of switches.

Butterflies Extend the tree with butterflies Pro Natural correspondence to algorithmic structures, e.g., Fast Fourier Transform (FFT) and sorting networks. Con Cost of short diameter (logarithmic) and bisection (N/2) is $$$$. Each node needs log 2 N switches!

Butterflies Fat Trees Butterflies are related to another topology encountered in practice fat trees particularly in large cluster supercomputers.

Topology Properties * : d = dimension ** : Bisection can be 1 for some switches, N for crossbar

Topologies and Routing Topologies with regular structure have simple routing algorithms. Example: Hypercube (2-D and 3-D) Simple labeling of nodes with the binary encoding of the number 0 2 N 1 yields a convenient routing pattern

Connectivity and Routing: Hypercube Connectivity: A matter of edges between nodes that differ by exactly one bit. Routing: A to B must traverse the dimensions that have bits on in XOR(A, B). Shortest Path Length: Hamming Distiance

Routing Algorithms Key Insight Build algorithms that take advantage of intrinsic properties of topology. Other Considerations Minimize hop counts Minimize data transmissions What happens when link to root (in a tree) goes down due to heat? Consider a torus-based network where each processor holds a set of numbers. Goal: Compute the sum of all numbers and store the result on each processor.

Global Sum on a Torus 1. Each processor computes sum of local data. 2. Each processor sends its sum to their left neighbor. Sum of neighbor is added to local sum. This new partial sum is passed to the left. 3. After sqrt(p) steps, the partial sum along one dimension (i.e., row) returns to each processor. 4. Repeat 1-3 but along the other dimension (i.e., column). Total time for data set of size N split over P processors? Is there a faster way?

Considerations Faster than sequential? Local sums obviously faster. Concurrently compute the partial sums of N/P elements faster than any one processor can compute the sum of all N elements. Problem? Interconnect overhead to execute the 2 * sqrt(p) transmissions may be quite high relative to the computing capability of each processor. Why is the above a problem?

Performance: Machine Balance Last example refers to the need to balance a machine or algorithm Quantity that we are tuning? Surface (communication) to volume (computation) ratio. Performance factors to consider performance profiling Time to compute a local sum over a local data set. Time to send a single small message over the interconnect. Performance profiling will come into play when using the CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.

Reflection Architectural Aspects Currently, caches still a key performance enhancement for multiprocessor systems (just like single CPU systems). Caches require some additional logic to make them continue to function and provide determinism in the main memory of a compute node. Coherence protocols and any other form of data transport between cores requires an interconnection network. At scale, all-to-all bus-like structures are infeasible. Solution: Novel topologies that sacrifice peak performance (avg latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.

Reflection Multicore Considerations Interconnection networks are constrained more in the multicore context than in the large-scale SMP world. Why? But AMD Barcelona quad-core processor utilizes 11 Cu layers. Relative to # transistors in in the two planar dimensions of the processor, the CPU remains for all intents and purposes flat. Cramming a sophisticated interconnection network that is not planar into a limited number of layers is quite hard. (Caveat: Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.

Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing

Reflection Multicore at Scale Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

Parallel Software: Correctness & Performance Life becoming more interesting as core counts continue to increase. Intel Terascale Chip: 80 cores Tilera: Reconfigurable 64 cores based on a 2-D mesh topology. AMD/ATi HD 4870: 800 cores NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-) In the not-to-distant future, interconnect topology will be back in vogue for parallel computing This concludes the architectural stuff now onto

Correctness Hardest aspect of parallel algorithm design and parallel programming? Writing programs that are correct What good is a program that generates wrong answers faster? What do we mean by correctness? Traditionally, proving that a given algorithm produced the output that is desired. Example: Prim s algorithm produces a minimum spanning tree. Correctness means that the tree produced by Prim s algorithm is indeed a minimum spanning tree.

Underlying Assumption Traditional algorithms take the following for granted: The machine is deterministic. Only one flow of control is active at any given time. Nondeterminism only comes into play in a purely theoretical sense when talking about automata theory, NFAs vs. DFAs, and P vs. NP. This is not the sort of determinism that we are talking about here. What are we talking about? When two uncoordinated flows of control that interact with each other, no guarantee that without explicit guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.

Performance The holy grail of parallel computing A parallel program should run at least as fast as the sequential equivalent for a fixed input size. One may use parallelism to increase the volume that can be computed, in which case, comparisons of time are not as important. (Weak scaling)

Performance and Correctness (or Correctness and Performance?) Performance and correctness are often intimately coupled. Without protections in place, a program can run very quickly but suffer from severe correctness problems. Very conservative decisions can be made to ensure correctness but at the cost of significant performance degradation. Example of this? Other performance factors (unrelated to logic flow in place) to maintain determinism and correctness. Example: Granularity of computation and communication can be poorly chosen resulting in abysmal performance.