Delivering Scale Out Data Center Networking with Optics Why and How Amin Vahdat Google/UC San Diego vahdat@google.com
Vignette 1: Explosion in Data Center Buildout
Motivation Blueprints for 200k sq. ft. Data Center in OR
San Antonio Data Center
Chicago Data Center
Dublin Data Center
All Filled with Commodity Computation and Storage
Why the Cloud? Compute/storage 100x cheaper to operate at scale Equipment costs 2-20x cheaper at scale Energy 10-20x cheaper at scale Unless ms-level latency critical, likely more efficient to perform computation and store data in the cloud Inter-cluster bandwidth requirements >>100x widearea Internet bandwidth
Network Design Goals Scalable interconnection bandwidth Full bisection bandwidth between all pairs of hosts Aggregate bandwidth = # hosts host NIC capacity Economies of scale Price/port constant with number of hosts Must leverage commodity merchant silicon Anything anywhere Don t let the network limit benefits of virtualization Management Modular design Avoid actively managing 100 s-1000 s network elements
Scale Out Networking Advances toward scale out computing and storage Aggregate computing and storage grows linearly with the number of commodity processors and disks Small matter of software to enable functionality Alternative is scale up where weaker processors and smaller disks are replaced with more powerful parts Today, no technology for scale out networking Modules to expand number of ports or aggr BW No management of individual switches, VLANs, subnets
The Future Internet Applications and data will be partitioned and replicated across multiple data centers 99% of compute, storage, communication will be inside the data center Data Center Bandwidth Exceeds that of the Access Data sizes will continue to explode From click streams, to scientific data, to user audio, photo, and video collections Individual user requests and queries will run in parallel on thousands of machines Back end analytics and data processing will dominate
Implications of Cloud Computing on Internet Architecture How do we move from delivering switches to delivering fabrics Cisco, Juniper, HP, Brocade, Should storage consist of SANs and centralized servers? EMC, Network Appliance, What role will optics play in future data centers? (your company name here)
Emerging Rack Architecture Core Cache Core Cache Core Cache Core Cache DDR3-1600 100Gb/s 5ns PCIe 3.0 x16 DRAM (10 s GB) TBs storage 24-port 40 GigE switch 150 ns latency L3 Cache 128Gb/s ~250ns 2x10GigE NIC Can we leverage emerging merchant switch and newly proposed optical transceivers and switches to treat entire data center as single logical computer
Amdahl s (Lesser Known) Law Balanced Systems for parallel computing For every 1Mhz of processing power must have 1MB of memory 1 Mbit/sec I/O In the late 1960 s Fast forward to 2012 4x2.5Ghz processors, 8 cores 30-60Ghz of processing power (not that simple!) 24-64GB memory But 1Gb/sec of network bandwidth?? Deliver 40 Gb/s bandwidth to 100k servers? 4 Pb/sec of bandwidth required today
Targets 100k server data center ~1M cores, 1 PB DRAM, 1 EB Storage Require: 4 Pb/sec network switch to make data center appear as uniform computer <500 ns latency to any host in the data center Can we effect 40x BW increase per host? From each rack: 1.6Tb/sec of traffic From each row: 20Tb/sec of traffic If $100 per SFP+ transceiver plus fiber, then >$3000/ server just for interconenct in 7-hop network Building scale 40Gb/s means $2000 transceivers?
Balanced Systems Really Do Matter Balancing network and I/O results in huge efficiency improvements How much is a factor of 100 improvement worth in terms of cost? TritonSort: A Balanced Large-scale Sorting System, Rasmussen, et al., NSDI 2011. System! Duration! Aggr. Rate! Servers! Rate/server! Yahoo (100TB)! 173 min! 9.6 GB/s! 3452! 2.8 MB/s! TritonSort (100TB)! 107 min! 15.6 GB/s! 52! 300 MB/s!
TritonSort Results Hardware HP DL-380 2U servers, 8-2.5 Ghz cores, 24 GB RAM, 16x500-GB disks, 2x10 Gb/s Myricom NICs 52-port Cisco Nexus 5020 switch Results 2010 GraySort: 100 TB in 123 mins/48 nodes, 2.3 Gb/s/server MinuteSort: 1014 GB in 59 secs/52 nodes, 2.6 Gb/s/server Results 2011 GraySort: 100 TB in 107 mins/52 nodes, 2.4 Gb/s/server MinuteSort: 1353 GB in 1 min/52 nodes, 3.5 Gb/s/server JouleSort: 9700 records/joule
Driver: Nonblocking Multistage Datacenter Topologies M. Al-Fares, A. Loukissas, A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM 08. k=4,n=3
Problem - 10 Tons of Cabling 55,296 Cat-6 cables 1,128 separate cable bundles If optics used for transport, transceivers are ~80% of cost of interconnect The Yellow Wall
Data Center Networking Research Switch Architecture [SIGCOMM 08] Cabling, Merchant Silicon [Hot Interconnects 09] Virtualization, Layer 2, Management [SIGCOMM 09,SOCC11a] Routing/Forwarding [NSDI 10] Hybrid Optical/Electrical Switch [SIGCOMM 10,SOCC11b] Energy, applications, transport [ongoing] Take advantage of regularity of fat tree structure to simplify protocol design and improve performance Generalize to arbitrary topologies
Vignette 2: Optics in the Data Center
Fiber Optic Communication Technologies: What s Needed for Datacenter Network Operations Cedric F. Lam, et. al, Google Inc. IEEE Communications Magazine July 2010 Ideal: network connects every server to every other server Application software not concerned with locality Problem: prohibitively expensive. Integration to improve port density and system throughput Power of the modules should also scale so that the total power dissipation per unit space does not change Datacenter operators request to optics: scale speed up by a factor of 4 with same power and space Oh, and reduce cost by a factor of 10 All that can be offered is a factor of 100x in volume?
A Modular Data Center Question: How to use optics to switch between the boxes? Illustration: Bryan Christie Design From: R. Katz, Tech Titans Building Boom. IEEE Spectrum, Feb. 2009
Hybrid Optical/Electric Switching? Electrical Packet Switch $500/port 12 W/port 10 Gb/s fixed Per-packet switching Optical Circuit Switch $500/port 240 mw/port Rate agnostic 12 ms switching time Mixing both types of switches in the same network allows one type of switch to compensate for the weaknesses of the other type. The optimal ratio of packet switches and circuit switches depends on the data center workload. Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers, Farrington, Porter, Radhakrishnan, Bazzaz, Subramanya, Fainman, Papen, Vahdat, Proceedings of ACM SIGCOMM 2010.
k Traditional N-port Electrical Core Packet Switches Bisection Bandwidth Cost Example: N=64 pods; k=1024 hosts/pod = 64K hosts total; 8 wavelengths Power 10% Static (10:1 Oversubscribed) $6.3 M 96.5 kw Cables 6,656 N Electrical k-port Pod Packet Switches 2010-07-19 25 100% Static Helios Example 10% Static + 90% Dynamic
k Traditional N-port Electrical Core Packet Switches Bisection Bandwidth N Electrical k-port Pod Packet Switches Example: N=64 pods; k=1024 hosts/pod = 64K hosts total; 8 wavelengths 10% Static (10:1 Oversubscribed) Cost $6.3 M $62.3 M Power 96.5 kw 950.3 kw Cables 6,656 65,536 2010-07-19 26 100% Static Helios Example 10% Static + 90% Dynamic
Traditional N-port Electrical Core Packet Switches + Optical Core Circuit Switches Fewer Core Switches Bisection Bandwidth N Electrical k-port Pod Packet Switches Example: N=64 pods; k=1024 hosts/pod = 64K hosts total; 8 wavelengths 10% Static (10:1 Oversubscribed) 100% Static Helios Example 10% Static + 90% Dynamic Cost $6.3 M $62.2 M $22.1 M 2.8x Less Power 96.5 kw 950.3 kw 157.2 kw 6.0x Less Cables 6,656 65,536 14,016 4.7x Less 2010-07-19 27
Superlinks Each 10 Gb/s transceiver in a LAG* uses a non-overlapping wavelength. This LAG, called a superlink, can fit onto a single fiber pair. Superlinks are transparent to the pod switches, which deal only with LAGs. Optical Circuit Switch WDM MUX WDM DEMUX 1 2 3 4 5 6 7 8 Pod Switch * IEEE 802.1AX-2008 Link Aggregation Group
Simulation Study @ 10Gb/s Conclusion: Most benefit (cost, power, # ports) comes for aggregating 4-8 wavelengths Less expensive coarse WDM can be used.
Helios Example: Aggregation in Action Core Circuit Switch WDM DEMUX Pod 1 Pod 2 Pod 3 Pod 4
Helios Example: Small HTTP Request Core Circuit Switch Small HTTP Request Pod 1 Core Packet Switch Pod 2
Helios Example: Large VM Migration Core Circuit Switch Establish Circuit b/w Pod 1 and 2 Large VM Migration Identify as An Elephant Flow Pod 1 Core Packet Pod 2 Switch
Optical Requirements for the Data Center
Optical Circuit Switches Lower cost Lots of opportunities to insert OCS into data center, but $500/port is simply not compelling Larger scale 100-300 ports is typically inadequate for data center applications Lower insertion loss 3-5dB precludes many deployment scenarios Faster switching times Hardware and software, 20 ms is a starting point but targeting 10-100 us could be transformative
WDM Transceivers VCSEL+MMF nonstarter for PbàXb networks Power consumption Electrical switch density limited by transceiver power Optical link budget Light will pass through multiple (passive) devices at distances approaching 1-2km Bandwidth LR-4 just now becoming available at too high a cost Need 10x10, 12x10, 16x10 for effective muxing of 40GE Spectral Efficiency Within a building, trade spectral efficiency for cost/power
Conclusions Vast majority of computation, storage, and communication will take place in data centers in the years ahead Building balanced systems today requires Pb/sec networks. Exabit/sec networks are not far behind Tremendous opportunity for optics to define what it means to perform communication in the data center