Scalable Data Center Networking Amin Vahdat Computer Science and Engineering UC San Diego vahdat@cs.ucsd.edu
Center for Networked Systems 20 Across CSE, ECE, and SDSC
CNS Project Formation Member Companies Center Faculty Research Interests Project Proposals Diverse Research Projects - Multiple faculty - Multiple students - Multidisciplinary - CNS Research Theme 3
An Extraordinarily Brief History of Communication 3500-2900BC: various inventions of alphabet 900BC: first postal service in China 776BC: first recorded use of homing pigeons to send messages 530BC: first library ~500BC: papyrus, portable and light writing surface 37: first optical network, Romans use mirrors 305: first wooden printing press in China 1455: first printing press with metal movable type 1831: electric telegraph 1876: telephone invented
Source: Livinginternet.com Vannevar Bush Summary: Vannevar Bush established the U.S. military / university research partnership that later developed the ARPANET. Quote: Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, "memex" will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. It consists of a desk, and while it can presumably be operated from a distance, it is primarily the piece of furniture at which he works. On the top are slanting translucent screens, on which material can be projected for convenient reading. There is a keyboard, and sets of buttons and levers. Otherwise it looks like an ordinary desk. Vannevar Bush; As We May Think; Atlantic Monthly; July 1945
J. C. R. Licklider Summary: Joseph Carl Robnett "Lick" Licklider developed the idea of a universal network, spread his vision throughout the IPTO, and inspired his successors to realize his dream by creation of the ARPANET. Quote: It seems reasonable to envision, for a time 10 or 15 years hence, a 'thinking center' that will incorporate the functions of present-day libraries together with anticipated advances in information storage and retrieval. The picture readily enlarges itself into a network of such centers, connected to one another by wide-band communication lines and to individual users by leasedwire services. In such a system, the speed of the computers would be balanced, and the cost of the gigantic memories and the sophisticated programs would be divided by the number of users. - J.C.R. Licklider, Man-Computer Symbiosis, 1960. Source: Livinginternet.com
1969 Internet Map
Back to the Future: Cloud Computing Personal computing revolution in the 1980 s led to a PC on every desktop Client/server computing to control distribution of data Management, energy, security, consistency quickly overwhelmed the cost of the hardware Bursty resource requirements led to 1-10% utilization Berkeley NOW: use idle cycles to build a supercomputer Trends and enabling technologies Utility computing, Software as a Service (SaaS) Ubiquitous wireless coverage, multi-gigabit optical pipes, virtualization, malware/botnets
Cloud Computing Third party companies provide storage and computing on demand Statistical multiplexing and virtualization enables efficient utilization of underlying resources Companies and individuals pay only for what they consume Applications, operating systems centrally managed Data/applications available from a variety of devices and in a variety of places Automatically backed up, made consistent
Cloud Computing@UCSD WebOS: Rent-A-Server [HPDC98] Continuous Consistency in support of replication TACT [OSDI00, SOSP01,TOCS02,TOCS04] Virtualization Virtual clusters [LISA07] Memory management [OSDI08] Large scale testing DieCast [NSDI06,NSDI08] PlanetLab/GENI Resource Peering [SOSP03] Workload characterization [USENIX06] Service Discovery [HPDC05] Plush application management [LISA07]
Cloud Computing: Two Questions Starting point: computing and storage increasingly delivered by dense data centers How to program multi-data center applications? Bottom line: applications built on top of data structures How do you partition and replicate data structures across and within data centers? For target levels of performance, availability, consistency How to interconnect individual data centers? 100,000+ ports within single data center, 10 Gb/s per port How to build a petabit/sec non-blocking switch?
Life of a Social Networking Request 120M+ users organized into a graph Incoming request for user Alice Cookie hashes to handle for Alice s profile Retrieve information from Alice s profile Picture, status, handles to friends, location, etc. Retrieve information from friends profiles Recent information from queues Retrieve recent information from news feeds linked lists Each request maps to ~1,000 machines
Life of a Social Networking Request: Backend Petabytes of data generated in form of click-streams Significant amount of user data to be indexed Advertising placement based on user access patterns and user profiles Large-scale data processing effort to appropriately process data Emerging data processing model: MapReduce All-to-all communication among tens of thousands of machines
Scalable Data Center Networking
Motivation Commoditization in the data center Inexpensive, commodity PCs and storage devices But network still highly specialized Data center is not a small Internet One admin domain, not adversarial, limited policy routing, etc. Bandwidth is often the bottleneck Cloud Computing Service-oriented Architectures Data Analysis (MapReduce)
Network Design Goals Scalable interconnection bandwidth Full bisection bandwidth between all pairs of hosts Aggregate bandwidth = # hosts host NIC capacity Economies of scale Price/port linear with number of hosts Single network fabric Support Ethernet and IP without end host modifications Management Modular design Avoid actively managing 100 s-1000 s network elements
Current Data Center Topologies Edge hosts connect to 1G Top of Rack (ToR) switch ToR switches connect to 10G End of Row (EoR) switches Large clusters: EoR switches to 10G core switches Oversubscription of 2.5:1 to 8:1 typical in guidelines No story for what happens as we move to 10G to the edge Core EoR ToR Key challenges: performance, cost, routing, energy, cabling
Data Center Network Economics 10x commodity edge switches $100/end host Low margins 1x commodity core switches $1,000-$4000/end host High margins
Force 10 Study: Data Center Pricing $4,000/port for switches in 1,000 node data center! Taken from The FORCE10 Networks TeraScale E-Series brochure
Cost of Data Center Networks Cost (USD millions) $30 $25 $20 $15 $10 $5 $0 100% BW 33% BW Fat-Tree (100% BW) 0 5000 10000 15000 20000 25000 Hosts Factor of 10+ price difference between traditional approach and proposed architecture
Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Fat tree built from 4-port switches Pod 3
Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Support 16 hosts organized into 4 pods Each pod is a 2-ary 2-tree Pod 3 Full bandwidth among hosts directly connected to pod
Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 Full bisection bandwidth at each level of fat tree Rearrangeably Nonblocking Entire fat-tree is a 2-ary 3-tree
Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 (5k 2 /4) k-port switches support k 3 /4 hosts 48-port switches: 27,648 hosts using 2,880 switches Critically, approach scales to 10 GigE at the edge
Scalability Using Identical Network Elements Core Pod 0 Pod 1 Pod 2 Pod 3 Regular structure simplifies design of network protocols Opportunities: performance, cost, energy, fault tolerance, incremental scalability, etc.
Why Hasn t This Done Before? Needs to be backward compatible with IP/Ethernet Existing routing protocols do not work for fat tree Cabling explosion at each level of the fat tree Tens of thousands of cables running across data center? Management Thousands of individual elements that must be programmed individually