PRAMAK 1 Optimizing Data Center Networks for Cloud Computing Data Center networks have evolved over time as the nature of computing changed. They evolved to handle the computing models based on main-frames, client-server computing, and Internet computing. As the era of cloud computing progresses, the data center network model used for servicing clouds needs to change appropriately. This is needed to support the characteristics specific to cloud computing in the most optimal manner. The characteristics of a cloud that require a rethink of the data center network model include Use of large number of Virtual Machines (VMs) Dynamic movement of VMs from one physical machine to another Heavy inter server traffic in the data center Use of commodity hardware for switches and routers Need to accommodate high bandwidth low-latency traffic This paper discusses some improvements needed in data center networking to support the unique characteristics and requirements of cloud computing. The paper is an overview of what is needed in data center networking for cloud computing. A comprehensive treatment of the subject is available through a consulting engagement with Pramak Characteristics of Cloud computing Cloud computing can be defined as providing a set of standardized IT services to one or more organizations (tenants or clients of the cloud) and accessible from anywhere on the Internet. It provides scalability, elasticity, agility, reliability, cost savings and other advantages to an organization. It has a massive density of resources providing massive scale and low latency to its tenants. It does this through the use general purpose commodity computing resources that can be aggregated or clustered if needed to provide higher computing capacity for individual clients. There is also heavy use of virtualization - virtual machines, virtual storage, and virtual networking to make use of available capacity effectively and provide reliability and resiliency while cutting costs. Virtualization has made it possible to have a pool of resources that can be moved around quickly and transparently without the client knowing about it. Multiple tiers of the same customer could be co-located on the same rack of machines. Likewise multiple tenants could be on the same rack or even blade. Elasticity is achieved by adjusting the resource needs of a client based on demand. Agility and reliability are achieved by being able to make dynamic adjustments such as adding, reducing, or moving resources in case of interruptions and terminations seamlessly and transparently, doing background replication and backups to handle faults and disasters, doing updates when needed without any interruptions to the service being provided, etc. Hosting multiple tenants, large and small, and providing the above capabilities results in much more server to server traffic in the data center than is there in the traditional silo ed IT architecture of a single organization. There is a need to support high bandwidth, low latency networking between servers and the services hosted on them. The traditional data center architecture which has a hierarchy of switches
PRAMAK 2 and routers, each adding to latency and delay for inter-server traffic doesn t do particularly well in this kind of an environment. Further, the use of commodity networking gear such as switches and routers as well as accommodating different traffic requirements of the various tenants services being hosted on the same rack or set of racks in a data center can lead to traffic of one tenant adversely affecting that of another for unduly long time. This can result in violations of the SLA promised by the tenant to its customers. In light of the above, the data center network model requires changes. Some of these changes are discussed below. Optimized Data Center Network The following sections discuss changes to the data center network that will optimize it for supporting cloud computing. In some cases these changes have to be supported by corresponding changes in the software using and managing the network. Some of the changes discussed below can be made expeditiously with standardized hardware and software while others may require the use of specialized hardware and software. This specialized hardware/software is very likely to get standardized over time. Flattened and virtualized network The traditional silo ed IT data center network has multiple layers of switches and routers. This is because such networks carry traffic mostly up and down i.e. north and south, as against sideways i.e. east and west. Compared to cloud data centers, there is a relatively smaller amount of inter-server traffic within the data center. This is one of the big differences between the two types of data centers. There are usually three layers/tiers in a traditional data center. The first layer is the access tier. This is right above the servers. It is comprised of access switches to which the servers in the data center are connected. This is followed by an aggregation tier that is comprised of switches that aggregate traffic from a bunch of access switches. This is followed by the core tier that is comprised of routers/ switches to which the aggregation switches are connected. While the access and aggregation switches do layer 2 switching to bridge traffic between servers on the same subnet, the core routers do layer 3 forwarding between servers on different subnets. Because of the heavy inter-server traffic in cloud data centers and the need to move VMs transparently and quickly between physical machines/blades and racks, it is better to have a flattened network as against a tall hierarchical network. This makes the inter-server traffic to travels through less number of switches and routers, each of which adds an inter-hop delay to the traffic. There is also need for virtual switching so that traffic between virtual machines (for example, from a web server to the application server) on the same physical machine or between two blades of a rack can be monitored for security purposes. The above pattern of activity common in a cloud data center requires that the core and aggregation tiers of the data center network be merged into a single tier. This reduces the number of physical tiers of switches/routers that traffic may have to traverse when going north/south or east/west from 3 to 2. Also, we need a virtual switch using a virtual firewall and other security and flow monitoring software for handling traffic between the blades of a single rack. In the above flattened and virtual model of networking described above, the traffic between racks
PRAMAK 3 of blades will go through 1 or 2 tiers of switches/routers and so have a lower latency than it would with a traditional data center network model. But even better, traffic between blades on a single rack accommodating in many cases all tiers of the software architecture of a customer s services 1, will go through a virtual switch only. This will reduce the latency to a significant degree for most tenants traffic. Large layer 2 domains A cloud provider should move VMs between blades or racks transparently and quickly without impacting applications or users. This is critical for business continuity, fault-tolerance, and routine maintenance. This requires large layer 2 domains since they offer high throughput and low latency. To move VMs seamlessly and quickly between machines requires that the storage and compute resources be in close proximity and the switching/routing to be fast with minimal inter-hop delays. Virtual networks with virtual and physical switches, discussed briefly in the previous section, can create large layer 2 domains where VMs can be moved from one machine to another seamlessly, quickly, and transparently. Higher bandwidth at the server edge Because of high machine density due to many VMs running on a physical machine and many machines packed together in racks, there is a need for supporting high bandwidth at the servers edge. This requires use of high performance and high port density switches in the network. The use of 10 GB Ethernet ports will be common in access switches very soon. Early Congestion Notification In a multi-tenant situation, the traffic patterns of different customers might be different. While one tenant may have short, low-latency connections, another may use long, high bandwidth connections. It is important that congestion occurring in switches and routers does not impact tenants with lowlatency, short lived connections to a point where such tenants connections are starved of bandwidth. The above requires that the switches and routers support the Early Congestion Notification (ECN) bit in the IP header of packets being switched and routed by them and that the TCP/IP stack in the VMs leverages this facility. This would be a good start but one can do even better. At least one cloud provider is moving to a better scheme that can be leveraged in a data center network. It is encouraging network vendors to make slight changes in their switches/routers while making the corresponding changes to its TCP/IP stack so that transmissions on the network can respond quickly to a hint of congestion as against to congestion after it has already occurred. The slight modification required in the switches/routers is that they provide a quick indication when their buffers get to a certain threshold of use as against providing indication only when they get exhausted as is the case with plain vanilla ECN support. The transmitting machine s TCP/IP stack then responds appropriately by slowing down the flow of packets. With this scheme, the traffic flow on the network is much smoother than otherwise, leading to an overall improved throughput and better, more predictable service for all clients. Virtualization Aware Security Since most customers of a cloud provider would prefer to have various tiers of their service (front end, back end, storage end) be co-located on the same rack for better performance it is important to have virtualization aware security solution running on the rack. This is provided by certain network vendors. A virtual switch helps in enabling security services such as Intrusion Prevention System and other services like Net-Flow/ J-Flow/S-Flow that can monitor traffic between VMs. 1 This is because a tenant s services will likely be co-located on the same rack for performance reasons
PRAMAK 4 High network connectivity In the old traditional data center architectures, a spanning tree ensures that despite whatever connectivity switches might have with each other, there is only one path that can be traversed between them. This is to avoid loops. When that path goes down, a backup path becomes active. There is however a delay between the failure of the primary path and the activation of the backup path which is not good for latency sensitive applications that get hosted in a cloud data center. There are a number of efforts in the Industry to standardize schemes that provide high degree of interconnection i.e. provide multiple paths, between resources, and support quick recovery from a failure of one path by minimizing the delay in a backup path taking over. A cloud provider should consider moving to such models of connectivity and fault-tolerance. Improved network utilization Multiple VMs on a physical machine help improve capacity utilization of the machine. This also improves network utilization since the aggregate network bandwidth usage of the machine goes up. One can also improve the network bandwidth available to a VM on the machine through virtualization of the multiple network interface cards (NICs) where they work as a team while appearing as a single card to the VM using them. This is called NIC teaming. NIC teaming not only provides higher network bandwidth to a VM thereby improving its network throughput but also improves reliability since a failure of a NIC does not cause disruption of a VM s traffic going through the team. WAN acceleration Because of the need to move data between data centers for disaster recovery as well as for better performance and high availability (to support clients in different regions), and to support large data downloads by clients of some customers, it is important to use WAN acceleration for transmitting data over the WAN. This improves performance and throughput of such transmissions appreciably. It is important to select the right vendor for this. Programmable routers/switches There is an effort underway in the Industry to move away from the current scheme of proprietary routers which don t allow much programmability and control. This may take some time to sort out before standards get created. However in the meanwhile there are network gear vendors that are supporting the above model to varying degrees. The bottom line here is that if the routers can be made very simple devices that do forwarding based on tables and algorithms that can be controlled programmatically, and if certain heavy duty routing protocols and software not needed in a controlled environment offered by a data center can be removed or minimized, networking could be made much more dynamic and agile. Some cloud providers are making headway in this space by choosing vendor(s) for their data center networks that provide some support for the above. Conclusions This paper provided an overview of some improvements that can be made to data center networks hosted by cloud providers that will enable them to improve the performance, throughput, agility, security, and other aspects of their clouds. This will enable them increase their value propositions for their customers and differentiate themselves from the crowd. The techniques and methodologies
PRAMAK 5 discussed in the paper include new ways of structuring a data center network and intelligent use of standardized and in some cases specialized hardware and software.