A Review of Flow Scheduling in Data Center Networks

A Review of Flow Scheduling in Data Center Networks Mala Yadav School of Computer and Info Science MVN University, Palwal, India mala0122001@yahoo.com, Jay Shankar Prasad School of Computer and Info Science MVN University, Palwal, India jayshankar.prasad@mvn.edu.in Abstract Today s data centres hosting cloud applications consume huge amount of electrical energy, contributing to high operational cost and carbon footprints to the environment. Several data centre network architectures have been proposed in this decade. These architectures employ richlyconnected topologies and multi-path routing to provide high network capacity. Significant research work has been done on energy-aware flow scheduling algorithm to address the problem of inefficient network energy usage. In this paper, we present a survey of data center routing considering how to aggregate traffic which flexibly choose the routing paths, with flows fairly sharing the link bandwidths. 1. INTRODUCTION Huge expansion in data centre sizes leads to increase the power consumption and carbon footprint. The study done in the year 2011 shows that roughly1.5% of the global electricity usage came from data centres [1], [2]. Thus the increasing energy consumption of data centres has almost restricted the growth of cloud services and raised the economic and environmental concerns. The contribution of networking part to the energy consumed by the whole data centre is also a matter of concern now a day. The network accounts for 10-20% of the overall power consumption in the data centres [3], however, this proportion can reach to 50% if the intelligent server-side power management techniques are deployed [4]. To achieve higher network capacity, several nicely-connected data centre network architectures have been developed. To replace the traditional tree architecture, such as Fat-Tree [5], BCube [6], FiConn [7], ufix [8], etc have been proposed. The network architecture uses switches and links to achieve full bisection bandwidth at the traffic peak time. Thus the network architectures enhance the network performance, however, side effects on the data centre energy consumption. The network architecture increase the energy consumption proportion of the networking part because more number of switches used by them. The second side effect is that they cause inefficient network energy consumption at traffic time, due to the poor utilization of switches. The main reason for network energy waste is that the energy consumption of present switches is imbalance and it is not proportional to the traffic loads [9]. The idle or under-utilized switches consume much energy. Thus to save the network energy in nicelyconnected data centres, different energy-aware flow scheduling approaches have been proposed [10], [11]. The common approach is to aggregate network flows into a subset of switches/links and uses minimum switches/links as possible to carry the flows. Hence, the active switches/links can carry maximum of network traffic and the idle switches may be kept into sleep mode for energy saving. The traffic aggregation results in multiple flow sharing of the same switches/links. Based upon the TCP congestion control algorithm, the flow get their throughputs by sharing the bandwidths of bottleneck links in a fair approach. This approach is known as FSR(Fair-Sharing Routing). But FSR usually result in low utilization of some nonbottleneck switches that causes waste of energy. 2. RELATED WORKS A. Green Networking The green networking is the use of energy efficient networking components and technologies to minimize the overall energy consumption. It also motivates to use the minimum resources as possible. The several methodologies have been proposed by different researchers and are in use which is investigated below. 1) Device-level Techniques: These energy conservation schemes basically improve the energy efficiency of network devices with advanced intelligent hardware techniques. Nedevschi et al. [12] suggested that the energy consumption of network devices could be minimised by applying the sleep-on-idle and the rate-adaptation methods. Gupta et al. [13] proposed, at certain times underutilized Ethernet interfaces may be kept to the lowpower mode, based on the number of arriving packets in certain time period. They also analyzed the trade-off between energy conservation and

packet loss and delay. Similarly, Gunaratne et al. [14] suggested the policies of link rate adaption by controlling the link utilization and the queue length of output buffer approach to save the Ethernet link energy. 2) Network-levelTechniques: The network-wide energy conservation methods help to reduce the network energy consumption. The topology design and routing of network plays a major role. Chabarek et al. [15] solved the problem of minimizing network energy consumption, they considered it as a multiple commodity network flow problem, and investigated the energy efficient network concerns through protocol with mixedinteger programming techniques. Vasic et al. [16] proposed a two-step flow scheduling scheme for energy conservation and balancing the computational complexity. In this scheme, different energy-critical paths were pre-computed and a scalable traffic management mechanism later used to schedule network flow in online way. Abts et al. [4] proposed an energy-efficient flattened butterfly topology and reconfigured the transfer rates of links in a periodic manner to make data centre networks more energy efficient. Heller et al. [10] designed a network energy optimizer known as ElasticTree to choose a subset of links for transmiting traffic flow under the restrictions of network performance and reliable conditions. Shang et al. [11] proposed a throughput-guaranteed power-aware routing algorithm, based on pruning method to iteratively eliminate switches from the original topology and maintaining the network throughput. B. Flow Scheduling in Data Center Networks: Different flow scheduling schemes have been proposed in data centre networks to meet different optimization objectives like maximizing network utilization, minimizing flow completion time and meeting the deadline constraints of network flows etc. Al-Fares et al. [17] proposed a dynamic flow scheduling scheme, i.e., Hedera, for multi-rooted hierarchical tree topologies also used in data centre networks. It can effectively utilize link resources without much compromise on the control and computation overheads. The experimentation and results shows that Hedera achieves very good bisection bandwidth than ECMP routing. Wilson et al. [18] proposed a deadline-aware control protocol, named D 3, which were able to control the transmission rate of network flows based on the deadline requirements. D 3 improved the latency of small flows and burst tolerance, thus it increased the transmission capacity of data centre networks. Hong et al. [19] designed and proposed a preemptive distributed quick flow scheduling protocol, i.e., PDQ, to minimize the average completion time of network flow and to meet the deadlines. The results showed that PDQ outperformed TCP and D 3 in terms of the average flow completion time and the number of flow meeting deadlines. Zats et al. [28] proposed DeTail, which developed a cross-layer network stack aiming to improve the completion time for the delay-sensitive flows in the network. 3. BACKGROUND The current progress of distributed computing frameworks such as Hadoop [21], MapReduce [25] and Dryad [26] including web services like search, e-commerce, and social networking have compelled to the construction of massive computing clusters made of commodity-class PCs. Also, due to extraordinary growth in the size and complexity of datasets, storages of up to several petabytes, stored on tens of thousands of machines enforce to think. These cluster applications can often be bottlenecked on the network, however, it has not been influenced by local resources. Thus application performance improvement leads towards network performance. The data center network topologies essentially uses hierarchical trees with small, cheap edge switches further connected to the end hosts[24]. These networks are interconnected by two or more layers of switches, reasons behind it is to effectively port densities obtained from commercial switches. With the advent to design larger data centers having several thousands of machines, our study suggests the horizontal in place of vertical expansion of data center networks [5, 22, 23]; and using expensive routers with higher speeds and port-densities can be discouraged, the solution is to use a larger number of parallel paths between the source and destination edge switches, this approach is known as multi-rooted tree topologies as shown in figure: Network designs with multi-rooted topologies is having the capability to provide full bisection bandwidth among all participating hosts[5]. It uses efficient protocol to forward the data within the network and scheduler to allocate network flows and paths for taking the advantage of massive parallelism.

The performance of ECMP and flow-based VLB mainly depends on flow size and the number of flows per host. Hash-based forwarding performs very good in cases where hosts in the network perform all-to all communication with one another simultaneously, or with individual flows that last only a few RTTs. Non uniform communication patterns, especially those involving transfers of large blocks of data, require more careful scheduling of flows to avoid network bottlenecks. 4. DYNAMIC flow SCHEDULING SCHEME, I.E., HEDERA Existing network forwarding protocols are optimized to select a single path for each source/destination pair in the absence of failure. This static single path forwarding can underutilize multi-rooted tree. ECMP[24] uses static mapping of flows to path and does not take care of current network utilization or flow size, which result in collisions and degrading overall switch utilization. Hedera, design central scheduler which collect flow information from switches, compute collision free path for flows and instruct switches to re-route traffic. Goal of Hedera is to maximize aggregate network utilization- bisection bandwidth with minimam scheduler overhead. Overheads are limited by focusing on scheduling decision of the large flows. Hedera s performance depends on the rates and durations of the flows in the network and more beneficial when network is stressed with many large data transfers within pods and across the network. 5. DEADLINE-AWARE CONTROL PROTOCOL, NAMED D 3 Recent data center networks, given their Internet origins, are sensitive towards application design. The congestion control (TCP) and flow scheduling mechanisms (FIFO queuing) used in data centres are unaware of flow deadlines and hence, strive to optimize network-level metrics i.e. maximize network throughput while maintaining fairness. This leads to unfair sharing and flow quenching. D 3 neutralise these effects. D3 is a Deadline-Driven Delivery control protocol that customized for the data center environment. It deals with the challenges of the datacentre environment - small RTTs, and a bursty, diverse traffic mix with widely varying deadlines. D 3 contributions are : Present the case for utilizing flow deadline information to apportion bandwidth in data centres. Present the design, implementation and evaluation of D 3, a congestion control protocol that makes datacentre networks deadline aware. D 3 can also double the peak load that a data center can support. D 3 also performs well as a congestion control protocol for datacentre networks in its own right. Even without any deadline information, D 3 outperforms TCP in supporting the mix of short and long flows observed in data centre networks. D 3 is a control protocol that uses application deadline information to achieve informed allocation of network bandwidth. D 3 is practical and provides significant benefits over even optimized versions of existing solutions. Recent trends in data centre indicate that operators are willing to adopt new designs that address their problems. 6. Pre-emptive distributed quick flow scheduling protocol Pre-emptive Distributed Quick (PDQ)[19] flow scheduling, is a protocol which can handle flows quickly with deadlines. PDQ is based on traditional real-time scheduling techniques. For processing a queue of tasks, scheduling in order of Earliest Deadline First (EDF) is known to minimize the number of late tasks and Shortest Job First (SJF) minimizes mean flow completion time. However, applying these for scheduling data center flows several challenges are encountered. First, EDF and SJF assume a centralized scheduler which knows the global state of the system. It helps in achieving goal of low latency in a large data center. To perform dynamic decentralized scheduling, PDQ provides a distributed algorithm to allow a set of switches to gather information about flow workloads and converge to a stable agreement on allocation decisions. EDF and SJF rely on the ability to pre-empt existing tasks, to ensure a newly arriving task with a smaller deadline can be completed before a currently scheduled task. In distributed environments, PDQ provides the ability to perform distributed pre-emption of existing flow trace. It enables fast switchover and is guaranteed to never deadlock. Thus, PDQ provides a distributed flow scheduling layer which is lightweight, using only FIFO tail-drop queues, and flexible, in that it can approximate a range of scheduling disciplines based on relative priority of flows. PDQ is most closely related to D 3 [18], which also tries to meet flow deadlines. Unlike D 3, which is a first-come first-reserve" algorithm, PDQ pre-emptively gives network resources to the most critical flows. For deadline-constrained flows, PDQ supports three times more concurrent senders while satisfying their flow deadlines. PDQ, a flow scheduling protocol designed to complete flows quickly and meet flow deadlines. It provide a distributed algorithm to approximate a range of scheduling disciplines based on relative priority of flows, minimizing mean flow completion time and the number of deadlinemissing flows.

7. DETAIL DeTail[20] is a cross-layer network-based approach for reducing the long flow completion time tail. At the link layer, DeTail uses port buffer occupancies to construct a lossless fabric [2]. By responding quickly, lossless fabrics ensure that packets are never dropped due to flash congestion. They are only dropped due to hardware errors and/or failures. Preventing congestion-related losses reduces the number of flows that experience long completion times. At the network layer, DeTail performs per-packet adaptive load balancing of packet routes. Switches use the congestion information obtained from port buffer at every hop to dynamically pick a packet s next hop. This smooths network load across available paths. Due to adaptiveness it performs well in asymmetric topologic. Packets are no longer lost due to congestion, transport protocol trusts upon congestion notifications derived from port buffer occupancies. Routes are load balanced one packet at a time, out-of-order packet delivery cannot be used as an early sign of congestion to the transport layer. DeTail allows applications to specify flow priorities. Applications typically know which flows are latency-sensitive foreground flows and which are latency-insensitive background flows. By allowing applications to set these priorities, and responding to them at the link and network layers, DeTail ensures that high priority packets do not get stuck behind low-priority ones. Reducing the tail of completion times of the short, latency-sensitive flows critical for page creation is beneficial in DeTail. DeTail s cross-layer, in-network mechanisms reduces packet losses and retransmissions, prioritizelatency-sensitive flows, and evenly balance traffic across multiplepaths. By making its flow completion statistics robust to congestion, DeTail can reduce 99.9 th percentile flow completion times by over 50% for many workloads. DeTail s approach achieve significant improvements in the tail of flow completion times. Buffers will drain faster, but they will also fill up more quickly, ultimatelycausing the packet losses and retransmissions that lead to long tails. Prioritization is important as background flows will likely remain the dominant fraction of traffic. Load imbalances due to topological asymmetries create hotspots. These issues enables websites to deliver richer contents, but still meeting interactivity deadlines. overheads. It achieves higher bisection bandwidth than ECMP routing. D 3 can control transmission rate of network flows according to their deadline requirements. D 3 can effectively improve the latency of mice flows and burst tolerance, and increase the transmission capacity of data center networks. PDQ, are used to minimize the average completion time of network flows as well as to meet their deadlines. PDQ works good as compared to TCP and D 3 in terms of the average flow completion time and the number of flows meeting deadlines. DeTail, improves the tail of completion time for delay-sensitive flows. In future more energy efficient flow scheduling can effectively save the wastage of energy consumption at switches/links. 8. CONCLUSION Hedera, utilizes link resources without significant compromise on the control and computation

REFERENCES [1] J. Koomey, Growth in data center electricity use 2005 to 2010, inanalytics Press, 2011. [2] P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, It s not easy being green, SIGCOMM Comput. Commun. Rev., 2012. [3] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, The cost of a cloud: research problems in data center networks, SIGCOMM Comput.Commun. Rev., January 2009. [4] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, Energy proportional data centre networks, in Proceedings of ISCA, 2010. [5] M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable, commoditydata center network architecture, in Proceedings of ACM SIGCOMM 2008. [6] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, ands. Lu, BCube: a high performance, server-centric network architecture for modular data centers, in Proceedings of the ACM SIGCOMM, 2009. [7] D. Li, C. Guo, H. Wu, K. Tan, Y. Zhang, and S. Lu, FiConn: Using Backup Port for Server Interconnection in Data Centers, in Proceedings of the IEEE INFOCOM, 2009. [8] D. Li, M. Xu, H. Zhao, and X. Fu, Building Mega Data Center fromheterogeneous Containers, in Proceedings of the IEEE ICNP, 2011. [9] P. Mahadevan, P. Sharma, S. Banerjee, and P. Ranganathan, A Power Benchmarking Framework for Network Devices, in Proceedings ofifip NETWORKING, 2009. [10] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma,S. Banerjee, and N. McKeown, ElasticTree: Saving Energy in Data Center Networks, in Proceedings of NSDI, Apr 2010. [11] Y. Shang, D. Li, and M. Xu, Energy-aware routing in data centernetwork, in Proceedings of the ACM SIGCOMM workshop on Greennetworking, 2010. [12] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy, and D. Wetherall, Reducing Network Energy Consumption via Sleeping and RateAdaptation, in Proceedings of NSDI, 2008. [13] M. Gupta and S. Singh, Using Low-Power Modes for Energy Conservation in Ethernet LANs, in Proceedings of the IEEE INFOCOM,2007. [14] C. Gunaratne, K. Christensen, B. Nordman, and S. Suen, Reducingthe Energy Consumption of Ethernet with Adaptive Link Rate (ALR), IEEE Transactions on Computers, 2008. [15] J. Chabarek, J. Sommers, P. Barford, C. Estan, D. Tsiang, and S. Wright, Power Awareness in Network Design and Routing, in Proceedings ofthe IEEE INFOCOM, 2008. [16] N. Vasi c, P. Bhurat, D. Novakovi c, M. Canini, S. Shekhar, and D. Kosti c, Identifying and using energy-critical paths, in Proceedings of the ACMCoNEXT, 2011. [17] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, Hedera: dynamic flow scheduling for data center networks, inproceedings of the NSDI, 2010. [18] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, Better neverthan late: meeting deadlines in datacenter networks, in Proceedings ofthe ACM SIGCOMM, 2011. [19] C.-Y. Hong, M. Caesar, and P. B. Godfrey, Finishing flows quicklywith preemptive scheduling, in Proceedings of the ACM SIGCOMM,2012. [20] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, Detail: reducingthe flow completion time tail in datacenter networks, in Proceedingsof the ACM SIGCOMM, 2012. [21] Apache Hadoop Project. http://hadoop.apache.org/. [22] GREENBERG, A., JAIN, N., KANDULA, S., KIM, C., LAHIRI,P., MALTZ, D., PATEL, P., AND SENGUPTA, S. VL2: A Scalableand Flexible Data Center Network. In Proceedings of ACMSIGCOMM, 2009. [23] GREENBERG, A., LAHIRI, P., MALTZ, D. A., PATEL, P., ANDSENGUPTA, S. Towards a Next Generation Data CenterArchitecture:Scalability and Commoditization. In Proceedings of ACMPRESTO, 2008. [24] Data center bridging. http://www.cisco.com/en/us/solutions/collatera l/ns340/ns517/ns224/ns783/at_a_glance_c45-460907.pdf. [25] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified DataProcessing on Large Clusters. In Proceedings of OSDI, 2004. [26] ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FET-TERLY, D. Dryad: Distributed Data-parallel Programs from SequentialBuilding Blocks. In Proceedings of ACMEuroSys, 2007.