OVER the last few years, the exponential increase of the

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 1 A Survey on Optical Interconnects for Data Centers Christoforos Kachris and Ioannis Tomkos Abstract Data centers are experiencing an exponential increase in the amount of network traffic that they have to sustain due to cloud computing and several emerging web applications. To face this network load, large data centers are required with thousands of servers interconnected with high bandwidth switches. Current data center networks, based on electronic packet switches, consume excessive power to handle the increased communication bandwidth of emerging applications. Optical interconnects have gained attention recently as a promising solution offering high throughput, low latency and reduced energy consumption compared to current networks based on commodity switches. This paper presents a thorough survey on optical interconnects for next generation data center networks. Furthermore, the paper provides a qualitative categorization and comparison of the proposed schemes based on their main features such as connectivity and scalability. Finally, the paper discusses the cost and the power consumption of these schemes that are of primary importance in the future data center networks. Index Terms Optical interconnects, data center networks. I. INTRODUCTION OVER the last few years, the exponential increase of the Internet traffic, mainly driven from emerging applications like streaming video, social networking and cloud computing has created the need for more powerful data centers. The applications that are hosted in the data center servers (e.g. cloud computing applications, search engines, etc.) are data-intensive and require high interaction between the servers in the data center [1]. This required interaction poses a significant challenge to the networking of the data centers creating the need for more efficient interconnection schemes with high bandwidth and reduced latency. The servers must experience low latency communication among each other, even if the data center continue to increase in size comprising thousand of servers. However, while the throughput and the latency in the future data center networks must be improved significantly to sustain the increased network traffic, the total power consumption inside the racks must remain almost the same due to thermal constraints [2]. Furthermore, as more and more processing cores are integrated into a single chip, the communication requirements between racks in the data centers will keep increasing significantly [3]. Table I shows the projections for performance, bandwidth requirements and power consumption for the future high performance systems [4],[5]. Note that while the peak performance will continue to increase rapidly, the budget for the total allowable power consumption that can be afforded by Manuscript received 5 May 2011; revised 30 August 2011, 25 November 2011, and 1 December 2011. C. Kachris is with Athens Information Technology, Athens, Greece. e-mail: (kachris@ait.edu.gr). I. Tomkos is a professor at Athens Information Technology, Athens, Greece. Digital Object Identifier 10.1109/SURV.2011.122111.00069 TABLE I PERFORMANCE, BW REQUIREMENTS AND POWER CONSUMPTION BOUND FOR FUTURE SYSTEMS [4],[5] Year Peak Performance (10x/4 yrs) Bandwidth requirements (20x/4 yrs) 2012 10 PF 1 PB/s 5 MW 2016 100 PF 20 PB/s 10 MW 2020 1000 PF 400 PB/s 20 MW Power consumption bound (2x/4 yrs) the data center is increasing in a much slower rate (2x every 4 years) due to several thermal dissipation issues. Therefore one of the most challenging issues in the design and deployment of a data center is the power consumption. The energy consumption of the data center infrastructures and the Internet network topology [6],[7] are the ones that mainly define the overall energy consumption of cloud computing. Greenpeace s Make IT Green report [8], estimates that the global demand for electricity from data centers was around 330bn kwh in 2007. This demand is projected to triple by 2020 (more than 1000bn kwh). According to some estimates [9], the power consumption of the data centers in the US in 2006 was 1.5% of the total energy consumed at a cost of more than $4.5B. The servers in the data centers consume around 40% of the total IT power, storage up to 37% and the network devices consume around 23% of the total IT power [10]. And as the total power consumption of IT devices in the data centers continues to increase rapidly, so does the power consumption of the HVAC equipment (Heating-Ventilation and Air-Conditioning) to keep steady the temperature of the data center site. Therefore, the reduction in the power consumption of the network devices has a significant impact on the overall power consumption of the data center site. The power consumption of the data centers has also a major impact on the environment. In 2007, data centers accounted for 14% of the total ICT greenhouse gases (GHG) emissions (or 2% of the global GHG), and it is expected to grow up to 18% by 2020 [11]. The global data center footprint in greenhouse gases emissions was 116 Metric Tonne Carbon Dioxide (MtCO 2 e) in 2007 and this is expected to more than double by 2020 to 257 MtCO 2 e, making it the fastest-growing contributor to the ICT sectors carbon footprint. In order to face this increased communication bandwidth demand and the power consumption in the data centers, new interconnection schemes must be developed that can provide high throughput, reduced latency and low power consumption. Optical networks have been widely used in the last years in the long-haul telecommunication networks, providing high throughput, low latency and low power consumption. Several schemes have been presented for the exploitation of the light s

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 2 high bandwidth such as Time Division Multiplexing (TDM) and Wavelength Division Multiplexing (WDM). In the case of WDM, the data are multiplexed using separate wavelength that can traverse simultaneously in the fiber providing significantly higher bandwidth. The optical telecommunication networks have evolved from traditional opaque networks toward alloptical (i.e. transparent) networks. In opaque networks, the optical signal carrying traffic undergoes an optical-electronicoptical (OEO) conversion at every routing node. But as the size of opaque networks increases, network designers had to face several issues such as higher cost, heat dissipation, power consumption, and operation and maintenance cost. On the other hand, all-optical networks provide higher bandwidth, reduced power consumption and reduced operation cost using optical cross-connects and reconfigurable optical add/drop multiplexers (ROADM) [12]. Currently the optical technology is utilized in data centers only for point-to-point links in the same way as point-to-point optical links were used in older telecommunication networks (opaque networks). These links are based on low cost multimode fibers (MMF) for short-reach communication. These MMF links are used for the connections of the switches using fiber-based Small Form-factor Pluggable transceivers (SFP for 1 Gbps and SFP+ for 10 Gbps) displacing the copper-based cables [13]. In the near future higher bandwidth transceivers are going to be adopted (for 40 Gbps and 100 Gbps Ethernet) such as 4x10 Gbps QSFP modules with four 10 Gbps parallel optical channels and CXP modules with 12 parallel 10 Gbps channels. The main drawback in this case is that power hungry electrical-to-optical (E/O) and optical-toelectrical (O/E) transceivers are required since the switching is performed using electronic packet switches. Current telecommunication networks are using transparent optical networks in which the switching is performed at the optical domain to face the high communication bandwidth. Similarly, as the traffic requirements in data centers are increasing to Tbps, all-optical interconnects (in which the switching is performed at the optical domain) could provide a viable solution to these systems that will meet the high traffic requirements while decreasing significantly the power consumption [14],[15],[16]. According to a report, all-optical networks could provide up to 75% energy savings in the data centers [17]. Especially in large data centers used in enterprises the use of power efficient, high bandwidth and low latency interconnects is of paramount importance and there is significant interest in the deployment of optical interconnects in these data centers [18]. A. Current DC with commodity switches Figure 1 shows the high level block diagram of a typical data center. A data center consists of multiple racks hosting the servers (e.g. web, application or database servers) connected through the data center interconnection network. When a request is issued by a user, then a packet is forwarded through the Internet to the front end of the data center. In the front end, the content switches and the load balance devices are used to route the request to the appropriate server. A request may require the communication of this server with many other servers. For example, a simple web search request may require the communication and synchronization between the web, the application and the database servers. Most of the current data centers are based on commodity switches for the interconnection network. The network is usually a canonical fat-tree 2-Tier or 3-Tier architecture as it is depicted in Figure 1 [19]. The servers (usually up to 48 in the form of blades) are accommodated into racks and are connected through a Top-of-the-Rack Switch (ToR) using 1 Gbps links. These ToR switches are further inter-connected through aggregate switches using 10 Gbps links in a tree topology. In the 3-Tier topologies (shown in the figure) one more level is applied in which the aggregate switches are connected in a fat-tree topology using the core switches either at 10 Gbps or 100 Gbps links (using a bundle of 10 Gbps links) [20]. The main advantage of this architecture is that it can be scaled easily and that it is fault tolerant (e.g. a ToR switch is usually connected to 2 or more aggregate switches). However, the main drawback of these architectures is the high power consumption of the ToR, aggregate and core switches (due to O/E and E/O transceivers and the electronic switch fabrics) and the high number of links that are required. Another problem of the current data center networks is the latency introduced due to multiple store-and-forward processing [21]. When a packet travels from one server to another through the ToR, the aggregate and the core switch, it experiences significant queuing and processing delay in each switch. As the data centers continue to increase to face the emerging web applications and cloud computing, more efficient interconnection schemes are required that can provide high throughput, low latency and reduced energy consumption. While there are several research efforts that try to increase the required bandwidth of the data centers that are based on commodity switches (e.g. using modified TCP or Ethernet enhancements [22]), the overall improvements are constraints by the bottlenecks of the current technology. B. Organization of the paper This paper presents a thorough survey of the optical interconnect schemes for data centers that have been recently presented in the research literature. The paper presents both hybrid and all-optical schemes based either on optical circuit switching or on packet/burst switching. Section II presents the network traffic characteristics of the data center networks. Section III presents the optical technology and the components that are used in the design of optical interconnects. Section IV presents the architectures of the optical interconnects and discusses the major features of each scheme. Section V presents a qualitative comparison and categorization of these schemes. Finally, Section VI discusses the issues of cost and power consumption in these networks and Section VII presents the conclusions. In summary the main contributions of this paper are the followings: Survey of the optical networks targeting data centers and insight behind these architectures

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 3 Fig. 1. Data Center Content switches & Load balance ToR switches Servers Core switches Aggregate Switches Rack Data Center Clients Internet 10 Gbps 10 Gbps Rack Rack Rack Architecture of current data center network Categorization and qualitative comparison of the proposed schemes based on the technology, connection type, architecture, etc. Analysis of optical interconnect s benefits in terms of power consumption and cost II. NETWORK TRAFFIC CHARACTERISTICS In order to design a high performance network for a data center, a clear understanding of the data center traffic characteristics is required. This section presents the main features of the network traffic in the data centers and discusses how these features affect the design of the optical networks. There are several research papers that have investigated the data center traffic such as the ones presented by Microsoft Research ([23],[24],[25]). The data centers can be categorized in three classes: university campus data centers, private enterprise data centers and cloud-computing data centers. In some cases there are some common traffic characteristics (e.g. average packet size) in all data centers while other characteristics (e.g. applications and traffic flow) are quite different between the data center categories. The results presented in these papers are based on measurement of real data centers. Until now there is not any study on a theoretical model of the data center traffic. However, based on these studies we can extract some useful figures of merit for the modeling of the data center traffic. For example the interarrival rate distribution of the packet in the in the private data centers follows the lognormal distribution while the campus data center follows the weibull distribution [25]. The main empirical findings of these studies are the followings: Applications: The applications that are running on the data centers depend on the data center category. In campus data centers the majority of the traffic is HTTP traffic. On the other hand, in private data centers and in data centers used for cloud computing the traffic is dominated by HTTP, HTTPS, LDAP, and DataBase (e.g. MapReduce) traffic. Traffic flow locality: A traffic flow is specified as an established link (usually TCP) between two servers. The traffic flow locality describes if the traffic generated by the servers in a rack is directed to the same rack (intrarack traffic) or if it directed to other racks (inter-rack traffic). According to these studies the traffic flow ratio for inter-rack traffic fluctuates from 10% to 80% depending on the application. Specifically, in data centers used by educational organization and private enterprises the ratio of intra-rack traffic ranges from 10% to 40%. On the other hand, in data centers that are used for cloud computing the majority of the traffic is intra-rack communication (up to 80%). The operators in these systems locate the servers, that usually exchange high traffic between each other, into the same rack. The traffic flow locality affects significantly the design of the network topology. In cases of high inter-rack communication traffic, high speed networks are required between the racks while lowcost commodity switches can be used inside the rack. Therefore, in these cases an efficient optical network could provide the required bandwidth demand between the racks while low cost electronic switches can be utilized for intra-rack communication. Traffic flow size and duration: A traffic flow is defined as an active connection between 2 or more servers. Most traffic flow sizes in the data center are considerably small (i.e. less than 10KB) and a significant fraction of these flows last under a few hundreds of milliseconds. The duration of a traffic flow can affect significantly the design of the optical topology. If the traffic flow lasts several seconds then an optical device with high reconfiguration time can sustain the reconfiguration overhead in order to provide higher bandwidth. Concurrent traffic flows: The number of concurrent traffic flows per server is also very important to the design of the network topology. If the number of concurrent flows can be supported by the number of optical connections then optical networks can provide significant advantage over the networks based on electronic switches. The average number of concurrent flows is around 10 in the majority of the data centers servers. Packet size: The packet size in data centers exhibit a bimodal pattern with most packet sizes clustering around

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 4 200 and 1400 bytes. This is due to the fact that the packets are either small control packets or are parts of large files that are fragmented to the maximum packet size of the Ethernet networks (1550 bytes). Link utilization: According to these studies, in all kinds of data centers the link utilization inside the rack and in the aggregate level is quite low, while the utilization on the core level is quite high. Inside the rack the preferable data rate links are 1 Gbps (in some cases each rack server hosts 2 or more 1 Gbps links), while in the aggregate and in the core network, 10 Gbps are usually deployed. The link utilization shows that higher bandwidth links are required especially in the core network, while the current 1 Gbps Ethernet networks inside the rack can sustain the future network demands. III. OPTICAL TECHNOLOGY The majority of the optical interconnections schemes presented in this paper are based on devices that are widely used in optical telecommunication networks (e.g. WDM networks and Passive optical networks (PONs)). This section describes the basic optical modules that are utilized for the implementation of the optical interconnects targeting data centers [26]. Splitter and combiner: A fiber optic splitter is a passive device that can distribute the optical signal (power) from one fiber among two or more fibers. A combiner on the other hand is used to combine optical signal from two or more fibers into a single fiber. Coupler: A coupler is a passive device that is used to combine and split signals in an optical network but can have multiple inputs and outputs. For example, a 2x2 coupler takes a fraction of the power from the first input and places it on output 1 and the remaining fraction on output 2 (similarly for the second input). Arrayed-Waveguide Grating (): s are passive data-rate independent optical devices that route each wavelength of an input to a different output (wavelength w of input i is routed to output [( i + w - 2) mod N]+1, 1 i N, 1 w W, where N is the number of ports and W the total number of wavelengths). In WDM communication systems where multiple wavelength are multiplexed, s are used as demultiplexers to separate the individual wavelengths or as multiplexers to combine them. Wavelength Selective Switch (WSS): A WSS is typical an 1xN optical component than can partition the incoming set of wavelengths to different ports (each wavelength can be assigned to be routed to different port). In other words, WSS can be considered as reconfigurable and the reconfiguration time is a few milliseconds [27]. Micro-Electro-Mechanical Systems Switches (MEMSswitches): MEMS optical switches are mechanical devices that physically rotate mirror arrays redirecting the laser beam to establish a connection between the input and the output. Because they are based on mechanical systems the reconfiguration time is in the orders of a few milliseconds. Currently that are commercial available MEMS optical switches with 32 input/output ports. Semiconductor Optical Amplifier (): Semiconductor Optical Amplifiers are optical amplifiers that are based on silicon p-n junctions. Light is amplified through stimulated emission when it propagates through the active region [28]. s are generally preferred over other amplifiers due to their fast switching time (in the order of ns) and their energy efficiency [29]. Tunable Wavelength Converters (): A tunable wavelength converter generates a configurable wavelength for an incoming optical signal. The tunable wavelength converter includes a tunable laser, a and a Mach-Zehnder Interferometer (MZI). The conversion is performed by the which receives as an input the tunable laser wavelength and the data and outputs the data in the selected wavelength. The is followed by the MZI which works as a filter to generate reshaped and clean pulses of the tuned wavelength [30],[31]. In [32] it is shown that the wavelength conversion can be achieved at 160 Gbps and the reconfiguration time is in the order of nanoseconds. IV. ARCHITECTURES This section presents the optical interconnects schemes that have been recently proposed for data center networks and provides a general insight on each architecture. A. c-through: Part-time optics in data centers A hybrid electrical-optical network has been presented by G. Wang et al. called c-through [33],[34]. The architecture of c-through is depicted in Figure 2 and is presented as an enhancement to the current data center networks. The ToR switches are connected both to an electrical packetbased network (i.e. Ethernet) and an optical circuit-based network. The circuit switch network can only provide a matching on the graph of racks. Thus the optical switch must be configured in a way that pairs of rack with high bandwidth demands are connected through this optical switch. A traffic monitoring system is required which is placed in the hosts and it measures the bandwidth requirements with the other hosts. An optical configuration manager collects these measurements and determines the configuration of the optical switch based on the traffic demands. The traffic demand and the connected links are formulated as a maximum weight perfect matching problem. In the c-through architecture the Edmonds algorithms has been used for the solution of the perfect matching algorithm [35]. After the configuration of the optical circuit switch, the optical manager informs the ToR switches in order to route the packets accordingly. The traffic in the ToR switches is demultiplexed by using a VLAN-based routing. Two different VLANs are used, one for the packet based network and one for the optical circuit based network. If the packets are destined to a ToR switch that is connected to the source ToR through the optical circuit, the packets are destined to the second VLAN. The system evaluation was based on simulation of a packetswitch network. The proposed scheme was evaluated using both micro-benchmarks and real applications. In the evaluation

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 5 Core Switch Electrical Packet switches Optical circuit switch Aggregate Switch ToR ToR switches (pod switches) 10 Gbps link WDM optical link Optical circuit switch Fig. 2. Architecture of the c-through network Fig. 3. Architecture of the Helios data center network it was shown that for applications in which the traffic demand between some hosts changes slowly, the proposed scheme can reduce significantly the completion time of the applications and offers significantly reduced latency between the nodes connected through the optical circuit. B. Helios: A hybrid optical electrical switch N. Farrington et al. from UCSD, presented in 2010 a hybrid electrical/optical switch architecture for modular data centers called Helios [36] that is similar to the c-through architecture but it is based on WDM links. Helios scheme follows the architecture of typical 2-layer data center networks. It consists of ToR switches (called pod switches) and core switches. Pod switches are common electrical packet switches, while core switches can be either electrical packet switches or optical circuit switches. The electrical packet switches are used for all-to-all communication of the pod switches, while the optical circuit switches are used for high bandwidth slowly changing (usually long lived) communication between the pod switches. Hence, the Helios architecture tries to combine the best of the optical and the electrical networks. The high level architecture of the Helios architecture is depicted in Figure 3. Each pod switch has both colorless and WDM optical transceivers. The colorless optical transceivers (e.g. 10G SFP+ modules) are used to connect the pod switches with the core electrical packet switches. The WDM optical transceivers are multiplexed through a passive optical multiplexer (forming superlinks) and are connected to the optical circuit switches. These superlinks can carry up to w x 10 Gbps (where w is the number of wavelengths; from 1 to 32). Thus the proposed scheme can deliver full bisection bandwidth. If the number of colorless and WDM transceivers is the same then the 50% of the bandwidth is shared between pods while the remaining 50% is allocated to specific routes depending on the traffic requirements. The Helios control scheme consists of three modules: the Topology Manager (TM), the circuit switch Manager (CSM) and the Pod Switch Manager (PSM). The topology manager is used to monitor the traffic of the data center and is based on the traffic requirements (e.g. number of active connections, traffic demand, etc.) to find the best configuration for the optical circuit switch. The traffic demand is based on the traffic that each server needs to transmit to the other servers, while the number of active flows is the maximum number of connection that can be simultaneously active. The Circuit Switch Manager is used to receive the graph of the connections and to configure the Glimmerglass MEMS switch. The Pod Switch Manager is hosted in the pod switches and is interfacing with the TM. Based on the configuration decisions on TM, the pod manager is used to route the packet either to the packet switch through the colorless transceivers or to the optical circuit switch through the WDM transceivers. The configuration of the circuit switch is based on a simple demand matrix in which the bipartite graph has to be calculated. The main advantage of the Helios architecture is that it is based on readily available optical modules and transceivers that are widely used in optical telecommunication networks. The optical circuit switch is based on the commercially available Glimmerglass switch and the pod switches use WDM SFP+ optical transceivers. The WDM transceivers can be either Coarse WDM (CWDM) modules that are less expensive but support a limited number of wavelengths or Dense WDM (DWDM) that are more expensive but can support higher number of wavelengths (e.g. 40) since they use narrower channel spacing. For the performance evaluation they used the Glimmerglass crossbar optical circuit switch that is readily available and can support up to 64 ports [37]. The main drawback of the proposed scheme is that it is based on MEMS switches hence any reconfiguration of the circuit switch requires several milliseconds (for the Glimmerglass switch the reconfiguration time was 25 ms). Thus, this scheme is ideal for applications where the connections between some nodes last more than a couple of seconds in order to compensate the reconfiguration overhead. In the performance evaluation it was shown that when the stability parameter was varied from 0.5 to 16 seconds the throughput was increased significantly.

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 6 C. DOS: A scalable optical switch X. Ye et al. from University of California, Davis presented in 2010 the DOS scalable Datacenter Optical Switch [38]. The switching in the DOS architecture is based on Arrayed Waveguide Grating Router (R) that allows contention resolution in the wavelength domain. The cyclic wavelength routing characteristics of the R (presented in Section III) is exploited, that allows different inputs to reach the same output simultaneously. Figure 4 depicts the high level block diagram of the DOS architectures. The optical switch fabric consists of an array of tunable wavelength converters (one for each node), an R and a loopback shared buffer. Each node can access any other node through the R by configuring the transmitting wavelength of the. The switch fabric is configured by the control plane that controls the and the label extractors (LEs). The control plane is used for the contention resolution and tuning. When a node transmits a packet to the switch, the label extractors are used to separate the label from the optical payload. The optical label is converted to electrical signal by an O/E converted inside the control plane module and it is forwarded to the arbitration unit. The label includes both the destination address and the packet length. This label is stored in the label processor and this processor sends a request to the arbitration unit for content resolution. Based on the decision by the arbitration unit, the control plane configures the. Hence based on the R routing characteristics, the transmitted packet arrives to the destined output. In case that the number of output receivers is less than the number of nodes that want to transmit to this port a link contention happens. In this case, a shared SDRAM buffer is used to store temporarily the transmitted packets. The wavelengths that face the contention are routed to the SDRAM through a optical-electrical (O/E) converter. The packets are then stored in SDRAM and a Shared buffer controller is used to handle these packets. This controller sends the requests of the buffered packets to the control plane and waits for a grant. When the grant is received, the packet is retrieved from the SDRAM, then it is converted back to optical signal through an electrical-optical converter and then it is forwarded to the switch fabric through a. The main challenge is the deployment of the DOS switch is the arbitration of the requests in the control plane. Since, there are not used virtual output queues (VOQ), every input issues a request to the arbiter and waits for the grant. Thus, for the scheduling of the packets a 2-phase arbiter can be used ([39]).The first phase is used for the arbitration of the requests and the second phase for the arbitration of the grants. Furthermore, since there are no virtual output queues only one request per output is issued. Thus a single iteration can be used instead of multiple iterations. However, the arbitration must be completed in very short time since the DOS architecture is a packet-based switch. The scalability of the DOS scheme depends on the scalability of the R and the tuning range of the. Some research centers have presented R that can reach up to 400 ports [40]. Thus the DOS architecture can be used to Fig. 4. ToR switches LE LE LE DOS architecture O/E O/E O/E R SDRAM Buffer Controller Control Plane O/E O/E O/E connect up to 512 nodes (or 512 racks assuming that each node is used as a ToR switch). Although, many data centers utilize more than 512 ToR switches, the DOS architecture could be easily scaled up in a fat-tree topology (i.e. the optical switch could be used for the aggregation layer). The main advantage of the DOS scheme is that the latency is almost independent of the number of input ports and remains low even at high input loads. This is due to the fact that the packets have to traverse only through an optical switch and they avoid the delay of the electrical switch s buffers. However the main drawback of the DOS scheme is that it is based on electrical buffers for the congestion management using power hungry electrical-to-optical and optical-to-electrical converters, thus increasing the overall power consumption and the packet latency. Furthermore, the DOS architecture use tunable wavelength transceivers that are quite expensive compared to commodity optical transceivers used in current switches. However, DOS remains an attractive candidate for data center networks where the traffic pattern is bursty with high temporary peaks. The tunable wavelength converters have a switching time in the orders of a few ns. Thus the proposed scheme can be reconfigured to meet the traffic fluctuation on contrary to the slow switching time of MEMS optical switches. A 40 Gbps 8x8 prototype of the DOS architecture has been recently presented by UCD and NPRC [41]. The prototype is based on an 8x8 200 GHz spacing R and it also includes four wavelength converters (WC) based on cross-phase modulation (XPM) in a semiconductor optical amplifier Mach- Zehnder interferometer (-MZI). The measured switching latency of the DOS prototype was only 118.2 ns which is much lower compared to the latency of legacy data centers (i.e. in the order of few microseconds).

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 7 D. Proteus data center network Proteus [42][43] is an all-optical architecture that is based on WSS switch modules and an optical switching matrix that is based on MEMS. The high level block diagram of the Proteus architecture is depicted in Figure 5. Each ToR switch has several optical transceivers operating at different wavelengths. The optical wavelengths are combined using a multiplexer and are routed to a WSS. The WSS multiplex each wavelength to up to k different groups and each group in connected to a port in the MEMS optical switch. Thus a point-to-point connection is established between the ToR switches. On the receiver s path, all of the wavelengths are de-multiplexed and routed to the optical transceiver. The switching configuration of the MEMS determines which set of ToRs are connected directly. In case that a TOR switch has to communicate with a ToR switch that is not directly connected, then it uses hopby-hop communication. Thus Proteus must ensure that the entire ToR graph is connected when performing the MEMS reconfiguration. The main idea of the Proteus project is to use direct optical connections between ToR switch for high-volume connections while in case of low volume traffic to use multi-hop connections. The main advantage of the Proteus project is that it can achieve coarse-grain flexible bandwidth. Each ToR has n optical transceivers. If for some reason the traffic between two switches increase, then additional connections can be set up (up to n, either directly or indirectly) thus increasing the optical bandwidth up to n times the bandwidth of one optical transceiver. The main challenge in the operation of the Proteus network is to find the optimum configuration for the MEMS switch for each traffic pattern. In [42] an Integer Linear Programming scheme is used to find the optimum configuration based on the traffic requirements. The main advantage of the Proteus is that it is based on readily available off-the-shelf optical modules (WSS such as the Finisar WSS [27], and optical multiplexers) that are widely used in optical telecommunication networks thus reducing the overall cost compared with ad-hoc solutions. The main disadvantage of the Proteus architecture is that the MEMS switch reconfiguration time is in the order of a few milliseconds. As it was discuss in Section II, the majority of the traffic flows last few milliseconds. Thus, in applications where the traffic flow changes rapidly and each server establishes connection with other servers that last few milliseconds the proposed scheme cannot follow the traffic fluctuations. Hence, in many cases by the time the MEMS switch is reconfigured, a new reconfiguration has to take place to meet the new traffic demands. However, although the traffic between the servers changes rapidly the aggregated traffic between the ToR switches may change much slower. In these cases, the Proteus scheme can exhibit high performance and reduced latency. E. Petabit Optical Switch Jonathan Chao from Polytechnic Institute of New York has presented a scalable bufferless optical switch fabric, called Petabit switch fabric, that is based on R and tunable Fig. 5. ToR ToR ToR TRX1 TRX2 TRX3... TRXN TRX1... TRXN TRX1 TRXN Proteus architecture... MUX DEMUX WSS Coupler Optical Mux/Demux & Switching Optical Mux/Demux & Switching Optical Switching Matrix wavelength converters [44],[45]. The proposed optical switch is combined efficiently with electronic buffering and scheduling. Figure 6 depicts the block diagram of the Petabit optical switch. The Petabit switch fabric adopts a three-stage Clos network and each stage consists of an array of AGWRs that are used for the passive routing of packets. In the first stage, the tunable lasers are used to route the packets through the Rs, while in the second and in the third stage are used to convert the wavelength and route accordingly the packets to destination port. The main difference compared to the DOS architecture is that Petabit switch does not use any buffers inside the switch fabric (thus avoiding the power hungry E/O and O/E conversion). Instead, the congestion management is performed using electronic buffers in the Line cards and an efficient scheduling algorithm. Each line card that is connected to the input port of the Petabit switch hosts a buffer in which the packet are stored before the transmission. The packets are classified to different virtual output queues (VOQ) based on the destination address. Given the high number of ports, a VOQ is maintained per OM (the last stage of the switch fabric) instead of one VOQ per output port. Using one VOQ per OM simplifies the scheduling algorithm and the buffer management but on the other hand it introduced Head-of-line blocking (HOL). However, using an efficient scheduling algorithm and some speedup, the Petabit switch fabric can achieve 100% throughput. The scheduler is used to find a bipartite match from the input port to the output port and to assign a CM (the central stage of the switch fabric) for each match. Using the bipartite match scheduling, there is no congestion of the packets in the switch fabric thus the buffer that is used in other schemes (e.g. in the DOS architecture) is eliminated. The performance evaluation of the Petabit switch is based on a cycle-accurate frame-based simulator. The Petabit switch was evaluated from 1024 to 10000 ports and it was shown that it can achieve up to 99.6% throughput even in the case of 10000 ports. The most important advantage of the proposed

Petabit IEEE COMMUNICATIONS SURVEYS & TUTORIALS 8 Tunable Laser Tunable Laser Tunable Laser Tunable Laser IMs CMs OMs ToR... 8x1 Broadcast OptAmp x8... 1x128 Select... x128 Fig. 6. The Petabit architecture architecture is that the average latency is only twice of a frame duration (200 ns) even at 80% load using three iteration of the scheduling algorithm. Hence, in contrast to the current data center networks based on commodity switches, the latency is significantly reduced and almost independent of the switch size. F. The OSMOSIS project IBM and Corning jointly developed in 2004 the OSMOSIS project [46],[47] that is a low-latency optical broadcast-andselect (B&S) architecture based on wavelength- and spacedivision multiplexing. The broadcast-and-select architecture is composed of two different stages. In the first stage, multiple wavelengths are multiplexed in a common WDM line and are broadcasted to all the modules of the second stage through a coupler. The second stage use s as fiber-selector gates to select the wavelength that will be forwarded to the output. In the framework of the OSMOSIS project, a 64-node interconnect scheme has been developed, combining eight wavelengths on eight fibers to achieve 64-way distribution. The switching is achieved with a fast 8:1 fiber-selection stage followed by a fast 8:1 wavelength-selection stage at each output port as it is depicted in Figure 7. Rather than using tunable filters, this design features a demux--selectmux architecture. A programmable centralized arbitration unit reconfigures the optical switch via a separate optical central scheduler synchronously with the arrival of fixed-length optical packets. The arbiter enables high-efficiency packet-level switching without aggregation or prior bandwidth reservation and achieves a high maximum throughput. The proposed scheme includes 64 input and output ports operating at 40 Gbps. The main advantage of the proposed scheme is that the switch can be scaled efficiently by deploying several switches in a two-level (three-stage) fat tree topology. For example, it can be scaled up to 2048 nodes be deploying 96 64x64 switches (64 switches for the first level and 32 switches for the second level). The line cards of the OSMOSIS architecture (that could be also interfaces of ToR switches) use distributed-feedback (DFB) laser for the transmitter which is coupled to a 40 Gbps electro-absorption modulator (EAM). On the other hand, two receivers per port have been included in the input path. The presence of two receivers can be exploited by changing the arbiter to match up to two inputs to one output, instead of just Fig. 7. The OSMOSIS architecture.... one, which requires modifications to the matching algorithm. The main drawback of the proposed scheme is that it is based on power hungry devices, that increase significantly the overall power consumption. G. Space-Wavelength architecture Castoldi et al. have presented an interconnection scheme for data centers that is based on space-wavelength switching [48]. The block diagram of the proposed architecture is depicted in Figure 8. In wavelength-switched architectures the switching is achieved by transmitting the packets on different wavelengths (using an array of fixed lasers or one fast tunable laser) based on the destination ports. On the other hand, in spaceswitched architectures one fixed laser per port is required and a non-blocking optical switch based on is used for the establishment of the connections in each time slot. The proposed scheme combines efficiently both the wavelength and the space switching. Each node (e.g. server or ToR switch) has up to N ports and each port is connected through an intra-card scheduler to an array of fixed lasers, each one transmitting at different wavelength in the C-band (1530-1570nm). Each laser is connected to an electrical-to-optical transceiver and these transceivers are connected to 1xM space switch. For the reception, each port is equipped with a fixed receiver, tuned on a specific wavelength. To switch a packet from an input port to its output port, the destination card is selected by setting the 1xM space switch, while its destination port on the card is selected by selecting the wavelength to be modulated. The 1xM space switch fabric is composed by an array of s forming a tree structure. The scheduling problem consists in selecting a packet from each input port and placing it into a matrix representing the card and port domains, so that no more than one packet coming from the same card is present in each row. Each card has an inter-card scheduler that is used for the scheduling of the packets and the control of the optical transceivers. The proposed scheme can be scaled by adding more planes (wavelengths) thus increasing the aggregated bandwidth and reducing the communication latency. For example, up to 12 channels (planes) can be accommodated using the C-band for the fixed transceivers. The main drawback of the proposed scheme is that the switch fabric is based on arrays that are expensive

Space Wavelength E RAPID IEEE COMMUNICATIONS SURVEYS & TUTORIALS 9 Blades or ToR Card 1 Intra card scheduler hdl TX TX RX RX Inter card Scheduler λ0 λn based switch fabric Servers Reconfigurable Controller Crossbar switch Rack 0 Transmitters VCSEL VCSEL VCSEL VCSEL Receivers PD PD PD PD Coupler λ0 λ1 λ2 λ3 λ0,λ1,λ2,λ3 λ1 Rack 1 Scalable Optical Remote Super Highway (SRS) Rack 3 λ3 λ2 Rack 2 Fig. 8. Card 2 Card N The Space-wavelength architecture and increase the overall power consumption. However, the performance evaluation shows that the proposed scheme can achieve low latency even for high network utilization using i.e. 12 separate planes. H. E-RAPID A. Kodi and A. Louri from University of Ohio and University of Arizona respectively, have jointly presented an energyefficient reconfigurable optical interconnects, called E-RAPID. This scheme can be used for high performance computing while it could be also deployed in data center networks [49]. The high level block diagram of this scheme is depicted in Figure 9 (several modules such as buffers between the nodes have been omitted for simplicity). Each module (i.e. rack) hosts several nodes (i.e. blade servers) and several transmitters that are based on VCSEL lasers. A Reconfigurable controller is used to control the crossbar switch and to allocate the nodes to a specific VCSEL laser. At any given time only one VCSEL laser is active on each wavelength. A coupler for each wavelength is used to select the VCSEL that will forward the packet to the Scalable Optical Remote Super Highway ring (SRS). This SRS highway is composed of several optical rings; one for each rack. In the receiver path, an is used for de-multiplexing of the wavelengths that are routed to an array of receivers. Then the crossbar switch is used to forward the packets from each receiver to the appropriate node in the board. For example, if a server from Rack 0 needs to send a packet to Rack 3, then the reconfigurable controller configures the crossbar switch to connect the server with one of the VCSEL lasers tuned at wavelength λ1. The VCSEL is transmitting the packet using the second coupler that is connected with the inner SRS ring (λ1). The inner SRS ring multiplexes all the wavelengths that are destined to Rack 3. In Rack 3 the is used to demultiplex the wavelengths and then route the packet to the server using the crossbar switch. The E-RAPID can be dynamically reconfigured in the sense that the transmitter ports can be reconfigured to different wave- Fig. 9. The E-RAPID architecture lengths in order to reach different boards. In case of increased traffic loads, more wavelengths can be used for node-to-node communication thus increasing the aggregate bandwidth of this link. In the control plane a static routing and wavelength allocation (RWA) manager is used for the control of the transmitters and the receivers. A Reconfigurable Controller (RC) is hosted in each module that controls the transmitters and the receivers of this module. The reconfigurable controller is also used for the control of the crossbar switch that is used to connect the nodes with the appropriate optical transceiver. The main advantage of the E-RAPID architecture is that the power consumption can be adjusted based on the traffic load. E-RAPID is based on VCSEL transmitters in which the supply current is adjusted based on the traffic load. When the traffic load is reduced, the bit rate can be scaled down by reducing the supply voltage thus resulting power savings. A lock-step (LS) algorithm has been developed that can control the bit-rate (and the power saving) based on the network traffic demands. Each node of the module sends a request to the reconfigurable controller. The controller aggregates the requests and based on the traffic demand controls the VCSEL the crossbar switch, transmitter and the receivers. The performance evaluation shows that depending on the reconfiguration windows (from 500 to 400 cycles) the latency of the packets range from less than 1 microsecond to 2 microseconds, thus providing significantly lower latencies than networks based on commodity switches. I. The IRIS project DARPA has also funded a research program called Data in the Optical Domain-Networking in which Alcatel-Lucent was participating. The result of this project was the IRIS Project [50]. IRIS is also based on Wavelength Division Multiplexing and the characteristics on Arrayed Waveguide Grating Routers (R) based on all optical wavelength converters. IRIS architecture is based on a three-stage switch. The three-stage architecture is dynamically non-blocking even though the two space switches are partially blocking. Each node (i.e. ToR switch) is connected to a port of the first stage using N WDM wavelengths. The first stage consists of an array of wavelength switches (WS), and each wavelength switch is based on an

bidirec IEEE COMMUNICATIONS SURVEYS & TUTORIALS 10 ToR switches Space Switch Time Switch Space Switch WS 1 WC WC WC WS 2 WS N HD TB WS 1 HD HD TB TB WS 2 WS N WC: Wavelength converter WS: Wavelength Switch HD: Header detector TB: Time Buffer WC WC WC Time Buffer Fig. 10. The IRIS project array of all-optical -based wavelength converters that is used for the wavelength routing. The second stage is a time switch that consists of an array of optical time buffers. The time switch is composed of an array of WC and two interconnected with a number of optical lines, each one with different delays. Based on the delay that needs to be added, the WC converts the optical signal to a specific wavelength that is forwarded to the with the required time delay. The delayed signals are multiplexed through a second AGW and are routed to the third stage (a second space switch). Based on the final destination port, the signal is converted to the required wavelength for the routing. Due to the periodic operation of the third space switch, the scheduling is local and deterministic to each time buffer which greatly reduces control-plane complexity and removes the need for optical random access memory. Using 40 Gb/s data packets and 80x80 s allows this architecture to scale to 80 2 x40 Gb/s = 256 Tb/s. The IRIS project has been prototyped using 4 XFP transceivers at 10 Gbps and has been implemented in an FPGA board. A 40 Gb/s wavelength converter is used that is based on fully-integrated circuit with a for the wavelength conversion [51]. The wavelengths conversion takes less than 1ns. The conversion wavelength is either supplied internally by an integrated fast-tunable multi-frequency laser (MFL) or externally by a sampled-grating distributed Bragg reflector (SG-DBR) laser [51]. The optical switch is based on a passive silica chip with dual 40x40. J. Bidirectional photonic network Bergman from Columbia University has presented an optical interconnection network for data networks based on bidirectional s [52]. The proposed scheme is based on bidirectional -based 2x2 switches that can be scaled efficiently in a tree-based topology as it is shown in Figure 11. The nodes connected to this network can be either server blades or ToR switches. Each of the switching nodes is a based 2x2 switch that consists of six s. Each port can establish any connection with the other ports in nanoseconds. The switching nodes are connected as a Banyan network (k-ary, n-trees) supporting k n processing nodes. The use of Fig. 11. Servers or ToR switches The bidirectional switching node design bidirectional switches can provide significant advantages in terms of component cost, power consumption, and footprint compared to other -based architectures like the broadcastand-select architecture. A prototype has been developed that shows the functionality of the proposed scheme using 4 nodes at 40 Gbps [53]. The optical switching nodes are organized in a three-stage Omega network with two nodes in each stage. The bit error rate that was achieved using four wavelengths was less than 10 12. The main advantage of the this scheme is that it can be scaled efficiently to large number of nodes with reduced number of optical modules, thus reduced power consumption. The total number of nodes is only constrained by the congestion management and the total required latency. K. Data vortex Bergman from Columbia University has also presented a distributed interconnection network, called Data Vortex [54],[55]. Data vortex mainly targets high performance computing systems (HPC) but it can also be applied to data center interconnects [56]. The network consists of nodes that can route both packet and circuit switched traffic simultaneously in a configurable manner based on semiconductor optical amplifiers (). The s, organized in a gate-array configuration, serve as photonic switching elements. The broadband capability of the gates facilitates the organization of the transmitted data onto multiple optical channels. A 16 node system has been developed in which the array is dissected into subsets of four, with each group corresponding to one of the four input ports [57]. Similarly, one gate in each subset corresponds to one of four output ports, enabling nonblocking operations of the switching node. Hence, the number of s is quadruple the number of nodes (e.g. for 32 nodes we would require 1024 s). The data vortex topology is composed entirely of 2x2 switching elements arranged in a fully connected, directed graph with terminal symmetry. The single-packet routing

Intune IEEE COMMUNICATIONS SURVEYS & TUTORIALS 11 North West East South Fast Tunable Laser and Burst mode Receiver OPST WDM Ring Nodes (e.g. ToR) Fig. 13. The Intune s OPST architecture Fig. 12. The data vortex architecture nodes are wholly distributed and require no centralized arbitration. The topology is divided into hierarchies or cylinders, which are analogous to the stages in a conventional banyan network as it is depicted in Figure 12. The data vortex topology exhibits a modular architecture therefore it can be scaled efficiently to large number of nodes. In all multistage interconnection networks, an important parameter is the number of routing nodes a packet will traverse before reaching its destination. For the data vortex that is based on 2x2 switches, the number of intermediate nodes M scales logarithmically with the number of ports N as it is depicted below: M log 2 N The main drawback of the data vortex architecture is that the banyan multiple-stage scheme becomes extremely complex when it is scaled to large networks. As the number of nodes increase, the packets have to traverse several nodes before reaching the destination address causing increased and nondeterministic latency. L. Commercial optical interconnects 1) Polatis: While all of the above schemes are proposed by universities or industrial research centers there is also a commercial available optical interconnect for data center that is provided by Polatis Inc. The Polatis optical switch is based on piezo-electric optical circuit switching and beam steering technology. Hence, the provided scheme is based on a centralized optical switch that can be reconfigured based on the network traffic demand. The most important feature of the provided switch is the lower power consumption (it is reported that the Polatis power consumption is only 45W compared to the 900+W of a legacy data center [58]), and it is data rate agnostic meaning that it can support 10 Gbps, 40 Gbps and 100 Gbps. The only drawback of this commercial scheme is that it is based on optical MEMS switched thus it has an increased reconfiguration time (according to the data sheets the maximum switching time is less than 20ms). 2) Intune Networks: Intune Network has developed the Optical Packet Switch and Transport (OPST) Technology that is based on their fast tunable optical transmitters [59]. Every node is attached to the OPST fabric (ring) through a Fast Tunable Laser (FTL) and a burst mode receiver (BMR) as it is depicted in Figure 13. Each receiver node is assigned a unique wavelength in which every other node can transmit to this node by tuning in real-time the transmitter. Although the Intune OPST technology is advertised mainly for transport networks it could be also used to replace the core network of the data centers. The Intune Network can support 80 wavelengths per ring in C-band and up to 16 nodes can be attached to the ring. Each node can transmit up to 80 Gbps. The power consumption of such a topology is 1.6kW which is much lower than the required power consumption of an equivalent network based on commodity switches. V. COMPARISON This section categorizes the proposed schemes and provides a qualitative comparison on the features of these schemes such as connectivity, performance, scalability, and technology. As it is shown in Table II the vast majority of papers have been presented in 2010. This shows that optical interconnects have gained significant attention recently, as they seem to provide a promising and viable solution for future data center networks compared to commodity switches. Another explanation for the impressive number of publications that are related to optical interconnects is the increase of warehouse-scale data centers that require high bandwidth interconnects. Emerging web application (such as social networks, streaming video, etc.) and cloud computing has created the need for more powerful warehouse-scale data centers with reduced latency and increased network bandwidth requirements. A. Technology As it is shown in the table, the majority of the optical interconnects are all-optical, while only the c-through and

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 12 TABLE II DATA CENTER OPTICAL INTERCONNECTS SUMMARY Technology Connection Architecture Univ.-Company Year Hybrid All-optical Circuit Packet WDM Cap.Lim. Scalability Prototype cthrough [33] RiceU,CMU,Intel 2010 Transc. low a Helios [36] UC San Diego 2010 Transc. low DOS [38][41] UC Davis 2010 medium Proteus [42] Illinois, NEC 2010 Transc. medium b Petabit [44] Poly-NY 2010 high OSMOSIS [46] IBM 2004 medium Space-WL [48] SSSUPisa 2010 high E-RAPID [49] Ohio U. 2010 Transc. high IRIS [50] Alcatel-Lucent 2010 high Bidirectional [52] Columbia U. 2009 high Data vortex [54] Columbia U. 2006 high Polatis [58] Polatis Inc. 2010 Transc. low c OPST [59] Intune Inc. 2010 low c a c-through: The prototype was based on emulated modified commodity switches in which the optical links were emulated with virtual LANs b Proteus: The prototype was based on 4 links using SFP transceivers and a WSS switch c Commercially available the Helios schemes are hybrid. The hybrid schemes offer the advantage of an incremental upgrade of an operating data center with commodity switches, reducing the cost of the upgrade. The ToR switches can be expanded by adding the optical modules that will increase the bandwidth and reduce the latency while at the same time the current deployed Ethernet network is used for all-to-all communication as before. Both of these systems are based on circuit switching for the optical network. Thus, if the traffic demand consists of bulky traffic that last long enough to compensate for the reconfiguration overhead, then the overall network bandwidth can be enhanced significantly. On the other hand, these hybrid schemes do not provide a viable solution for the future data center networks. In this case, all-optical schemes provide a long-term solution that can sustain the increased bandwidth with low latency and reduced power consumption. But as these schemes require the total replacement of the commodity switches, they must provide significant better characteristics in order to justify the increased cost of the replacement (CAPEX cost). B. Connectivity Another major distinction of the optical interconnects is whether are based on circuit or optical switching. Circuit switches are based usually on optical MEMS switches that have increased reconfiguration time (in the orders of few ms). Thus, these schemes are mainly targeting data centers networks in which long-term bulky data transfers are required such as enterprise networks. Furthermore, the circuit-based optical networks are targeting data centers where the average number of concurrent traffic flows in the servers can be covered by the number of circuit connections in the optical switches. On the other hand, packet-based optical switches are similar to the current network that is used in the data centers. The packet-based switching assumes either an array of fixed lasers or fast tunable transmitters for the selection of the destination port by selecting the appropriate wavelength. Packet-based optical switching fits better to data center networks in which the duration of a flow between two nodes is very small and allto-all connectivity is usually required. An interesting exception is the Proteus architecture in which although it is based on circuit switching it provides an all-to-all communication though the use of multiple hops when two nodes are not directly connected. The Petabit architecture seems to combine efficiently the best features of electronics and optics. The electronic buffers are used for the congestion management in the nodes using an efficient scheduler while all-optical framebased switching is used for the data plane. C. Scalability A major requirement for the adoption of optical networks in the data centers is the scalability. Optical networks need to scale easily to a large number of nodes (e.g. ToR switches) especially in warehouse-scale data centers. The hybrid schemes that are based on circuit switches have limited scalability since they are constrained by the number of optical ports of the switch (e.g. the Glimmerglass optical circuit switch used in Helios can support up to 64 ports). Many proposed schemes such as the OSMOSIS, the Proteus or the DOS are implemented through a central switch that can accommodate limited number of nodes (usually constrained by the number of wavelength channels). However, in many cases the proposed schemes can be scaled in the same way as the current networks. For example, the E-RAPID scheme can be scaled efficiently by connecting the modules in clusters and then connecting the cluster in a high data rate optical ring. In the same sense, many of the proposed schemes can be used to connect several ToR switches efficiently (forming a cluster) and then a higher level of the same topology can be used to connect the clusters. The bidirectional photonic switch can be also scaled to a large number of nodes in a banyan network

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 13 using the 2x2 switching nodes. The data-vortex is also a highly scalable system since it is a distributed system, but the large number of hops that may be required between two nodes may affect the communication latency. The Petabit and the IRIS architectures, although they are based on central switch, they can be scaled efficiently to a large number of nodes adopting a Clos network. Finally, the commercially available schemes have a low scalability since they are based on modules with a limited number of ports. D. Capacity Besides the scalability in terms of number of nodes the proposed schemes must be also easy to upgrade to higher capacities per node. The circuit-switch architectures that are based on MEMS switches (c-through, Helios and Proteus) can be easily upgrade to 40 Gbps, 100 Gbps or higher bit rates since the MEMS switches can support any data rate. Therefore, in these architectures the maximum capacity per node is determined by the data rate of the optical transceivers. DOS, Petabit and the IRIS architecture are all based on tunable wavelength converters for the switching. Therefore the maximum capacity per node is constrained by the maximum supported data rate of the (currently up to 160 Gbps). Finally, the OSMOSIS, the Space-WL, the Bidirectional and the Data Vortex are all based on devices for the optical switching therefore the maximum supported capacity per node is defined by the data rates of the technology. Table II shows the capacity limitation technology (Cap.Lim.) in each architecture, which essentially defines the maximum supported data rate. E. Routing The routing of the packets in the data center networks is quite different from the Internet routing (e.g. OSPF), in order to take advantage of the network capacities. The routing algorithms can affect significantly the performance of the network therefore efficient routing schemes must be deployed in the optical networks. In the case of the hybrid schemes (c-through and Helios), the electrical network is based on a tree topology while the optical network is based on direct links between the nodes. Therefore, in this case the routing is performed by a centralized scheduler that performs a bipartite graph allocation and assigns the high bandwidth requests to the optical links. If a packet has to be transmitted on a server with an established optical link, then it is forwarded directly to the optical network; otherwise it is routed through the electrical network. On the other hand, in the case of the DOS architecture, the packets are sent directly to the R switch and a control plane is used to route the packets by controlling the tunable wavelength converter. The main drawback of this scheme is that the scheduler in the control plane must be fast enough to sustain the scheduling of the packets. In all the other schemes the routing is performed at the node level, where each packet is forwarded to different port tuned at specific wavelength based on the destination address. The IRIS and the Petastar schemes can also provide higher reliability since the network is based on a Clos topology. This means that in the case of a (transient of permanent) failure the packets can be sent through different routes. However, during normal operation care must be taken in order to avoid out-of-order delivery of packets that belong to the same flow. F. Prototypes The high cost of optical components (e.g. a WSS can cost several hundred dollars) prohibits the implementation of fully operational prototypes. However, in some cases prototypes have been implemented that show either a proof of concept or a complete system. The Helios architecture has been fully implemented since it is based on a commercially available optical circuit switch that is used in telecommunication networks. The Data Vortex has also been implemented in small scale showing the proof of concept for small number of nodes. In the case of the c-through scheme, although it has not been implemented due to lack of optical components, an emulated system has been evaluated in which the optical links are established by modifying the commodity switches as virtual private LANs. VI. COST AND POWER CONSUMPTION The cost of network devices is a significant issue in the design of a data center. However, many architectures presented in this study are based on optical components that are not commercially available thus it is difficult to compare the cost. The c-through, the Helios and the Proteus scheme are based on readily available optical modules thus the cost is significantly lower than other schemes that require special optical components designed especially for these networks. Other schemes, such as the data-vortex or the DOS are based on -based modules that can be easily implemented at low cost. However it is interesting to note that in current and future data centers the operation cost (OPEX) may exceed the equipment s cost (CAPEX). This is due to the fact that significant portion of the cost is allocated for the electricity bill. According to a study from IDC [60],[61] the total cost of the IT equipment remains the same over the years while the cost for the power and the cooling of the data centers increases significantly. Figure 14 depicts the increase on the cost for the IT equipment and the power and cooling of the data centers. As is it shown in this figure during the period 2005-2010 the cumulative annual growth rate (CAGR) for the IT was only 2.7% while the CAGR for the power and cooling was 11.2%. In 2005 the electricity bill was only half of the total operation cost while in the near future it will be almost the same as the IT cost. Therefore, even if the cost of the optical interconnects is much higher than the commodity switches, the lower power consumption that they offer may reduce significantly the operation cost. Until now there is not any comparative study on the benefits of the optical interconnects. In this section we perform a new comparative study based on the power consumption and the cost analysis to evaluate the benefits of the optical interconnects in the data center networks. To estimate the reduction in cost of an optical interconnect, we study the replacement of current switches with optical interconnects in

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 14 Cost ($B) Worldwide Expense to Power and Cool of the Data centers New server spend Power and cooling 80 60 40 20 0 2005 2006 2007 2008 2009 2010 Year Fig. 14. Comparison of the CAPEX and OPEX cost for the data centers, Source: IDC Inc. [60] Cash balan nce ($K) 250 200 150 100 50 0 50 100 150 200 Cash balance of Optical Interconnects 100% 150% 200% Relative cost compared to current DCN 0.2 0.4 0.6 0.8 1 Power consumption compared to current DCN Fig. 15. Cost balance for the replacement of commodity switches with optical interconnects the case of a data center with 1536 servers such as the one presented in a simulator for data centers [62]. According to this simulator, a two-tier topology for this data center will require 512 ToR switches and 16 aggregate switches (with 32x10 Gbps ports) and the power consumption of the data center network will be 77kW. The cost of an aggregate switch with 32ports at 10 Gbps is around $5k [16]. Due to lack of cost for an integrated optical interconnects, this paper studies the return of investment (ROI) for three different relative costs compared to commodity switches: same cost, one and a half cost and twice the cost. Figure 15 shows the relative cost of the optical interconnects (CAPEX and OPEX) compared to the cost of the OPEX of the current data centers for several values of the optical power consumption in a 5-year timeframe (the cost of electricity has been assumed 0.1c/kWh). The cost balance has been evaluated in the following way: Cost = OP EX CDCN (CAP EX OI + OP EX OI ) where, CDCN : CurrentDataCenterN etwork OI : OpticalInterconnects As it is shown in the figure, if the cost of the optical interconnect is the same as the current switches, the ROI can be achieved even if the optical interconnects consume 0.8 of the commodity switches. On the other hand, if the optical interconnect cost twice the price of the current switches then it must consume less than 0.5 of the current power consumption to achieve the ROI in 5 years time frame. Therefore, based on the fact that optical interconnects are much more energy efficient than electrical switching [63], they can also be a cost efficient alternative solution for future data center networks. Note that in this case only the replacement of the current switches is presented. In the case of a new data center design (green field), it is clear that energy efficient optical interconnects can afford even more higher CAPEX and achieve reduced ROI time frame. It is clear that in cases where data centers play a key role in the financial market such as in stock exchange [64], the added value of high bandwidth, low latency optical interconnects is even more significant as the low latency communication has a major impact in the stock exchange transactions. VII. CONCLUSIONS Optical interconnects seem as a promising solution for the data center networks offering high bandwidth, low latency and reduced energy consumption. In this paper, a survey of the most recent schemes in the domain of optical interconnects for data centers has been presented. Furthermore, a qualitative categorization and comparison of the proposed scheme has been performed. Some schemes are hybrid and are proposed as an upgrade to current networks by adding optical circuits, while other propose a complete replacement of the current switches and are targeting future data center networks. Some of the schemes are based on readily available optical components while other schemes are based on advanced optical technologies that will be cost efficient in the near future. The majority of the schemes are based on technology for the switching, as s provide faster reconfiguration time than the MEMS switches and all-to-all communication while the majority of the -based network topologies provide also high scalability. However, novel schemes, such as the Proteus, show that high performance optical networks can be implemented that support all-to-all communication, with low latency and reduced power consumption even with readily available optical components. The use of readily available components can affect significantly the adoption of optical schemes in the data centers. However, the schemes that are based on and s can provide higher capacities and better scalability. Therefore they can sustain, in a more efficient way, the requirements of the future data center networks. In any case, it seems that optical interconnects can provide a promising and viable solution that can face efficiently the demanding requirements in terms of power consumption, bandwidth and latency of the future data center networks. REFERENCES [1] S. Sakr, A. Liu, D. Batista, and M. Alomari, A Survey of Large Scale Data Management Approaches in Cloud Environments, IEEE Communications Surveys & Tutorials, vol. 13, no. 3, pp. 311 336, Jul. 2011. [2] G. Schulz, The Green and Virtual Data Center, 1st ed. Boston, MA, USA: Auerbach Publications, 2009. [3] L. Schares, D. M. Kuchta, and A. F. Benner, Optics in future data center networks, in Symposium on High-Performance Interconnects, 2010, pp. 104 108.

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 15 [4] P. Pepeljugoski, J. Kash, F. Doany, D. Kuchta, L. Schares, C. Schow, M. Taubenblatt, B. J. Offrein, and A. Benner, Low Power and High Density Optical Interconnects for Future Supercomputers, in Optical Fiber Communication Conference and Exhibit, OthX2, 2010. [5] M. A. Taubenblatt, J. A. Kash, and Y. Taira, Optical interconnects for high performance computing, in Communications and Photonics Conference and Exhibition (ACP), 2009, pp. 1 2. [6] R. Bolla, R. Bruschi, F. Davoli, and F. Cucchietti, Energy Efficiency in the Future Internet: A Survey of Existing Approaches and Trends in Energy-Aware Fixed Network Infrastructures, IEEE Communications Surveys and Tutorials, vol. 13, no. 2, pp. 223 244, 2011. [7] Y. Zhang, P. Chowdhury, M. Tornatore, and B. Mukherjee, Energy Efficiency in Telecom Optical Networks, IEEE Communications Surveys and Tutorials, vol. 12, no. 4, pp. 441 458, 2010. [8] Make IT Green: Cloud Computing and its Contribution to Climate Change. Greenpeace International, 2010. [9] Report to Congress on Server and Data Center Energy Efficiency. U.S. Environmental Protection Agency, ENERGY STAR Program, 2007. [10] Where does power go? GreenDataProject, available online at: http://www.greendataproject.org, 2008. [11] SMART 2020: Enabling the low carbon economy in the information age. A report by The Climate Group on behalf of the Global esustainability Initiative (GeSI), 2008. [12] R. Ramaswami, K. Sivarajan, and G. Sasaki, Optical Networks: A Practical Perspective, 3rd Edition, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 447 452. [13] C. F. Lam, H. Liu, B. Koley, X. Zhao, V. Kamalov, and V. Gill, Fiber optic communication technologies: what s needed for datacenter network operations, Communications Magazine, vol. 48, pp. 32 39, July 2010. [14] M. Glick, Optical interconnects in next generation data centers: An end to end view, in Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, 2008, pp. 178 181. [15] C. Minkenberg, The rise of the interconnects, in HiPEAC Interconnects cluster meeting, Barcelona, 2010. [16] A. Davis, Photonics and Future Datacenter Networks, in HOT Chips, A Symposium on High Performance Chips, Stanford, 2010. [17] Vision and Roadmap: Routing Telecom and Data Centers Toward Efficient Energy Use. Vision and Roadmap Workshop on Routing Telecom and Data Centers, 2009. [18] D. Lee, Scaling Networks in Large Data Centers. OFC/NFOEC, Invited Talk, 2011. [19] U. Hoelzle and L. A. Barroso, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 1st ed. Morgan and Claypool Publishers, 2009. [20] Cisco Data Center Interconnect Design and Deployment Guide. Cisco Press, 2010. [21] K. Chen, C. Hu, X. Zhang, K. Zheng, Y. Chen, and A. V. Vasilakos, Survey on routing in data centers: insights and future directions, Network, IEEE, vol. 25, no. 4, pp. 6 10, Jul. 2011. [22] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, Vl2: a scalable and flexible data center network, in Proceedings of the ACM SIGCOMM 2009 conference on Data communication, ser. SIGCOMM 09, 2009, pp. 51 62. [23] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, The nature of data center traffic: measurements & analysis, in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, ser. IMC 09, 2009, pp. 202 208. [24] T. Benson, A. Anand, A. Akella, and M. Zhang, Understanding data center traffic characteristics, in Proceedings of the 1st ACM workshop on Research on enterprise networking, 2009, pp. 65 72. [25] T. Benson, A. Akella, and D. A. Maltz, Network traffic characteristics of data centers in the wild, in Proceedings of the 10th annual conference on Internet measurement (IMC), 2010, pp. 267 280. [26] G. I. Papadimitriou, C. Papazoglou, and A. S. Pomportsis, Optical switching: Switch fabrics, techniques, and architectures, Journal of Lightwave Technology, vol. 21, no. 2, p. 384, Feb 2003. [27] Wavelength Selective Switches for ROADM applications. Datasheet, Finisar Inc., 2008. [28] G. Agrawal, Fiber-optic communication systems, ser. Wiley series in microwave and optical engineering. Wiley-Interscience, 2002, pp. 232 241. [29] V. Eramo and M. Listanti, Power Consumption in Bufferless Optical Packet Switches in Technology, J. Opt. Commun. Netw., vol. 1, no. 3, pp. B15 B29, Aug 2009. [30] R. Ramaswami, K. Sivarajan, and G. Sasaki, Optical Networks: A Practical Perspective, 3rd Edition, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 221 223. [31] J. M. Yates, M. P. Rumsewicz, and J. P. R. Lacey, Wavelength converters in dynamically-reconfigurable wdm networks, IEEE Communications Surveys and Tutorials, vol. 2, no. 2, pp. 2 15, 1999. [32] J. F. Pina, H. J. A. da Silva, P. N. Monteiro, J. Wang, W. Freude, and J. Leuthold, Performance Evaluation of Wavelength Conversion at 160 Gbit/s using XGM in Quantum-Dot Semiconductor Optical Amplifiers in MZI configuration, in Photonics in Switching, 2007. [33] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. E. Ng, M. Kozuch, and M. Ryan, c-through: Part-time Optics in Data Centers, in Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM, ser. SIGCOMM 10, 2010, pp. 327 338. [34] G. Wang, D. G. Andersen, M. Kaminsky, M. Kozuch, T. E. Ng, K. Papagiannaki, M. Glick, and L. Mummert, Your data center is a router: The case for reconfigurable optical circuit switched paths, in Proceedings of the ACM Hotnets VIII, ser. Hotnets 09, 2009. [35] J. Edmonds, Paths, trees, and flowers, Canadian Journal on Mathematics, vol. 17, pp. 449 467, 1965. [36] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat, Helios: a hybrid electrical/optical switch architecture for modular data centers, in Proceedings of the ACM SIGCOMM 2010, 2010, pp. 339 350. [37] Glimmerglass Intelligent Optical System, in Datasheet, available online at www.glimmerglass.com. [38] X. Ye, Y. Yin, S. J. B. Yoo, P. Mejia, R. Proietti, and V. Akella, DOS: A scalable optical switch for datacenters, in Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ser. ANCS 10, 2010, pp. 24:1 24:12. [39] N. McKeown, The islip scheduling algorithm for input-queued switches, IEEE/ACM Trans. Netw., vol. 7, pp. 188 201, April 1999. [40] Y. Hida, Y. Hibino, T. Kitoh, Y. Inoue, M. Itoh, T. Shibata, A. Sugita, and A. Himeno, 400-channel 25-GHz spacing arrayed-waveguide grating covering a full range of C- and L-bands, in Optical Fiber Communication Conference and Exhibit, 2001. [41] R. Proietti, X. Ye, Y. Yin, A. Potter, R. Yu, J. Kurumida, V. Akella, and S. J. B. Yoo, 40 Gb/s 8x8 Low-latency Optical Switch for Data Centers, in Optical Fiber Communication Conference (OFC/NFOEC), 2011. [42] A. Singla, A. Singh, K. Ramachandran, L. Xu, and Y. Zhang, Proteus: a topology malleable data center network, in Proceedings of the Ninth ACM SIGCOMM Workshop on Hot Topics in Networks, ser. Hotnets 10, 2010, pp. 8:1 8:6. [43], Feasibility Study on Topology Malleable Data Center Networks (DCN) Using Optical Switching Technologies, in Proceedings of the Optical Fiber Communication Conference and Exposition (OFC) and the National Fiber Optic Engineers Conference (NFOEC), 2011. [44] K. Xia, Y.-H. Kaob, M. Yangb, and H. J. Chao, Petabit Optical Switch for Data Center Networks, in Technical report, Polytechnic Institute of NYU, 2010. [45] H. J. Chao, Z. Jing, and K. Deng, PetaStar: A petabit photonic packet switch, IEEE Journal on Selected Areas in Communications (JSAC), Special Issue on High-Performance Optical/Electronic Switches/Routers for High-Speed Internet, vol. 21, pp. 1096 1112, 2003. [46] R. Luijten, W. E. Denzel, R. R. Grzybowski, and R. Hemenway, Optical interconnection networks: The OSMOSIS project, in The 17th Annual Meeting of the IEEE Lasers and Electro-Optics Society, 2004. [47] R. Hemenway, R. Grzybowski, C. Minkenberg, and R. Luijten, Opticalpacket-switched interconnect for supercomputer applications, J. Opt. Netw., vol. 3, no. 12, pp. 900 913, Dec 2004. [48] O. Liboiron-Ladouceur, I. Cerutti, P. Raponi, N. Andriolli, and P. Castoldi, Energy-efficient design of a scalable optical multiplane interconnection architecture, IEEE Journal of Selected Topics in Quantum Electronics, no. 99, pp. 1 7, 2010. [49] A. K. Kodi and A. Louri, Energy-Efficient and Bandwidth- Reconfigurable Photonic Networks for High-Performance Computing (HPC) Systems, IEEE Journal of Selected Topics in Quantum Electronics, no. 99, pp. 1 12, 2010. [50] J. Gripp, J. E. Simsarian, J. D. LeGrange, P. Bernasconi, and D. T. Neilson, Photonic Terabit Routers: The IRIS Project, in Optical Fiber Communication Conference. Optical Society of America, 2010, p. OThP3. [51] J. Simsarian, M. Larson, H. Garrett, H. Hu, and T. Strand, Less than 5- ns wavelength switching with an SG-DBR laser, Photonics Technology Letters, vol. 18, pp. 565 567, 2006. [52] A. Shacham and K. Bergman, An experimental validation of a wavelength-striped, packet switched, optical interconnection network, J. Lightwave Technol., vol. 27, no. 7, pp. 841 850, Apr 2009.

IEEE COMMUNICATIONS SURVEYS & TUTORIALS 16 [53] H. Wang and K. Bergman, A Bidirectional 2x2 Photonic Network Building-Block for High-Performance Data Centers, in Optical Fiber Communication Conference. Optical Society of America, 2011. [54] O. Liboiron-Ladouceur, A. Shacham, B. A. Small, B. G. Lee, H. Wang, C. P. Lai, A. Biberman, and K. Bergman, The data vortex optical packet switched interconnection network, J. Lightwave Technol., vol. 26, no. 13, pp. 1777 1789, Jul 2008. [55] C. Hawkins, B. A. Small, D. S. Wills, and K. Bergman, The data vortex, an all optical path multicomputer interconnection network, IEEE Trans. Parallel Distrib. Syst., vol. 18, pp. 409 420, March 2007. [56] K. Bergman, Optically interconnected high performance data centers, in 36th European Conference and Exhibition on Optical Communication (ECOC), 2010, pp. 1 3. [57] H. Wang, A. S. Garg, K. Bergman, and M. Glick, Design and demonstration of an all-optical hybrid packet and circuit switched network platform for next generation data centers, in Optical Fiber Communication Conference. Optical Society of America, 2010, p. OTuP3. [58] The New Optical Data Center, Polatis Data Sheet, Polatis Inc., 2009. [59] ivx8000 Product Datasheet. InTune Networks, 2010. [60] J. Scaramell, Worldwide Server Power and Cooling Expense. 2006-2010 Forecast, Market analysis, IDC Inc. [61] The Impact of Power and Cooling on Data Center Infrastructure. Market Analysis, IDC Inc., 2007. [62] D. Kliazovich, P. Bouvry, and S. Khan, GreenCloud: A Packet-level Simulator of Energy-aware Cloud Computing Data Centers, Journal of Supercomputing (to appear). [63] R. Tucker, Green Optical CommunicationsPart II: Energy Limitations in Networks, IEEE Journal of Selected Topics in Quantum Electronics, vol. 17, no. 2, pp. 261 274, 2011. [64] A. Bach, The Financial Industry s Race to Zero Latency and Terabit Networking. OFC/NFOEC Keynote presentation, 2011. Christoforos Kachris is a senior researcher at Athens Information Technology (AIT), Greece. He received his Ph.D. in Computer Engineering from Delft University of Technology, The Netherlands in 2007, and the diploma and the M.Sc. in Computer Engineering from the Technical University of Crete, Greece in 2001 and 2003 respectively. In 2006 he worked as a research intern at the Networks Group, in Xilinx Research Labs (San Jose, CA). From February 2009 till August 2010 he was a visiting lecturer at the University of Crete, Greece and visiting researcher at the Institute of Computer Science in the Foundation for Research and Technology (FORTH) working on the HiPEAC NoE and the SARC European research projects. His research interests include reconfigurable computing (FPGAs), multi-core embedded systems, network processing, and data center interconnects. Dr. Ioannis Tomkos is head of the High Speed Networks and Optical Communication (NOC) Research Group in AIT. NOC participates in many EU funded research projects in several of which Dr. Tomkos has a consortium-wide leading role (he is/was the Project Leader of the EU ICT STREP project ACCORDANCE, Project Leader of the EU ICT STREP project DICONET, Technical Manager of the EU IST STREP project TRIUMPH, Chairman of the EU COST 291 project, WP leader and Steering Board member in several other projects). Dr. Tomkos has received the prestigious title of Distinguished Lecturer of HY IEEE Communications Society. Together with his colleagues and students kos (B.Sc., M.Sc., Ph.D.), is with the Athens Information Technology Center he has authored about 400 articles/presentations (about 220 IEEE sponsored 2002. In the past, he was senior scientist (1999-2002) at Corning Inc. USA and 995-1999) at University archivalof items) Athens, and Greece. his work At AIT has he founded received and over serves 1800 as citations. igh Speed Networks and Optical Communication (NOC) Research Group that many EU funded research projects (including 7 running FP7 projects) as well ects, within which Dr. Tomkos is representing AIT as Principal Investigator and -wide leading role. He is the Chairman of the working group Next generation Digital Greece 2020 Forum. He is also the chairman of the Working Group on etworks and Transmission within the GreenTouch Initiative/Consortium, an itiative focusing on improving the energy efficiency of telecom networks.