Redundancy & the Netnod Internet Exchange Points The extent to which businesses and consumers use the Internet for critical communication has been recognised for over a decade. Since the rise of the commercial Internet, maturity of processes and best practices for operating IP networks have increased many fold. You ll often hear about the topic of redundancy and its importance in maintaining the reliability, packet delivery, and latency commitments made to your user base. In this paper, we will cover redundancy and its importance in connecting to the Netnod Internet Exchange Points (IXPs) for intra-sweden content. Part 1: Explaining Network Redundancy What is network redundancy? Redundancy is the ability of a network to withstand a component failure without significantly affecting users, usually achieved through the use of secondary resources. For the sake of this paper, we ll define a failure event as an instance of a network component failure. Failure events do not necessarily equate to a loss of service for the network s users. Failure events may be invisible to users, or result in degraded service. Packet loss is one example of degraded service. In the worst case, a failure event results in loss of service, or outage, for all users or a subset of the user base. If the network is designed with redundancy, the network provider can minimise network outages. Network redundancy must be automatic, meaning that manual intervention on the part of a network engineer would not constitute true redundancy. The network availability requirements of modern IP networks are such that automated recovery mechanisms are critical. For example, a Skype VOIP call would be disconnected long before the network operator s engineers received an alarm about a failure event. Depending on the business requirements that underlie the network s design, a network may need to survive a single failure event, or some combination of failure events. Let s examine some common failure events on an IP network in three categories physical, power & environment, and logical. Physical Route processor failure Router card failure Router port failure Cable failure/fibre cut Power & Environment Power loss UPS failure Cooling failure Logical Routing protocol process crash Router inter-process communication failure Incorrect router configuration
Network designers use various methods to implement redundancy in packet-switched networks. Preparing for physical failure events is not difficult, although having multiple components in the network increases the cost. Designers can include multiple routers, links, and route processors for example. Possessing secondary resources ensures that the network remains able to offer a packet service to customers. Similar preparations can apply to power and environmental failures. A diesel generator could provide power in the case the main power source fails. Backup cooling systems protect against failure of the primary cooling system. Logical failure events are more challenging to withstand. These could take the form of router/switch software defects. Other than making vendor selections based on software reliability, operators have little ability to prevent software defects. What can be done in many cases is to configure the network such that software failure in one router does not affect the overall functioning of the network. However, human error causes many more problems than software defects. Even with rigorous change control process and configuration automation, mistakes happen. Adjusting processes and training based on each occurrence can help reduce though not eliminate human error. Failure events can affect the network s topology. Routing protocols such as Open Shortest Path First (OSPF) and Routing Information Protocol (RIP) were designed to detect changes in the topology and communicate the changes to the rest of the network. Using this new topology information, the routers determine how to route packets around link and router failures. Figure 1 shows an end user communicating with a server across an IP network. The packets use the north path in normal operating conditions. When a link between routers along that data path fails, the routers that experienced the failure send Link State Packets (LSP) to their neighbours (Figure 2). These LSPs are flooded throughout the network to update the network topology. Figure 3 illustrates how traffic has been re-routed to the south path. Figure 1- End User exchanging data with server
Figure 2- OSPF LSPs flooded after link fails Figure 3- Traffic re-routes to another path Routing protocols such as Border Gateway Protocol (BGP) use a simple keep-alive mechanism to detect failure. The routers are configured so that they send keep-alive messages on a specified interval and that they mark a neighbour as unreachable if keep-alives are not received. This scheme functions as designed, although the time it takes to detect failures is significant. Let s use an example. Router A sends Router B a BGP keep-alive message every ten seconds and vice versa. If three successive keep-alives are not received by Router A, Router A tears down the BGP session at the 30 second (10 multipled by 3) mark. Engineers can make the timer more aggressive; however, routing protocols are processed by the route processor, or central brain of the router. Trying to achieve sub-second failure detection would result in the route processor becoming overloaded.
Network designers recognised that an improved means of detecting failure was needed. The routers that service providers use in today s networks distribute computing resources rather than having all processing occur in the route processor. Processing is localised with network processor engines on each router card. While some control plane traffic must be sent to the central route processor, many other functions such as packet encapsulation can be performed on the card. Thus, placing a failure detection scheme on the router cards allowed for faster detection without overwhelming the router. Bidirectional Forwarding Detection (BFD) protocol emerged as a simple method for detecting loss of connectivity using IP. The use of BFD is critical on IP networks that do not have built-in failure detection at Layer 2 of the OSI protocol stack. Part 2: Redundancy at the Netnod Internet Exchange Point (IXP) Before delving into redundancy at Netnod, we ll examine the IXP service that Netnod offers. Netnod operates a Layer 2 IXP service in Stockholm, Göteborg, Malmö, Sundsvall, and Luleå. By connecting to an IXP, local ISPs have a common meet-me point to exchange Internet traffic. Keeping intra-country traffic local enables lower latency to reach content and reduces transit costs. Netnod is not involved in peering discussions between its tenants; all arrangement must be worked out between the two parties that want to interconnect. Once these agreements are completed, the two entities can exchange traffic. Figure 4 depicts, from a physical perspective, how ISPs can interconnect using IXPs. Each customer uses the local access provider to reach the IXP. IP packets flow between peers on the IXP as agreed upon by the parties.
Figure 4- Physical Connectivity to an IXP From a logical perspective, the customer routers function as though they are connected on a shared Ethernet segment. The interfaces that connect to IXPs on the customer routers would be configured with an IPv4 prefix and an IPv6 prefix. See Figure 5 for an illustration.
Figure 5- The logical view of IXP connectivity The technical configuration of the Netnod IXPs is very simple. There are few differences between them and a Gigabit Ethernet LAN. Netnod has a single Ethernet switch in each of its secure bunkers in the six cities. Connectivity relies upon virtual LANs (VLAN) as specified in the IEEE 802.1q standard. The customer router and the IXP switch exchange 802.1qtagged frames between one another. VLANs in an enterprise or service provider environment are often used for traffic separation. But, in this case, the use of VLANs is slightly different. The switch uses two common VLANs to accommodate two different maximum transmission unit (MTU) sizes. One VLAN tag is used for the commonly used 1500 byte MTU. If customers want to send frames between 1501 and 4470 bytes (jumbo frames), they use a second VLAN tag. There is no connection between the two VLANs at the IXP. Let s dispel a common misperception about VLANs and redundancy. When customers configure both VLAN tags on a single connection, the packets traverse a single fibre pair from the customer switch to the Netnod switch (see Figure 6). A fibre cut or network element failure will cause an outage. Therefore, the use of multiple VLANs alone is not a redundant set-up. A redundant set-up would require a physically separate connection to each VLAN.
Figure 6- Dual VLANs on Single Connection 1 Netnod has allocated both IPv4 and IPv6 address space for connectivity at the IXP. Each VLAN is assigned a /24 of IPv4 space (254 unique addresses) and a /64 of IPv6 space (2 64 unique addresses). Each IXP location uses different assigned IPv4 and IPv6 prefixes. However, there is no requirement for customers to use both IP protocols. Customer needs in this are driven by the connectivity arrangements with other tenants. How do customers connect to the Netnod IXP switches? The connectivity varies based on the city. The access method in Göteborg, Malmö, Sundsvall, and Luleå is dark fibre only. New customers work with a local access provider listed on the Netnod web site to order dark fibre connectivity to the Netnod IXP. The access provider will have the necessary information including destination point to enable the connection. In Stockholm, customers may connect to the IXP using dark fibre or Dense Wave Division Multiplexing (DWDM). Netnod orders the required dark fibre from Stokab. Alternatively, customers can connect via DWDM in the two Telecity data centres and InterXion s data centre. The customer is responsible for ordering the cross connects within the data centre to Netnod s DWDM transmission equipment. These connections are back hauled to the IXP bunkers. Having provided this information on the Netnod infrastructure and configuration, our focus for the following discussion will be on physical and logical failure events. Implementing power and environmental redundancy is independent of IXP connectivity; information on avoiding such failure events can be found on the Internet. Netnod infrastructure, such as the switches and DWDM switches, uses dual power feeds and is backed up by an uninterrupted power supply (UPS). While Netnod provides a highly available service, the availability experience of an individual customer is dependent on network design decisions made by the customer. The remainder of this paper will discuss Netnod s recommendations for implementing redundancy for IXP connectivity. These recommendations based on industry best practices stem from years of working with customers to minimise downtime for failure events external to the Netnod infrastructure. To help understand how redundancy should be implemented, let s delve into a simple set-up to connect to a Netnod IXP outside of Stockholm (note: Stockholm has a different architecture that will be addressed later). In this example, a customer has a single dark fibre connection between its point-of-presence and the IXP in Göteborg. The customer uses a single router that connects to both the Netnod IXP and the customer s transit provider. See Figure 7 for a depiction. 1 VLANs 15 and 16 are used in the Figures. These identifiers vary based on location.
Figure 7: Single connection to Göteborg site Failure events are inevitable in any network. This simple set-up to connect to the Göteborg site is not immune from such events. The potential failure events and outcomes should be documented for planning purposes. Since redundancy often involves back-up components or circuits, adding redundancy does involve costs. A provider can make the necessary business decisions on whether or not to invest in redundancy for a given failure event by prioritizing a failure event list by severity of the outcome. The table below depicts potential failure events, outcomes, and severities for the Göteborg example. Severities range from one (highest) to five (lowest). Failure Event ROUTER- 1 router crashes ROUTER- 1 router port fails for port to IXP ROUTER- 1 router port fails for port to transit provider Outcome Customer A network cannot reach Internet via IXP or transit provider Customer A network can still reach Internet via transit provider (likely more expensive per bit) Customer A network will only have access to local content from the IXP Severity (1 to 5) 1 3 1
Netnod switch's port to Customer A fails Netnod switch fails Customer A network can still reach Internet via transit provider Customer A network can still reach Internet via transit provider 3 3 Table 1: Failure Events for Non-redundant Scenario Redundancy can be added to Customer A s connectivity to the IXP by installing a second Internet-facing router and adding a second connection to Göteborg and the transit provider. Now the customer s ability to reach the IXP switch is unaffected if one of the two tail circuits is affected by failure events such as router crashes, router card crashes, and fibre cuts. Any single one of these events would result in the routing protocol routing around the failure. Customer A would continue to reach the IXP for any of these, though not necessarily a combination of them. For this reason, customer should consider the probability of multiple failures and invest or not invest as required. A redundant connection to the Göteborg IXP is depicted in Figure 8
Figure 8 Redundant connection to the Göteborg IXP Here are some of the corresponding failure events for the redundant scenario. Failure Event ROUTER- 1 router crashes ROUTER- 2 router crashes ROUTER- 1 router port fails for port to IXP or port to transit provider Netnod switch's port to Customer A fails Netnod switch fails Outcome Traffic bound for the Internet uses transit link or IXP link on ROUTER- 2 Traffic bound for the Internet uses transit link or IXP link on ROUTER- 1 Traffic re- routes that used the failed port re- routes to ROUTER- 2 Customer A network can still reach Internet via transit provider Customer A network can still reach Internet via transit provider Severity (1 to 5) 5 5 5 3 3 Table 2: Failure Events for Redundant Scenario Netnod s infrastructure in Stockholm differs from other locations in that there are two switches located at separate facilities. This adds resiliency to IXP connectivity for customers that are connected to both switches. The switches each have two VLANs like all Netnod switches. There is no connection between the two Stockholm switches. A customer could connect to one Stockholm switch in a basic, non-redundant set-up. This is depicted in Figure 9.
Figure 9- Non-redundant connection to one Stockholm switch For redundant access to Netnod in Stockholm, customers should connect to both switches (preferably using different routers at the customer premises). This dual connectivity to peers in Stockholm will deliver higher availability for access to local content. Figure 10 illustrates redundant access in Stockholm.
Figure 10- Redundant Access to Both Stockholm Netnod Switches Let s return to the subject of the two VLANs for different MTUs discussed earlier in the paper. While configuring both VLANs from a customer router to a switch is not redundant in itself, the configuration of two VLANs between customer routers and Netnod switches adds redundancy to IXP connectivity. See Figure 11 for a depiction of two VLANs on two physical connections to the IXP switch.
Figure 11- Dual VLANs per connection Adding redundancy in IXP connectivity design is not a configure and forget operation. Network environments are very dynamic: configurations change, hardware is augmented, hardware is removed, and engineers may change positions. For these reasons, verifying redundancy is crucial. Many ISPs have scheduled times during maintenance windows to manually force failure events. If redundancy exists, no outage should result. This regular testing prevents engineers from having to explain to stakeholders (for example, management and customers) why redundancy was claimed but did not prevent an outage.
Conclusion Redundancy is a fundamental component of sound network design. This use of secondary network resources both physical and logical prevents failure events from causing customer-affecting outages. For the Netnod IXPs, redundancy helps ensure the availability of the IXP service. A vital take-away from this paper is that availability of IXP connectivity is predicated on customer design decisions. Making decisions with redundancy in mind will increase availability of local content