A Multihoming solution for medium sized enterprises

A Multihoming solution for medium sized enterprises Praveen R Prashant J Hemant RG International Institute of Information Technology, Hyderabad {praveen_r, prashant_j}@students.iiit.net, hemant@iiit.net Abstract Multihoming solutions will see widespread use in small and medium-sized businesses with the increasing availability of low cost broadband links. Replacing a single high bandwidth link with links to multiple ISPs provides benefits like fault tolerance, better QoS and lower costs. Though many commercial multihoming systems exist, there is not much information available on the design of multihomed load balancing systems. We describe our experiences in building a Linux based multihoming system that does incoming and outgoing load balancing. The design of the system and the tradeoffs involved are discussed. We also describe our tools for determining path characteristics. 1. Introduction Multihoming is a technique to increase the reliability and QoS of an internet connection using multiple network links. A network is said to be multihomed if it has more than one path to the global internet via multiple ISPs. A multihomed network has multiple public IP addresses provided by different ISPs. The following are some of the reasons why an enterprise might want a multihomed network: 1. Multihoming helps to minimize downtime due to internet connection failure and ensures reliable internet services. When connectivity to the internet through one of the upstream ISPs is lost in the event of link failure, packets can be routed over the remaining links. 2. Policy based routing can be used to choose the best link for a connection based on the link characteristics and the application protocol. For example, ssh and telnet traffic can be routed over the link with minimum delay to the destination, while ftp traffic can be routed over the link that provides maximum bandwidth on the path to the destination. 3. Having multiple broadband links to different ISPs is more economical than a single high bandwidth line. As the reliability of inexpensive broadband links improves, multihoming solutions will see widespread adoption. 4. A multihomed solution helps to distribute the load over multiple links. This could enable geographically distributed enterprises to route packets through the nearest gateway. Sometimes different ISPs share a common transmission line that could form a single point of failure, thus considerably reducing the reliability benefits of a multihomed network. Hence it is important to choose the right set of ISPs that have minimal overlap in their networks. Multihoming systems also need to address scalability concerns as connection tracking consumes a lot of memory. There are a number of commercial BGP-based multihoming solutions, but these are only practical for large businesses. Even if a BGP router is installed by the customer, the ISP may not necessarily be ready to enable BGP peering. To our knowledge, the details of the design and implementation of multihoming systems that can be deployed by small and medium enterprises have not been published in the open literature. In this paper, we present a linux-based multihoming solution that consists of two components: Outgoing load balancer The outgoing load balancer routes outgoing traffic based on the characteristics of the paths to the destination. The path characteristics are prioritized based on the application level protocol of the packet while making the routing decision. The path characteristics that we consider are available bandwidth, capacity and delay. If these path characteristics are not available, then the characteristics of the first hop links to the ISPs are considered. The outgoing load balancer also does dead gateway detection. Network Address Translation (NAT) is used to dynamically bind internal hosts to public IP addresses provided by the ISPs. The status of connections is maintained by the ip_conntrack [5] module present in the linux kernel. Incoming load balancer The incoming load balancer distributes the incoming requests from hosts on the internet to servers hosted on the local network over all the links to the ISP gateways. This is done by dynamically modifying the DNS entries of the nameserver so that the connections to the servers are divided between the links based on the available bandwidths on the links. The servers are assumed to have multiple public IP addresses, one for each ISP.

2. Related work Guo et al [1] describe load balancing and fail-over implementation strategies using performance measurements obtained on a commercial load balancing system. Their paper addresses outgoing load balancing in a NATed scenario, and balancing the load of externally initiated connections. Strategies for link assignment and fault tolerance are also examined. Aditya Akella et al. quantify the performance benefits of using a multihomed network and give pointers to choosing the right combination of ISPs to achieve optimal performance [6]. 3. Overview The architecture of the multihoming system is explained in Section 4. Section 5 presents the methods used to determine the application protocol for incoming load balancing. Failure detection and handling are covered in Section 7. Section 8 discusses the methods used to determine the path characteristics, while Section 9 describes the determination of ISP link characteristics. Security issues are discussed in section 10. Section 11 describes the details of the test system and section 12 concludes the paper and describes areas for future work. 4. Architecture 4.1 Outgoing load balancing Outgoing request Destination IP address Dead gateway detection Kernel module Userspace daemon Capacity measurement tool Return global policy Cache and policy updates No Is dst in cache Yes Return policy for destination Available bandwidth measurement tool The outgoing load balancer routes outgoing packets over the link that best suits the type of the traffic. This policy can be set by the user. For example, if the system is to be optimized for downloads, http traffic could be directed over the link that has the maximum available bandwidth, but under normal circumstances, the link with minimum latency to the destination could be used. The multihoming system also provides fault tolerance by ensuring that new connections are not routed over dead links. The system consists of two components: a) Kernel module b) User-space daemon Kernel Module: The kernel module registers a PREROUTING netfilter [2] hook that examines the packets passing through the host. For each packet coming in through the local interface, the module obtains the destination address from the packet and sends the address to the userspace daemon. If the packet starts a new session, a new routing decision is taken for the packet based on the application protocol and user specified policy. The packet inherits the routing decision of the first packet of the connection, if it belongs to an existing connection. The IP connection tracking mechanism in the linux kernel (ip_conntrack) is used to identify all packets that belong to a particular connection. All connection oriented protocols like ssh, telnet that don t make any related connections are automatically handled by the connection tracking mechanism. On the other hand, protocols like FTP and IRC, which have separate data and control connections, require a separate conntrack helper. For making the routing decision, the module maintains a cache of recently visited destinations and the characteristics of the paths to these destinations. In case of a cache miss, which can occur if a less frequently accessed destination is encountered, the link characteristics (i.e. the link to the ISP gateway) are taken into consideration. Cache misses can be minimized by setting the destination cache to be large enough to hold all the popular destinations like search engines, news and mail websites. The optimal size of this cache can be determined by analyzing the internet usage patterns in the organization. The destination cache is implemented using a redblack tree ordered on destination IP address. This provides an efficient mechanism for adding, deleting and searching for addresses. In the case of connectionless protocols which are not supported by ip_conntrack, the same decision as the first packet is taken for all the subsequent packets between the same source-destination pair. This decision is stored in a hash table for fast lookup. An efficient hash function (Jenkin's hash [3]) makes sure that the collisions are kept at a minimum. Entries in the hash table are expired if there is no traffic between the corresponding source and destination for a certain amount of time. The

disadvantage of our approach is that for connectionless protocols which don t have a conntrack helper, all data between a source and destination are assumed to be part of the same connection, and the same routing decision will be taken for all these packets. While this is not optimal, this allows connectionless protocols without a conntrack helper to work properly. Routing rules are enforced in the kernel module by setting the netfilter nfmark for packets leaving the router. Separate routing rules can be set up for each nfmark value using iproute2 [5]. This ensures that all packets with the same nfmark are routed through the same interface. Packets from hosts on the internal network with private addresses are Source NATed to the router's outgoing interface address using iptables. Packets from hosts with public addresses are not NATed. The multihoming system can be configured to ignore packets originating from public IP addresses in the internal network. This feature allows the user to route public IP addresses separately without interference from the multihoming system. Generally, ISPs only forward packets that are coming from or going to addresses within their network. As packets from hosts with public IP addresses can be routed over any of the router s external links, the ISPs will need to have routing table entries which forward packets from these IP addresses to the relevant ISPs. Userspace daemon: The userspace daemon receives destinations sent by the kernel and stores them in a 2- level cache. The most frequently visited destinations are stored in the level 1 (L1) cache. The daemon calculates the path characteristics (available bandwidth, capacity and delay) of the destinations in the L1 cache and sends this data to the kernel. The L1 cache is kept in sync with the kernel's destination cache. If the path characteristics for a destination in L1 cache are not available, the link characteristics are used instead. The less frequently accessed destinations are stored in a level 2 (L2) cache. The L1 and L2 caches are implemented as red-black trees ordered on the number times the corresponding destination was visited. The size of the L1 cache is fixed, while the L2 cache expands as the number of destinations increases. When the visit count of an entry in the L2 cache becomes greater than that of an entry in the L1 cache, the entries are swapped. The cache entry for a destination is timed out if there are no packets to it in a long time. The caches are reaped periodically to remove expired entries. This ensures that the L2 cache does not become too large and that infrequently accessed destinations with high visit count don't monopolize the cache. All entries in the caches are also stored in a red-black tree ordered on the address to enable efficient searching of entries based on destination address. The kernel module and the daemon communicate via netlink sockets. Netlink sockets provide a full-duplex, asynchronous, flexible means of communication between the kernel and userspace via the standard socket API. This method is especially suited for high data rate communication as required by this application. 4.2 Incoming load balancing Incoming load balancing is used to distribute the incoming traffic over the links to the ISPs using DNS redirection. The load balancer sets up a DNS server that answers the DNS requests in such a way that the connections are divided between the links. Each server within the network is assumed to have the same number of public IP addresses, one address from each ISP s prefix. The incoming load balancer calculates the ratios of available bandwidths on the external interfaces of the router using the formula: available bandwidth = capacity - bandwidth used Servers are assigned to interfaces in the ratio of the available bandwidths of the interfaces. The load balancer dynamically updates the DNS entries using the nsupdate utility. Nsupdate allows resource records to be changed without modifying the zone files. It allows Query DNS for www.xyz.com DNS server Update DNS entries Dead gateway detection Incoming load balancer Find available bandwidth on links updation of a remote server, so the incoming load balancer and DNS server can be run on different machines. A group of update commands can be sent as one dynamic update request, so changing a server s resource records can be done atomically. When the ratios of available bandwidths change, the load balancer recalculates the number of servers that must be assigned to each link, and shifts servers from links that are overloaded to links that are underutilized. This is done by maintaining a list of interfaces to the external links, each interface having a list of servers that have been assigned to it. The number of servers to be assigned to each interface is found by iterating over the interface list and shifting servers from (or to, depending on whether available bandwidth decreased or increased on the interface) the interface list of the current interface to (or from) the interface list of the next interface on the list that is underutilized (or overloaded). The DNS entries are updated to reflect this change. This procedure is

repeated until all the interfaces have been balanced. The problem with this method is that there is no control over which servers are shifted between interfaces. This is because the load on individual servers is not known. This method would not work optimally if the servers don t have similar loads or the loads keep fluctuating as the servers are distributed between the links based on the number, not based on their loads. But we expect that the system would eventually reach a stable state where the load is balanced between the links. Fluctuations in DNS entries are minimized by updating the entries only of available bandwidths changes significantly. The DNS server is authoritative for the domains of the servers in the local network. The ttl values of the DNS entries are kept low enough so that clients querying upstream caching nameservers would not get stale entries. The incoming and outgoing load balancing can be done separately on the same host without interfering with each other. 5. Determination of application protocol The kernel module can determine the application protocol by looking at the destination ports or with the help of a conntrack helper [2] in case one exists. This method was chosen over a pattern matching approach, like the one used in L7 filters [4], for reasons of efficiency and simplicity. While a pattern matching approach would be more flexible, it is much less efficient than our approach. Further, doing a source NAT of packets with local addresses to the external interface address of the router requires that the protocol be recognized before routing the first packet. This is not always possible when using a pattern matching approach. Moreover, there will need to be a fall back to port based protocol detection for protocols where this approach fails. A drawback of our approach is that applications using non standard ports will not be routed using the correct policy. This, however, is not a significant issue as most network administrators filter out non-standard ports. A firewall can be run on the router to do the same. Support for new protocols can easily be added to the multihoming system. 6. Criteria for making routing decisions The routing decision is based on the following criteria: 1. Available bandwidth of the path to the destination. 2. Maximum capacity of the path to the destination. 3. Round trip time of the path to the destination. The order in which these parameters are considered depends on the user specified policy for the protocol. If the protocol of the incoming packet is unsupported, the default global policy is used. The following global parameters are considered when the per-destination path characteristics are not available: 1. Available bandwidth of the link with the upstream gateway 2. Capacity of the link with the gateway 3. Round trip time of the link These parameters are considered in the order specified by the protocol's policy and decision is made based on the first available value. For TCP connections this decision is made only for the first packet of the connection; subsequent packets are routed using the same policy. For connectionless protocols this decision is made for every packet unless a conntrack helper exists for the protocol, in which case the same policy is applied for a single session, as identified by the conntrack helper. Major exceptions to this rule are routing of non-tcp packets for application protocols without conntrack helper and packets from hosts with public IP addresses within the internal network. In case of non-tcp packets of protocols not supported by conntrack, a decision is made when the first packet is seen between the source destination pair. This decision applies to all subsequent non TCP packets between the hosts. This makes sure that the multihoming router doesn't break these protocols by changing the source IP addresses of packets belonging to the same session. A drawback of this approach is that all the packets between these hosts will inherit the decision made for the protocol of the first packet even if they belong to a different protocol. These decisions timeout if there is no traffic between these hosts for a specified time. For the next packet after the timeout a new decision will be taken. This timeout feature may break some protocols but can be avoided to some extent by making the timeout sufficiently large. Packets from hosts with public IP addresses are routed in a different manner. NAT is not performed on these packets. A different routing decision is taken for each packet belonging to connectionless protocols origination from these hosts. It is possible to prevent the module from touching these packets using a configuration option. These rules apply only to packets coming on the local interface. All other packets are left untouched. 7. Failure detection and handling 7.1. Link failure detection The multihoming router detects link and gateway failures by actively probing the gateways periodically. It sends ICMP echo requests to the gateways to determine the state of the gateways. This also serves as a method for estimating the round trip time to the gateway.

Passive failure detection is another approach that could have been used. The advantage of the passive probing method is that it does not consume any extra bandwidth. It also gives a better estimation of the round trip time than the active probing approach. However, active probing is a simpler and faster method. The multihoming application requires instantaneous detection of dead gateways, which are difficult to achieve to achieve with passive probing. Also, the exact value of round trip times is not very important for the application, as long as their relative order is correct. Further, the bandwidth consumed by active probing is negligible, so the active probing approach was found to be more suitable for our purpose. 7.2. Handling link/gateway failure Outgoing: If a link failure is detected, subsequent connections are routed over the remaining working links. Existing connections that use this link are terminated unless they originate from a host with a public IP address, in which case the connection can be rerouted over a working link. When a previously dead link is reported to be working, it will be considered in the subsequent routing decisions. Incoming: When a link is reported to be dead, the incoming system changes any DNS entries that point to IP addresses that are on a dead link. This ensures that the DNS works as long as at least one link is operational. 8. Determination of path characteristics 8.1. Capacity: The capacity of a path is the maximum possible bandwidth that the path can deliver [7]. Path capacity is measured using the Variable Packet Size (VPS) probing technique. This method is based on the assumption that the latency of each hop, known as the serialization latency, is equal to packet size divided by the link capacity of the hop. The end to end capacity of the path is the minimum per-hop capacity. The VPS technique requires the measurement of latency for each hop on the path to the destination. In the VPS method, packets with increasing TTL are sent to the destination, and the time interval between sending these packets and their replies is measured. The rtt, T I, to a hop I and packet size L is given by: T I = i=1 I ( L/C i + q i + p i + L REPLY / C + q r i + p r i ) (1) C i, i = 1..I, is the capacity of each hop, q i and q r i are the queuing delays of the sent packet and the reply respectively at hop i, while p i and p r i are the propagation delays. L REPLY, the size of the ICMP reply packet, is a constant. From the above equation it can be seen that serialization delay (L/C i ) is the only component that is dependent on packet size L. So the equation can be written as: T I = i=1 I ( L/C i + D i ) (2) where D i is independent of L. ie T I = α I + β I L (3) where β I = i=1 I 1/ C i For a packet of size L 1 T I 1 = α I + β I L 1 (4) From (3) and (4), β I = (T I 1 - T I ) / (L 1 L) (5) For the next hop T I+1 = α I + β I+1 L (6) From (3) and (6) C I+1 = 1/( β I+1 - β I ) (7) where β I+1 and β I as calculated using (5). In this method, a number of packets of increasing sizes are sent to each hop, and the value of β I is estimated using a linear regression algorithm. For each packet size, the minimum rtt is taken as this is assumed to have resulted from a packet and a corresponding ICMP reply that did not experience any queuing delay [8]. From this, the capacity of each hop can be determined. The capacity of the path is taken to be the bottleneck capacity. C = i=1 H C i (8) This method was found to be too slow for the multihoming application as the capacity estimation tool was not found to run fast enough to keep pace with changes in the destination cache. Another problem with this approach is that Layer 2 devices present on the path cause underestimation of hop capacity [7]. So the tool was modified to calculate the end-to-end capacity rather than per-hop capacity. The advantage of not considering intermediate hops is that it is more efficient and consumes less bandwidth than the original method, which finds the capacity of all the intermediate links. Further, the accuracy of this method was found to be comparable to the traditional VPS method used in pchar [9]. In this case, the size of the reply with vary with change in the size of the request, as in the end-to-end capacity measurement, an ICMP echo reply is generated whose size is generally equal to the size of the corresponding ICMP echo request packet. We assume that the forward and return journeys are symmetric. Hence latency = rtt/2 (9) where latency is the one-way from source to destination.

If rtt1 and rtt2 are the round trip times for packets of size s1 and s2 respectively, rtt 1 /2 = L 1 /C + k rtt 2 /2 = L 2 /C + k => C = L 2 L 1 / rtt 2 rtt 1 (10) A B C Path bottleneck ICMP echo request packets of increasing size are sent to the destination and the rtt s are determined. The endto-end capacity can then be determined by applying a linear regression algorithm on the capacity estimates obtained using the formula (10).. Δ IN Fig 2 Δ OUT 8.2. Round trip time: Round trip time is the time between the sent request and the received response. The round trip time of the path is calculated along with the measurement of the capacity of the path. It is assumed to be the minimum of the round trip times of the packets sent by the VPS method. The minimum round trip time is used as this is considered to be round trip time of the path under ideal circumstances. It is assumed that the ICMP echo request and response packets take the same path. The round trip time is thus twice the latency of the path to the destination. 8.3 Available bandwidth: The available bandwidth of a path is the maximum unused bandwidth on the path [7]. It is the minimum of the available bandwidths of the links constituting the path. Available bandwidth calculation is based on the principle that the dispersion between back-to-back packets increases when they go from a higher bandwidth link to a lower bandwidth link, but the dispersion does not change when such packets pass from a link with lower bandwidth to one with higher bandwidth (fig2). The dispersion between two packets is the time distance between the last bits of each packet. If a train of back-toback ICMP echo request packets is sent to a destination, the dispersion between the replies will be minimum dispersion between the packets, seen on the link with minimum available bandwidth on the path. This is the basis of the packet-train technique [10]. Fig 3 If δ i is dispersion between packets (of size L) in the packet train on link i of capacity B i, then the dispersion on the next link δ i+1 = max (δ i, L/ BBi) The dispersion measured by the receiver δ = max(l/ B i ) = L/min(B i ) = L/B B = L/ δ (11) where B is the link with minimum bandwidth. available In our implementation, we send trains of ICMP echo requests with increasing input dispersion until the output dispersion starts increasing steadily with increase in the input dispersion (fig3). Packets within a train that have abnormal dispersion and those that arrive out of order are discarded. Trains with a large number of aberrant packets are also discarded. To find the point at which the graph starts increasing,

we find the largest decreasing subsequence such that: (length of subsequence) / (number of elements in subsequence) < threshold value This is done to differentiate between the regions AP and PB in Fig3. In AP, the output dispersion does not vary consistently with increase in input dispersion, while in PB, the output dispersion increases steadily with increase in input dispersion. The last point in this subsequence corresponds to the maximum available bandwidth on the link. 9. Determination of link characteristics 9.1 Capacity: The capacity of a link is considered to be the maximum bandwidth available on the link. This value must be specified by the user. number of connections from a host using iptables. It is not a serious threat as the DOS attack can only be done from within the local network. As the router forms a single point of failure, it must be ensured that downtime is minimized. To keep the router secure, it can be configured not to accept connections from the external network. Further, all unnecessary services can be turned off and all patches and fixes applied. In addition, firewalls can also be installed on the links between the routers and ISPs. As BIND is a frequent source of security vulnerabilities, the DNS running on the router can be configured to allow updates only from the multihoming router. 11. Experimental Setup 9.2 Available bandwidth: The bandwidth usage for a link is calculated by finding the number of bytes of data sent or received in an interval. This is done periodically and the available bandwidth is calculated by subtracting the bandwidth consumed from the capacity of the link. bandwidth used = (tx + rx old_tx old_rx) / interval available bandwidth = capacity bandwidth used where tx, rx: number of bytes of data that have been sent and received on the link since the link was initialized old_tx, old_rx: previous values of tx and rx. interval: time interval between measurements of tx and rx 9.3 Round trip time: Round trip time of a link is the round trip time to the ISP's gateway. This is obtained from the gateway failure detection method (the gateway failure detection procedure assumes the gateway to be down if it does not respond within a specified time period). 10. Security A firewall can be run on the router to filter out unwanted traffic. The firewall rules must be added after the iptables rules created by the outgoing load balancer. The connection tracking mechanism consumes a lot of memory as it has to keep track of all the existing connections. The system may be vulnerable to a DOS attack that could affect internet connectivity. This problem may be mitigated by increasing the number of connections accepted by ip_conntrack and limiting the Our test system is a router with an interface to a private intranet and links to 3 ISPs. The internal network has hosts with private and public IP addresses. The objective is to distribute the incoming and outgoing traffic over the links to the ISPs. The router is running Fedora Core 3 with Linux kernel 2.6.10. We keep track of existing connections using the ip_conntrack module. The router also acts as the DNS server for the local domain. BIND 9 is used for DNS. 12. Conclusion Multihoming is becoming a viable option for many small enterprises with the increasing availability of inexpensive broadband internet connections. Most

existing solutions are proprietary and make use of BGP routing, which is not feasible for small organizations. This paper presents a multihoming system for linux that can be deployed on a router. Our implementation does both incoming as well as outgoing load balancing. The outgoing load balancer does policy-based routing to choose the best link for each type of traffic. It calculates the characteristics of the paths via each of the ISPs to the most frequently accessed destinations. The best link is chosen based on the path characteristic that is relevant to the application protocol of the packet, as determined by the user-defined policy. For infrequently visited destinations, a global policy that considers the first hop connectivity to the ISPs is used. The path characteristics considered are available bandwidth, capacity and delay. Algorithms used to calculate these parameters, and the tradeoffs involved, are discussed. The incoming load balancer distributes the load on the servers within the network among the links to the ISPs in the ratio of available bandwidths on these links by modifying the DNS entries appropriately. Both the incoming and outgoing load balancers do dead gateway detection to minimize downtime. [10] C. Dovrolis,P. Ramanathan, D. Moore Packet- Dispersion Techniques and a Capacity-Estimation Methodology, IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 12, NO. 6, DECEMBER 2004 13. Availability Our multihoming implementation is available at http://some.where.net. Acknowledgements We thank Habeeb Hassan for his technical inputs on the Linux kernel. The MSIT educational program was kind to provide the infrastructure and necessary test beds to build and test the prototype. References [1] F. Guo, J. Chen, W. Li,T. Chiueh, "Experiences in Building a Multihoming Load Balancing System" IEEE Infocomm, 2004 [2] Netfilter http://www.netfilter.org [3] Jenkin's Hash http://burtleburtle.net/bob/hash/ [4] L7-filters http://l7-filters.sourceforge.net [5] IPRoute2 http://developer.osdl.org/dev/iproute2/ [6] A. Akella et al A Measurement-Based Analysis of Multihoming SIGCOMM 2003 [7] R. S. Prasad et al Bandwidth estimation: metrics, measurement techniques, and tools [8] R. S. Prasad et al The effect of layer-2 store-andforward devices on per-hop capacity estimation IEEE INFOCOM 2003 [9] B. A. Mah, pchar: a Tool for Measuring Internet Path Characteristics, http://www.employees.org/_bmah/software/pchar/, Feb. 1999.