High Availability Police-Fire Architecture Overview A SOLUTION WHITE PAPER
WHITE PAPER High Availability Police-Fire Architecture Overview Introduction TABLE OF CONTENTS Introduction 2 Key Attributes of Configuration 2 Availability & Redundancy 2 Virtual Chassis 2 Link Aggregation (LAG) 3 Resiliency 3 G.8032 Rings & Subrings 3 Link Failover/Restoral Timing 3 Flap Timers 4 Hello Protocols 4 Y.1731 Interoperability 4 Y.1731 Dual Fault Handling 5 Extreme Networks Summit X460 and X460-G2 5 The Basics 5 Hardware Configurability 5 Software Configurability 5 Warranty 6 Conclusion 6 When police and fire departments discuss communications requirements for their missions, items like availability, reliability, performance, and value immediately rise to the top of the conversation. When a major US police and fire department needed all these ingredients combined into a cohesive solution they turned to Extreme Networks along with Extreme Networks Global Alliance Partners. After a full analysis which included multiple switching and microwave suppliers the customer selected Extreme Networks and two of Extreme s Global Alliances partners to design and build their high performance solution. Extreme Networks switching, paired with state of the art microwave radios met or exceeded all the requirements but with higher performance and lower cost than the alternate vendor s solution. This paper discusses the architecture that fulfilled these needs and also discusses key attributes of the synergy achieved between the microwave vendor, the radio supplier, and Extreme Networks in public safety deployments. In this example, the customer architecture required redundant microwave, Ethernet switching and very high reliability. After considerations including layer 3 versus layer 2 topology, resiliency and cost were evaluated a layer 2 ring architecture was selected by radio partners and the end customer. After consideration of several L2 ring protocols, the industry standard G.8032 (ERPS) protocol was selected. In this example, three G.8032 subrings were nested into one G.8032 main ring as shown. Microwave links were implemented between each site and then in many cases LAN rings were designed extending out from the sites A-H shown. Key Attributes of Configuration AVAILABILITY & REDUNDANCY Virtual Chassis: Each switching location in the network is part of the life-critical mission and so must be equipped with redundancy for all components. The architecture chosen by the integrator and the customer for the deployment included using dual switch fabrics, dual management interfaces, and quadruple power supplies at every location to ensure that no equipment failure would cause High Availability Police-Fire Architecture Overview White Paper 2
mission degradation. Since Extreme Networks can offer this redundancy utilizing two different architectures, an evaluation of each method resulted. The two methods considered were the utilization of device twins and the utilization of a virtual chassis configuration. The twinned devices utilize G.8032 resiliency with 802.3ad Link Aggregation Groups (LAGs) for mission critical interfaces and utilize disparate man-machine interfaces so they are inherently able to protect against accidental configuration error. The virtual chassis, on the other hand utilizes a single network entity with a high-speed connectivity option that creates a simple virtual chassis. In the virtual chassis multi-switch LAG connections are simplified because the two processors function as one single network element. Another advantage of the virtual chassis architecture selected is that future expansion can be easily accomodated up to eight switches while still using the same single network element identity as the customer grows. Both options offer redundancy of management interfaces, inband and out-of-band management, and primary/alternate IP addressessing. After consideration of both options, the customer chose to use two Extreme Networks Summit x460 devices architected into a virtual chassis at each location. Link Aggregation (LAG): The architecture of the microwave radios and the virtual chassis at each location is critical so that no single point of failure can impact the mission. Since the switches were architected using the virtual chassis capability, simple LAG groups work seamlessly with the microwave radios to handle any failure that might happen. Redundant ports in each radio ethernet interface connect to two redundant switching elements on one side and then the resilient ethernet ring on the other. Traffic in the network sees the LAG as a single logical port so resiliency becomes more transparent to the person doing the configuration thereby further reducing the possibility of a configuration error. There are multiple methods to implement LAG. In this design the main choices were to enable LAG both in the ethernet switches and in the radio ports or to implement LAG in the switches and have the radio ports provide two distinct pipes across the radio hop. Both the Extreme and selected microwave gear can support both methods so thought was given to each. If all network elements participate in the LAG protocol then each hop s specific path is determined when the protocol chooses, or more technically hashes, which path a given frame will transit. If the radios provide two distinct pipes then the hash takes place in the switches without the intermediate transport taking a role in the hash. There are protocols that automate the LAG control (called LACP) or the LAG can be configured in a static manner so that its operation is always the same. After testing both methods thouroughly, it was determined that the cleanest implementation would be to configure static LAG onto the switching elements and allow the radio devices to present two redundant transport pipes. RESILIENCY G.8032 Rings & Subrings: In addition to the network element redundancy described above, the network itself is designed to be highly available. During the design of the network, traffic flows were considered to allow for the minimum interruption under any failure scenario. Since weather and other factors come into consideration across the microwave radios, and since that condition might be highly variable, the industry standard and highly reliable G.8032 ring protocol was selected for the architecture. The required architecture for one of the implemented networks required a design containing four nested rings with other laterals and subrings extended from those. The traffic therefore utilized common links between many of the rings. This, plus the requirement for high speed failover discussed below, is part of what caused other vendors to be inferior even at a much higher cost. Generally the alternative solutions were restricted to L3 alternatives because of the topology. But G.8032 phase II, which is implemented by both Extreme and the microwave gear vendor provides for the creation of a sub-ring architecture which allows easy loop prevention with very fast layer two (L2) failover and recovery times. Link Failover/Restoral Timing: Since link failover scenarios needed to be resolved in the order of 50ms (milliseconds) and since it was highly desirable that ring failures provide zero packet loss as frequently as possible, failure analysis was needed for the architecture. After the analysis and appropriate test validation, High Availability Police-Fire Architecture Overview White Paper 3
failure tables were prepared for the architecture. Since the mission operates with life-critical mission traffic, failure recovery times expressed at layer two needed to be in the order of 50ms. In the design it was determined that voice services restoral under 100ms would be undetectable and so recovery times needed to certainly be below that. An engineering exercise resulted during which each and every link in the network was failed while the failure s impact was analyzed on every other service location in the network. In this manner it was proven that the selective flush capability of the Extreme Networks G.8032 implementation actually resulted in failure recovery times of 0ms in many cases when the failures were considered at the network level. A 0ms failure recovery time means that no packets at all would be lost and that the service BER retains a perfect score across the failure. This network dimensioning exercise is necessary to truly understand how a given failure impacts the network instead of other more simplistic measures. The table shown here is an example recovery/restoral table for the Fire side of the deployment (figure 3) which consisted of a seven node ring. As you can see in the L2 Recovery Table, for any given radio degradation or fiber cut in this network, a given service in the network has a 69% chance of completing the network failure/ restoral cycle without losing a single packet. And, those services that must have their traffic rerouted because of the failure will almost certainly yield an imperceptible switchover to the Emergency Services user of the service. Flap Timers: The Extreme Networks implementation of G.8032 incorporates flap-timers to mitigate the network impact of high-speed failure/recovery operations within the network. For example, if a crane were to spin around near to a roof-top microwave transmitter, the link might fluctuate as the crane spun into and out of the microwave signal. In a lesser network design the link would flap which means it would fail and restore constantly and might potentially cause an interruption of service each time the link flapped. In implementations such as spanning tree convergence might be slow enough that the network would remain out of service as the link fluctuated. After some engineering discussion it was determined to set the programmable G.8032 wait-to-restore timer to five seconds. This means that the link is now intelligent enough to wait until the physical link has been stable for 5000ms before moving traffic onto it. By the way, this is why G.8032 restorals are so lightning fast. Hello Protocols: Since microwave link failures can find their origin in thunderstorms, fog, ice, intermediate transport domain failures, or even moving machinery, each link in the network must have a hello protocol implemented which enables the detection of logical faults. On a LAG port, these hello timers must operate on each link of the LAG. A logical fault is an interruption of the link path such that all the hardware ports remain in working order; in other words the ports do not go down during these failures, it s just that no packets can get thru the link. A hello protocol constantly sends hello messages to its link peer on the opposite side of the radio transmission and reports a hello timeout if the hello messages are not correctly received. The port or the ring can then act upon the reports from these detection mechanisms to move the traffic to another link. In the Extreme Networks microwave transportation implementation, there were several options for the L2 hello protocol that would satisfy the reliability requirements. The two primary options considered were 802.1AG Continuity Check Messages (CCM) and Extreme Link Status Monitoring (ELSM). There were advantages and disadvantages to each. For example, the CCM for both the Extreme Summit x460 and the microwave equipment is built into the hardware of the port. Because of this, its hello messages can be sent every 3.3milliseconds with imperceptible device CPU loading. ELSM on the other hand is simpler to configure and has the advantage of driving its port status to busy when a logical failure is encountered. Both the x460 and the selected microwave transport radios interwork with either CCM or ELSM. In this implementation ELSM has an advantage because it was desirable that when a logical fault is detected on a single LAG member, that the port status be driven to a busy state. Accordingly, ELSM was selected for this deployment and it was determined by the integration partner to set the hello protocol to run every 100ms. Y.1731 Interoperability: While not implemented in this particular architecture, the Extreme Networks x460 and the microwave equipment ports both support Y.1731 CCM hello messages as differentiated from the 802.1AG CCM discussed earlier. When this feature is implemented, both switching and radio network elements actively participate in the ring protocol.. For example consider transport link having five radio hops across a sparse territory, or across a stretch where no customer services join or High Availability Police-Fire Architecture Overview White Paper 4
leave the network. If the hello protocol of the ring includes only the Ethernet switching on each end of the five-hop segment and there was a failure in the middle, network managers would only know that the service was down somewhere in the middle. If the switching and radio elements both interoperate in the ring protocol then the ring restoral can happen at precisely the spot in the link where the failure actually happened. The advantage of this is that network managers can instantly see the precise port/ link impacted by a problem which in turn speeds identification and recovery time. Since all elements participate in the ring recovery process there is also additional protection against network isolation. backplane between the units in the assembly and presents the units to the customer as a single network element. Network management sees only one IP address that has dual switching fabrics, dual management ports, and quad power supplies. The operator, for example, configures port 1:1 for the first port in the top RU and port 2:1 for the first port of the second RU. The 802.3AD LAGs can cross between the units allowing the configuration of no single point of failure services. Y.1731 Dual Fault Handling: Another advantage of Y.1731 standard interoperability is the ability to use what is called an UP-MEP in subrings in order to provide protection against multiple rings failures. Both the Extreme x460, X460-G2, and the microwave equipment selected have implemented this protocol which enables the Subring (See Figure 1) to detect when a segmentation of the Main ring happens and thereby change its traffic flow. By doing this the likelihood of isolation from two simultaneous failures occur is lowered or eliminated. This implementation is per Appendix X.3 of the G.8032 standard. Extreme Networks Summit X460 and X460-G2 Extreme Networks is able to provide the public safety customer with a wide array of technology to choose from. All sorts of considerations become important when the specific technology for a given build is selected. Reliability, processing power, port speed, protocol support, configuration options, cost and warranty all are very high priority. Also on the top of the list is how easily the equipment selected might be upgraded as the network changes or expands. After consideration of all these factors and more, the Extreme Networks Summit X460 was selected for this particular Police and Fire Department build. The Basics: The Extreme Networks Summit X460-G2 is a 1RU (a Rack-Unit is 1.75 inches) Ethernet switch that has many different form-factors, less than 4 microseconds of latency at a 64 byte frame size, and the configurability needed for the public safety environment. 1 For this particular large build, two x460 s were deployed into each location with each of the two x460 s also having redundant power. A 40Gbs link serves as a virtual Hardware Configurability: The x460-g2 comes in 24 or 48 port form factors, either copper or fiber, with or without POE, and with or without 4 built in 10Gig ports. In the rear of each device there are slots for expansion modules and power supplies. Shown here is an X460-G2 with one DC and one AC power supply and having a 40Gig QSFP+ VIM module, a fan assembly and a clocking module. Shown above the unit are other available modules. From left to right are shown the two port 40Gig QSFP+ module, the two port 10Gig SFP+ module, the two port 10Gig copper module, a stacking module, and a timing module. Fans can be selected to move air front-to-back or back-to-front. As discussed above the X460 or the X460-G2 can be deployed with single switch fabric / single power, single fabric / dual power, dual fabric / dual power, or dual fabric quad power. The virtual chassis allows up to eight units of any form factor to become a single network element. LAG groups can naturally and simplistically be split across multiple devices to increase reliability. Multiple devices can become the active (called master) device and so there is a high degree of engineering flexibility to design very highly reliable interfaces. In the multiple master environments, each master can have the device primary management IP address or a secondary one assigned. In the case of the failure of an active master, the management interface of the next master unit retains the same IP address. Software Configurability: The x460 comes standard with a cost-effective Edge Software license that enables a wide range of capability including VLANs, Stacked VLANs (VMAN), L2 Rings (both EAPS and G.8032), LAG, ELSM, CFM, ACLs, IP unicast and multicast L2 switching, IGMP, many IPv4 and IPv6 capabilities, mirroring, Y.1731, Clear-Flow, PIM, and more. The Advanced Edge software license adds multiple EAPS/G.8032 domains, 4 active OSPF interfaces, VRRP, two PIM-SM interfaces, and more. 3 More on the X460-G2: http://learn.extremenetworks.com/rs/extreme/images/summit-x460-g2-ds.pdf High Availability Police-Fire Architecture Overview White Paper 5
Then the CORE software license adds EAPS common links, PIM SSM, OSPF full, BGP IPv4 and IPv6, IS-IS and more. MPLS, AVB, Openflow, plus more are available as options. So the take-away is that customer s do not pay for features they do not need. The operating system is called ExtremeXOS and is highly modular and easily extensible. It supports module level upgrades, process memory protection, scripting, XML, and integrated security. It is also rich in useability and troubleshooting features. An example is the Extreme Discovery Protocol, or EDP. This protocol is used to gather information about neighboring switches and then simplistically allow the customer to understand what each interface on the x460 is connected to, what MAC addresses are there, VLANs and more. Another simple example might be how ACL s (Access Control Lists) are implemented in ExtremeXOS. ACL s are used to define packet filtering and forwarding rules for traffic traversing the switch. Each packet arriving on an ingress port and/or VLAN is compared to the access list applied to that interface and is either permitted or denied. In ExtremeXOS, ACLs can often be defined right from the command line using only a few short entries. Note in the figure that four simple commands are used to create a meter on VLAN 500 that will throttle traffic on VLAN 500 to 128kb. This is just the smallest example of the power of ExtremeXOS. Another powerful example of ExtremeXOS is the ability to utilize Python scripting or applications. 2 Python scripting allows run to completion scripts that users can write to make repetitive and data collection tasks more convenient. Python applications allow the user add their own applications to ExtremeXOS to customize and extend switch behavior to solve specific business needs for their networks. For example, Extreme Networks and the radio provider in the deployment detailed here are developing Python based apps which will pair the microwave radio link speed to the switching element. The radio will report the link speed in sub-ten millisecond messages to the Ethernet switching which will then adjust QoS, metering and/or the packet flow accordingly. Warranty: Value was mentioned in the first sentence of this paper. Ultimately value is the product of all the attributes discussed here and more. But value also comes in the fiscal realm and holding down operational costs are paramount to success. To help customers experience predictability across the life of the deployment, the x460 comes with Extreme Networks Ltd. Lifetime warranty including Advanced Hardware Replacement. What this translates to is that should a device or a port on a device fail, that Extreme will make commercially reasonable efforts to ship a replacement device within one business day. When the customer gets the replacement, then they send back the original unit. This hardware warranty is good from the shipment date until the end-of support date which is five years after the original end of sale date. 3 This level of hardware warranty support is a fundamental value when life-cycle costs are considered. Conclusion Public Safety networks are by nature mission critical. By nature they must be highly reliable. By design they must be highly resilient. When the mission is critical, high performance solutions must prevail. Extreme Networks and its Alliances partners are proud to be part of the very highest reliability networks in the world. Every component has been carefully selected, interoperability tested, field deployed and has been found to represent the best performance and value available anywhere. So as police departments, fire departments, public safety entities, and emergency services go about their mission Extreme Networks stands ready to help. 2 Python Scripting is Generally Available in exos now, Python applications are planned for software release 15.7.1 which is scheduled for General Availability in 1Q 2015. 3 More details at http://extrcdn.extremenetworks.com/wp-content/uploads/2014/01/ EndofLifePolicy.pdf http://www.extremenetworks.com/contact Phone +1 408 579 2800 2014 Extreme Networks, Inc. All rights reserved. Extreme Networks and the Extreme Networks logo are trademarks or registered trademarks of Extreme Networks, Inc. in the United States and/or other countries. All other names are the property of their respective owners. For additional information on Extreme Networks Trademarks please see http://www.extremenetworks.com/company/legal/trademarks/. Specifications and product availability are subject to change without notice. 8772-1114 WWW.EXTREMENETWORKS.COM High Availability Police-Fire Architecture Overview White Paper 6