Redefining Software Scalability for the Network Infrastructure

WHITE PAPER: Redefining Software Scalability for the Network Infrastructure By Paul N. Leroux, Technology Analyst, QNX Software Systems Ltd. www.qnx.com In their efforts to build a comprehensive range of networking products, many equipment manufacturers have invested in an equally wide range of operating systems (OSs).The results are predictable: code can t be reused across products, engineers can t move quickly from one project to another, and the networking products themselves can t offer end-to-end consistency of software services and management tools much to the customer s inconvenience. In this paper, we look at how a microkernel OS based on network-transparent IPC can address these issues by allowing applications to be coded once, then deployed across entire product lines. With this OS architecture, the same application can run on a single-processor device, be partitioned across a cluster of loosely coupled processors, or run on an SMP system, all without recoding or relinking. The net effect: less development effort, reduced testing, greater product consistency, and higher return on investment. The High Cost of OS Ownership For companies building carrier-class networking equipment, the ability to reuse software across an entire product line holds immense commercial advantages. For example, if applications and system software deployed in a core network element can, without modification, be reused in edge or aggregation devices, then the equipment vendor can achieve both higher return on software investment and faster time-to-market. Nonetheless, in their efforts to build a wide range of network elements, many equipment manufacturers have had to invest in an equally wide range of operating systems (OSs). It s common for a manufacturer to use 5, 10, even 20 OSs, each with different tools, different APIs, and different maintenance problems. The consequences are predictable. More often than not, code can t be reused across projects, and engineers have to learn a new OS and new tools when moving from one project to another. Return on code investment is, to say the least, limited, as is the ability to deploy multiple products quickly. The customer is also affected. Since software varies from product to product, so can interfaces and management tools. The skills that the customer has acquired for using one product don t always apply to other, similar, products.the end result: higher cost of ownership. The Demand for Massive Scalability But what if all those OSs weren t necessary? What if one OS could let you use the same code, tools, and APIs and by extension the For companies building carrier-class networking equipment, the ability to reuse software across an entire product line holds immense commercial advantages. same developers for everything from edge devices to carrierclass equipment? More to the point, what if the OS could let you reuse application binaries, not just source code, across complete product lines? A tall order.the OS would, in fact, have to be massively scalable. For example, it would need to: address an enormous range of memory configurations everything from a few hundred kilobytes to several gigabytes have the ability to coordinate hundreds, if not thousands, of simultaneous software processes 1 allow the same applications and drivers to run on a single processor, across a network of loosely coupled processors, or on a tightly coupled SMP system all without recoding Scalability Across Loosely Coupled Processors Conventional OS architectures fall short on all these counts particularly the last. Let s look at an example. In the distributed architecture of a modern high-end router, network growth is, in theory, easily accommodated: you simply add more line cards, each capable of making its own routing decisions. On one hand, this decentralization avoids the bottleneck of a single routing processor. On the other hand, many line cards attempting to communicate simultaneously with the main processor can quickly overload the system bus. The router, as a result, can t scale to handle increased traffic even though it has the raw processing power to do so. One solution: move software intelligence, such as the routing database, off the main processor and onto the line-card processors, thereby freeing up the system bus. Unfortunately, conventional OS architectures would make moving the database difficult, for several reasons. First, most OSs don t provide network-transparent interprocess communication (IPC). So, if you split up an application s components across different s, you must also add network-specific code so those components can continue talking to each other. In our case, you d have to recode the database, along with any software modules it communicates with. As an added complication, most or all software modules in conventional RTOSs are bound to the kernel.so,to move the database process from the main processor to the line card, you d probably have to create, and test, two new kernel images one for the line card and one for the main. 1 To enable high system availability, the OS should allow virtually any of these processes be it an application, driver, or OS module to be upgraded or restarted dynamically, without interruption of service. 68 VOLUME 3, SPRING 2002

Of course, similar problems would occur if, say, you tried to move an application distributed across multiple processors to a lowerend, single-processor product.the application would, in effect, be locked in to the current design. With network-transparent IPC, any process can be moved from one to another, without recoding the process itself or any other processes it communicates with. Likewise, the various processes that make up an application can either run on a single or be distributed across multiple loosely coupled s, again without recoding. Unlocking the design: The QNX approach The QNX realtime OS (RTOS) sidesteps these problems in two ways. First, it uses a true microkernel architecture that decouples applications, protocol stacks, drivers, and even high-level OS services (e.g. file systems) from the OS kernel.as a result, every software module can be an independent, MMU-protected process whose binary can be moved, without relinking, from one to another. No kernel reconfiguration or retesting required. protected Microkernel protected GUI Manager File System Application Device I/O Manager HARDWARE Application Network Other... Graphics With a true microkernel OS architecture, every driver, application, and protocol stack runs as an separate, MMU-protected process. (This also known as universal process model architecture, or UPM.) Second, the QNX RTOS provides a global interface message passing that operates identically in both local and networkremote cases. As a result, any process or thread on a given can transparently access any resource associated with any other. No networking code required. From the application s perspective, there s simply no difference between a local or remote resource. In fact, an application would need special code to tell whether a resource be it a database, file, or I/O device resides on the local or on some other on the network. 2 Higher ROI and reliability What does this mean? Instead of having islands of computing, where each processor is effectively isolated, you now have a "virtual supercomputer" model, where messages flow freely across processor boundaries. In the case of our router example, this network transparency neatly removes a limit on scalability: the database can be moved as is to another processor. Increased scalability aside, this approach provides: Better return on investment Programmers can design an application just once; they don t have to recode (and recompile and relink and retest) if the application has to be moved or partitioned across different processors. Greater consistency across products Many of the same programs in fact, the same binaries performing administration and control functions in, say, a backbone router can be reused as is in SOHO devices.as a result, network administrators can work with the same interfaces and management tools across a wide spectrum of networking equipment. Significantly greater reliability and confidence Since the same software can be used across both higher- and lower-end products, improvements derived from field-testing one device can be directly applied to the device s smaller (or larger) cousins. Reliability and product quality improve across the entire product line. Freedom to introduce new network architectures Since applications don t need network-specific code, new network architectures, using various hardware and protocols, can be introduced without having to recode the applications. Simply put, applications don t "care" what protocol or physical medium they communicate over it could be today or a backplane bus tomorrow. Network-wide I/O namespace In addition to network-transparent message passing, the QNX RTOS shields applications from networking issues by allowing all resources database services, network connections, I/O devices, and so on to be viewed and handled as files. For example, if the database manager in the above example wishes to provide its database services to other processes, it can register a unique pathname in the network-wide I/O namespace. Any client application that wishes to use those services simply issues standard POSIX calls open(), read(), write(), and so on on 2 For an in-depth discussion on how QNX message passing enables both network transparency and a high level of realtime performance, see the QNX RTOS v6 System Architecture Guide on http://qdn.qnx.com/support/docs/neutrino_2.11_en/sys_arch/about.html VOLUME 3, SPRING 2002 69

that pathname.the database manager will then take appropriate action based on the call made by the application. With this approach, it doesn t matter which the client is running on; likewise, the client doesn t need to know where the database manager resides.the client simply writes to a pathname and the OS automatically routes the request to the appropriate process. An Alternate Solution: Create a Load-sharing Manager With this facility in mind, let s return to our example and look at another approach to handling the traffic from multiple line cards. This time, instead of moving the database, we could implement two main processors, each mirroring the database.we could then create a load-sharing manager that would distribute requests coming from line cards across the two processors. Besides handling a larger number of line cards, this approach could also provide redundancy in case one main processor or one of the databases failed. In this case, the load-sharing manager could automatically shunt all requests to the remaining database, until the failed database recovered. The important thing here is that existing applications don t have to be recoded. For example, if a process on a line card makes a request of the database, it would use the same pathname, regardless of which processor might actually handle the request. The load-balancing manager would decide which processor the request goes to, without involving the application. increase actual bandwidth. For example, we could connect the processors via multiple links whether those processors talk over a switch, system bus, LAN, serial link, or any combination thereof. Unfortunately, conventional OS architectures don t offer seamless support for using multiple links over different types of media. In fact, since interprocess communication (IPC) is typically implemented "by hand" for each protocol, trying to make every application aware of multiple links with each link potentially handled by a different protocol is daunting at best. To address this problem, the QNX RTOS provides inherent support for multiple links, again without any need for special application code. In fact, this capability provides not only higher throughput,but also network fault-tolerance.for example,you can choose from the following classes of service: Load-balance Queue packets on the link that will deliver them the fastest, based on current load and link capacity.this policy uses the combined service of all links to maximize throughput and allows service to degrade gracefully if any link fails. If a link does fail, periodic maintenance packets are tried on that link to detect recovery. When the link recovers, it s placed back into the pool of available links. Redundant Send every packet over all links simultaneously. If a packet on link A arrives before the same packet on link B, the packet on link A "wins." Redundant packets that arrive later are quietly dropped. With this policy,service can continue without a stutter even if one link fails. Sequential Send out all packets over one link until it goes down, at which point use the second (or third or fourth) link. (This option doesn t provide higher throughput, but does offer fault-tolerance.) Network-transparent IPC simplifies the design of 2N redundant systems, since applications don t have to be coded to know which or how many s a service resides on. Requests to a mirrored database, for example, can be handled by a separate load-sharing manager, leading to a more scalable, cleanly partitioned design. Load-Balancing (active/active) Fiber Packet C Packet B Packet E Packet D Scalable Bandwidth through Multiple Network Links So far, we ve looked at a couple of ways in which network-transparent IPC can help us handle greater network traffic, and thereby increase scalability. Sometimes, however, the only solution is to Other Load-balancing Packets travel on the link that will deliver them the fastest 70 VOLUME 3, SPRING 2002

Preferred Same as sequential, but fall back to load-balancing if the specified link can t be used;that is,use all available links to reach the remote node. Redundant (active/active) Fiber Other Redundant Packets travel across all links simultaneously. If one link fails, service can continue without a stutter. number of product configurations. For example, a database program could be talking to client programs running on another bus. In another installation, the exact same client programs could be running on an entirely different machine connected by multiple links. And in yet another (low-end) machine, the clients and the database could be on the same processor. Neither the database nor the client processes would know the difference. Scalability Across Tightly Coupled Processors (SMP) In many networking devices, the workload for the control-plane has ballooned to the point where even the fastest can t keep up. For instance, in a high-end router, the must handle compute-intensive protocols such as OSPF,maintain a routing database of 500,000 or more entries, perform OA&M functions, process SNMP packets, download a subset of the routing table to each line card as well as handle any new network services coming down the pipe.with network bandwidth doubling at twice the speed of performance, the problem shows no signs of letting up. To meet these computational demands, more and more systems designers are distributing the workload across multiple s, Sequential (active/standby) Device driver File system Database Hot swap manager Packet C Packet B OS microkernel Maintenance bus Sequential Packets travel on the primary link. If that link fails, packets are automatically rerouted to the secondary link, and so on. Importantly, the QNX resource manager, Qnet, that provides these services is abstracted from the actual transport layer. It doesn t know, or care, whether the connections are fiber,, serial, and so on. Nor do user applications.the specifics of the physical transport are handled by a separate driver that talks directly to the hardware.this approach provides: Freedom to mix and match interfaces The designer can mix and match network links according to the needs of the design. One link could be fiber, the second serial, and so on. No special application coding is needed.the same applies for links that connect processors residing in different machines. One link could be ATM, the second ISDN, the third 100Mb/sec, and so on. Better code reuse across products Since applications distributed across processors don t have to know how many links, or what kind of links, exist between the processors,the same application binaries can be reused across any High-bandwidth memory bus The QNX RTOS conforms to the Intel MultiProcessor Specification and can support up to 8 Pentium or Pentium Pro processors. using symmetric multi-processing (SMP). SMP is often called the "shared everything" approach to multi-processing, since the multiple s share the same board,memory,i/o,and operating system (OS).In fact,this shared approach contributes to one of SMP s key advantages: low cost. For instance, when scaling from one to two s, you still only use one processor board, not two you effectively double your processing power without paying for additional support chips and without taking up an additional slot in the chassis. Nonetheless, before choosing an OS to implement SMP, the systems designer should ensure that the OS will allow the same software ideally, the same tested binaries to be reused across both single- and SMP members of a product family. This, in turn, will ensure higher return on investment, end-to-end software consistency across the product line, and, importantly, higher reliability. continued on page 74 VOLUME 3, SPRING 2002 71

continued from page 71 There s another issue.while SMP can boost performance dramatically, the law of diminishing returns can come into play as multiple processors contend for the same memory subsystem. So it s critical that the OS used to implement SMP doesn t add any unnecessary overhead on top of these natural barriers. That s a problem, since SMP is commonly associated with large, monolithic OSs used in enterprise server roles. Because the kernels in these OSs contain the bulk of OS services,adding SMP support typically requires large numbers of performance-robbing modifications and specialized spinlocks throughout the kernel code.also, since all device drivers run in the kernel space, adding SMP support means modifying each driver as well. In fact, one reason SMP isn t used more frequently is the difficulty of implementing it in software. Consequently, designers must often deploy limited implementations,where only certain routines are allowed to run on the second processor, resulting in modest performance gains just 10 to 30 per cent. No recoding required: the microkernel approach An OS with a microkernel architecture, such as the QNX RTOS, helps designers avoid the above problems. Compared to monolithic OS kernels, the QNX microkernel is extremely small, since most OS-level services (file systems,drivers,protocol stacks,and so on) exist as user programs that run outside the kernel space. Consequently,the kernel modifications required for SMP are equally small:just a few additional kilobytes of code.in fact,only the kernel has to be modified.all other multithreaded services file systems, drivers, applications can gain the performance advantages of SMP without the need for code changes. Since so little code has to be added to the kernel,this approach to implementing SMP incurs negligible overhead. And compared to monolithic OS models, it s inherently more reliable, since there s simply less to go wrong. Single SMP card Loosely coupled cluster With a microkernel architecture, drivers, protocol stacks, and OS modules can migrate from a single-processor device to an SMP system or be distributed across a network of loosely coupled SMP devices without recoding. Scalability on demand Of course, for the networking equipment manufacturer, it s equally important that custom applications and drivers not just offthe-shelf OS modules can move unmodified between singleprocessor and SMP systems. And, in fact, with this microkernel approach, there s no need to recode or relink custom software, provided the software has been coded to be "SMP safe." 3 Combining SMP with real time To optimize cache performance, the QNX RTOS supports processor affinity: the kernel will always try to dispatch a thread to the where the thread last ran. To further enhance performance, QNX provides an affinity mask, which can, for example, let you relegate all non-realtime threads to a particular. The remaining s would then always remain free to execute timecritical processes. In general, however, this approach isn t necessary, since QNX s realtime scheduler will always preempt a lowerpriority thread immediately when a higher-priority thread becomes ready. In fact,thanks to these preemptive capabilities,the QNX RTOS can help an SMP device handle an increase in system load,without the cost (and complexity) of adding more s.that s because timecritical tasks, such as routing table updates, are always executed in a predictable time frame, no matter how many other processes demand time. Response times can remain constant, even as overall system load increases. Also, as an RTOS, QNX can deliver context-switch speeds for threads and processes in the sub-microsecond range orders of magnitude faster than OSs conventionally used in SMP server roles.as a result, s waste much less time switching from one thread to another and have more to time to execute computeintensive applications. True Scalability: The Commercial Advantage Little or no recoding.that,in a nutshell,is the hallmark of true scalability. As we ve seen, it s not enough for an OS to simply scale down (or up) in memory footprint, or to support SMP. Rather, it must also allow software applications to move seamlessly across networking products from edge devices to gigabit routers without redesign or recoding, and with minimal retesting. This level of scalability isn t merely desirable. Given the massive range of products a networking equipment manufacturer may offer, it is, in fact, an immense commercial advantage whether you consider development costs, time-to-market, or customer satisfaction. Code can be reused rather than redesigned. Engineers can move freely between projects, without retraining. And customers can enjoy the convenience and lower cost of ownership of using the same set of tools and interfaces across an entire product line. 3 For the most part, this simply means that applications use standard POSIX primitives to control access to shared data structures. Of course, to reap the full benefits of SMP, the application should be designed with enough parallelism achieved through multiple independent threads to keep multiple s busy. 74 VOLUME 3, SPRING 2002