Network Management Framework: A Distributed Virtual NOC Architecture Octavian Rusu RoEduNet Iasi Branch Iasi, Romania octavian@roedu.net Abstract Today s networks superpose multiple sets of services belonging to different participants (universities, research networks, governmental organizations) on the same communication infrastructure (data backbones, operator s NOCs). Each of the participants should implement different services and different policies, without deploying full size personnel at every node location. We propose a model that illustrates the way a participant should organize and manage its network presence with minimum investment and maximum efficiency. The model is based on a structure named Distributed Virtual NOC, which contains a centralized component, allows delegation of different tasks and services to remote locations, but keeps the global behavior coherent by implementing distributed control mechanisms in both geographic and service dimensions. An implementation of the model based on Open Source software with web management interfaces was developed successfully by RoEduNet Iasi. The general structure of Distributed Virtual NOCs, together with concrete issues and solutions of the implementation are presented in this paper. 1. Introduction There are three classical strategies used in network management: centralized, distributed and hierarchical. These strategies work fine when there is a clear separation of the networks based on physical criteria, and when each network is observing a single set of rules and a unique management for its entire activity. Network management is defined as the mechanism used for monitoring, controlling and coordination of all managed objects within the Physical and Data Link Layer [1]. System Management is active through Application Layer protocols and provides mechanisms for monitoring control and coordination of all managed objects within open systems. In this paper we will include all the activities under System management in the generic term of network management. Modern trends in network development, especially in the academic and governmental worlds, are to use a collective financial and personnel effort to build and maintain networks. In such cases, the notion of standalone ISP tends to soften up being replaced by a group Florin B. Manolache Mellon College of Sciences Carnegie Mellon University Pittsburgh, PA, USA florin@andrew.cmu.edu of specialists under multiple authorities that are supposed to implement a different set of rules and policies on different traffic and services. To optimize the network management for such cases, we present a model that has both centralized and distributed features. The model is based on the idea of Distributed Virtual Network Operation Centers (DVNOC). The structure of a DVNOC has roots in the distributed network management paradigm, but some of the distributed components were replaced by centralized ones plus a set of software packages that supplement the communication channels and the consistency of different components. The centralized models are less expensive to operate, but exhibit poor flexibility and long response time to provide a consistent behavior. The distributed models have high operating costs. The DVNOC model tries to extract the advantages from both groups of strategies, by starting from a distributed structure and then move as many components as possible to a centralized implementation without importantly affecting the overall flexibility and efficiency, but decreasing the operating costs as much as possible. According to OSI (FCAPS model [1]) there are five components involved in network management and three components used for service management. We see the two classes of components as different dimensions of the DVNOC architecture implementation: the network management covers the geographical dimension, and the service management covers the services dimension. Every decision provided by our model is determined by the two types of criteria: (a) local traffic and conditions observed by the NOC operators and (b) type of service. This kind of perspective, fundamental to the DVNOC structure, helps combine the consistency of the services offered to the network clients, with the flexibility of adapting fast to the local traffic constraints. The following components are used for network management[1]: 1. Configuration management - detects and controls the state of the network. 2. Performance management - controls and analyses throughput and error rate. 3. Fault management is responsible for detecting, isolating and controlling abnormal behavior. 221
4. Accounting management collects and processes data about resource consumption in the network. 5. Security management deals with access control. The components of the service management are: 1. Monitoring - involves gathering data about the network 2. Control - manipulation of devices 3. Reporting - abnormal events are reported Modern network management solutions must deal with all components described above. The challenge consists in balancing the network management components between centralized and distributed approaches. As the DVNOC architecture and implementation will be described in the next sections, we ll keep track of these components and of their possible distributed/centralized character or even the redundancy of some components, to balance between a clear view of the network status and the elements involved in network operation. Section 2 describes the structure of the DVNOC architecture and information flow within management structure and Section 3 proposes Open Source software that fits into the DVNOC framework. 2. The Distributed Virtual NOC Architecture This Section studies the optimal architecture of a DVNOC, including the structural units, their responsibilities, and the relations between them. NMCU NMEU ESP # SSU # Help Desk APMs NOCs Figure 1. The structural entities of a DVNOC. As shown in Figure 1, the DVNOC model implies a series of entities that work on top of the physical network infrastructure offered by the operators. These entities take care of the implementation of the network management components described in the previous Section. The Network Management Coordinating Unit (NMCU) is the administrative management body that proposes and supervises the network policy, network development, and service implementation at the highest level. Some of its main functions are: sets up the main network policies, including the network evolution and upgrades of the equipments and services; establishes relations and appoints services with External Service Providers (ESPs); performs the high level design of all services; decides about special solutions and services by appropriate Special Solutions Units (SSU); coordinates the Network Management Executive Unit (NMEU) activities; The Network Management Executive Unit (NMEU) is the supervising technical unit that implements the decisions and policies of NMCU. It has write access to the networking equipment and performs the following functions: is responsible for the technical integrity of the services provided on the network; implements new services using configuration solutions provided by SSUs; technically defines and modifies network policies; plans network development; operates a Help Desk which interacts with: o o o APMs; ESP, to provide fault isolation and management of the lines and/or services supervised by a different authority; SSUs during testing period for new services. The Special Solutions Units (SSUs) are specialized task teams distributed in the service dimension, i.e. one per service or class of related services (e.g. IPv6, VoIP, etc.). One advantage of this approach is that different solutions, plans, or service implementations can be outsourced. These teams have limited access to the networking equipment and have the following main functions: provide studies for proposed services by NMCU, specifying issues of interest for the network objectives and policies; provide configuration files for network equipment to implement the proposed services; interact with NMEU during service activation; report through the Help Desk problems related to a service; 222
monitor service operation using network management tools during the implementation period. The Access Port Managers (APMs) are geographically distributed teams (one for each NOC) responsible for the local NOC activities. Their main functions are to: monitor the network operation in their area of authority; configure the local communication equipment; monitor the implementation of the services within their NOCs; interact with NMEU to maintain the centralized management system; interact with the users at the NOC level. Figure 1. shows the communication channels between different Units. The Network Management Coordinating Unit regularly communicates with the Network Management Executive Unit and the Special Solutions Units, to guarantee efficient problem solving and network operation. NMCU is also responsible for a high level interaction with the External Service Providers, dealing with issues such as ordering of new communication capacities, etc. Network Management Executive Unit is the technical core of the management team for the entire network. NMEU is the main node of communication between management entities, interacting directly with the APMs that support the network. NMEU operates the Help Desk and a Trouble Ticket System witch is the main communication channel to NMEU. If user level support must be offered, Help Desk representatives can be distributed to the NOCs and coordinated by APMs. Help Desk, with centralized or distributed components, must be operated by qualified personnel that should provide first level support, and should channel advanced requests to the appropriate authority via a trouble ticketing system. The NMEU, through the Help Desk group, communicates with External Service Units for fault management purposes and installation issues. Trouble Ticket System (TTS) must be unique in the entire management structure to provide a unitary consistent image of faults and events. At the same time tickets related to different types of events should go to different queues, to separate activities and to filter the right information to the right people. The main advantage of the proposed framework is that all information flows through the NMEU to provide a centralized character to the network operation. In the same time a distributed character is achieved through APMs and SSUs: APMs provide network management and user support within a geographical area of authority, SSUs are responsible for particular services implementation on the entire network. It should be noticed that SSUs do not interact directly with APMs. Their interaction is handled by NMCU which assures the consistency of all operations. The next Section analyzes several implementation components of the DVNOC model. 3. Implementation. Open Source Software The DVONC model can be implemented for a wide range of cases where cvasi-independent networks offering different services and observing different local policies, must coexist and share hardware and human resources. Typical cases are: a national resource (e.g. a connection to an international research network) shared by joint regional networks, a campus network composed of departmental networks. The general approach does not depend on the network topology and management structure, even the implementation is mostly independent on the concrete conditions. In this Section we extracted some common tasks, features, and tools that can provide opportunities of centralized implementation for some network management components. We considered as an important issue when distributed versus centralized strategies are weighted, the amount and the type of the traffic overhead produced by centralizing a management component. Experimental determination of an upper limit for the ICMP traffic and of the implications of large amount of UDP traffic associated with SNMP should be very useful for networks that are expected to operate most of the time close to the maximum capacity. Other related issues are the operating environment of the network management software and the amount of alarms generated when a section of the network is unreachable. Also, the security of the transactions involved by the management of distributed network devices is important: all traffic generated by management activities should be secured such that sensitive information cannot be spoofed or intercepted. The first component of the network management, configuration management, should be implemented in a manner that allows SSUs read-only access to the configuration files of the network equipment, and write access for NMCU and APMs. NMCU and SSUs should have access to all the equipment, and APMs should have access only to devices within their area of responsibility. To provide secure access to network devices, each NOC has to provide a secure channel for each of the managed device. This is done either using an encrypted connection (SSH access) directly to the device or through a management UNIX workstation on the same secured LAN with the device. Access to the 223
management workstation is allowed only for NMEU staff and the local APM. Read-only access is used by SSUs and can provide a fast way to directly access devices for monitoring purpose. Good tools for fast web based (read-only) access to the routers are fundamental for the efficiency of the SSUs. Such software should have the following features: user level access authorization; configuration file viewer; interfaces status and parameter viewer; IP routing table and/or single IP route viewer; routing protocols status viewer; simple debugging tools (ping and traceroute); router command line interface. A good tool for this purpose is Looking Glass [4]. Looking Glass can be installed distributed on the network, the centralized element being the web server that provides the unique interface for all managed devices. The transactions can be encrypted using https protocol. Figure 2 shows an example of Looking Glass usage though web interface. distributed. By using a web interface, public and private access can be offered. Figure 3 shows an example for output of Cricket. Figure 2. Looking Glass. The performance management component must be implemented hierarchical. This is necessary because, usually, necessary data to build reports for traffic values, error rates, CPU load device temperature. etc., are obtained using SNMP form different devices. Under a loaded network, a centralized implementation for the software used for performance management can lead to false alarms. In this respect, a distributed approach provides good results. The centralized component is achieved by using a single web based interface for the entire network. Following the Open Source approach, there are many software tools that are used for traffic and error rates reporting. In this respect, useful software is Cricket based on MRTG/RRD [6], [7] and Weathermap [5]. Both this solutions can be implemented centralized or Figure 3. Traffic monitoring with Cricket. Reports are available based on SNMP access to devices on the network. Transaction security for this component can be achieved using SNMPv3, a new SNMP protocol framework which is already available. The security component for SNMPv3 was proposed in RFC 2274 and described by the User-based Security Model (USB). The USB model defines elements of procedures for providing SNMP message-level security, and is supposed to protect against modification of information, masquerade, and disclosure. The USB uses MD5 (Message Digest Algorithm) and the Secure Hash Algorithm to provide data integrity, to directly protect against data modification attacks, to indirectly provide data origin authentication, and to defend against masquerade attacks. Data Encryption Standard (DES) is used to protect against disclosure. One of the most important components to be analyzed is the fault management. It consists in 3 steps: identify the problem, isolation, and correction. The first step is achieved by monitoring the network and looking for signatures of typical problems. If a signature is detected, a fault is reported (automatically or by the support personnel) to the Help Desk, issuing a trouble ticket. Depending on the importance of the problem, different entities could be required to take the appropriate decision and perform the isolation. Correction can be done either centralized or distributed 224
considering the nature of the fault and area of authority. An important component that can be centralized is monitoring. Good monitoring is essential for fast fault isolation. Specialized tools are needed for: monitoring of host, routers, resources, and environment (SNMP); monitoring of network services (HTTP, SMTP, FTP). Serious monitoring software should have as many as possible of the following features: contact notifications - email, pager, phone.; ability to define event handlers for service and host events; capability to scheduled downtime for suppressing host and service; web interface for viewing current network status, notification and problem history, log file, etc.; support for user defined plug-ins to perform service checks; hierarchical user authorization for access to the web interface; A good quality Open Source package that was tested by us and offers the above features is Nagios [8]. An output of Nagios is shown in Figure 4. Figure 3. Tactical Overview screen of Nagios. Accounting management is a component that, in the most cases, uses important network resources. A distributed approach is the best solution to use to fulfill this task. There are few options for accounting management solutions using Open Source software, due to strong relation between different types of equipment involved in the final accounting scheme. A reliable package, IPaccounting, is available from Istituto Nazionale di Fisica Nucleare, Italy. Other approaches based on traffic flow are available. Network security management implementation depends on the network structure and on the responsibility of each NOC to the local users for the offered services. There are two aspects involved in network security management: security of the network devices and security of the network services. In consequence, network security management involves: a set of permissions that limits access to networking equipment by username/ip address; notification policies and action plans to annihilate security-related violations as e.g. DoS attacks. There is no generally valid solution. Network security management cannot be classified as centralized or distributed. A centralized view of the entire component can lead to better network policy enforcement, but a distributed implementation of the software that is actually used for detecting and blocking network attacks is more efficient. Both, accounting management and network security management, typically use the same distributed/centralized scheme, and a common reliable solution is based on traffic flow analysis. A very good tool for network security management is Snort, an Open Source network intrusion detection system. Snort is capable of realtime traffic analysis and packet logging on IP networks (www.snort.org). A web interface for Snort is available and permits to centralize the results at the top level of the network management still using a distributed scheme. Snort uses a flexible rule-based language to describe traffic that should be collected or passed, as well as a detection engine based on modular plug-in architecture. A real-time alerting capability is available. Other tools that deal directly with the network equipment (usually Cisco routers) are available. Such a tool, available as Open Source, is under development by a RoEduNet team (http://zazu.iasi.roedu.net). Finally, no centralized/distributed hybrid network management system can be implemented efficiently without a good trouble ticket system as the core of the Help Desk. The Help Desk is the main mechanism to efficiently centralize parts of the network management components. All the problems appearing on the network are gathered by the Help Desk, and trouble tickets are issued. A trouble ticked should include the following information: the APM that reported the problem; the entity that should consider solving the problem (CNMSE and possibly some SIEs); description of the problem. The management entity charged with a trouble ticket will report to the Help Desk on the status of the ticket. A trouble ticket will be considered having an OPEN status as long as the problem was not solved. 225
When the problem is solved the trouble ticket will become CLOSED. For all trouble tickets that are OPEN, the Help Desk will send regular updates describing the actions that have been performed, as well as what is to be done. For obtaining this information, the Help Desk will regularly communicate with all involved parties (NMEU, SSU and APMs). The most useful features of a good ticket system are: web-based interface with user level authentication; support of multiple queues (administrative, technical, etc.); interface for ticket submitting and operation via e-mail; granular user access control (requestor, watcher, admin, owner, etc.); SQL database storage system; hierarchical ticket linking system (parentchild relationships); customizable templates for system messages. We had a good experience with Request Tracker (http://www.bestpractical.com/rt/) that provides all the above features. 4. Conclusions DVNOC framework, based on a centralized/distributed approach of functions to be fulfilled by a network management infrastructure, is proposed. This framework establishes the responsibilities of each unit involved in the management of a network structure with branches spread over a large geographical area and offering services to a number of different institutions. The DVNOC model for network management offers good opportunities to optimize both the performance and the operating costs of multiple networks using the same communication infrastructure. Due to the precise split of functions to different groups, and to the optimization of communication channels, a DVNOC architecture can be implemented using a mix of distributed and centralized strategies. To help realize such a mix, several free software packages were tested by the authors and are recommended. An important advantage of this approach to be emphasized: operation of NOCs and even the service implementation procedures are distributed and can be outsourced. We recommend the implementation of such a model for the management of fluid network structures, such as research and governmental networks, which have fluctuating operating budget provided by different sources and are offering an ever changing set of services to communities with heterogeneous resources. References [1] Udupa, Divakara K., Network Management System Essentials, McGraw-Hill, U.S.A., 1996. [2] Udupa, Divakara K., TMN-Telecommunications Management Network, McGraw-Hill, U.S.A., 1999. [3] Stallings W., SNMP and SNMPv2: The Infrastructure for Network Management, IEEE Communications Magazine, March 1998. [4] http://www.version6.net/ [5] http://www.indiana.edu/ [6] http://cricket.sourceforge.net [7] http://people.ee.ethz.ch/~oetiker/webtools [8] http://www.nagios.org 226