Estelar. Chapter 5. Accounting & Performance Management

Similar documents

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

CHAPTER 7 SUMMARY AND CONCLUSION

Scalable Load Balancing on Distributed Web Servers Using Mobile Agents

G DATA TechPaper #0275. G DATA Network Monitoring

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

The functionality and advantages of a high-availability file server system

A Scheme for Implementing Load Balancing of Web Server

Network Design Best Practices for Deploying WLAN Switches

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications

An Active Packet can be classified as

Optimizing Service Levels in Public Cloud Deployments

University of Portsmouth PORTSMOUTH Hants UNITED KINGDOM PO1 2UP

Efficient DNS based Load Balancing for Bursty Web Application Traffic

VMware vsphere Data Protection 6.1

Integrated Application and Data Protection. NEC ExpressCluster White Paper

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

Introduction to Network Management

Load Balancing in the Cloud Computing Using Virtual Machine Migration: A Review

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

A Survey Study on Monitoring Service for Grid

Transport Layer Protocols

1.1.1 Introduction to Cloud Computing

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

LOAD BALANCING AS A STRATEGY LEARNING TASK

ADMINISTRATION AND CONFIGURATION OF HETEROGENEOUS NETWORKS USING AGLETS

Advanced Peer to Peer Discovery and Interaction Framework

CME: A Middleware Architecture for Network-Aware Adaptive Applications

How A V3 Appliance Employs Superior VDI Architecture to Reduce Latency and Increase Performance

Scaling 10Gb/s Clustering at Wire-Speed

Load balancing as a strategy learning task

Meeting the Five Key Needs of Next-Generation Cloud Computing Networks with 10 GbE

A Network Monitoring System with a Peer-to-Peer Architecture

Windows Server 2008 R2 Hyper-V Server and Windows Server 8 Beta Hyper-V

Network-Wide Change Management Visibility with Route Analytics

Small Business Stackable Switch White Paper January 16, 2001

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Load Distribution in Large Scale Network Monitoring Infrastructures

Client/Server Computing Distributed Processing, Client/Server, and Clusters

HP Insight Management Agents architecture for Windows servers

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

How To Provide Qos Based Routing In The Internet

A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems

A Comparison of Dynamic Load Balancing Algorithms

11.1. Performance Monitoring

Load Balancing on Open Networks: A Mobile Agent Approach

A Mock RFI for a SD-WAN

High Availability Design Patterns

Network Management and Monitoring Software

Cloud Based Distributed Databases: The Future Ahead

Chapter 18. Network Management Basics

Analysis of Effect of Handoff on Audio Streaming in VOIP Networks

Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES

Behavior Analysis of TCP Traffic in Mobile Ad Hoc Network using Reactive Routing Protocols

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Load Testing and Monitoring Web Applications in a Windows Environment

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

White Paper. Requirements of Network Virtualization

Supporting Server Consolidation Takes More than WAFS

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Network Monitoring. Chu-Sing Yang. Department of Electrical Engineering National Cheng Kung University

Abstract. 1. Introduction

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Gaining Operational Efficiencies with the Enterasys S-Series

Performance Oriented Management System for Reconfigurable Network Appliances

Load Balancing of Web Server System Using Service Queue Length

Cloud Computing for Agent-based Traffic Management Systems

Service Quality Management The next logical step by James Lochran

Optimizing the Virtual Data Center

The Service Availability Forum Specification for High Availability Middleware

Web Services in SOA - Synchronous or Asynchronous?

Virtual Desktop Infrastructure Planning Overview

LOAD BALANCING MECHANISMS IN DATA CENTER NETWORKS

INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS

A Case for Dynamic Selection of Replication and Caching Strategies

AN EFFICIENT LOAD BALANCING ALGORITHM FOR A DISTRIBUTED COMPUTER SYSTEM. Dr. T.Ravichandran, B.E (ECE), M.E(CSE), Ph.D., MISTE.,

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Chapter 1 - Web Server Management and Cluster Topology

Enterprise Energy Management with JouleX and Cisco EnergyWise

Load Balancing in Distributed Data Base and Distributed Computing System

Grid Computing Approach for Dynamic Load Balancing

How To Balance In Cloud Computing

Keywords Load balancing, Dispatcher, Distributed Cluster Server, Static Load balancing, Dynamic Load balancing.

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Wait-Time Analysis Method: New Best Practice for Performance Management

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Building Remote Access VPNs

Chapter 4. VoIP Metric based Traffic Engineering to Support the Service Quality over the Internet (Inter-domain IP network)

Improved Hybrid Dynamic Load Balancing Algorithm for Distributed Environment

Quality of Service Testing in the VoIP Environment

CHAPTER 1 INTRODUCTION

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

Study of Various Load Balancing Techniques in Cloud Environment- A Review

Technical White Paper Integration of ETERNUS DX Storage Systems in VMware Environments

TRUFFLE Broadband Bonding Network Appliance. A Frequently Asked Question on. Link Bonding vs. Load Balancing

Load Balancing Algorithm Based on Services

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Transcription:

Chapter 5 Accounting & Performance Management During the last decade, the Internet has changed our ways of communicating more than anything else. The Internet is almost ubiquitous today, and we take connectivity for granted until for some reason we cannot connect. These days we expect Internet connectivity to be available anytime, anywhere. Most of us realize that this is impossible without intelligent systems managing the network. This leads us to technologies, processes, and applications in the area of Network Management and Network Management Systems and Operations Support Systems (NMS-OSS). NMS was a set of niche applications for quite some time, until businesses realized that their performance depended on the network. Then, suddenly, network downtime became a business issue instead of just a minor problem. Therefore, notions such as service level agreements (SLA) are imposed on the network to support specific business application requirements. Accounting describes the process of gathering usage data records at network devices and exporting those records to a collection server, where processing takes place. Then the records are presented to the user or provided to another application. The FCAPS model is an international standard defined by the International Telecommunication Union (ITU) that describes the various network management areas. The advantage of the FCAPS model is that it clearly distinguishes between accounting and performance. Management Functional Area (MFA) Accounting Performance Management Function Set Groups Usage measurement, collection, aggregation, and mediation; tariff and pricing Performance monitoring and control, performance analysis and trending, quality Assurance Table 5.1 ITU-T FCAPS Model for Accounting & Performance Management 115

An example of Accounting is the collection of usage records to identify security attacks based on specific traffic patterns or measuring which applications consume the most bandwidth in the network. Accounting was solely considered the task of collecting usage data, preprocessing it, and feeding it into a billing application. Service providers usually developed their own accounting and billing applications, and most enterprises were not interested in accounting information. With the introduction of data networks and the Internet Protocol (IP) becoming ubiquitous, the billing paradigm changed quickly. Internet access was only billed on access time, and services on the Internet were offered free of charge. Over time, accounting in the IP world was almost forgotten, even for network management experts. This was exacerbated by the roots of accounting, which was considered no more than a billing component. This also increased the isolation of accounting. Although collecting interface counters is quite simple, mediation and correlation of large accounting records for a billing application can be difficult. It requires detailed knowledge of the underlying network architecture and technology, because collecting usage records from a legacy voice network is a completely different task than collecting usage records in data networks. The trend toward IP as the unique communication protocol will certainly reduce the described complexity in the future. Therefore, it is important to understand the different accounting techniques and also identify the various sources in the network for generating usage data records. Network operators realized that collected accounting data records are not limited to billing applications. In addition, they can also be used as input for other applications such as performance monitoring, checking that a configuration change fixed a problem, or even security analysis. This is in reality a paradigm change, because suddenly the "A" part of the FCAPS model can be used in conjunction with Fault, Performance, Security, and even Configuration. For example, if the administrator has configured the network so that business-critical data should go via one path and best-effort traffic should take another path, accounting can verify if this policy is applied and otherwise notify the fault and configuration tools. The previous "stealth area" of accounting now becomes a major building block for network and application design and deployment. This is the reason for the increasing interest in accounting technologies. Performance applications combine active and passive monitoring techniques to provide information that is more 116

accurate. The described flexibility is probably the biggest advantage of collecting accounting information. If a network architect designs the framework correctly, you can collect accounting data once and use it as input for various applications. Figure 4-2 illustrates a three-tier accounting architecture. This also relates to the FCAPS model that was chosen to structure the various network management areas. By identifying the possible usage scenarios, accounting becomes an integral part of the NMS. Fig 5.1 Generic Three tier Accounting Infrastructure ITU-T definition (M.3400 and X.700, Definitions of the OSI Accounting Management Responsibilities): "Accounting management enables charges to be established for the use of resources in the OSIE [Open Systems Interconnect Environment], and for costs to be identified for the use of those resources Accounting management includes functions to: inform users of costs incurred or resources consumed; enable accounting limits to be set and tariff schedules to be associated with the use of resources; enable costs to be combined where multiple resources are invoked to achieve a given communication objective. " 117

Fig 5.2 Accounting Management Architecture The above figure illustrates the use of accounting management for multiple applications. Notice the functions of the different layers and the distinction between record generation and processing (such as data collection, exporting, and aggregation) and the applications that will use the records Purposes of Accounting As defined previously, the focus of accounting is to track the usage of network resources and traffic characteristics. The following section identifies various accounting scenarios: a) Network Monitoring: It is a vague expression that includes multiple functions. Network monitoring applications enable a system administrator to monitor a network for the purposes of security, billing, and analysis (both live and offline). The accounting collection process at the device level gathers usage records of network resources. These records consist of information such as interface utilization, traffic details per application and user (e.g. percentage of web traffic), real-time traffic, and network management traffic. They may include details such as the originator and recipient of a communication. A network monitoring solution can provide the following details for performance monitoring: a) Device performance monitoring viz. Interface and sub interface utilization, per class of service utilization, traffic per application b) Network performance monitoring viz. communication patterns in the network, path utilization between devices in the network c) Service performance monitoring viz. traffic per server, traffic per service, traffic per application 118

b) User Monitoring and Profiling : The trend of running mission-critical applications on the network is evident. Voice over IP (VoIP), virtual private networking (VPN), and videoconferencing are increasingly being run over the network. At the same time, people use the network to download movies, listen to music online, performs excessive surfing, and so on. This information can be used to monitor and profile users, track network usage per user, document usage trends by user, group, and department; identify opportunities to sell additional value-added services to targeted customers; build a traffic matrix per subdivision, group, or even user c) Application Monitoring and Profiling: With the increase in emerging technologies such as VoIP/IP telephony, video, data warehousing, sales force automation, customer relationship management, call centers, procurement, and human resources management, network management systems are required that allow us to identify traffic per application. Because of these changes, we need a new methodology to collect application-specific details, and accounting is the chosen technology. The collected accounting information can helps us to do the following: a) Monitor and profile applications In the entire network; over specific expense links b) Monitor application usage per group or individual user c) Deploy QoS and assign applications to different classes of service d) Assemble a traffic matrix based on application usage d) Capacity Planning: Internet traffic increases on a daily basis. Different studies produce different estimates of how long it takes traffic to double. This helps us predict that today ' s network designs will not be able to carry the traffic five years from now. Broadband adoption is one major driver, as well as the Internet ' s almost ubiquitous availability.. Capacity planning can be considered from the link point of view or from the network-wide point of view. Each view requires a completely different set of collection parameters and mechanisms 119

e) Traffic Profiling: A good analogy for understanding traffic engineering is examining vehicle traffic patterns. There are several highways in the area where we live, so we have several options to get to the airport or office. Based on the day of the week and time of day, we choose a different road to get to our destination. This analogy applies to network architects, modeling data traffic on the network for traffic engineering. Continuing the analogy, accounting data would be the traffic information, and an exceeded link utilization threshold would be the equivalent of a traffic report on the radio about a traffic jam. f) Security Analysis: Operators of enterprise networks as well as service providers are increasingly confronted with network disruptions due to a wide variety of security threats and malicious service abuse. Security attacks are becoming a scourge for companies as well as for individuals. Fortunately, the same accounting technologies that are used to collect granular information on the packets traversing the network can also be used for security monitoring. When attacks are taking place, accounting technologies can be leveraged to detect unusual situations or suspicious flows and alarm a network operations center (NOC) as soon as traffic patterns from security attacks are detected, such as smurf, fraggle, or SYN floods. In a second step, the data records can be used for root cause analysis to reduce the risk of future attacks Performance Management ITU-T definition (M.3400 and X.700, Definitions of the OSI Performance Management Responsibilities): " Performance Management provides functions to evaluate and report upon the behavior of telecommunication equipment and the effectiveness of the network or network element. Its role is to gather and analyze statistical data for the purpose of monitoring and correcting the behavior and effectiveness of the network, network elements, or other equipment and to aid in planning, provisioning, maintenance and the measurement of quality. 120

Performance management includes functions to: a) gather statistical information b) maintain and examine logs of system state histories c) determine system performance under natural and artificial conditions d) alter system modes of operation for conducting performance management activities Fig. 5.3 Performance Management Architecture Purposes of Performance Management This section is dedicated to performance management. It identifies various scenarios where performance data can help manage the network more effectively. As mentioned before, the term " performance monitoring " can be interpreted widely. Our definition covers three aspects: the device, the network, and the service. 1. Device Performance Monitoring: The most obvious area of performance monitoring is directly related to the overall network and individual devices within the network.. As the initial step, network monitoring is done for device and link availability. Availability is the measure of time for which the network is available to a user, so it represents the reliability of network components. Another description says availability is the probability that an item of the network is operational at any point in time. 2. 121

A common formula is: Availability = MTBF / (MTBF + MTTR) where MTBF is the mean time between failures and describes the time between two consecutive failures. MTTR is the mean time to repair after a failure occurred Availability is usually defined as a percentage, such as 99 percent or 99.99999 percent. Most network operators try to achieve availability between 99.9 percent and 99.99 percent. From a technical perspective, it is possible to increase this even further, but the price is so high that a solid business case is required as a justification. Trading floors and banking applications are examples of highavailability requirements; the average e-mail server is certainly not. 3. Network Element Performance Monitoring: From a device perspective, we are mainly interested in device " health " data, such as overall throughput, per-(sub) interface utilization, response time, CPU load, memory consumption, errors, and so forth. Details about network element performance, such as interface utilization and errors, are provided by the various MIBs, such as MIB-II (RFC 1213), Interfaces- Group-MIB (RFC 2863), and TCP-MIB (RFC 2012. 4. System and Server Performance Monitoring: The low-level and operation systems functions need to be checked constantly to identify performance issues immediately. In addition, we should also monitor the specific services running on the server. Relationship between Accounting and Performance Management There is strong relationship between performance and accounting, which is reflected by standard definitions described above. Both parts collect usage information, which can be applied to similar applications afterward, such as monitoring, base lining, security analysis, and so on. Accounting and performance monitoring are important sources not only for performance management, but also for security management this is another common area. A security management application can import the collected traffic information and analyze the different types of protocols, traffic patterns between source and destination, and so on. 122

A comparison of current data sets versus a defined baseline can identify abnormal situations or new traffic patterns. In addition, the same application can collect device performance monitoring details, such as high CPU utilization. The combination of the two areas can be a strong instrument to identify security attacks almost in real time. The security example is perfectly suited to explaining the benefits of performance monitoring and accounting. Each symptom by itself (abnormal traffic or high CPU utilization) might not be critical, but the amalgamation of them could indicate a critical situation in the network. From a network perspective, performance takes into account details such as network load, device load, throughput, link capacity, different traffic classes, dropped packets, congestion, and accounting addresses usage data collection. The collection interval can be considered a separation factor between accounting and performance monitoring. A data collection process for performance analysis should notify the administrator immediately if thresholds are exceeded; therefore, we need (almost) real-time collection in this case. Accounting data collection for billing does not have real-time collection requirements; except for scenarios such as prepaid billing history is certainly a differentiator. An accounting collection for billing purposes does not need to keep historical data sets, because the billing application does this. Performance management, on the other hand, needs history data to analyze deviation from normal as well as trending functions. Monitoring device utilization is another difference between the two areas. Device health monitoring is a crucial component of performance monitoring, whereas accounting management is interested in usage records. A fundamental difference between monitoring approaches is active and passive monitoring. Accounting management is always passive, and performance monitoring can be passive or active. Passive monitoring gathers performance data by implementing meters. Examples range from simple interface counters to dedicated appliances such as a Remote Monitoring (RMON) probe. Passive measurement needs to monitor some or all packets that are destined for a device. It is called sampling if only a subset of packets is inspected versus a full collection. In some scenarios, such as measuring response time for bidirectional communications, implementing passive measurement can become complex, because the request and response packets need to be correlated. The advantage of passive 123

monitoring is that it does not interfere with the traffic in the network, so the measurement does not bias the results. This benefit can also be a limitation, because network activity is the prerequisite for passive measurement. The active monitoring approach injects synthetic traffic into the network to measure performance metrics such as availability, response time, network round-trip time, latency, jitter, reordering, packet loss, and so on. The simplicity of active measurement increases scalability because only the generated traffic needs to be analyzed. Accounting & Performance Management in Cluster Computing Cluster computing has recently become an attractive computing paradigm for resolving a lot of computation-intensive applications. Clusters are formed by exploiting the existing computing resources on the network to work together as a single system. Thus, they eliminate the need for supercomputers by providing better price-performance ratio and fault tolerance compared to the traditional mainframes or supercomputers. The ubiquity and explosive growth of the World Wide Web contributed immensely to the increased demand of highperformance clusters and distributed resources. Such distributed systems are expected to cater to an ever increasing demand of rapid response time and high throughput for the client requests occurring at any time. As information and its location increasingly determine the performance perceived by each individual user, therefore the performance evaluation and quality of services of such systems have been the major focus of research now days. Distributing computing power in a cluster consisting of a network of heterogeneous computing devices represents a very complex task. However, it becomes even complicated when heterogeneous devices with different applications are taking a part of it. The key issues in cluster computing that must be dealt with before defining a platform for building and deploying parallel computing paradigm are depicted as under:- 1. Heterogeneity in architecture and operating systems: Although it is reasonable to assume that a new and stand-alone cluster system may be configured with a set of homogeneous nodes but there is a strong likelihood for upgraded clusters or networked clusters to have nodes with heterogeneous operating systems and architectures. The operating system heterogeneity could be handled through distributed operating systems. However, it will be non-trivial to handle architectural heterogeneity, since 124

the executable files are not compatible among architectures. 2. Difference in computing capability and memory availability: As each host may have different capabilities (such as memory) and different processing power, it is essential to allocate tasks to the nodes based on their capabilities and processing power. 3. Asymmetry in connectivity: The traditional cluster computing models do not face the problem of heterogeneity in network connection as the entire set of workstations that are participating in the clusters are connected only by the wired network. But with the advent of wireless technology, wireless networks also take part in cluster computing and they deliver much lower bandwidth than wired networks and have higher error rates. Moreover, mobile devices are characterized by high variation in the network bandwidth that can shift from one to four orders of magnitude, depending on whether it is a static host or a mobile host and on the type of connection used at its current cell. Thus, the cluster computing model must be able to distinguish among the types of connectivity and provide flexibility for easy variation of the grain size of the task to account for the variations in bandwidth. However, these systems are suitable only for coarse grain level parallelism due to the communication overhead. 4. Variation of load on the participating nodes: When using workstations for executing parallel applications the concept of ownership is frequently present. Workstation owners do not want their machine to be overloaded by the execution of parallel applications, or they may want exclusive access to their machine when they are working. Reconfiguration mechanisms are thus required to balance the load among the nodes, and to allow parallel computations to coexist with other applications. To overcome these problems, some dynamic load balancing mechanisms are needed. There are differences in loads among the nodes due to multiuser environment, and when an application is run on heterogeneous cluster. In these cases, it is important to balance loads among the nodes to achieve sufficient performance. As static load balancing techniques would be insufficient, dynamic load balancing techniques based on runtime load information would be essential. This would be difficult for a programmer to perform load balancing explicitly for each 125

environment/application, and automatic adaptation by the underlying runtime is indispensable. This gets aggravated when mobile devices are part of the cluster. 5. Timeliness issue: Timeliness issue is an important issue especially in real-time systems and refers to the delay that is taken for the mobile device to regain its full state when it moves from one cell to the other or after reentering a coverage area after disconnection.. Whenever a mobile host moves from one cell to the other, it is associated with a handoff, to ensure that data structures related to the mobile host are also moved to the new connecting point (MSS). This involves exchange of several registration messages. This may causes some delay and it should be fast enough to avoid loss of message delivery. In addition to this, there is a possibility that the mobile host could move out of coverage after accepting the task for execution. These issues need to be addressed with respect to the mobile cluster model. 6. Disconnectivity: The period of disconnectivity of nodes in static networks are usually treated as faults. However, in the context of mobile nodes, the disconnectivity may be due to roaming. Such faults should be eliminated as quickly as possible as it leads to negative impact on the efficiency of the cluster. With reasons above, accounting & performance management of the systems in which heterogeneous servers and resources are distributed become a key factor in achieving a rapid response time, efficient resource utilization and consequently higher performance. The most difficult subject in a cluster environment is the performance degradation caused by a high load disparity and achieving bare minimum response time for the client requests. So, different approaches are adopted to provide a solution to the above mentioned intricacies. One of them is Mobile Agent Technology for the management of Cluster Computing. Mobile Agent based Management in Cluster Computing Mobile agent technology plays an important role in the management of cluster computing environments. Mobile agents are autonomous programs that travel from cluster to cluster under their own control. They are not linked to the system where they start their execution. After being created at an execution cluster, each mobile agent can carry its state and code to another cluster, where its execution can be restarted or continued. By interacting 126

with a cluster after migrating to it, an agent can perform complex operations on data without transferring them, directly control the equipment of the visited cluster, and dynamically deploy software at the clusters, because the agent can deploy the application logic to where it is needed and carry only relevant data rather than the whole set of data observed in clusters. Therefore, mobile agent-based management systems should offer various agents specialized for supporting their particular tasks, rather than a few general-purpose agents for supporting various tasks, and they should select the most suit-able agents to perform these. For the same reason, as previously mentioned, mobile agents should be statically optimized for their target networks because both the outlay of dynamically discovering an efficient itinerary and its programs tend to be large. The Mobile Agent approach is gaining momentum in the field of distributed computing.and is useful for managing cluster computing systems as the use of mobile agents can bring some interesting advantages when compared to the traditional client server solutions. It can reduce traffic in the network, can provide more scalability, allows the use of disconnected computing and provides flexibility in the development and maintenance of distributed architecture. Several researchers have attempted to apply the technology to the management of cluster and Grid computing systems. The most complicated issue in a cluster environment is the performance degradation leading to low response time for the client requests due to disproportionate load distribution among the servers in cluster. So, load balancing is therefore indispensable for a heterogeneous cluster, to assure a fair distribution of workload on each server in the cluster. While static and dynamic load balancing for homogeneous parallel computing platforms has been well studied for more than a decade, load balancing for heterogeneous parallel systems is a relatively new subject of investigation with a less-explored landscape. On a heterogeneous platform, the goal is the same: to minimize idle processor time and to lower the execution wall-clock time. There are various approaches for implementing load balancing in a distributed heterogeneous server environment. The taxonomy in [2] classifies the loadbalancing approaches into four categories: client-based, DNS-based, dispatcher-based, and server-based approaches. All these approaches broadly implement load sharing algorithms which can be static or dynamic and can use either centralized or distributed control [3,4,5]. The Reference [6] shows that a hybrid of static and dynamic strategy for server selection 127

provides a good performance. A client-based approach implements the server selection on the client side [7]. The clients can select one of the servers (the strategy is also known as source initiated) in random but this random selection approach cannot guarantee load balancing and server availability. On the other hand the destination initiated strategy requires a server to look for client requests [8] (from the overloaded servers). In a DNS based approach, DNS server becomes a bottleneck and limits throughput restricting performance [9]. A dispatcher based approach performs address mapping at address level. A dispatcher based approach may implement either packet rewriting [10] in which case the overhead of address rewriting is incurred [11] or the HTTP redirection which introduces higher overhead than network load balancing, leading to deterioration in performance. SUNSCALAR [12] provides load balancing using dispatcher. The server-based approach uses decentralized load reallocation strategy where all servers are allowed to participate in the process of load balancing. The mod_backhand [13, 14] module provides a server-based solution for the Apache web server that allows for seamless reallocation of HTTP requests from heavy-loaded servers to under-utilized servers on a per-request basis. The load balancing approaches for distributed servers mentioned thus far involve frequent message exchanges between the server and the client to exchange load information. Mobile agents enabled load balancing can however overcome these problems [15] to a large extent by incorporating various optimized migration strategies [16] and coordination mechanisms amongst agents for information exchange [17]. They act as coordinators on behalf of the servers and present themselves at remote sites to coordinate the information synchronization [18]. Obeloer presented FLASH (Flexible Agent System for Heterogeneous Cluster), which uses a mobile agent based approach for load balancing [19]. Mobile agents have the capability to travel through a heterogeneous cluster and to execute jobs on the visited nodes. Mobile agents have been used to support load balancing in parallel and distributed computing [20] e.g., TRAVELER is a Java-based mobile agent infrastructure to support wide area parallel computing [21]. It allows users to dispatch jobs as mobile agents via a resource broker. MESSENGERS [22] is a system for general-purpose distributed computing based on mobile agents. It supports load balancing and dynamic resource utilization.. 128

Load balancing using mobile agents holds appeal because in message passing based approaches the nodes have to exchange messages of server loads periodically to be able to make decisions on load balancing for eg. mod_backhand for apache web server. The frequent message exchanges result in high communication latency and hence deteriorate the performance of the system. As opposed to this, a mobile agent migrates to the remote site and interacts locally with the destination host thereby significantly reducing remote message exchanges. Also such agents execute asynchronously and autonomously therefore the agent s creator is free after agent s creation and dispatching. Even if the original host is disconnected from the network for some time the agent can execute the task given by the original host on some remote destination. Thus this technology also supports disconnected computing. Mobile agents therefore become the most effective alternative for most of the network service based applications, for many reasons, including improvements in latency and bandwidth of cluster applications and reducing vulnerability to network disconnection. The integration of a Genetic Algorithm with the Sender Initiated Approach for server selection using Mobile Agent serves as a promising possibility for efficient load balancing and effective resource utilization in distributed systems. 129

Evolutionary Mobile Agent System Architecture The proposed framework for achieving dynamic load sharing MAS in cluster using mobile agents is shown in fig 1. This framework is implemented on a pool of Servers implementing EMAS and combining to form Server Cluster. The System architecture includes a set of intelligent agents along with associated policies, which coordinate with each other to create a robust and reliable framework for the development and execution of intelligent mobile agents for optimal load balancing. Fig 5.4: System architecture of EMAS in Cluster Computing Each agent executes a predefined policy and cooperates with each other for valuable information sharing and updating server load information on each visited node. The Agent pool consists of various agents each having its own role. The functionality of the various agents is briefly described as follows. 130

Only one Directory Agent is required per cluster. The main job of the Directory Agent is to administer the catalog of all computers that take part in the cluster. When a node needs to join or depart the cluster it has to report to the Directory Agent. The Threshold Agent can ask the Directory Agent for a list of other nodes in the cluster, so as to decide to which node it has to migrate when the load gets too high. The Directory Agent does not return the complete list of nodes, but it sends the information of node whose workload is the least in the list. This only happens when the total number of nodes is greater than a predefined threshold; otherwise the complete list is returned. The Directory Agent has to query all the available nodes from the list of nodes to get the updated status of the nodes taking part in cluster. Each node has one Monitor Agent. This agent keeps track of the load status of system resources. If the node runs out of resources it informs the Threshold Agent. Currently the only resource monitored is the processor load, which is checked at regular intervals. The level of processor load at which the Monitor Agent informs the Threshold Agent and that agent can set the current workload at start up time. In this way the Threshold Agent can control the load of the node and it can decide whether to allow to host the Client Agent (the agents with load to be executed) when the system is idle and to refuse those clients agents if the host is occupied by executable workload. The decision for the load of a particular node is checked by the Threshold Agent and represents with a single value (0/1), i.e., '0' means overload and '1' means under load. The Monitor agent does not inform the Threshold agent every time it checks the load of the system. It will first check the load several times and take the average of those samples. The average will be sent to the Threshold Agent. This way the system can handle fast and short fluctuations in the load. The Account Agent is responsible for keeping track of the resources used by a Client Agent. This is necessary for the payment. If a client has paid for some resources has to prevent the client from using more resources than was paid for. When a Client Agent enters a node it registers itself with the Threshold Agent, which then registers the client with the Account Agent. The Account Agent now monitors the Client Agent. If the Client Agent wants to migrate to another node it deregisters itself from the Threshold Agent. The Threshold Agent then deregisters the client from the Account Agent and asks the Account Agent for the resources being used by the client and writes this information to a log file. The 131

Threshold Agent may migrate the client to another node in the cluster, because of low system resources at the current node. In this case the Threshold Agent deregisters the client from itself and the Account Agent. After that it will register the Client Agent with the new node. The Threshold Agent also asks the Account Agent for the resources being used by the client and informs the Account Agent at the new node about the resources being used by the client until that time. Each node in the cluster has one Threshold Agent. This agent is the ruler of a node. It decides if and when agents will be migrated and where to. Furthermore it decides which agents are permitted on the node and what their restrictions are. When a Threshold Agent registers itself with the Directory Agent, the host of the Threshold Agent is added to the list of nodes. After registering, the Threshold Agent starts its job. First it will wait for a Monitor Agent and an Account Agent to register them with the Threshold Agent, and then it will accept Client Agents. A Client Agent has to register with the Threshold Agent otherwise the client will be removed. If the Monitor Agent reports low resources, threshold agent will try to move clients to hosts with more resources. The agents that carry out the real work for the user are the client agents. All Client Agents have to register themselves with the Threshold Agent on the node they just entered. When a Client Agent migrates by itself it has to deregister itself from the Threshold Agent and register itself with the Threshold Agent on the new node. Without registration the client will be removed by the Threshold Agent of the new node. A useful design pattern for clients with homogeneous tasks is to build one main agent that holds all program code and data. After it is started, the main agent clones itself several times, thus creating child agents who carry out subtasks. After the child agents have finished, they return their results to the main agent Agents in the system communicate with each other or with users using mobile group approach for coordination. The MAS acts as a collection of agents, locations and communication channels. A location represents a logical place in the distributed environment where agents execute. When a Mobile Agent migrates, it moves from one location to another. Agents communicate using the Message Agent by exchanging messages through reliable communications channels. The detailed functionality of the policies 132

executed by various agents is described as under:- Job Management Policy: After the servers destined for evaluating the desired jobs have been selected using server selection policy, the actual task of migration of jobs is initiated based on job management policy which with the aid of Client Agents actually transfer the job/task to the desired servers. This policy enables Client Agents to carry jobs to different servers and makes sure that all Client Agents register themselves with directory agent of each node as they enter and deregister on the completion of the job and finally, directs the Account Agent for keeping track of resources used by Client Agent. Load Sharing Policy: This policy Specifies the strategies on the basis of which jobs are to be provided to service log for processing It is initiated by a threshold based strategy. In our case Load sharing policy is Evolutionary Sender Initiated. Job Allocation Agent (JAA) executes server selection policy as well as Load sharing policy to select another server to receive job in case of failure. JAA carries the job to the selected server and negotiates with it for the acceptance of job Information Collection Policy: This policy periodically evaluates the workload on each of the servers and disseminates this information. The algorithm implementing this policy is shown in fig 2 as under: - 1: Set t: = 0; 2: At t: = t + Є, Update WS (t), where WS (t) = {WS 1, WS 2,..WSMAX} and WS i represents the ith Work Station 3: Initialize i 1 4. While i max do 5. Evaluate WSi_SUS:= w1 cpu_load + w2 no_connect / MAXCON + w3 free_mem /*SUS: = Server_utilization_status cpu_load = Length of job queue on server no_connect = number of active connections on serer MAXCON = maximum number of connections allowed to a server Free_mem = percentage of free memory space w1, w2, w3 = weights of parameters such that w1 + w2 + w3 = 1 */ 133

6. Evaluate WS i _RT: = RM + JC +Interoperability + SA + JE /* RT: = Response Time i.e. Time taken from service request to service provision RM: = Time to receive message JC: = Time for job classification SA: = Time for selecting the agent JE: = Time for job execution */ 7. Evaluate Turn around Cost TC = WS i _SUS + WS i _RT + Average Waiting Time of the job 8. Threshold_Agent (load_information) Provide (WS i _TC) 9. Message_Agent Threshold_Agent (load_information) 10. i i + 1 11. End while 12 Message agent propagates this information among the servers. Fig 5.5: Information collection policy Server Selection Policy It is a two phase policy. The first phase incorporates the initiation process while the second phase incorporates the selection process. The initiation process determines who initiates the load balancing process. Most of the initiation processes as per the literature reviewed implements one of the three approaches, sender-initiated approach where the process is initiated by an overloaded server; receiver initiated approach where the process is initiated by an under loaded server; Symmetric Initiated where a switch over to SI or RI approach will be done depending upon the network conditions. Once the process is initiated, the appropriate servers are selected for load reallocation using find-best strategy selecting the least loaded server among all the servers or find-first strategy selecting the first server whose load is below threshold. GA_LOAD_BAL 1. Set t : = 0; 2. Initialize P(t) : = {S1, SM}, such that Si = {J k where 1 k n J k = {0, 1}} /* n is the number of unique jobs */ 3. Evaluate P(t) : = { f(s1), f(sm) }; f(s1) = fitness Ji where 1 i n only if Ji = 1}; 4. Find S* Є P(t) such that f(s*) f(s), for all S Є P(t) /* S* stores the most fit chromosome */ 5. while t < tmax do 6. select {Si, Sj}: = Φ(P(t)); /*Φ = binary tournament operator */ 134

7. crossover C := Ώc (Si, Sj ); /* Ώc = uniform crossover operator */ 8. mutate C Ώm(C) ; /* Ώm = mutation operator */ 9. if C any S Є P(t) then 10. discard C and go to step 6; /* C is a duplicate of a member of the population*/ 11. end if 12. evaluate f(c); 13. find S Є P(t) such that f(s ) f(s), for all S Є P(t) and replace S C; 14. if f(c) < f(s*) then 15. S* C; /* Steady state replacement */ 16. end if /* update best fit chromosome found */ 17. t t + 1; 18. end while 19. return S*, f(s*) Fig. 5.6: Genetic Algorithm for Server Selection We propose a new strategy integrating the sender initiated server selection policy with the evolutionary approach named as the Evolutionary Sender Initiated Approach (ESI). In this approach, threshold agent-triggered overloaded servers register themselves with the server log using Message Agent. The Cluster head deploys GA_LOAD_BAL as a server selection policy and periodically executes it. GA_LOAD_BAL as shown in fig 3 selects a set of unique jobs to be processed on multiple servers concurrently. The parallel execution of unique jobs ensures optimal utilization of cluster resources, enforcing equitable load distribution at all times Here P represents the population { P0, P1, P2..P tmax } where each population comprises of m entries {S1, S2,..Sm} and each entry Si = {Jk such that 1 k n and Jk = {0,1}}. Let J = {J1, J2, J3, Jn} represents the set of unique services that are available across the network. For implementing GA for load balancing we define an array of structure res-joblog as follows For generating chromosomes in the initial population we select the jobs from the pool of 135

jobs J = {J1, J2, J3, Jn}. The selected jobs are represented by value 1 and the deselected ones are represented by value 0. For each job selected we store the system_id to keep track of the local dispatcher requiring service, resource_id for recording the id of service provider, AWT, ERT and SUS for calculating Average Waiting Time, Expected Response Time and Server Utilization Status for the selected resource. These three parameters help in evaluating the turnaround cost for the job J i and the sum of TC of all the selected jobs in a chromosome gives the fitness value. The Table 1 illustrates three chromosomes S1, S2, S3 from a population P(i). If we have 7 jobs / tasks then, corresponding to each job that has been selected. J1 J2 J3 J4 J5 J6 J7 System_id Resource_id Fitness value S1 = 1 0 0 1 1 0 0 20.0. 0. 3. 12. 0. 0 13. 0. 0. 1. 9. 0. 0 TC (Job 1 + Job 4 + Job 5) S2 = 0 0 0 0 1 1 1 0. 0. 0. 0.14. 5. 6 0. 0. 0. 0. 12. 2. 7 TC (Job 5 + Job 6 + Job 7) S3 = 1 0 0 1 1 0 0 12. 0. 0.10. 9. 0. 0 22. 0. 0. 3. 4. 0. 0 TC (Job 1 + Job 4 + Job 5) TABLE 5.2: Calculation of Turn around Cost of Jobs In the first and third chromosome we have generated the same set of jobs J1,J4 and J5 for processing but their system_ids as well as resource_ids are different making the set of jobs a unique resource, which can provide same service to concurrently running jobs. The sum of Turnaround Costs of the selected jobs generates the fitness value of a chromosome. The best fit chromosome represents the set of unique jobs, the parallel execution of which will yield the most effective balancing of load in the cluster as the expected response time and the average waiting time of the jobs is least. Apart from response time and waiting time, the SUS (server utilization status) factor of TC ensures minimum redirection of request/jobs on the fly thereby considerably reducing the overhead incurred in job reallocation. A crossover of 0.7% mutation at the rate of 0.01% is selected for generating the subsequent population. The new chromosome obtained after crossover is checked for its replica in the existing population. If it exists the chromosome is rejected and again the uniform crossover operator is called for the crossover of two new chromosomes selected using binary tournament selection operator. Each generation yields a set of unique jobs that would be processed in minimum time, thus modifying the solution vector in each step by proceeding towards the best fit solution. Implementation & Performance Evaluation 136

To study the performance of the proposed framework, we have implemented it on 10/100/1000 Mbps switched LAN that connects 1000 workstations and personal computers, and is used by about 700 hundred researchers and students. Machines are grouped into six different networks with their own servers and servers of each network are connected to the main server of the Network Centre. The proposed framework is implemented on a cluster of PCs using IBM Aglets software. The IBM aglets provide JAVA API for programming mobile agents and an environment for running mobile agents in Java. Mobile Agent enabled web server cluster is implemented on a cluster of PCs (P-4, 3 GHz machines) using Agent Software Development Kit and j2sdk1.3.1. The nodes have 256 MB main memory, while the web server host has 512 MB. Among them, five PCs are configured as web servers and other PCs are assigned as clients. The GA-based load-balancing scheme is evaluated on the cluster by comparing its performance with the server initiated approach, receiver initiated approach and symmetric initiated approach. The client requests are generated to measure the performance of web server which is assessed with the criteria like Load distribution, System throughput, Network traffic, etc. The comparison for the load distribution generated by the ESI scheme and other three approaches viz. SI, SYI, RI are implemented on five servers at different moment are depicted as under :- SI ESI SyI TABLE 5.3 Load distributions on five servers using SI, ESI, SyI, RI Approach RI 137

The Table 2 compares the mean deviation of load on the five servers which depicts that ESI scheme has lower load deviation than all three approaches which means by ESI, the client requests more evenly distributed onto the web servers. Fig. 5.7 EMAS System Throughput The Fig 4 shows system throughput in comparison to all four approaches and clearly depicts the outstanding performance of ESI in comparison with others. Fig. 5.8 Normalized average response time under varying load 138

Fig. 5.9 Normalized average waiting time under varying load The comparison of the job migration policies ESI, SI, SyI and RI is depicted in Fig. 5 and Fig 6 in response to average response time and average wait time respectively. ESI policy improves the average response time (seconds) significantly under varying system load. Under light load all policies are almost identical in terms of response time, but the key difference is under heavy load where ESI policy outperforms in comparisons of SI and RI. In Fig. 6 ESI policy improves the average wait time (seconds) compared to RI under varying system load. Also as the number of nodes increases in Cluster, Mobile Agent based ESI becomes more effective even for lightly loaded case Fig 5.10 Job Migration under varying load 139

Fig 5.11 System throughput of ESI approach and the case without load balancing The Job migration under varying load is compared which indicates that RI performance is lower than that of ESI. This is because RI approach is more passive than ESI, waiting for node to take initiative and thus migrating for only few numbers of jobs. SyI is more flexible than RI, having the option of passively wait for node to advertise their availability or to actively migrate jobs if no volunteers appear. The SyI have a good balance achieving better performance than RI but transfer lesser jobs than SI. The ESI approach shows little more performance as that of SyI in heavy load. The system throughputs of the ESI based load-balancing scheme using Mobile Agent and the case without load balancing is compared as in Fig. 8. The result shows that the ESI scheme can obviously improve the system throughput when increasing the number of servers but in latter case, the processing capacities of the servers are wasted and hence, no improvement. REFERENCES 1. W. Tang, and M. Mutka, Load distribution via static scheduling and client redirection for replicated web servers, Proceedings of the First International Workshop on Scalable Web Services (in conjunction ICPP 2000), Toronto, Canada, pp. 127 133, 2000. 2. V. Cardellini, and M. Colajanni, Dynamic load balancing on web-server systems, IEEE Internet Compute 3, pp. 28 39, 1999. 3. T.L. Casavant, and J.G. Kuhl, A taxonomy of scheduling in general-purpose distributed computer systems, IEEE Trans. Software Eng. 14 (2), pp.141 153, 1988. 140

4. J. Cao, G. Bennett, and K. Zhang, Direct execution simulation of load balancing algorithms with real workload distribution, J. System. Software 54, pp. 227 237, 2000. 5. Y. Wang, and R. Morris, Load sharing in distributed systems, IEEE Trans. Compute. C-34 (3), pp. 204 217, 1985. 6. M.J. Zaki, W. Li, and S. Parthasarthy. Customized Dynamic Load Balancing for a Network of Workstations, Journal of Parallel and Distributed Computing 43, pp. 156-162, 1997. 7. C. Yoshikawa, B. Chun, P. Eastham, A. Vahdat, and T. Anderson, Using smart clients to build scalable services, Proceedings of USENIX, pp. 105 117, 1997. 8. D. Eager, E. Lazowska, and J. Zahorjan, A comparison of receiver-initiated and sender-initiated dynamic load sharing, in Perform. Eval. Vol. 1, pp. 53 68, 1986. 9. Cisco Distributed Director, http://www.cisco.com/warp/public/ cc/pd/cxsr/dd/index.shtml. 10. Bestavros, M. Crovella, J. Liu, and D. Martin, Distributed packet rewriting and its applications to scalable web server architectures, Proceeding of the Sixth International Conference on Network Protocols, Austin, TX, pp. 290 297, 1998. 11. D. Dias, W. Kish, R. Mukherjee, and R. Tewari, A scalable and highly available web-server, Proceedings of the 41st International Computer Conference (COMPCON 96), IEEE Computer Society, San Jose, CA, pp. 85 92, 2001. 12. Singhai, A., S.B. Lim and S.R. Radia, The Sun SCALER framework for internet servers, IEEE Fault Tolerant Computing Systems, 1998. 13. modbackhand, http://www.backhand.org/mod backhand/ 14. T. Schlossnagle, The backhand project: load balancing and monitoring Apache web clusters, Proceedings of the Apache-Con. London, 2000. 15. J.Cao, Y.Sun, X.Wang and S.K. Das. Scalable load balancing on distributed web servers using mobile agents, Journal of Parallel and Distributed Computing, Vol. 63, pp. 996-1005, 2003. 16. Al-Jaroodi, J., N. Mohamed, J. Hong and D. Swanson, A middleware infrastructure for parallel and distributed programming models on heterogeneous systems, IEEE Trans. Parallel and Distributed Systems, Special Issue on Middleware, Vol. 14 pp. 1100-1111, 2003. 17. Raimundo, J., A. Macêdo, and F.M. Silva, The mobile groups approach for the 141