Design Considerations for Network Processor Operating Systems
|
|
|
- Lorraine Alexander
- 9 years ago
- Views:
Transcription
1 Design Considerations for Network rocessor Operating Systems Tilman Wolf, Ning Weng, and Chia-Hui Tai Dept. of Electrical and Computer Engineering University of Massachusetts Amherst, MA, USA ABSTRACT Network processors (Ns) promise a flexible, programmable packet processing infrastructure for network systems. To make full use of the capabilities of network processors, it is imperative to provide the ability to dynamically adapt to changing traffic patterns and to provide run-time support in the form of a network processor operating system. The differences to existing operating systems and the main challenges lie in the multiprocessor nature of Ns, their on-chip resources constraints, and the real-time processing requirements. In this paper, we explore the key design tradeoffs that need to be considered when designing a network processor operating system. In particular, we explore the performance impact of (1) application analysis for partitioning, (2) network traffic characterization, (3) workload mapping, and (4) run-time adaptation. We present and discuss qualitative and quantitative results in the context of a particular application analysis and mapping framework, but the observations and conclusions are generally applicable to any run-time environment for network processors. Categories and Subject Descriptors: D.4.1 [Operating Systems]: Multiprocessing; General Terms: Design. Keywords: Network processors, application partitioning, application mapping. 1. INTRODUCTION The success of the Internet as a communication medium is driving research in the areas of sensor networks, overlay networks, ubiquitous computing, grid computing, and storage area networks. This trend expands the functionality of networks to include increasingly diverse and heterogeneous end-systems, protocols, and services. Even in today s Internet, routers perform a large amount of processing in the data path. Examples are firewalling, network address translation (NAT), web switching, I traceback, TC/I offloading for high-performance storage servers, and encryption for This research was supported in part by National Science Foundation grant CNS ermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ANCS 5, October 26 28, 25, rinceton, New Jersey, USA. Copyright 25 ACM /5/1...$5.. virtual private networks (VN). Many of these functions are performed in access and edge networks, which exhibit the most diversity of systems and required network functions. With the broadening scope of networking it can be expected that this trend will continue and more complex processing of packets inside the network will become necessary [2, 4]. The processing infrastructure for these various packet processing tasks can be implemented in a number of ways. Well-defined, high-speed tasks are often implemented on application-specific integrated circuits (ASICs). Tasks that are not well-defined or possibly change over time need to be implemented on a more flexible platform that supports the ability to be reprogrammed. Network processors (Ns) have been developed for this purpose. The performance demands of increasing link speeds and the need for flexibility require that these network processors are implemented as multiprocessor systems. This makes the programming of such devices difficult, as the overall performance depends on the fine-tuned interaction of different system components (processors, memory interfaces, shared data structures, etc.). The main problem is handling the complexity of various, interacting N system components. To achieve the necessary processing performance to support multi-gigabit links, network processors are implemented as system-on-a-chip multiprocessors. This involves multiple multithreaded processing engines, different types of on- and off-chip memory, and a number of special-purpose co-processors. On conventional workstation or server systems these complexities are hidden by the operating system or do not express themselves as drastically due to their uniprocessor architecture. To simplify this process, a number of domain-specific programming languages and optimizing compilers are currently being developed [5, 17, 19]. These approaches aim at optimizing a single application (i.e., router functionality) statically for the underlying hardware. In current network processors, most performance critical tasks are implemented and fine-tuned in assembly (e.g., to balance the processing times in each step of a software pipeline). As a result, slight changes in the functionality can have drastic performance impacts which require re-tuning. Due to the necessary fine-tuning of individual applications, it is very difficult to integrate and dynamically change multiple packet processing functions on a single network processor. However, network processing is inherently a dynamic process. The main motivation for implementing packet processing functions in a network processor (rather than in a faster, more power-efficient custom logic device) is the need to change the functionality over time. Changing traffic patterns, new network services and protocols, new algorithms for flow classification, and changing defenses against denial of service attacks present the dynamic background that a programmable router needs to accommodate. This requires that the router (1) can implement multiple packet processing appli-
2 cations at the same time, (2) can quickly add and remove processing functions from its workload, and (3) can ensure efficient operation under all circumstances. In particular, the management of various system resources is important to avoid performance degradation from resource bottlenecks. The complexity of N multiprocessor architectures has limited the use of existing operating system concepts in this domain. The few existing approaches to providing run-time support are still aimed at single packet processing applications that are optimized offline [19, 12] or consider network processing applications as a single monolithic entity [9]. In this paper, we explore a variety of design issues for a run-time environment that supports several concurrent network processing applications and allows dynamic reconfiguration of the workload on a multiprocessor system. The key design considerations that are addressed fall into four broad categories: (1) application partitioning, (2) traffic characterization, (3) run-time mapping and adaptation, and (4)system constraints. Research efforts in the network processor domain have long been influenced by particular system designs. In order to separate the general design issues of operating systems from a particular system, we separate qualitative observations and quantitative results. First, we explore the above design issues and discuss the qualitative tradeoffs between available design choices. This analysis provides an important understanding of interactions between system tradeoffs. Second, we provide quantitative results that illustrate the tradeoffs discussed above in the context of actual N applications. For this, we leverage N application partitioning and mapping techniques and analytic performance modeling results that were developed in our prior work [15, 2]. In Section 2, we discuss related work. Section 3 discusses the key characteristics of network processor operating systems and how they differ from conventional workstation operating systems. Section 4 discusses qualitative design tradeoffs and Section 5 quantifies these observations in the context of our particular experimental system. Section 6 combines the results and discusses their implications on network processor operating system design. Section 7 concludes this paper. 2. RELATED WORK Commercial examples of network processors are numerous [3, 6, 7, 1]. A network processor is typically implemented as a singlechip multiprocessor with high-performance I/O components, which is optimized for packet processing. In particular, Ns provide a more suitable architecture to handle these workloads than conventional workstation or server processors. The need for a specialized architecture is due to the uniqueness of the workload of Ns, which is dominated by many small tasks and high-bandwidth I/O operations [21]. Another alternative for implementing packet processing functions is programmable logic devices, e.g., field-programmable gate arrays (FGAs), which are more suitable for some of the data flow style processing functions [1, 18]. The design considerations for run-time support in FGA-based systems are very similar to those of conventional Ns and thus the results of our work can be expected to be equally applicable to this domain. In order to achieve the necessary performance of ever-increasing line speeds and increasingly complex packet processing functions, both Ns and FGAs exploit the parallelism that is inherent in the workload of these systems. In general, packets can be processed in parallel as long as they do not belong to the same flow. rocessing functions within a packet can also be parallelized to decrease packet delay. This leads to N systems with numerous parallel processing and co-processing engines. To program such a system, several domain-specific programming languages have been developed. Intel jointly with the Shangri-La project at UT Austin has developed Baker [5]. The MESCAL project at UC Berkeley has developed N-Click [17], which is based on the Click modular router [8]. In the NEAL project [12], Memik et al. propose a run-time system which controls the execution of applications on a network processor. Applications are separated into modules at the task level using the NEAL AI and modules are mapped to execution cores dynamically. There is an extra level of translation during which the application code is converted to modules. The analysis of run-time techniques is limited to a couple of heuristic allocation algorithms. Teja [19] is a commercial programming environment for the Intel IX family of network processors. While they provide a thin network processing operating system (NOS), both Teja and NEAL are designed to simplify the programming process of a single application and aim at code-reuse across platforms. The ability to quickly adapt to multiple applications on the same network processor system is not supported. Even though there has not been much work on the mechanisms for dynamically managing multiple applications, there has been work on the algorithms for adapting and scheduling network processors to save power. Kokku et al. have explored run-time environment design issues [9] similar to the ones we present here, but do not considers partitioned applications that are distributed over several processors of the network processing system. Instead, it is assumed that an entire application is mapped to a single processor core. 3. NETWORK ROCESSOR OERATING SYSTEM The term operating system is most commonly used in the context of workstation computers. The responsibilities of such an operating system are to manage hardware resources and isolate users from each other. The optimization target is commonly to minimize the execution time of a single task (the one the user is currently working on). It is important to note that the goals of an operating system for network processors are very different. On a network processor, all applications are controlled by the same administrative entity and optimization aims at maximizing overall system throughput. 3.1 Differences to Workstation Operating Systems The following list details the differences between network processor operating systems and conventional operating systems: Separation Between Control and Data ath. This separation refers to the processor context, not the networking context. To achieve high throughput on Ns, several studies have shown that it is more economical to implement a larger number of simpler processing engines than fewer, more powerful ones. Such simple processors do not have the capability to run complex control tasks on top of packet processing tasks. In today s N designs, control is implemented on a separate control processor. Due to this separation between classes of processors, it is necessary to have a more explicit control structure than one would have in a conventional operating system. Limited Interactivity. Users do not directly interact with an N or its operating system. At most, applications are installed and removed occasionally. This does not mean that a user could not change configurations on an N (e.g., update rules for a firewall application), but the key variable in this
3 system are the traffic patterns that determine what processing needs to happen. applications network traffic Regularity and Simplicity of Applications. One dominating aspect of network processing is that data path processing is performed on individual packets. This means that packet processing tasks are typically limited in complexity due to the real-time constraints imposed by continuously arriving packets. As a result, the processing demands are low (few hundred to several thousand instructions [16]). Additionally, the execution path within an application is the same in a large number of cases and only slightly different for the other cases. Therefore it is feasible to analyze packet processing applications in detail to find good processor mappings. rocessing Dominates Resource Management. Conventional operating systems need to implement a number of different functions: processor scheduling, memory management, application isolation, abstraction of hardware resources, etc. In network processor systems, these challenges are dominated by managing processing resources. The diversity of hardware resources is limited and many are controlled directly by the application. Also, memory is usually allocated statically to ensure deterministic run-time behavior. This might change in the future as network applications become more complex and network processor operating systems become more similar to conventional operating systems. In this work, we focus on processing aspects of operating system functionality. Non-Existence of User-Space/Kernel-Space Separation. All functions on a network processor are controlled by the same administrative entity. There is no clear separation between user-space and kernel-space in the traditional sense. Instead, functionality is divided betwen control and data path. As a result, traditional protection mechanisms are typically not implemented in network processor operating systems. 3.2 Design Considerations Due to these numerous and significant differences between what is conventionally thought of as an operating system and what is necessary for a network processor, we believe it is important to explore some of the fundamental design issues that one encounters in the context of network processor operating systems. The questions that arise from such an exploration are: How should applications be dynamically installed on and removed from the network processor? How can applications be partitioned and mapped to utilize the underlying multiprocessor hardware? How should the operating system adapt to changes in network traffic that require different application configurations? What should be the frequency and level of adaptation? In order to explore operating system design aspects concretely, we assume a general operational approach as shown in Figure 1. There are four basic steps that are necessary for run-time support of network processor systems: application analysis, traffic characterization, workload mapping, and adaptation. There is a fundamental question what should be done offline (e.g., during application development) and what should (and can realistically) be done during run-time. offline profiling partitioning task to processor mapping packet processing traffic analysis/ prediciton application allocation periodic adaptation Figure 1: rocessing and Traffic Analysis in an Network rocessor Operating System. Application analysis is necessary to analyze the processing requirements of the application and to be able to partition the application. artitioning allows the distribution of different subtasks onto different processing elements to fully utilize the resources on the N. The simplicity and repetitiveness of network processing applications [21, 16] allows for a detailed analysis of the application. The profiling process is shown as an offline component. With the limited processing resources on current N architectures, this analysis cannot be done online. Typically, such an analysis can be performed in the context of the application development environment for the N applications. The partitioning can be performed in different ways and the level of granularity at which it should be performed is discussed in more detail below. Network traffic characterization is another important aspect of run-time support for network processor systems. Depending on the requirements of current network traffic, different applications dominate the processing. The dynamically changing workload is the main reason why run-time support is necessary. In order to achieve a good allocation of processing resources, it is necessary to know what processing is necessary for packets currently in the system (or that will be processed in the near future). The result of traffic analysis is an application allocation, which describes the ratio of processing required by each application that is available on the system. Workload mapping is the process that assigns processing tasks to actual processing engines. This assignment is based on the application allocation and the application partitioning derived in the previous two steps. Mapping can be performed in a number of different ways and depends on the particular system architecture, application development environment, and operational principles of a system. The goals of mapping are to achieve high system throughput and efficient resource utilization. The adaptation step illustrates the need to reconfigure the network processor system to match the processing requirements that are dictated by the traffic workload. During adaptation, the application allocation is changed according to the new traffic requirements. Then, the mapping step modifies the allocation of tasks to processors to match the new allocation.
4 4. QUALITATIVE TRADEOFFS With each of these four components of a network processor operating system discussed in the previous section, there are a number of qualitative design tradeoffs. We discuss the different design choices in this section and then provide quantitative results in Section 5. Since the quantitative results are highly dependent on a particular system, we have separated the discussion of tradeoffs to preserve its general applicability. 4.1 Application artitioning Network processor applications rarely consist of a single monolithic piece of code. More commonly, N applications are split into several subtasks. For example, on the Intel IX24, input processing is separated from forwarding and from output processing. This partitioning makes application development somewhat easier and changes to the application easier to implement. Also, it allows for exploiting parallelism and pipelining to fully utilize the multiprocessor infrastructure. How can an operating system support this application partitioning? Manual artitioning Manual application partitioning is the most common approach to determining a suitable separation of tasks and a mapping of tasks to processors. Using simulation environments, programmers can implement a certain partitioning and obtain performance results. By manually adapting the partitioning, the bottlenecks can be removed and the performance can be fine-tuned. This approach is very time consuming and requires a detailed understanding of the application and the N hardware. From an operating system perspective, manually partitioned applications limit the amount of dynamic support. It is generally not possible to adapt to changing traffic conditions Automated artitioning More recently, several approaches to automated partitioning of applications have been investigated. The automatic mapping aspect discussed below is related to this. An auto-partitioning compiler has been developed by Intel in [5]. Ramaswamy et al. have developed a profiling-based instruction clustering algorithm that derives a directed acyclic graph representation of N applications [15]. The granularity of the partitioning can be adapted as needed. lishker et al. take an approach were applications are described in a domain-specific language and then distributed onto processing elements [14] Design Choices One key question is what granularity of application partitioning is most suitable. The spectrum of choices ranges from monolithic applications to extremely fine-grained instruction (or basic block) allocations to processing resources. The design tradeoffs are: Monolithic Applications. If the application is not partitioned, it can only be allocated to a single processing engine. This significantly limits how the application workload can be adapted to network traffic requirements. Also, it may cause performance bottlenecks in pipelined systems (e.g., multiple sequential applications per packet), where the pipeline speed is determined by the maximum stage time. Finally, as application size continues to grow, monolithic applications do not allow for a scalable distribution of processing and may conflict with instruction store limitations. Very Fine-Grained artitioning. The extreme opposite case of a monolithic application is a partitioning where each instruction is mapped individually to a processing resource. This approach provides more flexibility, but also generates more overhead for managing the installation and control of the application. It also increases the complexity of the mapping problem, because a large number of nodes have to be mapped and the space of possible solutions grows significantly with the number of mapping choices. Balanced partitioning. Ideally, we would like to find a balanced partitioning that allows efficient distribution of processing tasks, but keeps the complexity of the mapping problem at bay. In the quantitative section below, we show that an optimally balanced partitioning exists. This emphasizes the importance of considering application partitioning for network processors. The assumption that applications are a monolithic structure that needs to be mapped to a single processing engine is not sufficient. 4.2 Traffic Characterization Traffic characterization is an important input to determining a suitable allocation of different applications on the network processor. Heavily used applications typically need to be replicated multiple times to provide sufficient performance (e.g., multiple parallel I forwarding applications). When considering run-time support for workloads, it is important to be able to analyze network traffic to estimate and possibly predict a suitable application allocation Static Traffic Model The simplest case of traffic characterization is a static traffic model. This is the most commonly used model since it does not require any online changes in the system. The assumption is that traffic requirements remain the same over the entire run time of the system. Short term variations are compensated by buffering (thus increasing the packet delay) or overprovisioning, where additional resources are allocated to each type of application (thus increasing the total required hardware resources) Batch Model In the batch-processing approach, a certain number of packets are buffered and their processing requirements analyzed before a workload allocation decision is made. This buffering allows an accurate characterization of the requirements. Since the allocation or reallocation of tasks cannot be done arbitrarily fast, it is necessary to maintain a particular processor mapping for a certain amount of time. This period might be only a few milliseconds, but given high link rates, it is equivalent to at least several dozen or hundred packets. If batch processing is used, then all the packets in a batch need to be buffered. Since this increases the overall delay that is experienced by a packet, predictive batch processing might be more suitable redictive Batch Model In many cases, network traffic exhibits a certain amount of temporal locality. This can be exploited to derive a traffic requirement estimation for a batch by observing a smaller window within the batch ( sampling ). Based on the observation, the total processing requirements for the batch can be extrapolated. For small samples, only a few packets need to be buffered to determine the allocation of a batch. With larger samples, the accuracy increases, but so does the processing delay Design Choices In order to determine the allocation of applications to the network processor system, traffic characterization can be performed according to the approaches described above. The tradeoffs are:
5 Choice of Batch Size. The batch size determines how many packets are processed given the same allocation of processing resources to applications. The smaller the batch size, the more frequently this allocation can be adapted to changes in traffic. However, with smaller batch sizes, less time is available between reconfigurations for the operating system to determine how to adapt and to find a good task mapping. Choice of Sample Size. In order to minimize the delay for packets, the sample should be chosen to be small. This reduces the accuracy and thus there is a limit to how small it can be. Granularity of Application Allocation. From the traffic models, an application allocation is derived that assigns a fraction of the overall processing requirements to each application. For large batch and sample sizes, this allocation can be determined with high precision. In contrast, most network processor systems only support a very coarse allocation of applications (e.g., number of times a certain application is replicated on the multiprocessor). Therefore, the granularity of application allocation can be kept at the same level of precision as the network processor can support. This permits the usage of smaller samples. 4.3 Run-Time Mapping and Adaptation The mapping of application tasks to processing elements is performed through a mapping algorithm. We explore the design tradeoffs without requiring that a particular algorithm be used. However, we assume two properties of the mapping algorithm: The mapping algorithm can yield incrementally better results as the run-time increases. This implies that the operating system designer could choose the run time of the algorithm and the quality of the resulting mapping. The mapping algorithm can be employed on a partially configured system. This means that some applications can be removed and others can be added without changing the mapping of applications that are not affected. The resulting design choices address the frequency and level of (partial) mapping Static Mapping Static mapping goes hand-in-hand with static partitioning and a static traffic model. In this case, processing tasks are allocated to processors offline and no changes are made during run-time Complete Dynamic Mapping Complete mapping refers to a mapping solution where the entire workload is mapped from scratch. The mapping algorithm can place processing tasks on the entire architecture without any initial constraint. This typically leads to a good solution that approximates the theoretical optimum for increasing processing times artial Dynamic Mapping artial mapping assumes that some part of the workload is already mapped to the N system. The mapping algorithm only needs to map a few applications to the remaining processing resources. This approach is more restrictive than complete dynamic mapping because much of the system resources are already assigned to applications. The incremental nature of this approach poses the risk that the mapping algorithm gets stuck in a local minimum. Nevertheless, the processing cost for the partial mapping is less than mapping the entire workload Mapping Effort Another interesting parameter for dynamic mapping is how often to reprogram the system. Ideally, we want to reconfigure application allocation with every packet to guarantee the best system utilization and high performance while traffic is varying. However, there is a cost associated with mapping and remapping. Apart from the cost for uploading new instructions to each processor, determining the new mapping requires processing power and computation time. It is important to keep the reprogramming frequency at a low enough rate where sufficient processing time is available to find good mapping results Design Choices Design choices for mapping determine how often, how much, and with how much effort to perform complete or partial mapping. Choice of Dynamic Adaptation Frequency and Mapping Effort. In order to adapt to changing traffic conditions, the network processor operating system needs to change the application allocation and thus the mapping. The rate at which this happens determines the maximum time that is available for determining a new mapping. The lower the adaptation rate, the more time can be spent on finding a better mapping solution. As the adaptation rate increases, the quality of the derived mapping solution decreases. Choice of artial Mapping and Mapping Effort. Changes in traffic conditions may only affect a few applications. In order to be able to adapt quickly with low mapping cost, an operating system designer may choose to only map a small part of the overall allocation. The benefit of this is the ability to adapt quickly, but the amount of traffic variation that can be supported is limited to the fraction of the N that is remapped. Balance of artial and Complete Mapping. Repeated partial mapping causes mapping solutions to deteriorate. In order to avoid this, complete mapping steps should be performed periodically. The more frequently this happens, the less likely the system will go into an inefficient state. However, this also increases the overall mapping effort. The quantitative results show that partial mapping can provide quicker adaptation to changes in traffic, but that the deterioration of the mapping result can have significant impact on the performance. 4.4 Constraints A network processor has a number of system constraints that are not considered in the above discussion. These constraints can play a major role when making design decisions Instruction Store Limitations Most network processor systems are severely limited in the amount of instruction store that is available for each processing engine. This is due to the relatively high chip area cost of memory compared to processing logic. This limitation is the reason that not all applications can be installed on all processing engines at all time. Therefore mapping changes are quite costly as they require uploading of new instructions to every processor Overprovisioning A network processing system needs to be capable of processing any packet that is transmitted on the network. In the context of a network processor operating system, this means that processing resources should be available for all applications. If this is not the
6 application profiling and partitioning mapping algorithm analytic performance evaluation system throughput and other results application ADAG M M M N architecture mapping result Figure 2: Application Analysis, Mapping, and erformance Evaluation rocess. Typically, multiple ADAGs are mapped to the N architecture to reflect the workload mix that can be processed by the network processor. case, packets need to be delayed until the next adaptation cycle or processed in the slow path of the router. One way to avoid this delay is to overprovision the system and increase the application allocation to more than 1%. It can be ensured that even applications that are not expected to see any traffic in the upcoming batch can be installed just in case. 5. QUANTITATIVE RESULTS In this section, we support the qualitative observations of the previous sections with quantitative results. This helps illustrating which trends have a large impact and which trends have a small impact on the system performance. In order to derive quantitative results, we use a particular system baseline. Of course, there are big differences between different network processor systems and results on another system would look somewhat different. It is therefore more important to consider the trends that can be observed in our results (e.g., does an optimum exist?) than individual data points (e.g., where exactly is the optimum?). 5.1 Baseline System The metric that we are interested in is the throughput of a network processor system given a certain workload. In order to derive this information, we need to implement some of the functions that are necessary for network operating systems. In particular, we need to consider realistic network processing applications, their partitioning, and the mapping of processing tasks to processing engines. In order to explore the design space that we have described, it is not sufficient to only consider a handful of partitioning and mapping solutions. We choose therefore to use an analytic modeling approach instead of simulation. With the automated partitioning, mapping, and performance modeling environment provided in [15] and [2], we can evaluate a large number of possible application partitioning, mapping results, etc. This provides a first-order understanding of the quantitative tradeoffs. In the process, several ancillary metrics (e.g., cost for deriving a certain quality mapping) can be obtained. The process of obtaining performance results is shown in Figure 2. We briefly describe the three key components, application partitioning and representation, the mapping algorithms, and the analytic performance model, to provide a basis for the understanding of the results below Application Representation A network processing application needs to be represented in a way that it can be easily mapped to multiple parallel or pipelined processing elements. This requires a representation that exhibits the application parallelism while also ensuring that data and control dependencies are considered. We use an annotated directed acyclic graph (ADAG) to represent the dynamic execution profile of applications. The ADAG is derived from dynamic profiling of the application by determining data and control dependencies between individual instructions. Using a clustering heuristic that minimizes the communication overhead, instructions are aggregated to larger nodes in the graph. Each node is annotated with information on the total number of instructions and memory accesses that need to be executed when processing the node Mapping Algorithm Once we have the application represented as an ADAG, the next step is to map the ADAG onto a N topology. The goal of the mapping is to assign processing tasks (i.e., ADAG nodes) to processing elements and generate a schedule that achieves the maximum system throughput. This assignment is not easy because the mapping process needs to consider the dependencies within an ADAG and ensure that a correct processing of packets is possible. Further, Malloy et al. have shown that producing an optimal schedule for a system that includes both execution and communication cost is N-complete, even if there are only two processing elements [11]. Therefore we need to develop a heuristic to find an approximate solution. Our heuristic solution to the mapping problem is based on randomized mapping. The key idea is to randomly choose a valid mapping and evaluate its performance. By repeating this process a large number of times and picking the best solution that has been found over all iterations, it is possible to achieve a good approximation to the global optimum. With the randomized approach any possible solution is considered and chosen with a small, but nonzero probability. This technique has been proposed and successfully used in different application domains [13]. The mapping is performed for multiple, possibly different, ADAGs. The mixture of ADAGs represents the allocation of applications to the N architecture Analytic erformance Model In order to evaluate the throughput performance of a given solution, we use an analytic performance model that considers processing, inter-processor communication, memory contention, and pipeline synchronization effects. After mapping the application ADAGs to the network processor topology, we know exactly the workload for each processing element. This information includes the total number of instructions executed, the number of memory accesses, and the amount of communication between stages. The model needs to take this into account as well as contention for shared resources (memory channels and communication intercon-
7 Table 1: Baseline Configuration for Quantitative Results. System parameter Baseline configuration N pipeline stages 4 Es per stage 4 Total number of Es 16 Memory interfaces per stage 2 Memory access time (in cycles) 1 Number of ADAGs 2 Nodes per ADAG 8 relative throughput performance mapping cost (number of randomized runs) Figure 3: System erformance Compared to Mapping Cost. nects). We are particularly interested in the maximum latency of each pipelined processing stage since that determines the overall system speed. The number of ADAGs that are mapped to an architecture determines how many packets are processed during any given stage time. After specifying the several system parameters, the throughput of the system for a given mapping can be expressed System Configuration and Workload The system architecture considered in the above model can be configured to represent any regular N architecture with a specified number of parallel processing engines (Es) and pipeline stages. We have chosen one single baseline system with a fixed configuration to explore the operating system issues (see Table 1). A separate question is how these change for different architecture configurations. This design space exploration is currently not addressed in our work. The applications that are considered for this system are radix-tree-based I-lookup and hash-based flow classification. 5.2 rocessing Task Mapping Metrics Mapping takes a certain amount of processing. This is expressed as c. The performance that is achieved by the mapping is expressed as t, the throughput of the system. Due to the N-completeness of the mapping problem, finding the overall optimal solution is infeasible. What is really desirable in a system is to obtain a good enough solution by running the approximation algorithm for a large amount of time Results Figure 3 shows the increasing quality of the mapping result as more processing effort is dedicated to the mapping process. The mapping cost is the time that it takes to calculate the solution (ex- performance relative to monolithic application partitioning level (number of nodes) Figure 4: erformance for Different Levels of Application artitioning. The overall mapping effort is limited to 1, total node mapping attempts. pressed as the number of randomized mapping iterations). We do not consider the additional overhead for stopping, reprogramming, and restarting processing engines, which occurs in real systems. Since the maximum system performance is unknown, we use a mapping result with very high cost as the baseline. 5.3 artitioning The goal of application partitioning is to study the impact of the partition granularity on the performance Metrics We consider the number of tasks (or nodes in the ADAG), n, into which the application is partitioned. The maximum n is different for each application, but for this evaluation we only consider values of n that are much smaller than this maximum. If the partitioning is balanced, then the size of each subtask is approximately 1 of the n application size Results First, we explore the tradeoff between system throughput and partitioning granularity. Finer granularity promises better performance if the permissible mapping effort is unbounded. When considering the realities of a network processor operating system, the mapping effort is bounded by the batch size and adaptation frequency. Figure 4 shows the throughput of the baseline system with different levels of application granularity relative to a monolithic implementation, where the entire application is executed on a single processor. The mapping effort is fixed and it can be observed that the best performance is achieved for n=5. The monolithic application performs worse because it does not permit an even distribution of processing tasks on the multiprocessor system. Larger values of n also degrade the performance because the mapping problem becomes more complex and finding a good mapping in a limited amount of time becomes more difficult. 5.4 Traffic Characterization To illustrate the need for traffic characterization and run-time adaptation, we have performed network measurement experiments Metrics As described above, the predictive batch model assumes that traffic is processed in batches (with batch size b) and the applica-
8 % sample 1% sample 2% sample traffic variation average traffic variation packet e+6 1e+7 batch size Figure 5: Traffic Variation Over a Sequence of ackets. tion allocation is based on a sample (size l). We can then describe the traffic variation v based on two metrics, e i,j(a) and p i,j(a). Metric e reflects the estimated number of packets requiring application a and metric p is the actual number of packets requiring this application in packet interval [i...j): v i(l, b) = 1 a max(p i,i+b (a) b b l e i,i+l(a), ). (1) For example, if the traffic matches exactly the estimated allocation, then all packets matched up and the traffic variation is v=. If half the packet of a batch are different from what was expected (e.g., all packets require a single application instead of an estimated 5/5 split between two application), then the traffic variation is v = Results To get realistic data on what traffic variation looks like in a network, we obtained measurement data from our institution s Gigabit internet access link. We collected 4,235,43 packets and classified them by layer 7 applications (using the classification rules of the Ethereal tool). There are a total of 175 categories, but over 98% of the traffic falls into the top five categories. Figure 5 shows traffic variation over a sequence of packets. The parameters are b=1, and l=1 and the variation is computed as a sliding window. While this is only a small sample of the overall measurement that we have performed, it does reflect the overall trends correctly. In most cases, the variation is around v=.4 with a few spikes of up to v=.15. Of course, the observed variation depends on the quality of the estimate (i.e., size of l relative to b) and the batch size b. To show this relationship, Figure 6 shows the average variation observed throughout the entire trace for different batch sizes and different sample percentages l. With small samples and small batch sizes, b the average variation is very high (v=.4). The peak variation in this case can reach v=1. The larger the sample percentage, the better are the estimates and thus the lower is the traffic variation. As the batch size increases, the temporary variations within the network traffic are smoothed out and less average variation is observed. This does not mean that the static allocation approach (b= ) is necessarily ideal. What the figure does not show is the delay that would be caused by smaller scale traffic variation nor the overprovisioning that would be necessary to support all possible traffic conditions. When the amount of sampling is fixed, then larger batch sizes cause higher traffic variations as shown in Figure Figure 6: Traffic Variation Compared to Batch Size with Different Levels of Sampling. The sample size is set to a percentage of the batch size. average traffic variation batch size Figure 7: Traffic Variation Compared to Batch Size with Fixed Sample Size. The sample size is fixed to l=1. 7. These results illustrate the importance of run-time adaptation because traffic does change significantly on short time scales. 5.5 Adaptation Adaptation during run-time is guided by the variation of traffic. Complete and partial mapping is performed during each adaptation process Metrics The metrics that are interesting for adaptation are the system throughput in comparison to a baseline configuration. The key input parameters are the frequency of adaptation and the amount of partial mapping (i.e., the fraction of N resources to which new applications can ge allocated). For our experiment, we consider the adaptation frequency to be the same as the batch size Results Figure 8 shows the degradation effect of repeated partial mapping. The figure shows that repeated removals and additions of applications cause system performance to quickly drop and then stabilize at a suboptimal state of about 8% of the peak performance. There is little variation between different levels of partial adaptation; the stabilizing value is driven more by the amount of effort
9 performance relative to baseline system baseline 1% partial mapping 3% partial mapping 5% partial mapping number of iterations Figure 8: erformance Degradation Due to Repeated artial Mapping. The baseline case is a complete mapping. performance relative to baseline v=. v=.1 v=.3.6 v=.5 v=.7 v=.9 operational variation batch size (packets) Figure 9: erformance for Different Batch Sizes under Traffic Variation. The total mapping effort is fixed and the baseline case has infinite batch size with no traffic variation. that is dedicated to the partial mapping. This result clearly shows that partial mapping can very quickly lead to suboptimal configurations. When designing a N operating system, it should be taken into account that complete mapping is occasionally necessary. The adaptation frequency (i.e., the batch size) poses a tradeoff between the adaptiveness of the system to changing traffic and the quality of the mapping result that can be derived. Figure 9 shows the relative performance of configurations with different batch sizes and traffic variation. With increasing batch size, the performance of the system increases, but as the traffic variation increases, the performance decreases. The line marked as operational variation shows the maximum system performance with real traffic variation and a chosen batch size (with fixed sample size of l=1 as shown in Figure 7). As the batch size increases, the traffic variation does so, too. The resulting deviation from the processing allocation causes the performance to drop. The trend of this curve shows that there clearly is an optimal batch size. This further supports the argument that careful design considerations are necessary when choosing such system parameters. 6. N OERATING SYSTEM DESIGN SCENARIOS In the previous two sections, we have discussed a number of design considerations and explored their qualitative and quantitative dependencies. To summarize some of these observations and to identify concrete results, we present three example designs for network processor operating systems. For each scenario, we discuss the performance aspects based on our observations above. 6.1 Scenario 1: Static Configuration This scenario assumes that all analysis, mapping, and allocation operations are performed offline. Once the network processor is configured, no adaptation is performed. This scenario represents many of today s network processor systems that are programmed and optimized for a single application scenario. The design and performance considerations are: Simplicity of System. Clearly, as static handling of all mapping and allocation issues simplifies the implementation of the system. There is no need for run-time control. Limited Run-Time Flexibility. The static approach does not allow for any changes during run-time. Any change in the workload configuration requires the use of a software development tool to reconfigure the entire system. As network processors become more integral components of networks, this approach will become less feasible. erformance Degradation under Traffic Variation. From Figure 5, we can see that traffic requirements do change over time. This leads to performance degradation or the need for significant overprovisioning. 6.2 Scenario 2: redetermined Configurations In this scenario, we assume that the N system can be configured to one of multiple predetermined workload configurations. Each configuration is statically mapped in an offline process. During run-time, the system can adapt to any one of these configurations. The design and performance considerations are: Offline Mapping. While the system needs to monitor traffic variation, it does not need to perform mapping computations. This limits the complexity of the operating system. Limited Adaptability. The number of configurations that can be precomputed is limited. If traffic varies outside the estimated bounds, no further adaptation is possible. Within the bounds of estimated traffic, it is unlikely that actual traffic completely matches a predetermined configuration. Thus, a certain level of performance degradation can be expected. Better Quality Mapping Results. Due the availability of arbitrary amounts of computational power for offline mapping, the quality of mapping results (see Figure 3) can be better than for online mapping. 6.3 Scenario 3: Fully Dynamic Configuration In the fully dynamic scenario, mapping is performed online and the system adaptation is performed to match the traffic variation. This case utilizes sampling to estimate the processing requirements for each batch. The design and performance considerations are: Complete Adaptability. A fully dynamic system can adapt to any traffic variations even configurations that could not
10 be predicted when programming the system. This is clearly the most important functional benefit of this scenario. Limited Mapping Quality. Due to the online nature of the mapping process, only limited amounts of processing time are available (very large batch sizes decrease performance see Figure 9). Thus, the quality of the mapping results is less than that of the other two scenarios. Lower Overprovisioning Overhead. Due to the ability to adapt to changing traffic conditions, a fully dynamic scenario can provide high throughput performance with less processing resources. In summary, each of the three approaches has its benefits (simplicity vs. high quality mapping vs. lower system requirements). Nevertheless, the ability to completely adapt at run-time is key in future network processor systems. 7. CONCLUSION We have presented an extensive qualitative discussion of design issues related to operating system design for network processors. To illustrate the design considerations, we have provided quantitative results that highlight performance tradeoffs between various design parameters. Finally, we have explored three different operating system designs and their benefits and drawbacks. We believe that this study provides an important basis for design and implementation of future operating systems for network processors. Understanding the presented tradeoffs will guide operating systems designers to consider the relevant interactions between applications, network traffic, and the underlying hardware. This will bring us closer to realizing network processors as easyto-use components of network systems. 8. REFERENCES [1] BAKER, Z., AND RASANNA, V. K. A methodology for synthesis of efficient intrusion detection systems on FGAs. In roc. of the IEEE Conference on Field-rogrammable Custom Computing Machines (Napa, CA, Apr. 24). [2] ELSON, J., AND CERA, A. Internet content adaptation protocol (ICA). RFC 357, Network Working Group, Apr. 23. [3] EZCHI TECHNOLOGIES LTD. N-1 1-Gigabit 7-Layer Network rocessor. Yokneam, Israel, np-1.html. [4] GAVRILOVSKA, A., SCHWAN, K., NORDSTROM, O., AND SEIFU, H. Network processors as building blocks in overlay networks. In roc. of Hot Interconnects (Stanford, CA, Aug. 23), ACM, pp [5] GOGLIN, S. D., HOOER, D., KUMAR, A., AND YAVATKAR, R. Advanced software framework, tools, and languages for the IX family. Intel Technology Journal 7,4 (Nov. 24), [6] IBM CORORATION. IBM ower Network rocessors, 2. processors.html. [7] INTEL CORORATION. Intel Second Generation Network rocessor, [8] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., AND KAASHOEK, M. F. The Click modular router. ACM Transactions on Computer Systems 18, 3 (Aug. 2), [9] KOKKU, R., RICHÉ, T.,KUNZE, A., MUDIGONDA, J., JASON, J., AND VIN, H. A case for run-time adaptation in packet processing systems. In roc. of the 2nd Workshop on Hot Topics in Networks (HOTNETS-II) (Cambridge, MA, Nov. 23). [1] LUCENT TECHNOLOGIES INC. ayloadlus TM Fast attern rocessor, Apr [11] MALLOY, B. A., LLOYD, E. L., AND SOUFFA, M.L. Scheduling DAG s for asynchronous multiprocessor execution. IEEE Transactions on arallel and Distributed Systems 5, 5 (May 1994), [12] MEMIK, G., AND MANGIONE-SMITH, W. H. NEAL: A framework for efficiently structuring applications for network processors. In roc. of Second Network rocessor Workshop (N-2) in conjunction with Ninth International Symposium on High erformance Computer Architecture (HCA-9) (Anaheim, CA, Feb. 23). [13] MOTWANI, R., AND RAGHAVAN,.Randomized Algorithms. Cambridge University ress, New York, NY, [14] LISHKER, W.,RAVINDRAN, K., SHAH, N., AND KEUTZER, K. Automated task allocation for network processors. In roc. of Network System Design Conference (Oct. 24), pp [15] RAMASWAMY, R., WENG, N., AND WOLF, T. Application analysis and resource mapping for heterogeneous network processor architectures. In roc. of Third Workshop on Network rocessors and Applications (N-3) in conjunction with Tenth International Symposium on High erformance Computer Architecture (HCA-1) (Madrid, Spain, Feb. 24), pp [16] RAMASWAMY, R., WENG, N., AND WOLF, T. Analysis of network processing workloads. In roc. of IEEE International Symposium on erformance Analysis of Systems and Software (ISASS) (Austin, TX, Mar. 25), pp [17] SHAH, N., LISHKER, W., AND KEUTZER, K. N-Click: A programming model for the intel IX12. In roc. of Second Network rocessor Workshop (N-2) in conjunction with Ninth International Symposium on High erformance Computer Architecture (HCA-9) (Anaheim, CA, Feb. 23), pp [18] TAYLOR, D. E., TURNER, J. S., LOCKWOOD, J.W.,AND HORTA, E. L. Dynamic hardware plugins: Exploiting reconfigurable hardware for high-performance programmable routers. Computer Networks 38, 3 (Feb. 22), [19] TEJA TECHNOLOGIES. TejaN Datasheet, [2] WENG, N., AND WOLF, T. rofiling and mapping of parallel workloads on network processors. In roc. of The 2th Annual ACM Symposium on Applied Computing (SAC) (Santa Fe, NM, Mar. 25), pp [21] WOLF, T., AND FRANKLIN, M. A. CommBench - a telecommunications benchmark for network processors. In roc. of IEEE International Symposium on erformance Analysis of Systems and Software (ISASS) (Austin, TX, Apr. 2), pp
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
VMWARE WHITE PAPER 1
1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the
Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings
Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Optimizing Configuration and Application Mapping for MPSoC Architectures
Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : [email protected] 1 Multi-Processor Systems on Chip (MPSoC) Design Trends
Architecture of distributed network processors: specifics of application in information security systems
Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia [email protected] 1. Introduction Modern
The Importance of Software License Server Monitoring
The Importance of Software License Server Monitoring NetworkComputer Meeting The Job Scheduling Challenges of Organizations of All Sizes White Paper Introduction Every semiconductor design group uses a
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Enterprise Application Performance Management: An End-to-End Perspective
SETLabs Briefings VOL 4 NO 2 Oct - Dec 2006 Enterprise Application Performance Management: An End-to-End Perspective By Vishy Narayan With rapidly evolving technology, continued improvements in performance
Advanced Core Operating System (ACOS): Experience the Performance
WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
The Benefits of Virtualizing
T E C H N I C A L B R I E F The Benefits of Virtualizing Aciduisismodo Microsoft SQL Dolore Server Eolore in Dionseq Hitachi Storage Uatummy Environments Odolorem Vel Leveraging Microsoft Hyper-V By Heidi
Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:
Performance Testing Definition: Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program or device. This process can involve
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Monitoring Traffic manager
Monitoring Traffic manager eg Enterprise v6 Restricted Rights Legend The information contained in this document is confidential and subject to change without notice. No part of this document may be reproduced
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...
The Benefits of POWER7+ and PowerVM over Intel and an x86 Hypervisor
The Benefits of POWER7+ and PowerVM over Intel and an x86 Hypervisor Howard Anglin [email protected] IBM Competitive Project Office May 2013 Abstract...3 Virtualization and Why It Is Important...3 Resiliency
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Accelerating High-Speed Networking with Intel I/O Acceleration Technology
White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Relational Databases in the Cloud
Contact Information: February 2011 zimory scale White Paper Relational Databases in the Cloud Target audience CIO/CTOs/Architects with medium to large IT installations looking to reduce IT costs by creating
INTRUSION PREVENTION AND EXPERT SYSTEMS
INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla [email protected] Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion
Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data
White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of
(Refer Slide Time: 01:52)
Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This
STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT
STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT 1. TIMING ACCURACY The accurate multi-point measurements require accurate synchronization of clocks of the measurement devices. If for example time stamps
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Rapid Bottleneck Identification
Rapid Bottleneck Identification TM A Better Way to Load Test WHITEPAPER You re getting ready to launch or upgrade a critical Web application. Quality is crucial, but time is short. How can you make the
Everything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
The Truth Behind IBM AIX LPAR Performance
The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 [email protected] www.orsyp.com
SiteCelerate white paper
SiteCelerate white paper Arahe Solutions SITECELERATE OVERVIEW As enterprises increases their investment in Web applications, Portal and websites and as usage of these applications increase, performance
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.
Chapter 2 TOPOLOGY SELECTION SYS-ED/ Computer Education Techniques, Inc. Objectives You will learn: Topology selection criteria. Perform a comparison of topology selection criteria. WebSphere component
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0
Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems
An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University
Application Performance Testing Basics
Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide
Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.
The Importance of Software License Server Monitoring
The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically
Dynamic Resource Allocation in Software Defined and Virtual Networks: A Comparative Analysis
Dynamic Resource Allocation in Software Defined and Virtual Networks: A Comparative Analysis Felipe Augusto Nunes de Oliveira - GRR20112021 João Victor Tozatti Risso - GRR20120726 Abstract. The increasing
AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK
Abstract AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Mrs. Amandeep Kaur, Assistant Professor, Department of Computer Application, Apeejay Institute of Management, Ramamandi, Jalandhar-144001, Punjab,
Cisco Integrated Services Routers Performance Overview
Integrated Services Routers Performance Overview What You Will Learn The Integrated Services Routers Generation 2 (ISR G2) provide a robust platform for delivering WAN services, unified communications,
How To Provide Qos Based Routing In The Internet
CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this
Flash Memory Arrays Enabling the Virtualized Data Center. July 2010
Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,
CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER
CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER To provide the transparency of the system the user space is implemented in software as Scheduler. Given the sketch of the architecture, a low overhead scheduler
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments
Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments Applied Technology Abstract This white paper introduces EMC s latest groundbreaking technologies,
IT Service Management
IT Service Management Service Continuity Methods (Disaster Recovery Planning) White Paper Prepared by: Rick Leopoldi May 25, 2002 Copyright 2001. All rights reserved. Duplication of this document or extraction
Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints
Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu
Scaling Microsoft SQL Server
Recommendations and Techniques for Scaling Microsoft SQL To support many more users, a database must easily scale out as well as up. This article describes techniques and strategies for scaling out the
Load Balancing in Distributed Data Base and Distributed Computing System
Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located
CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Five Essential Components for Highly Reliable Data Centers
GE Intelligent Platforms Five Essential Components for Highly Reliable Data Centers Ensuring continuous operations with an integrated, holistic technology strategy that provides high availability, increased
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family
Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family White Paper June, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
Rapid Bottleneck Identification A Better Way to do Load Testing. An Oracle White Paper June 2009
Rapid Bottleneck Identification A Better Way to do Load Testing An Oracle White Paper June 2009 Rapid Bottleneck Identification A Better Way to do Load Testing. RBI combines a comprehensive understanding
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
2004 Networks UK Publishers. Reprinted with permission.
Riikka Susitaival and Samuli Aalto. Adaptive load balancing with OSPF. In Proceedings of the Second International Working Conference on Performance Modelling and Evaluation of Heterogeneous Networks (HET
A Dell Technical White Paper Dell Compellent
The Architectural Advantages of Dell Compellent Automated Tiered Storage A Dell Technical White Paper Dell Compellent THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
Juniper Networks QFabric: Scaling for the Modern Data Center
Juniper Networks QFabric: Scaling for the Modern Data Center Executive Summary The modern data center has undergone a series of changes that have significantly impacted business operations. Applications
Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation
Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation ABSTRACT Ángel Cuevas Rumín Universidad Carlos III de Madrid Department of Telematic Engineering Ph.D Student
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Windows Server 2008 R2 Hyper-V Live Migration
Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at www.frankdenneman.
WHITE PAPER All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at www.frankdenneman.nl 1 Monolithic shared storage architectures
Windows Server 2008 R2 Hyper-V Live Migration
Windows Server 2008 R2 Hyper-V Live Migration White Paper Published: August 09 This is a preliminary document and may be changed substantially prior to final commercial release of the software described
The Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
Technical White Paper. Symantec Backup Exec 10d System Sizing. Best Practices For Optimizing Performance of the Continuous Protection Server
Symantec Backup Exec 10d System Sizing Best Practices For Optimizing Performance of the Continuous Protection Server Table of Contents Table of Contents...2 Executive Summary...3 System Sizing and Performance
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev ([email protected]) Information Technology Transfer Office Abstract Modern e-commerce
How To Build A Clustered Storage Area Network (Csan) From Power All Networks
Power-All Networks Clustered Storage Area Network: A scalable, fault-tolerant, high-performance storage system. Power-All Networks Ltd Abstract: Today's network-oriented computing environments require
From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller
White Paper From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller The focus of this paper is on the emergence of the converged network interface controller
SQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches
Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate
Network Performance Monitoring at Small Time Scales
Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA [email protected] Electrical and Computer Engineering Department University
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
A Study of Network Security Systems
A Study of Network Security Systems Ramy K. Khalil, Fayez W. Zaki, Mohamed M. Ashour, Mohamed A. Mohamed Department of Communication and Electronics Mansoura University El Gomhorya Street, Mansora,Dakahlya
Oracle Real Time Decisions
A Product Review James Taylor CEO CONTENTS Introducing Decision Management Systems Oracle Real Time Decisions Product Architecture Key Features Availability Conclusion Oracle Real Time Decisions (RTD)
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Per-Flow Queuing Allot's Approach to Bandwidth Management
White Paper Per-Flow Queuing Allot's Approach to Bandwidth Management Allot Communications, July 2006. All Rights Reserved. Table of Contents Executive Overview... 3 Understanding TCP/IP... 4 What is Bandwidth
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Applied Technology Abstract By migrating VMware virtual machines from one physical environment to another, VMware VMotion can
2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment
R&D supporting future cloud computing infrastructure technologies Research and Development on Autonomic Operation Control Infrastructure Technologies in the Cloud Computing Environment DEMPO Hiroshi, KAMI
INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS
WHITE PAPER INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS Network administrators and security teams can gain valuable insight into network health in real-time by
Load Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
How To Test For Elulla
EQUELLA Whitepaper Performance Testing Carl Hoffmann Senior Technical Consultant Contents 1 EQUELLA Performance Testing 3 1.1 Introduction 3 1.2 Overview of performance testing 3 2 Why do performance testing?
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS
Mestrado em Engenharia de Redes de Comunicações TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS 2009-2010 Projecto de Rede / Sistema - Network / System Design 1 Hierarchical Network Design 2 Hierarchical
4 Internet QoS Management
4 Internet QoS Management Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology [email protected] September 2008 Overview Network Management Performance Mgt QoS Mgt Resource Control
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Chapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
An Oracle White Paper February 2010. Rapid Bottleneck Identification - A Better Way to do Load Testing
An Oracle White Paper February 2010 Rapid Bottleneck Identification - A Better Way to do Load Testing Introduction You re ready to launch a critical Web application. Ensuring good application performance
BROCADE PERFORMANCE MANAGEMENT SOLUTIONS
Data Sheet BROCADE PERFORMANCE MANAGEMENT SOLUTIONS SOLUTIONS Managing and Optimizing the Performance of Mainframe Storage Environments HIGHLIGHTs Manage and optimize mainframe storage performance, while
