Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces
|
|
|
- Theodore Murphy
- 10 years ago
- Views:
Transcription
1 Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Douglas Doerfler and Ron Brightwell Center for Computation, Computers, Information and Math Sandia National Laboratories 1 Albuquerque, NM {dwdoerf, rbbrigh}@sandia.gov Abstract. In evaluating new high-speed network interfaces, the usual metrics of latency and bandwidth are commonly measured and reported. There are numerous other message passing characteristics that can have a dramatic effect on application performance that should be analyzed when evaluating a new interconnect. One such metric is overhead, which dictates the networks ability to allow the application to perform non-message passing work while a transfer is taking place. A method for measuring overhead, and hence calculating application availability, is presented. Results for several next-generation network interfaces are also presented. Keywords: MPI, Overhead, Availability, High Performance Computing, High Speed Networks. 1 Introduction Scaling efficiency of parallel applications in many instances depends on the ability to overlap communication with computation. If there is sufficient computation to overlap with communication, the application becomes insensitive to the bandwidth provided by the network. Overlap is also beneficial for inherently communication bound codes. In this instance the overhead of preparing the next messages can be overlapped with the transmission of the messages already in the send queue. In MPI application codes, the non-blocking send and receive calls are the primary means of achieving overlap. Unlike other MPI communication metrics, e.g. latency and bandwidth, there is a lack of readily available open-source micro-benchmarks that measure MPI overhead for non-blocking calls. This paper presents a method for measuring overhead and application availability and then applies this method to 1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
2 2 Douglas Doerfler and Ron Brightwell several current state-of-the-art high-performance network interfaces. It is not within the scope of this paper to explain why some interconnects and protocols provide low overhead and high availability. 2 Method There are multiple methods an application can use to overlap computation and communication using MPI. The method assumed by this paper is the post-work-wait loop using the MPI non-blocking send and receive calls, MPI_Isend() and MPI_Irecv(), to initiate the respective transfer, perform some work, and then wait for the transfer to complete using MPI_Wait(). This method is typical of most applications, and hence makes for the most realistic measure of a microbenchmark. Periodic polling methods have also been analyzed [1], but that particular method only makes sense if the application knows that progress will not be made without periodic MPI calls during the transfer. Overhead is defined to be [2]: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message; during this time, the processor cannot perform other operations. Application availability is defined to be the fraction of total transfer time 2 that the application is free to perform non-mpi related work. Application Availability = 1 (overhead / transfer time) (1) Figure 1 illustrates the method used for determining the overhead time and the message transfer time. For each iteration of the post-work-wait loop the amount of work performed (work_t), which is overlapped in time with the message transfer, increases and the total amount of time for the loop to complete (iter_t) is measured. If the work interval is small, it completes before the message transfer is complete. At some point the work interval is greater than the message transfer time and the message transfer completes first. At this point, the loop time becomes the amount of time required to perform the work plus the overhead time required by the host processor to complete the transfer. The overhead can then be calculated by measuring the amount of time used to perform the same amount of work without overlapping a message transfer and subtracting this value from the loop time. The message transfer time is equal to the loop time before the work interval becomes the dominant factor. In order to get an accurate estimate of the transfer time, the loop time values are accumulated and averaged, but only those values measured before the work interval starts to contribute to the loop time. These values used in the average calculation are determined by comparing the iteration time to a given threshold (base_t). This threshold must be set sufficiently high to avoid a pre-mature stop in the 2 Per the MPI non-blocking call definitions, the MPI_Wait() call only signifies that for a send the buffer can be reused and for a receive the data can be accessed in the receive buffer [3].
3 Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces 3 accumulation of the values used for the average calculation, but not so high as to use values measured after the work becomes a factor. The method does not automatically determine the threshold value. It is best to determine it empirically for a given system by trying different values and observing the results in verbose mode. A typical value is 1.02 to 1.05 times the message transfer time. Figure 1 also shows an iteration loop stop threshold (iter_t). This threshold is not critical and can be of any value as long as it is ensured that the total loop time is significantly larger than the transfer time. A typical value is 1.5 to 2 times the transfer time. In theory, the method could stop when the base_t threshold is exceeded, but in practice it has been found that this point can be too close to the knee of the curve to provide a reliable measurement. In addition, it is not necessary to calculate the work interval without messaging until the final sample has been taken. Fig. 1. A conceptual illustration of the post-work-wait loop time (iter_t) of a given message size for each iteration of the algorithm, with the work performed (work_t) increasing for each iteration. The message transfer time calculation threshold (base_t) and the iteration stop threshold (iter_t) are also shown along with the point at which the overhead calculation is taken. 3 Platforms Overhead and availability was measured on a variety of platforms, summarized in Table 1. All of the platforms except Red Storm are Linux clusters using the respective vendor s commercial software stacks. The Thunderbird cluster s MPI software stack has been modified and parameters have been set to reduce the memory required by the MPI stack at a scale of several hundred to a thousand processes. These
4 4 Douglas Doerfler and Ron Brightwell modifications do affect the real-world application performance, but it is unknown how those modifications affect the MPI overhead microbenchmark used in this analysis. The Red Storm platform uses the Catamount lightweight kernel [4], with low-level communications implemented using the Portals API [5]. All of the platforms use MPICH 1.x for their implementation of MPI, although several of these implementations have been optimized for their respective network interface. In particular, many vendors have optimized the collective communication routines. The Quadrics software stack uses a patched kernel, which allows optimizations benefiting overhead and host availability performance. Table 1. Overview of Test Platforms Red Storm Thunderbird CBC-B Odin Red Squall Interconnect Seastar 1.2 InfiniBand InfiniBand Myrinet 10G QsNetII Manufacturer Cray Cisco/Topspin PathScale Myricom Quadrics Adaptor Custom PCI-Express HCA InfiniPath Myri-10G Elan4 Host Interface HT 1.0 PCI-Express HT 1.0 PCI-Express PCI-X Programmable coprocessor Yes No No Yes Yes MPI MPICH-1 MVAPICH InfiniPath MPICH-MX MPICH QsNet 4 Results From a practical perspective, application availability is usually not a concern for small message sizes, as there is little to be gained trying to overlap computation with communication when transfer times are relatively small. Most applications will only try to overlap computation when they know the message size is sufficiently large. However, as an academic exercise, it still may be interesting to view availability for a small message as it provides information on how an interface s characteristics change at a protocol boundary, such as the switch from a short message protocol to a large message protocol. If an application writer is trying to optimize to a given platform, he/she may want to know where the protocol boundaries are and modify the code to better suit the platform. Overlap may also be beneficial to codes that need to send multiple small messages at a time. In this case, overlap allows preparation of the next message to be put in the queue while the messages already in the queue are being transmitted. However, this is the message throughput metric and is not within the scope of this study. Figure 2 illustrates the MPI_Isend() overhead as a function of message size for the platforms tested 3. Figure 3 shows application availability. The overhead for the Red 3 Note that this figure uses a logarithmic axis for overhead.
5 Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces 5 Storm, Odin (Myri-10G) and Red Squall (Elan4) interconnects is relatively constant for all message sizes. As such, application availability increases with message size until it is nearly 100% for large message transfers. The Thunderbird (InfiniBand) and CBC (InfiniPath) interconnects show a high overhead for large message transfers, with a corresponding drop in application availability. It should be noted that the InfiniPath network has a relatively low overhead for small transfers, which allows for that interconnect to achieve its high, advertised message throughput rate. Fig. 2. Overhead as a function of message size for MPI_Isend().
6 6 Douglas Doerfler and Ron Brightwell Fig. 3. Application availability as a function of message size for MPI_Isend(). MPI receive performance is charted in Figures 4 and 5. In general, receive performance is similar to the send performance for all of the interconnects tested. The Odin (Myri-10G) cluster does exhibit a more noticeable drop in application availability until the 32K byte message size, which is presumably a protocol boundary. After this point availability increases to an asymptotic value of 100%. Fig. 4. Overhead as a function of message size for MPI_Irecv().
7 Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces 7 Fig. 5. Application availability as a function of message size for MPI_Irecv(). 5 Related Work A significant amount of prior work has been done to measure and study the effect of overhead on application performance [1], [6], [7], [8] and [9]. Lawry [1] analyzes application availability, but the analysis and results are for a fixed message size and the results are a function of the polling interval. The other previous work does not quantify the overhead as a function of message size, but rather looks at its effect on application performance. An additional contribution of this paper is a comparison of overhead results for relatively new networking technologies, such as Red Storm s SeaStar, Pathscale s InfiniPath, and Myricom s Myri-10G. 6 Conclusion Simple ping-pong micro-benchmarks do not accurately capture all of the capabilities of a high-performance network. Host overhead and the ability to overlap computation with communication are important performance characteristics that can have a direct impact on an application s scalability. Two networks that have similar latency and bandwidth performance can vary significantly in their ability to provide overlap. This paper presented a method for measuring overhead and application availability for high-speed networks using MPI and then applied the method to five test platforms, each with a different network interface. Performance for MPI send and MPI receive operations was presented. In general, the send and receive characteristics for a given interconnect were similar. The Red Storm, Odin (Myri-10G) and Red Squall (Elan4)
8 8 Douglas Doerfler and Ron Brightwell platforms demonstrated a relatively small overhead as a function of message size, and thus showed high application availability for all message sizes. The CBC (InfiniPath) platform demonstrated excellent small message overhead, but for large messages overhead increased linearly with message size and application availability was very low. The Thunderbird (InfiniBand) cluster demonstrated good small message overhead, but like the CBC cluster large message overhead is high and application availability is low. 7 Future Work It is the intent of the authors to make the source to the code used in this study generally available and downloadable from an open web site, with the hope that this will allow overhead and application availability to become a common microbenchmark used in the evaluation of interconnects. We also expect that this will encourage contributions from the community to make the code more robust and accurate. 8 References 1. W. Lawry, C. Wilson, A. Maccabe, R. Brightwell. COMB: A Portable Benchmark Suite for Assessing MPI Overlap. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2002), p. 472, D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN symposium on Principles and Practice of Parllel Programming, pp , M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, J. Dongara. MPI: The Complete Reference. p. 52, The MIT Press, Cambridge, Massachusetts, S. Kelly, R. Brightwell. Software Architecture of the Light Weight Kernel, Catamount. In Proceeding of the 47 th Cray User Group (CUG 2005), Portals API, 6. R. Martin, A. M. Vahdat, D. E. Culler, T. E. Anderson. The Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. In Proceedings of the International Symposium on Computer Architecture, D. Culler, L. T. Liu, R. P. Martin, C. O. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, pp , Feb., C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, K. Yelick. An Evaluation of Current High-Performance Networks. In Proceedings IEEE International Parallel & Distributed Processing Symposium (IPDPS 03), R. Brightwell, D. Doerfler, K. D. Underwood. A Preliminary Analysis of the InfiniPath and XD1 Interfaces. In Proceedings IEEE International Parallel & Distributed Processing Symposium (IPDPS 06), 2006.
Performance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 [email protected] Amith Mamidala, Abhinav Vishnu, and Dhabaleswar
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell
So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell R&D Manager, Scalable System So#ware Department Sandia National Laboratories is a multi-program laboratory managed and
PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS
PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS Philip J. Sokolowski Department of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr. Detroit, MI 822
Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking
Cluster Grid Interconects Tony Kay Chief Architect Enterprise Grid and Networking Agenda Cluster Grid Interconnects The Upstart - Infiniband The Empire Strikes Back - Myricom Return of the King 10G Gigabit
Cray DVS: Data Virtualization Service
Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Network Performance in High Performance Linux Clusters
Network Performance in High Performance Linux Clusters Ben Huang, Michael Bauer, Michael Katchabaw Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 (huang
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)
INFINIBAND NETWORK ANALYSIS AND MONITORING USING OPENSM N. Dandapanthula 1, H. Subramoni 1, J. Vienne 1, K. Kandalla 1, S. Sur 1, D. K. Panda 1, and R. Brightwell 2 Presented By Xavier Besseron 1 Date:
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
A Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
The Three-level Approaches for Differentiated Service in Clustering Web Server
The Three-level Approaches for Differentiated Service in Clustering Web Server Myung-Sub Lee and Chang-Hyeon Park School of Computer Science and Electrical Engineering, Yeungnam University Kyungsan, Kyungbuk
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
C-Meter: A Framework for Performance Analysis of Computing Clouds
9th IEEE/ACM International Symposium on Cluster Computing and the Grid C-Meter: A Framework for Performance Analysis of Computing Clouds Nezih Yigitbasi, Alexandru Iosup, and Dick Epema Delft University
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented
White Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
Multilevel Load Balancing in NUMA Computers
FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,
Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing
Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing WHITE PAPER Highlights: There is a large number of HPC applications that need the lowest possible latency for best performance
A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing
A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing N.F. Huysamen and A.E. Krzesinski Department of Mathematical Sciences University of Stellenbosch 7600 Stellenbosch, South
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
High Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu Brad Benton Dhabaleswar K. Panda Network Based Computing Lab Department of Computer Science and Engineering The Ohio State University
10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications
10G Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Future Cloud Applications Testing conducted by Solarflare Communications and Arista Networks shows that
Network Performance Monitoring at Small Time Scales
Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA [email protected] Electrical and Computer Engineering Department University
SAND 2006-3866P Issued by Sandia National Laboratories for NNSA s Office of Advanced Simulation & Computing, NA-114.
ON THE COVER: This Parallel Volume Rendering of a cross-wind fire simulation shows the temperature of gases. This 150 million degree-of-freedom simulation uses loosely coupled SIERRA framework s codes:
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building
Parallel Computing of Kernel Density Estimates with MPI
Parallel Computing of Kernel Density Estimates with MPI Szymon Lukasik Department of Automatic Control, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland [email protected]
A Study of Network Security Systems
A Study of Network Security Systems Ramy K. Khalil, Fayez W. Zaki, Mohamed M. Ashour, Mohamed A. Mohamed Department of Communication and Electronics Mansoura University El Gomhorya Street, Mansora,Dakahlya
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Operating System Multilevel Load Balancing
Operating System Multilevel Load Balancing M. Corrêa, A. Zorzo Faculty of Informatics - PUCRS Porto Alegre, Brazil {mcorrea, zorzo}@inf.pucrs.br R. Scheer HP Brazil R&D Porto Alegre, Brazil [email protected]
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Scaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
Load Manager Administrator s Guide For other guides in this document set, go to the Document Center
Load Manager Administrator s Guide For other guides in this document set, go to the Document Center Load Manager for Citrix Presentation Server Citrix Presentation Server 4.5 for Windows Citrix Access
Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines
Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines Michael J Jipping Department of Computer Science Hope College Holland, MI 49423 [email protected] Gary Lewandowski Department of Mathematics
Performance Monitoring on an HPVM Cluster
Performance Monitoring on an HPVM Cluster Geetanjali Sampemane [email protected] Scott Pakin [email protected] Department of Computer Science University of Illinois at Urbana-Champaign 1304 W Springfield
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
Ultra-Low Latency, High Density 48 port Switch and Adapter Testing
Ultra-Low Latency, High Density 48 port Switch and Adapter Testing Testing conducted by Solarflare and Force10 shows that ultra low latency application level performance can be achieved with commercially
A Scalable Ethernet Clos-Switch
A Scalable Ethernet Clos-Switch Norbert Eicker John von Neumann-Institute for Computing Research Centre Jülich Technisches Seminar Desy Zeuthen 9.5.2006 Outline Motivation Clos-Switches Ethernet Crossbar
Performance Evaluation of Linux Bridge
Performance Evaluation of Linux Bridge James T. Yu School of Computer Science, Telecommunications, and Information System (CTI) DePaul University ABSTRACT This paper studies a unique network feature, Ethernet
Recent Trends in Operating Systems and their Applicability to HPC
Recent Trends in Operating Systems and their licability to HPC Arthur Maccabe and Patrick Bridges Ron Brightwell and Rolf Riesen Department of Computer Science Scalable Computing Systems Department MSC01-1130
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation
Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation R.Navaneethakrishnan Assistant Professor (SG) Bharathiyar College of Engineering and Technology, Karaikal, India.
Benchmarking the Performance of XenDesktop Virtual DeskTop Infrastructure (VDI) Platform
Benchmarking the Performance of XenDesktop Virtual DeskTop Infrastructure (VDI) Platform Shie-Yuan Wang Department of Computer Science National Chiao Tung University, Taiwan Email: [email protected]
High-Performance Automated Trading Network Architectures
High-Performance Automated Trading Network Architectures Performance Without Loss Performance When It Counts Introduction Firms in the automated trading business recognize that having a low-latency infrastructure
Software Distributed Shared Memory Scalability and New Applications
Software Distributed Shared Memory Scalability and New Applications Mats Brorsson Department of Information Technology, Lund University P.O. Box 118, S-221 00 LUND, Sweden email: [email protected]
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters
Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 [email protected]
10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications
10Gb Ethernet: The Foundation for Low-Latency, Real-Time Financial Services Applications and Other, Latency-Sensitive Applications Testing conducted by Solarflare and Arista Networks reveals single-digit
Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck
Sockets vs RDMA Interface over -Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck Pavan Balaji Computer Science and Engg., The Ohio State University, Columbus, OH 3, [email protected]
Cluster Implementation and Management; Scheduling
Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /
HPC Software Requirements to Support an HPC Cluster Supercomputer
HPC Software Requirements to Support an HPC Cluster Supercomputer Susan Kraus, Cray Cluster Solutions Software Product Manager Maria McLaughlin, Cray Cluster Solutions Product Marketing Cray Inc. WP-CCS-Software01-0417
Low Latency Test Report Ultra-Low Latency 10GbE Switch and Adapter Testing Bruce Tolley, PhD, Solarflare
Ultra-Low Latency 10GbE Switch and Adapter Testing Bruce Tolley, PhD, Solarflare Testing conducted by Solarflare and Fujitsu shows that ultra low latency application-level performance can be achieved with
4 High-speed Transmission and Interoperability
4 High-speed Transmission and Interoperability Technology 4-1 Transport Protocols for Fast Long-Distance Networks: Comparison of Their Performances in JGN KUMAZOE Kazumi, KOUYAMA Katsushi, HORI Yoshiaki,
Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing
www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline
Architecting Low Latency Cloud Networks
Architecting Low Latency Cloud Networks Introduction: Application Response Time is Critical in Cloud Environments As data centers transition to next generation virtualized & elastic cloud architectures,
