Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
|
|
|
- Sylvia Webster
- 9 years ago
- Views:
Transcription
1 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Abstract 1 Interconnect quality is very important in distributed high-performance computing. We were interested in efficiency of Microsoft's implementation of Message-Passing Interface (MPI-2) included in Compute Cluster Server An experimental distributed application (masterslave model) was implemented and executed on a 20-core cluster equipped with Gigabit Ethernet interconnect. Using different configurations and setups we were able to estimate some parameters of real-world network traffic: overhead of single MPI-message, maximum network load on master node and speedup factors for our well-parallelizable problem. Keywords: parallel programming, Compute Cluster Server, cluster interconnect, MPI, SHA-1, VUCAKO Bench Introduction Efficient parallel programming is very important objective today, when physical limits are limiting growth of single-core computing power. Therefore CPUs with more cores-on-die are manufactured and SIMD architectures or multi-node computer systems are used in high performance computing for a while.. We will be concentrating on multi-node (distributed) computing. Outline of the architecture is simple: there is number of processors (nodes), each of them has its own memory. Nodes are connected together by an interconnect (IC) network, grid or ring topologies (to mention the most important) are being used. Nodes need not have similar computing power or even the equal architecture. Google is one of well known users of massive heterogenous distributed computing systems. Throughput of an IC appears to be most crucial attribute, influencing overall effectivity of the computation. Of course there are job classes more sensitive to IC efficiency (needing more inter-node communication), on the other hand many problems are well-parallelizable without intensive communication needs. Software architect should know technical parameters of distributed system and its IC in both cases. There are several software libraries helping developers write programs for distributed systems efficiently and comfortably [1]. PVM (Parallel Virtual Machine) and MPI (Message- Passing Interface) are two main systems based on asynchronous message-passing between computing nodes. We will focus on MPI [2], especially on Microsoft's implementation of MPI-2 included in Compute Cluster Server 2003 [3]. One model for parallel programming of convenient problems is called master-slave (for details see any parallel-programming textbook or [4]). One node ( master ) controls the computation and distributes work ( work units, WU) to identically operating slave nodes. There might be one type of WU, several WU types on the same level, or even complicated system of WU-s connected in a dependence graph. But this is not too important for our further research. 1 Internal report describing technical research made in summer of 2007 on Tyan PSC T-630 cluster placed at Silicon Hill in Prague, 15 Nov
2 A simple master-slave system must perform following actions in one WU-cycle ({m} stands for master, {s} for slave, {n} for network ): 1. {m} assembly of input data (WU), concrete slave is defined 2. {n} WU is transferred to the slave 3. {s} WU is executed/computed by the slave 4. {n} results are transmitted back to the master 5. {m} result processing/merging, if applicable Note that in case of asynchronous network transfer, master needs only to initiate the 2., other part of master's code will be waken at the end of the 4. Properly implemented master code should be busy in 1. and 5. only, slave code should be utilized in 3., then it initiates the 4. transfer. Details of one concrete problem will follow. Distributed SHA-1 digest We had chosen simple and well-parallelizable problem: distributed computing of SHA-1 digest (for hash functions see [5]). 32GB of randomly generated data were divided into 1 million working units, 32KB each. But input data volume (transferred over a network) is only 4MB, full input data is reconstructed on the slave's side. After decoding input data, slave computes SHA-1 digest using well known algorithm [6]. Result in form of 20-bytes long binary array is transferred back to the master, where it is accumulated (using binary XOR) into global result array. Thanks to commutativity of the XOR operation, order of result accumulation does not matter. Latter method is the least effective one used in our measurements. SHA-1 digest of 32KB array takes only 200μs, which is comparable to message-passing overhead or network latency. Thus sending single WU in one MPI message is not very effective. We were introducing working batches containing 1, 2, 4, 10 or 100 WU-s. Input data size for one batch are 4, 8, 16, 40 or 400 bytes. Output (returning) data packet is always 20 bytes long, partial XOR is computed on the slave's side. Comparing computation efficiency of different batch sizes will allow us to estimate several IC parameters (single message overhead, latency). Schematic code of the used master-worker system follows. Master pseudocode: while ( anything-to-compute ) { MPI_Recv( slave, resultfromslave ); if ( resultfromslave!= GREETING_RESULT ) mergeresult( resultfromslave ); prepareunit( newunit ); MPI_Send( slave, newunit ); } for ( i = 1; i < numberofnodes; i++ ) MPI_Send( i, QUIT ); MPI_Finalize(); Slave pseudocode: MPI_Send( master, GREETING_RESULT ); while ( true ) { MPI_Recv( master, unittoprocess ); if ( unittoprocess == QUIT ) break; 2
3 result = computeunit( unittoprocess ); MPI_Send( master, result ); } MPI_Finalize(); The actual C++ code was included into open-source VUCAKO Bench project [7]. Standalone test #64 can be compiled from mpi64.cpp source, binary executable is named mpi64.exe and should be deployed on Windows Compute Cluster Server [3] by command job submit /numprocessors:12 /workdir:\\mscluster\<usr>$\ /stdout:out.txt mpiexec mpi64.exe -b <b> where <usr> is actual login name and <b> batch size (1, 2, 4, 10, 100,..). After job finish output text files will contain run-time statistics and timings. The code of test #64 would probably run on other systems/mpi implementations as well. But we were tested it only on Windows CCS so far. Test setup We were using Tyan PSC T-630 cluster [8] installed at Silicon Hill (other details can be found in [9]). Some basic technical facts: 5 nodes (boards) connected together by dedicated Gigabyte Ethernet interconnect each board has two Intel XEON 5148 dual-core processors installed, clock: 2.33GHz each board has 2GB of DDR II/667MHz DRAM modules installed each board has a 80GB SATA-II hard drive attached (not used in our problem) operating system on head node: Microsoft Windows Compute Cluster Server 2003 [3], other nodes have regular Microsoft Windows Server Standard installed Our experimental C++ code was compiled using Microsoft Visual Studio 2005 Professional, with help of additional libraries from the CCS installation (MPI communication). We had also performed many benchmark tests to rank single-core efficiency of the installed processors. VUCAKO Bench suite was used here too, detailed results can be found on its WWW pages [7]. For comparison purposes we will show a couple of benchmark results here (only CPU-related and memory-related ones, for details see [7]). Less numbers are better: 3
4 Distributed computing Distributed SHA-1 digest was used for measuring practical efficiency of the interconnect. The mpi64 task from VUCAKO Benchmark was compiled using Microsoft Visual Studio 2005 Professional and deployed several times in configurations with batch size 1 to 100 and utilizing 1 to 19 slave processes (100-batch configuration was considered to be low-networktraffic etalon, as it uses only hundreds of messages per second in average). Overall times in seconds were measured, averaged original times were: 1 slave (single chip) slaves (single board) slaves (two boards) slaves (whole cluster) batch configurations achieved parallel speedups: 3.00, 6.89, (for 3, 7, 19 slaves) which is close to theoretical value for 4 of non-parallelizable code (Amdahl's law in [10]). Communication overhead is more pronounced in configurations with more than 3 slaves (when Ethernet interconnect has to be used). Overview of network traffic for individual configurations (average number of MPI messages from master to all slaves per second : average incoming-message processing time on the master): 1 slave 4744:211μs 2410:415μs 1220:819μs 490:2039μs 50:20ms 3 slaves 14230:70μs 7233:138μs 3658:273μs 1473:679μs 149:7ms 7 slaves 21522:46μs 12594:79μs 6923:144μs 2789:359μs 342:3ms 19 slaves 36958:27μs 24775:40μs 14698:68μs 6563:152μs 864:1ms Next table contains MPI-overhead estimates (total overhead in seconds : overhead per single MPI message in μs). Total overhead times were computed by comparing total computation time of a configuration to the 100-batch standard: 1 slave 9.1 : 5μs 5.7 : 6μs 3.2 : 6μs 2.2 : 11μs 0.0 : 0μs 3 slaves 3.1 : 2μs 2.0 : 2μs 1.2 : 2μs 0.7 : 4μs 0.0 : 0μs 7 slaves 17.3 : 9μs 10.5 : 10μs 6.9 : 14μs 6.6 : 33μs 0.0 : 0μs 19 slaves 15.6 : 8μs 8.7 : 9μs 5.5 : 11μs 3.7 : 18μs 0.0 : 0μs Comparing 4-batch and 10-batch to 100-batch case, we can see that the master is able to process effectively of MPI messages per second (Mps) regardless of slaves' location (the same chip/board/different boards). MPI overhead can be estimated to be less than 6μs per message for single-board configurations ( 3 slaves). Overhead ratio in such cases will be less than 4% of total computation time. If interconnect had to be used (>3 slaves), one must expect much bigger communication overheads. Typical values measured in our experiments were between 8μs and 20μs per message. Overhead ratios can claim up to 50% of total computation time. Nevertheless we observed peak 37k Mps (27μs/message) throughput in this (least propitious) case. 4
5 Conclusions Everyone recommends to keep number of MPI-messages as low as possible, but MPI implementation in CCS seems be quite effective even in non-ideal conditions (thousands of messages per second from single slave to the master). Intra-board communication seems to be quite effective as well, so one has no great need to optimize the application in a hybridparallelization way, except for special algorithms which can take big advantage from sharing memory between worker threads. We can expect network overhead to be as low as 10μs per message even in cases with fully loaded master processes (thousands of MPI-messages per second per slave). Future work: equivalent tests will be performed on a 80-processor blade server equipped with an InfiniBand interconnect [11]. After that we will be able to give more complete study of MPI efficiency and influence of interconnect technology to overall effectivity of a master-slave solution. Acknowledgements This work was supported by Microsoft (see [9]) and Silicon Hill ( References 1. Peter Kacsuk, Ferenc Vajda. Network-based Distributed Computing (Metacomputing). ERCIM Hungary, Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface. 15 Nov 2003, on-line version can be found at: 3. Microsoft Windows Compute Cluster Server 2003, main WWW page, 4. Gary Shao, Francine Berman. Master/Slave Computing on the Grid. Heterogeneous Computing Workshop, Bruce Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. Wiley, 2nd Edition, 1995, ISBN: Steve Reid's public domain implementation of SHA-1 digest algorithm, 27 Sep 1996, one of present URLs: 7. VUCAKO Bench, open-source benchmark suite for 32- and 64-bit computers, homepage, 8. Scalable Servers Corporation, main WWW page, 9. Windows Compute Cluster Server Silicon Hill WiKi: Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr ). AFIPS Press, Reston, Va., 1967, pp InfiniBand Trade Association, WWW pages, 5
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Improved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
System Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
Using the Windows Cluster
Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing
Microsoft Windows Compute Cluster Server 2003 Partner Solution Brief Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing Microsoft Windows Compute Cluster Server Runs
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Comparing the performance of the Landmark Nexus reservoir simulator on HP servers
WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus
Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber
Introduction to grid technologies, parallel and cloud computing Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber OUTLINES Grid Computing Parallel programming technologies (MPI- Open MP-Cuda )
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)
Scalability Results Select the right hardware configuration for your organization to optimize performance Table of Contents Introduction... 1 Scalability... 2 Definition... 2 CPU and Memory Usage... 2
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Boosting Data Transfer with TCP Offload Engine Technology
Boosting Data Transfer with TCP Offload Engine Technology on Ninth-Generation Dell PowerEdge Servers TCP/IP Offload Engine () technology makes its debut in the ninth generation of Dell PowerEdge servers,
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
Tableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
Proposal of Dynamic Load Balancing Algorithm in Grid System
www.ijcsi.org 186 Proposal of Dynamic Load Balancing Algorithm in Grid System Sherihan Abu Elenin Faculty of Computers and Information Mansoura University, Egypt Abstract This paper proposed dynamic load
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
Analysis and Implementation of Cluster Computing Using Linux Operating System
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 2, Issue 3 (July-Aug. 2012), PP 06-11 Analysis and Implementation of Cluster Computing Using Linux Operating System Zinnia Sultana
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine
Where IT perceptions are reality Test Report OCe14000 Performance Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Document # TEST2014001 v9, October 2014 Copyright 2014 IT Brand
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
1000Mbps Ethernet Performance Test Report 2014.4
1000Mbps Ethernet Performance Test Report 2014.4 Test Setup: Test Equipment Used: Lenovo ThinkPad T420 Laptop Intel Core i5-2540m CPU - 2.60 GHz 4GB DDR3 Memory Intel 82579LM Gigabit Ethernet Adapter CentOS
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
Interoperability Testing and iwarp Performance. Whitepaper
Interoperability Testing and iwarp Performance Whitepaper Interoperability Testing and iwarp Performance Introduction In tests conducted at the Chelsio facility, results demonstrate successful interoperability
Shared Parallel File System
Shared Parallel File System Fangbin Liu [email protected] System and Network Engineering University of Amsterdam Shared Parallel File System Introduction of the project The PVFS2 parallel file system
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Capacity Planning Guide for Adobe LiveCycle Data Services 2.6
White Paper Capacity Planning Guide for Adobe LiveCycle Data Services 2.6 Create applications that can deliver thousands of messages per second to thousands of end users simultaneously Table of contents
DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering
DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Numerix CrossAsset XL and Windows HPC Server 2008 R2
Numerix CrossAsset XL and Windows HPC Server 2008 R2 Faster Performance for Valuation and Risk Management in Complex Derivative Portfolios Microsoft Corporation Published: February 2011 Abstract Numerix,
Online Remote Data Backup for iscsi-based Storage Systems
Online Remote Data Backup for iscsi-based Storage Systems Dan Zhou, Li Ou, Xubin (Ben) He Department of Electrical and Computer Engineering Tennessee Technological University Cookeville, TN 38505, USA
McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]
McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected] Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation,
Parallel Simplification of Large Meshes on PC Clusters
Parallel Simplification of Large Meshes on PC Clusters Hua Xiong, Xiaohong Jiang, Yaping Zhang, Jiaoying Shi State Key Lab of CAD&CG, College of Computer Science Zhejiang University Hangzhou, China April
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution
Arista 10 Gigabit Ethernet Switch Lab-Tested with Panasas ActiveStor Parallel Storage System Delivers Best Results for High-Performance and Low Latency for Scale-Out Cloud Storage Applications Introduction
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Scalability Factors of JMeter In Performance Testing Projects
Scalability Factors of JMeter In Performance Testing Projects Title Scalability Factors for JMeter In Performance Testing Projects Conference STEP-IN Conference Performance Testing 2008, PUNE Author(s)
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team
Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC Wenhao Wu Program Manager Windows HPC team Agenda Microsoft s Commitments to HPC RDMA for HPC Server RDMA for Storage in Windows 8 Microsoft
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
Purpose... 3. Computer Hardware Configurations... 6 Single Computer Configuration... 6 Multiple Server Configurations... 7. Data Encryption...
Contents Purpose... 3 Background on Keyscan Software... 3 Client... 4 Communication Service... 4 SQL Server 2012 Express... 4 Aurora Optional Software Modules... 5 Computer Hardware Configurations... 6
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
ILOG JRules Performance Analysis and Capacity Planning
ILOG JRules Performance Analysis and Capacity Planning Version 1. Last Modified: 25-9-31 Introduction JRules customers and evaluators often ask fundamental questions such as: How fast is the rule engine?,
Development of Software Dispatcher Based. for Heterogeneous. Cluster Based Web Systems
ISSN: 0974-3308, VO L. 5, NO. 2, DECEMBER 2012 @ SRIMC A 105 Development of Software Dispatcher Based B Load Balancing AlgorithmsA for Heterogeneous Cluster Based Web Systems S Prof. Gautam J. Kamani,
Architecting Low Latency Cloud Networks
Architecting Low Latency Cloud Networks Introduction: Application Response Time is Critical in Cloud Environments As data centers transition to next generation virtualized & elastic cloud architectures,
Load Balancing MPI Algorithm for High Throughput Applications
Load Balancing MPI Algorithm for High Throughput Applications Igor Grudenić, Stjepan Groš, Nikola Bogunović Faculty of Electrical Engineering and, University of Zagreb Unska 3, 10000 Zagreb, Croatia {igor.grudenic,
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Improving Grid Processing Efficiency through Compute-Data Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Windows HPC 2008 Cluster Launch
Windows HPC 2008 Cluster Launch Regionales Rechenzentrum Erlangen (RRZE) Johannes Habich [email protected] Launch overview Small presentation and basic introduction Questions and answers Hands-On
Performance and scalability of a large OLTP workload
Performance and scalability of a large OLTP workload ii Performance and scalability of a large OLTP workload Contents Performance and scalability of a large OLTP workload with DB2 9 for System z on Linux..............
Design Considerations for Increasing VDI Performance and Scalability with Cisco Unified Computing System
White Paper Design Considerations for Increasing VDI Performance and Scalability with Cisco Unified Computing System White Paper April 2013 2013 Cisco and/or its affiliates. All rights reserved. This document
How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)
INFINIBAND NETWORK ANALYSIS AND MONITORING USING OPENSM N. Dandapanthula 1, H. Subramoni 1, J. Vienne 1, K. Kandalla 1, S. Sur 1, D. K. Panda 1, and R. Brightwell 2 Presented By Xavier Besseron 1 Date:
SERVER CLUSTERING TECHNOLOGY & CONCEPT
SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: [email protected] Abstract Server Cluster is one of the clustering technologies; it is use for
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams
Neptune A Domain Specific Language for Deploying HPC Software on Cloud Platforms Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams ScienceCloud 2011 @ San Jose, CA June 8, 2011 Cloud Computing Three
RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
