Lattice QCD Performance. on Multi core Linux Servers
|
|
|
- Jeremy Cross
- 10 years ago
- Views:
Transcription
1 Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most well established non-perturbative approach to solving the theory of Quantum Chromodynamics. To get the best performance of Lattice QCD computing, it requires careful attention to balancing memory bandwidth, floating point throughput, and network performance. In this paper, I will discuss my investigation of the effect to lattice QCD performance of different commodity processors. I will also discuss the effect of different lattice size, computing precision etc. Finally, I will give my suggestion on how to do lattice QCD computing effectively and economically. Contents 1. Introduction to the benchmark suite DD-HMC algorithm 1.2 DD-HMC lattice QCD simulation package 1.3 The BenchMZ suite 2. Introduction to the platforms for test Platforms for test 2.2 Compiler flags 3. Effect to lattice QCD performance of different processor architectures Introduction to compared CPU architectures. 3.2 Benchmark results 4. Effect to lattice QCD performance of different lattice size and computing precision Effect to lattice QCD performance of different ELF format Conclusion 10 * [email protected] 1 / 11
2 1. Introduction to the benchmark suite I use the BenchMZ suite made by Martin Luescher and Hartmut Wittig to evaluate the performance of lattice QCD computing. This suite is based on the DD-HMC lattice QCD simulation package, which implements the DD-HMC algorithm for lattice QCD with the Wilson plaquette action and a doublet of mass-degenerate O(a)-improved Wilson quarks [1]. 1.1 DD-HMC algorithm The DD-HMC algorithm combines domain decomposition ideas with the Hybrid-Monte-Carlo algorithm and the Sexton-Weingarten multiple-time integration scheme. In the relevant range of quark masses and lattice sizes, it is much faster than any algorithm used before. Moreover, it is well suited for parallel processing, which is a critical requirement if large lattices are to be simulated [1]. This algorithm is described in detail in [2] and [3]. 1.2 DD-HMC lattice QCD simulation package The DD-HMC lattice QCD simulation package which implements the DD-HMC algorithm is written by Martin Luescher, the same person who introduced this algorithm. So far this implementation has been extensively used in PC clusters, and we are using it in production runs. The code is this package is highly optimized for the machines with current Intel and AMD processors, but will run correctly on any system that complies with the ISO C and the MPI 1.2 standards [1]. Many modules in the package include SSE (Streaming SIMD Extension) inline-assembly code which can be activated at compilation time, to profit from the modern x86 processors extended capabilities of performing arithmetic operations on vectors of 4 single-precision or 2 double-precision floating-point numbers in just one or two machine circles. In my investigation, the DD-HMC package version is used. However, it should be easy to move to the latest version The BenchMZ suite In the BenchMZ suite we focus on the computing time spent on various functions to benchmark a given system. The timing functions and short descriptions of these functions are listed in Table 1. These functions are chosen in the benchmark suite because they contain most of the time-critical parts of actual lattice QCD applications. Typical values in production runs are used for the choice of parameter set. Two different lattice sizes are considered, 32x16x16x32 and 32x16x24x48; the corresponding block sizes are 8x8x8x8 and 8x8x12x12, respectively. The coupling parameter β = 0.0 and the hopping parameter κ = 0.1 are used. 2 / 11
3 Timing Function Qhat/Qhat_dble Qhat_blk/Qhat_blk_dble spinor_prod/ spinor_prod_dble spinor_prod_re/ spinor_prod_re_dble normalize/ normalize_dble mulc_spinor_ad/ mulc_spinor_ad_dble project/project_dble rotate/rotate_dble norm_square/ norm_square_dble Short Description Part of the application of the even-odd preconditioned Wilson-Dirac operator. Applies Qhat to the global single/double precision field and assigns the result. Part of the application of the even-odd preconditioned Wilson-Dirac operator. Applies Qhat to the single/double precision field on the block and assigns the result. Computes the scalar product of the fields. Computes the real part of the scalar product of the fields. Replaces pk[] by pk[]/ pk and returns the norm pk Replaces pk[] by pk[]+z*pl[] Replaces ps_k by pk[]-(pl,pk)*pl[] Replaces p_k[] by sum_j p_j[]*v[n*j+k] where 0<=k, j<n and p_k=ppk[k] Computes the square of the norm of the field. Table 1. List of the functions that are used to benchmarking the system and their short descriptions. hostname Processors Memory(GB) System Compiler hpbl1.ifh.de Intel Xeon E5440 Quad Core * 2 16 Scientific Linux bit (kernel 2.6.9) gcc hpbl2.ifh.de hpbl3.ifh.de hpbl4.ifh.de Intel Xeon E5440 Quad Core * 2 AMD Opteron 2356 Quad Core * 2 AMD Opteron 2356 Quad Core * Scientific Linux bit (kernel ) Scientific Linux bit (kernel 2.6.9) Scientific Linux bit (kernel ) gcc gcc gcc Table 2. Test platforms 3 / 11
4 2. Introduction to the platforms for test 2.1 Platforms for test The platforms used to performing the benchmarking are listed in Table 2. They are all x86_64 machines, with typical modern configurations. Multi-core architecture is the main focus of this investigation, because we want to know how the data exchanging between cores constrains the performance of lattice QCD. Scientific Linux Operating System is used because this is the most popular OS in high energy physics computing. For 64-bit executables in SL5 machines, OpenMPI is used as MPI implementation, otherwise OpenMPI is used. 2.2 Complier flags For 32-bit executables, the complier flags used are -ansi pedantic Wno-long-long Wall mcpu=i586 malign-double fno-force-mem O DSSE3 DPM. For 64-bit executables, the complier flags used are -std=c89 -pedantic -Wno-long-long -Wall fomit -frame-pointer -O -DSSE3 DPM. These flags are suggested by the author of the DD-HMC simulation package. Actually, several compiler options mentioned in gcc man pages are tried, but the effects are not very significant. What tend to really drive the performance are the prefetch distance (which is set by DPM) and the SSE inline-assemble activation (which is set by DSSEx). 3. Effect to lattice QCD performance of different processor architectures High lattice QCD performance requires excellent single and double precision floating point performance, high memory bandwidth * and fast communications between nodes. The effect of network communications will not be investigated in this paper; and the other two factors both largely depend on the CPU architectures. Previous study shows that for older CPUs, memory bandwidth typically constrains the performance on single nodes [4]. 3.1 Introduction to the compared CPU architectures Two popular commodity processors, Intel Xeon E5440 and AMD Opteron 2356, which are believed to be representative to the modern Intel and AMD multi-core CPUs, are investigated. Detailed technical information about these two processors can be found in the official websites of Intel and AMD companies [5-6]. I will concentrate only on the key aspects that affect the floating-point performances and memory bandwidth. The main frequency of Intel E5440 is 2.83 GHz, while it is 2.30 GHz for AMD 2356, which basically means that E5440 runs faster than *The term memory bandwidth used here not only refers to the access to main memory, but also to the data exchanging between cores and processors. 4 / 11
5 Both CPUs have integrated floating point unit and support SSE3 instruction set. AMD Opteron 2356 also provides 128-bit floating-point pipeline enhancements To improve float-point performance. Figure 1 gives the illustration of the comparison of floating-point performance on these two processors. One can find from the figures that generally AMD Opteron 2356 does better than Intel Xeon E5440 in floating-point performance despite its lower main frequency. However, this advantage is not obvious in low optimization codes that are typically used in high energy computing. Fig. 1.a [7] Fig. 1.b Figure 1: Illustration of the floating-point throughput performance comparison of Intel Xeon E5440 and AMD Opteron P servers. Fig. 1.a is the SPECfp*_rate results with high optimization; while Fig. 1.b is the SPECfp*_multispeed results with typical optimization used in high energy computing [8]. Concerning memory bandwidth, AMD has made some important changes in the processor architectures of Opteron series. Each Opteron has an integrated memory controller and a separate memory bus attached to the controller. HyperTransport Technology Links are used to link between processors. A given processor can address both local memory, and memory attached to the other processor. However, NUMA-aware kernels such as the Linux 2.6.x series must be used to gain the best performance on AMD processors [4]. In the contrast, Intel Xeon E5400 series still use the traditional front-size bus architectures, with E5440 having a 1333 MHz FSB. Figure 2 gives the comparison of memory bandwidth of these two processors. One can find from the figure that the memory bandwidth of Intel Xeon E5440 processor system only approaches 45%-55% of the bandwidth of AMD Opteron 2356 processor system. 3.2 Benchmark results The Benchmark results used to evaluate performances of different processors are shown in Figure 3; performances are given in Mflops (higher is better), whose value 5 / 11
6 is inversely proportional to time spent per lattice point. Lattice QCD applications with single and double precision, smaller and bigger lattice, as well as 32-bit and 64-bit executable format, are runned in hpbl1, hpbl2 (Intel-based machines), and hpbl3, hpbl4 (AMD-based machines). Performances vary from different timing functions, but one can find that generally AMD-based machines have a higher Mflops value, which is in accordance with the fact that AMD Opteron2356 beats Intel Xeon E5440 in both floating-point performance and memory bandwidth despite its lower main frequency. Moreover, one will observe that for bigger lattice size or higher (double) precision, the Figure 2: Illustration of the memory bandwidth comparison of Intel Xeon E5440 and AMD Opteron P servers. The results are obtained using STREAM benchmark suite [7]. Fig 3.a Fig 3.b 6 / 11
7 Fig 3.c Fig 3.d Fig 3.e Fig 3.f 7 / 11
8 Fig. 3.g Fig 3.h Figure 3: Benchmark results for processor architecture comparison. Intel-based machine performances have the tendency of approaching 40%-50% of AMD-base machine performances. This tendency can be seen more clearly when one looks at the Qhat_blk and norm_squre timing functions. This observation can be explained by that memory bandwidth is still the constraining factor in lattice QCD computing, so when the lattice extends into main memory, the lattice QCD performance basically reflects the memory bandwidth performance of the particular machine, which is 40%-50% of AMD for Intel-based machines. 4. Effect to lattice QCD performance of different lattice size and computing precision The Benchmark results used to evaluate performances of different lattice size and computing precision are shown in Figure 4. The lattice QCD benchmark suite is runned in four different machines (hpbl1-4); and for each machine, the suite is runned with single and double precision, and with 32x16x16x32 and 32x16x24x48 lattice, respectively. Comparing the results of different lattice size runs, one can find that in single precision computing, there may be some performance decline for larger lattice; but in double precision runs, the Mflops performance value for different lattice size tends to become the same, i.e. in high precision computing the time spent on each lattice point does not increase when lattice size increases. Comparing the result of different precision runs, one can find some analogous property. In small lattice runs, the double precision performance may only reach 1/5, or even 1/7 of the single precision performance, e.g. in the normalize or norm_squre functions. However, in large lattice runs, almost all the benchmarking functions 8 / 11
9 Fig. 4.a Fig. 4.b Fig. 4.c Fig. 4.d Figure 4: Benchmark results for lattice size and computing precision comparison report a 2:1 performance ratio for single vs. double precision runs. Both observations may result from the fact that memory bandwidth typically constrains the lattice QCD performance. When larger lattice size or higher computing precision is specified, the lattice extends to the main memory, so the nonlinear effect of the algorithm is disguisedd by the linear memory access time increasing. As a result, time spent on per lattice point remains linearly. However, more statistics is still needed to confirm this explanation. 9 / 11
10 5. Effect to lattice QCD performance of 32bit and 64bit applications The benchmark results used to evaluate performance of 32bit and 64 bit applications are showed in Figure 5. The 32-bit and 64-bit lattice QCD benchmark suites are runned in two machines with different architectures. Fig 5.a Benchmark results on Intel-based Fig 5.b Benchmark results on AMD-based machine hpbl2 machine hpbl4 Figure 5: Benchmark results for ELF format comparison One can see that there is no considerable performance difference between 32-bit and 64-bit lattice QCD applications. The reason for this phenomenon still needs some investigation. But the guess is that it has something to do with the fact that the prefetch distance is 128 bit for both 32-bit and 64-bit executables. 6. Conclusions Memory bandwidth is still the constraining factor of lattice QCD computing even for fairly new machines. So processors with special technology to speed up the memory access have evident advantage in lattice QCD computing. As a respective case, AMD Opteron 2356 with integrated memory controller and HyperTransport channels gets a 120% higher score than Intel Xeon 2345 with traditional front size bus architecture in lattice QCD benchmarking. Besides, the increase of lattice size or computing precision has a roughly linear effect on computing time.. So for people who want to perform effective and economic lattice QCD computing, memory bandwidth should be a key factor to focus on; AMD Opteron systems with enhanced memory access design may probably be a good choice. Also, since there is no dramatic performance decline for larger lattice and higher computing precision; 10 / 11
11 this would be quite acceptable when there is need. Acknowledgment This research is sponsored by DESY summer student program. I want to thank my supervisor Peter Wegner for his guidance. I want to thank Mr. Martin Luescher and Mr. Hartmut Wittig for their kindness and patience when answering my rather naive questions. I would also like to thank Mr. Goetz Waschk and Mr. Stephan Wiesand for their kind help to my work. References [1] M. Luescher, README file of the DD-HMC package. [2] M. Luescher, Comput. Phys. Commun. 165 (2005) 199 [3] M. Luescher, Comput. Phys. Commun. 156 (2004) 209 [4] D. Holmgren et. al, FERMILAB-CONF CD [5] [6] 0.html [7] 8800,00.html [8] tion 11 / 11
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers
Principled Technologies Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers Principled Technologies, Inc. Agenda Overview System configurations
Improved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE
SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
CPU Benchmarks Over 600,000 CPUs Benchmarked
Shopping cart Search Home Software Hardware Benchmarks Services Store Support Forums About Us Home» CPU Benchmarks» Multiple CPU Systems CPU Benchmarks Video Card Benchmarks Hard Drive Benchmarks RAM PC
Clusters: Mainstream Technology for CAE
Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux
Benchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
An examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
IOS110. Virtualization 5/27/2014 1
IOS110 Virtualization 5/27/2014 1 Agenda What is Virtualization? Types of Virtualization. Advantages and Disadvantages. Virtualization software Hyper V What is Virtualization? Virtualization Refers to
Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
On the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
Supercomputing 2004 - Status und Trends (Conference Report) Peter Wegner
(Conference Report) Peter Wegner SC2004 conference Top500 List BG/L Moors Law, problems of recent architectures Solutions Interconnects Software Lattice QCD machines DESY @SC2004 QCDOC Conclusions Technical
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Scaling from 1 PC to a super computer using Mascot
Scaling from 1 PC to a super computer using Mascot 1 Mascot - options for high throughput When is more processing power required? Options: Multiple Mascot installations Using a large Unix server Clusters
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0
Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without
StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud
StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Building Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl [email protected] CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. [email protected] Andrew Munzer Director, Training and Customer
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
PRIMERGY server-based High Performance Computing solutions
PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
CPU PERFORMANCE COMPARISON OF TWO CLOUD SOLUTIONS: VMWARE VCLOUD HYBRID SERVICE AND MICROSOFT AZURE
CPU PERFORMANCE COMPARISON OF TWO CLOUD SOLUTIONS: VMWARE VCLOUD HYBRID SERVICE AND MICROSOFT AZURE Businesses are rapidly transitioning to the public cloud to take advantage of on-demand resources and
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
CHAPTER FIVE RESULT ANALYSIS
CHAPTER FIVE RESULT ANALYSIS 5.1 Chapter Introduction 5.2 Discussion of Results 5.3 Performance Comparisons 5.4 Chapter Summary 61 5.1 Chapter Introduction This chapter outlines the results obtained from
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
An Oracle White Paper August 2011. Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability
An Oracle White Paper August 2011 Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability Note This whitepaper discusses a number of considerations to be made when
The Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
Optimization of Cluster Web Server Scheduling from Site Access Statistics
Optimization of Cluster Web Server Scheduling from Site Access Statistics Nartpong Ampornaramveth, Surasak Sanguanpong Faculty of Computer Engineering, Kasetsart University, Bangkhen Bangkok, Thailand
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Comparing the performance of the Landmark Nexus reservoir simulator on HP servers
WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist
ACANO SOLUTION VIRTUALIZED DEPLOYMENTS White Paper Simon Evans, Acano Chief Scientist Updated April 2015 CONTENTS Introduction... 3 Host Requirements... 5 Sizing a VM... 6 Call Bridge VM... 7 Acano Edge
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Multilevel Load Balancing in NUMA Computers
FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,
Which ARM Cortex Core Is Right for Your Application: A, R or M?
Which ARM Cortex Core Is Right for Your Application: A, R or M? Introduction The ARM Cortex series of cores encompasses a very wide range of scalable performance options offering designers a great deal
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
IT@Intel. Comparing Multi-Core Processors for Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core
TheImpactofWeightsonthe Performance of Server Load Balancing(SLB) Systems
TheImpactofWeightsonthe Performance of Server Load Balancing(SLB) Systems Jörg Jung University of Potsdam Institute for Computer Science Operating Systems and Distributed Systems March 2013 1 Outline 1
How System Settings Impact PCIe SSD Performance
How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Cloud Performance Benchmark Series
Cloud Performance Benchmark Series Amazon EC2 CPU Speed Benchmarks Kalpit Sarda Sumit Sanghrajka Radu Sion ver..7 C l o u d B e n c h m a r k s : C o m p u t i n g o n A m a z o n E C 2 2 1. Overview We
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Full and Para Virtualization
Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
Performance monitoring of the software frameworks for LHC experiments
Performance monitoring of the software frameworks for LHC experiments William A. Romero R. [email protected] J.M. Dana [email protected] First EELA-2 Conference Bogotá, COL OUTLINE Introduction
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World
Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook
DDR3 memory technology
DDR3 memory technology Technology brief, 3 rd edition Introduction... 2 DDR3 architecture... 2 Types of DDR3 DIMMs... 2 Unbuffered and Registered DIMMs... 2 Load Reduced DIMMs... 3 LRDIMMs and rank multiplication...
