Application performance analysis on Pilatus
|
|
|
- Mariah Potter
- 9 years ago
- Views:
Transcription
1 Application performance analysis on Pilatus Abstract The US group at CSCS performed a set of benchmarks on Pilatus, using the three Programming Environments available (GNU, Intel, PGI): the results can be retrieved on the web-site: (link at the bottom of the page). The purpose of these benchmarks is to indicate which compiler is building the executable with the best performance, among the ones available: it turned out that the Intel and GNU compilers are winning the competition (hyperthreading enabled). In some cases the codes have been compared against the equivalent counterparts installed on Monte Rosa: however this additional information is not meant to measure strictly the difference between the two systems, but rather to give a hint on the performance that users can expect when using default applications. Below we report the procedure to run the benchmarks with a summary of the results and the scalability plots (all in log-log scale) of few selected libraries and applications widely used by the scientific community and parallel benchmarks. Information on the cluster and the processors Vendor_id: Genuine Intel Cores: 16 (32 with hyperthreading) Packages(sockets): 2 Model name: Intel(R) Xeon(R) CPU E GHz CPU MHz: (if powersave on) Cache size: KB Memory: 64GB DDR3 at 1333MHz per node Interconnect: Mellanox ConnectX FDR Infiniband currently configured as QDR Programming Environments and MPI library PrgEnv-intel/ PrgEnv-gnu/4.7.1 PrgEnv-pgi/12.5 MPI library: mvapich2/1.8 Benchmarks Gromacs-4.5.5, NPB, Echam6, SPECFEM3D, FFTW, ScaLAPACK, Stream, SPH-flow More details on the benchmarks with the full data are available at the following web-page:
2 Gromacs HDL benchmark The module gromacs/4.5.5 loads the PATH for the Gromacs molecular dynamics engine The source code is available for download at the web-page Optimized FFTW3 are necessary, since pre-compiled libraries are not using AVX instructions: MKL FFT with Intel does not perform better than FFTW, therefore only the latter is shown. The module fftw/3.3.2 sets the include and library paths. The optimization flags used are -O3 with -mavx (Intel), -march=corei7-avx (GNU), -tp=sandybridge-64 -Mvect=simd:256 (PGI). On Rosa we used gcc with -O3. Simulation setup and Results Retrieve input file hdl.tpr from /project/csstaff/lucamar/benchmark/gromacs/hdl. Run a performance and scalability test with 32, 64, 128, 256 and 512 MPI tasks using 32 tasks per node (from 1 up to 8 nodes). The runtimes (s) of the benchmark are plotted for each Programming Environment against the number of MPI tasks. Gromacs on the sandybridge (with hyperthreading) is performing better than on Monte Rosa. The relatively bad performance of the version compiled using the PGI compiler is due to the impossibility to compile FFTW3 with AVX intrinsics using PGI. The difference in performance between Pilatus and Rosa is easily explained, since Gromacs has assembly loops written for Intel architectures: if SSE and AVX are implemented, the performance doesn't depend on the compiler used.
3 NAS Parallel Benchmarks (NPB) The NAS Parallel Benchmarks (NPB) are a small set of programs designed to evaluate the performance of supercomputers: In the present benchmark, we use the CLASS=D of the hybrid MPI+OpenMP implementation of the BT (Block Tri-diagonal solver), multi-zone version (BT-MZ). Intel: -O3 -mavx -mcmodel=medium -shared-intel -openmp GNU : -O3 -march=corei7-avx -fopenmp -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium -mp Build with make bt-mz CLASS=D NPROCS=n (n=1,2,... is the number of MPI tasks) Simulation Setup and Results The CLASS=D executables were running on 1 to 16 nodes using up to 16 MPI tasks (1 task per node) and 32 OpenMP threads per task. The runtimes (s) of the benchmark are plotted for each Programming Environment against the total number of CPUs. The NAS Parallel Benchmarks (NPB) have shown a good scalability with all the compilers available on the Intel SandyBridge. The Intel compiler shows the best performance and scalability, while GNU and PGI perform relatively bad.
4 Echam6 Echam is a Global Climate Model developed by the Max Planck Institute for Meteorology. Website: Echam6 has been compiled on Pilatus with the optimization flag -O2 for the PGI compiler, -O3 with AVX instructions for the GNU compiler (with Intel on Rosa). Simulation Setup and Results The Echam6 executables were running the T63L47GR15 model for one month prediction on 1 to 8 nodes using 32 MPI tasks per node. The runtimes (s) of the benchmark are plotted for the Programming Environments against the total number of MPI tasks. The best performance is obtained with the GNU compiler, which is also outperforming the same version installed on Monte Rosa with less than 256 CPUs. Echam6 gives a segmentation fault at runtime when compiled with Intel on Pilatus.
5 SPECFEM3D (courtesy of J. Poznanovic) Specfem3d simulates seismic wave propagation in sedimentary basins or any other regional geological model. See Specfem3d was compiled with all compilers on Pilatus. These flags were used: -O3, some floating point optimization flags, and the appropriate architecture flags. Simulation Setup and Results The performance of SPECFEM3D has been benchmarked on Pilatus and Rosa using all available compilers, from 1 up to 16 nodes. The runtimes (s) of the benchmark are plotted for the Programming Environments against the number nodes. The performance span between different compilers on the same architecture is relevant. Note the wide performance gap of GNU on Pilatus vs Rosa. The Intel compiler on Pilatus is giving the best performance, comparable to Cray on Rosa.
6 FFTW FFTW is a C library for computing the discrete Fourier transform (DFT) in one or more dimensions of arbitrary input size, real and complex data: The optimization flags used to compile are -O3 with SandyBridge specific options. Simulation Setup and Results The performance of the FFTW/3.3.2 has been tested with all compilers available on Pilatus running the function FFTW_MEASURE on a two-dimensional mesh x The performance has been compared against Rosa (GNU version), from 32 to 320 CPUs. The runtimes (s) averaged over several calls are plotted vs the MPI tasks. The performance of GNU and Intel compilers is similar, while PGI with no AVX intrinsics for FFTW performs generally worse, except when the number of CPUs is not a divisor of the mesh size: then the performance of GNU and Intel is bad. The default fftw/ on Rosa (GNU 4.7) is outperformed by fftw/3.3.2 on Pilatus.
7 ScaLAPACK ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. ScaLAPACK solves dense and banded linear systems, least squares problems, eigenvalue problems, and singular value problems. More information and the source code are available at Simulation Setup and Results The performance of the ScaLAPACK/2.0.2 function PDSYEV has been tested with the GNU and Intel compilers on Pilatus with a matrix of size The performance has been compared against the ScaLAPACK available on Rosa (Intel version of craylibsci/ ), from 1 to 16 nodes (32 CPUS per node). The runtimes (s) are plotted against the number of MPI tasks. The best performance is obtained with the Intel compiler, however the GNU compiler on Pilatus is still outperforming the ScaLAPACK function PDSYEV provided by Cray and available in the module cray-libsci/ on the Cray XE6 Monte Rosa (Intel version).
8 STREAM STREAM is a synthetic benchmark program that measures sustained memory bandwidth for simple vector kernels: It is designed to work with large data sets, larger than the Last Level Cache (LLC) on the target system so that the results are indicative of very large vector-oriented applications. Available in FORTRAN and C, OpenMP and MPI. (Reference: Performance Guide for HPC Applications on idataplex dx360 M4 systems) To avoid measurements from cache a minimum dimension of the array is required. The array sizes must be chosen so that each of the three arrays is at least 4x larger than the sum of all the last level caches used in the run: each Sandy Bridge chip has a shared L3 cache of 20 MB, i.e. the two chips on a compute node have 40 MB or the equivalent of 5 million double precision words. The size of the arrays must be at least four times 5 million double precision words (~160MB). Simulation setup and Results The benchmark measures the 4 following loops: COPY: a(i) = b(i) => with memory access to 2 double precision words (16 bytes) and no floating point operation per iteration SCALE: a(i) = q * b(i) => with memory access to 2 double precision words (16 bytes) and one floating point operation per iteration SUM: a(i) = b(i) + c(i) => with memory access to 3 double precision words (16 bytes) and one floating point operation per iteration TRIAD: a(i) = b(i) + q * c(i) => with memory access to 3 double precision words (16 bytes) and two floating point operations per iteration INTEL/SANDYBRIDGE/GNU/4.7.1: make CC=gcc CFLAGS="-DN= O3" Function Rate (MB/s) Avg time Min time Max time Copy: Scale: Add: Triad: INTEL/SANDYBRIDGE/INTEL/12.1.2: make CC=icc CFLAGS="-DN= O3" Function Rate (MB/s) Avg time Min time Max time Copy: Scale: Add: Triad: Data: On AMD Interlagos (CRAY XE6) the measured bandwidth is ~6 GB/s for all loops and both compilers (gnu/47 and intel/12) while on the Intel SandyBridge the measured bandwidth is ~13 GB/s for all loops and both compilers (gnu/47 and intel/12). The two sets of SandyBridge results show that single processor binaries produced by the Intel compiler are faster, if compiler options recommended by Intel are used. Known issue with the Intel compiler:
9 SPH-flow SPH-flow is a hybrid MPI/OpenMP code using the hdf5 library for I/O. Get the source code from choose the Programming Environment, load hdf5/ The optimization flags are the following: GNU : -O3 -march=corei7-avx INTEL : -O3 -mavx -shared-intel -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium Simulation setup and Results Walltimes (in sec) shown here correspond to pure MPI jobs only, using 32 cores per node on 2 up to 20 nodes on the SandyBridge and 4 steps of dambreak.ini: The best performance is obtained with the Intel compiler, average performance with GNU, the worst performance with PGI. The difference between the fastest (Intel) and the slowest (PGI) executable is considerably high (~ 50% on 20 nodes).
10 Conclusions The benchmarks reported have the purpose to guide the users on Pilatus towards the compiler which can build the executable with the best performance. Among the applications that we have tested, the best performance is achieved in general with the Intel and GNU compilers. The PGI compiler is performing worse than the previous ones most of the times, with rare exceptions that have been reported. The performance of the default versions of some applications installed on Monte Rosa is presented as well in the current report. The data are included to give the user an idea of the performance that might be obtained using the default modules available on the two systems, without trying to implement specific optimizations. However the present comparison cannot measure strictly a performance difference: in fact sometimes different versions were compared on the two machines, e.g. FFTW on Pilatus vs. FFTW on Rosa and ScaLAPACK source code downloaded from vs. the cray-libsci version of the ScaLAPACK routines.
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
SR-IOV: Performance Benefits for Virtualized Interconnects!
SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013
Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Mathematical Libraries and Application Software on JUROPA and JUQUEEN
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Overview of HPC systems and software available within
Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster
Performance Evaluation of Amazon EC2 for NASA HPC Applications!
National Aeronautics and Space Administration Performance Evaluation of Amazon EC2 for NASA HPC Applications! Piyush Mehrotra!! J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff,! S. Saini, R. Biswas!
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
The CNMS Computer Cluster
The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the
Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster
Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
On the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
Advancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
Mathematical Libraries on JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Benchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
Running applications on the Cray XC30 4/12/2015
Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
SR-IOV In High Performance Computing
SR-IOV In High Performance Computing Hoot Thompson & Dan Duffy NASA Goddard Space Flight Center Greenbelt, MD 20771 [email protected] [email protected] www.nccs.nasa.gov Focus on the research side
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
How to Run Parallel Jobs Efficiently
How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance
Intel True Scale Fabric Architecture Enhanced HPC Architecture and Performance 1. Revision: Version 1 Date: November 2012 Table of Contents Introduction... 3 Key Findings... 3 Intel True Scale Fabric Infiniband
System Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
The Foundation for Better Business Intelligence
Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data
RDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009
ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
PRIMERGY server-based High Performance Computing solutions
PRIMERGY server-based High Performance Computing solutions PreSales - May 2010 - HPC Revenue OS & Processor Type Increasing standardization with shift in HPC to x86 with 70% in 2008.. HPC revenue by operating
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
2IP WP8 Materiel Science Activity report March 6, 2013
2IP WP8 Materiel Science Activity report March 6, 2013 Codes involved in this task ABINIT (M.Torrent) Quantum ESPRESSO (F. Affinito) YAMBO + Octopus (F. Nogueira) SIESTA (G. Huhs) EXCITING/ELK (A. Kozhevnikov)
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Linux for Scientific Computing
Linux for Scientific Computing Bill Saphir Berkeley Lab [email protected] Things you should know if you re thinking about using Linux for Scientific Computing Bill Saphir Berkeley Lab [email protected] Random
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
