HPC and Parallel efficiency
|
|
- Bertram Patrick
- 7 years ago
- Views:
Transcription
1 HPC and Parallel efficiency Martin Hilgeman EMEA product technologist HPC
2 What is parallel scaling?
3 Parallel scaling Parallel scaling is the reduction in application execution time when more than one core is used Number of cores Wall clock time (seconds) Speedup factor 7x faster on 8 cores does not seem to be that bad, but
4 Amdahl s Law
5 Amdahl s Law Gene Amdahl (1967): "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): The effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude a = n (p + n 1 p ) a: speedup n: number of processors p: parallel fraction
6 Amdahl s Law: model 80% parallel application Parallel scaling is hindered by the domination of the serial parts in the application Wall clock time parallel serial Number of processors
7 Amdahl s Law limits maximal speedup Infinite number of processors 250 a = 1 (1 p) 200 a: speedup 200 n: number of processors Parallel Speedup p: parallel fraction % 97.0% 99.0% 99.5% Amdahl's Law percentage
8 Amdahl s Law curves a = n (p + n 1 p ) % 95.0% 97.0% 99.0% 99.5% 99.9% 100.0% a: speedup n: number of processors p: parallel fraction Parallel speedup Multiprocessing can exceed 99% Most Parallel codes are between 95% and 99% Number of processor cores
9 Amdahl s Law and Efficiency 100.0% 99.0% 98.0% 97.0% 96.0% Diminishing returns: Amdahl's Law Percentage 95.0% 94.0% 93.0% 92.0% 91.0% 90.0% 89.0% e = 1 p n n 1 Tension between the desire to use more processors and the associated cost 88.0% 87.0% 86.0% % 66.7% 85.7% 90.9% 93.3% 94.7% 95.7% 96.8% 97.4% 97.9% 98.2% 98.4% 98.7% 98.9% 99.1% 99.2% 66.7% 83.4% 92.9% 95.5% 96.7% 97.4% 97.8% 98.4% 98.7% 98.9% 99.1% 99.2% 99.4% 99.5% 99.6% 99.6% 75.0% 88.9% 95.2% 97.0% 97.8% 98.2% 98.6% 98.9% 99.1% 99.3% 99.4% 99.5% 99.6% 99.6% 99.7% 99.7% Number of processor cores
10 Best practices for efficient computing
11 Taking care of process placement
12 Beware of memory topology In the SMP days, the mapping of processors to memory was straightforward I/O hub 0 Processors controller
13 NUMA for Intel Xeon 56xx Intel Nehalem and AMD Opteron have their memory controller on the processor access becomes non-uniform (NUMA) QPI QPI QPI I/O hub Intel Westmere-EP
14 NUMA for Intel Xeon E5 Intel Sandy Bridge also has the PCIe controller on the chip PCI device access becomes non-uniform (NUPA?) QPI I/O hub I/O hub Intel Sandy Bridge EN/EP
15 NUMA for Intel Xeon E7 processors 4-socket Intel systems (PowerEdge R810, R910, M910) are connected with QPI links QPI QPI QPI Intel Westmere-EX
16 NUMA for AMD 4-socket AMD systems (PowerEdge R815, M915, C6145) are connected with HyperTransport3 links HT3 HT3 HT3 AMD Bulldozer
17 Use tools for memory placement Programming tools: sched_setaffinity() set the process CPU affinity mask Operating system tools: numactl - control NUMA policy for processes or shared memory taskset - set a process CPU affinity Most MPI libraries have memory placement tools enabled by default Intel MPI I_MPI_PIN_* environment variables Open MPI --bind_to_core, --bind_to_socket MVAPICH2 MV2_ENABLE_AFFINITY environment variable Some set affinity by default and do the right thing, but be careful
18 Case study: gysela5d plasma physics application Claimed 92% efficiency on 8192 cores Uses hybrid MPI/OpenMP parallelism Run on 64 cores Dell PowerEdge R815 with AMD GHz processors 8 MPI ranks x 8 OpenMP threads MVAPICH2 1.6
19 PowerEdge R815 CPU core layout PowerEdge R815 has a non-standard dual plane core mapping, making default placement to fail
20 No placement MPI ranks and OpenMP threads scattered across the sockets Sleeping processes (idle OpenMP threads) can start to roam Wall clock time: 1296 seconds
21 Naïve (default) placement Starting from core 0 and counting onwards MPI ranks on every 8 th core, OpenMP ranks in between Wall clock time: 249 seconds
22 Optimal placement Use a placement script which calculates the right mapping for numactl Wall clock time: 191 seconds
23 PowerEdge R910 CPU core layout PowerEdge R910 has the same non-linear core mapping as the R815!
24 No placement MPI ranks and OpenMP threads scattered across the sockets Sleeping processes (idle OpenMP threads) can start to roam QPI QPI QPI Wall clock time: 152 seconds
25 Naïve (default) placement Starting from core 0 and counting onwards MPI ranks on every 8 th core, OpenMP ranks in between QPI QPI QPI Wall clock time: 150 seconds
26 Optimal placement Use a placement script which calculates the right mapping for numactl QPI QPI QPI Wall clock time: 131 seconds
27 Placement program Written in C99 (~ 2,000 lines of code) Works on all major distributions RHEL 5.x and open variants SLES11 SP2 Supports all major MPI libraries MVAPICH MVAPICH2 OpenMPI Platform MPI/HP MPI Intel MPI Tested with > 5,000 core runs Supports hybrid MPI/OpenMP runs too!
28 Placement program Supports all machines and processor vendor models: Intel Nehalem-EP Intel Nehalem-EX Intel Westmere-EP Intel Westmere-EX Intel Sandy Bridge EP AMD Magny Cours AMD Interlagos Knows which cores are sharing L3 caches Understands the AMD Bulldozer module concept to make maximal use of available resources if possible
29 Works only on Dell systems! Try to run on a whitebox system $ mpirun np 1./dell_affinity.exe hello_world.exe This is not a Dell system. Exiting.
30 Program examples: Usage: open64_acml]$ mpirun -np 1 dell_affinity.exe -h Invalid option: -h Usage: dell_affinity.exe -n <# local MPI ranks> -t <# OpenMP threads per rank> TACC single node, 6 cores MPI login2$ mpirun -np 6 dell_affinity.exe./mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 6 OMP_NUM_THREADS: 1. dell_affinity.exe: Intel Westmere processor detected. dell_affinity.exe: Placing MPI rank 0 on host login2 local rank 0 cpulist 1 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host login2 local rank 1 cpulist 3 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host login2 local rank 2 cpulist 5 memlist 0 dell_affinity.exe: Placing MPI rank 3 on host login2 local rank 3 cpulist 7 memlist 0 dell_affinity.exe: Placing MPI rank 4 on host login2 local rank 4 cpulist 9 memlist 0 dell_affinity.exe: Placing MPI rank 5 on host login2 local rank 5 cpulist 11 memlist 0
31 Program examples: TACC single node 2 MPI ranks, 6 OpenMP threads per rank: login1$ mpirun -np 4./dell_affinity.exe -n 2 -t 6 -v /bin/true./dell_affinity.exe: Using Open MPI../dell_affinity.exe: PPN = 2./dell_affinity.exe: OMP_NUM_THREADS = 6./dell_affinity.exe: Intel Westmere EP processor detected../dell_affinity.exe: node 1: cpulist: /dell_affinity.exe: node 0: cpulist: /dell_affinity.exe: Placing MPI rank 0 on host login1.ls4.tacc.utexas.edu local rank 0 cpulist 0,2,4,6,8,10 memlist 0./dell_affinity.exe: Placing MPI rank 3 on host login1.ls4.tacc.utexas.edu local rank 1 cpulist 1,3,5,7,9,11 memlist 1./dell_affinity.exe: Placing MPI rank 1 on host login1.ls4.tacc.utexas.edu local rank 1 cpulist 1,3,5,7,9,11 memlist 1./dell_affinity.exe: Placing MPI rank 2 on host login1.ls4.tacc.utexas.edu local rank 0 cpulist 0,2,4,6,8,10 memlist 0
32 Program examples: TACC two nodes, 4 MPI ranks per node, 3 OpenMP threads per rank: login2$ mpirun_rsh -ssh -hostfile./hosts -np 8 dell_affinity.exe -n 4 -t 3./mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 4 OMP_NUM_THREADS: 3. dell_affinity.exe: Intel Westmere processor detected. dell_affinity.exe: Placing MPI rank 0 on host login1 local rank 0 cpulist 1,3,5 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host login1 local rank 0 cpulist 7,9,11 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host login1 local rank 2 cpulist 0,2,4 memlist 1 dell_affinity.exe: Placing MPI rank 3 on host login1 local rank 3 cpulist 6,8,10 memlist 1 dell_affinity.exe: Placing MPI rank 4 on host login2 local rank 0 cpulist 1,3,5 memlist 0 dell_affinity.exe: Placing MPI rank 5 on host login2 local rank 1 cpulist 7,9,11 memlist 0 dell_affinity.exe: Placing MPI rank 6 on host login2 local rank 2 cpulist 0,2,4 memlist 1 dell_affinity.exe: Placing MPI rank 7 on host login2 local rank 3 cpulist 6,8,10 memlist 1
33 Program examples: Cambridge Single C6145 with AMD Interlagos, 16 MPI ranks: open64_acml]$ mpirun -np 16 dell_affinity.exe ~/martinh/bin/mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 16 OMP_NUM_THREADS: 1. dell_affinity.exe: AMD Interlagos processor detected. dell_affinity.exe: Placing OMP threads on separate modules. dell_affinity.exe: Placing MPI rank 0 on host bench local rank 0 cpulist 0 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host bench local rank 1 cpulist 4 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host bench local rank 2 cpulist 8 memlist 1 dell_affinity.exe: Placing MPI rank 3 on host bench local rank 3 cpulist 12 memlist 1 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 4 cpulist 16 memlist 2 dell_affinity.exe: Placing MPI rank 5 on host bench local rank 5 cpulist 20 memlist 2 dell_affinity.exe: Placing MPI rank 6 on host bench local rank 6 cpulist 24 memlist 3 dell_affinity.exe: Placing MPI rank 7 on host bench local rank 7 cpulist 28 memlist 3 dell_affinity.exe: Placing MPI rank 8 on host bench local rank 8 cpulist 32 memlist 4 dell_affinity.exe: Placing MPI rank 9 on host bench local rank 9 cpulist 36 memlist 4 dell_affinity.exe: Placing MPI rank 10 on host bench local rank 10 cpulist 40 memlist 5 dell_affinity.exe: Placing MPI rank 11 on host bench local rank 11 cpulist 44 memlist 5 dell_affinity.exe: Placing MPI rank 12 on host bench local rank 12 cpulist 48 memlist 6 dell_affinity.exe: Placing MPI rank 13 on host bench local rank 13 cpulist 52 memlist 6 dell_affinity.exe: Placing MPI rank 14 on host bench local rank 14 cpulist 56 memlist 7 dell_affinity.exe: Placing MPI rank 15 on host bench local rank 15 cpulist 60 memlist 7 Dell Confidential
34 Program examples: Cambridge Single C6145 with AMD Interlagos, 8 MPI ranks, 8 OpenMP threads per rank: [dell-guest@bench open64_acml]$ mpirun -np 8 dell_affinity.exe -t 8 ~/martinh/bin/mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 8 OMP_NUM_THREADS: 8. dell_affinity.exe: AMD Interlagos processor detected. dell_affinity.exe: Placing MPI rank 0 on host bench local rank 0 cpulist 0,1,2,3,4,5,6,7 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host bench local rank 1 cpulist 8,9,10,11,12,13,14,15 memlist 1 dell_affinity.exe: Placing MPI rank 2 on host bench local rank 2 cpulist 16,17,18,19,20,21,22,23 memlist 2 dell_affinity.exe: Placing MPI rank 3 on host bench local rank 3 cpulist 24,25,26,27,28,29,30,31 memlist 3 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 4 cpulist 32,33,34,35,36,37,38,39 memlist 4 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 5 cpulist 40,41,42,43,44,45,46,47 memlist 5 dell_affinity.exe: Placing MPI rank 6 on host bench local rank 6 cpulist 48,49,50,51,52,53,54,55 memlist 6 dell_affinity.exe: Placing MPI rank 7 on host bench local rank 7 cpulist 56,57,58,59,60,61,62,63 memlist 7
35 LS-DYNA benchmark: neon_refined LS-DYNA mpp971 v5.1.1 Platform MPI Ran on PE R MPI ranks Architecture knowledge Is key! Mode MPI ranks Wall clock (s) As-is Platform MPI pinning dell_affinity Mode MPI ranks Wall clock (s) As-is Platform MPI pinning dell_affinity
36 Parallel optimization
37 Parallel optimization A lot of attention is being paid to: Infiniband networking buzzwords: Fat-tree multi-rail QDR/FDR/EDR non-blocking MPI library features: shared memory optimization collective offloading single sided messaging message buffering Better start at the root of the parallel performance
38 Do these programs run efficient? LS-DYNA explicit PARATEC
39 Case study: PARATEC load balancing PARAllel Total Energy Code Developed at NERSC for ab initio electronic structure calculations in materials science Uses Density Functional Theory (DFT) to describe the electronic structure of a material (solid, crystal, metal) Knowing the electronic structure of a material tells you everything about its properties Electronic structure is described by wave functions, which (unfortunately) cannot be solved mathematically Approach: Expand the wave functions in plane waves (in Fourier space) Describe the nucleus of a atom with a pseudopotential 3D parallel Fourier Transformations are needed to convert to real (cartesian) space, which are *very* expensive!
40 Benchmark setup Si (silicium) in diamond structure 686 atoms, 7x7x7 cell, 1372 electronic bands Jobs ran at Texas Advanced Computing Center Dell Linux Cluster Lonestar 1,888 Dell PowerEdge M610 blades 22,656 Intel Xeon X GHz Mellanox QDR Infiniband 1 PB Lustre parallel storage Used 196 cores for the calculations
41 Default g vector distribution Computational Time MPI Time Wall Clock Time (s) seconds Uneven load MPI Rank Computation time : 648 seconds Communication time: 276 seconds Communication % : 29.9 % Load imbalance : 21.2 %
42 Optimized g vector distribution Computational Time MPI Time Speedup: 14.3 % 800 Wall Clock Time (s) seconds MPI rank Even load Computation time : 638 seconds Communication time: 154 seconds Communication % : 19.5 % Load imbalance : 5.8 %
43 Conclusion
44 Conclusion Architecture knowledge is key to obtain good scalability People concentrate on MPI optimization work but often forget load balancing issues Use system tools and profilers as standard practice!
45 Questions?
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
More informationFLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationHow To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationHow System Settings Impact PCIe SSD Performance
How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage
More informationToward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster
Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationBasic Concepts in Parallelization
1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline
More informationKashif Iqbal - PhD Kashif.iqbal@ichec.ie
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
More informationHigh Performance Computing Infrastructure at DESY
High Performance Computing Infrastructure at DESY Sven Sternberger & Frank Schlünzen High Performance Computing Infrastructures at DESY DV-Seminar / 04 Feb 2013 Compute Infrastructures at DESY - Outline
More informationOptimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
More informationJUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
More informationHow to Run Parallel Jobs Efficiently
How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
More informationIntroduction to HPC Workshop. Center for e-research (eresearch@nesi.org.nz)
Center for e-research (eresearch@nesi.org.nz) Outline 1 About Us About CER and NeSI The CS Team Our Facilities 2 Key Concepts What is a Cluster Parallel Programming Shared Memory Distributed Memory 3 Using
More informationThe CNMS Computer Cluster
The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationAuto-Tunning of Data Communication on Heterogeneous Systems
1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationUsing the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationInnovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationAdvancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
More informationPSE Molekulardynamik
OpenMP, bigger Applications 12.12.2014 Outline Schedule Presentations: Worksheet 4 OpenMP Multicore Architectures Membrane, Crystallization Preparation: Worksheet 5 2 Schedule 10.10.2014 Intro 1 WS 24.10.2014
More informationParallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
More informationKnow your Cluster Bottlenecks and Maximize Performance
Know your Cluster Bottlenecks and Maximize Performance Hands-on training March 2013 Agenda Overview Performance Factors General System Configuration - PCI Express (PCIe) Capabilities - Memory Configuration
More informationGetting Started with HPC
Getting Started with HPC An Introduction to the Minerva High Performance Computing Resource 17 Sep 2013 Outline of Topics Introduction HPC Accounts Logging onto the HPC Clusters Common Linux Commands Storage
More informationHPC Update: Engagement Model
HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems mikev@sun.com Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009
ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory
More informationHow to control Resource allocation on pseries multi MCM system
How to control Resource allocation on pseries multi system Pascal Vezolle Deep Computing EMEA ATS-P.S.S.C/ Montpellier FRANCE Agenda AIX Resource Management Tools WorkLoad Manager (WLM) Affinity Services
More informationProgramming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationUsing the Windows Cluster
Using the Windows Cluster Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
More informationNUMA Best Practices for Dell PowerEdge 12th Generation Servers
NUMA Best Practices for Dell PowerEdge 12th Generation Servers Tuning the Linux OS for optimal performance with NUMA systems John Beckett Solutions Performance Analysis Enterprise Solutions Group Contents
More informationUsing NeSI HPC Resources. NeSI Computational Science Team (support@nesi.org.nz)
NeSI Computational Science Team (support@nesi.org.nz) Outline 1 About Us About NeSI Our Facilities 2 Using the Cluster Suitable Work What to expect Parallel speedup Data Getting to the Login Node 3 Submitting
More informationSun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationOpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
More informationPerformance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
More informationLANL Computing Environment for PSAAP Partners
LANL Computing Environment for PSAAP Partners Robert Cunningham rtc@lanl.gov HPC Systems Group (HPC-3) July 2011 LANL Resources Available To Alliance Users Mapache is new, has a Lobo-like allocation Linux
More informationHPC Hardware Overview
HPC Hardware Overview John Lockman III February 7, 2012 Texas Advanced Computing Center The University of Texas at Austin Outline Some general comments Lonestar System Dell blade-based system InfiniBand
More informationExascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationbenchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationDepartment of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012
Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................
More informationMemory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors
Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors Robert Schöne, Daniel Hackenberg, and Daniel Molka Center for Information Services and High Performance Computing
More informationParallel Processing using the LOTUS cluster
Parallel Processing using the LOTUS cluster Alison Pamment / Cristina del Cano Novales JASMIN/CEMS Workshop February 2015 Overview Parallelising data analysis LOTUS HPC Cluster Job submission on LOTUS
More informationStovepipes to Clouds. Rick Reid Principal Engineer SGI Federal. 2013 by SGI Federal. Published by The Aerospace Corporation with permission.
Stovepipes to Clouds Rick Reid Principal Engineer SGI Federal 2013 by SGI Federal. Published by The Aerospace Corporation with permission. Agenda Stovepipe Characteristics Why we Built Stovepipes Cluster
More informationMiami University RedHawk Cluster Working with batch jobs on the Cluster
Miami University RedHawk Cluster Working with batch jobs on the Cluster The RedHawk cluster is a general purpose research computing resource available to support the research community at Miami University.
More informationAppro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales
Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes Anthony Kenisky, VP of North America Sales About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer 2001-2007
More informationWorkshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationAn Introduction to the Gordon Architecture
An Introduction to the Gordon Architecture Gordon Summer Institute & Cyberinfrastructure Summer Institute for Geoscientists August 8-11, 2011 Shawn Strande Gordon Project Manager San Diego Supercomputer
More informationDavid Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
More informationVirtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM
White Paper Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM September, 2013 Author Sanhita Sarkar, Director of Engineering, SGI Abstract This paper describes how to implement
More informationRunning applications on the Cray XC30 4/12/2015
Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes
More informationRDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
More informationTechnical Computing Suite Job Management Software
Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
More informationEVLA Post Processing Cluster Recommendation
The following document provides a description of CASA parallelization and scaling issues for processing JVLA astronomical data and concludes with a design recommendation for the computing facilities necessary
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationOracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationThe Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server
Research Report The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Executive Summary Information technology (IT) executives should be
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationHadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationSRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center
SRNWP Workshop HP Solutions and Activities in Climate & Weather Research Michael Riedmann European Performance Center Agenda A bit of marketing: HP Solutions for HPC A few words about recent Met deals
More informationKriterien für ein PetaFlop System
Kriterien für ein PetaFlop System Rainer Keller, HLRS :: :: :: Context: Organizational HLRS is one of the three national supercomputing centers in Germany. The national supercomputing centers are working
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More informationVers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.
Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques Brice Goglin 15 avril 2014 Towards generic Communication Mechanisms
More informationKeys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
More informationDDR3 memory technology
DDR3 memory technology Technology brief, 3 rd edition Introduction... 2 DDR3 architecture... 2 Types of DDR3 DIMMs... 2 Unbuffered and Registered DIMMs... 2 Load Reduced DIMMs... 3 LRDIMMs and rank multiplication...
More informationIntel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
More informationNCCS Brown Bag Series
NCCS Brown Bag Series Tips for Monitoring Memory Usage in PBS jobs on Discover Chongxun (Doris) Pan doris.pan@nasa.gov October 16, 2012 After the talk, you will understand -- What s memory swapping, really?
More informationAgenda. Using HPC Wales 2
Using HPC Wales Agenda Infrastructure : An Overview of our Infrastructure Logging in : Command Line Interface and File Transfer Linux Basics : Commands and Text Editors Using Modules : Managing Software
More informationCloud Computing through Virtualization and HPC technologies
Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC
More informationRecent Advances in HPC for Structural Mechanics Simulations
Recent Advances in HPC for Structural Mechanics Simulations 1 Trends in Engineering Driving Demand for HPC Increase product performance and integrity in less time Consider more design variants Find the
More informationIS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration
IS-ENES/PrACE Meeting EC-EARTH 3 A High-resolution Configuration Motivation Generate a high-resolution configuration of EC-EARTH to Prepare studies of high-resolution ESM in climate mode Prove and improve
More informationParallel Large-Scale Visualization
Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU
More informationImproved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
More informationSUN ORACLE EXADATA STORAGE SERVER
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More information