64-Bit versus 32-Bit CPUs in Scientific Computing

Similar documents

Benchmarking Large Scale Cloud Computing in Asia Pacific

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Lattice QCD Performance. on Multi core Linux Servers

Generations of the computer. processors.

Clusters: Mainstream Technology for CAE

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Virtuoso and Database Scalability

1 Bull, 2011 Bull Extreme Computing

Current Trend of Supercomputer Architecture

Enabling Technologies for Distributed Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

x64 Servers: Do you want 64 or 32 bit apps with that server?

Recommended hardware system configurations for ANSYS users

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

CPU Benchmarks Over 600,000 CPUs Benchmarked

Multi-Threading Performance on Commodity Multi-Core Processors

PassMark - CPU Mark Multiple CPU Systems - Updated 17th of July 2012

Enabling Technologies for Distributed and Cloud Computing

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Multicore Parallel Computing with OpenMP

Adaptec: Snap Server NAS Performance Study

Scaling in a Hypervisor Environment

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

LS DYNA Performance Benchmarks and Profiling. January 2009

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Improved LS-DYNA Performance on Sun Servers

For designers and engineers, Autodesk Product Design Suite Standard provides a foundational 3D design and drafting solution.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Optimization of Cluster Web Server Scheduling from Site Access Statistics

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Cluster Computing at HRI

Computer Architectures

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Understanding the Benefits of IBM SPSS Statistics Server

OpenPower: IBM s Strategy for Best of Breed 64-bit Linux

Week 1 out-of-class notes, discussions and sample problems

Quantifying Hardware Selection in an EnCase v7 Environment

Communicating with devices

IOS110. Virtualization 5/27/2014 1

Altix Usage and Application Programming. Welcome and Introduction

LSN 2 Computer Processors

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Performance of the JMA NWP models on the PC cluster TSUBAME.

Amazon EC2 XenApp Scalability Analysis

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Building Clusters for Gromacs and other HPC applications

Choosing a Computer for Running SLX, P3D, and P5

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Dual-Core Intel Xeon Processor 5100 Series

GPUs for Scientific Computing

FLOW-3D Performance Benchmark and Profiling. September 2012

Seradex White Paper. Focus on these points for optimizing the performance of a Seradex ERP SQL database:

AMD Opteron Quad-Core

High Performance Computing

Legal Notices Introduction... 3

System Requirements Table of contents

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Parallel Programming Survey

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

The Central Processing Unit:

Introduction to Cloud Computing

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

¹ Autodesk Showcase 2016 and Autodesk ReCap 2016 are not supported in 32-Bit.

Building an Inexpensive Parallel Computer

Lecture 1: the anatomy of a supercomputer

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

System Requirements. Autodesk Building Design Suite Standard 2013

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Performance evaluation

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Binary search tree with SIMD bandwidth optimization using SSE

Accelerating CFD using OpenFOAM with GPUs

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

White Paper. Recording Server Virtualization

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

How To Write A Parallel Computer Program

Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Autodesk Building Design Suite 2012 Standard Edition System Requirements... 2

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Transcription:

64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum <axel.kohlmeyer@rub.de> March 2004

1/25 Outline 64-Bit and 32-Bit CPU Examples 64-Bit and Scientific Computing What is the best Machine (for me)? Summary

2/25 What is a 64/32-Bit CPU? CPUs have (several, separate) sub-units: integer, floating-point, SIMD/vector, load/store, caches/buffers Size of integer registers usually defines bitness there are pure and hybrid 64/32-Bit CPUs. IEEE754 double precision floating-point is 64-bit.

3/25 Example: DEC/Compaq/HP Alpha oldest widely available 64-bit platform (RISC) medium clock rates, high floating-point performance up to 256-bit wide memory interface SMP via elaborate crossbar switches typical Workstation CPU

4/25 Example: Intel Itanium completely new 64-bit design (EPIC) low clockrate, very high floating-point speed 32-bit via emulation only architecture offloads optimizations to the compiler SMP via processor bus

5/25 Example: Intel Pentium 4 / Xeon most common 32-bit platform (CISC) instructions to access up to 64GB memory extremely high clock rates SMP via processor bus SSE unit for floating point vector operations AMD Athlon similar: lower clock, twin floating-point

6/25 Example: AMD Athlon64/Opteron evolutionary 64-bit design (extension to 32-bit) memory controller in CPU integrated SMP via point-to-point connect hybrid 32-/64-bit use in hardware possible

7/25 Summary: Architectures different architecture strategies each platform has different characteristics CPU only part of the platform 64-bitness only minor aspect simple comparison of CPU only not useful

8/25 Demands of Scientific Computing Ex.: Fluid Dynamics, Molecular Dynamics, Quantum Chemistry/Physics,... mainly floating-point intensive high memory bandwidth for large datasets Linear Algebra, Fast-Fourier-Transform Parallelization (MPI, OpenMP, Multi-threading)

9/25 Impact of change to 64-bit CPU o intrinsic floating-point performance bitness independent o programming language mostly bitness independent + larger address space compute larger problems + new architecture no legacy support required larger registers more cache contamination new compiler, new performance libraries needed, new tuning tricks need to be learned

10/25 Benchmarking absolutely essential needs to be done by people who know the application, the hardware, and the operating system needs several representative benchmarks careful with artificial benchmarks or marketing myths

11/25 Some Benchmark Results Car-Parrinello Molecular Dynamics Simulation linear algebra (matrix multiplication) + FFT high memory bandwidth medium to large memory MPI-parallel (MIMD) + OpenMP high network throughput for MPI-parallel runs uses BLAS/LAPACK a lot

12/25 CPMD Serial Runs (10 Ryd) Machine Wall Time / s AMD Athlon XP1600+, 1.4GHz, PC133 545 AMD Athlon MP1600+, 1.4GHz, PC266-ECC 443 Compaq Alpha EV6, 600MHz, XP1000 435 HP SuperDome 32000, HPPA 8700,750MHz 388 AMD Athlon XP2500+, 1.83GHz, PC333 361 Compaq Alpha EV67, 677MHz, ES40 284 AMD Opteron, 1.6GHz, 32-bit 287 AMD Opteron, 1.6GHz, 64-bit 254 Intel P4 Xeon, 2.4GHz, PC266 236 Compaq Alpha EV68AL, 833MHz, DS20 234 Intel Itanium2, 900MHz, HP zx6000 206 AMD Athlon64 3200+, 2.0GHz, PC333, 64-bit 173 IBM Power4+ 1.7 GHz, Regatta H+ 171

13/25 CPMD Serial Runs (30/50 Ryd) Machine Wall Time / s AMD Athlon XP1600+, 1.4GHz, PC133 2878 HP SuperDome 32000, HPPA 8700, 750MHz 2672 Compaq Alpha EV6, 600MHz, XP1000 2624 AMD Opteron, 1.6GHz, PC266 memory, 64-bit 1292 Intel Pentium 4 Xeon, 2.4GHz, 1275 AMD Opteron, 1.6GHz, PC266 memory, 32-bit 1157 IBM Power4+ 1.7 GHz, Regatta H+ 997 AMD Athlon XP1800+, 1.53GHz, PC266 5878 AMD Athlon XP2500+, 1.83GHz, PC333 5196 AMD Athlon XP2500+, 1.83GHz, dual-channel PC333 3848 Intel Itanium2, 900MHz, HP zx6000 3145 AMD Opteron, 1.6GHz, PC266 memory, 32-bit 3143 AMD Athlon64 3200+, 2.0GHz, PC333, 64-bit 3134 IBM Power4+ 1.7 GHz, Regatta H+, 2259

14/25 SMP Overhead one serial run per CPU simulataneously relative speed compared to a single cpu run does not account for multi-threading latencies Machine relative SMP speed Quad Pentium 4 Xeon, 2.4GHz 31% Dual Pentium 4 Xeon, 2.4GHz 61% Dual AMD Athlon MP1800+, 1.53GHz 69% Dual AMD Athlon MP1600+, 1.4GHz 73% Dual Compaq Alpha EV68AL, 833MHz, DS20 88% Dual Intel Itanium2, 900MHz, HP zx6000 89% Quad Compaq Alpha EV67, 667MHz, ES40 90% AMD Opteron, 1.6GHz, PC266, 32-bit, 98% AMD Opteron, 1.6GHz, PC266, 64-bit 98%

15/25 Program Differences: CPMD vs.cmb2 CPMD: Machine Wall Time / s AMD Athlon XP1600+, 1.4GHz, PC133 2878 Compaq Alpha EV6, 600MHz, XP1000 2624 AMD Athlon XP1800+, 1.53GHz, 2136 Compaq Alpha EV68AL, 833MHz, DS20 1519 AMD Opteron, 1.6GHz, 64-bit 1292 Intel Itanium2, 900MHz, HP zx6000 1158 AMD Opteron, 1.6GHz, 32-bit 1157 CMB2: Machine Wall Time / s AMD Athlon XP1600+, 1.4GHz, PC133 1914 AMD Athlon XP2500+, 1.83GHz, PC333 1782 Compaq Alpha EV6, 600MHz, XP1000 1740 Intel Itanium2, 900MHz, HP zx6000 1266 Compaq Alpha EV68AL, 833MHz, DS20 1254 AMD Opteron, 1.6GHz, 64-bit 1194 AMD Opteron, 1.6GHz, 32-bit 1080

16/25 Library Optimizations 100 steps CP-MD: 63 Si-Atoms, 10Ryd generic specific Machine BLAS ATLAS ATLAS Athon XP1800+ 950 s 428 s 378 s 251% 113% 100% Pentium IV 2GHz 765 s 493 s 441 s 173% 112% 100% P4 Xeon 2.4GHz 471 s 316 s 276 s 171% 118% 100% Pentium M 900MHz 716 s 430 s -

17/25 Networking Impact scalability limit of the application scalability limit of the network throughput / peak performance considerations price / performance ratio considerations I/O performance limits SMP performance limits

18/25 Si_63 bulk / PBC / 70 Ryd Wall Time per Timestep [s] 100 80 60 40 10 5 100Mbit / Single Athlon XP1600+ 1000Mbit / Single Athlon 1.3GHz 1000Mbit / Dual Athlon MP1800+ / MPI+OpenMP 1000Mbit / Dual Athlon MP1800+ / MPI 1000Mbit / Single Athlon MP1800+ / MPI SCI / Dual Athlon MP1600+ / MPI SCI / Dual Athlon MP1600+ / MPI+OpenMP SCI / Single Athlon MP1600+ / MPI 20 40 60 80 100 120 20 0 0 10 20 Number of CPUs Wed Aug 27 11:22:56 2003 axel.kohlmeyer@theochem.ruhr-uni-bochum.de http://www.theochem.ruhr-uni-bochum.de/go/cpmd-bench.html

19/25 Si_63 bulk / PBC / 70 Ryd Cumulative Wall Time per Timestep [s] 300 200 100 100Mbit / Single Athlon XP1600+ 1000Mbit / Single Athlon 1.3GHz 1000Mbit / Dual Athlon MP1800+ / MPI+OpenMP 1000Mbit / Single Athlon MP1800+ / MPI 1000Mbit / Dual Athlon MP1800+ / MPI SCI / Dual Athlon MP1600+ / MPI SCI / Single Athlon MP1600+ / MPI likewise, but CPUs counted twice 10 20 30 40 Number of CPUs Wed Aug 27 11:22:56 2003 axel.kohlmeyer@theochem.ruhr-uni-bochum.de http://www.theochem.ruhr-uni-bochum.de/go/cpmd-bench.html

20/25 I/O Performance Conventional SCF Quantum Chemistry Program creating 2 * 9 = 18 GByte Integral Files. Machine Disk Cputime Wall Time Athlon64 2.0GHz SCSI 10k rpm 74.0 min 82.5 min Athlon64 2.0GHz IDE 7.2k rpm 73.5 min 81.0 min Athlon64 2.0GHz IDE RAID-0 74.0 min 75.0 min AthlonXP 1.53GHz IDE RAID-0 (old) 105 min 123 min Athlon 650MHz IDE RAID-0 254 min 255 min

21/25 I/O Performance Conventional SCF Quantum Chemistry Program 16 SCF Iterations using the Integral Files. Machine Disk Cputime Wall Time Athlon64 2.0GHz SCSI 10k rpm 58.5 min 160.0 min Athlon64 2.0GHz IDE 7.2k rpm 58.5 min 128.0 min Athlon64 2.0GHz IDE RAID-0 60.5 min 77.0 min AthlonXP 1.53GHz IDE RAID-0 (old) 114 min 148 min Athlon 650MHz z IDE RAID-0 249 min 266 min

22/25 What else does matter? reliability: 2nd/3rd fastest components more reliable. availability of OS updates, compiler and optimized libraries optimize for the common case, not to fit all demands best (small) special machine for special uses only get (and pay for) real Hardware support or do-it-yourself and get a larger machine

23/25 Code Optimizations Fortran90 vs Fortran77 + BLAS (C++ vs C) ease of use vs. absolute speed prefer generic optimizations to specific search for better algorithms identify performance bottlenecks check accuracy (danger of overoptimizing)

24/25 Summary Bigger is not always better! 32-bit or 64-bit does not really matter, unless you need the address space determine requirements of the dominant application(s). representative benchmarks to find best architecture weakest link in chain determines total performance always consider the whole architecture (cpu, memory, i/o). factor in future software support requirements (compiler, optimized libraries, parallization support).

25/25 Thanks RRZE Erlangen FZ-Jülich Ruhr-Universität Bochum Prof. Domink Marx, Prof. Volker Staemmler, Dr. Bernd Meyer, Dipl.-Chem. Holger Langer http://www.theochem.rub.de/~axel.kohlmeyer/ axel.kohlmeyer@rub.de