Experiences with HPC on Windows



Similar documents
Using the Windows Cluster

High Performance Computing on Windows. User Experiences: Dynamic optimization User Experiences: Matlab. Christian Terboven

High Performance Computing in Aachen

OpenMP Programming on ScaleMP

High Productivity Computing With Windows

Parallel Programming Survey

Performance Characteristics of Large SMP Machines

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

An HPC Application Deployment Model on Azure Cloud for SMEs

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-Threading Performance on Commodity Multi-Core Processors

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Parallel Algorithm Engineering

Introduction to Hybrid Programming

Debugging with TotalView

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Performance of the JMA NWP models on the PC cluster TSUBAME.

- An Essential Building Block for Stable and Reliable Compute Clusters

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

System Requirements Table of contents

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Clusters: Mainstream Technology for CAE

High Performance Computing in CST STUDIO SUITE

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Microsoft Compute Clusters in High Performance Technical Computing. Björn Tromsdorf, HPC Product Manager, Microsoft Corporation

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

PRIMERGY server-based High Performance Computing solutions

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Recommended hardware system configurations for ANSYS users

Technical Overview of Windows HPC Server 2008

Cluster Computing at HRI

Case Study on Productivity and Performance of GPGPUs

The CNMS Computer Cluster

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

OpenMP and Performance

Understanding the Benefits of IBM SPSS Statistics Server

Building Clusters for Gromacs and other HPC applications

Improved LS-DYNA Performance on Sun Servers

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

HP reference configuration for entry-level SAS Grid Manager solutions

Parallel Computing. Parallel shared memory computing with OpenMP

FLOW-3D Performance Benchmark and Profiling. September 2012

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

An Oracle White Paper August Oracle WebCenter Content 11gR1 Performance Testing Results

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Parallel Computing with MATLAB

Cloud Computing through Virtualization and HPC technologies

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Parallel application development

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

MOSIX: High performance Linux farm

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

SUN ORACLE EXADATA STORAGE SERVER

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Can High-Performance Interconnects Benefit Memcached and Hadoop?

High Performance Computing

ST810 Advanced Computing

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Altix Usage and Application Programming. Welcome and Introduction

Microsoft Windows Compute Cluster Server 2003 Evaluation

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Scaling from Workstation to Cluster for Compute-Intensive Applications

Capacity Planning for Microsoft SharePoint Technologies

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Business white paper. HP Process Automation. Version 7.0. Server performance

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Part I Courses Syllabus

64-Bit versus 32-Bit CPUs in Scientific Computing

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

wu.cloud: Insights Gained from Operating a Private Cloud System

Using the Intel Inspector XE

Praktijkexamen met Project VRC. Virtual Reality Check

Linux Cluster Computing An Administrator s Perspective

The RWTH Compute Cluster Environment

Scalability and Classifications

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Transcription:

Experiences with on Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Server Computing Summit 2008 April 7 11, HPI/Potsdam

Experiences with on Agenda 2 o High Performance Computing () OpenMP & MPI & Hybrid Programming o @ Aachen Hardware & Software Deployment & Configuration o Top500 submission o Case Studies Dynamic Optimization: AVT Bevel Gears: WZL Application & Benchmark Codes o Summary

Experiences with on High Performance Computing () o begins when performance (runtime) matters! is about reducing latency the Grid increases latency Dominating programming languages: Fortran (77 + 95), C++, C o Focus in Aachen is on Computational Engineering Science! 3 o We do on Unix for many years why try? Attract new users: Third party cooperations sometimes depend on Some users look out for the Desktop like experience is a great development platform We rely on tools: Combine the best of both worlds! Top500 on : Why not?

Experiences with on Parallel Computer Architectures: Shared Memory o Dual Core processors are common now A laptop/desktop is a Shared Memory parallel computer! Memory o Multiple processors have access to the same main memory. L2 Cache L1 L1 core core o Different types: o Uniform memory access (UMA) o Non uniform memoryaccess (ccnuma), still cache coherent o Trend towards ccnuma 4 Intel Woodcrest based system I am ignoring instruction caches, address caches (TLB), write buffers, prefetch buffers, as data caches are most important for applications.

Experiences with on Programming for Shared Memory: OpenMP www.openmp.org 5 Master Thread const int N = 100000; double a[n], b[n], c[n]; [ ] #pragma omp parallel for for (int i = 0; i < N; i++) { Slave Threads Slave c[i] = a[i] + b[i]; Threads Slave Threads } [ ] o and other threading based paradigms Serial Part Parallel Region Serial Part

Experiences with on Parallel Computer Architectures: Distributed Memory o Distributed Memory: Each processor has only access to it s own main memory o Programs have to use external network for communication External network Memory L2 Cache Memory L2 Cache o Cooperation is done via message exchange 6 L1 core L1 core L1 core L1 core

Experiences with on Programming for Distributed Memory: MPI www.mpi forum.org MPI Process 2 7 int rank, size, value = 23; MPI Process 1 MPI_Init( ); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&value, 1, MPI_INT, rank+1, ) } else { MPI_Recv(&value, 1, MPI_INT, rank-1, ) } MPI_Finalize( ); rank: 0 rank: 1 MPI_Send Msg: 1 int MPI_Recv

Experiences with on Programming s of SMP Nodes: Hybrid o Multiple levels of parallelism can improve scalability! One or more MPI tasks per node One or more OpenMP threads per MPI task MPI Memory OpenMP L2 Cache External network MPI Memory OpenMP L2 Cache o Many of our applications are hybrid: Best fit for clusters of SMP nodes. 8 L1 core L1 core L1 core L1 core o The SMP node will grow in the future!

Experiences with on Speedup and Efficiency o The Speedup denotes how much a parallel program is faster than the serial program: = T T 1 S p p p = number of threads / processes T 1 = execution time of the serial program T p = execution time of the parallel program o The Efficiency of a parallel program is defined as: E p = S p p o According to Amdahl s law the Speedup is limited to p. 9 Note: Other (similar) definitions for other needs available in the literature.

Experiences with on Agenda 10 o High Performance Computing () OpenMP & MPI & Hybrid Programming o @ Aachen Hardware & Software Deployment & Configuration o Top500 submission o Case Studies Dynamic Optimization: AVT Bevel Gears: WZL Application & Benchmark Codes o Summary

Experiences with on Harpertown based InfiniBand o Recently installed : Fujitsu Siemens Primergy RS 200 S4 servers 2x Intel Xeon 5450 (quad core, 3.0 GHz) 16 / 32 GB memory per node 4x DDR InfiniBand (latency: 2.7 us, bandwidth: 1250 MB/s) In total: About 25 TFLOP/s peak performance (260 nodes) 11 o Frontend Server 2008 Better performance when machine gets loaded Faster file access: Lower access time, higher bandwidth Automatic load balancing: Three machines are clustered to form one interactive machine: cluster-win.rz.rwth-aachen.de

Experiences with on Harpertown based InfiniBand o Quite a lot of Pizza boxes 4 cables per box: o InfiniBand: MPI o Gigabit Ethernet: I/O o Management o Power 12 Pictures taken during installation

13 Experiences with on Software Environment o Complete Development Environment Visual Studio 2005 Pro + Microsoft Compute Pack Subversion Client, X Win32, Microsoft SDKs, Intel Tool Suite: C/C++ Compiler, Fortran Compiler VTune Performance Analyzer, Threading Tools MKL library, Threading Building Blocks Intel MPI + Intel Tracing Tools Java, Eclipse, o Growing list of ISV software ANSYS, HyperWorks, Fluent, Maple, Mathematica, Matlab, MS Office 2003, MSC.Marc, MSC.Adams, User licensed software hosting, e.g. GTM X

Experiences with on ISV codes in the batch system o Several requirements No usage of graphical elements allowed No user input allowed (except redirected standard input) Execution of compiled set of commands, Termination Software share Save output Change dir & Execute 14 o Exemplary usage instructions for Matlab Command line: \\cifs\cluster\software\matlab\bin\win64\matlab.exe /minimize /nosplash Disable GUI elements /logfile log.txt /r "cd('\\cifs\cluster\home\your_userid\your_path'), YOUR_M_FILE The.M file should contain quit; as last statement

Experiences with on Deploying a 256 node cluster 15 o Setting up the Head Node: approx. 2 hours Installing and configuring Microsoft SQL Server 2005 Installing and configuring Microsoft Pack o Installing the compute nodes: Installing 1 compute node: 50 minutes Installing n compute nodes: 50 minutes? (Multicasting!) We installed 42 nodes at once in 50 minutes

Experiences with on Deploying a 256 node cluster o The deployment performance is only limited by Head Node: 42 multicasting nodes served by WDS on Head Node 16 Network utilization at 100% CPU utilization at 100%

Experiences with on Configuring a 256 node cluster o Updates? Deploy an updated image to the compute nodes! 17

Experiences with on Agenda 18 o High Performance Computing () OpenMP & MPI & Hybrid Programming o @ Aachen Hardware & Software Deployment & Configuration o Top500 submission o Case Studies Dynamic Optimization: AVT Bevel Gears: WZL Application & Benchmark Codes o Summary

Experiences with on The Top500 List (www.top500.org) 1.000.000 100.000 10.000 1.000 100 10 RWTH: Peak Performance (R peak) RWTH: Linpack Performance TOP500: Rank 1 TOP500: Rank 50 TOP500: Rank 200 TOP500: Rank 500 PC-Technology Moore's Law 19 1 Jun 93 Jun 94 Jun 95 Jun 96 Jun 97 Jun 98 Jun 99 Jun 00 Jun 01 Jun 02 Jun 03 Jun 04 Jun 05 Jun 06

Experiences with on The Top500 List: clusters o Why add just another Linux cluster? Top500 on! 20 o Watch out for the /Unix ratio in the upcoming Top500 list in June

Experiences with on The LINPACK benchmark o Ranking is determined by running a single benchmark: HPL Solve a dense system of linear equations Gaussian elimination type of algorithm: 2 3 n + O( n 2 ) 3 Very regular problem high performance / efficiency www.netlib.org/hpl o Result is not of high significance for our applications o but it s a pretty good sanity check for the system! Low performance Something is wrong! 21 o Steps to success: setup & sanity check Linpack parameter tuning

Experiences with on Results of initial Sanity Check o The cluster was in normal operation since end of January o Bandwidth test: Variations from 180 MB/s to 1180 MB/s Nodes 1 to 9 Nodes 1 to 21 22

Experiences with on Parameter Tuning o approach: Performance Analysis Tuning Analyis Tuning Analysis Tuning 23 Performance got better over time: System and Parameter Tuning

Experiences with on Parameter Tuning Going the way o Excel application on a laptop controlled the whole cluster! 24 We easily reached 76,69 GFlop/s on one node. Peak performance is: 8 cores * 3 GHz * 4 results per cycle = 96 Gflop/s That is 80% efficiency!

Experiences with on Server 2008 in Action o Job startup comparison 2048 MPI processes: Our Linux configuration: Order of Minutes Our configuration: Order of Seconds Video: Job startup on 2008 25

Experiences with on Agenda 26 o High Performance Computing () OpenMP & MPI & Hybrid Programming o @ Aachen Hardware & Software Deployment & Configuration o Top500 submission o Case Studies Dynamic Optimization: AVT Bevel Gears: WZL Application & Benchmark Codes o Summary

Experiences with on Case Study: Dynamic Optimization (AVT) o Dynamic optimization in chemical industry Plastic A Composition of A und B => unfeasible material Plastic B Task: Changing the product specification (from A to B) of a plastic manufactory in ongoing business Minimize the junk (composition of A and B) Search for economic and ecologic optimal operational mode! 27 This task is solved by the DyOS (Dynamic Optimization Software) tool, developed at the chair for process systems engineering (AVT: Aachener Verfahrenstechnik) at RWTH Aachen University.

Experiences with on Case Study: Dynamic Optimization (AVT) o What is dynamic optimization? Start Drive 250 m in minimal time! Start with v = 0, stop exactly at x = 250m! Maximal speed is v = 10 m/s! Maximal acceleration is a = 1 m/s²! Finish v( t) v max x( 0), v(0) = 0 v( t f ) = 0, x( t f ) xt f. 28 1 0.5 0 MAX Acceleration a -0.5-1 MIN 0 10 20 30 Time [sec] 10 8 6 4 2 Speed v Distance x 0 0 10 20 30 Time [sec] 250 200 150 100 50 0

Experiences with on Case Study: Dynamic Optimization (AVT) o One simulation typically takes up to two weeks and requires a significant amount of memory (>4GB) MPI parallelization o Challenge: Commercial software component depending on VS6, only one instance per machine allowed outsourced into DLL x Tasks of MPI communicator (ROOT): DyOS A storing data sending and receiving data control over entire software 29 Physical realisation gproms interface

Experiences with on Case Study: Dynamic Optimization (AVT) o A scenario consists of several simulations o Projected compute time on desktop: 5 scenarios à 2 months = 10 months o Compute time on our cluster (faster machines, multiple jobs): 5 scenarios à 1.5 months = 3 months o MPI parallelization lead to factor 4.5 on 8 cores: 5 scenarios à 0.3 months = 0.7 months 30 o Arndt Hartwich: Stabilityof RZ scomputenodes is significantly higher than stability of our desktops.

Experiences with on Case Study: KegelToleranzen (WZL) o Simulation of Bevel Gears Written in Fortran, using Intel Fortran 10.1 compiler Very cache friendly runs at high Mflop/s rates Bevel Gear Pair Differential Gear 31 Laboratory for Machine Tools and Production Engineering, RWTH Aachen

32 Experiences with on Case Study: KegelToleranzen (WZL) o Target Pentium//Intel Xeon//Intel Serial Tuning + Parallelization with OpenMP o Procedure Get the tools: Porting to UltraSparc IV/Solaris/Sun Studio Simulog Foresys: Convert to Fortran 90 77000 Fortran77 lines 91000 Fortran 90 lines Sun Analyzer: Runtime Analysis with different datasets Deduce targets for Serial Tuning and OpenMP Parallelization OpenMP Parallelization: 5 Parallel Regions, 70 Directives Get the tools: Porting new code to Xeon//Intel Intel Thread Checker: Verification of OpenMP Parallelization o Put new code in production on Xeon//Intel

Experiences with on Case Study: KegelToleranzen (WZL) o Comparing Linux and Server 2003: Performance of KegelToleranzen 5000 4500 4000 Linux 2.6: 4x Opteron dual core, 2.2 GHz 2003: 4x Opteron dual core, 2.2 GHz 24% 3500 33 MFlop/s 3000 2500 2000 1500 1000 500 0 1 2 4 8 b e t t e r #Threads

Experiences with on Case Study: KegelToleranzen (WZL) o Comparing Linux and Server 2008: Performance of KegelToleranzen 34 MFlop/s 8000 7000 6000 5000 4000 3000 2000 1000 0 Linux 2.6: 4x Opteron dual core, 2.2 GHz 2003: 4x Opteron dual core, 2.2 GHz Linux 2.6: 2x Harpertown quad core, 3.0 GHz 2008: 2x Harpertown quad core, 3.0 GHz 1 2 4 8 7% b e t t e r #Threads

Experiences with on Case Study: KegelToleranzen (WZL) o Comparing Linux and Server 2008: 35 MFlop/s 8000 7000 6000 5000 4000 3000 2000 1000 Performance of KegelToleranzen What about 96 Gflop/s per node?!? Linux 2.6: 4x Opteron dual core, 2.2 GHz 2003: 4x Opteron dual core, 2.2 GHz In fact, this application is rather fast. applications are waiting for memory! Linux 2.6: 4x Clovertown quad core, 3.0 GHz 2008: 4x Clovertown quad core, 3.0 GHz Performance gain for the user: Speedup of 5.6 on one node. 0 Even better from starting point (desktop: 220 MFlop/s). MPI parallelization is work in progress. 1 2 4 8 #Threads 7% b e t t e r

Experiences with on Agenda 36 o High Performance Computing () OpenMP & MPI & Hybrid Programming o @ Aachen Hardware & Software Deployment & Configuration o Top500 submission o Case Studies Dynamic Optimization: AVT Bevel Gears: WZL Application & Benchmark Codes o Summary

Experiences with on Invitation: User Group Meeting Einladung zum 1. Treffen der High Performance Computing Nutzergruppe im deutschsprachigen Raum 21./22. April 2008 Rechen und Kommunikationszentrum, RWTH Aachen http://www.rz.rwth aachen.de/winhpcug 37

Experiences with on Invitation: User Group Meeting o Ziele Informationsaustausch zwischen Nutzern untereinander Informationsaustausch zwischen Nutzern und Microsoft 38 o Montag, 21. April 17:00h: Domführung 18:30h: Gemeinsames Abendessen o Dienstag, 22. April Präsentationen der Firmen Allinea, Cisco, Intel, The Mathworks und Microsoft Präsentationen von Einrichtungen aus Forschung und Lehre Frage und Antwort Runde mit Experten von Microsoft

Experiences with on Summary & Outlook o The environment has been well accepted Growing interest and need of compute power on. o Server 2008 (beta) is pretty impressive!!! Deploying and Configuring 256 nodes. Job Startup and Job Management. Performance improvements & Linpack efficiency. o Interoperability in heterogeneous environments got easier. E.g. WDS can interoperate with Linux DHCP and PXE Boot. 39 o Performance: Need for specific tuning.

Experiences with on The End Thank you for your attention! 40