Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002. sack@stud.ntnu.no www.stud.ntnu.

Similar documents

Streamline Computing Linux Cluster User Training. ( Nottingham University)

An introduction to Fyrkat

The CNMS Computer Cluster

- An Essential Building Block for Stable and Reliable Compute Clusters

Recommended hardware system configurations for ANSYS users

Improved LS-DYNA Performance on Sun Servers

Survey of execution monitoring tools for computer clusters

Overlapping Data Transfer With Application Execution on Clusters

System Requirements Table of contents

Manual for using Super Computing Resources

Cluster Computing at HRI

HPC Wales Skills Academy Course Catalogue 2015

Grid 101. Grid 101. Josh Hegie.

Operating System for the K computer

How to Run Parallel Jobs Efficiently

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

supercomputing. simplified.

MOSIX: High performance Linux farm

MPI / ClusterTools Update and Plans

To connect to the cluster, simply use a SSH or SFTP client to connect to:

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

2. COMPUTER SYSTEM. 2.1 Introduction

Building an Inexpensive Parallel Computer

QUADRICS IN LINUX CLUSTERS

Windows HPC 2008 Cluster Launch

Multicore Parallel Computing with OpenMP

Technical Overview of Windows HPC Server 2008

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

Matlab on a Supercomputer

PerfMiner: Cluster-wide Collection, Storage and Presentation of Application Level Hardware Performance Data

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

OpenMP Programming on ScaleMP

Running applications on the Cray XC30 4/12/2015

WHITE PAPER. ClusterWorX 2.1 from Linux NetworX. Cluster Management Solution C ONTENTS INTRODUCTION

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

benchmarking Amazon EC2 for high-performance scientific computing

Enabling Technologies for Distributed Computing

Parallel Processing using the LOTUS cluster

Industrial Communication Whitepaper. Principles of EtherNet/IP Communication

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Windows Vista / Windows 7 Installation Guide

RA MPI Compilers Debuggers Profiling. March 25, 2009

Cluster Lifecycle Management Carlo Nardone. Technical Systems Ambassador GSO Client Solutions

Virtual Machines.

Ensure that the server where you install the Primary Server software meets the following requirements: Item Requirements Additional Details

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

GRID Computing: CAS Style

Using NeSI HPC Resources. NeSI Computational Science Team

Perfmon2: A leap forward in Performance Monitoring

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

Parallel Debugging with DDT

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

Introduction to HPC Workshop. Center for e-research

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Scheduling and Resource Management in Computational Mini-Grids

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Enabling Technologies for Distributed and Cloud Computing

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

NEC HPC-Linux-Cluster

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

How to Install and Configure a LSF Host

Altix Usage and Application Programming. Welcome and Introduction

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

A Crash course to (The) Bighouse

Miami University RedHawk Cluster Working with batch jobs on the Cluster

MapReduce Evaluator: User Guide

Getting Started with HPC

Allinea Performance Reports User Guide. Version 6.0.6

Batch Scheduling and Resource Management

Windows IB. Introduction to Windows 2003 Compute Cluster Edition. Eric Lantz Microsoft

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

System Requirements - filesmart

JProfiler: Code Coverage Analysis Tool for OMP Project

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

SPAMfighter Exchange Module

How To Visualize Performance Data In A Computer Program

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

1 Bull, 2011 Bull Extreme Computing

Automated Testing of Installed Software

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

LCMON Network Traffic Analysis

CATS-i : LINUX CLUSTER ADMINISTRATION TOOLS ON THE INTERNET

Scientific Computing Programming with Parallel Objects

Parallel and Distributed Computing Programming Assignment 1

Amazon EC2 XenApp Scalability Analysis

24/08/2004. Introductory User Guide

ABAQUS High Performance Computing Environment at Nokia

intertrax Suite resource MGR Web

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Transcription:

Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002 sack@stud.ntnu.no www.stud.ntnu.no/ sack/

Outline Background information Work on clusters Profiling tools Porting Fortran 90 code Performance analysis MPI Broadcast w/ multicast 1

IDI Cluster: Clustis Description of clustis: 1.46 GHz AMD processors, 2GBRAM,40GBdrives Use OpenPBS batch system (PBS implementation) Classification of nodes: clustis: master node node01-node08: interactive use node09-node40: batch use Several queues of different priorities: privileged users guest users sif80bi users 2

Documentation qmpirun script: Usual MPI invocation: mpirun -np 4 program To use batch system: qmpirun -np 4 program Output saved in program.out and program.err Creates and submits shell script for batch system Cleans itself up when finished mpirun modified to support OpenPBS Itanium cluster Two IA-64 nodes and master Setup OpenPBS and qmpirun script as for clustis 3

Profiling Tools: Vampir Produced by Pallas GmbH in Cologne and Dortmund Easiest to use: wrappers for mpicc, mpicc, mpif90 Simple, effective design: static libraries Shows individual MPI calls during a timeslice Shows aggregate statistics in varying levels of detail Expensive: 75 000 Kroner for clustis 4

5

Profiling Tools: Paradyn Written at University of Wisconsin Difficult to use: non-intuitive GUI interface Complex design: run-time code-patching Shows even more data than Vampir Does not work in batch environments Impossible to compile Inexpensive: Free 6

Profiling Tools: Jumpshot Written at various universities and research centres in USA Comes with MPICH Simple design: static libraries Has years of testing and works well Limited data: no aggregate statistics Also free 7

8

Porting Fortran 90 Code Compilers not standard-compliant Most compilers don t fully support standard and exceed the standard This step took the longest in the project SGI compiler SGI supercomputers used for scientific computing Scientists and mathematicians like Fortran Hence, SGI writes a Fortran 90 compiler that produces very optimized code Code originally written on SGI 9

Intel compilers NAGWare (NTNU site-licensed) Compiler produces faulty code in array reshaping code Debugger segfaults NAGWare acknowledges bug, recommends upgrading (!) Intel Intel also produces supercomputers Good compiler, but strictly standard-conformant My code has some obsolete Fortran 77 syntax Free! 10

Portland Group (Bergen and Linköping) Compiler mishandles 3D-array reshape calls Successfully rewrote this as a series of 2D-array reshape calls Sometimes segfaults No GNU Fortran 90 compiler yet 11

Getting MPICH to work with Fortran 90 Simple mapping in C between function/variable names and symbol names in object files (foo() = foo) Fortran 90 compilers sometimes add one underscore and sometimes add two (foo() = foo or foo ) MPICH configure script is supposed to handle this, but doesn t work Since Fortran 90 is mostly used for scientific computing, the Portland Group provides a custom configure script, which does work 12

Porting Used Portland Group compiler Joakim Hove and Knut Petter assisted me in correcting many errors Most of the errors were in extraneous portions of the code and could be commented out. By comparing the code with standard-compliant code and a bit of guesswork, I managed to get all of the necessary modules to compile without error 13

Profiling: Systems Cluster in Bergen: 32 dual 1.26 GHz Pentium IIIs 100 Megabit network Cluster in Linköping: 33 900 MHz Pentium IIIs 100 Megabit and SCI (full-duplex Gigabit) networks Supercomputer (embla): 512 600 MHz MIPS processors Did not use Clustis 14

Description of Program Simulation of behaviour of subatomic particles in near absolute-zero temperatures under an electromagnetic field Three-dimensional parallelopiped divided between processes One thousand iterations. Each iteration has a separate communication and a computation phase Initial hypothesis: Cluster better for small number of processes (faster processors) Supercomputer better for large numer of processes (faster interconnect network) 15

Analysis of Problem Size Number of messages passed is proportional to the internal surface areas Amount of computation proportional to the volume I expected the program to perform better per-unitvolume with larger problem sizes 1 0.9 0.8 Time per unit volume, relative to time for 8x8x8 Embla (ITEA) Clustis (IDI) Fire (UiB) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 25 30 35 Problem size 16

Profiling Results: Speedup In this test, the problem size is constant and the number of processes used varies. 30 25 Run Times Cluster (Linkoping) Cluster (Bergen) Supercomputer Time (seconds) 20 15 10 5 0 0 2 4 6 8 10 12 14 16 Number of processes 17

Time for one processor divided by time for n processors 4.5 4 3.5 3 2.5 2 1.5 Speedup Cluster (Linkoping) Cluster (Bergen) Supercomputer 1 2 4 6 8 10 12 14 16 Number of processes 4 3.5 Time, relative to time for supercomputer Cluster (Linkoping) Cluster (Bergen) 3 2.5 2 1.5 1 0 2 4 6 8 10 12 14 16 18 Number of processes 18

Profiling Results: Speedup Observations: Unexpectedly, the supercomputer outperforms the clusters for a small number of processes SGI compiler produces more optimized code MIPS CPUs have larger caches With a large number of processors, the two clusters performance is approximately equal 19

Profiling Results: Scaleup In this test, the problem size is grown in proportion to the number of processes. In other words, the problem size per process is kept constant. 7 6 Run Times (scaleup) Cluster (Linkoping) Cluster (Bergen) Supercomputer Time (seconds) 5 4 3 2 1 0 2 4 6 8 10 12 14 16 Number of processes 20

Time for one processor times problem size divided by time for n processors 9 8 7 6 5 4 3 2 1 Speedup (scaleup) 2 4 6 8 10 12 14 16 Number of processes Cluster (Linkoping) Cluster (Bergen) Supercomputer 2 1.8 Time, relative to time for supercomputer Cluster (Linkoping) Cluster (Bergen) 1.6 1.4 1.2 1 0.8 0 2 4 6 8 10 12 14 16 18 Number of processes 21

Profiling Results: Scaleup Observations: The clusters fare much better in this test The Bergen cluster s performance is within 20% of the supercomputer s The Linköping cluster s performance is within 80% of the supercomputer s This is a more real-world test, since the amount of work per process is a reasonable amount. All the systems exhibit better speedup than in previous test 22

Profiling Results: Conclusions Clusters perform quite well, considering their cost Middle ground: clusters with proprietary Gigabit networks Could not get SCI (Gigabit) network to work 23

MPI Broadcast w/ Multicast Broadcast With Unicast (TCP/IP) is O log 2 nµ With Multicast is O 1µ But Multicast uses UDP, which is unreliable Unicast t=0 t=1 t=2 t=0 Multicast 24

Reliability protocol Receiving Process Sending Process Original Messages Receiving Process Retransmitted Messages Acknowledgments Reliability Daemon 25

Multicast: Unicast: 26

Results 1 0.8 Sending 2000 600-byte messages from single source Multicast Unicast Time (seconds) 0.6 0.4 0.2 0 0 5 10 15 20 Number of processes Sending 2000 600-byte messages from all sources 2 Multicast Unicast Time (seconds) 1.5 1 0.5 0 0 5 10 15 20 Number of processes 27