High Performance Compute Cluster
|
|
|
- Winifred McKenzie
- 10 years ago
- Views:
Transcription
1 High Performance Compute Cluster Overview for Researchers and Users Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 1
2 Copyrights, Licenses and Acknowledgements Dell, PowerEdge, PowerVault and PowerConnect are recognised trademarks of Dell Corporation. Intel, Intel Xeon and QPI are recognised trademark of Intel Corp. AMD Opteron is a recognised trademark of Advanced Micro Devices, Inc. The Whamcloud logo and some course content is reproduced with permission from Whamcloud Inc. Lustre and Lustre file system are recognised trademarks of Xyratex. Other product names, logos, brands and other trademarks referred to within this documentation, as well as other products and services are the property of their respective trademark holders. All rights reserved. These trademark holders are not affiliated with Alces Software, our products, or our services, and may not sponsor or endorse our materials. This material is designed to be used to support an instructor led training schedule for reference purposes only. This documentation is not designed as a stand-alone training tool example commands and syntax intended to demonstrate functionality in a training workshop environment and may cause data loss if executed on live systems. Remember: always take backups of any valuable data. This documentation is provided as is and without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. The Alces Software HPC Cluster Toolkit is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. These packages are distributed in the hope that they will be useful, but without any warranty; without even the implied warranty of merchanability or fitness for a particular purpose. See the GNU Affero General Public License for more details ( A copy of the GNU Affero General Public License is distributed along with this product. For more information on Alces Software, please visit: Please support software developers wherever they work if you use Open Source software, please consider contributing to the maintaining organisation or project, and crediting their work as part of your research publications. This work is Copyright Alces Software Ltd All Rights Reserved. Unauthorized re-distribution is prohibited except where express written permission is supplied by Alces Software. Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 2
3 Agenda HPC cluster technology overview Getting started Cluster access methods (CLI, graphical, web-portal) Data storage (access, retrieval, quotas, performance) Software development environment Using compilers, libraries and MPIs Software application services Job scheduler system Principles and benefits Creating job scripts, submitting jobs and getting results Special cases, tuning Documentation review Where to get help Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 3
4 HPC Cluster Technology Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 4
5 Beowulf Architecture Loosely-coupled Linux Supercomputer Efficient for a number of use-cases Embarrassingly parallel / single-threaded jobs SMP / multi-threaded, single-node jobs MPI / parallel multi-node jobs Very cost effective HPC solution Commodity X86_64 server hardware and storage Linux based operating system Specialist high-performance interconnect and software Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 5
6 Beowulf Architecture Scalable architecture Dedicated management and storage nodes Dedicated login node pair to provide user access Multiple compute node types for different jobs Standard memory compute nodes (4GB/core) High memory compute nodes (16GB/core) Very high memory SMP node (1TB RAM) Multiple storage tiers for user data 2TB local scratch disk on every node 55TB shared Lustre parallel filesystem 4TB home-directory / archive storage Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 6
7 Cluster facilities Dual headnodes Dual login nodes Resilient networks Dual Infiniband core Dual Lustre servers Compute 20 x standard nodes / 320 cores 16 x high-memory nodes / 256 cores 1 x very high memory node / 1TB RAM Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 7
8 User facilities Modern 64-bit Linux operating system Compatible with a wide range of software Pre-installed with tuned HPC applications Compiled for the latest CPU architecture Comprehensive software development environment C, C++, Fortran, Java, Ruby, Python, Perl, R Modules environment management Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 8
9 User facilities High throughput job-scheduler for increased utilisation Resource request limits to protect running jobs Fair-share for cluster load-balancing Resource reservation with job backfilling Multiple user-interfaces supporting all abilities Command-line access Graphical desktop access Web portal for job monitoring, file transfer and accounting Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 9
10 HPC Cluster Hardware Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 10
11 Service node configuration Head and storage nodes are Dell PowerEdge R520 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 11
12 Service node configuration Login nodes are Dell PowerEdge R720 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 12
13 Compute node configuration Dell C6220 compute nodes Dual 8-core CPUs, 64 or 256GB RAM QDR IB Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 13
14 Compute node configuration 65nm 45nm 32nm 22nm Intel Core Microarchitecture First high-volume server Quad-Core CPUs Nehalem Microarchitecture Up to 6 cores and 12MB Cache 1 Sandy Bridge Microarchitecture Up to 8 cores and 20MB Cache Has Microarc Dedicated high-speed bus per CPU HW-assisted virtualization (VT-x) Integrated memory controller with DDR3 support Turbo Boost, Intel HT, AES- NI 1 End-to-end HW-assisted virtualization (VT-x, -d, -c) Integrated PCI Express* Turbo Boost 2.0 Intel Advanced Vector Extensions (AVX) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 14
15 QPI QPI Compute node configuration DDR3 1333Mhz DDR3 DDR3 DDR3 Xeon 5500/5600 Platform Xeon 5500 Xeon /6 Cores Nehalem/Westmere 6.4 GT/s QPI Xeon 5500 Xeon /6 Cores DDR3 DDR3 DDR3 DDR3 1600Mhz DDR3 DDR3 DDR3 DDR3 Xeon E5 Platform Xeon E /6/8 Cores Sandy-bridge/Ivy-bridge 8.0 GT/s QPI QPI Xeon E /6/8 Cores DDR3 DDR3 DDR3 DDR3 x4 x8 Intel 5500 Series (IOH) x4 Intel ICH 10 Up to 12 DIMMs per 2S platform Up to 36 PCIe2 lanes Two-chip IOH / ICH up to 36 lanes PCIe2 Intel C600 Series (PCH) up to 40 lanes PCIe3 per socket Serial Attached SCSI (SAS) 4 ports, 6Gb/s Up to 16DIMMs Up to 80 PCIe 3.0 lanes Two QPI links between CPUs One-chip Platform Controller Hub (PCH) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 15
16 Very high memory node Dell PowerEdge R910 4 x 8-core Xeon CPUs 1TB RAM QDR IB and dual Gb Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 16
17 Infiniband fabric High-performance 40Gbps Infiniband fabric 32Gbps effective bandwidth per link 1.23us latency for small messages (measured) Scalable core/edge design Dual resilient subnet manager Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 17
18 Cluster facilities Dual headnodes Dual login nodes Resilient networks Dual Infiniband core Dual Lustre servers Compute 20 x standard nodes / 320 cores 16 x high-memory nodes / 256 cores 1 x very high memory node / 1TB RAM Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 18
19 Accessing the cluster Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 19
20 Command-line access Login via SSH Use SSH client on Linux, UNIX and Mac hosts Use PuTTY, Exceed, OpenSSH on Windows Connect to one of the login nodes maxwell1.abdn.ac.uk maxwell2.abdn.ac.uk Use your site AD username and password Terminal access also available via web portal Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 20
21 Admin command-line access Compute node access provided via job scheduler Use the qrsh command to get an interactive shell Commands requiring special permissions require sudo access Only enabled for admin accounts Privileged access to shared filesystems Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 21
22 Graphical Access X-forwarding is supported for graphical apps Use the ssh X command to enable Available via PuTTY Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 22
23 Remote Desktop FreeNX remote desktop to login nodes Use cluster private key file to connect /etc/nxserver/client.id_dsa.key Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 23
24 NX Remote Desktop Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 24
25 Web Portal access Browse to Designed to supplement command-line access User interface provides: Web-based overview of cluster status Access to user documentation Filesystem usage and quota statistics File browser for your cluster filesystems Job scheduler statistics Terminal interface Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 25
26 User documentation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 26
27 User documentation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 27
28 Storage system metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 28
29 File browser Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 29
30 Job submission metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 30
31 Job submission metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 31
32 HPC Cluster Software Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 32
33 Operating system Unified 64-bit Linux installation Scientific Linux bit kernel Automatic Transparent HugePage support Centralised software deployment system Headnode manage cluster software repositories Compute and login nodes deployed from masters Personality-less software simplifies management Automatic software deployment for system regeneration Centralised software patching and update system Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 33
34 Application software Supporting software provided by SL distro Compilers, interpreters, drivers Specific software repositories also included Intel/Whamcloud: Lustre software Fedora/EPEL: RedHat compatible libraries and tools Open Grid Scheduler: Job scheduler software Alces Software: Symphony HPC Toolkit, web portal service Application software Commercial applications (various) Alces Software: Gridware application repository Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 34
35 System software Infiniband software Redhat Open Fabric Enterprise Distribution (OFED) Dual resilient fabric manager Supports native Infiniband verbs interface Accelerated Intel True Scale Performance Scaled Messaging (PSM) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 35
36 System software (storage) Deployed as a number of cluster filesystems /opt/alces (100GB, read-only) Configuration information /opt/gridware (500GB, writable by admins) User applications, compilers, libraries /users (3.5TB, writable by owning user) User home directories /nobackup (55TB, writable by owning user) High performance Parallel Filesystem Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 36
37 Lustre parallel filesystem Open Source codebase (GPL v2) A Distributed Filesystem A Cluster Filesystem A POSIX Filesystem A large scale filesystem A High Performance Filesystem Hardware agnostic Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 37
38 Cluster Filesystem Early users Beowulf clusters MPI jobs Large shared file tasks Scientific data CGI rendering Large file-per-process tasks Cray XT series main filesystem Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 38
39 Initial goals From the first SOW Must do everything NFS does Shared namespace Robust, ubiquitous Must support full POSIX semantics Must support all network types of interest Must be scalable Cheaply attach storage Performance scales Must be easy to administer Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 39
40 POSIX Compliant All client filesystem operations are coherent No stale data or metadata No timeout based cache coherency like NFS Single client passes full POSIX suite Early Lustre requirement Very important to many customers Applications designed to work with POSIX compliant filesystems should work correctly with Lustre Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 40
41 Distributed Filesystem Lustre aggregates multiple servers into a single filesystem namespace The namespace is then exported to clients across a network Client/Server design Servers can have client processes Modular storage can be added on the fly WAN many groups use Lustre over at least a campus distance some grid implementations Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 41
42 Data file components Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 42
43 Relative file sections Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 43
44 The Big Picture The Lustre client splits filesystem operations into metadata and object data Each goes to a dedicated server Metadata file/directory names, permissions, etc Data Stored as objects block data files may be striped across multiple servers Filesystem needs a central registration point Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 44
45 Basic Lustre I/O Operation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 45
46 Distributed Locking Longstanding goal of Lustre protect data in use Provide POSIX semantics Clients never see stale data or metadata Fairly complex locking, reasonable scaling Each target provides a lock service for its objects (data/metadata) Predictive lock growth to reduce load on servers Contended resources are handled directly by the server instead of giving a lock to each client to do the operation Lock revocation flushes data from client cache Data always written to disk before it is sent to another client Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 46
47 Lustre overview Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 47
48 Lustre file striping If only one data object is associated with a Lustre MDS entry, we refer to that object as a singly-striped file Striping allows parts of a file to be stored on different OSTs: Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 48
49 Lustre file striping Striping large files can lead to massive performance increases Each OST can process data independently, allowing parallel I/O from clients More performance enhancing cache memory available Performance from 10 identical OSS maybe greater than 10X single OSS due to reduced contention Larger maximum filesize Better performance with smaller OSTs Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 49
50 Lustre file striping Use the lfs getstripe and lfs setstripe commands to influence policies: Forced striped settings can have unexpected results File availability considerations Extra stripe information stored on MDS Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 50
51 Lustre file striping The lfs setstripe command takes the following parameters: lfs setstripe [-d] ([--size -s stripe-size] [--count -c stripe-cnt] [--offset -o start-ost] [--pool -p <pool>]) <filename dirname> Size is stripe size in multiples of 64KB Enter zero to use the filesystem default (1024KB) Count is the number of OSTs to stripe over Enter zero to use the filesystem default Enter -1 to use all available OSTs Offset is the starting OST to write to Enter -1 to allow Lustre to decide best OST Pool is the OST pool name to use Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 51
52 Lustre best practice for performance Minimise unnecessary metadata traffic Avoid using ls l ; an ls command will sometimes suffice Remove the --color alias to the ls command Avoid repetitive stat operations on the filesystem Avoid directories with large number of files Invites lock contention for directory namespace Use stripe count of 1 for large numbers of small files Huge numbers of small files may not perform well on Lustre Tuning for multiple readers/writers Tune your compute jobs to match the number of OSS/OSTs e.g. with 6 x OSTs, a multi-writer job will perform best when run over a subset of 6,12,18 nodes When reading, try and make best use of OSS cache Align reads of the same data to be simultaneous from clients Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 52
53 Cluster data storage Multiple filesystems provided for users Software repositories (/opt/gridware) 0.5TB Contains all shared, user-facing software Read-only for users Home-directories (/users) 3.5TB Default place that users login to Contains help files and links to available scratch areas Quota enforced to encourage users to remove/compress data Backed-up by University service Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 53
54 Cluster data storage Multiple filesystems provided for users Lustre data storage (/nobackup) 55TB High performance parallel filesystem Available to all nodes; 1.5GB/sec max throughput Quotas may be set if required Shared scratch (~/sharedscratch) 55TB Shared scratch location for parallel jobs Orphan data is automatically deleted Local scratch (~/localscratch) 2TB/node Local scratch disk on nodes = fastest available storage Orphan data is automatically deleted Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 54
55 Cluster data storage summary Storage type Location Access Persistence Write Performance Application store /opt/gridware Read-only Shared, Permanent N/A Home directory ~ (/users/<username>) Quota enabled Shared, Permanent Medium Parallel filesystem /nobackup Quota enabled Shared, Permanent Fast Shared scratch ~/sharedscratch Quota enabled Shared, auto-deletion Fast Local scratch ~/localscratch Full Local to node, auto-deletion Fastest Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 55
56 Filesystem Quotas Use the quota command to show your filesystem quota usage Use the -s option for human-readable format ~]$ quota s Disk quotas for user alces-cluster (uid 501): Filesystem blocks quota limit grace files quota limit grace storage.ib:/users 9405M* 8813M 97657M 7days k 200k [alces-cluster@login1(cluster) ~]$ * means that you have exceeded a quota Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 56
57 Filesystem Quotas All users have a home-directory quota set Soft limit defaults to 50GB and 100K files per user Users can continue to write data when soft limit is exceeded Default of 7-day time limit to reduce capacity used After time limit, no further capacity may be used Hard limit default to 75GB and 150K files per user Users cannot exceed their hard limit Group limits may also be enforced as required Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 57
58 Efficient space usage Data stored in scratch areas may be removed Compute node disks are persistent during jobs only Old data automatically removed from scratch areas Data stored in /nobackup is not backed-up Consider summarising and saving to home-dir Contact your supervisor about data archive facilities Consider compressing data stored in home-dir Standard Linux utilities installed on login nodes gzip, bzip2, zip/unzip, TAR Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 58
59 File permissions and sharing All files and directories have default permissions Contents of your home directory are private users]$ ls -ald $HOME drwx alces-cluster users Feb 18 17:57 /users/alces-cluster To give colleagues access to your files, use the chmod command: users]$ chmod g+rx $HOME users]$ ls -ald $HOME drwxr-x alces-cluster users Feb 18 17:57 /users/alces-cluster Scratch directories are more open by default Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 59
60 Software Development Environment Compilers Defaults to distribution version (GCC 4.4.6) GCC version also available via module Intel Cluster Suite Pro (v ) also available To use, load module then just call compiler ~]$ module load compilers/intel compilers/intel/ OK ~]$ which icc /opt/gridware/pkg/apps/intelcsxe/ /bin/bin/icc ~]$ which ifort /opt/gridware/pkg/apps/intelcsxe/ /bin/bin/ifort Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 60
61 Software Development Environment Message passing environment (MPI) Used for parallel jobs spanning multiple nodes Multiple versions available OpenMPI version Intel MPI version To use, load module then compile and run ~]$ module load mpi/intelmpi mpi/intelmpi/ /bin OK ~]$ which mpirun /opt/gridware/pkg/apps/intelcsxe/ /bin/impi/ /bin64/mpirun ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 61
62 Software Development Environment modules environment management The primary method for users to access software Enables central, shared software library Provides separation between software packages Support multiple incompatible versions Automatic dependency analysis and module loading Available for all users to Use centralised application repository Build their own applications and modules in home dirs Ignore modules and setup user account manually Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 62
63 Modules environment management Loading a module does the following: Puts binaries in your $PATH for your current session Puts libraries in your $LD_LIBRARY_PATH Enables manual pages and help files Sets variables to allow you to find and use packages [alces-cluster@login1(cluster) ~]$ module load apps/breakdancer apps/breakdancer/ /gcc samtools boost OK [alces-cluster@login1(cluster) ~]$ echo $BREAKDANCERDIR /opt/gridware/pkg/apps/breakdancer/ /gcc samtools boost [alces-cluster@login1(cluster) ~]$ echo $BREAKDANCERBIN /opt/gridware/pkg/apps/breakdancer/ /gcc samtools boost /bin [alces-cluster@login1(cluster) ~]$ which breakdancer-max /opt/gridware/pkg/apps/breakdancer/ /gcc samtools boost /bin/breakdancer-max Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 63
64 Modules environment management Using the module avail command ~]$ module avail /users/alces-cluster/gridware/etc/modules apps/hpl/2.1/gcc openmpi atlas /opt/gridware/etc/modules apps/r/2.15.2/gcc lapack blas-1 apps/picard/1.85/noarch apps/abyss/1.2.6/gcc openmpi sparsehash-1.10 apps/plink/1.07/gcc apps/artemis/ /noarch apps/python/2.7.3/gcc apps/augustus/2.6.1/gcc apps/rapid/0.6/noarch apps/blat/35/gcc apps/recon/1.05/gcc apps/bowtie/0.12.8/gcc libs/blas/1/gcc libs/boost/1.45.0/gcc openmpi python mpi/intelmpi/ /bin mpi/openmpi/1.6.3/gcc Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 64
65 Modules environment management Other modules commands available # module unload <module> Removes the module from the current environment # module list Shows currently loaded modules # module display <module> # module whatis <module> Shows information about the software # module keyword <search term> Searches for the supplied term in the available modules whatis entries Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 65
66 Modules environment management By default, modules are loaded for your session Loading modules automatically (all sessions): # module initadd <module> Loads the named module every time a user logs in # module initlist Shows which modules are automatically loaded at login # module initrm <module> Stops a module being loaded on login # module use <new module dir> Allows users to supply their own modules branches Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 66
67 Modules environment management Using modules Shared applications are installed with a module New requests are assessed for suitability to be a shared application Available from /opt/gridware NFS share No applications installed on compute nodes Same application set available throughout the cluster Modules always available to all users Can be loaded on the command line Can be included to load automatically at login Can be included in scheduler job scripts Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 67
68 HPC Applications Many popular applications available Centrally hosted to avoid replication Multiple versions supported via modules Interactive applications also supported Please run on compute nodes via qrsh Users should not run applications on login nodes Open-source and commercial apps supported Commercial application support provided by license provider Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 68
69 Requesting new applications Contact your site admin for information Application requests will be reviewed Most open-source applications can be incorporated Commercial applications require appropriate licenses Site license may be required License server available on cluster service nodes Users can install their own packages in home dirs Many software tools are available in Linux Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 69
70 Cluster job scheduler Open Grid Scheduler Developed from Sun Grid Engine codebase Virtually identical to original syntax Open-source with wide range of contributors Free to download and use Please consider acknowledging copyrights Very stable and established application Many command-line utilities and tools Admin GUI also available but rarely used Alces web-portal used to report accounting data Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 70
71 Cluster job scheduler Why do we need a job scheduler? Need to allocate compute resources to users Need to prevent user applications from overloading compute nodes Want to queue work to run overnight and out of hours Want to ensure that users each get a fair share of the available resources Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 71
72 Types of job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 72
73 Types of job Serial jobs Use a single CPU core May be submitted as a task array (many single jobs) Multi-threaded jobs Use two or more CPU cores on the same node User must request the number of cores required Parallel jobs Use two or more CPU cores on one or more nodes User must request the number of cores required User may optionally request a number of nodes to use Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 73
74 Class of job Interactive jobs Started with the qrsh command Will start immediately if resources are available Will exit immediately if there are no resources available May be serial, multi-threaded or parallel type Users must request a maximum runtime Users should specify resources required to run Input data is the shell session or application started Output data is shown on screen Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 74
75 Interactive Jobs Qrsh allows users to run interactive applications directly on compute node Do not run demanding applications on login nodes Per-user CPU and memory limits are in place to discourage users from running on login nodes Login node CPU time is not included in scheduler accounting Users are prevented from logging in directly to compute nodes; use qrsh for access ~]$ qrsh ~]$ uptime 17:01:07 up 7 days, 5:12, 1 user, load average: 0.00, 0.00, 0.00 [alces-cluster@node12(cluster) ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 75
76 Class of job Non-interactive (batch) jobs Submitted via qsub <job> command Submitted to the scheduler queue and left to run Will start immediately if resources are available Will queue for resources if job cannot run May be serial, multi-threaded or parallel type Users must request a maximum runtime Users should specify resources required to run Input data is usually stored on disk Output is stored to a file on disk Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 76
77 Non-interactive jobs Jobs are submitted via a job-script Contains the commands to run your job Can contain instructions for the job-scheduler Submission of a simple job-script Echo commands to stdout Single-core batch job Runs with default resource requests Assigned a unique job number (2573) Output sent to home directory by default Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 77
78 Simple batch job Job-script: ~/simple_jobscript.sh #!/bin/bash echo "Starting new job" sleep 120 echo "Finished my job myjob]$ qsub simple_jobscript.sh Your job 2573 ("simple_jobscript.sh") has been submitted myjob]$ qstat job-id prior name user state queue slots ja-task-id simple_job alces-cluster r byslot.q@node21 1 [alces-cluster@login1(cluster) myjob]$ cat ~/simple_jobscript.sh.o2573 Starting new job Finished my job [alces-cluster@login1(cluster) myjob]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 78
79 Viewing the queue status Use the qstat command to view queue status: myjob]$ qstat job-id prior name user state submit/start at queue slots imb-any.sh alces-cluste r 02/14/ :15:38 [email protected] imb-any.sh alces-cluste r 02/14/ :15:43 [email protected] imb-any.sh alces-cluste r 02/14/ :15:43 [email protected] simple_job alces-cluste r 02/14/ :15:58 [email protected] simple_job alces-cluste qw 02/14/ :16:25 1 Running jobs (r) are shown with queues used Queuing jobs (qw) are waiting to run Use qstat j <jobid> for more information on why queuing jobs have not started yet Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 79
80 Removing Jobs from the queue Use the qdel command to remove a job: [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots imb-any.sh alces-cluste r 02/14/ :15:43 [email protected] simple_job alces-cluste qw 02/14/ :16:25 1 [alces-cluster@login1(cluster) myjob]$ qdel 2577 alces-cluster has registered the job 2577 for deletion [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots simple_job alces-cluste qw 02/14/ :16:25 1 Nodes are automatically cleared up Users can delete their own jobs only Operator can delete any jobs Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 80
81 Job-scheduler instructions qsub and qrsh can receive instructions from users Common instructions include: Provide a name for your job Control how output files are written Request notification of job status Request additional resources More CPU cores More memory A longer run-time Access to special purpose compute nodes Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 81
82 Job-scheduler instructions Job-scheduler instructions may be given as parameters to your qsub or qrsh command Use -N <name> to set a job name Job name is visible via a qstat command [alces-cluster@login1(cluster) ~]$ qrsh N hello [alces-cluster@node06(cluster) ~]$ [alces-cluster@node06(cluster) ~]$ qstat job-id prior name user state submit/start at queue slots hello alces-cluste r 02/24/ :29:41 byslot.q@node06 1 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 82
83 Job-scheduler instructions Non-interactive jobs can provide instructions as part of their job-script Begin instructions with #$ identifier Multiple lines can be processed myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N my_job echo "Starting new job" sleep 120 echo "Finished my job" myjob]$ qsub simple_jobscript.sh Your job 2581 ("my_job") has been submitted myjob]$ qstat job-id prior name user state submit/start at queue slots my_job alces-cluste r 02/14/ :36:26 byslot.q@node03 1 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 83
84 Setting output file location Provide a location to store the output of a job Use the -o <filename> option Supply full path and filename The variable $JOB_ID is set automatically to your job ID number [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -o ~/outputfiles/myjob.$job_id echo "Starting new job" sleep 120 echo "Finished my job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 84
85 Requesting notification Provide an address for job status Use the -M < address> option Specify conditions to report using -m <b e a> Send when job Begins, Ends or Aborts [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -o ~/outputfiles/myjob.$job_id #$ -M [email protected] m bea echo "Starting new job" sleep 120 echo "Finished my job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 85
86 Running task arrays Used to run an array of serial jobs Input and output files can be separated Individual tasks are independent; may run anywhere Use the -t <start>-<end> option myjob]$ cat simple_taskarray.sh #!/bin/bash #$ -t cwd o ~/results/taskjob.$job_id module load apps/fastapp Fastapp i ~/inputdata/intput.$sge_task_id o \ ~/results/out.$sge_task_id Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 86
87 Requesting additional resources If no additional resources are requested, jobs are automatically assigned default resources One CPU core Up to 4GB RAM Maximum of 24-hour runtime (wall-clock) Execute on standard compute nodes These limits are automatically enforced Jobs must request different limits if required Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 87
88 Requesting more CPU cores Multi-threaded and parallel jobs require users to request the number of CPU cores needed The scheduler uses a Parallel Environment (PE) Must be requested by name Number of slots (CPU cores) must be requested Enables scheduler to prepare nodes for the job Three PE available on cluster smp / smp-verbose mpislots / mpislots-verbose mpinodes / mpinodes-verbose Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 88
89 smp Parallel Environment Designed for SMP / multi-threaded jobs Job must be contained within a single node Requested with number of slots / CPU-cores Verbose variant shows setup information in job output Memory limit is requested per slot Request with -pe smp-verbose <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 89
90 smp Parallel Environment smptest]$ cat runsmp.sh #!/bin/bash #$ -pe smp-verbose 7 -o ~/smptest/results/smptest.out.$job_id ~/smptest/hello [alces-cluster@login1(cluster) smptest]$ cat ~/smptest/results/smptest.out.2583 ======================================================= SGE job submitted on Mon Feb 14 14:01:31 GMT 2013 JOB ID: 2583 JOB NAME: runsmp.sh PE: smp-verbose QUEUE: byslot.q MASTER node17.cluster.local ======================================================= 2: Hello World! 5: Hello World! 6: Hello World! 1: Hello World! 4: Hello World! 0: Hello World! 3: Hello World! ================================================== [alces-cluster@login1(cluster) smptest]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 90
91 mpislots Parallel Environment Designed for multi-core MPI jobs Job may use slots on many nodes MPI hostfile is generated by scheduler Requested with number of slots / CPU-cores Verbose variant shows setup information in job output Memory limit is requested per slot Default is 4GB per slot (CPU core) Request with -pe mpislots <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 91
92 mpislots Parallel Environment ~]$ cat mpijob.sh #!/bin/bash #$ -pe mpislots-verbose 48 -cwd N imb-job -o ~/imb/results/imb-job.$job_id V module load apps/imb mpirun IMB-MPI1 [alces-cluster@login1(cluster) ~]$ cat ~/imb/results/imb-job.2584 ======================================================= SGE job submitted on Mon Feb 14 14:11:31 GMT hosts used JOB ID: 2584 JOB NAME: imb-job PE: mpislots-verbose QUEUE: byslot.q MASTER node07.cluster.local Nodes used: node07 node24 node02 ======================================================= ** A machine file has been written to /tmp/sge.machines.2584 on node07.cluster.local ** ======================================================= If an output file was specified on job submission Job Output Follows: ======================================================= Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 92
93 mpinodes Parallel Environment Designed for MPI jobs that use whole nodes Job has a number of entire nodes dedicated to it MPI hostfile is generated by scheduler Requested with number of nodes (16-cores each) Verbose variant shows setup information in job output Memory limit is requested per slot Default is 64GB per slot (complete node) Request with -pe mpinodes <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 93
94 mpinodes Parallel Environment ~]$ cat mpinodesjob.sh #!/bin/bash #$ -pe mpinodes-verbose 3 -cwd N imb-job -o ~/imb/results/imb-job.$job_id V module load apps/imb mpirun IMB-MPI1 [alces-cluster@login1(cluster) ~]$ cat ~/imb/results/imb-job.2585 ======================================================= SGE job submitted on Mon Feb 14 15:20:41 GMT hosts used JOB ID: 2585 JOB NAME: imb-job PE: mpinodes-verbose QUEUE: bynode.q MASTER node05.cluster.local Nodes used: node05 node04 node22 ======================================================= ** A machine file has been written to /tmp/sge.machines.2585 on node05.cluster.local ** ======================================================= If an output file was specified on job submission Job Output Follows: ======================================================= Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 94
95 Requesting more memory Default memory request is 4GB per core Request more using -l h_vmem=<amount> ~]$ qsub l h_vmem=8g jobscript.sh Jobs requesting more memory are likely to queue for longer Need to wait for resources to be available to run Jobs requesting less memory are likely to run sooner Scheduler will use all available nodes and memory Memory limits are enforced by the scheduler Jobs exceeding their limit will be automatically stopped Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 95
96 How much memory should I request? Run the job with a large memory request When complete, use qacct j <jobid> [alces@login1(cluster) ~]$ qsub -l h_vmem=64g simple_jobscript.sh [alces@login1(cluster) ~]$ qacct -j 2586 ============================================================== qname byslot.q hostname node13.cluster.local group users owner alces-cluster project default.prj department alces.ul jobname my_job jobnumber 2586 cpu maxvmem M Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 96
97 Requesting a longer runtime Default runtime is 24-hours Time is measured for slot occupancy when job starts Request more using -l h_rt=<hours:mins:secs> [alces@login1(cluster) ~]$ qsub l h_rt=04:30:00 jobscript.sh This is a maximum runtime Your job will be accounted for if you finish early Job time limits are enforced by the scheduler Jobs exceeding their limit will be automatically stopped Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 97
98 Why not always request large runtimes? The scheduler performs backfilling When resources are being reserved for a large job, the scheduler will automatically run short jobs on idle CPU cores The scheduler needs to know how long your job may run for to allow it to jump the queue Users queuing large jobs can request reservation Use the -R y option to request that resources are reserved for large parallel jobs Reserved resources are included in your usage Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 98
99 Requesting special purpose nodes The vhmem01 node has 1TB of memory Access is restricted to particular users and groups Request this system with -l vhmem=true Remember to request amount of memory needed myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N bigjob -h h_rt=72:0:0 #$ -l vhmem=true l h_vmem=32g #$ -pe smp-verbose 32 module load apps/mysmpjob mysmpbin Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 99
100 Requesting node exclusivity Ensures that your job has a complete node Access is restricted to particular users and groups Request exclusivity with -l exclusive=true Your job may queue for significantly longer [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N excjob -h h_rt=2:0:0 #$ -l exclusive=true l h_vmem=2g module load apps/verybusyapp VeryBusyApp -all Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 100
101 Example job-script templates Example job-scripts available on the cluster /opt/gridware/examples/templates/ Example job-script simple.sh simple-array.sh smp.sh multislot.sh multinode.sh Purpose Single-slot serial job Array of serial jobs Single-node, multi-threaded job Multi-node, multi-thread MPI job (assigned by slot) Multi-node, multi-thread MPI job (assigned by whole nodes) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 101
102 Sharing resources Job priority affects queuing jobs only Functional shares Static definitions of how much resource a user/group can use Urgency The amount of time a job has been waiting to run Priority of queuing jobs automatically increases over time Fair-share policy Takes past usage of the cluster into account Half-life of 14 days for resource usage Ensures that cluster is not unfairly monopolised Use qstat to show the priority of your job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 102
103 Sharing resources Resource quotas Static limits set by administrators Allows the maximum resource available to a user or group to be controlled Default quotas include: Limit access to the vhmem01 node Access to exclusivity flag Limit the maximum number of CPU slots Use qquota to show your resource quotas Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 103
104 Sharing resources Resource quotas Static limits set by administrators Allows the maximum resource available to a user or group to be controlled Default quotas include: Limit access to the vhmem01 node Access to exclusivity flag Limit the maximum number of CPU slots Use qquota to show your resource quotas Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 104
105 Queue administration Administrator can perform queue management Delete jobs (qdel) Enable and disable queues (qmod) Disabled queues finish running jobs but do not start more ~]$ qstat u \* job-id prior name user state submit/start at queue slots imb-any.sh alces-cluste r 02/20/ :11:43 [email protected] imb-any.sh alces-cluste r 02/21/ :11:21 [email protected] imb-any.sh alces-cluste r 02/05/ :41:44 [email protected] imb-any.sh alces-cluste r 02/08/ :21:51 [email protected] my_job alces-cluste r 02/14/ :42:06 [email protected] 1 [admin@headnode1(cluster) ~]$ qdel 2587 admin has registered the job 2587 for deletion [admin@headnode1(cluster) ~]$ qmod -d byslot.q@node05 [email protected] changed state of "[email protected]" (disabled) [admin@headnode1(cluster) ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 105
106 Execution host status Use the qhost command to view host status ~]$ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS global node01 linux-x G 1.1G 16.0G 0.0 node02 linux-x G 1.2G 16.0G 0.0 node03 linux-x G 1.2G 16.0G 0.0 node04 linux-x G 4.2G 16.0G 0.0 node05 linux-x G 1.2G 16.0G 0.0 node06 linux-x G 1.2G 16.0G 0.0 node07 linux-x G 8.2G 16.0G 0.0 node08 linux-x G 5.2G 16.0G 0.0 node09 linux-x G 56.2G 16.0G 0.0 node10 linux-x G 56.2G 16.0G 0.0 node11 linux-x G 56.7G 16.0G 0.0 node12 linux-x G 32.2G 16.0G 4.0 node13 linux-x G 1.5G 16.0G 0.0 node14 linux-x G 1.5G 16.0G 0.0 node15 linux-x G 31.2G 16.0G 0.0 node16 linux-x G 250.1G 16.0G 2.0 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 106
107 Requesting Support Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 107
108 Support services Site administrator On-site point of contact for all users Service delivery lead for site Organises, priorities and referees user requests Determines usage policy, time sharing, etc. Responsible for user data management Ability to organise data files stored on the cluster Can set user quotas, change file ownerships and groups Requests vendor service via support tickets Changes to the service must be authorized by site admin Delegated responsibility during absence to another admin Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 108
109 Requesting assistance Open a service desk ticket Must contain a description of the problem Username or group with the problem Example job output files that show the issue Node number or system component fault Method for replicating the problem Request priority and business case justification Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 109
110 Getting started Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 110
111 User accounts Requesting a user account Contact your site administrator; please provide Your existing IT Services username (but not password) The name of your supervisor or PI Once your account is created, access the system using: Your existing IT Services username Your existing IT Services password Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 111
112 Alces Software Limited 2013
Streamline Computing Linux Cluster User Training. ( Nottingham University)
1 Streamline Computing Linux Cluster User Training ( Nottingham University) 3 User Training Agenda System Overview System Access Description of Cluster Environment Code Development Job Schedulers Running
Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)
Grid Engine Basics (Formerly: Sun Grid Engine) Table of Contents Table of Contents Document Text Style Associations Prerequisites Terminology What is the Grid Engine (SGE)? Loading the SGE Module on Turing
New Storage System Solutions
New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems
Cluster@WU User s Manual
Cluster@WU User s Manual Stefan Theußl Martin Pacala September 29, 2014 1 Introduction and scope At the WU Wirtschaftsuniversität Wien the Research Institute for Computational Methods (Forschungsinstitut
An Introduction to High Performance Computing in the Department
An Introduction to High Performance Computing in the Department Ashley Ford & Chris Jewell Department of Statistics University of Warwick October 30, 2012 1 Some Background 2 How is Buster used? 3 Software
How to Run Parallel Jobs Efficiently
How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2
Grid Engine Users Guide. 2011.11p1 Edition
Grid Engine Users Guide 2011.11p1 Edition Grid Engine Users Guide : 2011.11p1 Edition Published Nov 01 2012 Copyright 2012 University of California and Scalable Systems This document is subject to the
High Performance Computing Cluster Quick Reference User Guide
High Performance Computing Cluster Quick Reference User Guide Base Operating System: Redhat(TM) / Scientific Linux 5.5 with Alces HPC Software Stack Copyright 2011 Alces Software Ltd All Rights Reserved
Manual for using Super Computing Resources
Manual for using Super Computing Resources Super Computing Research and Education Centre at Research Centre for Modeling and Simulation National University of Science and Technology H-12 Campus, Islamabad
How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 (
Running Hadoop and Stratosphere jobs on TomPouce cluster 16 October 2013 TomPouce cluster TomPouce is a cluster of 20 calcula@on nodes = 240 cores Located in the Inria Turing building (École Polytechnique)
The CNMS Computer Cluster
The CNMS Computer Cluster This page describes the CNMS Computational Cluster, how to access it, and how to use it. Introduction (2014) The latest block of the CNMS Cluster (2010) Previous blocks of the
SGE Roll: Users Guide. Version @VERSION@ Edition
SGE Roll: Users Guide Version @VERSION@ Edition SGE Roll: Users Guide : Version @VERSION@ Edition Published Aug 2006 Copyright 2006 UC Regents, Scalable Systems Table of Contents Preface...i 1. Requirements...1
High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina
High Performance Computing Facility Specifications, Policies and Usage Supercomputer Project Bibliotheca Alexandrina Bibliotheca Alexandrina 1/16 Topics Specifications Overview Site Policies Intel Compilers
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
Introduction to the SGE/OGS batch-queuing system
Grid Computing Competence Center Introduction to the SGE/OGS batch-queuing system Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Oct. 6, 2011 The basic
Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014
Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959
LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu
Parallel Processing using the LOTUS cluster
Parallel Processing using the LOTUS cluster Alison Pamment / Cristina del Cano Novales JASMIN/CEMS Workshop February 2015 Overview Parallelising data analysis LOTUS HPC Cluster Job submission on LOTUS
Getting Started with HPC
Getting Started with HPC An Introduction to the Minerva High Performance Computing Resource 17 Sep 2013 Outline of Topics Introduction HPC Accounts Logging onto the HPC Clusters Common Linux Commands Storage
1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology
Volume 1.0 FACULTY OF CUMPUTER SCIENCE & ENGINEERING Ghulam Ishaq Khan Institute of Engineering Sciences & Technology User Manual For HPC Cluster at GIKI Designed and prepared by Faculty of Computer Science
Introduction to Sun Grid Engine (SGE)
Introduction to Sun Grid Engine (SGE) What is SGE? Sun Grid Engine (SGE) is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems
Grid 101. Grid 101. Josh Hegie. [email protected] http://hpc.unr.edu
Grid 101 Josh Hegie [email protected] http://hpc.unr.edu Accessing the Grid Outline 1 Accessing the Grid 2 Working on the Grid 3 Submitting Jobs with SGE 4 Compiling 5 MPI 6 Questions? Accessing the Grid Logging
NEC HPC-Linux-Cluster
NEC HPC-Linux-Cluster Hardware configuration: 4 Front-end servers: each with SandyBridge-EP processors: 16 cores per node 128 GB memory 134 compute nodes: 112 nodes with SandyBridge-EP processors (16 cores
Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster
Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster http://www.biostat.jhsph.edu/bit/sge_lecture.ppt.pdf Marvin Newhouse Fernando J. Pineda The JHPCE staff:
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
Cluster Implementation and Management; Scheduling
Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
AstroCompute. AWS101 - using the cloud for Science. Brendan Bouffler ( boof ) Scientific Computing (SciCo) @ AWS. ska-astrocompute@amazon.
AstroCompute AWS101 - using the cloud for Science Brendan Bouffler ( boof ) Scientific Computing (SciCo) @ AWS [email protected] AWS is hoping to contribute to the development of data processing,
ALPS Supercomputing System A Scalable Supercomputer with Flexible Services
ALPS Supercomputing System A Scalable Supercomputer with Flexible Services 1 Abstract Supercomputing is moving from the realm of abstract to mainstream with more and more applications and research being
GRID Computing: CAS Style
CS4CC3 Advanced Operating Systems Architectures Laboratory 7 GRID Computing: CAS Style campus trunk C.I.S. router "birkhoff" server The CAS Grid Computer 100BT ethernet node 1 "gigabyte" Ethernet switch
HPCC USER S GUIDE. Version 1.2 July 2012. IITS (Research Support) Singapore Management University. IITS, Singapore Management University Page 1 of 35
HPCC USER S GUIDE Version 1.2 July 2012 IITS (Research Support) Singapore Management University IITS, Singapore Management University Page 1 of 35 Revision History Version 1.0 (27 June 2012): - Modified
Kiko> A personal job scheduler
Kiko> A personal job scheduler V1.2 Carlos allende prieto october 2009 kiko> is a light-weight tool to manage non-interactive tasks on personal computers. It can improve your system s throughput significantly
System Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
Introduction to Gluster. Versions 3.0.x
Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster
The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC2013 - Denver
1 The PHI solution Fujitsu Industry Ready Intel XEON-PHI based solution SC2013 - Denver Industrial Application Challenges Most of existing scientific and technical applications Are written for legacy execution
Using the Windows Cluster
Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
Current Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
The RWTH Compute Cluster Environment
The RWTH Compute Cluster Environment Tim Cramer 11.03.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) How to login Frontends cluster.rz.rwth-aachen.de cluster-x.rz.rwth-aachen.de
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Using NeSI HPC Resources. NeSI Computational Science Team ([email protected])
NeSI Computational Science Team ([email protected]) Outline 1 About Us About NeSI Our Facilities 2 Using the Cluster Suitable Work What to expect Parallel speedup Data Getting to the Login Node 3 Submitting
Isilon OneFS. Version 7.2. OneFS Migration Tools Guide
Isilon OneFS Version 7.2 OneFS Migration Tools Guide Copyright 2014 EMC Corporation. All rights reserved. Published in USA. Published November, 2014 EMC believes the information in this publication is
HPC at IU Overview. Abhinav Thota Research Technologies Indiana University
HPC at IU Overview Abhinav Thota Research Technologies Indiana University What is HPC/cyberinfrastructure? Why should you care? Data sizes are growing Need to get to the solution faster Compute power is
Grid Engine 6. Troubleshooting. BioTeam Inc. [email protected]
Grid Engine 6 Troubleshooting BioTeam Inc. [email protected] Grid Engine Troubleshooting There are two core problem types Job Level Cluster seems OK, example scripts work fine Some user jobs/apps fail Cluster
An Oracle White Paper August 2012. Oracle WebCenter Content 11gR1 Performance Testing Results
An Oracle White Paper August 2012 Oracle WebCenter Content 11gR1 Performance Testing Results Introduction... 2 Oracle WebCenter Content Architecture... 2 High Volume Content & Imaging Application Characteristics...
Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers
Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers Enterprise Product Group (EPG) Dell White Paper By Todd Muirhead and Peter Lillian July 2004 Contents Executive Summary... 3 Introduction...
Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide
Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research
! Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research! Cynthia Cornelius! Center for Computational Research University at Buffalo, SUNY! cdc at
Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers
Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers White Paper rev. 2015-11-27 2015 FlashGrid Inc. 1 www.flashgrid.io Abstract Oracle Real Application Clusters (RAC)
An Oracle White Paper July 2014. Oracle ACFS
An Oracle White Paper July 2014 Oracle ACFS 1 Executive Overview As storage requirements double every 18 months, Oracle customers continue to deal with complex storage management challenges in their data
SUN ORACLE EXADATA STORAGE SERVER
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
Introduction to ACENET Accelerating Discovery with Computational Research May, 2015
Introduction to ACENET Accelerating Discovery with Computational Research May, 2015 What is ACENET? What is ACENET? Shared regional resource for... high-performance computing (HPC) remote collaboration
HPC Update: Engagement Model
HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems [email protected] Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value
PARALLELS SERVER BARE METAL 5.0 README
PARALLELS SERVER BARE METAL 5.0 README 1999-2011 Parallels Holdings, Ltd. and its affiliates. All rights reserved. This document provides the first-priority information on the Parallels Server Bare Metal
Legal Notices... 2. Introduction... 3
HP Asset Manager Asset Manager 5.10 Sizing Guide Using the Oracle Database Server, or IBM DB2 Database Server, or Microsoft SQL Server Legal Notices... 2 Introduction... 3 Asset Manager Architecture...
NetScaler VPX FAQ. Table of Contents
NetScaler VPX FAQ Table of Contents Feature and Functionality Frequently Asked Questions... 2 Pricing and Packaging Frequently Asked Questions... 4 NetScaler VPX Express Frequently Asked Questions... 5
High Performance Computing OpenStack Options. September 22, 2015
High Performance Computing OpenStack PRESENTATION TITLE GOES HERE Options September 22, 2015 Today s Presenters Glyn Bowden, SNIA Cloud Storage Initiative Board HP Helion Professional Services Alex McDonald,
Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015
Work Environment David Tur HPC Expert HPC Users Training September, 18th 2015 1. Atlas Cluster: Accessing and using resources 2. Software Overview 3. Job Scheduler 1. Accessing Resources DIPC technicians
Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre
Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre University of Cambridge, UIS, HPC Service Authors: Wojciech Turek, Paul Calleja, John Taylor
www.thinkparq.com www.beegfs.com
www.thinkparq.com www.beegfs.com KEY ASPECTS Maximum Flexibility Maximum Scalability BeeGFS supports a wide range of Linux distributions such as RHEL/Fedora, SLES/OpenSuse or Debian/Ubuntu as well as a
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
Installing and running COMSOL on a Linux cluster
Installing and running COMSOL on a Linux cluster Introduction This quick guide explains how to install and operate COMSOL Multiphysics 5.0 on a Linux cluster. It is a complement to the COMSOL Installation
Building Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl [email protected] CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal. 2013 by SGI Federal. Published by The Aerospace Corporation with permission.
Stovepipes to Clouds Rick Reid Principal Engineer SGI Federal 2013 by SGI Federal. Published by The Aerospace Corporation with permission. Agenda Stovepipe Characteristics Why we Built Stovepipes Cluster
Cray DVS: Data Virtualization Service
Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with
Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services
Reference Architecture Developing Storage Solutions with Intel Cloud Edition for Lustre* and Amazon Web Services Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud
Quantum StorNext. Product Brief: Distributed LAN Client
Quantum StorNext Product Brief: Distributed LAN Client NOTICE This product brief may contain proprietary information protected by copyright. Information in this product brief is subject to change without
Introduction to Grid Engine
Introduction to Grid Engine Workbook Edition 8 January 2011 Document reference: 3609-2011 Introduction to Grid Engine for ECDF Users Workbook Introduction to Grid Engine for ECDF Users Author: Brian Fletcher,
Sun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
PARALLELS SERVER 4 BARE METAL README
PARALLELS SERVER 4 BARE METAL README This document provides the first-priority information on Parallels Server 4 Bare Metal and supplements the included documentation. TABLE OF CONTENTS 1 About Parallels
Isilon OneFS. Version 7.2.1. OneFS Migration Tools Guide
Isilon OneFS Version 7.2.1 OneFS Migration Tools Guide Copyright 2015 EMC Corporation. All rights reserved. Published in USA. Published July, 2015 EMC believes the information in this publication is accurate
High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda
High Performance Computing with Sun Grid Engine on the HPSCC cluster Fernando J. Pineda HPSCC High Performance Scientific Computing Center (HPSCC) " The Johns Hopkins Service Center in the Dept. of Biostatistics
Introduction to HPC Workshop. Center for e-research ([email protected])
Center for e-research ([email protected]) Outline 1 About Us About CER and NeSI The CS Team Our Facilities 2 Key Concepts What is a Cluster Parallel Programming Shared Memory Distributed Memory 3 Using
Parallels Cloud Storage
Parallels Cloud Storage White Paper Best Practices for Configuring a Parallels Cloud Storage Cluster www.parallels.com Table of Contents Introduction... 3 How Parallels Cloud Storage Works... 3 Deploying
ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK
ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK KEY FEATURES PROVISION FROM BARE- METAL TO PRODUCTION QUICKLY AND EFFICIENTLY Controlled discovery with active control of your hardware Automatically
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
Architecting a High Performance Storage System
WHITE PAPER Intel Enterprise Edition for Lustre* Software High Performance Data Division Architecting a High Performance Storage System January 2014 Contents Introduction... 1 A Systematic Approach to
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
www.novell.com/documentation Server Installation ZENworks Mobile Management 2.7.x August 2013
www.novell.com/documentation Server Installation ZENworks Mobile Management 2.7.x August 2013 Legal Notices Novell, Inc., makes no representations or warranties with respect to the contents or use of this
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
How to Choose your Red Hat Enterprise Linux Filesystem
How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
An Oracle White Paper May 2011. Exadata Smart Flash Cache and the Oracle Exadata Database Machine
An Oracle White Paper May 2011 Exadata Smart Flash Cache and the Oracle Exadata Database Machine Exadata Smart Flash Cache... 2 Oracle Database 11g: The First Flash Optimized Database... 2 Exadata Smart
NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS
NFS SERVER WITH 1 GIGABIT ETHERNET NETWORKS A Dell Technical White Paper DEPLOYING 1GigE NETWORK FOR HIGH PERFORMANCE CLUSTERS By Li Ou Massive Scale-Out Systems delltechcenter.com Deploying NFS Server
IBM WebSphere Application Server Version 7.0
IBM WebSphere Application Server Version 7.0 Centralized Installation Manager for IBM WebSphere Application Server Network Deployment Version 7.0 Note: Before using this information, be sure to read the
Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III
White Paper Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III Performance of Microsoft SQL Server 2008 BI and D/W Solutions on Dell PowerEdge
Lessons learned from parallel file system operation
Lessons learned from parallel file system operation Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association
