SC09 Tutorial M06 Cluster Construc5on Tutorial Paul Marshall, Michael Oberg Theron Voran, Ma=hew Woitaszek University of Colorado, Boulder Na5onal Center for Atmospheric Research
Advanced topics Parallel file systems Grids and Clouds Administra5ve scalability Security essen5als Cluster scalability Automated clustering tools and kits Best prac5ces 16 Nov 2009 2
16 Nov 2009 3
I/O requirements of scien5fic workflows Mandatory input Mandatory output Scratch files Checkpoin5ng Fault tolerance Con5nua5on jobs Visualiza5on Collabora5on 16 Nov 2009 4
Serial I/O One reader/writer for each of many files Parallel I/O Mul5ple simultaneous writers to a single file Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 16 Nov 2009 5
Single file server or a sta5c mapping of servers Easy to use Any node can open files No special libraries or drivers required All nodes connect to switch Network Switch I/O to a specific servers /home /proj1 /scratch 16 Nov 2009 6
Parallel file systems Store files on mul.ple systems with mul.ple disks IBM GPFS, Lustre, PFVS, and others Applica5ons Naïve (POSIX): standard block layer Enhanced for parallel I/O: exploits parallel nature of data storage (MPI IO, Parallel HDF5) See also Tutorial S05 Parallel I/O in Prac5ce 16 Nov 2009 7
External Parallel Storage I/O on a network separate from the MPI interconnect Embedded Parallel Storage I/O on the same network as the MPI data traffic All nodes connect to switch Network Switch /scratch /scratch 16 Nov 2009 8
16 Nov 2009 9
Speed Parallel file systems vs. NFS Visibility Single system or datacenter wide Quotas Instantaneous? Over 5me? Backup and purge policies Storing irreplaceable data vs. preprocessed results 16 Nov 2009 10
16 Nov 2009 11
Defini5ons and dis5nc5ons s5ll open for debate Grid So>ware Facilitates cross cluster opera5ons Emphasis on data sharing, job/execu5on management Interna5onal standards (e.g., commi=ees ) Cloud Resources Provides on demand access to plamorms and resources Emphasis on resources, plamorms, and services Business model (e.g., billable hours ) Clouds on Grids, and Grids on Clouds 16 Nov 2009 12
A grid Coordinates resources that are not subject to centralized control Uses standard and open protocols and interfaces Delivers nontrivial quali5es of service 16 Nov 2009 h=p://www fp.mcs.anl.gov/~foster/ar5cles/whatisthegrid.pdf 13
A grid Coordinates resources that are not subject to centralized control Uses standard and open protocols and interfaces The Grid uses the standard protocols The Globus Toolkit (GT4) Global Grid Forum (GGF) Open Grid Services Architecture (OGSA) web services with support from IBM, Microsoo, Sun [A] heterogeneous, mul5 vendor, Grid world 16 Nov 2009 h=p://www fp.mcs.anl.gov/~foster/ar5cles/whatisthegrid.pdf 14
Data Replica5on Delega5on Data Access & Integra5on Grid Telecontrol Community Authoriza5on Replica Loca5on Community Scheduling WebMDS Python Run5me Authen5ca5on Authoriza5on Reliable File Transfer Workspace Management Trigger C Run5me Creden5al Management GridFTP GRAM Index Java Run5me Security Data Management Execu5on Management Informa5on Services Common Run5me 16 Nov 2009 15
High performance data movement Striped transfers Mul5ple streams Third party transfers TCP tuning Backbone WAN (10 40 Gbps) Site Uplink (10 40 Gbps) Site Network (Backplane) User Client GridFTP Commands Trunked 1Gbps Na5ve 10Gbps Source Source Dest Dest 16 Nov 2009 16
Clouds provide rapid, metered access to a virtually unlimited set of resources. Doug Thain (University of Notre Dame) Different kinds of clouds: SaaS Sooware as a Service PaaS Plamorm as a Service IaaS Infrastructure as a Service Public vs. Private 16 Nov 2009 17
We used virtualiza5on for this tutorial External connec5vity External connec5vity pegasus001 cluster01 head cluster02 head pegasus002 cluster01 node01 cluster02 node01 pegasus005 cluster01 node04 cluster02 node04 Private VLAN based network for each cluster We could have used Amazon EC2 for $140 560 (About 10 hours on 140 cores at $0.10 0.40 with no mistakes or prac.ce. Shipping, handling, taxes, data storage, and data transfer not included.) 16 Nov 2009 18
Use cloud resources as nodes when necessary Your cluster head node001 node002 node003 Node VM image Establish account, deploy node VM image a priori or on demand Policy boots nodes on Cloud resources when required Cloud provider node900 node901 Challenges: network topology, security, cost 16 Nov 2009 19
16 Nov 2009 20
10% / 90% rule A li=le hardening, some protec5ons, and a li=le bit of monitoring and rou5ne prac5ces go a long way Important security components Firewall: host based or centralized File system integrity scanning Centralized logging and monitoring Opera5ng system and services hardening 16 Nov 2009 21
Backbone WAN (10 40 Gbps) Firewall as first defense Protects against service misconfigura5on Site Network (Backplane) Trunked 1Gbps Site Uplink (10 40 Gbps) Na5ve 10Gbps Usually easily modified Op5on 1: Central firewalls Driven by convenience Centralized management Con: ooen expensive, limits bandwidth, usually placed far upstream Con: limits cross system / datacenter configura5on 16 Nov 2009 22
Backbone WAN (10 40 Gbps) Site Network (Backplane) Trunked 1Gbps Site Uplink (10 40 Gbps) Na5ve 10Gbps Op5on 2: Host based firewalls (e.g., Linux iptables) Scalable performance Per host configura5on Wrappers can simplify administra5on: (e.g., shorewall) Easy to modify, but each host is managed independently Easy to specify cross system and datacenter policies: full control of each system s exposure 16 Nov 2009 23
Monitor file system modifica5ons as second defense Daemons run on main systems, alert when cri5cal files change Tripwire E mail daily from all hosts, no DB support Tripwire Pro solves most of the scalability issues Osiris Only emails if events happened S5ll emails from all hosts independently Samhain Can run per host or in a large installa5on with scanners on hosts logging to a central server that then handles alerts & DB logging Beltane management GUI 16 Nov 2009 24
Centralized, dedicated syslog server Only a few admin accounts, unique passwords All servers copy all syslog messages to this server Can run logwatch, etc., in single loca5on Command `logger` allows custom logging Syslog ng can be configured to output each server s syslog informa5on to a separate directory: /var/log/remote_hosts/host1/messages /var/log/remote_hosts/host2/messages /var/log/remote_hosts/host3/messages 16 Nov 2009 25
Root logins: restrict to SSH keys only PermitRootLogin without password Op5onal: restrict to SSHv2 Op5onal: hardcode cipher Example /etc/ssh/sshd_config: Protocol 2 Ciphers 3des cbc PermitRootLogin without password RSAAuthen5ca5on yes PubkeyAuthen5ca5on yes IgnoreRhosts yes 16 Nov 2009 26
General techniques Remove unnecessary packages Turn off unneeded services Restrict SUID / GUID binaries Opera5ng system and agency system hardening guidelines and templates 16 Nov 2009 27
Rapid restore following a security incident Rebuild replacement systems on spare disk drives Retain compromised drives for analysis Requires prepara5on and procedures NCAR TeraGrid resources use several 5ers: Daily: rsync en5re system to admin only storage Weekly: Full backups to archival storage Sporadic: Blu Ray for offline immutable copy Recovery install and restore greatly simplified 16 Nov 2009 28
16 Nov 2009 29
Ji=er: cumula5ve effect of small interrupts in code execu5on across large systems Can drama5cally affect large scale workflows Could be something as simple as a printer spool daemon checking the queue every few seconds! Could be administrators or their tools! in band management / monitoring func5ons Verify only cri5cal ac5vi5es are running (process audi5ng on test nodes) 16 Nov 2009 30
Unified Fabrics Leverage investment in high performance interconnect for management and monitoring Example: IP over IB instead of separate Ethernet Simplifies management and lowers cost Administra5ve traffic (node boot, monitoring, IPMI, etc.) competes with latency sensi.ve user traffic 16 Nov 2009 31
Configura5on file management Replicate managed files across sets of nodes /etc/{passwd, shadow, hosts}, etc. Support libraries, admin scripts Common packages Cfengine IBM CSM/xcat 16 Nov 2009 32
Node imaging SysImage AspenU5ls Stateful node analysis Bcfg2: state based analysis of required modifica5ons Policy based node management: file management, image cloning, etc. 16 Nov 2009 33
Parallel file system deployment Deploy a parallel file system on two nodes Compare throughput to your /home NFS file system Security exercise Rootkit your own cluster 16 Nov 2009 34