Building a diskless Linux Cluster for high performance computations from a standard Linux distribution

Building a diskless Linux Cluster for high performance computations from a standard Linux distribution Stefan Böhringer Institut für Humangenetik Universitätsklinikum Essen April, 7 th, 2003 Abstract This paper describes the steps involved in building a Linux cluster usable for high performance computing. A diskless design for Linux clusters is highly beneficial on account of reduced administration costs. Therefore an implementation of a diskless cluster is described here. Since Linux development is still in flux the focus is on an abstract treatment. The actual steps involved at the time of writing are given in highlighted text sections. The setting considered here involves Pre execution Environment (PXE) capable hardware. Keywords: Linux, cluster, diskless boot, PXE, tftp, NFS boot, load balancing, high performance computations 1 Introduction Linux is a highly stable operating system (OS) that is being used for many high availability tasks, like web and database servers. Also Linux clusters have been 1

built to serve as high perfomance computations facilities [1]. These clusters are commonly referred as Beowulf clusters [2], whereby this term is not well defined. In general it denotes any type of linux boxes in proximity (when the metric to define proximity is again subject to variation). Using a web search one can find numerous guides to build a Linux cluster but in this instance there did not seem to be a matching document. Therefore this document intends to focus on the specific setting considered here. A different setting might be served better elsewhere. The following aims are pursued: - OS is Linux - Diskless boot - No hardware changes to nodes (no eproms) - Homogeneous Nodes This guide assumes some UNIX proficiency. It is recommended to use a web search if any terms seem to be ambiguous or are unknown to the reader [3]. Any computer topic is treated in depth on the web. In the instance of problems during the build of a Linux cluster it is highly recommended to use Google/Groups [4], since numerous problems have been discussed before and can be retrieved from the cited service. The documents treats the assembly of the cluster in three sections. - Choice of hardware - Configuring the boot process (server configuration) - Configuring the shared file system (node configuration) In the following a practical example based on a RedHat distribution is reported. In the following the cluster built is referred to as the Genecruncher I cluster ([5]). The steps involved to build that particular type of cluster are shown in non proportional font. 2

2 Choice of Hardware According to the aims outlined above, there are some constrained with respect to the hardware usable for a linux cluster. You need a mainboard that supports network boot via PXE which is implemented through a Managed Boot Agent (MBA). This should not be a problem for most mainboards these days. At the time of selecting the hardware (Autumn 2002) there was no cheap mainboard available with an onboard network interface card (NIC) which would support PXE. Therefore a PCI NIC with PXE support had to be added. 3Com and Intel cards should be usable. The Genecruncher I uses the following type of hardware (specific expamples are shown in nonproportional font). Genecruncher I hardware configuration ------------------------------------- Mainboard: Elitegroup K7S5A CPU: AMD Duron 1,2 GHz Memory: 256 MB NIC: 3Com 3C905C-TX-M PCI Graphics: ATI Xpert 2000/32MB Additions in the server node CDRom 80 GB Harddisk Netgear 16 port 100Mbit switch 3 Server configuration The server is to be installed with a Linux operation system (OS). After installation of the OS the following services have to be installed and configured. 3

DHCP tftp PXE NFS A standard Redhat 7.3 distribution, which was downloaded from the internet was installed on the Genecruncher I server. During installation it was ensured that DHCP and TFTP software packages were installed. 3.1 Network configuration of the server node 3.2 Dynamic Host configuration Protocol (DHCP) When the boot process of a client node started that node is bereft of any information. The first bit of information a node needs is an IP address and a host name. This configuration deals with assigning constant IP addresses to the nodes, which is a requirement for netbooting the clients. This requires one to determine the Ethernet Hardware addresses of the client node NICs, which will be reported on any attempted netboot from a client. The dhcpd demon is configured via the dhcpd.conf file usually located at /etc/dhcpd.conf. It is documented in the dhcpd.conf manual file. On a RedHat 7.3 box the dhcpd demon is managed via SysV startup scripts. To activate the service at boot time issue the command: /sbin/chkconfig --level 345 dhcpd start Here is an exerpt of the Genecruncher I dhcpd.conf: option domain-name-servers 132.252.3.10, 132.252.1.7; 4

option routers 192.168.1.1; subnet 192.168.0.0 netmask 255.255.255.0 { # range 192.168.0.90 192.168.0.90; } subnet 192.168.1.0 netmask 255.255.255.0 { } range 192.168.1.100 192.168.1.254; group { filename "pxelinux.0"; use-host-decl-names on; host cn1 { fixed-address 192.168.1.100; hardware ethernet 00:04:75:9d:32:43; option root-path "/tftpboot/192.168.1.100"; } #... repreat for all diskless nodes } 3.2.1 Punching wholes into the firewall It is perhaps a good choice to turn off firewalling in the first place. Port accesses for all required services have to be permitted. On RedHat 7.3 the firewall rules are stored in the file /etc/sysconfig/ipchains The relevant ports are 67 for dhcp and 60 for tftp. However tftp seems to require additional ports. The following configuration is 5

very liberal and lets through the nfs ports also (amongst others). -------------- /etc/sysconfig/ipchains -------------- :input ACCEPT :forward ACCEPT :output ACCEPT -A input -s 0/0 0:65535 -d 0/0 0:65535 -p udp -i eth0 -j ACCEPT -A input -s 0/0 0:65535 -d 0/0 0:65535 -p tcp -i eth0 -j ACCEPT -A input -s 0/0 0:65535 -d 0/0 0:65535 -p udp -i eth1 -j ACCEPT -A input -s 0/0 0:65535 -d 0/0 0:65535 -p tcp -i eth1 -j ACCEPT # allow for dhcp requests -A input -s 0/0 67:68 -d 0/0 67:68 -p udp -i eth1 -j ACCEPT # allow for tftp -A input -s 192.168.0.0/16 60: -d 0/0 0: -p udp -i eth1 -j ACCEPT -A input -s 0/0 -d 0/0 -i lo -j ACCEPT -A input -p tcp -s 0/0 -d 0/0 0:1023 -y -j REJECT -A input -p tcp -s 0/0 -d 0/0 2049 -y -j REJECT -A input -p udp -s 0/0 -d 0/0 0:1023 -j REJECT -A input -p udp -s 0/0 -d 0/0 2049 -j REJECT -A input -p tcp -s 0/0 -d 0/0 6000:6009 -y -j REJECT -A input -p tcp -s 0/0 -d 0/0 7100 -y -j REJECT 3.3 Trivial File Transfer Protocol (TFTP) TFTP allows the diskless nodes to load boot code over a NIC. This is discussed in section 3.4. The boot code will then load a linux kernel which then will use NFS to perform any further file accesses over the network. On RedHat 7.3 the TFTP demon is controlled by the xinetd meta demon. You can enable the service by changing the disabled option in the /etc/xinetd.d/tftp file. The -s option specifies which directory will 6

be served. In the following /tftpboot is assumed. -------------- /etc/xinetd.d/tftp -------------- service tftp { socket_type = dgram protocol = udp wait = yes user = root server = /usr/sbin/in.tftpd server_args = -s /tftpboot disable = no per_source = 11 cps = 100 2 only_from = 0.0.0.0/0 } 3.4 Pre Execution Environment (PXE) To make use of the PXE booting process you need to download the PXELINUX ([6]) which is part of the SYSLINUX project ([7]). PXE is a standard for network cards to load boot code over the network. This is done via tftp and you therefore have to populate the tftp directory with the apporiate files. pxelinux.0 is the binary from the PXELINUX distribution which allows ix68 machines to boot linux. A directory is to be created in the /tftpboot directory named pxelinux.cfg. Within that directory a file named default is to be created. It holds the name of the kernel and the kernel options. label linux kernel bzimage append root=/dev/nfs ip=dhcp init=/sbin/init 7

3.5 Network File System (NFS) The server has to expose parts of its filesystem to be mountable by the client and used by them as their root filesystem. The kernel is instructed to use the /dev/nfs device to mount its file system (sec. 3.4). This device has to be created in the /dev directory and serves as a placeholder. mknod /dev/nfs c 0 255 The kernel can be told to use that device as root file system by the use of the rdev command. rdev bzimage /dev/nfs To allow for file access via NFS later in the boot process you have to configure the NFS server via the /etc/exports file. You have add a line like /tftpboot 192.168.1.0/24(rw,no_root_squash) where 192.168.1.0 is the network of the client nodes. In the Genecruncer implementation, both, the rdev and the PXE options were used. To enable nfs on RedHat 7.3 issue the commands: chkconfig --level 345 nfs on chkconfig --level 345 nfslock on Again, the firewall rules must let through nfs netwok accesses. This completes the server configuration. Move on to the clients. 4 Client configuration The client configuration takes place entirely on the server on account of lacking disk space except for the bios configuration of the diskless nodes. 8

4.1 Hardware Since the client nodes are diskless there is not much to configure. You have to enable netboot in the bios. There may be several netboot options in the bios when the MBA boot option is the correct one. Second you have to determine the Hardware ethernet address of the network card in the client node. To get it, plug in the network and a monitor and start the boot process. At some point the NIC tries to set an IP address via DHCP and should display its hardware address at this point in time. Since the NIC is waiting you should have enough time to write it down. It turns out to be useful to tag each node with a label and the hardware address, since in this instance you will not have to plug a monitor into the node again. The hardware address has to be supplied to the dhcp daemon running on the server as described above (section 3.2). 4.2 Software 4.2.1 Building a client kernel Building a new kernel requires you to install the kernel sources. Once these are install you can enter the top level source directory. Under Rehat 7.3 the sources are supplied on the CD distribution. If you did not install them in the first place you can install the RPM kernel-2.4.18-3.rpm. The source directory is placed at: /usr/src/linux-2.4.18-3 Now issue make menuconfig to build a new kernel being capable of netbooting. This kernel must understand the following things, which have to be compiled into the kernel. Support the NIC installed for netbooting Be able to configure via DHCP 9

Be able to NFS mount its file system The NIC you have installed in the clients has to be chosen from the list of supported network cards under Network device support / Ethernet (10 or 100Mbit) menu entries. If you are not sure about your NIC plug it into another linux box, and let it autoconfigure the card. Then you can inspect the file /etc/modules.conf (on Redhat) to learn about the driver used. You have to press Y and an * should appear before the NIC driver, indicating that static support for that particular NIC is included in the kernel. Next, choose the Networking options from the main menu. Here you have to enable IP: kernel level autoconfiguration. This enables the dhcp and bootp options, both of which have to be included statically (in the TCP/IP networking section). By some obscure quantum effect this lets appear another option under File Systems Network File Systems NFS file system support. Both, NFS file system support and Root file system on NFS are to be compiled statically into the kernel. Now you are done. Save the configuration and issue make dep ; make. This will produce a bzimage file. On RedHat 7.3 the resulting bzimage file is located at: /usr/src/linux-2.4.18-3/arch/i386/boot/bzimage The bzimage file is then to be copied to the tftpboot directory (section 3.2). On problems do a Google Search ( Linux Kernel HOWTO ). 4.2.2 Putting together the client filesystem By convention a root tree for each client is placed in the /tftpboot/ip where IP is the IP address of the node as specified in the dhcpd.conf 10

file (e.g. /tftpboot/192.168.0.100). The following script was used to create a template root tree from which actual client root trees are created (createclienttemplate.sh): #!/bin/sh CUSTOM_FILES=/home/pingu/clusterFiles CLIENT_TEMPLATE=/tftpboot/client DIRS_FROM_SERVER="bin lib usr boot etc sbin" echo "Cleaning out destination dir..." rm -rf $CLIENT_TEMPLATE mkdir $CLIENT_TEMPLATE echo "Copying directories: $DIRS_FROM_SERVER..." ( cd / ; cp -r $DIRS_FROM_SERVER $CLIENT_TEMPLATE ) echo Creating devices..../makedev -d $CLIENT_TEMPLATE/dev generic./makedev -d $CLIENT_TEMPLATE/dev console./makedev -d $CLIENT_TEMPLATE/dev loop0 losetup $CLIENT_TEMPLATE/dev/loop0 $CLIENT_TEMPLATE/var/swap echo Copying customized files... cp $CUSTOM_FILES/rc.sysinit $CLIENT_TEMPLATE/etc/rc.d cp $CUSTOM_FILES/fstab $CLIENT_TEMPLATE/etc cp $CUSTOM_FILES/inittab $CLIENT_TEMPLATE/etc This script copies essential directories (DIRS FROM SERVER) as they are on the server machine. Second the devices are created using the MAKEDEV tool. Essential is the console devices used for kernel output. The loop0 devices is used later to establish a swap file. Third few files are replaced to account for netbooting. 11

The rc.sysinit replacement is necessary to allow for the setup of a proper swap file on RedHat 7.3. It is not listed here on account of size but is to be downloaded from the supplementary Website (see below). The fstab lists the nfs mounts of client nodes. All these directories are shared between nodes. The IP address 192.168.1.1 is the address of the server and must be replaced with the correct address. Excerpt: 192.168.1.1:/tftpboot/IP / nfs rw,bg,soft,intr 0 0 192.168.1.1:/tftpboot/client/usr/bin /usr/bin nfs rw,bg,soft,intr 0 0 192.168.1.1:/tftpboot/client/usr/etc /usr/etc nfs rw,bg,soft,intr 0 0... none /dev/pts devpts gid=5,mode=620 0 0 none /proc proc defaults 0 0 /dev/loop0 swap swap defaults 0 0 The inittab configures the runlevels of the nodes:... # Default runlevel. The runlevels used by RHS are: # 0 - halt (Do NOT set initdefault to this) # 1 - Single user mode # 2 - Multiuser, without NFS (The same as 3, if you do not have networking) # 3 - Full multiuser mode # 4 - unused # 5 - X11 # 6 - reboot (Do NOT set initdefault to this) # id:3:initdefault: # System initialization. si::sysinit:/etc/rc.d/rc.sysinit 12

... For each node to be added for diskless boot a unique file system has to be created. Here is a script to build such a tree from the template file structure. The script has to be supplied with the IP address of the node to be added. #!/bin/bash BASE=/tftpboot CLIENT_TEMPLATE=$BASE/client NODE=$BASE/$1 if [ "$1" = ]; then echo USAGE: $0 node-ip fi echo Creating filesystem for node $1 mkdir $NODE # the following dirs are on their own on each node: # boot etc dev [duplicated] # tmp root home [empty] # var [sceleton] # all others are mounted via nfs # bin lib sbin usr echo Creating empty dirs and sceletons... for d in tmp root home proc usr ; do mkdir $NODE/$d done 13

echo Duplicating needed dirs... cd $CLIENT_TEMPLATE for d in dev boot etc sbin bin lib usr/sbin ; do echo Duplicating $d... tar cp- $d tar xc- $NODE done echo Creating var sceleton... mkdir $NODE/var for d in local log spool spool/anacron spool/at opt db lib lib/nfs lib/nfs/statd tmp loc mkdir $NODE/var/$d done touch /var/lib/nfs/rmtab touch /var/lib/nfs/xtab echo Creating usr sceleton other than sbin... for d in bin dict etc games GNUstep include info kerberos lib libexec local man mkdir $NODE/usr/$d done share s echo Creating swap file... dd if=/dev/zero of=$node/var/swap bs=1k count=50k /sbin/mkswap $NODE/var/swap The script does the following. First it creates a new root directory for the node. Second it creates empty, private directories for the node. Third it copies private directories from the template file tree. Fourth it creates empty directories for directories to be mounted via NFS and to be shared amongst the nodes. Fifth a swap file is created for the node (50Mb). The /dev/loop0 device created earlier serves as a loopback devices which allows for mounting the file created as a swap device. 14

Files can be downloaded from [5]. TROUBLESHOOTING On Redhat 7.3 the boot process used to start hanging during the experimental phase of the configuration. It turned out that the /dev/console file went corrupt. Recreating /dev/pts and /dev/console remedied this problem. Issue: mkdir $CLIENT_TEMPLATE/dev/pts;./MAKEDEV -d $CLIENT_TEMPLATE/dev console 4.2.3 Finishing off Well, there should not be any issues left. However, you should be ready to accept some time investment to have your own cluster up and running. This guide may contain errors and new ReHat versions may vary from 7.3 in many aspects. The client nodes were established by first making a single node bootable. It turned out to be useful to create necessary users, do application and service setup on that machine and later pulling off copies for all nodes. 5 Remaining topics You are likely to install a load balancing software on your cluster. This topic is not covered here. You are referred to two references to look for the options ([8, 9]). You are encouraged to give feedback on this document, may it be error reports or comments. 15

References References [1] http://top500.org. [2] http://www.beowulf.org. [3] http://google.com. [4] http://www.google.com/grphp. [5] http://www.s-boehringer.de/genecruncher. [6] http://syslinux.zytor.com/pxe.php. [7] http://syslinux.zytor.com. [8] http://lcic.net. [9] http://zillion.sf.net. 16