Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing is one of the key requirements in nearly all branches of science. Precise calculation of predictions in models of materials and physical processes often require recourse to numerical solution of equations. In many cases the system of equations is too large to be tractable on desktop computers, and in some cases it is impossible to solve the full system of equations even on the fastest computers. Till a few years ago, study of such systems even approximate studies could be carried out only in a handful of institutions with free access to supercomputers. In recent years the idea of cluster computing has been adopted by the academic community as an affordable alternative to conventional supercomputers. This has allowed researchers to tackle problems ignored so far for the want of computing power. In the Indian context, CDAC and BARC were the pioneers and they started setting up clusters about ten years ago. In recent years several other institutions have followed suit and have set up small clusters for high performance computing requirements. Harish-Chandra Research Institute made its first foray into this field with a modest 12 node cluster of Pentium-3 computers connected with fast Ethernet. This cluster was set up in the second half of 2000. Expertise to set up and manage the cluster was developed in house with the help of documentation available on the Internet. First steps towards parallel programming and development of parallel algorithms were taken on this cluster. This success encouraged us to propose setting up a major cluster computing lab at HRI. Experience with the 12 node cluster was very useful 1
and we decided to use this experience to set up a 16 node cluster and test some of the ideas we wanted to implement on the proposed large cluster. Fig.1: The 16 node cluster. Hardware for the 16 node cluster was designed to optimise the performance for our applications while keeping the cost down. This cluster was set up in mid 2002. Each node of the 16 node cluster has a Pentium-4 pro- 2
cessor (1.6GHz), 1GB RAM, 20GB hard disk and a fast Ethernet network interface card. Various components of the cluster were chosen specifically to enhance performance, e.g., the NIC was chosen for its low latency and we preferred an RDRAM solution as it offers very good memory bandwidth for 32 bit workstations. Initially we experimented with two NICs on each node in order to split the load for network services and compute load, however we found that this increases the latency and hence is not a good solution. Excellent performance of this cluster gave us confidence in our ability to tailor the configuration to optimise performance for our applications. CPU utilisation of this cluster during its first year was greater than 65%. Numerical computations done on this cluster have been used in many research projects and papers. Apart from parallelisation of some simple problems, a highly optimised parallel algorithm for n-body simulations of galaxy formation was developed and tested on this cluster. This cluster is still in use and we expect to keep it operational for at least one more year. 2 Computing beyond 200 GigaFlops This year a new cluster named Kabir, has been set up at HRI. This was proposed as a part of the project titled Scientific Computing and Networking for the X five year plan (2002 2007) at the Harish-Chandra Research Institute (HRI). It was proposed by J.S.Bagla, P.Majumdar and S.Naik. Issues that we had to confront related mainly to the interconnect. It was clear that the usual network services like NIS and NFS will not scale very nicely beyond a point and we will require specially designed servers if we cross that limit. In addition, it was clear that the poor latency of Ethernet networks will inhibit scaling to large processors for nearly all applications. Thus we decided to go in for a low latency, high bandwidth interconnect. Such solutions cost a lot, more than a typical single processor workstation. Thus it was decided to make a cluster of SMP nodes in order to reduce the cost as well as address the issue of scaling of network services. We had two comparable options, Myrinet and Scalable Coherent Interface (SCI), we chose SCI for the simple reason that it is more economical for clusters with number of nodes other than 2 n, where n is an integer. Amongst SMP solutions, dual Xeon servers seemed to be an ideal choice in terms of performance, price, as well as support at the operating system and application level software. 3
Fig.2: Front view of the Kabir cluster. The Kabir cluster is the realisation of these concepts. Each node has 2 GB or larger main memory, 40 GB hard disk and 2.4 GHz Xeon CPUs. These nodes are connected using the SCI network. The SCI is a very high performance switch less interconnect that gives nearly the same performance for inter-node communications as the communications between processors on the same node. We have measured latencies as low as 4µs at the application level for point to point messaging, point to point bandwidth for large 4
messages is as high as 275MBps larger than that for Gigabit Ethernet. Fig.3: The Kabir cluster as seen from the back. In addition to the SCI network, we use an Ethernet network for network services like NIS and NFS. KVM switches have been used to enable the administrator to access consoles of all the machines from one keyboard, mouse and monitor. 5
We have used the SuSE 8.2 bundle with linux kernel 2.4.20 on these servers. Several useful libraries of subroutines for numerical computations have been installed, e.g., BLAS, ATLAS, FFTW and PetSC. Open PBS is used for job scheduling. 2.1 Monitoring Monitoring of systems is done using some home grown scripts that use the /proc partition and LM sensors to obtain information about the hardware and CPU usage, etc. The process of monitoring is done by bash scripts on each node. These scripts read information like load average, up time, CPU temperature, etc. and record it in a set of files. A central processing script collates all this data and prepares content for three web pages for displaying the status of cluster nodes. There is a summary table that gives summary of the information collected about each node. There are two graphic pages that show the temperature and CPU load average for the last ten minutes in a colour coded fashion. For each node, information about the most active processes is also available. These data are available on the cluster web pages and are updated every ten minutes. Fig.4: Variation of performance of the Kabir cluster with problem size, as measured using the HPL benchmark. 6
2.2 Performance Benchmarks Each server has been clocked at 5.7Gflops with the HPL benchmark using GNU gcc, ATLAS libraries and ScaMPI. The cluster has been benchmarked at 223.2 Giga flops, making it the fastest supercomputer in research institutes and universities in India at present. This performance was measured with the HPL implementation of the Linpack benchmark for a problem size of 100800. Performance drops to half when the problem size is 11400, thus the range of problem sizes where we get high performance is significant. Thus in terms of performance Kabir can compete with a cutting edge supercomputer. The only faster supercomputers in India are a research cluster at Intel, Bangalore and the Param Padma developed by CDAC. 2.3 Associated Facilities A large cluster like the Kabir needs to be supported by associated facilities like data storage area, data backup facilities, servers, uninterrupted power supply and a cool and clean environment. We have a 1.1 TB central data storage facility. This is a RAID system with 16 IDE disks of 80GB each. The RAID array connects to a host computer through the SCSI interface. There are several options for taking data backup on removable media, with backup capacity of up to 80 GB. These include CD-RW and DVD-RW drives for random access backup media. Tape drives (DAT 12/24, DAT 20/40 and DLT 1 40/80) are available for larger data volumes. These facilities along with network services are hosted on a group of four servers. We also have five workstations with large amount of RAM (8GB each) to facilitate work on large problems with sequential codes. The facilities were set up and managed without recourse to hiring additional manpower. Project investigators themselves have taken time out to carry out the computing related setup, all the way from choosing the configuration to installation of operating system and scientific software. Given the wide and fruitful use of the earlier, smaller cluster, we expect Kabir, the new cluster to be a very valuable and useful resource. More information about cluster computing at HRI is available at the web page http://cluster.mri.ernet.in. Acknowledgements We received considerable help and support from everyone at HRI. Without this help, support and good wishes, it would not have been possible to set 7
up the cluster in such a short time. The administrative staff was always prompt in finding solutions whenever we faced any difficulties in placing orders or dealing with vendors. Their positive approach meant that there were few procedural delays and the median time between order and supplies was less than 30 days with only a few orders taking longer than that. Their help and support from Prof.Ravi Kulkarni, the Director, meant that there were hardly any delays between requisition and placing of orders once we had obtained approval for our approach from the computer committee. Considerable amount of work was put in by the engineering section especially regarding electrical wiring and distribution. Detailed initial planning and then prompt improvisations as we encountered problems meant that we could minimise downtime even in the testing phase. Computer centre staff provided timely help by loaning equipment in cases where our orders were delayed. In particular the loan of a 20kVA UPS allowed us to set up and test the cluster without serious delays. Students at HRI, especially Suryadeep Ray, Sanjeev Kumar and Jayanti Prasad put in several patient hours helping with the setup and maintenance of the cluster. Manu Awasthi, a visiting student helped considerably during setting up of the cluster. 8