ClusterWorX r : A Framework to Manage Large Clusters Effectively Dr. Thomas M. Warschko Linux NetworX Inc., Sandy, Utah USA (http://www.lnxi.com) E-mail: twarschko@lnxi.com Abstract Linux Clusters are going to be the high performance compute engine of choice for research labs as well as industry. Clusters are now well known for their flexibility, reliability, scalability and price/performance ratio compared to traditional supercomputers and Linux seems to be the operating system of choice to drive these clusters effectively. As cluster systems scale to thousands of processors, management becomes exponentially complex, and can be a daunting challenge for any organization. To alleviate this effort, Linux NetworX has developed ClusterWorX r, which integrates all aspects of cluster management and administration within a simple and user-friendly solution. Keywords: Cluster Management, LinuxBIOS, ICE Box TM, ClusterWorX r, High-Speed Interconnects, High-Performance Cluster Computing. 1 Introduction Paying for expensive and proprietary software and hardware only to end up tied to an inflexible platform is a trend of the past. Today s rapidly changing IT industry is shifting towards open source software platforms using commercial-off-the-shelfhardware components. By using Linux as the operating system and hardware based on standard x86 architectures, Linux clustering is the culmination of both ofthese concepts. It leverages the power ofthe open source community s prize, while harnessing the power oflow-cost components to deliver a solution that is powerful, scalable, flexible and very reliable. Despite the cost-savings, questions remain about the manageability oflinux clusters. A common myth is that Ph.D.-level knowledge is required to adopt the technology. At one time this was true. The earliest adopters oflinux clusters were in fact universities and national laboratories, generally because they possessed the knowledge-base and resources to take on the challenge ofsetting up and maintaining a cluster system. Today, however, vendors provide services such as integration, installation, system optimization and training. New cluster management tools also help empower administrators over these complex systems. Today, the barriers to adopting the technology have been significantly lowered. Administrators have several issues and concerns with managing and maintaining a Linux cluster. Cluster administrators need to not only know where the nodes are, but also who they are with, what they are doing, how hard they are working, and even the locations of the network bottlenecks. They need to see all, know all, and be able to take action on the system remotely. The challenge to the administrator is finding the best available tools to help to do their job as painlessly as possible. Cluster administrators need empowering tools to help them essentially become omniscient and omnipotent over their systems. Items to consider include: cluster efficiency, hardware failures, software upgrades, remote access, cloning and storage management, and system consistency integrated within a single tool to make an administrator s life easy. In fact this was the motivation when designing the components of ClusterWorX. This article is organized as follows: section 2 discusses the LinuxBIOS project and it s features in detail, section 3 focuses on the integration ofthe ICE Box 1 hardware within ClusterWorX, section 4 explains the cloning and 1 Integrated Cluster Environment
image management capabilities and section 5, the event handling and management ofclusterworx. A hint to further software developments is given in section 6 and a conclusion is drawn in section 7. 2 LinuxBIOS The primary motivation behind LinuxBIOS[6] is the desire to have the operating system gain control ofa cluster node from power on. It aims to replace the normal BIOS found on PCs with a Linux kernel that can boot Linux from a cold start. LinuxBIOS is primarily Linux with a few changes to the current Linux kernel. It initializes the hardware, activates serial console output, checks for valid memory, and starts loading the operating system - only it does it in about 3 seconds, whereas most commercial BIOS alternatives require about 30 to 60 seconds to boot. Current PCs used as cluster nodes depend on a vendor-supplied BIOS for booting. The BIOS in turn relies on inherently unreliable devices such as video cards, floppy disks, CD- ROM and hard drives to boot the operating system. In addition, current BIOS software is unable to accommodate non-standard hardware, making it difficult to support experimental work. The BIOS is slow, often redundant, and most importantly in a cluster environment, it is difficult to maintain. Imagine walking around with a keyboard and monitor to every one ofthe 1000 nodes in a large cluster to change one BIOS setting. Using a real operating system to boot another operating system provides much greater flexibility than using a simple netboot program or the BIOS. Because Linux is the boot mechanism, it can boot over standard Ethernet or over other interconnects such as Myrinet [9], Quadrics [10], or SCI [11]. It can use SSH connections to load the kernel, or it can use the InterMezzo caching file system or traditional NFS. Cluster nodes can be as simple as they need to be - perhaps as simple as a CPU and memory, no disk, no floppy, no graphics adapter, and no file system. The nodes will be much less autonomous, making them easier to maintain. With a terminal server, such as the ICE Box (see section 3), an administrator is able to trace the boot process from the very beginning and access the nodes using the serial console. LinuxBIOS reports all detected errors and hardware failures using the serial console. The output is captured and logged through the ICE Box to allow even post-mortem trouble shooting ofnodes. After initializing the hardware, LinuxBIOS is able to boot from the network or local hard disk. Booting options (see section 4) can be easily changed using ClusterWorX or network configuration options such as DHCP. Additional tools are provided to change BIOS settings or to flash new LinuxBIOS releases on demand. Because LinuxBIOS can be accessed and configured from within the Linux operating system, changes can be made remotely to a single node or to all nodes in a cluster system. These changes become active as soon as the nodes are rebooted. 3 ICE Box TM The ICE Box provides three essential cluster management capabilities: serial console, remote power management, and remote monitoring accessible through a variety ofprotocols (NIMP, SIMP, Serial, Telnet, SSH, SNMP, ClusterWorX). All the hardware ofan ICE Box is controlled and all services are provided by a embedded computer running Linux. For detailed information on the ICE Box see [4, 3]. 3.1 Remote Power Management Controlling the power to the nodes and other devices is a basic cluster management task. However, this feature is one which is most often overlooked in the cluster system design. A remotely managed power solution is superior to one that requires an on site user. Each ICE Box provides power to 10 compute nodes and two auxiliary devices. Two 15A power inlets each provide power to five nodes and one auxiliary device. Whereas the node outlets can be power-cycled on demand, the auxiliary outlets are powered on and stay on as long as the ICE Box is receiving power. This is to ensure that host nodes, switches and other de-
vices are not powered off by mistake. During the power up procedure, ICE Box also automatically sequences power, reducing the risk ofpower spikes. 3.2 Remote Monitoring The ICE Box hardware contains power and temperature probes and a reset switch inside each node. The reset switch allows the user to remotely reset any standard motherboard - preventing a full power down. The power probe is used to detect failing power supplies and the temperature probes are used in combination with the event handling capabilities of ClusterWorX (see section 5) to prevent overheating ofthe system. 3.3 Serial Terminal Access Serial terminal access, also known as console port management or serial console, is generally used for managing remote systems in data centers. Though not a new technology, because ofthe low scalability and legacy design with traditional console access or terminal servers, widespread use ofterminal servers for clusters has not been widely implemented. ICE Box overcomes this challenge by offering unprecedented scalability and high port density, making it the perfect solution for cluster management. Serial networks provide remote access to a machine by opening a UNIX console through the serial (COM) port on a machine. However, this type ofaccess usually has two inherent problems: it requires a user to plug in a cable and it is not scalable. To solve this problem, terminal servers are used to access many serial devices from a centralized location. Besides providing serial access to each connected device, the ICE Box also provides logging and buffering (up to 16k) ofthe output on each serial device. This capability allows even postmortem analysis on what has happened to a specific node. 3.4 Accessing the ICE Box The ICE Box itselfprovides serial as well as network (ethernet) access. There are native command protocols which can be used with ClusterWorX or other software to control ICE Box remotely. The serial ICE management protocol SIMP facilitates the serial connection ofan ICE Box and the network ICE management protocol NIMP uses the onboard ethernet ofan ICE Box, respectively. Furthermore, the ICE Box provides access via telnet and ssh (v1 & v2) and native IP filtering can be used for higher security. Telnet and ssh connections can be established either with the ICE Box or with each individual device connected to the ICE Box using specific port numbers. Last but not least, the ICE Box is SNMP compliant, so ICE Boxes can be controlled through standard SNMP management software. 4 Image Support and Cloning Disk image consistency is accomplished using a technique called disk cloning a process of quickly copying a system image from the ClusterWorX management host to individual nodes within the cluster. Disk cloning allows the administrator to load or update the operating system on single nodes, or the entire cluster at one time using reliable multicast technology. Using a multicast mechanism, even a single fast ethernet is sufficient to clone several hundred nodes simultaneously 2. On startup all participating nodes listen to the multicast stream, buffering the received data locally. Once the multicast stream is spread out, individual nodes acknowledge the reception ofthe new image in a round robin fashion controlled by the cloning host. Ifan individual node is still lacking image data, the missing parts are transferred during the acknowledging phase on a peer-to-peer base with the master node. As soon as a node gets all the image data, it starts the cloning process locally and reboots itselfto operational mode. With ClusterWorX, cloning is done from the easy-to-use GUI. Administrators are able to load the OS and applications to build the required functionality into an image. Then ClusterWorX automatically clones the images to selected nodes. Improvements to cloning add 2 It took about 12 min. to clone and reboot over 400 nodes of the Lawrence Livermore cluster.
the ability to more easily update the kernel on all nodes, create new types ofimages, and update files or packages on the nodes in parallel. Disk cloning greatly reduces the time, effort, and cost ofinstalling, upgrading, or updating a large cluster system. For convenience we offer prebuilt images for cloning, harddisk as well as NFS boot. Furthermore, customized images can be build with little effort (see [5]). 5 Monitoring and Event Handling ClusterWorX is the main framework for our cluster management solution. Besides providing a graphical user interface (GUI) to the ICE Box and to the disk cloning facilities, Cluster- WorX is responsible for monitoring and event handling within a cluster, which is the topic for this section. For a more detailed description of ClusterWorX see [1, 2]. 5.1 Monitoring ClusterWorX can virtually monitor any system function including CPU usage, CPU type, network bandwidth, memory usage, disk I/O and system uptime. It comes standard with over 40 monitors build in. The UDP echo port is used to ensure network connectivity. In addition, ClusterWorX offers plug-in support so administrators can include their own monitors. In combination with additional sensor packages (e.g. lmsensors [7]) it is possible to monitor fans, CPU and board temperature, although temperature monitoring is usually accomplished using the ICE Box sensors. A plugin itselfcan be any program, script (shell, perl, etc.) or any combination thereof- as long as it resides in the ClusterWorX plug-in directory it will be recognized by the system automatically. This flexible concept ofplug-ins allows ClusterWorX to fit the needs ofany system, no matter how unique its functionality. Through a secure connection, ClusterWorX allows administrators to remotely monitor and manage a cluster system from an on-site or offsite location with any Java-enhanced browser. Ifproblems arise, administrators have full access to the cluster at home or on the road. ClusterWorX is written in Java for crossplatform, client side independence. The Java based GUI provides a platform for advanced visually-based cluster management. The 3- tier design allows multiple clients to access the ClusterWorX server at the same time without conflict. The ClusterWorX main monitoring screen is easily customized to allow administrators to view system statistics relevant to their system in near real time. With ClusterWorX, cloning an image or adding a node to the cluster becomes as simple as a few mouse clicks. Historical graphing allows the administrator to chart monitoring values over time. The administrator can view cluster use and performance trends over a selected time interval, analyze the relationships between monitored values, or compare performance between nodes. Analyzing this data can help the administrator spot system bottlenecks, improve cluster efficiency, and predict future computing needs. 5.2 Event Handling Online monitoring is only one capability of ClusterWorX. More important especially in case ofproblems or failures is the event and notification engine. When cluster problems arise, administrators can customize ClusterWorX to automatically take action, e.g. power down, reboot, or halt any malfunctioning node. This is accomplished through an event engine that allows administrators to set thresholds on any value monitored. This allows corrective action to be taken before problems become critical (e.g. powering down a node on CPU fan failure to prevent the CPU from burning). If the administrator-defined threshold is exceeded, ClusterWorX automatically triggers an action. Default actions include node power down and node reboot. For example, the event engine can report and take an administrator-defined action, such as powering down a node, when processors rise above a certain temperature, or ifthe load is too high. Events are configured by administrators and allow administrators the choice ofreceiving a notification when an event occurs. Events are also extendable in that they monitor administrator-defined values and execute administrator-defined plug-ins. Customizable action can be created using shell
scripts, perl scripts, symbolic links, programs, and more. Using a smart notification algorithm, ClusterWorX notifies administrators ofproblems without swamping them with unnecessary e- mails. The e-mail informs the administrator which cluster is malfunctioning, the name of the triggered event, the node(s) which are experiencing the problem, and the action (ifany) that was taken. Only one e-mail is sent per triggered event, even ifmultiple nodes are involved. Ifa node is fixed by an administrator bur fails again later, the event re-fires automatically, without administrative interventions. For those who desire, e-mail can be directed to most wireless devices such as pagers and cell phones. 5.3 Performance Issues Monitoring is at the heart ofcluster management. The data is used to schedule tasks, loadbalance devices and services, notify administrators ofsoftware and hardware failures, and generally monitor the health and usage ofa system. Unfortunately the information used to perform these operations must be gathered from the cluster without impacting application performance. Cluster monitoring primarily consumes two important resources: CPU cycles and network bandwidth. The CPU usage problem is completely localized on a node, and is addressed by creating efficient gathering and consolidating algorithms. The network bandwidth problem affects a shared resource and is addressed by finding ways to minimize the amount of data transmitted over the network. To address these two issues, we divide cluster monitoring into three stages: gathering, consolidation, and transmission. 5.3.1 Gathering The gathering stage is responsible for loading the data from the operating system, parsing the values, and storing the results in memory. Standard tools for gathering system statistics, such as rstatd and SNMP tools, only provide limited information and tend to be slow and inefficient. Thus we focus on using the /proc virtual file system to gather all system statistics. An important note about the proc file system is that each time a proc fileisread, a handler is called by the kernel, or the owning module, to generate the data. The data is generated on the fly, and the entire file is reconstructed whether a single character or a large block is read, which is a crucial point for efficiency. The test system used was a 1 GHz Pentium III with 1 GB ofmemory, using the 2.4.18 version ofthe Linux kernel. Our first implementation loading and analyzing the memory statistics (/proc/meminfo) only renders 85 samples per second at 100% CPU utilization. Loading /proc/meminfo at once into a separate buffer and parsing the data within that buffer increases the gathering rate to 4173 samples per second, or a 4800% increase in performance. By taking advantage ofthe fact that /proc data uses standard ASCII output and by using a priori knowledge about the output format of /proc/meminfo, we were able to achieve another 236% increase in performance, resulting in a monitoring rate of14031 samples per second. The last improvement was due to not closing and reopening /proc/meminfo each time we needed new memory statistics. Instead we keep the file open all the time, just resetting the file pointer to the beginning ofthe file between two consecutive steps. The result ofthis optimization yields an additional 141% increase in performance. Now we reach a gathering rate of33855 samples per second, which translates to 29.5µs ofcpu time per call. In other words, the optimized gathering process takes approximately 5 seconds ofcpu time per hour at a monitoring rate of50 samples per second. Other statistics are taken from /proc/stat at 35µs per call, from /proc/loadavg at 7.5µs per call, from /proc/uptime at 6.2µs per call, and from /proc/net/dev at 21.6µs per call per network device. Furthermore we ve been investigating the difference between implementing the gathering process in C or Java and found out that C is only slightly ahead ofjava. Thus we decided to use the Java implementation because ClusterWorX is also written in Java.
5.3.2 Consolidation The consolidation stage is responsible for bringing the data from multiple sources together to determine ifvalues have changed, and for filtering. In the interest of efficiency this task is exclusively performed on a node because the node is the gatherer and provider ofthe monitored data. The consolidation stage is used to combine data from multiple data sources at independent gathering rates. The consolidation process distinguishes between static and dynamic monitoring data and transmits only data that has not changed since the last transmission. This reduces the amount oftransferred data substantially. Furthermore, monitor data is cached so that simultaneous requests can be served using the same set ofdata. This approach reduces the burden on the operating system and increases the responsiveness ofthe monitoring system. 5.3.3 Transmission The transmission stage is responsible for compression and transmission ofthe data to a management node. Since we use the /proc filesystem, monitored data is stored in humanreadable form. Although binary formats require less storage, we leave the data in text form because of platform independency and the human-readable nature ofthe data. Nevertheless, when transmitting the data, we use data compression techniques, which are known to be very effective on text input. 6 Future Work The Lawrence Livermore National Laboratory (LLNL) and Linux NetworX are designing and developing SLURM 3. SLURM provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set ofallocated nodes. Finally, it arbitrates conflicting requests for resources by managing a 3 Simple Linux Utility for Resource Management queue ofpending work. SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler [8]. While other resource managers do exist, SLURM is unique in several respects: Its source code is freely available under the GNU General Public License. It is designed to operate in a heterogeneous cluster with up to thousands of nodes. It is portable; written in C with a GNU autoconfconfiguration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets. The interconnect to be initially supported is Quadrics Elan3, but support for other interconnects (e.g. Myrinet) is already planned. SLURM is highly tolerant ofsystem failures including failure of the node executing its control functions. It is simple enough for the motivated end user to understand its source and add functionality. Further information about the design and the current state ofslurm is available on the SLURM homepage [12]. 7 Conclusion Linux Clustering is a reasonable alternative to supercomputing because it is a reliable, flexible, scalable and cost-effective solution. However, many organizations are prevented from benefiting from Linux clusters because of limited technical resources. To help alleviate this problem, we developed ClusterWorX and ICE Box, lowering the barriers to adopt this technology. On the software side, the cluster management solution ClusterWorX scales to meet the needs ofany size system and includes: remote management capabilities, a customizable, easy-to-use graphical user interface, in-
tegrated disk cloning, sophisticated monitoring and event handling, and automatic administrator notification. On the hardware side, the cluster management solution ICE Box fully integrates with ClusterWorX to provide advanced power monitoring and power control as well as thermal probing and serial console access to all nodes ofa cluster. Furthermore, we support and participate in open source projects such as LinuxBIOS and SLURM to provide future cluster management enhancements. References [1] Linux NetworX. ClusterWorX 2.1, April 2002. http://www.lnxi.com/news/pdf/whitepaper cwx.pdf. [2] Linux NetworX. ClusterWorX User Guide, 2002. [3] Linux NetworX. ICE Box, 2002. http://www.lnxi.com/news/pdf/whitepaper ice.pdf. [4] Linux NetworX. ICE Box User Guide, 2002. [5] Linux NetworX. Image Manager User Guide, 2002. [6] The LinuxBIOS Homepage. http://www.linuxbios.org. [7] Hardware Monitoring by lm sensors. http://www2.lm-sensors.nu/ lm78. [8] Maui Scheduler Open Cluster Software. http://mauischeduler.sourceforge.net. [9] Myrinet. http://www.myri.com. [10] Quadrics. http://www.quadrics.com. [11] Dolphin. http://www.dolphinics.com. [12] Simple Linux Utility for Resource Management. http://www.llnl.gov/linux/slurm.