1 Open Single System Image (openssi) Linux Cluster Project Bruce J. Walker, Hewlett-Packard Abstract The openssi Cluster project is an ongoing open source project which was started two years ago to bring together some of the best Linux and Unix clustering technologies into a single integrated and yet modular project. Linux is rich in cluster technology but is segmented into 6 different cluster areas - high performance, load-leveling, web-service, storage, database and high availability. The openssi project address all cluster environments by simultaneously addressing the three key cluster goals - availability, scalability and manageability. To accomplish this ambitious goal, the project was started with a Linux adaptation of the NonStop Clusters for Unixware code, contributed by Compaq/HP. That code included membership, internode communication, clusterwide process management, clusterwide devices, a cluster filesystem, clusterwide IPC (pipes, fifos, msgqueues, semaphores, etc.) and clusterwide tcp/ip networking. Other open source clustering code has been integrated into the modular architecture, including opengfs, opendlm, LVS, Lustre and a small component of Mosix. The architecture of the project allows for subsetting and substitution of components. A full function initial release is available in both source and RPM form. Many enhancement opportunities still exist both in integrating with other technologies and by improving scalability and availability. Introduction In this paper we explain Single System Image (SSI) clustering, we rationalize SSI clustering with other clustering paradigms and we explain how the open source Linux project (openssi) can be the foundation for a wide range of cluster solutions. Technical detail on the openssi architecture and components is provided, to the extent that space permits. The term clustering, with no adjective specifying the type of clustering, means different things to different groups. Below is a short discussion of high performance (HP) clusters, load-leveling clusters, webservice clusters, storage clusters, database clusters and high-availability (HA) clusters. All of these have a few things in common that distinguish clusters from other computing platforms. Clusters are typically constructed from standard computers, or nodes, without any shared physical memory. That, combined with the characteristic of running an OS kernel on each node, distinguishes a cluster from a NUMA or SMP platform (of course a cluster could be constructed from NUMA or SMP nodes). The "purpose" of a cluster is to work together to better provide HA, HP, parallel processing or parallel web servicing. How is a cluster different from a client-server distributed computing environment? Each of the cluster platform solutions is more peer-to-peer, typically with a sense of membership that each node has access to.
2 Solution Clusters High performance (HP) clusters, typified by Beowulf clusters [Beowulf], are constructed to run parallel programs (weather simulations, data mining, etc.). While a master node typically drives them, the nodes work together and communicate to accomplish their task. Load-leveling clusters, typified by Mosix [Mosix], are constructed to allow a user on one node to spread his workload transparently to other nodes in the cluster. This can be very useful for compute intensive, long running jobs that aren't massively parallel. Web-service clusters, typified by the Linux Virtual Server (LVS) project [LVS] and Piranha do a different form of load leveling. Incoming web service requests are load-leveled between a set of standard servers. Arguably this is more of a farm than a cluster since the server nodes don't typically work together. Web-service clusters are include in this list because the service nodes will eventually work more closely together. Storage clusters, typified by Sistina's GFS [GFS], the OpenGFS project [OpenGFS], and the Lustre [Lustre] project, consist of nodes which supply parallel, coherent and highly available access to filesystem data. Database clusters, typified by Oracle Parallel Server (OPS; now Oracle 9I RAC)[Oracle], consist of nodes which supply parallel, coherent and HA access to a database. The database cluster is the most visible form of application specific clusters. High Availability clusters, typified by ServiceGuard [ServiceGuard], Lifekeeper [Lifekeeper], FailSafe [FailSafe] and Heartbeat [HALinux], are also often known as failover clusters. Resources, most importantly applications and nodes, are monitored and scripts are run when a failure is detected. Scripts are used to fail over IP addresses, disks and filesystems as well as restarting applications. It should be obvious that many of the above solution clusters could benefit by combining/leveraging other cluster technology. In fact, there is ongoing work in most of the cluster areas to add other cluster capabilities. After examining SSI clustering in general, we will look at how the Open SSI Cluster Project [OpenSSIC] is striving to bring all the best technology in each of these solution areas together on a SSI platform. What is a SSI Cluster? Presenting the collection of machines that make up the cluster as a single machine is what SSI clustering is all about. SSI Clustering is not new. The Locus Distributed System [Popek85] provided a SSI environment back in the 1980's. Sun Microsystems had an SSI project [Khalidi96] and Gregory Pfister dedicated a chapter of his book "In Search of Clusters" to Single System Image, claiming that general cluster adoption has been slow because of "lack of single system image software"[pfister98]. The SSI presentation comes in several forms and to varying degrees. Below we look at 6 current "SSI" cluster approaches. The database cluster is SSI with respect to the database. Instances of the database run on each of the nodes, accessing the same data and coherently presenting the same view of the data from each instance. The storage cluster is SSI with respect to a set of devices and more importantly a set of filesystems. Even more SSI is accomplished if the root filesystem is shared between the nodes. Web-service clustering (Linux Virtual Server (LVS) and other examples) presents the cluster to the outside world as a single name and single IP address, with different techniques to load level incoming connections and optionally failover the routing/redirection service. This is one form of SSI clustering.
3 Scyld is providing "SSI" for Beowulf clusters [Scyld] using the bproc process movement technology [bproc]. All processes are initiated on/from the master node and programmatically distributed to all the compute nodes via rexec, rfork or move. From the master node all these processes are still visible and signalable, just as if they were all running on a single machine. The cluster, defined as the parallel program running on it, is presented to the user as a single system. The filesystem is not SSI, the kernel IPC objects are not SSI, the device space is not SSI, independent processes on compute nodes would not see a SSI process space, etc. Nonetheless, given the specialized manner in which the cluster is used (typically one large parallel application is all that is running, with one process per compute node), this "master node" SSI is enough SSI to provide a significant improvement over pure Beowulf clusters. Mosix provides "home node" SSI. Home node SSI is similar to Scyld's "master node" SSI except there can be lots of home nodes. A process is created on a node (called its home node). It might be transparently migrated to another node in the cluster but: a) it still sees the environment of it's home node (processes, devices, filesystems, ipc, etc.) b) other processes started on that home node see the process as if it was still running on the home node (for the most part, the kernel portion of the process is still running on the home node; system calls made by the process are shipped back to the home node for execution). While the entire cluster is not presented to each process like one big machine, a given process does see the same single machine (the home node) view no matter where it is executing. Additionally, the user logged into the given home node sees all the processes created via that login session, no matter where they are running (and that user has full control of those processes). The ideal SSI cluster is more complete in scope than any of the above SSI clusters for Linux - in terms of process management, in terms of SSI for other kernel resources and in terms of providing a high availability platform. The ideal SSI cluster unifies and extends the clustering described above to result in: a unified clusterwide process model with load balancing unified clusterwide device, storage and filesystem models with parallel and served access (thereby being the ideal platform for parallel databases) a unified clusterwide networking model, providing both a highly available single access point to a parallel server, and an SSI view from within the cluster; unified clusterwide inter-process communication (IPC) objects like pipes, fifos, semaphores, message queues, shared memory, sockets and signals, so programs, processes and users can work together no matter which node in the cluster they are executing on unified clusterwide system management, where single system tools can see and manage all resources of all nodes from any node simple, yet powerful high availability (HA) capability is provided for any resource from networking to inter-node interconnect to storage to filesystems to applications to processes to IPC objects The ideal cluster, as described above, is the topic of the Open SSI Cluster project described in the next section. Open SSI Cluster Project This section has four parts - goals of the project, component technologies, cluster architecture and status. Project Goals The ambitious goals of the project include: building a full SSI cluster using open source cluster components; extending the SSI cluster in areas of scalability and availability, again using open source cluster components; maintaining a modular architecture so components can be substituted as newer versions are produced.
4 Project Components The figure below shows the many open source components currently envisioned as contributing to the Open SSI Cluster project. Other technologies will be added as needed. NSC CI Lustre EVMS/ CLVM GFS DLM LVS Mosix Scyld/ Beowulf HA UML Open SSI Cluster Project DRBD Before presenting an overview of the project architecture and a review of the project status, a brief description of the contributed technologies is given. NSC (NonStop Clusters for Unixware): This was a product from Compaq and SCO/Caldera that provided a high degree of Single System Image on a Unixware base [NSC]. It is based on earlier work [Popek85,Walker98,Walker99]. Much of the code has been adapted to Linux as part of the Open SSI Cluster project, including cluster membership, inter-process communication, process management, process migration, cluster filesystem, clusterwide device access, and many different clusterwide IPC subsystems (including pipes, fifos, message queues, semaphores, shared memory and sockets). Application and node monitoring capability, along with application restart capability, have also been adapted to the Open SSI Cluster project. CI (Cluster Infrastructure): The cluster membership and inter-node communication subsystem pieces of NSC were made available as the CI project [CI], on which the SSI project is layered. Other cluster technologies may also be layered on CI. GFS (Global File System): GFS, developed by Sistina Software [GFS], is available from Sistina or through the OpenGFS project [OpenGFS]. It is being used in the Open SSI Cluster project as one option to provide a shared root or other filesystem for the cluster. It is not yet integrated with the DLM. DLM (Distributed Lock Manager): IBM open sourced a DLM code base. It is available under the GPL license [OpenDLM]. It has been integrated with the SSI membership subsystem of CI to allow dynamic addition and deletion of participating nodes. LVS (Linux Virtual Server): This open source technology [LVS] presents a single IP address and then load levels incoming connections to a set of backend Web servers. Several methods of forwarding the traffic are available and there are several different load-leveling algorithms to pick from. This technology works with the other parts of the SSI project.
5 Lustre: The Lustre [Lustre] open source filesystem project is designed to scale to thousands of client nodes. Unlike GFS, Lustre defines a new protocol between the clients and the storage. Mosix: The Mosix project [Mosix] does transparent load leveling of unmodified processes. The Open SSI Cluster project has a different process management and process migration technology but is using the load leveling decision algorithms from the Mosix project. Scyld/Beowulf: The Beowulf parallel program [Beowulf] and communications libraries should be straightforward to port to the SSI cluster. The bproc library is also easily adapted to the APIs available in the SSI cluster. The concept of network boot and network install that Scyld utilizes has been developed for the SSI project as well. Thus only one node must do a full install and nodes can optionally netboot into the cluster. HA (High Availability): There are many HA products and projects on Linux, from "heartbeat"[halinux] to ServiceGuard [ServiceGuard], FailSafe [FailSafe] and Lifekeeper [LifeKeeper]. To date, all of these work on very loosely coupled clusters, without even a shared filesystem. Integrating one or more of these solutions on the Cluster Infrastructure (CI) platform or the Open SSI cluster platform will provide a simpler yet more powerful capability. UML (User Mode Linux): UML has been integrated with SSI so that you can set up a multi-node SSI cluster on a single machine. This may not have significant commercial value but makes a powerful development environment for cluster subsystems. Project Architecture The SSI cluster has a few characteristics that influence the architecture and distinguish it from other cluster solutions. Running on a single shared root filesystem, the SSI cluster is made up of an OS kernel on each node communicating and co-ordinating with each other. The OS kernel per node is for availability and performance. The shared root is for ease of management and it should contain few or no node-specific files. The shared root can either be managed by a parallel physical filesystem like GFS or can be served by one node at a time and shared via a layered cluster filesystem like CFS. In either case, nodes need to join the cluster early in the boot cycle and there should be a strong sense of membership - all nodes should know the membership and agree on it. Using one or more kernel-tokernel communication paths, the kernels should present a highly available single machine at the system call level. The co-operating kernels present a single namespace for nameable kernel objects and provide coherent distributed access to all objects. The diagram below portrays how unmodified applications interact with a standard kernel that has SSI hooks and SSI extensions. The extensions allow the kernels to co-ordinate access to objects like processes, files, message queues and devices. New interfaces are made available to allow cluster-aware applications to more fully take advantage of the cluster for added scalability, availability or manageability.
6 Uniprocessor or SMP node Users, applications, and systems management Uniprocessor or SMP node Users, applications, and systems management Standard OS kernel calls Extensions Extensions Standard OS kernel calls Standard Linux 2.4 kernel with SSI hooks Modular kernel extensions Modular kernel extensions Standard Linux 2.4 kernel with SSI hooks Devices Devices IP- based interconnect Other nodes The SSI cluster enhancements generally fall into one of three categories - extensions outside the kernel for system management or high availability, kernel infrastructure pieces used to layer on added functions, and extensions to existing kernel subsystems to provide the SSI view of kernel objects. The figure below breaks the architecture components into those 3 categories. System management and application/resource monitoring and failover/restart are two subsystem areas outside the kernel. The infrastructure pieces in the kernel are cluster membership, inter-node communication and the distributed lock manager (DLM). Extended base subsystems include filesystems, process management, cluster devfs, a cluster volume manager, cluster tcp/ip networking, and clusterwide IPC objects.
7 System Management Application/Resource Monitoring and Failover/Restart Cluster Networking Cluster Membership Cluster devfs Cluster Filesystem DLM CLVM Kernel Interface Cluster Process Subsystem Cluster IPC Inter-node Communication Subsystem (ICS) Each of the subsystem areas is described below. Cluster Membership: The Cluster Membership code is adapted from the NSC code base and comes to the open SSI cluster project from the CI project. It has the algorithms to form the cluster, to monitor membership, to invoke registered nodeup and nodedown routines, to notify processes of membership events and support a set of membership APIs. The cluster membership service (CLMS) is code that exists on all nodes. Each node is running in one of two modes - master or slave. At any time there is one master for the cluster. Takeover is done if the master fails. Via the master-run algorithms, nodes transition through a set of states which can include NeverUP, ComingUP, UP, Shutdown, Goingdown and Down. Each state transition is atomic and clusterwide so all nodes always have identical information. The CLMS master makes membership decisions, enforces any majority voting algorithm and does STONITH (Shoot The Other Node In The Head, used to ensure a node excluded from the cluster won't do any additional disk I/O) enforcement. A largely independent node-monitoring subsystem informs the CLMS if a node is unresponsive. Nodes can also be managed through APIs. Most, if not all of the other kernel subsystems included in the presentation of SSI register nodedown routines. Each node runs these routines when a node fails, in order to cleanup data structures associated with the lost node. After all kernels have done their cleanup, the "init" process may run a script to do additional cleanup or failover. Only after that is the node allowed to rejoin the cluster. CLMS works with and leverages the underlying Inter-node Communication Subsystem (ICS).
8 Inter-node Communication Subsystem (ICS): ICS is the kernel-to-kernel communication path used by almost all the kernel SSI components. Like CLMS, it came from NSC and is part of the CI project. ICS is architected into two pieces, the transport independent piece (upper layer) and the transport dependent piece (lower layer). The upper layer (transport independent) of ICS interfaces to the other kernel subsystems. It presents three programming paradigms - RPC, request/response or async RPC, and message passing. Message size is variable and there are data types to designate out-of-band data to be sent DMA if the underlying transport supports it (either push or pull). The upper layer piece also has a dynamic pool of threads to service incoming requests, which are prioritized and possibly queued to avoid resource deadlock and server process thrashing. The transport dependent piece (lower layer) interfaces to a reliable message service over some hardware. The current implementation is running over tcp connections but other transports have been used and will be supported. The lower layer pieces also define how out-of-line data is transferred from node to node. ICS works with CLMS to set up communication when a node joins the cluster and flush incoming and outgoing traffic on nodedown. Distributed Lock Manager (DLM): The DLM is a kernel component that comes from an IBM open source project. It runs on each node and is integrated with the Cluster Membership subsystem. It may in the future be integrated with the ICS subsystem. The DLM is used primarily to maintain various forms of cache coherency across the nodes in the cluster. One form of cache is the kernel filesystem cache. Both the OpenGFS and the Sistina GFS projects intend to use the DLM when it is integrated. The other key use of the DLM is for coherency between the caches of a cluster application. The Oracle Parallel Server (OPS) database was the first major application to use a DLM. Clusterwide Filesystem: The goal of the filesystem part of the project is to automatically, transparently and coherently present the same filesystem hierarchy on all nodes of the cluster at all times, in a highly available manner. This would include the root filesystem as well as access to filesystems of all types. There are three components to the architecture. First is support for various forms of cluster filesystem. Second is a mechanism to ensure that each node has the same mount hierarchy. Third is ensuring that open files are retained on process movement. There are three approaches to cluster filesystems and the project supports all three. One approach can be characterized as a parallel physical filesystem. An instance of the filesystem is run on each node. Each instance goes directly to the storage and has it's own log area. The instances co-ordinate their caches through a lock manager subsystem (ideally the DLM). Sistina GFS and OpenGFS are examples of this type of cluster filesystem. Another approach is that a client instance is run on every node and every client learns the data layout from a metadata server and goes directly to each of many dedicated storage nodes, over an interconnect. Lustre [Lustre] uses this parallel network approach. The final approach also has a client instance on each node. The clients communicate over the interconnect to a server node for each filesystem (much less parallelism than the parallel network approach). Each server node is part of the cluster and supplies one or more filesystems to the cluster. This approach is attractive for small clusters or clusters without shared media or as a way to clusterize different physical filesystems like JFS, XFS, cdfs, ext3, etc. A key component of all solutions is the ability to transparently continue operation in the face of failures. Clusterwide Process Subsystem: Two goals of the process management system are to make all processes on all nodes transparently visible and accessible from all processes and to allow process families and process groups to be transparently distributable. Other goals include single site semantics in the face of arbitrary node failures, a clusterwide /proc, the ability to migrate a process from one node to another while it is running and a flexible load leveling subsystem. To accomplish these ambitious goals: process are assigned clusterwide unique process ids (pids), which they retain if they migrate to other nodes;
9 a "vproc" structure (somewhat like a vnode structure) is added and process relationship fields are moved from the task struct to the vproc struct. A given process has a single task struct and it is on the node where the process is executing. There may be several vproc structs for the process around the cluster as needed to track and represent various process relationships. System calls are executed on the node where the process is currently running; if remote resources are needed to complete the system call, object specific requests are sent over the cluster interconnect via ICS. Enough information is kept redundantly in the cluster so that relationships that survive arbitrary node failures are successfully reconstructed Rexec(), rfork(), migrate() and kill3() calls are added so cluster-aware applications can manage process placement. A sigmigrate signal is added to facilitate the movement of unmodified processes and process groups from node to node. An optional process load balancing subsystem is configurable to transparently move processes around the cluster to better utilize the processing power of each node. Clusterwide Devices Subsystem: There are several goals with respect to device management in the SSI cluster. First, all devices on all nodes should be nameable, preferably in a node-independent manner, and transparent access to all devices from all nodes should be available. Devices with shared physical access should have a single name. Device major/minor numbers seen via the "stat" call must be clusterwide unique and both names and major/minors must be persistent across boots and independent of the order nodes boot in. Accomplishing these goals is done via extensions to the kernel devfs and without changes to device drivers. First, each device is transparently assigned a clusterwide-unique major/minor in addition to its node-specific major/minor that the driver may want. The clusterwide value is exposed in the "stat" interface, where clusterwide uniqueness is more important than matching the value the driver is used to. Next, the devfs code on each node has the complete device information for the cluster so lookups and opens happen directly from the process execution node to the node where the device is attached. Next, there is a set of file operations that allow an open on one node to access a device on another node. The /dev directory is changed to be a context dependent symlink where the context is either default or a compatibility context. The default is a single dev directory with the union of all the devices in the cluster and subdirectories for the devices on each node. The compatibility context would just give the application the devices of the node it is executing on Cluster Logical Volume Management (CLVM): The logical volume manager [LVM] abstracts the physical boundaries of disks to present logical entities. Co-ordination is needed if multiple nodes are going to access and possibly change the structure of logical volumes at the same time. The architectural plan is to leverage the existing base LVM technology and support any CLVM project (e.g. EVMS). Clusterwide Networking Subsystem: From a performance perspective, it is important that each node do as much network processing as is feasible without interacting with other cluster nodes. Some interacting is needed to attain an SSI view. An important goal of the open SSI cluster project is that the cluster looks, from a networking standpoint, to the outside world, like a single, scalable and highly available machine. The Linux Virtual Server (LVS) project, with failover, provides much of the needed capability. A single address is advertised but connections are load leveled to designated service nodes through a variety of algorithms. The SSI project may augment LVS by making configurations more automatic and by adding additional availability features. What LVS does not address is the view of the cluster from within the cluster and the view of the external world from within the cluster. To be strictly SSI, all network devices and thus IP addresses in the cluster should look local to all nodes (LVS only make the Cluster IP look local to all nodes) so a wildcard listen on any node would effectively be listening on all the devices on all nodes. Also, strict SSI would allow two processes to communicate over the loopback devices, even if they were executing on different cluster nodes. Finally, the routing tables on each node should be co-ordinated so that if one can reach a foreign host from one node, you can automatically do the same from any node in the cluster. For the open SSI cluster project, the first step is to have at least basic LVS capability with failover. Providing more strict SSI capability will be addressed as needed.
10 Clusterwide Inter-Process Communication (IPC): There are many forms of IPC in Linux, ranging from signals, pipes, fifos, semaphores, message queues, shared memory, internet sockets to unix-domain sockets. Signals are discussed under process management. Some of the other IPC objects are nameable and some are not. The architecture of the open SSI cluster project is straightforward. There is a clusterwide namespace for all nameable objects. Objects are always created locally, for performance. Process movement does not loose access to named or unnamed objects. Objects should be able to move from node to node. The namespace must adhere to standard practice for naming each object in Linux, it must be efficient on lookup and create, and it must be resilient to arbitrary node failures. To accomplish all these goals, the IPC nameservers are: centralized so a lookup or create is no more than an single RPC (no polling) rebuildable from remaining cluster nodes. Nodes in the cluster may cache information about where a particular IPC object is currently managed, with the understanding that the nameserver can be consulted if the cache turns out to be stale. Clusterwide System Management: The goals with respect to system management are to use slightly enhanced single machine management programs rather than a whole new set of cluster management tools. Such an approach should greatly reduce the management burden clusters often impose. This approach was the primary approach for NonStop Clusters for Unixware (NSC). Enhancements to the single machine tools include dealing with nodes that are currently unavailable and displaying which node a resource is affiliated with. System management also includes installation and booting. The installation architecture is to do a fairly standard install on a single node, including putting the root filesystem on a shared media, and then trivially adding nodes both at initial install time and anytime thereafter. Booting should be as SSI as feasible, taking into account the availability requirements that nodes can join an existing cluster at any time. The architecture involves having a single "init" process that, on cold boot, takes the initial set of nodes to the specified run level in parallel. Nodes that join later are not generally visible until they reach the run level of the rest of the cluster. To manage cluster booting in this manner, some rc scripts must be modified, as well as init itself. A graceful shutdown of the entire cluster as well as individual nodes is also architected through init. Application/Resource Monitoring and Failover/Restart: This component is key to satisfying our high availability goal. Part of the functionality standard HA products and projects provide is satisfied by the infrastructure and core OS components and part is layered over the SSI cluster platform. Nodeup and nodedown detection is accomplished through the cluster membership subsystem with the sigcluster signal and membership APIs available to applications and monitoring software. Simple filesystem availability is accomplished by the filesystem (GFS or CFS). Simple network IP address availability is accomplished via cluster network failover. Advanced monitoring capability should be integrated with the appropriate subsystem. Redundant interconnects is part of the ICS subsystem. Split brain detection and avoidance and STONITH capability are part of the cluster membership subsystem. The application monitoring and restart architecture has two conflicting goals to deal with - compatibility with a solution on a non-ssi loosely coupled cluster and simplicity of configuration and management possible due to the underlying SSI platform. It should be the case that one can adapt any of the existing launching/monitoring/restarting capabilities to the SSI platform. Reliance on independent namespaces (files, IPC objects) can complicate the adaptation. Project Status There have been several source releases from the Open SSI Cluster project since it debuted in June of Until recently, the releases were development releases. However, starting with the release (Oct 2002 for source, November for binary), the system is complete enough for use in a variety of environments. Version is based on the kernel. It can work in any one of several distributes and supports IA-32, Itanium-2 (IA-64) and Alpha processor types. The binary release is based on RH7.3.
11 On a subsystem by subsystem basis, the status is: Cluster Membership (CLMS): CLMS is quite stable and quite functional at this point. It should work up to 50 nodes but some work on initial booting is needed to scale to large numbers of nodes. Membership policy isn't very sophisticated yet and integration with STONITH algorithms is yet to be done. Nodedown detection frequency can be tuned so sub-second detection is feasible. Inter-node Communication Subsystem (ICS): ICS is quite functional over tcp. Outstanding enhancements would be to put it over native Myrinet or Infiniband. Some forms of redundant NICs are transparently supported under ICS. Other forms require some work in ICS to route around failed NICs or failed subnets. Distributed Lock Manager (DLM): The open DLM project has much of the DLM working on Linux and integration with CLMS has been done. Clusterwide Filesystems: The project currently supports all three types of cluster filesystem parallel physical via opengfs, stacked neetwork via CFS and parallel network via Lustre. Both CFS and opengfs have been used for the root filesystem. Failover capability for CFS is being actively worked on and is expected in January An early version of the Lustre client was tried on the system. Enforcing and automating a clusterwide mount model is largely complete. Inheriting open files as processes move is provided. Clusterwide Process Subsystem: Most of the process management code is up and running. Process relationships can be distributed; access to any process from any process clusterwide is supported; processes can rfork, rexec, migrate or be signaled to migrate with kill3 and SIGMIGRATE; nodedown handling cleans up data structures and maintains remaining relationships. Currently clones must be co-located (no cross-node shared address space capability). Clusterwide /proc for processes is in, as is transparent clusterwide ptrace. The loadleveling algorithm adapted from the Mosix project is included in recent releases. Clusterwide Devices Subsystem: Complete naming and access to remote devices is supported. Each node uses devfs to name its devices and CFS provides the clusterwide naming. Rather than a single unified /dev, /dev is currently a context symlink to node specific device directories. Cluster Logical Volume Management (CLVM): Work is ongoing at Sistina on a CLVM but no work to integrate it with SSI has started. The EVMS subsystem is now released for single machines and work to clusterize it should be starting soon. Clusterwide Networking Subsystem: The LVS code has been integrated with SSI. Some enhancements to provide more SSI have been coded but the exact nature of the complete offering is still under discussion. Failover capability using keepalived is provided. Clusterwide Inter-Process Communication (IPC): Signals are clusterwide. Pipes and fifos are clusterwide, as are message queues. Internet sockets are clusterwide. Semaphores, shared memory and unix-domain sockets will be supported by the end of Clusterwide System Management: Installation of the first node is fairly easy is you use the CFS option to distribute the root. Installation of the first node is still very cumbersome if you choose GFS because GFS is not in the base, so getting a GFS root filesystem is at least a 2 or 3 step process. Installation of other nodes is trivial due to netboot and PXE boot. After installation, nodes can either have a ramdisk and kernel on their local disks or can netboot into the cluster.
12 Init has been enhanced to support a cluster of nodes and to be statefully restartable. There is no run level support yet and thus only a crude method of getting all nodes to a reasonable run level exists. Little effort has been spent yet to enhance the other sysadmin tools to make them cluster-aware and present a SSI environment (other than selected filesystem commands). Application/Resource Monitoring and Failover/Restart: To date, there has been little integration with the Linux HA projects. As explained in the architecture section, some areas will be dealt with via the subsystem (ICS, GFS and LVS+ will do interconnect, filesystem and IP takeover so the HA subsystem doesn't have to). An application monitoring and restart capability was ported from NSC and is part of the current release. UML integration : User-Mode-Linux (UML) is a supported architecture for SSI kernels so one can create a multi-node SSI cluster on a single physical machine in order to do development and testing. Project Enhancement Opportunities While the current code base for the openssi project has a rich set of functionality, there are always many opportunities to enhance it either by integrating other functional components or by extending the current functions. Components which would be valuable to integrate include DRBD (Distributed Replicated Block Device [DRBD]), EVMS (Enterprise Volume Management System [EVMS]), OCFS (Oracle Cluster File System [OCFS]), and Failsafe [Failsafe]. Also, integrating MPICH and the Bproc libraries would enhance the High Performance capability. Possible enhancements for availability would include dual interconnects, process checkpointing, fault tolerant processes and IPC object movement. Scalability enhancements include support for other interconnects (Quadrics, Myrinet, Infiniband) and some algorithm changes to scale from hundreds to thousands of nodes. Manageability enhancements include streamlining installation and node monitoring capabilities. Summary To date, clustering on Linux has been quite fragmented, as solutions to specific problems have been developed and deployed, often concentrating on only one of the three key cluster goals - availability, scalability and manageability. Now, each solution is looking for the added functionality that the other solutions have. This could eventually result in some consolidation. The SSI platform, as pursued by the Open SSI Cluster project, is attempting to not only bring the best of each cluster solution area together in a single offering, but also is doing so in a modular way that should allow component subsetting and substitution and expansion over time. The term SSI cluster is not an absolute term. There are variants like "master node" SSI and "home node" SSI that provide a single machine view to a set of processes spread around the cluster. There are also degrees on SSI - one can do SSI system management without SSI process management, or SSI file management without a SSI device space. The Open SSI Cluster project is the most ambitious attempt to provide the greatest degree of SSI clustering. Having just SSI is not enough, however, to make a cluster solution valuable to a broad segment of the computer world. Added capability in the areas of scalability and availability are critical to a general cluster strategy. Even though SSI provides a sharable environment where processes on different nodes can work together and scale, load balancing both processes and incoming work requests enhances the SSI cluster with respect to scaling. Integrating parallel programming libraries is also important to scaling. A cluster that acted like a single machine and thus completely failed if any node failed would look like an SMP or NUMA system from an HA perspective and would miss the opportunity presented by having a copy of the OS kernel on each node. Instead, each node should be an independent fault zone and all the
13 monitoring, failover and restart capability of HA/failover clusters should be put on the SSI cluster offering. The SSI cluster presents a much simpler environment for providing HA because the system can manage networking, device and filesystem monitoring and failover. Application HA is much simpler since there is a common root filesystem to hold a single copy of the process availability configuration and because any program can run on any node at any time. While the Open SSI Cluster project is quite young and ambitious, it is well along the way to providing a full SSI environment. Part of the reason is that it was seeded with a set of SSI code from Compaq's NonStop Cluster for Unixware product. The other reason is that completed code from several other open source projects, including GFS, Mosix, LVS, Lustre, DLM and Beowulf is already integrated or is ongoing. References [Beowulf] [bproc] [CI] [DRBD] [EVMS] [FailSafe] [GFS] [HALinux] [Khalidi96] J. Kahlidi, J. Bernabeu, V. Matena, K. Sherriff, M. Thadani, "Solaris MC: A Multi Computer OS", USENIX 1996 Annual Technical Conference, pp [LifeKeeper] [Lustre] [LVS] [LVM] [Mosix] [NSC] [OCFS] [OpenDLM] [OpenGFS] [OpenSSIC] [Pfister98] G. Pfister, "In Search of Clusters", 2 nd edition, Prentice Hall PTR, 1998
14 [Popek85] G. Popek, B. Walker, "The LOCUS Distributed System Architecture", MIT Press, 1985 [ServiceGuard] [Scyld] [UML] [Walker98] B. Walker, "Clusters: Beyond Failover", Performance Computing, August 1998 [Walker99] B. Walker, D Steel, "Implementing a Full Single System Image UnixWare Cluster: Middleware vs Underware", PDPTA99
Distributed Operating Systems Cluster Systems Ewa Niewiadomska-Szynkiewicz firstname.lastname@example.org Institute of Control and Computation Engineering Warsaw University of Technology E&IT Department, WUT 1 1. Cluster
Red Hat Cluster Suite HP User Society / DECUS 17. Mai 2006 Joachim Schröder Red Hat GmbH Two Key Industry Trends Clustering (scale-out) is happening 20% of all servers shipped will be clustered by 2006.
Simplest Scalable Architecture NOW Network Of Workstations Many types of Clusters (form HP s Dr. Bruce J. Walker) High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI Load-leveling Clusters
Implementing the SUSE Linux Enterprise High Availability Extension on System z Mike Friesenegger Novell Monday, February 28, 2011 Session Number: 8474 Agenda What is a high availability (HA) cluster? What
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
High Availability Storage High Availability Extensions Goldwyn Rodrigues High Availability Storage Engineer SUSE High Availability Extensions Highly available services for mission critical systems Integrated
Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster
VERITAS Cluster Server v2.0 Technical Overview V E R I T A S W H I T E P A P E R Table of Contents Executive Overview............................................................................1 Why VERITAS
Glosim: Global System Image for Cluster Computing Hai Jin, Li Guo, Zongfen Han Internet and Cluster Computing Center Huazhong University of Science and Technology, Wuhan, 4374, China Abstract: This paper
POWER ALL GLOBAL FILE SYSTEM (PGFS) Defining next generation of global storage grid Power All Networks Ltd. Technical Whitepaper April 2008, version 1.01 Table of Content 1. Introduction.. 3 2. Paradigm
High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
Implementing the SUSE Linux Enterprise High Availability Extension on System z Mike Friesenegger SUSE Technical Specialist email@example.com but soon to be firstname.lastname@example.org Agenda SUSE Linux Enterprise Server
Phone: (603)883-7979 email@example.com Cepoint Cluster Server CEP Cluster Server turnkey system. ENTERPRISE HIGH AVAILABILITY, High performance and very reliable Super Computing Solution for heterogeneous
Network Attached Storage Jinfeng Yang Oct/19/2015 Outline Part A 1. What is the Network Attached Storage (NAS)? 2. What are the applications of NAS? 3. The benefits of NAS. 4. NAS s performance (Reliability
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
Comparing TCO for Mission Critical Linux and NonStop Iain Liston-Brown EMEA NonStop PreSales BITUG, 2nd December 2014 1 Agenda What do we mean by Mission Critical? Mission Critical Infrastructure principles
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
LISTSERV in a High-Availability Environment DRAFT Revised 2010-01-11 Introduction For many L-Soft customers, LISTSERV is a critical network application. Such customers often have policies dictating uptime
Active-Active and High Availability Advanced Design and Setup Guide Perceptive Content Version: 7.0.x Written by: Product Knowledge, R&D Date: July 2015 2015 Perceptive Software. All rights reserved. Lexmark
LinuxWorld Conference & Expo Server Farms and XML Web Services Jorgen Thelin, CapeConnect Chief Architect PJ Murray, Product Manager Cape Clear Software Objectives What aspects must a developer be aware
HP Serviceguard Cluster Configuration for Partitioned Systems July 2005 Abstract...2 Partition configurations...3 Serviceguard design assumptions...4 Hardware redundancy...4 Cluster membership protocol...4
Best Practices on monitoring Solaris Global/Local Zones using IBM Tivoli Monitoring Document version 1.0 Gianluca Della Corte, IBM Tivoli Monitoring software engineer Antonio Sgro, IBM Tivoli Monitoring
How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to
IBM Global Technology Services September 2007 NAS systems scale out to meet Page 2 Contents 2 Introduction 2 Understanding the traditional NAS role 3 Gaining NAS benefits 4 NAS shortcomings in enterprise
the Availability Digest Redundant Load Balancing for High Availability July 2013 A large data center can comprise hundreds or thousands of servers. These servers must not only be interconnected, but they
Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with
HP Serviceguard Cluster Configuration for HP-UX 11i or Linux Partitioned Systems April 2009 Abstract... 2 Partition Configurations... 2 Serviceguard design assumptions... 4 Hardware redundancy... 4 Cluster
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Oracle EXAM - 1Z0-102 Oracle Weblogic Server 11g: System Administration I Buy Full Product http://www.examskey.com/1z0-102.html Examskey Oracle 1Z0-102 exam demo product is here for you to test the quality
High-Availability Using Open Source Software Luka Perkov Iskon Internet, Zagreb, Croatia Nikola Pavković Ruđer Bošković Institute Bijenička cesta Zagreb, Croatia Juraj Petrović Faculty of Electrical Engineering
Operating Systems: Internals and Design Principles Chapter 14 Virtual Machines Eighth Edition By William Stallings Virtual Machines (VM) Virtualization technology enables a single PC or server to simultaneously
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
REDEFINING THE ENTERPRISE OS RED HAT ENTERPRISE LINUX 7 Rodrigo Freire Sr. Technical Account Manager May 2014 1 Roadmap At A Glance CY2010 CY2011 CY2012 CY2013 CY2014.0 RHEL 7 RHEL 5 Production 1.1.0 RHEL
(Advanced Topics in) Operating Systems Winter Term 2009 / 2010 Jun.-Prof. Dr.-Ing. André Brinkmann firstname.lastname@example.org Universität Paderborn PC 1 Overview Overview of chapter 3: Case Studies 3.1 Windows Architecture.....3
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service Achieving Scalability and High Availability Abstract DB2 Connect Enterprise Edition for Windows NT provides fast and robust connectivity
CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011 Happy Birthday Spectrum! On this day, exactly 20 years ago (4/15/1991) Spectrum was officially considered meant - 2 CA Virtual Assurance
Implementing Network Attached Storage Ken Fallon Bill Bullers Impactdata Abstract The Network Peripheral Adapter (NPA) is an intelligent controller and optimized file server that enables network-attached
An Oracle White Paper July 2011 Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud Executive Summary... 3 Introduction... 4 Hardware and Software Overview... 5 Compute Node... 5 Storage
Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products
Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users
High Availability with Linux / Hepix October 2004 Karin Miers 1 High Availability with Linux Using DRBD and Heartbeat short introduction to linux high availability description of problem and solution possibilities
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES By: Edward Whalen Performance Tuning Corporation INTRODUCTION There are a number of clustering products available on the market today, and clustering has become
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Unit 2 Distributed Systems R.Yamini Dept. Of CA, SRM University Kattankulathur 1 Introduction to Distributed Systems Why do we develop distributed systems? availability of powerful yet cheap microprocessors
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Page 1 of 10 MSDN Home > MSDN Library > Deployment Rate this page: 10 users 4.9 out of 5 Building a Highly Available and Scalable Web Farm Duwamish Online Paul Johns and Aaron Ching Microsoft Developer
Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS Soumya Koduri Red Hat Meghana Madhusudhan Red Hat AGENDA What is GlusterFS? Integration with NFS Ganesha Clustered
Distributed Systems REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1 1 The Rise of Distributed Systems! Computer hardware prices are falling and power increasing.!
Datacenter Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015 This Lecture What s a datacenter Why datacenters Types of datacenters Hyperscale datacenters Major
1 KKIO 2005 Configuration Management of Massively Scalable Systems Configuration Management of Massively Scalable Systems Marcin Jarząb, Krzysztof Zieliński, Jacek Kosiński SUN Center of Excelence Department
An Oracle White Paper November 2010 Oracle Real Application Clusters One Node: The Always On Single-Instance Database Executive Summary... 1 Oracle Real Application Clusters One Node Overview... 1 Always
Technical White Paper Enterprise Linux Servers SAP Solutions High Availability on SUSE Linux Enterprise Server for SAP Applications All best practices have been developed jointly between Hewlett-Packard
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Windows Server Failover Clustering April 00 Windows Server Failover Clustering (WSFC) is the successor to Microsoft Cluster Service (MSCS). WSFC and its predecessor, MSCS, offer high availability for critical
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
WHITE PAPER SanDisk ION Accelerator High Availability 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Introduction 3 Basics of SanDisk ION Accelerator High Availability 3 ALUA Multipathing
Couchbase Server Technical Overview Key concepts, system architecture and subsystem design Table of Contents What is Couchbase Server? 3 System overview and architecture 5 Overview Couchbase Server and
Contact Us: (616) 875-4060 HP-UX System and Network Administration for Experienced UNIX System Administrators Course Summary Length: Classroom: 5 days Virtual: 6 hrs/day - 5 days Prerequisite: System Administration
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
Satish Mohan Head Engineering AMD Developer Conference, Bangalore Open source software Allows developers worldwide to collaborate and benefit. Strategic elimination of vendor lock in OSS naturally creates
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
Module 14: Scalability and High Availability Overview Key high availability features available in Oracle and SQL Server Key scalability features available in Oracle and SQL Server High Availability High
Informix Dynamic Server May 2007 Availability Solutions with Informix Dynamic Server 11 1 Availability Solutions with IBM Informix Dynamic Server 11.10 Madison Pruet Ajay Gupta The addition of Multi-node
CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS What is an operating? A collection of software modules to assist programmers in enhancing efficiency, flexibility, and robustness An Extended Machine from the users
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
Building Reliable, Scalable Solutions High-Availability White Paper Introduction This paper will discuss the products, tools and strategies available for building reliable and scalable Action Request System
Storage Management for the Oracle Database on Red Hat Enterprise Linux 6: Using ASM With or Without ASMLib Sayan Saha, Sue Denham & Lars Herrmann 05/02/2011 On 22 March 2011, Oracle posted the following
Wednesday, November 18,2015 1:15-2:10 pm VT425 Learn Oracle WebLogic Server 12c Administration For Middleware Administrators Raastech, Inc. 2201 Cooperative Way, Suite 600 Herndon, VA 20171 +1-703-884-2223
the Availability Digest Penguin Computing Offers Beowulf Clustering on Linux January 2007 Clustering can provide high availability and superr-scalable high-performance computing at commodity prices. The
An Oracle White Paper July 2014 Oracle ACFS 1 Executive Overview As storage requirements double every 18 months, Oracle customers continue to deal with complex storage management challenges in their data
Network File System (NFS) Pradipta De email@example.com Today s Topic Network File System Type of Distributed file system NFS protocol NFS cache consistency issue CSE506: Ext Filesystem 2 NFS
A Comparison on Current Distributed File Systems for Beowulf Clusters Rafael Bohrer Ávila 1 Philippe Olivier Alexandre Navaux 2 Yves Denneulin 3 Abstract This paper presents a comparison on current file
Contact Information: February 2011 zimory scale White Paper Relational Databases in the Cloud Target audience CIO/CTOs/Architects with medium to large IT installations looking to reduce IT costs by creating
JOURNAL OF APPLIED COMPUTER SCIENCE Vol. 22 No. 2 (2014), pp. 65-80 Filesystems Performance in GNU/Linux Multi-Disk Data Storage Mateusz Smoliński 1 1 Lodz University of Technology Faculty of Technical
Business Continuance of SAP Solutions on Vmware vsphere This product is protected by U.S. and international copyright and intellectual property laws. This product is covered by one or more patents listed
COS 318: Operating Systems Virtual Machine Monitors Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Introduction Have been around
Violin: A Framework for Extensible Block-level Storage Michail Flouris Dept. of Computer Science, University of Toronto, Canada firstname.lastname@example.org Angelos Bilas ICS-FORTH & University of Crete, Greece
PolyServe Matrix Server for Linux Highly Available, Shared Data Clustering Software PolyServe Matrix Server for Linux is shared data clustering software that allows customers to replace UNIX SMP servers
Poster Companion Reference: Hyper-V and Failover Clustering Introduction This document is part of a companion reference that discusses the Windows Server 2012 Hyper-V Component Architecture Poster. This
This product is protected by U.S. and international copyright and intellectual property laws. This product is covered by one or more patents listed at http://www.vmware.com/download/patents.html. VMware