Pivot3 Serverless Computing Technology Overview June 2009
Table of Contents Introduction....................................................3 Serverless Computing Architecture.................................... 4 1. Server Applications..........................................5 2. The RAIGE Operating System...................................6 3. Hosted Server Software...................................... 13 4. Cloudbank Appliances....................................... 15 Summary..................................................... 18
Introduction Industry transitions are often marked by inflection points where elegant software makes it practical to use x86-based commodity hardware for tasks that previously relied on expensive proprietary hardware. These inflection points drive widespread deployment of new technology into more costsensitive markets. Examples of companies that used software to drive technology transitions include Microsoft, VMware and Google. Microsoft, for example, introduced x86 servers to business-critical computing once the Windows NT operating system became a viable application platform, an event that effectively ended the era of mini computers. VMware similarly shook up the mainframe business by developing virtualization software for x86 servers. Meanwhile, Google developed the Google File System to provide massive server farms using tens of thousands of inexpensive x86-based motherboards. In each case, the disruptive software had to anticipate that commodity hardware components would fail and that the performance of general-purpose hardware would be more difficult to manage than application-specific hardware. Offsetting these challenges was the disruptive cost base of the x86 platforms and the relentless performance improvements from Moore s Law. Pivot3 Serverless Computing software delivers an inflection point where low-cost x86-based server hardware is repurposed to deliver highly available on-demand infrastructure for capacity-intensive applications. This document provides a technical overview of the workings of the Serverless Computing architecture and innovations inherent in the design.
Serverless Computing Architecture The Pivot3 Serverless Computing Architecture is the first and only storage area network (SAN) storage system that simultaneously hosts server applications on shared x86 hardware. The combined server/storage platform is both a high-availability SAN storage solution with no single point of failure and a high-performance high-availability server solution that exploits commodity hardware components. There are four major components in Serverless Computing Arrays: The following chapters describe each of these four architectural components in more detail. For specific product information such as capacity calculations, power specifications and performance by array size, please see the Pivot3 product specification sheets and architectural and engineering specifications which are available on the Pivot3 web site at www.pivot3.com.
1. Server Applications The Pivot3 Serverless Computing Array is an open systems server and storage platform for server applications. The architecture is ideal for environments that are CPU-intensive and I/O-loaded and that require high storage capacity. Supported Applications The Serverless Computing Architecture supports server applications running on Microsoft Server products that support the iscsi storage standard. Because of the standards-based approach, there is no application integration required and no certification hurdles for new applications. Remote management and monitoring of server applications and operating systems is completely supported since the platform is based on standard Ethernet networks. Pivot3 platforms are also Microsoft Windows Hardware Quality Lab (WHQL) certified for compatibility with Microsoft operating systems. The Pivot3 application lab does test certain applications that are commonly deployed in the field to speed field support resolution. An active list of open system ISV partners can be found on the Pivot3 web site at http://pivot3.com/partners/solution. Windows Storage Server NAS (network-attached storage) Users can also use part or all of a Serverless Computing Array as a NAS share by running the Windows Storage Server on one or more Cloudbank Appliances in the Serverless Computing Array. All Windows Storage Server features are supported including: Distributed File System (DFS) for simplified access and high availability across locations Centralized management with File Server Resource Manager (FSRM) interface SIS file-level de-duplication for up to 128 volumes Integration with Microsoft ecosystem standards including Active Directory
2. The RAIGE Operating System The Pivot3 RAIGE Operating System (RAIGE OS) runs on each Cloudbank Appliance. The RAIGE OS provides logical volume management, distributed data protection and automatic load balancing across appliances for ease of management and high performance. Logical Volume Management The RAIGE OS virtualizes physical disks and appliances in a Serverless Computing Array so that capacity can be managed logically beyond the physical limits of each appliance. Reliability also improves because volume access is not disrupted by physical hardware component failures. Pivot3 appliances discovered by the RAIGE Director Software are presented as the RAIGE Domain. Physical appliances on the same local subnet can be selected and assigned to one or more Serverless Computing Arrays. Multiple Pivot3 Arrays can be managed with one instance of the RAIGE Director Software. Appliances are assigned to an array using the RAIGE Director Software
Capacity Management The aggregate capacity of the underlying appliances can then be parceled into logical volumes using the RAIGE Director Software. Attributes for each logical volume, such as RAID protection, name, rebuild priority, and access control are set by volume without requiring knowledge of the underlying physical hardware. Capacity can be physically and logically added to both the array and to existing logical volumes. Capacity expansion is dynamic and does not interfere with data storing or retrieval. Technically, the aggregated capacity of all the appliances is recognized by application servers as a multi-ported iscsi target. Logical volumes are created from the Serverless Computing Array Initiator Management Access to volumes is managed by a list of iscsi initiators that are allowed to login to specific volumes. iscsi initiator logins can either be un-authenticated or authenticated via MD5 based CHAP and an initiator shared-secret. For initiators that support mutual CHAP logins, the array can be configured to authenticate volumes to the initiator via MD5 based CHAP and an array-wide shared-secret. Initiator names/logins can have an Access Control List (ACL) defined that allows Read-Write, Read- Only or no access to volumes. Array Management Pivot3 provides a software utility called the RAIGE Connection Manager to automatically configure and maintain network connections between servers and volumes since there can be many connections in a large Serverless Computing Array. Distributed Data Protection The key elements of Pivot3 distributed data protection are RAID across Gigabit Ethernet (RAIGE) algorithms, disk write cache, virtual global sparing, parallel rebuilding of failed drives, priority rebuilding of volumes and continuous background verification.
RAID Algorithms The RAIGE OS distributes data and parity across Cloudbank appliances so that data is efficiently protected against component failures. There is no need to create physical disk or RAID sets as you would with a traditional RAID system. Rather, disks are treated as raw capacity and the RAID function is implemented at the volume level. The normally burdensome management tasks of defining RAID groups and partitioning volumes, which are associated with traditional RAID devices, are not necessary with a Pivot3 Serverless Computing Array. RAID Protection Four RAID protection levels are provided to meet the data protection goals of each application: RAID 0 Striping with no parity. Data is not protected against failures. There is no protecti\on capacity required for RAID 0. RAID 1e Enhanced network mirroring. Data is protected by striping an exact copy of the primary data across drives in each of the other appliances in the array. RAID 1e protects against the failure of any drive in the array. The e indicates enhanced RAID 1 since data is also protected if an entire appliance with all twelve drives fails. The protection capacity required for RAID 1e is 100% of the primary data. RAID 5e Enhanced network parity. Data is striped across each appliance in the array and protected by network parity. Network parity is also striped across each appliance in the array so that data stored in each appliance is protected by parity in another appliance. RAID 5e protects against one failure which can be either a drive failure or an appliance failure. The e indicates the enhanced RAID 5 protection for all twelve drives in the appliance. Capacity required for RAID 5e protection is calculated using the number of appliances in the Serverless Computing Array. For example, in an array with twelve Cloudbanks, a volume designated as RAID 5e effectively has an 11+1 data + parity scheme.
RAID 6e Enhanced network and disk parity. Data is striped across each appliance in the array and protected by two levels of parity. The first level of parity, network parity, is striped across each appliance in the array, much like the parity in RAID 5e. Network parity protects data against either a drive or appliance failure. The second level of parity, disk parity is striped across the disks within an appliance. Disk parity protects against any disk failure within each appliance. RAID 6e protects against three simultaneous disk failures. The e indicates enhanced RAID 6 since data is also protected against a simultaneous failure of one drive and an entire appliance with all twelve drives. Capacity required for RAID 6e protection, roughly at two appliances per array, is double that of RAID 5e since parity is distributed to two locations. Precise usable capacity is included in the product specification sheets. Drive Groups Drive Groups further minimize the effect of drive failures on the overall array. Drive Groups consist of one drive per appliance and are automatically created and maintained by the RAIGE OS. By organizing the placement of parity and mirror data within one Drive Group, the impact of a drive failure is limited to its Drive Group. Drive Groups effectively increase the number of simultaneous drive failures that each Serverless Computing Array can sustain without data loss since drive failures outside of a Disk Group do not affect other Disk Groups. Changing and Mixing RAID Protection in an Array Since RAID protection is set by volume, different RAID levels may co-exist on an array and RAID levels can be changed while the system is running, without disruption to data reads or writes. For example, a RAID 6e volume can be changed to a RAID 5e volume to free up space while access to the volume continues uninterrupted. Adding Physical Capacity to an Existing Array Capacity can be physically added to an existing Serverless Computing Array by connecting new physical appliances to the SAN and then configuring them into the existing Serverless
Computing Array using the RAIGE Director Software. The additional physical capacity is dynamically added to the Pivot3 Array and data is restriped across the appliances so that capacity is automatically provisioned. The complexity normally associated with managing meta-luns for volumes larger than the domain of one physical conventional RAID controller is consequently eliminated. Allocate-on-write The RAIGE OS uses an allocate on write method so that a configured volume can be written to immediately and does not require disk formatting time, which for large conventional arrays may take over 24 hours. Disk Write Cache Pivot3 RAIGE OS uses a patented Disk Write Cache to protect in-flight data against power loss and to eliminate unreliable and expensive battery-backed RAM write cache memory. This patented approach takes advantage of the massive network bandwidth available in a Serverless Computing Array and delivers excellent performance by using parallel disks as caching elements. On each physical disk, cache zones spread across the sectors of each disk are used for the intermediate caching of write data. Depending on the position of the disk head on the platter when a write request occurs, the cached data is saved in the nearest cache zone, greatly reducing head seek latencies. Host acknowledgements allow the host server to move on to the next activity and the cached data is moved to its final placement on the media as a background task. Disk Write Cache Zones Virtual Global Sparing Virtual drive sparing is used to automate and speed drive rebuilding if a drive fails in a Serverless Computing Array. The capacity of one spare drive is reserved across all of the drives in the array and removed from usable capacity. In the event of a drive failure, the rebuild process begins immediately using the previously reserved capacity. Unlike spare drives in conventional systems that standby during normal operation, virtual global spare drives in a Serverless Computing Array contribute to the overall performance of the RAID system during normal operation. Virtual RAID controller sparing is effectively supported since data is protected in the case of an appliance failure. 10
Parallel Rebuild Innovation Conventional RAID systems are constrained by the physical relationship between RAID groups and their member disk drives. As a result, sparing and rebuilds are similarly constrained to physical drives. This becomes an important limitation as drive capacities grow to beyond 1 TB and rebuilding times for single drives increase. Serverless Computing Arrays provide extremely fast parallel rebuilds of failed drives because of the distributed nature of data allocation and sparing. Many drives contribute to the rebuild process and the recovered data is written to all drives resulting in a massively parallel activity. Only sectors of a failed disk that actually have data allocated and written need to be rebuilt which further speeds rebuild times in lesser utilized arrays. Priority Rebuilds by Volume Rebuilds are performed by volume. All volumes in the array are allocated to all of the drives in the array. An added benefit is the ability to designate a priority level for each volume so that higher priority volumes are rebuilt first. Rebuilding any specific volume may require rebuilding only a small portion of a drive. With conventional systems, rebuilding happens at the disk level which generally means volumes are only protected once the entire disk is fully rebuilt. Background Verification The Serverless Computing Array continuously performs background disk verification. Each disk is completely scanned to identify disks that are beginning to fail and to detect and repair bad blocks on the media. This is another process that benefits from the massive available bandwidth of the array and the processing power available in the Cloudbank Appliances. 11
Automatic Load Balancing Load balancing of bandwidth and capacity across network ports, appliance controllers and disk drives is managed by the RAIGE OS with no administrative intervention. Since data is equally distributed across the Serverless Computing Array, changes to either the physical infrastructure or logical entities can be quickly accommodated to eliminate disk, controller and network hot spots. For write operations, the dedicated x86 processors in each appliance have ample processing power for both RAID operations and TCP offload processing. For read operations, Cloudbank Appliances return data to the application servers in parallel, providing load balanced performance across all appliances and all drives. Physical Cloudbank Connections RAIGE Director Software Server Local Area Network 4 ports per Cloudbank iscsi Storage Area Network 2 ports per Cloudbank Four Cloudbank Array Example The parallel architecture and load balancing of the RAIGE OS allow Serverless Computing Arrays to effectively aggregate many 1Gbps Ethernet ports and quickly surpass the bandwidth in proprietary 4Gbps Fibre Channel systems. Load balancing of capacity and performance extends to physical reconfiguration of each Serverless Computing Array. Following additions or removals of physical appliances to an existing array, the RAIGE OS restripes data across the new physical appliance count and automatically optimizes the load across the new physical network connections. 12
3. Hosted Server Software Each Cloudbank Appliance runs a virtualization software layer that allows storage and server operating systems to run simultaneously on the same appliance. Virtualization Software The virtualization software in the Serverless Computing Array is open source software provided by Xen.org. The RAIGE OS runs in the first guest operating system. One additional server operating system can then be added as a guest. Cloudbank Appliances are hardware provisioned for storage and server functions as follows: Four x86 cores, four gigabytes RAM and two gigabit NIC ports for the RAIGE OS Four x86 cores, four gigabytes RAM and two gigabit NIC ports for the server operating system and applications Virtual Machine I/O iscsi I/O in a Serverless Computing virtual machine (VM) begins with a single iscsi session between the iscsi initiator in the host and each Pivot3 logical volume accessible by the host. The iscsi sessions utilize a purely virtual network interface card (NIC) that interfaces directly to RAIGE OS running on the Pivot3 Cloudbank Appliance hosting the VM. This network path is immune to the failure points common in physical networks (cables, switches). A host I/O destined for a Pivot3 logical volume is sent by the iscsi initiator to the iscsi target for that logical volume through the virtual NIC. The RAIGE OS examines the logical block address associated with the command to determine which Cloudbank Appliance the data should be written to or read from. If the I/O can be processed on the local appliance, the request is serviced immediately and the data never traverses a physical network cable. If the data is associated with another Cloudbank in the array, the RAIGE OS on the local Cloudbank will read or write the data to the appropriate Cloudbank. This I/O takes place utilizing the fault-tolerant and load balanced storage networks forming the backbone of the Pivot3 Array. Once the data transfer is complete, status for the I/O is returned to the host over the virtual NIC. Managing Hosted Virtual Servers Hosted virtual servers running on a Cloudbank Appliance have access to the entire shared capacity and bandwidth of the underlying iscsi storage array. Server instances can be started, stopped and managed as would any remote server. 13
Server VM Recovery Pivot3 provides added server application reliability with a software recovery feature that protects applications against server hardware failures. The VM recovery feature can be selected using the RAIGE Director Software for each virtual machine. In the case of a Cloudbank hardware failure, the failover appliance automatically reloads the virtual machine on an available Cloudbank Appliance in the Serverless Computing Array and dynamically re-establishes network, camera, and storage connections. This speeds the restoration of the application and access to the storage volumes which are protected against appliance failures by the RAIGE OS. Unlike hardware-based failover techniques, Pivot3 VM recovery does not require dedicated physical server, storage or network hardware and is a standard feature included with Pivot3 Serverless Computing Arrays. 14
4. Cloudbank Appliances Cloudbank Appliances are the hardware building block of the Serverless Computing Architecture. a. b. c. d. e. f. Dual CPU enterprise server motherboard. Each appliance contains an enterprise server motherboard with dual quad-core Xeon x86 CPUs and 8 Gigabytes of ECC DIMM RAM. Of the eight available cores, four cores are dedicated to the server operating system and applications and four cores are dedicated to storage operations and TCP offload processing. Four 1 Gigabit Ethernet LAN ports. Four 1 Gigabit Ethernet Network Interface Card (NIC) ports are dedicated to server connectivity. Two 1 Gigabit Ethernet LAN ports. Two 1 Gigabit Ethernet NIC are dedicated to the iscsi SAN. Redundant hot-swappable power supplies and fans. Redundant power supplies eliminate power circuits as a single point of failure and are easily replaced without interrupting appliance operation for fast field service. Three redundant fans are also hot-swap devices. Audible alarms. An audible alarm is activated on physical component failures to alert support personnel. This alarm is a requirement in some regulated environments and is helpful for environments where less-trained operators are managing the appliances. Environmental monitoring and diagnostics. Each appliance self-monitors key environmental conditions of major components. Changes in state to any environmental condition is displayed in the Pivot3 user interface and can be can be transmitted as an SNMP event. 15
g. h. State-indicating LEDs. Drive bay LEDs assist field support and maintenance. LEDs flash blue to indicate read and write operations to the drives under normal conditions. Failed drives are quickly identified by a corresponding red LED. A strobe function allows users to identify specific drives or appliances for diagnostic purposes. Enterprise SATA drives. The 2U twelve drive form factor is the densest possible storage configuration that maintains front access to hot-swappable drives. Front accessibility is a key element in simplifying field support so that replacement of failed drives can be quickly accomplished by users in the field. Appliances are delivered with fully populated drive bays. SATA drives are delivered in the appliance although the backplane and controller infrastructure supports both SAS and SATA interfaces. The following environmental states are monitored and reported: CPU failure Disk failure Power supply failure Network failure iscsi port failure Thermal temperature threshold exceeded Fan failure Diagnostic logs are also kept for each appliance and can be remotely accessed and analyzed. 16
SNMP Support Pivot3 appliances can be monitored using the Simple Network Management Protocol (SNMP) protocol. Community strings for SNMP are configured through the RAIGE Director Software and the SNMP MIB (management information base) is provided with the Pivot3 software. Because appliances cooperate within an array, SNMP agents can be set once at an array level and do not need to be set for each appliance. Cloudbank Appliances run the Pivot3 SNMP agent and send out SNMP events to third-party software applications that receive, or trap, the SNMP notifications. Since SNMP traffic is on the storage network, the server running third party application or an SNMP trap receiver needs to have access to both the storage network and the server network. Many network-management applications use SMTP to forward traps as email. The network-management application will push an SMTP message to the corporate email server, which will forward the message as an email to a specified email address. 17
Summary While intense scrutiny has been placed on server-centric and switch-centric virtualization approaches, Pivot3 Serverless Computing has quietly developed a third approach to on-demand infrastructure that integrates server virtualization, for the first time, in a storage platform. For application environments characterized by high-capacity needs and I/O-intensive workloads, a storage-centric approach offers higher performance, lower cost and higher availability than either switch or server alternatives. By virtualizing RAID block storage and then collapsing the storage and sever hardware into a single platform, Pivot3 is uniquely positioned to deliver Storage Centric Computing. Newer scale-out storage systems spill-over with x86 resources and can easily integrate server applications to reduce cost, power, cooling, and rack space while improving availability and simplifying management. By contrast, conventional storage arrays are a poor platform for integrating server virtualization technology for the simple reason that there is not enough compute horsepower available. Customers should look carefully at the predominant workloads of their environments and select virtualization platforms that best meet their requirements. For capacity-rich and I/O-intensive workloads, the use of storage-centric platforms based on newer scale-out architectures can be dramatic and should be central to data center consolidation planning. 18
Pivot3, Inc. 6605 Cypresswood Drive Spring, TX 77379 www.pivot3.com Tel: 877.574.8683 Fax: 281.516.6099 Copyright 2009 Pivot3, Inc. All rights reserved. Specifications are subject to change without notice. Pivot3, RAIGE, Pivot3 Serverless Computing, Cloudbank, Databank, NVR Recovery, and High-Definition Storage are trademarks or registered trademarks of Pivot3. All other trademarks are owned by their respective companies. Techover4.1 June 2009