Project Colossus
Table of Contents Executive Summary... 3 Introduction... 4 Nutanix Complete Cluster A Scalable Architecture... 5 Project Colossus... 7 The Setup... 7 Benchmark Results... 8 Conclusion... 12
EXECUTIVE SUMMARY Executive Summary Predictable scalability is a big challenge for virtualized datacenters today. Organizations are looking to design their next-generation datacenters with virtualization at the core of the architecture. But most of them are unsure of how their needs are going to evolve as they virtualize more and more applications. As part of the design exercise, they need to make critical sizing decisions upfront. They need to accurately predict the size of server and storage infrastructure they will need over three to five years and procure a right size system. This guesswork often results in scenarios where organizations have to either live with a system that s under-designed or do a forklift upgrade and buy a bigger box. Nutanix Complete Cluster solves this problem with its building Block approach for the virtualized datacenter. It can seamlessly scale from one Block to a large numbers of Blocks in an elastic manner. As more Blocks are added to the cluster, Nutanix SOCS (Scale-Out Converged Storage) continues to grow as a shared storage pool that s accessible to any VM running anywhere on the cluster. This takes the guesswork out of the datacenter design and allows organizations to grow as their needs grow, while providing a predictable increase in performance and cost. Project Colossus is a fifty-node benchmark cluster that demonstrated linear scalability of Nutanix Complete Cluster from four nodes (one Complete Block) to fifty nodes (12.5 Complete Blocks). In this benchmark, Colossus delivered 375,000 random write IOPs and 28 Gigabytes per second of sequential bandwidth. 3
INTRODUCTION Introduction Capacity planning is a one of the biggest challenges for most data center architects. A typical data center design tries to address a three to five year horizon, if not more. For this, the architects have to know (or guess) in advance what they are designing for. Some of the most common questions an architect tries to address are: How much storage capacity is needed How many virtual machines will they need to run How much performance (IOPs, bandwidth) will the storage system need to deliver Apart from these technical requirements, they also often have to estimate how much budget is available to build and support this infrastructure over the planning horizon. Most of this unnatural upfront guesswork is a result of the rigid architecture of traditional storage systems. Traditional vendors typically sell their systems in small, medium and large configurations but they often require forklift upgrades to move from one configuration to the next one. Most small systems offer one storage controller, medium configurations offer two to four controllers and large systems might typically offer four to eight storage controllers (a few might go higher). As the number of controllers increases, performance often doesn t increase linearly while costs may grow exponentially. This makes it very hard for architects to right size their systems without making huge investments upfront. This problem is further accentuated due to the rapid growth in number of virtual machines. In such a dynamic environment, the ability to plan for the future is reduced and capacity management becomes an even bigger challenge. What is required is a system that can grow incrementally as new VMs are created. The system should provide predictable performance growth as it scales and should not require a forklift upgrade to get higher performance and capacity. At the same time, the system should provide predictability in investments needed to grow the datacenter over the planning horizon. 4
NUTANIX COMPLETE CLUSTER A SCALABLE ARCHITECTURE Nutanix Complete Cluster A Scalable Architecture Nutanix Complete Cluster is designed to be a highly scalable system with a small unit of growth (one node at a time). The architecture has its roots in the distributed computing model that has been adopted by companies such Google, Facebook, Yahoo and Amazon to build their data centers that are highly scalable. The following key aspects of its architecture make Nutanix Complete Cluster highly scalable: No Single Master In the Nutanix architecture, no one node plays the role of a master. Most common distributed systems have a single master that acts as a central point of control for key system operations. This master ends up becoming a bottleneck as the system scales. In traditional storage arrays, storage controllers play the role of such a central authority and are the limiting factor in scaling the system. Even if a storage array has multiple controllers, such controllers have to synchronize their caches, making it hard to scale them efficiently. Medusa (Distributed Metadata Layer) Nutanix Complete Cluster stores its metadata (information on what is stored where) in a distributed metadata layer named Medusa. As more nodes are added to the cluster, the metadata layer also continues to scale. Medusa stores the metadata on PCIe-based iomemory from Fusion-io, making the metadata accesses faster without limiting the metadata to RAM, as is the case in traditional solutions. This allows the cluster to keep rich metadata for advanced functionality without the metadata layer becoming a bottleneck. Curator (Distributed Data Maintenance Service) In large-scale storage systems, data maintenance is often centralized within a few controllers and it can easily become a system bottleneck. As more data is modified and deleted, garbage collection adds significant overhead to the system. Also, data consistency checks start consuming more and more resources from the controllers as data sizes grow. 5
NUTANIX COMPLETE CLUSTER A SCALABLE ARCHITECTURE Nutanix has developed a distributed data maintenance framework called Curator for addressing these limitations. Curator is a MapReduce-based distributed service that enables Nutanix Complete Cluster to scale linearly. It executes background data management operations in a massively parallel manner. Such operations include data-tiering, garbage collection, consistency checks, data redistribution in case of failures, etc. HOTCache (Distributed High Performance Cache) Nutanix HOTCache is a distributed cache that provides high-performance to virtual machines. It s a software layer that is backed by high performance Fusionio iomemory. As VMs write data, the data is immediately written to HOTCache first and then flushed to the backend storage layer in the background. HOTCache is persistent and also highly available data is persistently stored on iomemory and also replicated across the cluster to ensure that there is no data loss even if nodes fail. 6
Project Colossus To demonstrate true scalability, Nutanix created a benchmark called Project Colossus. Project Colossus involved running performance tests on a single Block (a four-node cluster) and scaling it to a large number of nodes (fifty) to validate performance at large scale. The Setup Single Block Hardware The baseline setup involved a standard Nutanix Complete Block (four VMware hosts) with the following aggregate hardware resources: 8 Intel Xeon Processors (48 Cores) 1.28 TB of Fusion-io iomemory 1.20 TB of SATA SSD 20 TB of SATA hard disks (7200 RPM) 192 GB RAM 4X10 GbE and 8X1GbE network ports Colossus Hardware Colossus was a cluster with fifty nodes (ESX hosts) and had the following aggregate hardware resources: 100 Intel Xeon Processors (600 Cores) 16 TB of Fusion-io iomemory 15 TB of SATA SSD 250 TB of SATA hard disks (7200 RPM) 2.4 TB RAM 50X10 GbE and 100X1GbE network ports The nodes were connected to each other over a 10 GbE network in both cases. Software Setup Each node of the Colossus cluster was running ESXi 4.1 and had a Nutanix Controller VM running on it. Thus, Project Colossus was running fifty controllers in a single cluster, making it one of the most scalable enterprise-class storage systems in the industry. 7
Performance Test Setup On each node, six virtual machines were created, representing a highperformance server virtualization environment. Each virtual machine was connected to one Nutanix virtual disk. Thus, Colossus had a total of 300 virtual machines with 300 virtual disks. Inside each virtual machine, a test application was running the test workloads (random/sequential). A centralized test manager orchestrated these test applications running on 300 virtual machines. To test a variety of workloads, the system was benchmarked for random write performance and sequential read performance. Benchmark Results Sequential Read Performance Four Nodes (One Block) The chart below shows performance results delivered by a four-node cluster. 2500 Bandwidth (MB/s) 2000 1500 1000 500 Bandwidth (MB/s) 0 Time (sec) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 This chart shows that the four-node cluster delivered an average of 2 Gigabytes per second of sequential read throughput. The test virtual machines completed their tasks in ~6 minutes. 8
Sequential Read Performance Fifty Nodes (12.5 Blocks) To test how sequential read throughput increased with the cluster size, the test was run on Colossus, the fifty-node cluster. 29000 28500 Aggregate Bandwidth vs Time (secs) 28000 27500 Bandwidth (MB/s) 27000 26500 26000 Time (sec) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 This chart shows that the cluster delivered an average of about 28 Gigabytes per second of sequential read throughput. Thus, as the cluster size increased 12.5 times, the performance increased almost 14 times, demonstrating linear scalability (adjusting for statistical variations). 9
Random Write Performance Four Nodes (One Block) Random write workloads are very common in virtualized environments. When a large number of virtual machines are writing data, shared storage often ends up witnessing a random data stream even if the VMs are writing data sequentially due to an IO blender effect. Aggregate IOPS vs Time (sec) 35000 30000 25000 20000 15000 IOPS 10000 5000 0 The chart above shows random write performance delivered by a 4-node cluster. As can be seen here, the average IOPs delivered are about 25,000, with 30,000 IOPs being reached frequently during the test. 10
Random Write Performance Fifty Nodes (12.5 Blocks) The chart below shows random write performance delivered by Colossus: 410,000 400,000 390,000 380,000 370,000 360,000 350,000 340,000 330,000 320,000 Time (sec) Aggregate IOPS vs Time (secs) 20 40 60 80 100 120 140 160 180 200 220 240 260 280 IOPS Here Colossus delivered 375,000 IOPs on an average, touching almost 400,000 IOPs a number of times during the tests. This is almost a 12X-14X increase over a single-block performance, demonstrating linear scalability as more nodes are added to the cluster. 11
CONCLUSION Conclusion Nutanix Complete Cluster solves the biggest capacity planning dilemma faced by data center architects today overpaying for an oversized system from the beginning or hoping that they don t hit a performance wall before the system is ready for refresh. It allows them to start with a small footprint of a single Block and grow linearly as their needs grow. It can deliver high performance at small scale as well as large scale, making it fit for a broad variety of use cases. By using Nutanix Complete Cluster, organizations can enjoy the benefits of high performance and low and predictable cost at the same time. 12