Open source object storage for unstructured data

Transcription

1 Solution Reference Architecture Open source object storage for unstructured data Ceph on HP ProLiant SL4540 Gen8 Servers Table of contents Executive summary... 3 Introduction... 4 Reference architecture guidance... 4 Sample reference configuration summary... 4 Overview... 6 Business problem... 6 Typical architectures vs. object storage... 6 Key solution technologies... 7 Solution diagrams Solution components Component choices Sample reference configuration design Workload testing Workload description Workload generator tools Workload results and analysis Configuration guidance Building your own cluster Cluster Installation Cluster tuning Bill of materials HP ProLiant SL4540 Gen8 Server HP ProLiant DL360p Gen8 Server HP Networking Cables HP 1GbE Switch HP 10GbE Switches HP Rack and Power Summary Appendix A: Sample Reference Ceph Configuration File Appendix B: Sample Reference Pool Configuration Appendix C: Syntactical Conventions for command samples Appendix D: Server Preparation Install HP Support Pack for ProLiant... 36

2 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers BIOS Configuration Settings Configuring a Mirrored Boot Device Upgrading Ubuntu Useful Additional Package Configuration Command Line Logical Drive Configuration Updating the Driver for the 10GbE Mellanox NIC Appendix E: Cluster Installation Naming Conventions Ceph Deploy Setup Ceph Node Setup Create a Cluster Add Object Gateways Appendix F: Newer Ceph Features Multi-Site Erasure Coding Cache Tiering Appendix G: Helpful Commands Removing configured objects Checking Cluster State Configuring Pool Settings Listing Object Gateway Users Appendix H: Workload Tool Detail Getput Fio Collectl HAProxy Glossary For more information

3 Executive summary Explosive data growth, expansion of Big Data and unstructured data and the pervasiveness of mobile devices continually pressure traditional file and block storage architectures. Businesses are exploring emerging storage architectures like object storage to help deal with these trends to provide cost-effective storage solutions that keep up with capacity growth while providing the service level agreements to meet business and customer requirements. Enterprise-class storage subsystems are designed to address storage requirements for business-critical transactional data latencies. However, this may not be an optimal solution for unstructured data and backup/archival storage. In these cases, enterprise-class reliability is still required, but massive scale-out capacity and lower solution investment are more important than minimal latency. Object storage software solutions are designed to run on industry-standard server platforms, offering lower infrastructure costs and scalability beyond the capacity points of typical file server storage subsystems. Ceph running on HP ProLiant hardware is a comprehensive and cost-effective object storage solution for addressing scale-out storage needs. HP hardware combined with Ceph on Linux delivers an open source object storage solution that: Has software capable of scaling from dozens of terabytes, to exabytes of data and billions of objects Lowers upfront solution investment and total cost of ownership (TCO) per gigabyte Provides enterprise-class infrastructure monitoring and management Does not require cluster software licensing as the cluster is scaled Provides a single software-defined storage (SDS) cluster with block, object, and file access Uses open source, minimizing concerns about vendor lock-in, and increasing flexibility of hardware and software choice Can integrate into an OpenStack deployment Target audience: CTOs and Solution Architects looking for a storage solution that handles the rapid growth of unstructured data, cloud and archival storage while controlling licensing and infrastructure costs. This paper assumes knowledge of enterprise data center administration challenges and familiarity with data center configuration and deployment best practices, primarily with regard to storage systems. It also assumes the reader appreciates the challenges and benefits open source solutions can bring, especially for early adopters of object storage. This reference architecture describes testing performed by HP in March 2014.

4 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 4 Introduction This reference architecture describes a Ceph cluster deployed on HP hardware. It details why and how to build a Ceph cluster with HP hardware to solve unstructured, cloud and backup/archival storage problems. The key reasons why the reader should care about this are: Object storage is a better solution for unstructured data than traditional storage alone The right solution needs the right platform white box hardware doesn t meet enterprise needs at scale Object storage is architected for the characteristics and use of Big Data to remove scaling limitations. As implemented by Ceph, object storage is an SDS layer that federates traditional file and block storage on industry-standard Linux servers. This provides a way to scale out massively for Big Data needs at lower costs than SAN/NAS business-critical storage targets. HP hardware is the right platform for a large-scale object storage cluster because it provides better TCO for operating and maintaining the hardware than white box servers. HP provides: Platform management tools that scale across data centers Server components and form factors that are optimized for enterprise use cases Hardware platforms where component parts have been qualified together A proven support infrastructure Clusters built with white box servers work for business at small scales, but as they grow, the complexity and cost make them less compelling than enterprise-focused hardware. With white box solutions, IT has to standardize and integrate platforms and supported components themselves. Support escalation becomes more complicated. Without standardized toolsets to manage the hardware at scale, IT must chart their own way with platform management and automation. Power consumption and space inefficiencies of generic platform design also limit scale and increase cost over time. The result is IT staff working harder and the business spending more to support the quantity and complexity of a white box hardware infrastructure. The lowest upfront cost does not deliver the lowest total cost or easiest solution to maintain. Reference architecture guidance It s important to set expectations on what this reference architecture is attempting to accomplish, and what that means to the reader. This paper does provide a picture of how to implement a Ceph cluster on HP hardware, and why it s compelling in a business and technical sense. It does not show how to build an entire application solution stack using Ceph. The distinction is important because the whole solution picture that s present for classic server applications using traditional block and file storage is only beginning to appear for object storage. Placing an object storage cluster in the data center is an important step, but an object storage interface requires integration effort to use. It s a new storage interface. There are no standardized benchmarks, use cases or typical object data patterns for object storage currently. Nor is there clear guidance around recommending applications that can connect to object storage. This means additional work to figure out the pieces and players for connecting a Ceph cluster to enterprise applications. This paper also doesn t provide an exhaustive picture of cluster configuration options. Scaling storage on industry standard servers is different from standardizing on classic NAS and SAN targets within a single datacenter. Based on business requirements, a variety of servers and infrastructure options can be selected for cluster architecture across multiple sites. It all adds up to more variables than this paper can reasonably cover while staying focused. Sample reference configuration summary This Ceph cluster is shown at a high level to give the reader context; it s based around storage on the HP ProLiant SL4540 Server, which is purpose-built for Big Data. The single rack sample cluster contains: Five 2x25 HP ProLiant SL4540 Gen8 Server chassis, with 3TB drives and SSD Journals. Three HP ProLiant DL360p Gen8 Server chassis Ubuntu LTS. Ubuntu is the OS best supported by Ceph software today, and the long-term support release is most appropriate for an enterprise environment. Ceph running the Dumpling (v0.67) release, which is the most current and stable Ceph LTS release at the time of this testing 10GbE Networking running on HP 5900AF switches, carrying object data traffic

5 Reference Architecture Product, solution, or service 1GbE Networking running on an HP 2920 switch, carrying HP Integrated Light-Out (ilo) and corporate management traffic Rack and power components In this configuration the HP ProLiant SL4540 Gen8 Servers are object storage nodes ; these are servers where scale-out storage hard drives reside. The HP ProLiant DL360p Gen8 Servers are management nodes for the cluster. The HP ProLiant DL360p Gen8 Servers provide the part of the solution that maintains cluster state and object gateways to access the cluster storage through S3/Swift REST APIs. Given a RESTful interface, traffic generators can come from all kind of clients but in this test a benchmark tool run on x86 Linux boxes provides sample workload. 5

6 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Overview Business problem Businesses are looking for better and more cost-effective ways to manage their exploding data storage requirements. In recent years, the amount of storage required for businesses has increased dramatically. Exploration data from oil and gas, patient medical records, user and machine generated content, and many other data types generate massive amounts of data per day. Simultaneously, businesses are dealing with a shift from tape- to disk-based backup. Cost-per-gigabyte and ease of retrieval are important factors for choosing a solution that can scale quickly and economically over many years of continually increasing capacities and data retention requirements. Many organizations still need to manage much or all of that data in-house. Regulations and privacy considerations can make offsite storage impractical or impossible. Hosting on a public cloud may not meet cost or data control requirements in the long-term; the performance and control of on -premise equipment still offers real business advantages. Organizations that have been trying to keep up with data growth using traditional file and block storage solutions are finding that the complexity of managing and operating them has grown significantly as have the costs of storage infrastructure. Typical architectures vs. object storage Storage solutions designed for traditional IT tasks are not optimal for petabyte-scale unstructured data Typical architectures often struggle to meet business service level agreements (SLAs) when applied to petabyte scale unstructured and archival data. In addition, with traditional storage solutions it s possible to pay for features that aren t needed, and achieve less flexibility, scale, and reliability than required by the SLA. Here are some ways traditional thinking falls short when architecting a solution to serve unstructured data at massive scale. Architectural and cost mismatches File and block storage methods that make sense for structured data impose unnecessary overhead for unstructured data, particularly at large scale. Traditionally, businesses buy block storage optimized for classic data access cases, like database workloads and file systems. These solutions have the ability to support high IOPS and heavy, concurrent write load. However, unstructured and archival data is often written just once. Bandwidth and storage capacity are much more important for unstructured and archival data than low latency. Traditional storage means paying for drive classes and features an unstructured use case may not need. When trying to drive the lowest cost per GB, tape immediately comes to mind. For many Big Data use cases, worst case latency of tape-based storage falls outside of the required latency behaviors for data access. Unstructured and archival data may sit dormant for a while but need to be available quickly with maximum latency times measured in seconds instead of minutes. Where tape latencies are acceptable, many enterprises don t want to manage tape storage for onsite data. Gaps in reliability, manageability, scalability Storage systems designed for smaller-scale, single-site deployments are often not capable of delivering the overall reliability and data durability necessary to support complex, multi-site scale-out configurations. Many existing storage solutions are a challenge to manage and control at massive scale. Management silos and user interface limitations make it harder deploy new storage into business infrastructure. Unstructured deployments can accumulate billions of objects and petabytes of data. File system limits on count and size of files, and block storage limits on the size of presented block devices become significant connection management and deployment challenges. Why object storage technology Businesses need an architecture that s more scalable, and provides an easier way to manage and access data. The enterprise also still requires availability and access control, even if the performance requirements are different than those of traditional storage architecture. Object storage is designed for the scale, characteristics, and requirements of unstructured data By creating an interface that isn t encumbered with design restrictions of file and block, but is optimized for unstructured data, it s possible to create a cluster architecture that breaks out of typical scale storage architectural drawbacks. Object Storage Architecture Details Object storage allows the storage of arbitrary-sized objects using a flat, wide namespace where each object can be tagged with its own metadata. This simple architecture makes it much easier for software to support massive numbers of objects across the object store. The APIs provided by the object storage gateway add an additional layer above objects called containers (Swift) and buckets (S3) to hold groupings of objects. 6

7 Reference Architecture Product, solution, or service To access the storage, a RESTful interface is used to provide better client independence and remove state tracking load on the server. HTTP is typically used as the transport mechanism to connect applications to the data, so it s very easy to connect any device over the network to the object store. The IO interface is designed for static data. There are no file handles, concerns for locking, or reservations on objects. An S3 or Swift API object IO translates to an HTTP PUT (write) for the entire object, HTTP GET (read), or HTTP DELETE. Along with the flat structure, it s much easier for the storage architecture to support client concurrency because write concurrency doesn t exist. If multiple clients attempt to write to the same object, one version will win ; the entire resulting object will be coherent with a given client object PUT. This may not be easy to predict, so what simplifies storage architecture could impact client software. Object storage commonly includes multi-tenancy with access keys and ACLs for storage. With a metadata rich focus, object storage is built around what is in data rather than where it s located. That means that work to guarantee enterprise availability sites, replica counts, etc. stays in the cluster. The client code is focused on the data context. At the core of the object storage concept is the way clients leverage a (relatively) flat namespace, metadata tags on objects, and the RESTful interface. Various object storage interfaces may have more or less hierarchy in the namespace, allow partial writes to existing objects (RADOS does this), or might not require client features such as access or ACLs. Because this document covers object storage access through APIs provided by the object storage gateway, HP has provided additional details specific to those interfaces. Key solution technologies Using industry-standard servers as cluster components gives enormous flexibility for customizing, configuring, and balancing cost for the use case (CPU per disk, storage density, network infrastructure, etc.). With massive scale, costs of the cluster building blocks add up, so choosing the right components for the task makes a difference. It s very important for enterprise adopters to develop a roadmap for understanding and implementing a maintainable object storage solution. As an early adopter of object storage in general and an open source solution in particular expect to realize a cost and feature benefit for implementing object storage that can make a real difference operating at scale. But also plan for an engineering load both to support a Ceph cluster and develop code to utilize object storage. Cluster architecture A Ceph cluster is SDS architecture layered on top of traditional server storage. It provides a federated view of storage across multiple industry-standard servers using block storage and traditional file systems, and does this with object storage architecture. This approach has the advantages of leveraging work and standard hardware where appropriate, while still providing the overall solution scale and performance needed. See Ceph Architecture for more details. The core of mapping a GET/PUT or block read/write to Ceph objects from any of the access methods is CRUSH (Controlled Replication Under Scalable Hashing). It is the algorithm Ceph uses to compute object storage locations. All access methods are converted into some number of Ceph native objects on the back end. Cluster Roles There are three primary roles in the Ceph cluster covered by this sample reference configuration: OSD Host: The HP ProLiant SL4540 Gen8 Server has been presented as the object storage host; this is how Ceph terms the role of the server storing object data. The Ceph OSD Daemon is software which interacts with the OSD (Object Storage Disk); for production clusters there s a 1:1 mapping of OSD Daemon to logical volume. The default file system used for this sample reference configuration on an OSD is xfs, although btrfs and ext4 are also supported. Ceph Monitor (MON): A Ceph Monitor maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group Map, and the CRUSH map. Ceph maintains a history (called an epoch ) of each state change in the Ceph Monitors, Ceph OSD Daemons, and PGs. Object Gateway (RGW): An object storage interface to provide applications with a RESTful gateway to Ceph Storage Clusters. Ceph Object Storage Gateway supports two interfaces, S3 and Swift. These interfaces support a large subset of their respective APIs as implemented by Amazon and OpenStack Swift. There s also a metadata server for file system support which is outside of the context of this reference architecture. How RADOS IO Works This is a view of IO at a relatively high level. Ceph architecture documentation and source code provides more detail, but understanding how IO functions helps show how the cluster provides unified control of all the storage and protects data. To start with, here s a model of how Ceph is accessed by clients and how that s layered on top of the object storage architecture. 7

8 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Figure 1: Cluster Access Methods The core of mapping a HTTP GET/PUT or block read/write to Ceph objects from any of the access methods is CRUSH (Controlled Replication Under Scalable Hashing). It is the algorithm Ceph uses to compute object storage locations. Per the picture above, all access methods are converted into some number of Ceph native objects on the back end. The Ceph storage system supports the notion of Pools, which are logical partitions for storing objects. Pools set the following parameters: Ownership/access of objects The number of object replicas The number of placement groups The CRUSH Rule set to use Ceph Clients retrieve a cluster map from a Ceph Monitor, and write objects to pools. The pool s size, number of replicas, the CRUSH rule set, and the number of placement groups determine how Ceph will place the data. Figure 2: Client IO to a Pool Each pool has a number of placement groups (PGs). CRUSH maps PGs to OSDs dynamically. When a Ceph Client stores objects, CRUSH will map each object to a placement group. 8 Mapping objects to placement groups creates a layer of indirection between the Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph Client

9 Reference Architecture Product, solution, or service knew which Ceph OSD Daemon had which object, it would create a tight coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH algorithm maps each object to a placement group and then maps each placement group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices come online. The following diagram depicts how CRUSH maps objects to placement groups, and placement groups to OSDs. Figure 3: Mapping Objects to OSDs The leverage of existing storage technology takes place under the OSD Daemon. When the RADOS object data is written, it s currently written as a file within a directory on the OSD. There s more to it than that the metadata must also be committed separately, and Ceph reserves some storage for journaling but the distribution across the file system is essentially how object data and placement groups are implemented. Scaling/Consistency/Failure Handling With an understanding of the roles of the cluster and how data is stored, it s also important to understand how the integrity of data is protected and maintained. Replication in addition to the benefit of data locality, replication provides the failure tolerance required by large scale. Like Ceph Clients, Ceph OSD Daemons use the CRUSH algorithm but the Ceph OSD Daemon uses it to compute where replicas of objects should be stored (and for rebalancing). For the recommended configuration, there are 3 copies of any object data written one on the Primary OSD for the placement group and two replicas. This replication level is user configurable; the default without modifying ceph.conf is 2 replicas. In a typical write scenario, a client uses the CRUSH algorithm to compute where to store an object, maps the object to a pool and placement group, then looks at the CRUSH map to identify the primary OSD for the placement group. The client writes the object to the identified placement group in the primary OSD. Then, the primary OSD with its own copy of the CRUSH map identifies the secondary and tertiary OSDs for replication purposes, replicates the object to the appropriate placement groups in the secondary and tertiary OSDs, and responds to the client once it has confirmed the object and its replicas were stored successfully. 9

10 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 10 Figure 4: Replication This model offloads replication to the OSD hosts; the client only has to drive data for the primary write. The picture above is for the three copies of an object in the sample reference configuration; the Primary OSD will drive as many replicas as are defined by the target pool. Peering and Sets: The Ceph Storage Cluster was designed to store at least two copies of an object, which is the minimum requirement for data safety. For high availability, a Ceph Storage Cluster should store more than two copies of an object (e.g., size = 3 and min size = 2) so that it can continue to run in a degraded state while maintaining data safety. Referring back to the replication diagram, the Ceph OSD Daemons are not specifically named (e.g., osd.0, osd.1, etc.), but rather referred to as Primary, Secondary, and so forth. By convention, the Primary is the first OSD in the Acting Set, and is responsible for coordinating the peering process for each placement group where it acts as the Primary, and is the ONLY OSD that that will accept client-initiated writes to objects for a given placement group where it acts as the Primary. When a series of OSDs are responsible for a placement group, they re referred to as an Acting Set. An Acting Set may be Ceph OSD Daemons that are currently responsible for the placement group, or the Ceph OSD Daemons that were responsible for a particular placement group as of some epoch. This behavior also applies to removal/failure of an OSD; object copies get remapped from other elements of the Acting Sets onto free cluster storage. Rebalancing: When a Ceph OSD Daemon is added to a Ceph Storage Cluster, the cluster map gets updated with the new OSD. Consequently, it also changes object placement, because it changes an input for CRUSH map calculations. The process of redistributing objects across the Ceph cluster is termed rebalancing. Data Consistency: Ceph OSDs can compare object metadata in one placement group with its replicas in placement groups stored on other OSDs. Scrubbing (usually performed daily) catches OSD bugs or file system errors. OSDs can also perform deeper scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a disk that weren t apparent in a lighter scrub. Failure Domain, CRUSH map: An important element of maintaining fault tolerance is determining where data should be placed for optimal resiliency. Rather than depend on expensive hardware redundancy for all interconnects, the scale-out design principle is to assume some components will fail and make the data available by properly partitioning failure domains. For a simpler configuration, it s ok to assume that distributing copies across servers will provide reliability. As clusters scale out, it will be important to separate replicas across racks, power sources, network switches, or even data centers to reduce the likelihood a failure event will have untenable impacts. The default CRUSH map generated by ceph-deploy cluster install

11 Reference Architecture Product, solution, or service will be functional, but for more complicated configurations customizing the CRUSH map is required. Tuning the map for failure domains of the cluster helps optimize performance, improves reliability and availability, and aids manageability. Value of a purpose-built enterprise hardware platform An important part of planning Ceph cluster architecture is determining what kind of hardware it runs on. HP hardware brings value to the solution in these ways: Flexible compute/storage ratio with one, two, and three compute node chassis available, you can choose the HP ProLiant SL4540 Gen8 Server model that delivers the optimal storage-to-compute ratio for your object storage access workloads. Converged design The HP ProLiant SL4540 Gen8 Server delivers increased storage density at lower cost. Flexible IO bay Storage controller and network interfaces are contained within each compute node s IO bay. 10GbE and 1GbE networking options ensure support for industry-standard network infrastructures. Power management The SL Advanced Power Manager provides dynamic power capping and asset management features that are standard across the HP ProLiant SL line. The converged HP ProLiant SL4540 Gen8 Server chassis also yields power savings via shared cooling and power resources. Solution integration and data center acceptance HP hardware described in this reference architecture has been qualified together. This means no work building, maintaining, and qualifying white box architectures for the cluster. HP hardware can be validated with confidence. Enterprise support Get dedicated solution and support resources from HP, a trusted enterprise partner. At massive scale, system failures become part of the design even with the most reliable components. Therefore, it s critical to have good support infrastructure to keep system reliability and availability at acceptable levels. Enterprise-class storage components HP Smart Array controllers provide a robust storage solution within the server. HP drive carriers allow easy drive swaps and collect triage information to help the RMA process not only replace failures but avoid future problems. Issues like bad drive batches and drive firmware problems are very significant for solutions that consume and scale with large quantities of drives. HP server solutions are built to manage storage failures; HP also qualifies and supports hard drives to minimize failure impacts. HP Integrated Lights-Out (ilo) HP ilo is an industry-leading embedded monitoring solution. Its agentless management, diagnostic tools, and remote support allow for entire data centers to be managed with ease. Enterprise-class management HP OneView provides infrastructure management for hardware at scale. Along with all the other platform management value it brings, HP OneView links failed cluster components to the location of hardware in the infrastructure. Open source value Businesses that use open source value the control and cost benefit it brings Linux and Ceph provide the kind of robust and functional open source solutions these businesses want. Tradeoffs for engineering and support versus a normal enterprise closed-source solution make sense for them. Control With access to the source, a business can customize solutions as needed. They can also apply bug fixes or roll new features as needed, with the ability to see exactly what s changing. There s no concern about the provider of a solution going away and making a solution unsupportable. As an open standard, a Ceph cluster is not tied to particular hardware. This means expansion or refresh of cluster hardware is not locked in to any vendor choose the hardware that s the right fit for the business case and solution parameters. The HP ProLiant SL4540 Gen8 Server was designed with storage density and compute ratios that match unstructured data retention and processing needs. Therefore, it s a good match for Ceph cluster storage design. Cost People not familiar with open source may believe that it s free because it s freely available. There are, of course, engineering costs for configuring and maintaining open source. Additionally, there are many valuable commercial software and support solutions built on top of open source. However, closed source clusters can add significant cost per server for licensing and support, which adds up at massive scale. Because there isn t a paid license for open source software, adding nodes doesn t add upfront license costs. Also, building up expertise on the solution in -house can pay off by reducing the operating expense (OpEx) required to support each cluster node. A proper analysis of the costs and scope of supporting an open source solution can realize significant savings at scale. Ceph value Active community It s important that the community supporting an open source solution and code base is active. Ceph fits that description; in 2013 alone it grew its author pool from 103 to 203 and accepted major source contributions from significant storage industry players. The community held the inaugural Ceph Developer Summit and organized Ceph Days for education and 11

12 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers idea exchange. Inktank is the company delivering Ceph, and they have a goal to drive the widespread adoption of SDS with Ceph and help customers scale storage to the Exabyte level and beyond in a cost-effective way. Enterprise solutions and support While Ceph is in use for a variety of business cases, there s ongoing work to support the needs of enterprise deployments beyond just hardening work. If the business requires it, Inktank provides professional solution support for the cluster and professional services such as performance tuning to maximize use of cluster resources. Inktank is also creating robust enterprise management software for Ceph. The graphical manager, named Calamari, accelerates and simplifies cluster management by showing the performance and state data needed to operate a Ceph cluster. Calamari already includes cluster management but will add support for analytics within Additionally, Ceph will add support for SNMP and hypervisors like Microsoft Hyper-V and VMware to allow better integration of a Ceph cluster into the data center cloud environment. Figure 5: Calamari v1.1 Screenshot Use storage that matches the needs of data Ceph s cluster reliability allows utilizing non-enterprise-class drives for significant scale savings. If faster storage is needed, Ceph can be configured to restrict a pool to a more performant tier particularly useful for RADOS Block Devices (RBD). With replication, data consistency, and the cluster reliability of a properly tuned CRUSH map, Ceph provides enterprise data availability and durability required at petabyte scales and beyond. Flexible access methods Ceph can provide many different methods of storage access within a single storage cluster; this whitepaper covers object gateway and block access but file and native RADOS methods are also available. For any storage access, customers generally want methods that are supported across many storage systems. To this end, the object gateway converts S3 and Swift compatible APIs to RADOS objects. This allows existing libraries or applications that use these APIs to be leveraged rather than rewritten. So use cases like hybrid public/private cloud setups, S3 repatriation, or heterogeneous object solution environments can share and reuse more code. Ceph also can present cluster storage with a block interface that s been supported in the Linux kernel since Traditional block-focused applications and standard OS file systems can then leverage cluster storage. Within a cloud environment, RBD integrates with OpenStack Cinder/Glance and can be directly used by Linux VMs themselves. 12 Although CephFS has had less development focus (current plans are to release updated support towards the latter half of 2014), it s still viable and in use for scale-out file system data cases today. It also provides HDFS offload, leveraging replication s data locality along with overall cluster management.

13 Reference Architecture Product, solution, or service Releases that meet Risk and Feature requirements As involvement in Ceph has increased, it s kept a rapid pace both with new feature introduction and bug fix/stability work. Multi-Site federation had its first real complete release in Emperor. There are a lot of compelling features coming down the road with Ceph such as Erasure Coding, Cache Tiering, and improving CephFS. At the same time, it s understood that not everyone running a cluster wants to introduce and integrate major features at a high rate. For that class of customer they also have long term support releases where the primary focus is on bug fix and stability work. Solution diagrams This block diagram has connections to infrastructure outside of the sample reference configuration. The 1GbE Management Network labeled in blue has an uplink to the larger management network, and the management network has a route to the internet. Internet access allows rack management network infrastructure to reach Ubuntu and Ceph package repositories, while the uplink connects the cluster with other management servers and consoles. On the same diagram, the external link on the green labeled10gbe Data Network connects the cluster to client machines and load balancing. A rack diagram is also shown for a more hardware platform based view. While clients and load balancing are important components for a full solution, planning and scaling in this document doesn t cover traffic generating hardware. Primarily, this paper focuses on reference architecture for Ceph storage clusters, not guidance on total datacenter configuration or client use case. Some details are given in the description of testing to help with benchmark reproduction. 13

14 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Figure 6: Sample Reference Configuration Block Diagram Figure 7: Sample Reference Configuration Rack Diagram 14

15 Reference Architecture Product, solution, or service Solution components Component choices This section describes more detailed reasoning behind some of the hardware and software components chosen for the sample reference configuration. Decisions made for component sizing choices in the cluster (compute, memory, storage, networking topology) are described under Configuration guidance. Operating system Ubuntu is the Linux OS distribution that has been tested the most with Ceph, and Ubuntu LTS is the most appropriate current release for enterprise use. The Long Term Support (LTS) version focuses on stability rather than including newest features. It is also the only supported cluster Ubuntu version for Ceph Calamari management software at time of writing. HP strongly recommends having internet available for OS install and Ceph cluster setup, especially for package management. If installing Ubuntu completely from media, the kernel and OS packages should be updated from what initially shipped to current. It s possible to install Ceph from source code and/or set up internal package repositories. Other Distributions For later Ubuntu releases, Ceph cluster is supported on Ubuntu and will be supported on LTS in the future. Ubuntu is not officially supported as of this writing. Red Hat is better supported now than in the past; HP hasn t attempted to characterize a non-ubuntu distribution at this time. More detail about what versions are supported by Ceph is available from their official documentation pages. OS configuration instructions and samples in this document are from using Ubuntu If other releases or distributions are used, this document s content must be adapted accordingly. Switches Top of Rack (TOR) switches (HP 5900AF-48XG-4QSPF+) for Data and Replication traffic The HP 5900AF-48XG-4QSPF+ 10GbE high-density, ultra-low latency, TOR switch provides IRF Bonding and sflow, which simplifies the management, monitoring and resiliency of the network. This model has 48x 10-Gigabit / Gigabit SFP+ ports with four QSFP+ 40-Gigabit ports for ultra-high capacity connections. The high performance 10 GbE networking provides cut-through and nonblocking architecture, delivering industry-leading low latency (~1 microsecond) for very demanding enterprise applications. The switch delivers a 1.28 Tbps switching capacity and Mpps packet forwarding rate in addition to incorporating 9 MB of packet buffers. Figure 8: HP 5900AF-48X-4QSF+ Top of Rack (TOR) switch Top of Rack (TOR) switches (HP G) for HP ilo and management The HP G is an ideal TOR 1GbE switch for denser rack configurations with up to four 10GbE uplinks, and 48 1GbE ports. A dedicated management switch for HP ilo traffic is required for the HP ProLiant SL4540 Gen8 Server, and this also helps segment other non-cluster traffic (SSH connectivity, package updates). Figure 9: HP GTop of Rack (TOR) switch Both of the switches referenced are rear-facing, in that the cables for the switch are connected on the same side of the rack as the cables that are connected to the NICs at the back of the HP ProLiant SL4540 Gen8 Servers. Server selection Within this architecture, the cluster can be scaled effectively while using the same server hardware. This section briefly talks about sample reference configuration server choices. 15

16 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Management nodes The 1U HP ProLiant DL360p Gen8 Server is a dual socket server, with a choice of Intel Xeon E v2 and Intel Xeon E processors, up to 768GB of memory, and two expansion slots. Network connectivity can be provided through FlexibleLOM in a 4x1GbE NIC configuration or a 2x10GbE configuration. For storage, various configurations are available with LFF or SFF drives with an HP Smart Array P420i controller. The HP ProLiant DL360p Gen8 Server was chosen to keep rack space requirements minimal for nodes where storage density was not the issue, but still provide good network bandwidth and compute power. An 8SFF drive configuration is used in the sample reference configuration, but the storage on the HP ProLiant DL360p Gen8 Server is not particularly important to Ceph functionality outside of providing a reliable mirrored OS boot drive. Figure 10: HP ProLiant DL360p Gen8 Server Object storage nodes The two-node configuration of the HP ProLiant SL4540 Gen8 Server consists of up to two compute nodes and a total of 50 large form factor (LFF) 3.5" hard disk drives (HDD) in the chassis. The HP ProLiant SL4540 Gen8 Server is a dual socket server, with a choice of five different Intel Xeon processors, up to 288GB of memory and one PCIe slot for expansion per node. Every compute node also has its own dedicated networking ports with 1GbE and 10GbE choices available. The HP ProLiant SL4540 Gen8 Server was chosen as a chassis due to its focus on rack storage density, matching unstructured storage requirements. The 2x25 configuration specifically gives a good balance point between maximum CPU per node and colder use case storage density for tests that create a performance baseline. Testing was done with 4.86 array controller firmware on the SL4540 s P420i; ensure Smart Array controller firmware is the latest released before doing scale testing or deployment. Figure 11: HP ProLiant SL4540 (2 x 25) Gen8 Server Sample reference configuration design The sample reference configuration could have represented anything from a minimal test configuration to multiple performance optimized data centers (PODs). A bill of materials (BOM) of five HP ProLiant Gen8 SL4540 and three HP ProLiant DL360p Gen8 Servers was chosen because it s a size representative of enterprise data needs without being too large to be a reasonable initial deployment use case for many readers. For raw capacity, this configuration could reach 1PB (50 4TB drives x 5), which is both a good conceptual scale number and a point where enterprise platform architectures make TCO sense versus smaller white box configurations. 1PB raw of 4TB even in 12 drive 2u boxes would still be an entire 42u rack worth (with no space for TOR switching etc.), and more standard scale-out platforms are even less rack space/port efficient. HP s drive choice was lower cost/density but still performant midline 3TB device. Another important storage design choice in this reference configuration is using SSD journals. Due to the architecture of Ceph s object commits significant PUT performance will be gained by committing the journal and data parts of the object IO on different devices. 16

17 Reference Architecture Product, solution, or service The sample reference configuration fits in a single rack, but is scalable in some important ways. The rack reserves space to configure for further HP ProLiant SL4540 Gen8 Server scaling or other datacenter equipment. It s relatively simple to source this configuration to multiple racks by replicating elements of the BOM and distributing monitors/object gateways across the racks. Multi-site and multiple-pod can get more complicated than the sample reference architecture shows, as object gateway and monitor configuration involve hardware selection and failure domains beyond the scope of a single data center. However, for POC or initial cluster implementation what s here is a good conceptual start. Licensing and support The BOM lists service, support, and licenses for ilo. These are important for a scale-out solution with industry-standard servers as they provide reliability and management required to operate petabyte scale clusters and beyond. HP ilo provides the foundation for linking the hardware platform to cluster performance, along with remote hardware management. HP service and support provide expertise through setup, operation and escalation for issues with HP provided hardware. Inktank support on the software solution side is also recommended to protect your cluster investment. Inktank Ceph Enterprise is a subscription combining the most stable version of Ceph for object and block storage, with the Calamari graphical manager, enhanced integration tools and support services. Inktank also provides expertise and professional services for your Ceph cluster. Workload testing A guiding principle of testing for this paper is to create performance that s simple to replicate. Although object storage does not have the same SLA as traditional storage, performance benchmarking shows where bottlenecks are and where resources could be allocated for solution requirements. Object and block data are the chosen focus cases for unstructured data use on Ceph. Traffic to the Ceph object gateway is using the Swift API, which has the advantage of a similar API and traffic generation test tool for both OpenStack Swift and Ceph testing. These results help set expectations for performance at a level of load and type of access, which helps scale planning and matching cluster capabilities to use case. Ceph provides a native object API with librados and a test tool in rados-bench; while a very useful tool for testing Ceph cluster performance HP believes customer object storage use cases are more likely to focus around the APIs provided by the object gateway. Workload description Without a recognized standard for object storage benchmarking, baseline data comes from canned IO testing. The test takes objects of varying sizes and is run to achieve a reliable average performance sample. Test matrix configuration The cluster is pre-seeded with 1000 accounts: 100 containers per account and 100 objects per container. This gives a representation of IO running on a system with used capacity. Runs operate at a fixed number of processes/threads (30) using three traffic generators for all tests. Test matrix terminology A Suite is all tests run for a given access method A Pass is all types of test at a given object/block size A Phase is a single type of test at an object size A Step is any subdivision of a Phase Object testing Test passes are done at 1KB, 16KB, 64KB, 128KB, 512KB, 1MB, 4MB, 16MB, 128MB object sizes The account, containers and objects being accessed by the test are not pre-seeded Each tested phase has steps at 100% PUTs, 100% GETs, and then 90%/10% GETs to PUTs Object count for a pass is chosen so the100% PUT phase lasts at least 30 minutes On the MIX step, do PUTs as step #1. Step #2 is GETs to step #1objects and PUTs using new object name prefix There are no object DELETEs between sizes This suite consists of pure write and read load tests, and then a mixed test to represent an active cluster with mostly static data being read (but some ongoing writes). As load tests are a warmer use case, the MIX approximates a file hosting service load. GETs must be performed after PUTs, so file system cache impact occurs. Although object storage does not have the same SLA as traditional storage, performance benchmarking still reveals bottlenecks and resource restrictions. 17

18 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Block testing Test phases for random IO are 8k read, write, and 70% read/30% write mix. Test phases for sequential IO are 256k read and write. All block IO is submitted to the same 4TB RADOS block device mapped to all three traffic generators. The sequential tests are started at offsets of 0, 1, and 2TB on the block device The RBD pool was left at default 4M striping Block IO test passes last 30 minutes each The test was set up with a level of performance that is reasonably stressful to characterize the cluster, rather than for maximum performance. The iodepth was set to 8, and ioengine used was asynchronous These mixes are a characterization of real world small block random and large block sequential loads, respectively. Tests like these are a subset of common canned test block benchmark loads and are representative of IO on VM boot/data drive images. Unlike object testing, there isn t as much opportunity to do caching on the reads they re done before a write phase or with random distribution across the pool. The biggest performance limiting factor here is the amount of load from the test, not an element of the cluster. The block test suite represents load on a cluster with performance headroom. Bounding principles and choices With a large number of variables and a lot of data to present, the test matrix was chosen as a good representation of cluster behavior under load while limiting scaling and tuning variables. This type of benchmarking won t represent production traffic, but does form a base for the reader to extrapolate from when configuring their own cluster. Important factors to consider about the tests chosen: Without a particular use case to simulate, the test standardizes on a single thread count to stress the system but not overly thrash. There s no perfect thread count across all object sizes, so the number aims for a good fit. Traffic generators were pushed to utilize as much network bandwidth and CPU as possible. This means very few clients required to saturate resources. A production environment usually has less bandwidth per client and a higher number of clients for application load, but that s a variable HP is not currently prepared to model in a way useful to the reader. High bandwidth PUT tests are unlikely to be useful for colder object storage load planning. DELETEs are not benchmarked from a performance standpoint for a few reasons. They didn t seem as critical to system performance planning, as this class of data is purged infrequently. Under load, variance per object size was much less significant than variances for GET and PUT. They re time consuming to accurately gauge at higher object sizes since it takes so much more time writing objects than it does to acquire a significant DELETE sample. Workload generator tools These are brief descriptions of the tools used to create and characterize the workload. Look at Appendix H: Workload Tool Detail for more information, including how to get the tools. getput To test REST API object interface performance through object storage gateways, an HP authored tool named getput is used. Getput is written in python and uses the swift client library to do Swift object IO. The getput program is the workhorse piece; the author of getput also has code for building test suites and synchronizing getput runs across multiple traffic generators. fio For block testing, fio is utilized. It s a publicly available tool that will spawn threads doing a user specified IO mix. It s fairly common to use fio for both benchmark and stress/hardware verification. collectl During test collectl is run to gather periodic samples of CPU, Memory and Disk stats on an object gateway and one of the OSD hosts. Collectl gathers performance on a number of subsystems and allows later playback of the performance samples to filter information. haproxy We used an open source load balancer as a way of demonstrating a simpler way to connect clients to a number of object gateways. The test configuration uses a single load balancer with 1 10GbE port between the traffic generator clients and the object storage gateways. This does restricts overall bandwidth for object storage benchmarking, but keeps the configuration simple. Workload results and analysis These cover bandwidth, IOPS and latency data for object and block IO tests. Object data also includes CPU usage graphs representing load on an OSD host and object gateways. IO results are the sum of the three traffic generator client results. 18

19 Reference Architecture Product, solution, or service General points The analysis details will help make cluster planning decisions versus the target workload/use case, but a few general points that can be derived from the data are: Reads are significantly more performant than writes at the same size Writes mixed with reads have a noticeable impact on read performance Object IO maximum latency can be significant, although max latency cases are atypical Object testing There are two IO sizes of note in the object matrix. One is 512K, which is typically the largest sequential IO issued at the kernel block layer. The other is 4M, the size of Ceph s RADOS objects in the target pools. Objects greater than 4M submitted using the Swift API must be split into multiple RADOS objects. While the object server listening on HTTPS is configured and a test suite was run over SSL the detailed results here are unencrypted traffic. Expect an additional processing load for using HTTPS at the object gateway and on the clients; the largest effect was at highest object sizes (16M, 128M), where average client utilization increased by a bit over 10% and object gateway load was up 5-8% on average. Peak spikes were also up significantly for HTTPS with large objects; at 128M PUT and MIX tests rose to the low 40% range while GETs went from 9% to 17% peak CPU. Object gateway test infrastructure is bottlenecking bandwidth within a single 10GbE link; the results show ~900MB/sec as the roll off point for average bandwidth. Quick samples show greater peak IO (~1100MB/sec). 19

20 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Bandwidth & IOPS MB/Sec Bandwidth k 16k 64k 128k 512k 1m 4m 16m 128m Object Size 100% PUTs 100% GETs 90% GET/10% PUT Ops/Sec Operations Per Second 1k 16k 64k 128k 512k 1m 4m 16m 128m Object Size 100% PUTs 100% GETs 90% GET/10% PUT GET Ops/sec on the sample reference configuration is significantly higher than PUTs for object sizes up to the 4M native RADOS object size. From 1K through 512K, the difference is 10x or more. Operation speed is particularly impacted by file system caching and lower cluster load of GETs (reads touch only one object copy, writes must commit all replicas). Consequently, GET bandwidth ramps a lot faster than PUT as object sizes increase in the test matrix. This means 100% GET bottlenecks quickly on networking in the load balancer/object gateway part of this setup. Effectively this happens around 512k, although there s an efficiency dip at the 1M sample. The MIX test has middle of the graph (64k-1M) results pulled by interleaving PUTs significantly. None of these samples are close to a 90% scale factor of pure GET IO. On the small side (1K, 16K) there s not much stress from GETs, and for larger objects the balancer/object gateway bottleneck prevents an accurate picture of relative GET/PUT performance. 20

21 Reference Architecture Product, solution, or service Latency Object storage latencies are higher than typical SAN storage. Some of that is expected with the architecture (HTTP server, networking), but those factors don t cover all performance impacts. Minimum latency data for object IO is less interesting it s still relatively long compared to block so those graphs are not presented. PUT/GET Avg Latency PUT/GET Max Latency k 16k 64k 128k 512k 1m 4m 16m 128m Seconds PUT GET Seconds PUT GET 1k 16k 64k 128k 512k 1m 4m 16m 128m Object Size Object Size MIX Avg Latency MIX Max Latency Seconds k 16k 64k 128k 512k 1m 4m 16m 128m PUT GET Seconds k 16k 64k 128k 512k 1m 4m 16m 128m PUT GET Object Size Object Size Average PUT latencies are in the 100s of milliseconds, with a significant uptick around 512K versus smaller object sizes. One-to-many Swift IOs that result in a number of RADOS objects being written (16M, 128M) are less latent than: 4M IO latency * ( object size/4m). Average GET latencies are generally faster than PUT, which agrees with bandwidth and IOPs results above. Only the 128M sample doesn t show much difference between PUT and GET latency. Maximum latency for GETs is not much worse than average latency, and is sub 20 seconds across the board. Maximum PUT latency is greater for all but the 128M case. The very large spike at 16M appears to be an aberration as compared with average latency, but it s important to note that max latency for object PUTs can be in the 10s of seconds. The MIX test is a more complicated picture. The PUT and GET average latencies show almost identical tracking to each other, and are similar to the 100% GET/PUT test passes. Max latencies form an S curve, where the range between 64K and 1M spike. The pull effect of PUTs show here with similar maximum latencies for GETs and PUTs, and those maximum latencies are greater than 100% GET but significantly less than 100% PUT. 21

22 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers CPU% Object GW, Avg CPU Object GW, Peak CPU % CPU PUT % CPU PUT 0 GET 0 GET 1K 16K 64K 128K 512k 1M 4M 16M 128M MIX 1K 16K 64K 128K 512k 1M 4M 16M 128M MIX Object Size Object Size OSD Host, Avg CPU OSD Host, Peak CPU % CPU PUT % CPU PUT 0 GET 0 GET 1K 16K 64K 128K 512k 1M 4M 16M 128M MIX 1K 16K 64K 128K 512k 1M 4M 16M 128M MIX Object Size Object Size The results show the selected CPU doesn t go much above 50% even at peak, so there s plenty of CPU headroom. GET traffic: The object gateway shows average CPU usage highest for small objects, ramping down to fairly minimal around 1M. Small objects are constrained by IOPS processing here. Peak CPU at the object gateway is much higher from 64K to 512K, settling down again in the larger object size ranges. On the OSD Host, the average CPU follows the same curve as the object gateway but proportionally less so since the IOs through three object gateways are distributed across the ten nodes of the Ceph cluster. OSD Host peak CPU is interestingly different; past the 512K mark the increased impact of IO missing file system cache is visible. PUT traffic: Utilization curve on the object gateway is reversed from GETs (most CPU is at largest object sizes) and never reaches as high. Some of this is from the object gateway issuing the original IO and waiting for the primary OSD(s) to complete the object replication. The higher bandwidth and multiple slices for processing larger objects keeps the gateway busier. On the OSD Host, PUTs always go to disk and there s a proportionally larger amount of IO from replication. So there s higher load, and large objects can saturate drives depending on cluster object distribution. Note the valley around 512K with maximal block IO efficiency; 64K is also an improved CPU efficiency point versus very small objects. MIX traffic: On the object gateway, average CPU roughly tracks the lowest common denominator between GETs and PUTs (although there s a noticeable spike on the 16K object size). At peak, the load matches closer to GET the dominating portion of the 90/10 MIX. There are 2 points where the MIX test peak is significantly closer to the PUT line: 128k and 4M. OSD average CPU shows a very similar tracking to GET IO with a 16K spike. At peak, the impact of processing PUTs keeps the MIX load around or above what s measured for 100% GET traffic. 22

23 Reference Architecture Product, solution, or service Block testing HP presents less data around RBD traffic than object IO partly because there s more public content around tuning and performance for RBD. One reason for that is because RBD testing is easier to set up. No object gateway or object storage access code is required and block storage benchmarking tools are easy to get and well understood. It s recommended to search for some of this other content for more detail; Mark Nelson s performance blog posts at Inktank are good places to start as are Ceph community comments around RBD performance. Bandwidth & IOPS KB/sec Bandwidth Test Type 256K sequential writes 256K sequential reads 8K random reads 8K random writes 8K mix test reads IOPS Test Type K sequential writes 256K sequential reads 8K random reads 8K random writes 8K mix test reads The results show a fair amount of bandwidth on sequential IO tests; the sequential read test is getting over 800 MB/sec with a queue depth of 8. Sequential writes perform at a bit less than half of that; this matches with expectations from the way replica traffic behaves and the lighter load in an optimal case (4M object IO). Random IO pure read peaks at 8755 IOPS, with pure write significantly less at about 24% of that total (2084) more of an efficiency drop than just replication overhead. The MIX load is again interesting, a 70% read/30% write split results in a total ops level that s 70% of pure read. Most of this drop is again pull from the load incurred by writes to the read part of the test; read IO drops to 49% of pure read instead of 70%. On the other hand, writes only dropped IOPS to 88% of pure write. 23

24 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Latency Minimum Latency ms Test Type K sequential writes 256K sequential reads 8K random reads 8K random writes 8K mix test reads ms Maximum Latency Test Type 256K sequential writes 256K sequential reads 8K random reads 8K random writes 8K mix test reads Average Latency ms Test Type K sequential writes 256K sequential reads 8K random reads 8K random writes 8K mix test reads Average latency at these test loads is mostly sub-10ms (with the exception of sequential writes at 15ms). The top two most common latency categories in the sampling ranged from 2/3 to almost 95% of the entire latency samples, so the performance is fairly stable as well. Maximum latency ranged from about 1 second for sequential writes up to almost three seconds for random writes long but tolerable for block storage error handling. 24

25 Reference Architecture Product, solution, or service Configuration guidance This section covers how to create a Ceph cluster to fit your business needs. The basic strategy of building a cluster is this: with a desired capacity and workload in mind, understand where performance bottlenecks are for the use case, and what failure domains the cluster configuration introduces. Building your own cluster General configuration recommendations The slowest performer is the weakest link for performance in a pool. Typically, OSD hosts should be configured with the same quantity, type, and configuration of storage. There are reasons to violate this guidance (pools limited to specific drives/hosts, federation being more important than performance), but it s a good design principle. A minimum size cluster has at least three compute nodes hosting OSDs to distribute the three replicas. A minimum recommended size cluster would have at least six compute nodes. The additional nodes provide more space for unstructured scale, help distribute load per node for operations and make each component less of a bottleneck. If the minimum recommended cluster size sounds large, consider whether Ceph is the right solution. Smaller amounts of storage that don t grow at unstructured data scales could stay on traditional block and file, or leverage an object interface on a file-focused storage target. Smaller Ceph clusters do make sense if the use case requires features of Swift/S3 RESTful interfaces. If the planned solution starts small but scales quickly past the minimum cluster size, then it will benefit from the features of Ceph on HP hardware. Ceph clusters can scale to exabyte levels, and you can easily add storage as needed. But failure domain impacts must be considered as hardware is added. Even three-way replication may reach an unacceptable data durability level with enough OSDs. Also, what may have been a sufficient failure domain in the initial CRUSH map may not be a good representation as network and power elements are added. Design assuming elements will fail at scale. Cluster sizing Compute and memory For the OSD hosts, the recommendation is reserving 1GHz from a core of Intel Xeon processing per OSD daemon. If other tasks run on these cluster nodes, consider the sample data in the CPU results chart under the canned tests as a fairly optimal baseline, and select CPUs resources accordingly. Balance the power of the CPU selected for hardware versus failure domain considerations for losing the processing power. Even if there are enough free CPU cycles to run VMs or other Linux services on cluster components, more functionality will be lost if a box running multiple services goes down. From the official Ceph recommendations, monitors should reserve about 1GB of RAM per daemon instance. The object gateway does not require much buffer for object size load either; in total the sample reference configuration only needed to reserve a few GB on top of other OS and application requirements. The general memory recommendation is about 2GB of memory per OSD. Normal IO usage is rated about 500MB of RAM per OSD daemon instance; observations haven t shown much of a memory load during normal operation. During recovery however, OSDs may use significantly more memory. The canned tests show extra RAM will have a noticeable positive impact with file system cache on smaller object IOs, so additional memory can benefit performance too. Choosing disks Choose how many drives are needed to meet performance SLAs. That may be the number of drives to meet capacity requirements, but may require more spindles for performance or cluster homogeneity reasons. Object storage requirements tend to be primarily driven by capacity, so consider required capacity first. Replica count is the biggest impact between raw and real capacities. There will be additional configuration loss factor for things like journal capacity, file system format, and logical volume reserved sectors that will factor into storage efficiency but these are significantly less impact than replication. A good estimate ratio to use with the sample reference configuration is 1:3.2 for usable to raw storage. Three-way or greater replica count allows for more distribution of object copies to service reads, but also provides for a quorum on object coherency. Importantly, two disks failing can t cause data loss at these replica levels. Choose the types of drives to meet requirements balanced based on price and performance sensitivity and if SSDs will be used for journal data. Extrapolate from performance results versus the business use case to help make this selection. HP drive qualification helps maintain homogeneity here, as drives of the same class and capacity are tuned to have similar performance characteristics regardless of vendor. Unstructured data may not require the performance and 24x7 nature of enterprise -class drives. If this is true for the use case, choose drives that trade performance and availability for cost/gb. As an example, HP midline drives are capable of about 550 TB/year of workload and have both SAS and SATA interfaces. 25

26 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 26 It s a good idea to buffer some performance in estimates. Complex application loads are not as easy to gauge as a simple canned test load, and production systems shouldn t run near the edge so they can better cope with failures and unexpected load spikes. Some other things to remember around disk performance: Replica count means multiple media writes for each object PUT Peak write performance of spinning media without separate journals is around half due to writes to journal and data partitions going to the same device With a single 10GbE port, the bandwidth bottleneck is at the port rather than controller/drive on any fully disk -populated HP ProLiant SL4540 Gen8 Server node; the controller is optimally capable of about 3GB/sec, while the effective peak node bandwidth on a 10GbE link looks to be in the 900 MB-1GB/sec range out of theoretical 1.25GB max At smaller object sizes, the bottleneck tends to be on the object gateway s ops/sec capabilities before network or disk; in some cases, the bottleneck can be the client s ability to execute object operations Given the fairly randomly distributed IO load for object data best case average performance from spinning media is about MB/sec. Real world object gateway performance is more in the MB/sec average range per disk; this is also impacted by object gateways not providing a particularly deep IO queue in observed tests; Peak disk performance can be higher, which is why a 4:1 SSD journal ratio is recommended Capacity versus object count If the use case focuses on many small objects, it may be necessary to get involved in the details of the file systems mounted on each OSD. Because RADOS objects are represented as files, they require an inode to be allocated. Depending on the file system used and the average object size it may be necessary to change formatting options to maximize disk usage. As an example, we ll refer to limits for the sample reference configuration. The Ceph-deploy program sets up xfs file systems with 5% of capacity as maximum usable for inodes (xfs dynamically allocates inodes as needed). As an example, using 2KB xfs inode size on 3TB drives configured as RAID 0 results in about 73.2 million inodes available per drive. Clearly these settings would max out inode usage with 1k objects well before the drive was full of object data. If inode limitations are a concern, plan file system format parameters before installing Ceph on the cluster. Installation of the OSDs is more involved with custom file system settings; reference the Ceph official documentation for details. Allocating disks to OSD hosts Choose the server that fits use case needs; for the OSD hosts we ll cover choices using the HP ProLiant SL4540 Gen8 Server. 3x15 units maximize per-node disk utilization on smaller network pipes or offer a greater network bandwidth to disk ratio. Using 3x15 HP ProLiant SL4540 Gen8 Servers increases compute density in the rack, but would be the least dense choice for storage. The 2x25 and 1x60 configurations increasingly improve storage density in the rack at the expense of compute density, and are therefore good choices for progressively colder storage. Take the drive pool from the first step and divide it into the desired HP ProLiant SL4540 Gen8 Server node configuration. If SSD journals have been chosen, they ll reduce capacity per node accordingly. As an example, and HP ProLiant SL4540 Gen8 Server with a 4:1 ratio of spinning to SSD would have 12 spinning disks per node on a 3x15, 20 spinning disks per node on a 2x25. SSD journals are not recommended on a 1x60 density optimized configuration. Replacing a spinning media slot with SSDs is counter to the focus on density, and the attempt to increase drive write performance runs into server architectural limitations for example, ratio of disk to network bandwidth. As part of designing towards homogeneity, adjust drive counts to divide storage into compute evenly where possible. Once the number of disks is chosen, decide how storage will be configured in logical volumes see Logical Drive Configuration under Cluster Tuning and select system CPU and memory to match the number of OSDs. Choosing a network infrastructure Consider desired bandwidth of storage calculated above, the overhead of replication traffic, and the network configuration of the object gateway s data network (number of ports/total bandwidth). Details of traffic segmentation, load balancer configuration, VLAN setup, or other networking configuration/best practice are very use-case specific and outside the scope of this document. Typical choices of configuration for data traffic will be 1-2 1GbE or 10GbE networks. Cold object storage use cases may be satisfied with data access over lower bandwidth, but consider that10gbe is also useful for faster rebuild and recovery between OSDs. Replicating 1TB of data across a 1GbE network takes three hours, with 10GbE it would be 20 minutes. If more network ports are needed, an additional NIC card can be placed in the server s PCIe slot. Network redundancy (active/passive configurations, redundant switching) is not recommended, as scale-out configurations gain significant reliability from compute and disk node redundancy and proper failure domain configuration. Consider the network configuration (where the switches and rack interconnects are) in the CRUSH map to define how replicas are distributed.

27 Reference Architecture Product, solution, or service A cluster network offloads replication traffic from the data network, and provides an isolated failure domain. With tested replication settings, there are two writes for replication on the cluster network for every actual IO. That s a significant amount of traffic to isolate from the data network. It is recommended to reserve a separate 1GbE network for management as it supports a different class and purpose of traffic than cluster IO. Matching object gateways to traffic Start by selecting the typical object size and IO pattern then compare to the sample reference configuration results. The object gateway limits depend on the object traffic, so accurate scaling requires testing and characterization with load representative of the use case. Here are some considerations when determining how many object gateways to select for the cluster: Object gateway operation processing tends to limit small object transfer. File system caching for GETs tends to have the biggest performance impact at these small sizes. For larger object and cluster sizes, gateway network bandwidth is the typical limiting factor for performance. HP has observed around a peak of ops/sec per object gateway testing across object sizes; that range was seen at small object sizes. Maximum practical bandwidth limits seen were in the 900MB-1GB/sec range on a 10GbE link. Load balancing does make sense at scale to improve latency, IOPS, and bandwidth. Consider at least three object gateways behind a load balancer architecture. Very cold storage or environments with limited clients may only ever need a single gateway. With the monitor process having relatively lightweight resource requirements, the monitor can run on the same hardware used for an object gateway. Performance and failure domain requirements dictate that not every monitor host is an object gateway, and vice-versa. To maximize client traffic per object gateway or meet strictest failure domain requirements, it is recommended the two roles be hosted on separate hardware. Planning monitor count Use a minimum of three monitors for a production setup. While it is possible to run with just one monitor, it s not recommended for an enterprise deployment, as larger counts are important for quorum and redundancy. With multiple sites it makes sense to extend the monitor count higher to maintain a quorum with a site down. Use physical boxes rather than VMs, to have separate hardware for failure cases. Do not run a monitor on the same box as OSDs; Ceph documentation recommends avoiding that due to the monitor s usage of fsync() impacting OSD performance. The sample reference configuration is not stressing DL360p monitor resources in a 200 OSD cluster. Therefore, there are no scaling recommendations for monitors based on cluster size. Cluster Installation Hardware platform and OS preparation details are contained in Appendix D: Server Preparation. Most installation details are broken out in Appendix E: Cluster Installation. Even for more complicated clusters, the quick Ceph deployment flow using ceph-deploy is a good starting point for cluster installation. There is community work with more advanced configuration management tools to further automate cluster install (ex: Chef, Juju, Puppet), but those details are outside of this document s scope. Whether Ceph s quick start instructions are used or not, it s recommended to use ceph-deploy where possible over manual configuration as steps tend to be simpler to execute and maintain. Do expect to make manual configuration changes to ceph.conf regardless. Object gateway configuration is currently not supported under ceph-deploy; cluster use and configuration may dictate non-default parameters be added Cluster tuning This section contains tuning guidance, which HP considered important to general system configuration. This section is not an optimal performance guide; there are a lot of settings to modify operating behavior and the goal was easy to reproduce performance with a good baseline configuration. Placement groups The tested ratio (for the sum of all pools) recommended by online documentation is <total_placement_group_count>= ((# OSD * 100) / replica count) Some tuning heuristics: 27

28 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 28 When balancing PG usage for all pools, the proportion of PGs allocated should be based on which pool contains the most objects. So the data pool for the object gateway would typically get the lion s share of the placement groups. If there are multiple pools with high numbers of objects (ex: a few RBD pools), tuning PG count becomes more complicated. Right now pg_num and pgp_num must be the same. Remember to set both values when pools need tuning. The *100 ratio can actually vary between about , where lower counts may help with lower powered systems. For the HP ProLiant SL4540 Gen 8 Server under test, plenty of compute resources are available so a higher number works. Powers of two are documented as slightly more performant. It is not practical to jump heavily utilized pools a full power of two every time OSDs are added, but keep this type of growth in mind for planning. PG allocations must keep a minimum PG count per OSD for the cluster. Running ceph s will warn if under threshold. PG count in a pool can t be lowered; pools must be deleted and remade to lower PG count (if data isn t important) or copy pool contents to another through RADOS before deletion. So increasing placement groups isn t directly reversible. Recent Inktank installation documentation has recommended even higher counts of PGs per OSD; for large enterprise clusters with the CPU and high OSD counts these higher PG counts may show benefits. HP has not tested with this tuning in mind. Higher PG counts take more CPU and rebalance time in exchange for better cluster distribution of objects. Changing PG count also incurs a rebalance. Adding extra PGs for future expansion of OSDs on a critical pool can make sense, or PGs can be left available for RBD pool(s). Best practice depends on current and planned cluster use. SSD journal usage If data requires significant PUT performance, consider SSDs for data journaling. Advantages Separation of the highly sequential journal data from object data which is distributed across the data partition as RADOS objects land in their placement groups means significantly less seeking to the front of the drive for a journal commit and then seeking elsewhere to write data. It also means that all bandwidth on the spinning media is going to data IO, approximately doubling bandwidth of PUTs/writes. Using an SSD device for the journal keeps storage relatively dense because multiple journals can go to the same higher bandwidth device while not incurring rotating media seek penalties. Disadvantages Each SSD in this configuration is more expensive than a drive that could be put in the slot. Journal SSDs reduce the maximum amount of object storage on the node. Tying a separate device to multiple OSDs as a journal and using xfs the default file system with ceph-deploy means that loss of the journal device is a loss of all dependent OSDs. With a high enough replica and OSD count this isn t a significant additional risk to data durability, but it does mean architecting with that expectation in mind. The btrfs file system avoids this limitation, but it is not mature enough for some enterprises. OSDs can t be hot swapped with separate data and journal devices. Configuration recommendations For bandwidth, four spinning disks to one SSD is a recommended performance ratio. It s possible to go with a higher ratio of spinning to solid state, but that increases the number of OSDs affected by an SSD failure. Also, the SSD could become a bottleneck; larger ratios of disks to SSD journal should be balanced versus peak spinning media performance. Journals don t require a lot of capacity but larger SSDs do provide extra wear leveling. Journaling space reserved by the Ceph should be seconds of writes. If each spinning disk peaks at ~150MB/sec, then 4GB of capacity in a given journal partition is more than a spinning disk will need to meet those buffer requirements. A RAID 1 of SSDs is not recommended. Wear leveling makes it likely SSDs will be upgraded at similar times. The doubling of SSDs per node also reduces storage density and increases price per gig. With massive storage scale, it s better to expect drive failure and plan so failure is easily recoverable and tolerable. Choose SSDs that match data usage. Consider the number of times the entire device will be written per day versus the capabilities of the device. If write bandwidth required is in occasional bursts, SLC flash doesn t make sense. Logical drive configuration For a 1x60, significant CPU cycles must be reserved for 60 OSDs on a single compute node. A 1x60 HP ProLiant SL4540 Gen8 Server fully-loaded could reduce CPU usage by configuring RAID 0 volumes across two drives at a time resulting in 30 OSDs. Configuring multiple drives in a RAID array can reduce CPU cost for colder storage, in exchange for reduced storage efficiency to provide reliability. It can also provide more CPU headroom for error handling, or additional resources if cluster design dictates CPU resource usage outside of cluster specific tasks.

29 Reference Architecture Product, solution, or service In production configuring a logical drive is generally straightforward: if it s in the node it s in use. If it s desirable to use a subset of the drives present in the system, the recommendation is configuring logical volumes only for drives to be used. Array accelerator cache is divided between configured logical drives, so unused logical volumes will take up caching resources. The result is a less accurate representation of peak performance 29

30 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Bill of materials This BOM reproduces the sample reference configuration. Note: HP ProLiant servers ship with an IEC-IEC power cord for rack mounting. HP ProLiant SL4540 Gen8 Server Qty Part Number Description B22 HP ProLiant SL454x 2x Node Chassis B22 HP 2xSL4540 Gen8 Tray Node Svr L21 HP SL4540 Gen8 Intel Xeon E (2.3GHz/8-core/20MB/95W) FIO Processor Kit B21 HP 8GB (1x8GB) Dual Rank x4 PC3L-10600R (DDR3-1333) Registered CAS-9 Low Voltage Memory Kit B21 HP SL GbE IO Module Kit B21 HP QSFP/SFP+ Adaptor Kit B21 HP Smart Array P420i Mezz Ctrllr FIO Kit B21 HP 1GB FBWC for P-Series Smart Array B21 HP 12in Super Cap for Smart Array B21 HP 500GB 6G SATA 7.2k 2.5in SC MDL HDD B21 HP 3TB 6G SAS 7.2k 3.5in MDL SC HDD B21 HP 200GB 6G SATA 2.5in SC Enterprise SSD B21 HP 1200W CS Platinum Power Supply kit B21 HP 4.3U Rail Kit B21 HP 0.66U Spacer Blank Kit Available, but not used for this configuration B21 HP ilo Adv 1-Svr incl 1yr TS&U SW HP ProLiant DL360p Gen8 Server Qty Part Number Description B21 HP ProLiant DL360p Gen8 8 SFF Server L21 HP DL360p Gen8 E FIO Kit B21 HP DL360p Gen8 E Kit B21 HP 8GB 2Rx4 PC R-11 Kit B21 HP Ethernet 1GbE 4P 331FLR FIO Adptr B21 HP Ethernet 10GbE 2P 560SFP+ Adptr B21 HP 450GB 6G SAS 10K rpm SFF (2.5-inch) SC Enterprise 3yr Warranty Hard Drive B21 HP 1GB FBWC for P-Series Smart Array B21 HP Raid 1 Drive 1 FIO Setting B21 HP 460W CS Gold Ht Plg Pwr Supply Kit B21 HP 1U FIO Friction Rail Kit B21 HP ilo Adv 1-Svr incl 1yr TS&U SW 30

31 Reference Architecture Product, solution, or service HP Networking Cables Qty Part Number Description B23 HP IP CAT5 Qty-8 12ft/3.7m Cable B22 HP IP CAT5 Qty-8 6ft/2m Cable 3 JD096C HP X240 10G SFP+ to SFP+ 1.2m DAC Cable 20 JD097C HP X240 10G SFP+ to SFP+ 3m DAC Cable 2 JG328A HP X240 40G QSFP+ QSFP+ 5m DAC Cable HP 1GbE Switch Qty Part Number Description 1 J9728A HP G Switch, 1 J9739A Power Supply Included 1 U6319E 3-year Support Plus, 4-hour onsite, 24x7 coverage 1 U4830E HP Networks Stackable Legacy Switch Startup Service 1 U4826E HP Networks Stackable Legacy Switch Installation Service HP 10GbE Switches Qty Part Number Description 2 JC772A HP 5900AF-48XG-4QSFP+ Switch 4 JC680A HP 58x0AF 650W AC Power Supply 4 JC682A HP 58x0AF Bck(pwr)-Frt(ports) Fan Tray 2 U5Y06E HP 3y SupportPlus swt Svc [for JC772A] HP Rack and Power Qty Part Number Description 1 BW908A HP mm Shock Intelligent Rack 1 BW932A HP 600mm Rack Stabilizer Kit 1 BW930A HP Air Flow Optimization Kit 1 BW909A HP 42U 1200mm Side Panel Kit 2 AF916A HP 3PH 48A NA/JP Pwr Monitoring PDU 2 AF500A HP 2, 7X C-13 Stk Intl Modular PDU B21 HP 9000 Series Ballast Option Kit 31

32 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Summary With rapid growth of unstructured data and backup/archival storage, traditional storage solutions are lacking in their ability to scale or efficiently serve this data. The cost per gigabyte for SAN and NAS at scale is undesirable and the solutions provide performance features data doesn t really require. Tape has better cost at scale, but doesn t always meet latency requirements for data access. Management of the quantity of storage and sites is complicated; guaranteeing enterprise reliability to the clients becomes difficult or impossible. HP and Ceph on Linux uses object storage and industry-standard servers to provide the cost, reliability, and centralized management businesses need for petabyte unstructured storage scale and beyond. Industry-standard server hardware from HP is a reliable, easy to manage, and supported hardware infrastructure for the cluster. Ceph and Inktank provide the same set of qualities on the software side. Together, they form a solution with a lower TCO than traditional storage that can be designed and scaled for current and future unstructured data needs. Importantly, the solution brings the control and cost benefits of open source to those enterprises that can leverage it. Open source software doesn t require additional license costs. There s no inherent vendor lock-in from the cluster software. Source code is available to control and customize what s deployed in the datacenter. Ceph can also be a key backing component of OpenStack Cinder and Glance. Software, storage, and network infrastructure can be scaled to solve your exploding data problems. Ceph cluster software and HP hardware are a compelling solution to a new scale of storage requirements, freeing your storage from traditional limitations. 32

33 Reference Architecture Product, solution, or service Appendix A: Sample Reference Ceph Configuration File The ceph.conf file used for the sample reference configuration. mon_initial_members = hp-cephmon01, hp-cephmon02, hp-cephmon03 mon_host = , , auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_pool_default_size = 3 osd_pool_default_min_size = 2 public_network = /24 cluster_network = /24 [client.radosgw.gateway01] host = hp-cephmon01 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log [client.radosgw.gateway02] host = hp-cephmon02 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log [client.radosgw.gateway03] host = hp-cephmon03 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log 33

34 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Appendix B: Sample Reference Pool Configuration Pool dump for sample reference configuration. pool 0 'data' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 3 min_size 2 crush_ruleset 2 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 797 owner 0 pool 3 '.rgw.buckets' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 793 owner 0 pool 4 '.rgw.root' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 799 owner 0 pool 5 '.rgw.control' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 801 owner 0 pool 6 '.rgw' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 803 owner 0 pool 7 '.rgw.gc' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 804 owner 0 pool 8 '.users.uid' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 806 owner 0 pool 9 '.users' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 808 owner pool 10 '.users.swift' rep size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 810 owner

35 Reference Architecture Product, solution, or service Appendix C: Syntactical Conventions for command samples Angle bracketed text indicates a substitution for a literal value. Example: ssh < host name > would indicate to substitute the host name of the target Ceph node when executing ssh commands. The use of single quotes with an OS command indicates shorthand for a command. Example: ceph s. This does not typically attempt to indicate permissions or context as it s used when that s assumed known, or reference to the command is all the needed context (i.e. a command might require a preceding sudo ). A breakout of detailed example OS text (command line or configuration file) is specified in a box. Sample Text A callout for optional OS text (something that is important to know but may not apply to all users) is specified in a colored box. Important Note 35

36 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Appendix D: Server Preparation This section describes a few steps that need to be performed prior to OS installation as well as steps required to customize the OS after installation. Install HP Support Pack for ProLiant HP Service Pack for ProLiant (SPP) is a comprehensive systems software and firmware update solution, which is delivered as a single ISO image. This solution uses HP Smart Update Manager (HP SUM) as the deployment tool and is tested on all HP ProLiant Gen8, G7, and earlier servers as defined in the Service Pack for ProLiant Server Support Guide found at The ISO image can be downloaded from: For the HP SL4540 servers in this Reference Architecture, we used ilo Virtual Media to mount the ISO image and booted the server from it. We chose to use the Automatic Firmware Update option which requires no further interaction to complete. BIOS Configuration Settings Prior to OS installation on the SL4540, you will need to change or verify a couple of BIOS configuration settings. The Mellanox NIC on the SL GbE IO Module requires SR-IOV to be enabled when using firmware v or later. Also, current versions of the Ubuntu distribution do not include in-box support for the HP Dynamic Smart Array, so you will need to set the controller in AHCI mode so that the system drives (hot plug drives on front of the HP SL4540) may be used for the boot device. During server boot, press F9 to enter the BIOS configuration menu when you see F9 Setup at the bottom of the screen. Select System Options and then scroll down and select SATA Controller Options. Then Select Embedded SATA Configuration and Enable SATA AHCI Support. Assuming this is a fresh installation, you can ignore the warning message about enabling RAID resulting in data loss. Press escape twice to get back to the main menu, scroll down to Advanced Options, then down to SR-IOV. If you don t see a dark blue box indicating Enabled, then press enter and select Enabled. 36

37 Reference Architecture Product, solution, or service You should observe the dark blue box change to indicate Enabled. Escape twice and then press F10 to save your changes and exit the utility. Configuring a Mirrored Boot Device While not required, mirroring your boot device is a good practice. For this reference architecture we created two partitions on each drive, one for the root file system and one for swap. We then mirrored each pair of partitions. During Ubuntu installation, when you arrive at the step to partition your disks, sleet the Manual partition method. Follow these steps to create the necessary partitions. 1) Select the first of your system drives and confirm Create a new empty partition table on this device?. Repeat for the second system drive. 2) Select Free Space on the first drive and then Create a new partition. 3) For this reference architecture, since we have 48GB of memory on the system, we chose to make our swap partition 96GB following the rule of thumb in doubling the memory size. Enter 96GB as you size and then select Primary and Beginning. 4) Select Use as: and change to physical volume for RAID, then select Done setting up partition. 5) For the root file system, again choose Free Space on the same device and Create a new partition. 6) Use the remainder of the available capacity on the drive. Select Continue and then Primary. 7) Select Use as: and again, change to physical volume for RAID. In this case, for the root file system select the Bootable flag: and change the value to on. Then select Done setting up partition. 8) Now follow steps 2 through 7 again for the second drive. The next step is to mirror the pairs of partitions just created for the root file system and swap partition. Go back to the main Partition Disks page for this. 1) Select Configure Software RAID and yes to write the changes to disk. 2) Select Create MD device and then select RAID1 for mirroring. 3) For number of active devices, enter 2 and then 0 for the number of spare devices. 4) The next step is selecting which partitions to mirror. For the swap partition, select sda1 and sdb1 then Continue to go to the next step. 5) Repeat steps 2 through 4 to mirror the root file system partitions, this time selecting sda2 and sdb2. 6) Now select Finish. You are now ready to format your mirrored partitions. 1) Under RAID1 device #0, select #1. 2) Select Use as:, change the value to swap area, and then Done setting up partition. 3) Under RAID1 device #1, select #1. 4) Select Use as: and change the value to Ext4 journaling file system. 5) Now select Mount Point and choose / - the root file system and then Done setting up partition. 6) Select Finish partitioning and write changes to disk. In our case we then confirmed that we wanted to boot in the event that the RAID partition was in a degraded state, and proceeded with the rest of the OS installation. 37

38 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 38 Upgrading Ubuntu Update kernel and packages to latest; as of this writing Ubuntu is running with on the sample reference configuration. If you need to upgrade the kernel before installing Ceph, use apt-get dist-upgrade to do intelligent package upgrade. Despite the name, the apt-get dist-upgrade allows upgrade of the kernel packages, not the distribution. sudo apt-get update && sudo apt-get dist-upgrade Useful Additional Package Configuration Install NTP. Since cluster coherency requires minimal time skew at least one of the monitors should be an NTP reference. A real time reference is optimal, but a synchronizing time reference will work for testing. For runtime, some additional useful packages (none are absolutely necessary for cluster setup) lsscsi for initial OSD creation, supplementing ceph-deploy disk list to map unassigned drives to OSDs. collectl for checking performance during runtime to tune/evaluate cluster load. Command Line Logical Drive Configuration Install the HP CLI package hpssacli to configure new logical volumes and other array parameters. This will help reduce having to reboot the system to get to the HP SSA GUI and manually creating dozens of single drive RAID 0 volumes per SL4540. The recommended configuration for boxes tested is all spinning media as single drive RAID 0; here s the CLI syntax: sudo /usr/sbin/hpssacli controller slot=1 create type=arrayr0 drives=allunassigned The hpssacli binary is not distributed in.deb format in sync with.rpm releases. To get the latest HP Smart Storage Administrator CLI for Linux, download the.rpm and extract the files. Go to Download the 64-bit Linux RPM. At the time of this paper it was hpssacli x86_64.rpm. Copy to a directory where the extraction can be run. If not already present, download rpm2cpio to convert and install the files from the extraction directory. sudo apt-get install rpm2cpio rpm2cpio hpssacli x86_64.rpm cpio idmv sudo cp R opt/* /opt sudo cp R usr/* /usr The CLI can now be run from /usr/sbin with sudo permissions. Updating the Driver for the 10GbE Mellanox NIC HP recommends that you upgrade to the v2.1 or later mlnx_en driver for use with the HP SL GbE IO Module. This driver is available from For this reference architecture, we used the mlnx_en_ tgz package. Copy this file to a directory where the extraction can be run. These steps extract the driver source, build it and install the driver so that they will load persistently. If you haven t already installed them, install make and gcc. tar xzvf mlnx-en tgz cd mlnx-en /sources gunzip mlnx-en_2.1.orig.tar.gz tar xvf mlnx-en_2.1.orig.tar cd mlnx-en-2.1./scripts/mlnx_en_patch.sh make sudo modprobe r mlx4_en edit /etc/depmod.d/ubuntu.conf Add extras to the search list, resulting in the line: search extra updates ubuntu built-in sudo make install sudo modprobe mlx4_en sudo vi /usr/share/initramfs-tools/modules.d/mlx4_en Contents of the file should be these lines: mlx4_en mlx4_core Save and quit back to command line. sudo update-initramfs u k all

39 Reference Architecture Product, solution, or service sudo reboot Use modinfo mlx4_en after rebooting and verify that the server is using the v2.1.xx driver. Remember that if the kernel is upgraded, you must also recompile Mellanox drivers against the new kernel and rebuild the initramfs. Don t forget to re-run the patch script to regenerate the config.mk file before building. 39

40 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers 40 Appendix E: Cluster Installation Because the Ceph documentation website can change over time, the installation flow has been sourced from Ceph documentation used when configuring the sample reference configuration. The sourced instructions have been modified to include customizations made, and fixed on the choices of Ubuntu and Ceph distribution. For additional details, the Ceph quick start instructions are also linked in the for more information section. It s recommended to perform cluster install where the cluster can get access to the internet. This allows the OS package manager to download Ceph and general OS packages from normal repositories. The ceph-deploy program depends on using the package manager to install, and it is much more straightforward than maintaining internal repositories. If site security doesn t allow internet access, the installation instructions will have to be modified according to specific site requirements, whether the install uses a local repository or source install. All samples are from installation on Ubuntu Naming Conventions The monitor and object gateway systems are named: hp-cephmon01 through hp-cephmon03 The OSD hosts are named hp-osdhost01 through hp-osdhost10 When an operation is generic to the type of system, it ll be referred to as <node01> through <nodexx> Ceph Deploy Setup Initial cluster creation and staging is executed from the first monitor, hp-cephmon01. Add the release key. You may need to edit wgetrc for proxy use; see Initial Configuration Modification below for syntax: wget -q -O- ' sudo apt-key add - Add Ceph Packages to the repository: echo deb $(lsb_release -sc) main sudo tee /etc/apt/sources.list.d/ceph.list Update the repository and install ceph-deploy: sudo apt-get update && sudo apt-get install ceph-deploy Ceph Node Setup Create a user on each Ceph node: ssh <existing login user>@ceph-server sudo useradd -d /home/ceph -m ceph -s /bin/bash sudo passwd ceph Add root privileges for the user on each Ceph node: echo "ceph ALL = (root) NOPASSWD:ALL" sudo tee /etc/sudoers.d/ceph sudo chmod 0440 /etc/sudoers.d/ceph Install ssh server (if necessary) on each Ceph Node: sudo apt-get install openssh-server Configure the ceph-deploy admin node with password-less SSH access to each Ceph Node. When configuring SSH access, do not use sudo or the root user. Leave the passphrase empty: ssh-keygen Generating public/private key pair. Enter file in which to save the key (/ceph-client/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /ceph-client/.ssh/id_rsa.

41 Reference Architecture Product, solution, or service Your public key has been saved in /ceph-client/.ssh/id_rsa.pub. Copy the key to each Ceph Node ssh-copy-id ssh-copy-id Modify the ~/.ssh/config file of the ceph-deploy admin node so that it logs in to Ceph Nodes as the user created (e.g., ceph). Host <node01> Hostname <node01 fully qualified domain name> User ceph Host <nodexx> Hostname <nodexx fully qualified domain name> User ceph Ensure connectivity using ping with short hostnames (hostname s). Create a Cluster Start cluster installation Create the cluster staging directory then set up the initial config file and monitor keyring. mkdir cluster-stage; cd cluster-stage ceph-deploy new <initial-monitor-node(s) fully qualified domain names> Initial Configuration modification Making some configuration modifications at this step will avoid restart of affected services during this install. It s recommended to make ceph.conf changes in this staging directory rather than /etc/ceph/ceph.conf so new configuration updates can push to all nodes with ceph-deploy --overwrite-conf config push <nodes> or ceph-deploy --overwrite-conf admin <nodes>. Set replica counts to 3 and min count for writes to 2 so pools are at enterprise reliability levels. Replication at this level consumes more disk and network bandwidth but allows repair without data loss risk from additional device failures. This also allows for a quorum on object coherency since odd counts > 1 can agree on a majority. <cluster creation dir>/ceph.conf osd_pool_default_size = 3 osd_pool_default_min_size = 2 If the object gateway is installed per the Ceph default instructions, related pools will be created automatically on demand as the object gateway is utilized which means starting with defaults. The default of 8 PGs is low, although it may be appropriate for object counts in very lightly utilized pools. Too boost defaults based on cluster size, here are the configuration parameters. <cluster creation dir>/ceph.conf [global] osd_pool_default_pg_num = <default_pool_placement_group_count> osd_pool_default_pgp_num = <default_pool_placement_group_count> If you want to offload cluster network traffic like our sample reference configuration did, you ll need to specify both public (data) and cluster network settings in ceph.conf using the network and netmask slash notation. <cluster creation dir>/ceph.conf [global] public_network = <public network>/<netmask> cluster_network = <cluster network>/<netmask> 41

42 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Install Ceph Software This step pulls down the Ceph distribution packages and installs onto all cluster role servers. If using ceph-deploy to install Ceph packages and using a proxy server to get to the internet, edit wgetrc s proxy configuration under all Ubuntu nodes. Otherwise, ceph-deploy install will get stuck trying to get the release key with wget. Aptitude should be configured with proper proxy settings during OS installation. Example from /etc/wgetrc https_proxy = <proper proxy server url> http_proxy = <proper proxy server url> ceph-deploy install --release dumpling <node01>...<nodexx> Create Monitors Add initial monitors and gather the keys. ceph-deploy mon create-initial <monitor01> <monitorxx> Add OSDs The typical manual flow for adding an OSD to the cluster with SSD journals is below, with the SSD as /dev/sdt and the OSD on /dev/sda for the target host. If there s no partition table on the target journal SSD, create one. ssh hp-osdhost01 sudo parted -s /dev/sdt mklabel gpt Create partition on the target journal SSD. ssh hp-osdhost01 -s mkpart cephjournal01 0G 4G Create new partition table on the OSD. ceph-deploy has failed trying to clear a partition table when repurposing a drive; an explicit redo of the table has proved more reliable. ssh hp-osdhost01 sudo "parted -s /dev/sda mklabel gpt" Prepare and activate the OSD (create command does both it in one step) ceph-deploy --overwrite-conf osd create hp-osdhost01:sda:sdt1 42 The scripts below are simple examples for setting up all drives on a box in a batch. These are not robust (no real error handling, output help, etc.) but can be a useful starting point for command syntax/function. All these scripts assume the install instructions above such that no ssh password entry is necessary. A more robust creation mechanism would probably leverage orchestrating software, but even with occasional hiccups the scripts below generally suffice if adding new OSDs is not all that common of a task. Sample script for creating SSD journal partitions, 4 per ssd interleave. #!/bin/bash tgtsys=${1} if [ -z "${tgtsys}" ]; then echo "No target system." exit 1 fi tgtdrv=${2} if [ -z "${tgtdrv}" ]; then echo "No target disk." exit 1 fi

43 Reference Architecture Product, solution, or service ssh ${tgtsys} sudo parted -s ${tgtdrv} mklabel gpt p_layout=( 0G 4G 8G 12G 16G ) start_idx=0 end_idx=1 while [ ${end_idx} -lt ${#p_layout[@]} ]; do ssh ${tgtsys} sudo parted ${tgtdrv} -s mkpart cephjournal${end_idx} ${p_layout[${start_idx}]} ${p_layout[${end_idx}]} (( start_idx=end_idx )) (( end_idx++ )) done Sample script for adding OSDs to the cluster. #!/bin/bash destbox=${1} if [ -z "${destbox}" ]; then echo "No target system." exit 1 fi partdev=$(echo sd{a..t} ) journaldev=( $(echo sd{u..y}{1..4}) ) journal_idx=0 for devid in ${partdev}; do echo "working on ${devid}" ssh ${destbox} sudo "parted -s /dev/${devid} mklabel gpt" ceph-deploy --overwrite-conf osd create ${destbox}:${devid}:${journaldev[${journal_idx}]} (( journal_idx++ )) done Create Admin Node The server is administered on the same box as the primary monitor/object gateway. Adding read permissions on the admin keyring and ceph configuration allows cluster administrator operations without having to be root. ceph-deploy admin hp-cephmon01 sudo chmod +r /etc/ceph/ceph.client.admin.keyring sudo chmod +r /etc/ceph/ceph.conf Verify Cluster Health The cluster should be complete. Check health status of the cluster and cluster state information to make sure the cluster looks like it should. ceph health ceph -s An example of command output from a healthy cluster configuration: cloudplay@hp-cephmon02:~$ ceph -s cluster 8fd2af32-987c-48a7-9a7b-e932bd88024b health HEALTH_OK monmap e1: 3 mons at {hp-cephmon01= :6789/0,hp-cephmon02= :6789/0,hpcephmon03= :6789/0}, election epoch 8, quorum 0,1,2 hp-cephmon01,hp-cephmon02,hp-cephmon03 osdmap e822: 200 osds: 200 up, 200 in pgmap v106577: 6336 pgs: 6324 active+clean, 12 active+clean+scrubbing; GB data, GB used, 508 TB / 545 TB avail mdsmap e1: 0/0/1 up cloudplay@hp-cephmon02:~$ ceph health HEALTH_OK Default Object Storage Placement Group Count The majority of placement groups should lie in the pools with the most RADOS objects. In an object storage focused cluster, this pool will default to.rgw.buckets. Using the cluster tuning guidelines for placement groups, this step is a good place to 43

44 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers create the default pool here so object gateway install doesn t create one sub-optimal default placement group settings. Remember to balance object gateway usage with amount of rbd storage required. sudo ceph osd pool create.rgw.buckets <pg_count> Add Object Gateways The ceph-deploy package does not support object gateways, but changes to the configuration are driven from the staging directory created in the above cluster installation step. If testing involves redoing the cluster from scratch frequently, this is manual enough that it is worth scripting or otherwise orchestrating. For the sample reference configuration, object gateways are installed on all of the monitors and load balanced. All of the steps below are performed on the target system directly, except for the push of the ceph.conf which occurs on the staging system (in this case hp-cephmon01). Installation must be performed and individually tailored for each system performing the object gateway role. Apache/FastCGI W/100-Continue The Ceph community provides a slightly optimized version of the apache2 and fastcgi packages. The material difference is that the Ceph packages are optimized for the 100-continue HTTP response, where the server determines if it will accept the request by first evaluating the request header. If there are specific apache requirements, it may be possible to run with the stock server. Add ceph-apache.list file to APT sources. echo deb -sc)-x86_64-basic/ref/master $(lsb_release -sc) main sudo tee /etc/apt/sources.list.d/ceph-apache.list Add ceph-fastcgi.list file to APT sources. echo deb -sc)-x86_64-basic/ref/master $(lsb_release -sc) main sudo tee /etc/apt/sources.list.d/ceph-fastcgi.list Update repository and install Apache and FastCGI. sudo apt-get update && sudo apt-get install apache2 libapache2-mod-fastcgi Configure Apache/FastCGI Open the apache2.conf file: sudo vim /etc/apache2/apache2.conf Add a line for the server name in the Apache configuration file. Provide the fully qualified domain name of the server machine. Edit /etc/apache2/apache2.conf: ServerName <fully qualified domain name> Enable the URL rewrite modules for Apache and FastCGI sudo a2enmod rewrite sudo a2enmod fastcgi Restart Apache so that the foregoing changes take effect. sudo service apache2 restart 44 Enable SSL Because this sample configuration is targeted at enterprise customers, SSL is configured. Ensure dependencies are installed. sudo apt-get install openssl ssl-cert Enable the SSL module. sudo a2enmod ssl Generate a Certificate sudo mkdir /etc/apache2/ssl sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/apache2/ssl/apache.key -out /etc/apache2/ssl/apache.crt

45 Reference Architecture Product, solution, or service Restart Apache sudo service apache2 restart Install Ceph Object Gateway The Ceph packages don t pull down object gateway software by default, so add that now. sudo apt-get install radosgw Add gateway configuration to Ceph HP recommends this step and the configuration step be executed from the deployment directory used for cluster creation. For each object gateway host, there s a separate section for their definition. When running the scripts to start the service the host-name field matches the proper configuration to start, and the gateway name is the piece of the <cluster>-<id> combo that identifies the cephx user authenticating for that gateway instance. For example, with hp-cephmon01 the matching gateway name is gateway01, so the below would be for [client.radosgw.gateway01]. [client.radosgw.<gateway name>] host = <object gateway host-name> keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log Redeploy Ceph Configuration Strictly speaking only the new object gateway needs the update, but it s a best practice to keep the configuration files in sync. If not manually editing /etc/ceph/ceph.conf on the gateway machine, run this command to deploy the config file changes to the cluster. ceph-deploy --overwrite-conf config push <nodes> Create Data Directory sudo mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.<gateway name> Create Gateway Configuration Create an rgw.conf file under the /etc/apache2/sites-available directory on the host where the Ceph Object Gateway was installed. This configuration accomplishes a few things: Configure FastCGI as an external server to Apache. Sets a rewrite rule for Amazon S3 compatible interface (not use under this test). Configure the mod_fastcgi module. Allow encoded slashes, provide log file paths, and turn off server signatures. Enable standard HTTP and SSL config. The below is a literal config for hp-cephmon01, replace ServerName and ServerAdmin with the appropriate name of the host where the object gateway is being installed. FastCgiExternalServer /var/www/s3gw.fcgi -socket /tmp/radosgw.sock <VirtualHost *:80> ServerName <hp-cephmon01> ServerAlias *.ldev.net ServerAdmin [email protected] DocumentRoot /var/www RewriteEngine On RewriteRule ^/([a-za-z0-9-_.]*)([/]?.*) /s3gw.fcgi?page=$1&params=$2&%{query_string} [E=HTTP_AUTHORIZATION:%{ <IfModule mod_fastcgi.c> <Directory /var/www> Options +ExecCGI AllowOverride All SetHandler fastcgi-script Order allow,deny 45

46 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Allow from all AuthBasicAuthoritative Off </Directory> </IfModule> AllowEncodedSlashes On ErrorLog /var/log/apache2/error.log CustomLog /var/log/apache2/access.log combined ServerSignature Off </VirtualHost> <VirtualHost *:443> ServerName <hp-cephmon01> ServerAlias *.ldev.net ServerAdmin DocumentRoot /var/www RewriteEngine On RewriteRule ^/([a-za-z0-9-_.]*)([/]?.*) /s3gw.fcgi?page=$1&params=$2&%{query_string} [E=HTTP_AUTHORIZATION:%{ <IfModule mod_fastcgi.c> <Directory /var/www> Options +ExecCGI AllowOverride All SetHandler fastcgi-script Order allow,deny Allow from all AuthBasicAuthoritative Off </Directory> </IfModule> AllowEncodedSlashes On ErrorLog /var/log/apache2/error.log CustomLog /var/log/apache2/access.log combined ServerSignature Off SSLEngine on SSLCertificateFile /etc/apache2/ssl/apache.crt SSLCertificateKeyFile /etc/apache2/ssl/apache.key SetEnv SERVER_PORT_SECURE 443 </VirtualHost> Enable the Configuration Enable the site for rgw.conf, disable the default site. sudo a2ensite rgw.conf sudo a2dissite default Add Ceph Object Gateway Script Create the object gateway script in /var/www/s3gw.fcgi #!/bin/sh exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.<gateway name> Make sure the script is executable. sudo chmod +x /var/www/s3gw.fcgi 46

47 Reference Architecture Product, solution, or service Generate Keyring and Key for the Gateway Here a keyring is created on the object gateway install system. These steps also set up read access for administrative ease of use, and attach the gateway user to the cluster and keyring file. For simplicity, this config doesn t bother merging gateway keyring files across object gateways. sudo ceph-authtool --create-keyring /etc/ceph/keyring.radosgw.gateway sudo chmod +r /etc/ceph/keyring.radosgw.gateway sudo ceph-authtool /etc/ceph/keyring.radosgw.gateway -n client.radosgw.<gateway name> --gen-key sudo ceph-authtool -n client.radosgw.<gateway name> --cap osd 'allow rwx' --cap mon 'allow rw' /etc/ceph/keyring.radosgw.gateway sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.radosgw.<gateway name> -i /etc/ceph/keyring.radosgw.gateway Restart Services and Start the Gateway sudo service ceph restart sudo service apache2 restart sudo /etc/init.d/radosgw start Create a Gateway User To use the Swift and S3 APIs through the object gateway, a user account is required. This was done extensively for the seeding part of the test with an automatic script. Since tests used the Swift API and SW_AUTH through the object gateway, each account involves setting up a user, a swift subuser and a key for the subuser to authenticate to. sudo radosgw-admin user create --uid=testusr --display-name="test User" sudo radosgw-admin subuser create --uid=testusr --subuser=testusr:swift --access=full sudo radosgw-admin key create --subuser=testusr:swift --key-type=swift --gen-secret You may want to modify read permissions for /etc/ceph/ceph.client.admin.keyring to allow radosgw-admin usage without sudo. To validate the object gateway is working, you can utilize swift client to do a list on a user account created. Even without any objects written, the command should return without error if the object gateway is working. When using the subuser secret key, watch out for keys with escapes of / (\/ represents just /). You may need to delete the escape character depending on how you re using the key. swift -U <user name>:swift -K "<swift subuser secret key>" -A gateway IP>/auth/v1.0 list 47

48 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers Appendix F: Newer Ceph Features While the sample reference configuration here used the Dumpling release, Ceph is continuing to make feature releases of significant technologies. This section lists features already released in stable code bases, or coming soon. There are many features on the Inktank roadmap; the selected are from Emperor and Firefly releases. Multi-Site Ceph Emperor Release has fully functional support for multi-site clusters. Ceph object gateway regions and metadata synchronization agents maintain a global namespace across different geographies and even clusters. Zones can be defined within regions to synchronize and maintain further copies of the data. A typical configuration would be a Ceph cluster per region, with zones defined as needed within each region for failover, disaster, and backup recovery protection. There are of course hardware impacts when deploying multi-site. Make sure the SL4540 compute node density works well for splitting failure domains across sites (clearly a single SL4540 chassis can t be divided). The count of object gateways and monitors will increase above what the same cluster OSD host count would require on a single site to match region and zone configuration. It s also likely that object gateway distribution will dictate additional load balancers per site. Erasure Coding Replication has the performance advantage of data locality as a full copy of data is present on each device in the Active Set. It also provides sufficient protection for data at massive scale. It does however come with the drawback of being less storage efficient than traditional RAID 5/RAID 6 architectures. At larger scales especially where cost per usable gigabyte is a primary driver of the storage architecture this becomes a significant scaling drawback. Erasure coding is a Forward Error Correction code that translates a message of k symbols into a message of n symbols such that the original message can be recovered from a subset of the n symbols (k symbols). Erasure codes use math to create extra data that allows the user to need only a subset of the total data to recreate the message. It is similar to RAID 6 but the SLA, latency, and scale characteristics of an object store require tolerating > 2 drive failures. Therefore, Erasure Coding can be tuned for n and k based on the scale and failure tolerance of the cluster. The tradeoff is lower performance, but instead of 3.2:1 storage efficiency it s more in the :1 range. As implemented under the Ceph Firefly release, it can be set as a storage tier with more performant replicated pools. Objects that are colder will be migrated to the erasure coded storage; erasure coding supports a layer of storage with appropriate price/performance to the temperature of the data. Cache Tiering For pools that require more performance, Ceph implements a cache pool tier in Firefly. There are two defined use cases for initial release: Writeback cache take an existing data pool and put a fast cache pool (such as SSDs) in front of it. Writes are acked from the cache tier pool and flushed to the data pool based on the defined policy. Read-only pool, weak consistency take an existing data pool and add one or more read-only cache pools. Copy data to the cache pool(s) on read and forward writes to the original data pool. Stale data expired from the cache pools based on the defined policy. These will be useful when combined with specific applications with access patterns that match these caching properties. The object gateway is an example, but this could also be used as a performant accelerator for a block layer with need for write performance or a cacheable read load ( golden image VM boot volumes). 48

49 Reference Architecture Product, solution, or service Appendix G: Helpful Commands These are commands for administering the Ceph cluster that were useful during testing. Removing configured objects For a POC/test configuration, there may be reasons to tear down the cluster to recreate, change OSD configuration, etc. An example is swapping out spinning media for SSDs. Rebuilding Cluster If resetting a running cluster this can be a lot faster than rebuilding if the cluster data isn t important the official instructions were generally not sufficient to clean up hosts. Instead, do this under the cluster staging directory. 1. ceph-deploy purge <all cluster hosts> 2. ceph-deploy purgedata <all cluster hosts> 3. ceph-deploy forgetkeys 4. You may also need to run sudo apt-get autoremove on the cluster hosts if you re changing releases to clean up package dependencies. If the state still appears to be stuck, the nuclear option is going to each node and manually removing /var/lib/ceph and /var/run/ceph. OSDs hosts may require unmounting OSD data partitions before removing /var/lib. Make sure all ceph packages are uninstalled and then reboot the hosts. The unmount syntax (operated on the OSD hosts): sudo umount /dev/sd{<start letter>..<end letter>}1 Removing OSDs This simplifies the flow of the official Ceph instructions somewhat. With a number of OSDs to remove these can be put into small scripts to avoid errors; automating the wait on cluster health would be a bit more involved. Just deleting OSDs one after another can result in data loss if not careful. The slow but safe approach is recommended to avoid risk of rebuilding a pool/cluster. ceph osd out <OSD #> ssh <remote host> sudo stop ceph-osd id=<osd #> Wait here with ceph w until the cluster is healthy. ceph osd crush remove osd.<osd #> ceph auth del osd.<osd #> ceph osd rm <OSD #> If removing more OSDs, again wait with ceph w until the cluster is healthy. Removing logical drives If reducing volume count for some reason (changing out drives in use, reducing count for performance evaluation), here s sample CLI syntax. The logical drive numbers are 1 based. for lnum in {<start #>..<end #>}; do sudo /usr/sbin/hpssacli controller slot=1 logicaldrive ${lnum} delete; done Checking Cluster State The default is to require root permissions to read ceph configuration. For simplicity, open everything up on admin node(s) while debugging: sudo chmod +r /etc/ceph/* The command ceph s is useful for validating cluster health. Use ceph w to follow runtime task status for the cluster. Using ceph df or rados df for a bit more information to see cluster usage. If an OSD is down, ceph OSD tree is a good state check of the cluster OSDs. Here s a subset of the command output format grabbed from the sample reference configuration; nodes that are not healthy will not be up. cloudplay@hp-cephmon02:~$ ceph osd tree head -n 23 # id weight type name up/down reweight 49

50 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers root default host hp-osdhost osd.0 up osd.1 up osd.2 up osd.3 up osd.4 up osd.5 up osd.6 up osd.7 up osd.8 up osd.9 up osd.10 up osd.11 up osd.12 up osd.13 up osd.14 up osd.15 up osd.16 up osd.17 up osd.18 up osd.19 up 1 Configuring Pool Settings This is often useful for dynamically changing the pg_num and pgp_num settings after changing the number of OSDs. There are a few other parameters that can be set if needed like CRUSH rule set or the replica parameters. To get current information on pools in the system, execute this on an admin node: sudo ceph osd dump grep pool Here is an example for setting the data pool on an object gateway for 60 OSDs with a 3 way replica. If total pgs should be around: (60 * 100)/3 == 2000, try to choose a reasonable power of 2 under that number. sudo ceph osd pool set.rgw.buckets pg_num 1024 Wait until cluster settles down with ceph w : sudo ceph osd pool set.rgw.buckets pgp_num 1024 See Appendix B for an example of pool dump output. Listing Object Gateway Users If the administrator forgets a username or there s a need to list existing object gateway users this command is helpful. sudo radosgw-admin metadata list user 50

51 Reference Architecture Product, solution, or service Appendix H: Workload Tool Detail Getput Installing Getput is written in python and is available as source from Getput on github. Getput v0.0.7, which is the version available on github at the time of this writing, requires python-swiftclient v1.6.0 or later. On Ubuntu 12.04, you should be able to use these instructions to install this package. sudo apt-get install python-software-properties sudo add-apt-repository-cloud-archive:havana sudo apt-get update && apt-get install python-swiftclient Alternatively, you can create /etc/apt/sources.list.d/cloudarchive-havana.list with these contents: deb precise-updates/havana main deb-src precise-updates/havana main And then install the package using: sudo apt-get update && apt-get install python-swiftclient Running the test versus the Ceph cluster as configured required commenting out x-trans-id references for responses: #transid = response['headers']['x-trans-id'] transid = '' Test Parameters Using version 1 authentication on the object gateway, we set up a resource file to define the environment variables for access. export ST_AUTH= balancer address>/auth/v1.0 export ST_USER=<main user name>:swift export ST_KEY="<swift subuser secret key>" Full utility parameter help is available through man pages and running with --help. For the test run, these parameters were used: -n <NOBJECTS>: container/object numbers as a value or range. -o <ONAME>: object name prefix -c <CNAME>: name of container -t <TESTS>: tests to run, these can be gpd (GET, PUT, DELETE). -s <SIZESET>: object size(s) --procs=<procset>: number of processes to run Fio Installing Under Ubuntu, fio can be fetched as a standard package sudo apt-get install fio Test Parameters --rw <IO Pattern>: Type of IO pattern for the test. Used read and write for sequential tests, randread, randwrite and randrw for the various mix tests. -ioengine=<io execution method>: Defines how the job issues IO. The engine used was libaio, for Linux native asynchronous IO. --runtime=<maximum time to run test>: Terminates processing after specified time. 30 minutes was used. --numjobs=<number of workload clones>: Defines processes/threads performing same workload of this job. Used default of 1. --direct=<boolean>: Selects use/non-use of buffered IO. Direct was set to 1 (true). --bs=<size of IO>: Block size for IO units. 8k was used for random, 256K for sequential IO. --time_based: Force the test to run for full runtime even if there s already complete coverage of the file. --size=<total IO size of job>: fio runs to cover this size, but size of device and time limit were controlling factors for this test. 51

52 Reference Architecture Ceph on HP ProLiant SL4540 Gen8 Servers --iodepth=<units in flight>: Number of IO units to keep in flight against the file. A depth of 8 was used for testing. --name <job name>: In this context, used /dev/rbd1 to specify job name and the device file being targeted. --rwmixwrite=<% mix writes>: Percentage of mixed workload to make writes. The mix test used rwmixread=<% mix reads>: Percentage of mixed workload to make reads. The mix test used 70. Collectl Installing Written by Mark Seger (the author of getput), collectl is available as a standard distribution package. sudo apt-get install collectl Test Parameters Collectl has an extensive list of parameters to capture and play back information. CPU, Disk and Network data are captured by default, so the only parameter added during capture was f for redirecting playback output to a file. Playback of the results used these parameters, and was then post-processed to get average data: -s <subsystem>: This field controls which subsystem data is to be collected or played back for. A subsystem of c specifies playing back CPU data. -p <Filename>: Read data from the specified playback file(s). HAProxy Installing Under Ubuntu, haproxy can be fetched as a standard package. sudo apt-get install haproxy Configuration This is a minimal configuration that forwards 80 and 443 to the monitor/gateway boxes and uses source balancing to keep a best-effort client/server affinity. #this config needs haproxy or haproxy global log local0 log local1 notice #log loghost local0 info maxconn 4096 #chroot /usr/share/haproxy user haproxy group haproxy daemon #debug #quiet defaults log global mode http option dontlognull retries 3 option redispatch maxconn 2000 contimeout 5000 clitimeout srvtimeout listen http_proxy :80 option httplog mode http balance source server hp-cephmon check server hp-cephmon check server hp-cephmon check

53 Reference Architecture Product, solution, or service listen https_proxy :443 option tcplog mode tcp option ssl-hello-chk balance source server hp-cephmon check server hp-cephmon check server hp-cephmon check 53

54 Glossary Cold, warm and hot storage Temperature in data management refers to frequency and performance of data access in storage. Cold storage is rarely accessed and can be stored on the slowest tier of storage. As the storage heat increases, the bandwidth over time as well as instantaneous (latency, IOPS) performance requirements increase. CRUSH Controlled Replication Under Scalable Hashing. The algorithm Ceph uses to compute object storage locations. Epoch Ceph maintains a history of each state change in the Ceph Monitors, Ceph OSD Daemons and PGs. Each version of cluster element state is called an epoch. Failure domain An area of the solution impacted when a key device or service experiences failure. Federated storage A collection of autonomous storage resources with centralized management that provides rules about how data is stored, managed and moved through the cluster. Multiple storage systems are combined and managed as a single storage pool. Object storage A storage model focusing on data objects instead of file systems or disk blocks; objects have key/value pairs of metadata associated with them to given the data context. Typically accessed by a REST API, designed for massive scale and using a wide, flat namespace. PGs Placement Group. A grouping of objects on an OSD; pools contain a number of PGs and many PGs can map to an OSD. Pools logical partitions for storing objects. Pools set ownership/access to objects, the number of object replicas, the number of placement groups, and the CRUSH rule set to use. RADOS A Reliable, Autonomic Distributed Object Store. This is the core set of storage software which stores the user s data in a Ceph Cluster (MON+OSD). REST Representational State Transfer is stateless, cacheable, layered client-server architecture with a uniform interface. In this cluster, the REST APIs are architected on top of HTTP.

55 Reference Architecture Product, solution, or service For more information With increased density, efficiency, serviceability, and flexibility, the HP ProLiant SL4540 Gen8 Server is the perfect solution for scale-out storage needs. To learn more, visit hp.com/servers/sl4540. To support the management and access features of object storage, and seamlessly operate as part of HP Converged Infrastructure, the HP ProLiant DL360p Gen8 series brings the power, density and performance required. HP OneView helps companies of all sizes unlock the value of converged infrastructure by bringing the best of consumer IT to the data center and allowing teams to work in a more natural and collaborative way. Visit hp.com/go/oneview. HP Integrated Lights-Out simplifies server setup, promotes remote administration, engages health monitoring, and maintains power and thermal control. For more information, see hp.com/go/ilo. HP simplifies, integrates, and automates networking so organizations can focus on what they do best. Visit hp.com/go/networking for more information. The HP switches used in this document are HP G and HP 5900AF- 48XG-4QSFP+-48G. Documents for HP scale out object storage solutions on industry-standard servers are at hp.com/go/objectstorage. Ceph has excellent documentation available at its website; this whitepaper has sourced it extensively. The documentation master page starts here: To help us improve our documents, please provide feedback at hp.com/solutions/feedback. Sign up for updates hp.com/go/getupdated Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. AMD is a trademark of Advanced Micro Devices, Inc. Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates. c May 2014