Build and operate a CEPH Infrastructure University of Pisa case study Simone Spinelli simone.spinelli@unipi.it
Agenda CEPH@unipi: an overview Performances Infrastructure bricks: Our experience Conclusions Network OSD nodes Monitor Node Racks MGMT tools
University of Pisa Big sized Italian university: 70K students 8K employees Not a campus but spread all over the city no big datacenter but many small sites Own and manage an optical infrastructure with on top an MPLS-based MAN Proud host of GARR Network PoP Surrounded by other research/educational institutions (CNR/SantAnna/Scuola Normale )
How we use CEPH Currently in production as backend for an Openstack installation, it hosts: department tenants (Web servers, etc.. ) tenants for research projects (DNA seq, etc ) tenants for us: multimedia content from elearning platforms Working on: An email system for students hosted on Openstack RBD A sync&share platform RadosGW
Timeline Spring 2014: we started to plan: Capacity/Replica planning Rack engineering (power/cooling) Bare metal management Configuration Management Dec 2014: first testbed Feb 2015: 12 nodes cluster goes in production Jul 2015: Openstack goes in production Oct 2015: Start to deploy new ceph nodes (+12)
Overview 3 sites (we started with 2): One replica per site 2 active computing and storage 1 for storage and quorum 2 different network infrastructures: services (1Gb and 10 Gb) storage (10Gb and 40Gb)
Network Ceph clients and cluster networks are realized as VLAN on the same switching infrstructure Redundancy and loadbalancing are achieved by LACP Switching platforms: Juniper EX4550: 32p SFP Juniper EX4200: 24p copper
Storage ring Sites interconnected wirh a 2x40Gb ERP For Storage nodes: 1VirtualChassis per DC: Maximize the bandwidht: 128GB backend inside the VC Easy to configure and manage (NSSU) No more than 8 nodes per VC For computing nodes different VC
Hardware:OSD nodes DELL R720XD (2U): 2 Xeon e5-2603@1.8ghz: 8 core total Ubuntu 14.04 64GB RAM DDR3 Linux 3.13.0-46-generic #77-Ubuntu Linux bonding driver: 2x10Gb Intel X520 Network Adapter 12 2TB SATA disks (6disks/RUs) 2 Samsung 850 256GB SSD disks Mdadm raid1 for OS 6 partition per disk for XFS journal No special functions Less complex Really easy to deploy with idrac Intended to be the virtual machine pool (faster)
Hardware:OSD nodes Supermicro SSG6047R-OSD120H: 2 Xeon e5-2630v2@2.60ghz : 24 core total Ubuntu 14.04 Linux 3.13.0-46-generic #77-Ubuntu 256GB RAM DDR3 2 SSD raid 1 for OS (dedicated) 4x10Gb Intel X520 Network Adapter Linux bonding driver: 30 6TB SATA disks (7.5disks/RU) 6 intel 3700 SSD disks for XFS journal 1 disk 5 OSD No special functions Less complex Intended to be the object storage pool (slow)
Hardware: monitor nodes Sun Sunfire x4150 Hardware not virtual (3 in production, going to be 5) Ubuntu 14.04 - Linux 3.13.0-46-generic #77-Ubuntu 2 Intel Xeon X5355@2.66Ghz 2x1GB intel for Ceph Client network (LACP) 16GB RAM 5x120GB intel 3500 SSD RAID 10 + HotSpare
Racks plans NOW: computing and storage are mixed. 24U OSD nodes 4U Computing nodes 2U monitor/cache 10U network IN PROGRESS: computing and storage will be in specific racks. For storage: 32U OSD nodes 2U monitor/cache 8U network For computing: 32U for computing nodes 10U network The storage network fan-out is optimized
configuration essential -1 262.1 root default rule serra_fibo_ing_high-end_ruleset { -15 87.36 datacenter fibonacci ruleset 3-16 87.36 rack rack-c03-fib type replicated -14 87.36 datacenter serra min_size 1-17 87.36 rack rack-02-ser max_size 10-35 87.36 datacenter ingegneria step take default -31 0 rack rack-01-ing -32 0 rack rack-02-ing -33 0 rack rack-03-ing step chooseleaf firstn 1 type host-highend -34 0 rack rack-04-ing step emit -18 87.36 rack rack-03-ser step choose firstn 0 type datacenter }
Tools Just 3 people working on CEPH (not 100%) and you need to grow quickly Automation is REALLY important Configuration management: Puppet Most of the classes are already production-ready A lot of documentation (best practices, books, community) Bare metal installation:the Foreman Complete lifecycle for hardware DHCP, DNS, Puppet ENC
Tools For monitoring/alarming: VirtualBox) to test what is hardware indipendent: Nagios+CheckMK alarms graphing Rsyslog Looking at collectd + Graphite Test environment: (Vagrant and new functionalities Puppet classes upgrades procedures Metrics correlation
Openstack integration It works straightforward Shared storage live migration CEPH as a backend for: multiple pools are supported Current issues: Volumes Vms Massive volume deletion Images Evacuate Copy on Write: VM as a snapshot (OS=Juno Ceph=Giant)
Performances ceph bench writes ===================================== ==================================== =================================== Total time run: 10.353915 Total time run: 60.308706 Total time run: 120.537838 Total writes made: 1330 Total writes made: 5942 Total writes made: 12593 Write size: 4194304 Write size: 4194304 Write size: 4194304 Bandwidth (MB/sec): 513.815 Bandwidth (MB/sec): 394.106 Bandwidth (MB/sec): 417.894 Stddev Bandwidth: 161.337 Stddev Bandwidth: 103.204 Stddev Bandwidth: 84.4311 Max bandwidth (MB/sec): 564 Max bandwidth (MB/sec): 524 Max bandwidth (MB/sec): 560 Min bandwidth (MB/sec): 0 Min bandwidth (MB/sec): 0 Min bandwidth (MB/sec): 0 Average Latency: 0.123224 Average Latency: 0.162265 Average Latency: 0.153105 Stddev Latency: 0.0928879 Stddev Latency: 0.211504 Stddev Latency: 0.175394 Max latency: 0.955342 Max latency: 2.71961 Max latency: 2.05649 Min latency: 0.045272 Min latency: 0.041313 Min latency: 0.038814 ===================================== ==================================== ====================================
Performances ceph bench reads rados bench -p BenchPool 10 rand rados bench -p BenchPool 10 seq =================================== ================================== Total time run: Total time run: 10.065519 10.057527 Total reads made: 1561 Total reads made: 1561 Read size: 4194304 Read size: 4194304 Bandwidth (MB/sec): 620.336 Bandwidth (MB/sec): 620.829 Average Latency: 0.102881 Average Latency: 0.102826 Max latency: 0.294117 Max latency: 0.328899 Min latency: 0.04644 Min latency: 0.041481 =================================== ==================================
Performances: adding VMs What to measure: See how Latency is influenced by IOPS, measuring it while we add VMs (fixed load generator). See how Total bandwidth decrease adding VMs Setup: 40VM on Openstack with 2 10G volumes (pre-allocated with dd): One with bandwidht CAP (100MB) One with IOPS CAP (200 total) We use fio as benchmark tool and dsh to launch it from a master node. Refence: Measure Ceph RBD performance in a quantitative way: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance- in-a-quantitative-way-part-i
Fio fio --size=1g \ fio --size=4g \ --runtime 60 \ --runtime=60 \ --ioengine=libaio \ --direct=1 \ --rw=randread [randwrite]\ --name=fiojob \ --ioengine=libaio \ --direct=1 \ --rw=read [write]\ --name=fiojob \ --blocksize=4k \ --iodepth=2 \ --blocksize=128k [256K] \ --rate_iops=200 \ --iodepth=64 --output=randread.out --output=seqread.out \
Performances -write
Performances - write
Performances - read
Performances - read
Dealing with: Hardware: Software: Most of the problems came for hardware (disks, controllers, nodes): but maybe we are too small More RAM = less PAIN (specially during recovery/rebalancing) Slow requests/operation blocked Scrubs errors: fix it with pg repair,check the logs Automation: When something is broken, puppet can make it worse
...so what? Ceph is addressing our needs: It performs (well?) It's robust In about 9 months - production and non-production - nothing really bad happen. Now we are going to: Work more on monitoring and performance graphing More benchmarks to understand what to improve Add SSD cache Activate RadosGW (in production) and the slow pool
Questions? For you: VMWARE support? Xex/XenServer? SMB/NFS/iSCSI?
Coffee time!