Cooling Down Ceph Exploration and Evaluation of Cold Storage Techniques
Cold Storage Ceph Cooling Down Ceph Object Stubs Striper Prefix Hashing
Facebook photos [1]: 82% reads to 8% stored data Scientific data system [2]: 50% reads to 5% stored data [1] T. P. Morgan. Facebook Rolls Out New Web and Database Server Designs. http:// www.theregister.co.uk/2013/01/17/open_compute _facebook_servers/, 2013 [2] M. Grawinkel, L. Nagel, M. Mäsker, F. Padua, A. Brinkmann, and L. Sorth. Analysis of the ECMW Storage Landscape. Proc. of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015 3
Cold storage is an operational mode or a method operation of a data storage device or system for inactive data where an explicit trade-off is made, resulting in data retrieval response times beyond what may be considered normally acceptable to online or production applications in order to archive significant capital and operational savings IDC Technology Assessment: Cold Storage Is Hot Again - Finding the Frost Point (2013) 4
Distributed storage system No single point of failure Horizontal scaling Run on commodity hardware 6
7
MON MON MON Client 8
Monitor Keeps the Cluster Map Distributed Consensus MON MON Not in data path MON Client 9
Client Computes placement based on Cluster Map Directly accesses s and MONs MON MON MON Client 10
Stores Objects Manages replication MON MON Placement MON Client 11
Pool PG PG 12
Pool s Buckets Pool Type PG Rack, Server, Disk, Type PG Replicated Erasure Coded Placement Groups 13
Placement Groups Pool Abstraction for placement computation PG ~ 100 per PG 14
Object Pool Data 4 MiB PG Name PG Matters Object Map 15
Pool Object Storage Device PG ~ 1 per Disk Replication PG Erasure Coding 16
Placement Object PG Hash CRUSH Cluster Map PG 17
Cooling Down Ceph
Ceph s Cold Storage Features Cache Tiering Erasure Coding Metadata-aware clients Semantic Pool Selection Metadata for Later Placement Striper Prefix Hashing Extra Placement Information Redirection and Stubbing Object Redirects Object Stubs Object Store Backend to Archive System Journal Cache 19
Ceph s Cold Storage Features Cache Tiering Erasure Coding Metadata-aware clients Semantic Pool Selection Metadata for Later Placement Striper Prefix Hashing Extra Placement Information Redirection and Stubbing Object Redirects Object Stubs Object Store Backend to Archive System Journal Cache 20
Object Stubs stub stub Archive System 21
Implementation As part of the Transparent to the clients New RADOS Operations Stub Unstub Implicit unstub 22
New RADOS Ops: Stub, Unstub Stub Server Stub Server stub IN stub OUT SET XAttrs RM XAttrs - URL - URL 23
Implementation: Stub, Unstub UNSTUB STUB Read Object HTTP PUT Get Xattr "stub_uri" HTTP GET url HTTP DELETE url TRUNCATE 0 SETXATTR "stub_uri" url WRITEFULL ect RMXATTR "stub_uri" 24
Operation B.5 operations that need ect data 59 zero X setxattrs Table B.1: Ceph operations Implicit Unstub Needs Object Data? Operation read X cache-flush X stat cache-evict X mapext X cache-try-flush X masktrunc X tmap2omap X sparse-read X set-alloc-hint notify notify-ack redirect unredirect assert-version clonerange X list-watchers list-snaps assert-src-version src-cmpxattr sync_read X getxattr write X getxattrs writefull X cmpxattr truncate X setxattr zero X setxattrs delete resetxattrs append X rmxattr startsync pull X settrunc push X trimtrunc X balance-reads X Needs Object Data? delete resetxattrs append X rmxattr startsync pull X settrunc push X trimtrunc X balance-reads X tmapup X unbalance-reads X tmapput X scrub X tmapget X scrub-reserve X create X scrub-unreserve X rollback X scrub-stop B.6Xstub watch scrub-map X omap-get-keys omap-get-vals omap-get-header Operation omap-get-vals-by-keys omap-set-vals omap-set-header omap-clear omap-rm-keys omap-cmp wrlock Table B.1: Ceph wrunlock operations Needs Object Data? rdlock Operation rdunlock uplock Needs Object Data? dnlock Continued on next pa call pgls pgls-filter copy-from X pg-hitset-ls copy-get-classic X pg-hitset-get undirty isdirty copy-get X pgnls pgnqls-filter
Benefits Supports links to external storage systems Stubbed Snapshots => Backup 26
Striper Prefix Hashing File File 27
converts an ect ID (ect_t oid) and ect locator (ect_locator_t Listing C.1: Original hash_key implementation loc, contains the pool ID) to a placement group (pg_t pg). uint32_t pg_pool_t::hash_key(const string& key, Full source code of the implementation, including the simulator, are pub- Implementation lished on github 1. Listing C.1 const showsstring& the original ns) const implementation; Listing { C.2 shows the implementation with Striper Prefix Hashing. Listing C.1: Original hash_key implementation } uint32_t pg_pool_t::hash_key(const string& key, { } string n = make_hash_str(key, ns); return ceph_str_hash(ect_hash, n.c_str(), n.length()); The changed method cuts off const the ect s string& name ns) const at the first dot found in the string inkey, which happens to be the sequence number part for ect string n = make_hash_str(key, ns); names generated by the Ceph Striper. return ceph_str_hash(ect_hash, n.c_str(), n.length()); Listing C.2: Changed hash_key implementation uint32_t The changed pg_pool_t::hash_key(const method cuts off the ect s string& name inkey, at the first dot found in the string inkey, which happens const to bestring& the sequence ns) const number part for ect names generated by the Ceph Striper. { string key(inkey); Listing C.2: Changed hash_key implementation if (flags & FLAG_HASHPSONLYPREFIX) { uint32_t pg_pool_t::hash_key(const string& inkey, string::size_type n = inkey.find("."); const string& ns) const { if (n!= string::npos) { string key(inkey); key = inkey.substr(0, n) ; } if (flags & FLAG_HASHPSONLYPREFIX) { } string::size_type n = inkey.find("."); } string n = make_hash_str(key, ns); if (n!= string::npos) { return ceph_str_hash(ect_hash, n.c_str(), n.length()); key = inkey.substr(0, n) ; } 28
Methodology ECMWF ECFS HPSS dump [1] 137 million files 14.8 PiB 10% random sample Simulator 29 [1] M. Grawinkel, L. Nagel, M. Mäsker, F. Padua, A. Brinkmann, and L. Sorth. Analysis of the ECMW Storage Landscape. Proc. of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015
Table 5.1: ECFS HPSS Dump Properties Property Value Unit Workload Total #files 14,950,112 Total used capacity 1.595 PiB Max file size 32 GiB ECFS HPSS 10% random sample 5.2 striper prefix hashes 35 8000000 7000000 47.05 Size of Objects 6 4 MiB 5.495 PiB Size of Objects > 4 MiB 1.590 PiB Table 5.1: ECFS HPSS Dump Properties 6000000 Property Value Unit 5000000 00 00 47.05 Total #files 14,950,112 Total used capacity 1.595 PiB Max file size 32 GiB Size of Objects 6 4 MiB 5.495 PiB Size of Objects > 4 MiB 1.590 PiB #Files 4000000 3000000 2000000 1000000 0 0-4MiB 7.98 4MiB-8MiB 10.53 8MiB-16MiB 6.97 16MiB-32MiB 6.91 32MiB-64MiB 6.25 64MiB-128MiB 6.00 128MiB-256MiB 3.82 256MiB-512MiB 2.22 512MiB-1GiB 1.41 1GiB-2GiB 0.47 2GiB-4GiB 0.29 4GiB-8GiB 0.09 8GiB-16GiB 0.01 16GiB-32GiB 00 00 Figure 5.2: Histogram: File size distribution. Each bucket inclusive e.g 0<x6 4 MiB 30
Simulator 5.2 striper prefix ha 600 s 38400 PGs 45 minutes on 32 cores 5.2.2.3 Simulator Table 5.2: Benchmark settings Hash Algorithm 1 RJenkins No 2 Linux No 3 RJenkins Yes 4 Linux Yes Prefix Hash Enabled? Simulating Ceph s placement decision by running a whole Ceph with Ceph FS was quickly found to be infeasible in terms of run For this test s, MONs and MDS components ran on a single s s ran on top of a FUSE overlay filesystem called blackholefs 2 cially designed to intercept writes. A special 31 simulator running only the bare minimum of Ceph and
pendix A contains more detailed data and values for the plots shown here. Distinct s per File 5.2.4.1 Distinct s per File Table 5.3 shows statistics for the number of distinct primary s of all files in the workload or: that Does are larger than it 4work? MiB, because the number of s per file is always 1 for files with only one ect. Table 5.3: Distinct primary s per file for files larger than 4 MiB Statistic RJenkins Linux dcache RJenkins+Prefix Linux+Prefix Min. 1 1 1 1 Q 1 3 3 1 1 Median 9 9 1 1 Q 3 35 35 1 1 Max. 600 600 1 1 As evident from the table the Striper Prefix Hash runs result in 1 per file. 32
In into anaccount. ideal balanced Ceph cluster all s store the same amount of data and ects. Figure 5.4 and 5.5 illustrate the number of ects per primary Balance and the size of data per primary respectively. Both measures are similar, but the size plot also takes ects not filled to the maximum into account. 1.2M 1.1M 1.0M 900.0k #Objects 800.0k 700.0k 600.0k 500.0k 400.0k 300.0k RJenkins Linux dcache RJenkins+Prefix Figure 5.4: Number of ect per Linux+Prefix Combined ect size / Byte 4.0TiB 3.8TiB 3.6TiB 3.4TiB 3.2TiB 3.0TiB 2.8TiB 2.6TiB 2.4TiB 2.2TiB 2.0TiB Figure 5.4: Number of ect per RJenkins Linux dcache 33 RJenkins+Prefix Linux+Prefix
Racap Ceph Cooling Down Ceph File Implementation and Evaluation Object Stubs Striper Prefix Hashing stub Archive System stub 34
Cooling Down Ceph Exploration and Evaluation of Cold Storage Techniques
Bonus Slides
Striper File, Block Device Image, etc. Blocks 0 1 2 3 4 5 6 7 8 Stripes 0 1 2 3 4 0 1 2 3 4 5 6 7 8 Object Set Objects 0 1 6 7 2 3 8 4 5 0 1 2 3 37