SS Data & Storage Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT (with slides from Andreas Peters and Jan Iven) 5th International Conference "Distributed Computing and Grid-technologies in Science and Education" July 16-21, 2012 Dubna, Russia
SS Outline CERN storage strategy Disk vs Archive decoupling EOS deployment status at CERN Recent developments for Castor and EOS Cloud storage evaluation 2 Why cloud storage and protocols? Test plans for two S3 implementations OpenStack/Swift Openlab collaboration with Huawei Preliminary test results
SS Model change from HSM to more loosely coupled Data Tiers 3
SS EOS Design Targets Project start: April 2010 Initial focus: user analysis at CERN many individual users with chaotic work patterns many small output files, large shared read-only input files often only partial file access many file seeks over uninteresting input events or branches Using xroot as client server framework with an in-memory name space (no DB) availability via file-level replication (configurable) reduce operational effort at large volume scale 4
SS 5
SS Main Choices taken in EOS 6
SS EOS Deployment Status Today! 10000 5000 0 2011$07 2011$11 2012$03 72011$05 2011$09 2012$01 ATL AS $E OS ATL AS $ CAS TOR Installation for LHCb still under test! 8000 6000 4000 2000 0 2011$08 2012$02 2011$05 2011$11 CMS$ EOS CMS$ CASTOR
SS Deployment & Support Evolution Operation FTEs EOS 1.5 CASTOR 3.3 Similar number of hardware exceptions spike: doublecounting of low level errors in EOS later correlated in one ticket Total operator tickets stayed stable Significant decline of user tickets (red bars) 8 Incidents 2000 1500 1000 500 0 Aug$11 Jun$11 250 200 150 100 50 0 May Apr Jun IT C M$opera tor$tic kets Jul Oct$11 Dec$11 Feb$12 S ep Nov Jan Mar May Aug Oct D ec F eb Apr EOSALICE EOSCMS EOSATLAS C2PUBLIC C2LHCB C2CMS C2ATLAS C2ALICE automatic user+ggus
SS CASTOR developments Successful migration from LSF-based I/O scheduling to new TransferManager removing the previous access rate limitation for scheduled access in CASTOR reducing the deployment complexity of the CASTOR LSF setup at CERN reducing license costs for all CASTOR sites Tested in close collaboration with ATLAS as first user of the new system Deployed first at CERN and ASGC recently also at RAL Tier-1 Relatively smooth migration given the key role of this component in CASTOR 9
DSS CASTOR Tape highlights CH-1211 Geneva 23 http://cern.ch/it 62PB on tape, 52K tapes, 9 libraries, 80 production drives (+20 legacy) Beta-tested, validated and deployed IBM TS1140 (4TB) and Oracle T10000C (5TB) drives Boosted write tape speed writing by developing and deploying buffered tape marks (avoiding head repositioning) factor 10x achieved in 1 year Introduced traffic lights and bus lanes for prioritising bulk read requests, reducing tape mounts by ~50% Investigating suitability of commodity equipment (aka LTO) Active verification of archive contents by re-reading tapes and comparing checksums All newly filled tapes dusty (not recently mounted) tapes Cf posters: 415 (S. Murray) and 247 (G. Cancio) CHEP2012 - CERN IT physics storage - 16
SS EOS - Recent Developments Recently released EOS 0.2.2-1 GPL license Update to recent xroot 3.2 stress testing new xroot client inside the EOS setup Monitoring additional metrics and aggregates traffic stats by application or domain UDP fan out in JSON format eg popularity service Support for electronic operator s log book to annotate admin commands for later reference Import/export handler 11 low overhead import/export via xroot, {S3}
SS EOS Recent Developments Additional Redundancy Options Goal: less expensive disk archive storage with EOS distribute data chunks with checksums to many disks add redundancy blocks Redundant Array of Inexpensive Nodes (RAIN) File B1 B2 B3 B4 B5 B6 B7 B8 12 CR C 32 C Head B1 B5 CR C 32 C Head Head Head CR CR B2 B3 C B4 C 32 32 C C B6 B7 B8 CR C 32 C Head P1 P3 CR C 32 C Head P2 P4
SS Checksumming Performance comparing different algorithms -> CRC32C chosen measured on 8-Core Xeon 2.27 GHz 13 12x4 GB, DDR3 1 GHz RAM Linear scaling with # cores
SS Integration of Tuneable Redundancy EOS is based on XRootD as framework XRootD Client File Layout Plugin IO Scheme plug-in for implementation of file and directory IO interfaces already exist XRootD Server XRootD Server XR file layout plug-in on either client or server side provides access to distributed file fragments investigating impact of both options together with new encoding schemes Storage OFS Plugin File Layout Plugin 4k Block CheckSum CRC32C XFS Controller Storage OFS Plugin File Layout Plugin 4k Block CheckSum CRC32C XFS Controller HDD HDD 14
SS Measured Upload Performance File Upload Benchmark using eoscp RAIN IO driver and localhost XrootD server parallel threads (n,k) Reed-Solomon Result 15 encoding / decoding can run within available CPU budget without significant impact on throughput
SS Data & Storage Cloud Storage Evaluation
SS What is cloud storage anyway? Storage used by jobs running in a computational cloud network latency impacts what is usable depending on type of applications Storage build from (computational) cloud node resources storage life time = life time of node Storage service by commercial (or private) providers exploiting similar scaling concepts as computational clouds clustered storage with remote access with cloud protocol and modified storage semantic Term may be valid for all of the above but here I will refer to the last group of storage solutions 17
SS Why Cloud Storage & S3 Protocol? Cloud computing and storage gain rapidly in popularity Both as private infrastructure and as commercial service Several investigations are taking place also in the HEP and broader science community Price point of commercial offerings may not (yet?) be comparable with services we run at CERN or WLCG sites, but Changes in semantics, protocols, deployment model promise increased scalability with reduced deployment complexity (TCO) Market is growing rapidly and we need to understand if promises can be confirmed with HEP work loads Need to understand how cloud storage will integrate with (or change) current HEP computing models 18
SS S3 Semantics and Protocol Simple Storage Service (Amazon S3) just a storage service in contrast to eg Hadoop, which comes with a distributed computation model exploiting data locality uses a language independent REST API http(s) for transport Provide additional scalability by focussing on a defined subset of posix functionality partitioning of namespace into independent buckets S3 protocol alone can not provide scalability eg if added on top of a traditional storage system Scalability gains need to be proven for each S3 implementation 19
SS CERN Interest in Cloud Storage Main Interest: Scalability and TCO can we run cloud storage systems to complement or consolidate existing storage services? Focus: storage for physics and infrastructure analysis disk pools, home directories, virtual machine image storage Which areas of the storage phase space can be covered well? First steps: 20 setup and run a cloud storage service of PB scale confirm scalability and/or deployment gains
SS Potential Interest for WLCG S3 Protocol could be a standard interface for access, placement or federation of physics data Allowing to provide (or buy) storage services without change to user application large sites may provide private clouds storage on acquired hardware smaller sites may buy S3 or rent capacity on demand First Steps 21 successful deployment at one site (eg CERN) demonstrate data distribution across sites (S3 implementations) according to experiment computing models
Component Layering in current SEs User Protocol Layer local & WAN efficiency, federation support, identity & role mapping random client I/O sequential p-2-p put / get Grid to Local User & Role mapping DM Admin Access rfio/xroot gridftp SRM xroot gridftp Bestman Cluster Layer scaling for large numbers of concurrent clients Clustering & Scheduling {replicated} File Cluster Catalogue & Meta Data DB CASTOR / DPM EOS {reliable} Media Raw Media {reliable} Media Raw Media {reliable} Media Raw Media (RAIDed) disk servers JBOD Media Layer Stable manageable storage, scaling in volume per $ (including ops effort) 22
Poten5al Future Scenarios integrate cloud market buy or build and integrate via standard Protocol Layer xroot http S3 Cluster Layer EOS Cloud Storage Media Layer Cloud Storage need to prove S3 TCO gains S3 alone functionally sufficient? 23
SS Common Work Items Common items for OpenStack/Swift and Huawei systems Define the main I/O pattern of interest based on measured I/O patterns in current storage services (eg archive, analysis pool, home dir, grid home?) Define implement and test a S3 functionality test define S3 API areas of main importance develop a S3 stress / scalability test scale up to several hundred concurrent clients (planned for August) copy-local and remote access scenarios Define key operational use cases and classify human effort and resulting service unavailability add remove disk servers (incl. draining) h/w intervention by vendor (does not apply to Huawei) s/w upgrade power outage Compare performance and TCO metrics to existing 24services at CERN
SS Work Items - Huawei Appliance Support commissioning of Huawei storage system 0.8 PB system in place at CERN Share CERN test suite and results with Huawei Tests (including ROOT based ones) are regular being run by Huawei development team Perform continuous functional and performance tests to validate new Huawei s/w releases maintain a list of unresolved operational or performance issues Schedule and execute large scale stress test 26 S3 test suite Hammercloud with experiment applications
SS OpenStack/SWIFT Actively participate in fixing of CERN issues with released software build up internal knowledge about SWIFT implementation and defects probe our ability to contribute and influence the release content from the OpenStack foundation Run the same functionality and stress tests as for the Huawei system Visit larger scale SWIFT deployments to get in-depth knowledge about level of overlap between open software and in-house deployed additions / improvements compare I/O pattern in typical deployments with CERN services 27
SS Ongoing tests data up/download with 4kB to 100MB files extending to larger files 1GB, 10GB to investigate impact of data splitting on the storage side multiple buckets confirm the scale-out via independent meta data number of clients tested up to 200 concurrent clients (limited by client h/w availability) multi byte-range read (vector read) extended ROOT S3 plugin to support this fill 2 * 10 GB network connection 28 graceful behaviour in overload conditions?
SS Preliminary Results OpenStack/Swift and Huawei reach similar (10-20% less) performance as EOS for full file access for small to moderate number of clients (O(100)) Analysis type access using the ROOT S3 plugin naive use (no TTreeCache) of both S3 implementations shows significant overhead with enabled cache and vector read this overhead is removed S3fs (= fuse mounted S3 storage) almost reaches the same performance for jobs accessing 10-100% of a file assuming that local cache space (/tmp) is available Authentication and authorisation not yet mapped from certificates used in WLCG Plan to publish a more quantitative comparison at autumn HEPiX 29
SS Summary Strategy of decoupling Archive from Disk storage has been implemented at CERN Reducing the total deployment effort and the interference impact for experiment users CASTOR: tape improvements and simplified scheduling further consolidated archive area EOS: automation of deployment tasks and available redundancy options are being completed Cloud storage evaluation with two S3 30 implementation has started at CERN this year Performance of local S3 based storage looks so far comparable to current production system WAN replication and O(1000) client tests are coming up Realistic TCO estimation can not yet be done in a small (1PB) test system w/o real users access