Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary
16/02/2015 Real-Time Analytics: Making better and faster business decisions 8
The ATLAS experiment CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 7000 tons, 150 million sensors generating data 40 millions times per second i.e. a petabyte/s The Worldwide LHC Computing Grid 5
Data Collection and Archiving at CERN Data flow to permanent storage: 4-6 GB/sec LHCb: 200-400 MB/sec ATLAS: 1-2 GB/sec Alice: 4 GB/sec CMS: 1-2 GB/sec Markus.Schulz@cern.ch
The Worldwide LHC Computing Grid An international collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording, reconstruction and distribution nearly 170 sites, 40 countries ~350 000 cores Tier-1: permanent storage, reprocessing, analysis Tier-2: Simulation, end-user analysis 500 PB of storage > 2 million jobs/day 10-100 Gb links
LHC Big Data Few PB of raw data becomes ~100 PB! Duplicate raw data Simulated data Derived data products Versions as software improves Replicas to allow access by more physicists
How do we store/retrieve LHC data? A short history 1 st Try - All data in an commercial Object Database (1995) good match for complex data model and OO language integralon but the market predicted by many analysts did not materialise! 2 nd Try - All data in a relalonal DB - object relalonal mapping (1999) PB- scale of deployment was far for from being proven Users code in C++ and rejected data model definilon in SQL Hybrid between RDBMS and structured files (from 2001 - today) RelaLonal DBs for transaclonal management of metadata (only TB- scale) File/dataset meta data, condilons, calibralon, provenance, work flow via DB abstraclon (plugins: Oracle, MySQL, SQLite, FronLer/SQUID) Open source persistency framework (ROOT) Uses C++ introspeclon to store/retrieve networks of C++ objects Column- store for efficient sparse reading Ian.Bird@cern.ch 9
Processing a TTree TSelector Output list Begin() - Create histograms - Define output list preselection Process() Ok analysis Terminate() - Finalize analysis (fitting,...) Event Branch n Leaf Leaf Branch Branch Read needed parts only Branch Leaf Leaf Leaf Leaf Leaf TTree 1 2 n last Loop over events 16
CERN Disk Storage Overview AFS CASTOR EOS Ceph NFS CERNBox Raw Capacity 3 PB 20 PB 140 PB 4 PB 200 TB 1.1 PB Data Stored 390 TB 86 PB (tape) 27 PB 170 TB 36 TB 35 TB Files Stored 2.7 B 300 M 284 M 77 M (obj) 120 M 14 M AFS is CERN s linux home directory service CASTOR & EOS are mainly used for the physics use case (Data Analysis and DAQ) Ceph is our storage backend for images and volumes in OpenStack NFS is mainly used by engineering application CERNBox is our file synchronisation service based on OwnCloud+EOS 2
ture Tape at CERN inosity Archive write 27 PB Data Volume 100 PB physics archive 7 PB backup (TSM) 15 PB 23 PB Tape libraries 3+2 x IBM TS3500 4 x Oracle SL8500 Tape drives 100 physics archive 50 backup Archive read Capacity 70k slots 30k tapes 12 15 14/4/2015 CHEP 2015, Okinawa 2
Archive: Large scale media migration Part 1: Oracle T10000D Part 2: IBM TS1150 Deadline: LHC run 2 start! Repack Repack LHC Run1 LHC Run1 14/4/2015 CHEP 2015, Okinawa 13
Feb 17 2015 Ian.Bird@cern.ch 14
Smart vs Simple Archive: HSM Issues CASTOR had been designed as Hierarchical Storage Management system disk-only and multi-pool support were added later painfully.. required rates for namespace access and file-open exceeded earlier estimates Around LHC start also conceptual issues with the HSM model became visible A file is not a meaningful granule for managing data exchange experiment use datasets Dataset parts needed to be pinned on disk by users to avoid cache trashing Users had to trick the HSM to do the right thing :-(
DSS EOS Project: Goals & Choices Server, media, file system failures need to be transparently absorbed key functionality: file level replication and rebalancing data stays available after a failure - no human intervention Fine grained redundancy within one h/w setup choose & change redundancy level for specific data either file replica count or erasure encoding Support bulk deployment operations eg replace hundreds of servers at end of warranty In-memory namespace (sparse hash per directory) file stat calls 1-2 orders faster write ahead logging for durability Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Later in addition: transparent multi-site clustering eg between Geneva and Budapest 16
Connectivity (100 Gbps) Dante/Géant T-Systems
EOS Raw Capacity Evolution
Why do we develop our own open source storage software? Large science community trained to be effective with set of products efficiency of this community is our main asset - not just the raw utilisation of CPUs and disks integration and specific support do matter community sharing via tools and formats even more Long term projects change of vendor/technology is not only likely but expected we carry old but valuable data through time (bit-preservation) loss of data ownership after first active project period
Does Kryder s law still hold? areal density CAGR source: HDD Opportunities & Challenges, Now to 2020, Dave Anderson, Seagate
Object Disk Each disk talks object storage protocol over TCP replication/failover with other disks in a networked disk cluster open access library for app development Why now? shingled media come with constrained (object) semantic: eg no updates Early stage with several open questions port price for disk network vs price gain by reduced server/power cost? standardisation of protocol/semantics to allow app development at low risk of vendor binding?
Can we optimise our systems further? Infrastructure analytics apply statistical analysis to the complete system: storage, cpu, network, user app measure/predict quantitative impact of changes on real job population Easy! looks like physics analysis with infrastructure metrics instead of physics data really?
Non-trivial Technically needs consolidated service and application side metrics usually: log data for human consumption without data design Conceptually some established metrics turn out to be less useful for analysis of today s hardware than expected cpu efficiency = t_cpu / t_wall? storage efficiency = GB / s? correlation does not imply causal relation Sociologically better observe rule of local discovery people who quantitatively understand the infrastructure are busy running services Always
Data Collection and Analysis Repository MR node MR node MR node Hadoop MR node MR node MR node Ramping up: ~ 100 nodes ~ 100 TB raw logs eos lsf ai Monitoring JSON Files small, binary subset Periodic Extract & Cleaning HDFS Set: Set: EOS EOS readbytes Set: eos : number readbytes : number readbytes filename : number : string filename : string filename opentime : string : time opentime : time opentime : time export User extract In production: - Flume - HDFS - MR - Pig - Spark - Scoop - {Impala} Current work items: Service: availability (eg isolation and rolling upgrades) Analytics: workbooks support for popular analysis tools: R/python/ROOT
Summary CERN has a long tradition in deploying large scale storage systems used by a distributed science community world-wide During the first LHC run period we have passed the 100 PB mark at CERN and more importantly have contributed to the rapid confirmation of the Higgs boson and many other LHC results For LHC Run 2 we have significantly upgraded & optimised the infrastructure in close collaboration between service providers and users Adding more quantitative infrastructure analytics to prepare for High-Luminosity-LHC CERN is already very active as user and provider in the open source world and the overlap with other Big Data communities is increasing.
Thank you!