Online data handling with Lustre at the CMS experiment

Similar documents

Big Data and Storage Management at the Large Hadron Collider

NetApp High-Performance Computing Solution for Lustre: Solution Guide

CERN s Scientific Programme and the need for computing resources

13 th Economic Trends Survey of the Architects Council of Europe

1. Perception of the Bancruptcy System Perception of In-court Reorganisation... 4

The Data Quality Monitoring Software for the CMS experiment at the LHC

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Energy prices in the EU Household electricity prices in the EU rose by 2.9% in 2014 Gas prices up by 2.0% in the EU

How many students study abroad and where do they go?

187/ December EU28, euro area and United States GDP growth rates % change over the previous quarter

Building a Top500-class Supercomputing Cluster at LNS-BUAP

99/ June EU28, euro area and United States GDP growth rates % change over the previous quarter

ThinkServer PC DDR3 1333MHz UDIMM and RDIMM PC DDR3 1066MHz RDIMM options for the next generation of ThinkServer systems TS200 and RS210

A High-Performance Storage System for the LHCb Experiment Juan Manuel Caicedo Carvajal, Jean-Christophe Garnier, Niko Neufeld, and Rainer Schwemmer

Size and Development of the Shadow Economy of 31 European and 5 other OECD Countries from 2003 to 2015: Different Developments

Funding and network opportunities for cluster internationalization

New Storage System Solutions

41 T Korea, Rep T Netherlands T Japan E Bulgaria T Argentina T Czech Republic T Greece 50.

Current Status of FEFS for the K computer

Report on Government Information Requests

Beyond High Performance Computing: What Matters to CERN

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Waste. Copenhagen, 3 rd September Almut Reichel Project Manager Sustainable consumption and production & waste, European Environment Agency

About us. As our customer you will be able to take advantage of the following benefits: One Provider. Flexible Billing. Our Portal.

COMMUNICATION FROM THE COMMISSION

Reporting practices for domestic and total debt securities

Labour Force Survey 2014 Almost 10 million part-time workers in the EU would have preferred to work more Two-thirds were women

Cisco IT Data Center and Operations Control Center Tour

New Design and Layout Tips For Processing Multiple Tasks

ERASMUS+ MASTER LOANS

ERASMUS+ MASTER LOANS

NetFlow Feature Acceleration

Brochure More information from

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

CISCO PIX SECURITY APPLIANCE LICENSING

First-year experience with the ATLAS online monitoring framework

Scala Storage Scale-Out Clustered Storage White Paper

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Pendulum Business Loan Brokers L.L.C. U.S. State Market Area

INTERNATIONAL TRACKED POSTAGE SERVICE

Report on Government Information Requests

Computing our Future Computer programming and coding in schools in Europe. Anja Balanskat, Senior Manager European Schoolnet

IAB Europe AdEx Benchmark Daniel Knapp, IHS Eleni Marouli, IHS

UEFA Futsal EURO 2013/14 Preliminary & Main Rounds Draw Procedure

International Call Services

First estimate for 2014 Euro area international trade in goods surplus bn 24.2 bn surplus for EU28

Is the Big Data Experience at CERN useful for other applications?

CISCO CONTENT SWITCHING MODULE SOFTWARE VERSION 4.1(1) FOR THE CISCO CATALYST 6500 SERIES SWITCH AND CISCO 7600 SERIES ROUTER

TOYOTA I_SITE More than fleet management

EMC SYMMETRIX VMAX 10K

TOWARDS PUBLIC PROCUREMENT KEY PERFORMANCE INDICATORS. Paulo Magina Public Sector Integrity Division

DCA QUESTIONNAIRE V0.1-1 INTRODUCTION AND IDENTIFICATION OF THE DATA CENTRE

Improvement Options for LHC Mass Storage and Data Management

Computer Specifications

BEST PRACTICES/ TRENDS/ TO-DOS

Scalable Multi-Node Event Logging System for Ba Bar

Il/network/italiano/ Risorse digitali e strumenti colaborativi per le Scienze del'antichità/ Venezia'3'o*obre'2014' Emiliano Degl Innocenti

Visa Information 2012

World Consumer Income and Expenditure Patterns

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

BT Premium Event Call and Web Rate Card

by knürr CoolTherm Server cabinet technology with outstanding benefits up to 35kW cooling capacity Blade-Server server optimized!

Chase Online SM Wire Transfer Help Guide page 1 of 16. How to Send Wire Transfers on Chase Online SM

Genuine BMW Accessories. The Ultimate Driving Machine. BMW Trackstar. tracked. recovered. BMW TRACKSTAR.

Electricity, Gas and Water: The European Market Report 2014

European Research Council

Milan Zoric ETSI

Komatsu Satellite Monitoring System. Control your machine for total peace of mind

Foreign Taxes Paid and Foreign Source Income INTECH Global Income Managed Volatility Fund

Cisco CNS NetFlow Collection Engine Version 4.0

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

M2M Connectivity T: W:

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

The STAR Level-3 Trigger System

Sonexion GridRAID Characteristics

Online Performance Monitoring of the Third ALICE Data Challenge (ADC III)

On What Resources and Services Is Education Funding Spent?

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Annual report 2009: the state of the drugs problem in Europe

Scientific Storage at FNAL. Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015

EUF STATISTICS. 31 December 2013

Investigation of storage options for scientific computing on Grid and Cloud facilities

CISCO MDS 9000 FAMILY PERFORMANCE MANAGEMENT

HP Technology Services HP NonStop Server Support

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CMS Computing Model: Notes for a discussion with Super-B

EXECUTIVE SEARCH, INTERIM MANAGEMENT, NON-EXECUTIVE DIRECTORS & CONSULTING

Techniques for implementing & running robust and reliable DB-centric Grid Applications

International Hints and Tips

Business Mobile Plans

Trends in the European Investment Fund Industry. in the Second Quarter of Results for the first half of 2014

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

The CMS Tier0 goes Cloud and Grid for LHC Run 2. Dirk Hufnagel (FNAL) for CMS Computing

FTK the online Fast Tracker for the ATLAS upgrade

Supported Payment Methods

Higher education institutions as places to integrate individual lifelong learning strategies

Transcription:

Online data handling with Lustre at the CMS experiment Lavinia Darlea, on behalf of CMS DAQ Group MIT/DAQ CMS September 17, 2015 1 / 29

CERN 2 / 29

CERN CERN was founded 1954: 12 European States Science for Peace Today: 21 Member States ~ 2300 staff ~ 1050 other paid personnel > 11000 users Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, Israel, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Members in the Pre-Stage to Membership: Serbia Applicant States: Cyprus, Slovenia, Turkey Observers to Council: India, Japan, the Russian Federation, the United States of America, Turkey, the European Commission and UNESCO 3 3 / 29

CERN 4 / 29

CERN Protons collide in the CMS detector Took ~2000 scientists and engineers more than 20 years to design and build Is about 15 metres wide and 21.5 metres long Weighs twice as much as the Eiffel Tower about 14000t Uses the largest, most powerful magnet of its kind ever made 5 / 29

CERN 6 / 29

Introduction CMS data flow, rates and computing model at LHC σ LHC s=14tev L=10 34 cm -2 s -1 barn mb mb nb σ inelastic bb jets W Z W ν Z ll tt gg H SM Trigger Levels Level 1 (~10 8 ) HLT (~10 5 ) On Tape (~10 2 ) SUSY qq ~~ +q ~ g ~ +g ~ g ~ Hz tanβ=2, µ=m~=m~/2 g q tanβ=2, µ =m~=m g Event rate GHz MHz khz 25 ns pipelined Level-1 Trigger 40 MHz massive parallel ON-Line Readout Event Builder Tbit/s Network Filter Farm (HLT) TFlop processor farms Mass Storage Petabyte Archive OFF-Line Reconstruction and Analysis pb qq qq H SM H SM γγ h γγ Z ARL 2 mhz fb H SM 2Z 0 4µ Z SM 3γ Z η 2 scalar LQ 100 500 2000 jet E T or particle mass (GeV) mhz Distributed data and computing centers µs ms sec hour year 10-9 10-6 10-3 1 10 3 10 6 sec Real Time Window Giga Tera Peta 7 / 29

CMS DAQ2 System 8 / 29

CMS DAQ2 System Storage Manager and Transfer System (SMTS) in the DAQ chain SMTS and DAQ input: end of the Data AcQuisition chain* last part of the data flow: ensure safe storage and transfer to Tier0 *E. Meschi, File-based data flow in the CMS Filter Farm, CHEP 2015 9 / 29

Storage Manager and Transfer System Role Data flow pattern to and from Lustre Storage and Transfer Management System (STMS) is the final component of the DAQ system Data flow status before the STMS: Builder Units (BU) nodes receive event fragments that they reconstruct They distribute these events to the Filter Units (FU) for selection The FUs send back the meaningful events to the BUs The Storage Manager merges the data into a smaller number of files The resulting files are temporarily stored in the Lustre File System (LFS) The Transfer System picks the files up from LFS and ships them for analysis processing and permanent storage into the CERN Tier0 (EOS/Castor) 10/29

Storage Manager and Transfer System Role Implementation Stages merge the filter units output as to obtain 1 data and 1 metadata file/ls/stream buffer the data until it is safely stored on tape in Tier0/Castor copy the final files according to their intended destination: Tier0 for the main data streams various sub detectors for online consumption: DQM, EventDisplay store locally for local calibration of various sub detectors ensure hand shake with Tier0 for proper accountability Simplified glossary LS: 23s stream: grouping of similar datasets 11/29

Storage and Transfer System Requirements Merger System merge data at the BU level such as to obtain 1 file/bu/ls/stream (mini merger) centralize and merge all the BU outputs such as to obtain 1 file/ls/stream (macro merger) latency: a maximum of 2LS (1LS = 23s) delay in the macro merger is considered acceptable provide input for the online monitoring system 1 additional metadata file per data file* not only concatenate, but deal with special files, such as histograms and jsn files *S. Morovic, A scalable monitoring for the CMS Filter Farm based on elasticsearch, CHEP 2015 12/29

Storage and Transfer System Requirements Storage and Transfer buffer a minimum of 3 days of continuous running (estimated 250TB) aggregated SM input from the 62 BUs is expected to reach a maximum of 2GB/s mini merger write to LFS rate of files can be as high as 20 streams x 3files/LS, 120 files/minute; additionally: bookkeeping/locking files, up to 60 BUs x 20 streams, 1200 files/minute the macro merger needs to consume this data online (2GB/s read the fragments, 2GB/s write the final merged file): 4GB/s(!) the transfer system is expected to transfer most of the data to Tier0 at 1GB/s* overall: LFS needs to guarantee a total of sustained 7GB/s parallel r/w *Recently this requirement increased to 3GB/s 13/29

Mergers 2 available options A dditive mini mergers write a file/bu/ls/stream, macro merger merges them and makes them available for the TS easy debugging, reliable, standard logic C opyless mini mergers write in parallel in the final file, macro merger checks for completion and makes it available for the transfer system reduce the required bandwidth with 4GB/s reduce the number of temporary files by a factor of 60 (number of BUs) fast due to parallel writing in the same file more sensitive to corruption 14/29

Lustre File System Implementation Lustre FS architecture current Intel Enterprise Edition for Lustre version: 2.2.0.2 servers: 6 DELL R720 2 MDS nodes, one active at a time 4 OSS nodes, each controls 6 OSTs Rack view MDT (low), 1 OST controller and 1 disk shelves expansion enclosure 15/29

Lustre File System Implementation Meta Data Configuration 16 drives of 1TB in 1 volume group, 8 hot spares only 10% of the disks capacity is used in order to increase performance partitions: 10GB for MGT, 1TB for MDT connection to servers: Mini-SAS HD to Mini-SAS redundancy: RAID6 MDT: NetApp E2724 front and rear view 16/29

Lustre File System Implementation Object Storage Configuration 2 OST controllers: NetApp E5560 each controller manages one disk expansion enclosure DE6600 each disk shelf enclosure contains 60 disks of 2TB each total raw disk space: 240 disks x 2TB = 480 TB physical installation: 2 racks, 1 controller and its expansion enclosure per rack connection to servers: Mini-SAS to Mini-SAS Front OST Disk shelves 17/29

Lustre File System Implementation OST: Volume Configuration each controller/expansion shelf is organized in 6 RAID6 volume groups the volume groups are physically allocated vertically to ensure resilience to single shelf damage total usable space: 249TB Volumes configuration 18/29

Lustre File System Implementation High Availability volumes repartition to provide full shelf failure redundancy all volumes are RAID6 all devices (controllers, shelves, servers) are dual powered (normal and UPS) all servers configured in active/passive failover mode via corosync/pacemaker: MDS in neighbouring racks, OSS within the same rack LFS nominal availability: 40GE and InfiniBand (56Gb) data networks* 19/29

Lustre File System Control and Monitoring IML: Lustre FS control and monitoring interface mostly used for control and base FS operations the dashboard provides useful information for debugging an overloaded system very demading installation requirements not fully reliable: fake BMC monitoring warnings, false status reports upon major FS failures 20/29

Lustre File System Control and Monitoring SANtricity mostly used for monitoring bandwidth usage per controller reports detailed text bandwidth usage per volume provides useful information and alerts on hardware status 21/29

Bandwidth Validation Commissioning Acceptance Proven steady 10GB/s rate in r/w mode Merger emulation Proven steady 7.5GB/s rate 22/29

Validation Read vs Write 23/29

Validation LFS bandwidth benchmarking Emulation tests using the production computing cluster tests performed using different fractions of the available computing farm obvious non linear behaviour with the number of BUs transfer system (read operations) were not considered during the tests saturation is expected around 8.5GB/s 24/29

Real Life Usage High trigger rate Stable beams runs 25/29

Conclusion Mergers monitoring sample SMTS Validation stable behaviour in 5 months of production running mode general latencies within the requirements proven reliability and availability a few glitches, have been followed up and mostly solved Mergers delays sample 26/29

Conclusion Event display of one of the first particle splashes seen in CMS during Run2 27/29

Conclusion Event display of one of the first particle splashes seen in CMS during Run2... only a few minutes before one of the OSS servers crashed... 27/29

Conclusion Event display of one of the first particle splashes seen in CMS during Run2... only a few minutes before one of the OSS servers crashed...... and the failover mechanism failed... 27/29

Conclusion SMTS team interaction with Lustre Lustre appeared to be very sensitive to network glitches IML can be misleading, but provides very intuitive ways of controlling the FS sub optimal application architecture artificially increased the load on the FS (fixed) a few FS issues have been identified, but they have been/are being fixed clients recover pretty fast and painlessly after FS unavailability Intel s Lustre and NetApp s E-Series seem to play nicely together and they deliver the required bandwidth performance 28/29

Questions? 29/29