LHC schedule: what does it imply for SRM deployment? Jamie.Shiers@cern.ch. CERN, July 2007



Similar documents
Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Tier0 plans and security and backup policy proposals

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

Report from SARA/NIKHEF T1 and associated T2s

Service Challenge Tests of the LCG Grid

Grid Computing in Aachen

The glite File Transfer Service

The dcache Storage Element

Techniques for implementing & running robust and reliable DB-centric Grid Applications

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Status and Evolution of ATLAS Workload Management System PanDA

Evolution of Database Replication Technologies for WLCG

HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions

GridKa: Roles and Status

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

Global Grid User Support - GGUS - start up schedule

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Computing at the HL-LHC

Enhanced Vessel Traffic Management System Booking Slots Available and Vessels Booked per Day From 12-JAN-2016 To 30-JUN-2017

Next Generation Tier 1 Storage

Database Services for CERN

ATLAS job monitoring in the Dashboard Framework

HTCondor at the RAL Tier-1

Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT

Mass Storage System for Disk and Tape resources at the Tier1.

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

PoS(EGICF12-EMITC2)110

OSG Operational Infrastructure

Analysis One Code Desc. Transaction Amount. Fiscal Period

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

Distributed Database Access in the LHC Computing Grid with CORAL

An objective comparison test of workload management systems

Data storage services at CC-IN2P3

Council, 6 February IT Report. Executive summary and recommendations. Introduction

Virtualization, Grid, Cloud: Integration Paths for Scientific Computing

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

High Availability Databases based on Oracle 10g RAC on Linux

Relational databases for conditions data and event selection in ATLAS

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. dcache Introduction

Analisi di un servizio SRM: StoRM

ATLAS GridKa T1/T2 Status

Scalable NAS for Oracle: Gateway to the (NFS) future

Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware

ALICE GRID & Kolkata Tier-2

Summer Student Project Report

High Availability Implementation for JD Edwards EnterpriseOne

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

Forschungszentrum Karlsruhe: GridKa and GGUS

Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary

HIGH ENERGY PHYSICS EXPERIMENTS IN GRID COMPUTING NETWORKS EKSPERYMENTY FIZYKI WYSOKICH ENERGII W SIECIACH KOMPUTEROWYCH GRID. 1.

Cost effective methods of test environment management. Prabhu Meruga Director - Solution Engineering 16 th July SCQAA Irvine, CA

PowerSteering Product Roadmap Your Success Is Our Bottom Line

The CMS analysis chain in a distributed environment

Global Grid User Support - GGUS - in the LCG & EGEE environment

Monitoring Evolution WLCG collaboration workshop 7 July Pablo Saiz IT/SDC

U.S. Department of Energy Golden Field Office Information Technology. GOanywhere Real World Virtual Desktops in the DOE

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Integrated Performance & Risk Management -

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

How To Get A Certificate From Ms.Net For A Server Server

Transcription:

WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? Jamie.Shiers@cern.ch WLCG Storage Workshop CERN, July 2007

Agenda The machine The experiments The service

LHC Schedule Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. 12 23 34 45 56 67 78 81. Operation testing of available sectors Machine Checkout Beam Commissioning to 7 TeV Interconnection of the continuous cryostat Global pressure test &Consolidation Warm up Leak tests of the last sub-sectors Flushing Powering Tests Inner Triplets repairs & interconnections. Cool-down Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. 20/6/2007 LHC commissioning - CMS June 07 3

2008 LHC Accelerator schedule 20/6/2007 LHC commissioning - CMS June 07 4

2008 LHC Accelerator schedule 20/6/2007 LHC commissioning - CMS June 07 5

Machine Summary No engineering run in 2007 Startup in May 2008 we aim to be seeing high energy collisions by the summer. No long shutdown end 2008 See also DG s talk on http://cern.ch/

Experiments Continue preparations for Full Dress Rehearsals Schedule from CMS is very clear: CSA07 runs September 10 for 30 days Ready for cosmics runinnovember Another such run in March ALICE have stated FDR from November Expecting concurrent exports from ATLAS & CMS end July 1GB/s from ATLAS, 300MB/s from CMS Bottom line: continuous activity post CHEP likely to be (very) busy

ATLAS Event sizes We already needed more hardware in the T0 because In the TDR there was no full ESD copy to BNL included Transfers require more disk servers than expected 10% less disk space in CAF From TDR: RAW=1.6 MB, ESD=0.5 MB, AOD=0.1 MB 5-day buffer at CERN 127 Tbyte; Currently 50 disk servers 300 TByte For Release 13: RAW=1.6 MB, ESD=1.5 MB, AOD=0.23 MB (incl. trigger&truth) 22 2.2 33MB 3.3 = 50% more at T0 3 ESD, 10 AOD copies: 4.1 8.4 MB = factor 2 more for exports More disk servers needed for T0 internal and exports 40% less disk in CAF Extra tapes and drives 25% cost increase Have to be taken away from CAF again Also implications for T1/2 sites Can store 50% less data Goal: run this summer 2 weeks uninterrupted at nominal rates with all T1 sites Event sizes from cosmic run ~8MB (no zero suppression) CERN, June 26,2007 Software & Computing Workshop 8

ATLAS T0 T1 Exports situation at May 28/29 2007 Tier-1 Site Efficiency (%) Average Nominal 50% of 100% of 150% of 200% of Thruput Rate nominal nominal nominal nominal (MB/s) (MB/s) achieved achieved achieved achieved ASGC 0 0 60 BNL 95 145 290 CNAF 71 11 90 FZK 46 31 90 Lyon 85 129 110 NDGF 86 37 50 PIC 45 37 50 RAL 0 0 100 SARA 99 149 110 Triumf 94 34 50 CERN, June 26,2007 Software & Computing Workshop 9

Services Schedule Q: What td do you (CMS) need for CSA07? A: Nothing would like FTS 2.0 at Tier1s (and not too late) but not required for CSA07 to succeed Trying to ensure that this is done at CMS T1s Other major residual service : SRM v2.2 2 Windows of opportunity: post CSA07, early 2008 Q: How long will SRM 1.1 services be needed? 1 week? 1 month? 1 year? LHC annual schedule has significant impact on larger service upgrades / migrations cf COMPASS triple migration

S.W.O.T. Analysis of WLCG Services Strengths Weaknesses Threats Opportunities We do have a service that is used, albeit with a small number of well known and documented deficiencies (with work-arounds) Continued service instabilities; holes in operational tools & procedures; ramp-up will take at least several (many?) months more Hints of possible delays could re-ignite discussions on new features Maximise time remaining until high-energy running to: 1)Ensure 1.) all remaining residual services are deployed as rapidly as possible, but only when sufficiently tested & robust; 2.) Focus on smooth service delivery, with emphasis on improving all operation, service and support activities. All services (including residual ) should be in place no later than All services (including residual ) should be in place no later than Q1 2008, by which time a marked improvement in the measurable service level should also be achievable.

LCG Steep ramp-up still needed before first physics run CERN + Tier-1s - Installed and Required CPU Capacity (MSI2K) CERN + Tier-1s - Installed and Required DISK Capacity (PetaBytes) 50 25 45 40 35 30 4x 20 15 6x 25 20 10 15 10 5 5 0 Apr 06 May Jun Jul 06 06 06 Aug 06 Sep 06 Oct 06 Nov Dec Jan 06 06 07 Feb 07 Mar 07 Apr 07 May Jun Jul 07 07 07 Aug 07 Sep 07 Oct 07 Nov Dec Jan 07 07 08 Feb 08 Mar Apr 08 08 0 Apr 06 May Jun Jul 06 06 06 Aug 06 Sep 06 Oct 06 Nov Dec Jan 06 06 07 Feb 07 Mar 07 Apr 07 May Jun Jul 07 07 07 Aug 07 Sep 07 Oct 07 Nov Dec Jan 07 07 08 Feb 08 Mar Apr 08 08 installed target installed target Evolution of installed capacity from April 06 to June 07 Target capacity from MoU pledges for 2007 (due July07) and 2008 (due April 08)

WLCG Service: S / M / L vision Short-term: ready for Full Dress Rehearsals now expected to fully ramp-up ~mid-september (>CHEP) The only thing I see as realistic on this time-frame is FTS 2.0 services at WLCG Tier0 & Tier1s Schedule: June 18 th at CERN; available mid-july for Tier1s Medium-term: what is needed & possible for 2008 LHC data taking & processing The remaining residual services must be in full production mode early Q1 2008 at all WLCG sites! Significant improvements in monitoring, reporting, logging more timely error response service improvements Long-term: anything else The famous sustainable e-infrastructure? WLCG Service Deployment Lessons Learnt 14

WLCG Service Deployment Lessons Learnt 15

Types of Intervention 0. (Transparent) load balanced servers / (ices) 1. Infrastructure: power, cooling, network 2. Storage services: CASTOR, dcache 3. Interaction with backend DB: LFC, FTS, VOMS, SAM etc.

Transparent Interventions - Definition Have reached agreement with the LCG VOs that the combination of hardware / middleware / experiment-ware should be resilient to service glitches A glitch is defined as a short interruption of (one component of) the service that can be hidden at least to batch behind some retry mechanism(s) How long is a glitch? All central CERN services are covered for power glitches of up to 10 minutes Some are also covered for longer by diesel UPS but any non-trivial service seen by the users is only covered for 10 Can we implement the services so that ~all interventions are transparent? YES with some provisos EGI Preparation Meeting, Munich, March 19 2007 - Jamie.Shiers@cern.ch 17

Targetted Interventions Common interventions include: Adding additional resources to an existing service; Replacing hardware used by an existing service; Operating system / middleware upgrade / patch; Similar operations on DB backend (where applicable). Pathological cases include: Massive machine room reconfigurations, as was performed at CERN (and elsewhere) to prepare for LHC; Wide-spread power or cooling problems; Major network problems, such as DNS / router / switch problems. Pathological cases clearly need to be addressed too! Lessons Learnt from WLCG Service Deployment - Jamie.Shiers@cern.ch 18

More Transparent Interventions I am preparing to restart our SRM server here at IN2P3-CC so I have closed the IN2P3 channel on prod-fts-ws in order to drain current transfer queues. I will open them in 1 hour or 2. Is this a transparent intervention or an unscheduled one? A: technically unscheduled, since it's SRM downtime. An EGEE broadcast was made, but this is just an example But if the channel was first paused which would mean that no files will fail it becomes instead transparent at least to the FTS which is explicitly listed as a separate service in the WLCG MoU, both for T0 & T1! i.e. if we can trivially limit the impact of an intervention, we should (c.f. WLCG MoU services at Tier0/Tier1s/Tier2s) / WLCG Service Deployment Lessons Learnt 19

Service Review For each service need current status of: Power supply (redundant including power feed? Critical? Why?) Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?) Network (are servers connected to separate network switches?) Middleware? (can middleware transparently handle loss of one of more servers?) Impact (what is the impact on other services and / or users of a loss / degradation of service?) Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?) Tested (have interventions been made transparently using the above features?) Documented (operations procedures, service information) WLCG Service Deployment Lessons Learnt 20

WLCG Service Deployment Lessons Learnt 21

Why a Grid Solution? The LCG Technical Design Report lists: 1. Significant costs of [ providing ] maintaining and upgrading the necessary resources more easily handled in a distributed ib t d environment, where individual id institutes t and organisations can fund local resources whilst contributing to the global goal 2. no single points of failure. Multiple copies of the data, automatic reassigning of tasks to resources facilitates access to data for all scientists independent of location. round the clock monitoring and support. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

Services - Summary Its open season on SPOFs You Seek! are a SPOF! You Locate! are the enemy of the Grid! You Exterminate! will be exterminated! WLCG Service Deployment Lessons Learnt 23

Summary 2008 / 2009 LHC running will be lower than design luminosity (but same data rate?) Work kh has (re-)started with CMS to jointly address critical services R li ti ll it ill t k it ff t d Realistically, it will take quite some effort and time to get services up to design luminosity

Questions for this workshop 1. Given the schedule of the experiments and the LHC machine, (when) can we realistically deploy SRM 2.2 in production? 2. What is the roll-out out schedule? (WLCG sites by name & possibly VO) 3. How long is the validation period including possible fixes to clients (FTS etc.) 4. For how long do we need to continue to run SRM v1.1 1 services? Migration issues? Clients?

ATLAS Visit For those who have registered, now is a good time to pay the 10 deposit RDV 14:00 Geneva time, CERN reception, B33

Backup Slides

Component LFC DPM FTS 2.0 3D VOMS roles Summary updates presented at June GDB Service Progress Summary Bulk queries deployed in February, Secondary groups deployed in April. ATLAS and LHCb are currently giving new specifications for other bulk operations that are scheduled for deployment this Autumn. Matching GFAL and lcg-utils changes. SRM 2.2 support released in November. Secondary groups deployed in April. Support for ACLs on disk pools has just passed certification. SL4 32 and 64-bit versions certified apart from vdt (gridftp) dependencies. Has been through integration and testing including certificate delegation, SRM v2.2 support and service enhancements now being validated in PPS and pilot service (already completed by ATLAS and LHCb); will then be used in CERN production for 1 month (from June 18 th ) before release to Tier-1. Ongoing (less critical) developments to improve monitoring piece by piece continue. All Tier 1 sites in production mode and validated with respect to ATLAS conditions DB requirements. 3D monitoring integrated into GGUS problem reporting system. Testing to confirm streams failover procedures in next few weeks then will exercise coordinated DB recovery with all sites. Also starting Tier 1 scalability tests with many ATLAS and LHCb clients to have correct DB server resources in place by the Autumn. Mapping to job scheduling priorities has been implemented at Tier 0 and most Tier 1 but behavior is not as expected (ATLAS report that production role jobs map to both production and normal queues) so this is being re-discussed.

Service Progress Summary Component glite 3.1 WMS glite 3.1 CE SL4 SRM 2.2 DAQ-Tier-0 Integration Operations Summary updates presented at June GDB WMS passed certification and is now in integration. It is being used for validation work at CERN by ATLAS and CMS with LHCb to follow. Developers at CNAF fix any bugs then run 2 weeks of local testing before giving g patches back to CERN. CE still under test with no clear date for completion. Backup solution is to keep the existing 3.0 CE which will require SLC3 systems. Also discussing alternative solutions. SL3 built SL4 compatibility mode UI and WN released but decision to deploy left to sites. Native SL4 32 WN in PPS now and UI ready to go in. Will not be released to production until after experiment testing is completed. SL4 DPM (needs vdt) important for sites that buy new hardware. CASTOR2 work is coupled to the ongoing performance enhancements; dcache 1.8 beta has test installations at FNAL, DESY, BNL, FZK, Edinburgh, IN2P3 and NDGF, most of which also are in the PPS. Integration of ALICE with the Tier-0 has been tested with a throughput of 1 GByte/sec. LHCb testing planned for June then ATLAS and CMS from September. Many improvements are under way for increasing the reliability of all services. See this workshop & also WLCG Collaboration w/s @CHEP N.B. its not all dials & dashboards!