WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? Jamie.Shiers@cern.ch WLCG Storage Workshop CERN, July 2007
Agenda The machine The experiments The service
LHC Schedule Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. 12 23 34 45 56 67 78 81. Operation testing of available sectors Machine Checkout Beam Commissioning to 7 TeV Interconnection of the continuous cryostat Global pressure test &Consolidation Warm up Leak tests of the last sub-sectors Flushing Powering Tests Inner Triplets repairs & interconnections. Cool-down Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec. 20/6/2007 LHC commissioning - CMS June 07 3
2008 LHC Accelerator schedule 20/6/2007 LHC commissioning - CMS June 07 4
2008 LHC Accelerator schedule 20/6/2007 LHC commissioning - CMS June 07 5
Machine Summary No engineering run in 2007 Startup in May 2008 we aim to be seeing high energy collisions by the summer. No long shutdown end 2008 See also DG s talk on http://cern.ch/
Experiments Continue preparations for Full Dress Rehearsals Schedule from CMS is very clear: CSA07 runs September 10 for 30 days Ready for cosmics runinnovember Another such run in March ALICE have stated FDR from November Expecting concurrent exports from ATLAS & CMS end July 1GB/s from ATLAS, 300MB/s from CMS Bottom line: continuous activity post CHEP likely to be (very) busy
ATLAS Event sizes We already needed more hardware in the T0 because In the TDR there was no full ESD copy to BNL included Transfers require more disk servers than expected 10% less disk space in CAF From TDR: RAW=1.6 MB, ESD=0.5 MB, AOD=0.1 MB 5-day buffer at CERN 127 Tbyte; Currently 50 disk servers 300 TByte For Release 13: RAW=1.6 MB, ESD=1.5 MB, AOD=0.23 MB (incl. trigger&truth) 22 2.2 33MB 3.3 = 50% more at T0 3 ESD, 10 AOD copies: 4.1 8.4 MB = factor 2 more for exports More disk servers needed for T0 internal and exports 40% less disk in CAF Extra tapes and drives 25% cost increase Have to be taken away from CAF again Also implications for T1/2 sites Can store 50% less data Goal: run this summer 2 weeks uninterrupted at nominal rates with all T1 sites Event sizes from cosmic run ~8MB (no zero suppression) CERN, June 26,2007 Software & Computing Workshop 8
ATLAS T0 T1 Exports situation at May 28/29 2007 Tier-1 Site Efficiency (%) Average Nominal 50% of 100% of 150% of 200% of Thruput Rate nominal nominal nominal nominal (MB/s) (MB/s) achieved achieved achieved achieved ASGC 0 0 60 BNL 95 145 290 CNAF 71 11 90 FZK 46 31 90 Lyon 85 129 110 NDGF 86 37 50 PIC 45 37 50 RAL 0 0 100 SARA 99 149 110 Triumf 94 34 50 CERN, June 26,2007 Software & Computing Workshop 9
Services Schedule Q: What td do you (CMS) need for CSA07? A: Nothing would like FTS 2.0 at Tier1s (and not too late) but not required for CSA07 to succeed Trying to ensure that this is done at CMS T1s Other major residual service : SRM v2.2 2 Windows of opportunity: post CSA07, early 2008 Q: How long will SRM 1.1 services be needed? 1 week? 1 month? 1 year? LHC annual schedule has significant impact on larger service upgrades / migrations cf COMPASS triple migration
S.W.O.T. Analysis of WLCG Services Strengths Weaknesses Threats Opportunities We do have a service that is used, albeit with a small number of well known and documented deficiencies (with work-arounds) Continued service instabilities; holes in operational tools & procedures; ramp-up will take at least several (many?) months more Hints of possible delays could re-ignite discussions on new features Maximise time remaining until high-energy running to: 1)Ensure 1.) all remaining residual services are deployed as rapidly as possible, but only when sufficiently tested & robust; 2.) Focus on smooth service delivery, with emphasis on improving all operation, service and support activities. All services (including residual ) should be in place no later than All services (including residual ) should be in place no later than Q1 2008, by which time a marked improvement in the measurable service level should also be achievable.
LCG Steep ramp-up still needed before first physics run CERN + Tier-1s - Installed and Required CPU Capacity (MSI2K) CERN + Tier-1s - Installed and Required DISK Capacity (PetaBytes) 50 25 45 40 35 30 4x 20 15 6x 25 20 10 15 10 5 5 0 Apr 06 May Jun Jul 06 06 06 Aug 06 Sep 06 Oct 06 Nov Dec Jan 06 06 07 Feb 07 Mar 07 Apr 07 May Jun Jul 07 07 07 Aug 07 Sep 07 Oct 07 Nov Dec Jan 07 07 08 Feb 08 Mar Apr 08 08 0 Apr 06 May Jun Jul 06 06 06 Aug 06 Sep 06 Oct 06 Nov Dec Jan 06 06 07 Feb 07 Mar 07 Apr 07 May Jun Jul 07 07 07 Aug 07 Sep 07 Oct 07 Nov Dec Jan 07 07 08 Feb 08 Mar Apr 08 08 installed target installed target Evolution of installed capacity from April 06 to June 07 Target capacity from MoU pledges for 2007 (due July07) and 2008 (due April 08)
WLCG Service: S / M / L vision Short-term: ready for Full Dress Rehearsals now expected to fully ramp-up ~mid-september (>CHEP) The only thing I see as realistic on this time-frame is FTS 2.0 services at WLCG Tier0 & Tier1s Schedule: June 18 th at CERN; available mid-july for Tier1s Medium-term: what is needed & possible for 2008 LHC data taking & processing The remaining residual services must be in full production mode early Q1 2008 at all WLCG sites! Significant improvements in monitoring, reporting, logging more timely error response service improvements Long-term: anything else The famous sustainable e-infrastructure? WLCG Service Deployment Lessons Learnt 14
WLCG Service Deployment Lessons Learnt 15
Types of Intervention 0. (Transparent) load balanced servers / (ices) 1. Infrastructure: power, cooling, network 2. Storage services: CASTOR, dcache 3. Interaction with backend DB: LFC, FTS, VOMS, SAM etc.
Transparent Interventions - Definition Have reached agreement with the LCG VOs that the combination of hardware / middleware / experiment-ware should be resilient to service glitches A glitch is defined as a short interruption of (one component of) the service that can be hidden at least to batch behind some retry mechanism(s) How long is a glitch? All central CERN services are covered for power glitches of up to 10 minutes Some are also covered for longer by diesel UPS but any non-trivial service seen by the users is only covered for 10 Can we implement the services so that ~all interventions are transparent? YES with some provisos EGI Preparation Meeting, Munich, March 19 2007 - Jamie.Shiers@cern.ch 17
Targetted Interventions Common interventions include: Adding additional resources to an existing service; Replacing hardware used by an existing service; Operating system / middleware upgrade / patch; Similar operations on DB backend (where applicable). Pathological cases include: Massive machine room reconfigurations, as was performed at CERN (and elsewhere) to prepare for LHC; Wide-spread power or cooling problems; Major network problems, such as DNS / router / switch problems. Pathological cases clearly need to be addressed too! Lessons Learnt from WLCG Service Deployment - Jamie.Shiers@cern.ch 18
More Transparent Interventions I am preparing to restart our SRM server here at IN2P3-CC so I have closed the IN2P3 channel on prod-fts-ws in order to drain current transfer queues. I will open them in 1 hour or 2. Is this a transparent intervention or an unscheduled one? A: technically unscheduled, since it's SRM downtime. An EGEE broadcast was made, but this is just an example But if the channel was first paused which would mean that no files will fail it becomes instead transparent at least to the FTS which is explicitly listed as a separate service in the WLCG MoU, both for T0 & T1! i.e. if we can trivially limit the impact of an intervention, we should (c.f. WLCG MoU services at Tier0/Tier1s/Tier2s) / WLCG Service Deployment Lessons Learnt 19
Service Review For each service need current status of: Power supply (redundant including power feed? Critical? Why?) Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?) Network (are servers connected to separate network switches?) Middleware? (can middleware transparently handle loss of one of more servers?) Impact (what is the impact on other services and / or users of a loss / degradation of service?) Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?) Tested (have interventions been made transparently using the above features?) Documented (operations procedures, service information) WLCG Service Deployment Lessons Learnt 20
WLCG Service Deployment Lessons Learnt 21
Why a Grid Solution? The LCG Technical Design Report lists: 1. Significant costs of [ providing ] maintaining and upgrading the necessary resources more easily handled in a distributed ib t d environment, where individual id institutes t and organisations can fund local resources whilst contributing to the global goal 2. no single points of failure. Multiple copies of the data, automatic reassigning of tasks to resources facilitates access to data for all scientists independent of location. round the clock monitoring and support. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea
Services - Summary Its open season on SPOFs You Seek! are a SPOF! You Locate! are the enemy of the Grid! You Exterminate! will be exterminated! WLCG Service Deployment Lessons Learnt 23
Summary 2008 / 2009 LHC running will be lower than design luminosity (but same data rate?) Work kh has (re-)started with CMS to jointly address critical services R li ti ll it ill t k it ff t d Realistically, it will take quite some effort and time to get services up to design luminosity
Questions for this workshop 1. Given the schedule of the experiments and the LHC machine, (when) can we realistically deploy SRM 2.2 in production? 2. What is the roll-out out schedule? (WLCG sites by name & possibly VO) 3. How long is the validation period including possible fixes to clients (FTS etc.) 4. For how long do we need to continue to run SRM v1.1 1 services? Migration issues? Clients?
ATLAS Visit For those who have registered, now is a good time to pay the 10 deposit RDV 14:00 Geneva time, CERN reception, B33
Backup Slides
Component LFC DPM FTS 2.0 3D VOMS roles Summary updates presented at June GDB Service Progress Summary Bulk queries deployed in February, Secondary groups deployed in April. ATLAS and LHCb are currently giving new specifications for other bulk operations that are scheduled for deployment this Autumn. Matching GFAL and lcg-utils changes. SRM 2.2 support released in November. Secondary groups deployed in April. Support for ACLs on disk pools has just passed certification. SL4 32 and 64-bit versions certified apart from vdt (gridftp) dependencies. Has been through integration and testing including certificate delegation, SRM v2.2 support and service enhancements now being validated in PPS and pilot service (already completed by ATLAS and LHCb); will then be used in CERN production for 1 month (from June 18 th ) before release to Tier-1. Ongoing (less critical) developments to improve monitoring piece by piece continue. All Tier 1 sites in production mode and validated with respect to ATLAS conditions DB requirements. 3D monitoring integrated into GGUS problem reporting system. Testing to confirm streams failover procedures in next few weeks then will exercise coordinated DB recovery with all sites. Also starting Tier 1 scalability tests with many ATLAS and LHCb clients to have correct DB server resources in place by the Autumn. Mapping to job scheduling priorities has been implemented at Tier 0 and most Tier 1 but behavior is not as expected (ATLAS report that production role jobs map to both production and normal queues) so this is being re-discussed.
Service Progress Summary Component glite 3.1 WMS glite 3.1 CE SL4 SRM 2.2 DAQ-Tier-0 Integration Operations Summary updates presented at June GDB WMS passed certification and is now in integration. It is being used for validation work at CERN by ATLAS and CMS with LHCb to follow. Developers at CNAF fix any bugs then run 2 weeks of local testing before giving g patches back to CERN. CE still under test with no clear date for completion. Backup solution is to keep the existing 3.0 CE which will require SLC3 systems. Also discussing alternative solutions. SL3 built SL4 compatibility mode UI and WN released but decision to deploy left to sites. Native SL4 32 WN in PPS now and UI ready to go in. Will not be released to production until after experiment testing is completed. SL4 DPM (needs vdt) important for sites that buy new hardware. CASTOR2 work is coupled to the ongoing performance enhancements; dcache 1.8 beta has test installations at FNAL, DESY, BNL, FZK, Edinburgh, IN2P3 and NDGF, most of which also are in the PPS. Integration of ALICE with the Tier-0 has been tested with a throughput of 1 GByte/sec. LHCb testing planned for June then ATLAS and CMS from September. Many improvements are under way for increasing the reliability of all services. See this workshop & also WLCG Collaboration w/s @CHEP N.B. its not all dials & dashboards!