Service Challenge Tests of the LCG Grid



Similar documents
LHC GRID computing in Poland

Grid Computing in Aachen

The CMS analysis chain in a distributed environment

LHC schedule: what does it imply for SRM deployment? CERN, July 2007

Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Report from SARA/NIKHEF T1 and associated T2s

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

KMD National Data Storage

Tier0 plans and security and backup policy proposals

PIONIER the national fibre optic network for new generation services Artur Binczewski, Maciej Stroiński Poznań Supercomputing and Networking Center

Status and Evolution of ATLAS Workload Management System PanDA

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

ALICE GRID & Kolkata Tier-2

Poland. networking, digital divide andgridprojects. M. Pzybylski The Poznan Supercomputing and Networking Center, Poznan, Poland

The glite File Transfer Service

Linux and the Higgs Particle

The dcache Storage Element

CMS Dashboard of Grid Activity

Advanced Service Platform for e-science. Robert Pękal, Maciej Stroiński, Jan Węglarz (PSNC PL)

HIGH ENERGY PHYSICS EXPERIMENTS IN GRID COMPUTING NETWORKS EKSPERYMENTY FIZYKI WYSOKICH ENERGII W SIECIACH KOMPUTEROWYCH GRID. 1.

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Computing at the HL-LHC

The Grid-it: the Italian Grid Production infrastructure

HADOOP, a newly emerged Java-based software framework, Hadoop Distributed File System for the Grid

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Analyses on functional capabilities of BizTalk Server, Oracle BPEL Process Manger and WebSphere Process Server for applications in Grid middleware

HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions

Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb

An objective comparison test of workload management systems

Data Management Plan (DMP) for Particle Physics Experiments prepared for the 2015 Consolidated Grants Round. Detailed Version

Chapter 12 Distributed Storage

GridKa: Roles and Status

Managed GRID or why NFSv4.1 is not enough. Tigran Mkrtchyan for dcache Team

Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing

Evolution of Database Replication Technologies for WLCG

TELEMEDICINE in POLAND

Roberto Barbera. Centralized bookkeeping and monitoring in ALICE

ATLAS GridKa T1/T2 Status

Analisi di un servizio SRM: StoRM

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

The GENIUS Grid Portal

EDG Project: Database Management Services

Data storage services at CC-IN2P3

ATLAS job monitoring in the Dashboard Framework

PoS(EGICF12-EMITC2)110

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

ARDA Experiment Dashboard

Global Grid User Support - GGUS - in the LCG & EGEE environment

Bob Jones Technical Director

Grid Activities in Poland

Database Services for CERN

Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary

Client/Server Grid applications to manage complex workflows

EGEE is a project funded by the European Union under contract IST

An approach to grid scheduling by using Condor-G Matchmaking mechanism

Transcription:

Service Challenge Tests of the LCG Grid Andrzej Olszewski Institute of Nuclear Physics PAN Kraków, Poland Cracow 05 Grid Workshop 22 nd Nov 2005 The materials used in this presentation come from many sources In particular I have used slides from presentations of LHC experiments and LHC Computing Grid

CERN Large Hadron Collider CMS European Organisation for Nuclear Reseach LHCb Atlas Alice CGW05 Andrzej Olszewski 2

Experiment The most powerful microscope CGW05 Andrzej Olszewski 3

Data Reduction Data preselection in real time - many different physics processes - several levels of filtering - high efficiency for events of interest - total reduction factor of about 10 7 ~1 GHz interaction rate Level 1 Special Hardware 75 khz fully digitized Level 2 Embedded processors/farm 2 khz Level 3 Farm of commodity CPU 100 Hz Data Recording & Offline Analysis 2 µs 10 ms ~2 s Target processing time CGW05 Andrzej Olszewski 4

Data Rates Rate [Hz] RAW [MB] ESD Reco [MB] AOD [kb] Monte Carlo [MB/evt] Monte Carlo % of real ALICE HI 100 12.5 2.5 250 300 100 ALICE pp 100 1 0.04 4 0.4 100 ATLAS 200 1.6 0.5 100 2 20 CMS 150 1.5 0.25 50 2 100 LHCb 2000 0.025 0.025 0.5 20 50 days running in 2007 10 7 seconds/year pp from 2008 on ~10 9 events/experiment 10 6 seconds/year heavy ion CGW05 Andrzej Olszewski 5

Mountains of CPU & Disks For LHC computing, 100M SpecInt2000 or 100K of 3GHz Pentium 4 is needed! For data storage, 20 Peta Bytes or 100K of disks/tapes per year is needed! At CERN currently: ~2,400 processors ~2 Peta Bytes of disk ~12 PB of magnetic tape Even with technology- driven improvements in performance and costs CERN can provide nowhere near enough capacity for LHC! CGW05 Andrzej Olszewski 6

Large, distributed community ~5000 Physicists around the world - around the clock Offline software effort: 1000 person-years per experiment Da Major challenges: Communication and collaboration at a distance Distributed computing resources Remote software development and physics analysis CGW05 Andrzej Olszewski 7

The LCG Project Objectives Design, prototyping and implementation of a computing environment for LHC experiments: - Infrastructure (for HEP it is effective to use PC farms) - Middleware (based on EDG, VDT, glite.) - operations (experiment VOs, operation and support centres) Approved by the CERN Council in September 2001 Phase 1 (2001-2004): Development of a distributed production prototype that will be operated as a platform for the data challenges Phase 2 (2005-2007): Installation and operation of the full world-wide initial production Grid system, requiring continued manpower efforts and substantial material resources. CGW05 Andrzej Olszewski 8

Grid Foundation Projects Globus ivdgl Open Science Grid PPDG GriPhyN Condor Grid3 Globus, Condor and VDT have provided key components of the middleware used. DataTag GridPP LCG cooperates with other Grid projects. Key members participate in OSG and EGEE NorduGrid EU DataGrid EGEE CrossGrid INFN Grid CGW05 Andrzej Olszewski 9

LCG Hierarchical Model Uni x Lab m Tier3 Physics Department Desktop γ Tier2 Lab a Lab b USA Brookhaven USA FermiLab The LHC Computing Facility Italy Tier 1 NL CERN Tier0 UK France Tier1 Germany Lancs Lab c Uni n β α Uni y Uni b CGW05 Andrzej Olszewski 10

LCG Hierarchical Model Tier-0 at CERN Record RAW data (1.25 GB/s ALICE) Distribute second copy to Tier-1s Calibrate and do first-pass reconstruction Tier-1 centers (11 defined) Manage permanent storage RAW, simulated, processed Capacity for reprocessing, bulk analysis Tier-2 centers (> 100 identified) Monte Carlo event simulation End-user analysis Tier-3 Facilities at universities and laboratories Access to data and processing in Tier-2s, Tier-1s CGW05 Andrzej Olszewski 11

Architecture Tier0 Gigabit Ethernet Ten Gigabit Ethernet Double ten gigabit Ethernet 10 Gb/s to 32 1 Gb/s WAN WAN 2.4 Tb/s CORE Campus Campus network network Experimental Experimental areas areas Distribution layer..96....96....96....32....10... ~6000 CPU servers x 8000 SPECINT2000 (2008) ~2000 Tape and Disk servers CGW05 Andrzej Olszewski 12

Network Connectivity National Reasearch Networks (NRENs) at Tier-1s: ASnet LHCnet/ESnet GARR LHCnet/ESnet RENATER DFN SURFnet6 NORDUnet RedIRIS UKERNA CANARIE Tier-2 Any Tier-2 may access data at any Tier-1 Tier-2 BNL Tier-2 IN2P3 Tier-2s and Tier-1s are inter-connected by the general purpose research networks GridKa TRIUMF ASCC Tier-2 Tier-2 Tier-2 Tier-2 Nordic FNAL Tier-2 CNAF Tier-2 Tier-2 SARA PIC RAL CGW05 Andrzej Olszewski 13

Service Challanges LCG Service Challenges are about preparing, hardening and delivering the production LHC Computing Environment. Data recording CERN must be capable of accepting data from the experiments and recording it at a long term sustained average rate of 1.6 1.8 GBytes/sec Service Challenge 1-2 Demonstrate reliable file transfer, disk to disk, between CERN and Tier-1 centres, sustaining for one week an aggregate throughput of 500 MBytes/sec at CERN. Service Challenge 3 Operate a reliable base service including most of the Tier-1 centres and some Tier-2s. Grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at Tier-1s). Service Challenge 4 Demonstrate that all of the offline data processing requirements expressed in the experiments Computing Models, from raw data taking through to analysis, can be handled by the Grid at the full nominal data rate of the LHC CGW05 Andrzej Olszewski 14

Schedules Sep05 - SC3 Service Phase May06 SC4 Service Phase Sep06 Initial LHC Service in stable operation Apr07 LHC Service commissioned 2005 2006 2007 2008 SC3 SC4 LHC Service Operation cosmics First First beams physics Full physics run SC3 Currently finishing throughput phase, testing basic experiment software chains SC4 All Tier-1s, major Tier-2s sustain nominal final grid data throughput (~ 1.5 GB/sec) LHC Service in Operation September 2006 ramp up to full operational capacity by April 2007 CGW05 Andrzej Olszewski 15

SC2 - Throughput to Tier1 from CERN Has to use multiple TCP streams and multiple file transfers to fill up network pipe CGW05 Andrzej Olszewski 16

SC3 Throughput Tests Site MoU Target (Tape) ASGC 100 Aver. MB/s (Disk) 10 BNL 200 107 FNAL 200 185 GridKa 200 42 CC-IN2P3 200 40 CNAF 200 50 NDGF 50 129 PIC 100 54 RAL 150 52 NIKHEF 150 111 TRIUMF 50 34 All Tier0 participating UsingSRM interface July - low transfer rates and poor reliability of transfers between T0-T1 running at ~half the target of 1GB/s with poor stability T1-T2 transfers at much lower rates on target Since then a better performance has been seen after resolving FTS and Castor problems at CERN CGW05 Andrzej Olszewski 17

Baseline Services Priority A: High priority and mandatory Priority B: Standard solutions are required, but experiments could select different implementations Priority C: Desirable to have a common solution, but not essential CGW05 Andrzej Olszewski 18

LCG/EGEE Services in SC3 Basic services (CE, SE,...) New LCG/gLite service components tested in SC3 SRM Storage element at T0, T1 and T2 SRM 1.1 interface to provide managed storage FTS server at T0 and T1 T0 and T1 to provide (reliable) File Transfer Service LFC catalog at T0, T1 and T2 Local catalogs to provide information about location of experiments files and datasets VOBOX at T0, T1 and T2 For running experiment specific agents at a site CGW05 Andrzej Olszewski 19

SRM Service SRM v1.1 insufficient Volatile, Permanent space Global space reservation: reserve, release, update (mandatory LHCb, useful ATLAS,ALICE). Permissions on directories mandatory Prefer based on roles and not DN (SRM integrated with VOMS desirable) Directory functions (except mv) Pin/unpin capability Relative paths in SURL important for ATLAS, LHCb, not for ALICE CGW05 Andrzej Olszewski 20

FTS Service First require base storage and transfer infrastructure (gridftp, SRM) to become available at high priority and to demonstrate sufficient quality of service Reliable transfer layer is valuable The glite FTS seems to satisfy current requirements Experiments plan on integrating with FTS as an underlying service to their own file transfer and data placement services Interaction with fts (e.g catalog access) can be implemented either in the experiment layer or integrating into FTS workflow Regardless of transfer system deployed need for experimentspecific components to run at both Tier1 and Tier2 Without a general service, inter-vo scheduling, bandwidth allocation, prioritisation, rapid address of security issues etc. would be difficult CGW05 Andrzej Olszewski 21

Local File Catalog File Catalogues provide the mapping of Logical file names to GUID and Storage locations (SURL). Experiments need hierarchical name space (directories) Need some form of a collection notion (datasets, fileblocks, ) Need to have role-based security (admin, production, etc.) Support bulk-operations: Dump entire catalog Interfaces are required to: POOL, Posix-like I/O service, WMS (e.g. Data Location Interface/Storage Index interfaces) LCG File Catalog fixes performance and scalability problems seen in EDG Catalogs and provides most of the required functionality Experiments rely on grid catalogs for locating files and datasets Experiment dependent information is in experiment catalogues CGW05 Andrzej Olszewski 22

VOBox The VOBox is a machine where permanent-running processes (agents or services) can be deployed and where the required security, logging & monitoring can be incorporated. The VOBox provides a way to deploy VO specific upper layer middleware on the Site with the aim of filling the gap between existing LCG middleware and the VO needs. The VOBox is not to by-pass current middleware deployment but to strengthen & enhance it to meet the experiment specific requirements. CGW05 Andrzej Olszewski 23

Experiment Integration LHCb Architecture for using FTS LHCb LCG Central Data Movement model based at CERN. FTS+TransferAgent+ RequestDB TransferAgent+ReqDB developed for this purpose. Transfer Agent run on LHCb managed lxgate class machine File (Replica) Catalog Transfer Agent Re que s t DB Transfer Manager Interface File Transfer Service Transfer network Tier0 SE Tier1 SE A Tier1 SE B Tier1 SE C CGW05 Andrzej Olszewski 24

ALICE Workload Management ALICE Job Catalogue Job 1.1 Job 1.2 Job 1.3 Job 2.1 Job 2.1 Job 3.1 Job 3.2 lfn1 lfn2 lfn3, lfn4 lfn1, lfn3 lfn2, lfn4 lfn1, lfn3 lfn2 Optimizer Knows close SE s Matchmakes ALICE File Catalogue lfn guid {se s} lfn guid {se s} lfn guid {se s} lfn guid {se s} lfn guid {se s} Registers output Asks work-load ALICE central services Execs agent Yes User Site Env OK? VO-Box LCG User Job ALICE catalogues No Die with grace Updates TQ Retrieves workload Receives work-load Sends job result CE agent Submits job agent RB Sends job agent to site CE WN CGW05 Andrzej Olszewski 25

Service Level Definition Class Description Downtime Reduced Degraded Availability C Critical 1 hour 1 hour 4 hours 99% H High 4 hours 6 hours 6 hours 99% M Medium 6 hours 6 hours 12 hours 99% L Low 12 hours 24 hours 48 hours 98% U Unmanaged None None None None Downtime defines the time between the start of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%) Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%) Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%) Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations. 99% means a service can be down up to 3.6 days a year in total. 98% means up to a week in total. None means the service is running unattended CGW05 Andrzej Olszewski 26

Example Services & Service Levels Service Service Level Runs Where Resource Broker Critical Main sites Compute Element High All sites MyProxy Critical BDII Critical Global R-GMA High LFC High All sites (ATLAS, ALICE) CERN (LHCb) FTS High T0, T1s (except FNAL) SRM Critical All sites This list needs to be completed and verified Then timescales for achieving the necessary service levels need to be agreed CGW05 Andrzej Olszewski 27

Building WLCG Service All services required to handle production data flows now deployed at all Tier1s and participating Tier2s Bring the remaining Tier2 centres into the process Getting the (stable) data rates up to the target Identify the additional Use Cases and functionality Bring core services up to robust 24 x 7 production standard required Need to use existing experience and technology... Monitoring, alarms, operators, SMS to 2 nd / 3 rd level support (Re-)implement Required Services at Sites so that they can meet MoU Targets Measured through Site Functional Tests Delivered Availability, maximum intervention time etc. Goal is to build a cohesive service out of a large distributed community CGW05 Andrzej Olszewski 28

Building WLCG Service Tier-1 Evolution Tier-2 Growth 50 30.00 16.0 3 MSI2K.. 45 40 35 30 25 20 15 25.00 20.00 15.00 10.00 PetaBytes. MSI2K. 14.0 12.0 10.0 8.0 6.0 2.5 2 1.5 1 PetaBytes 10 5 5.00 4.0 2.0 0.5 0 2005 2006 2007 2008 Year 0.00 0.0 2005 2006 2007 2008 Year 0 CPU (MSI2K) Disk (PB) Tape (PB) CPU (M SI2K) Disk (PB) CGW05 Andrzej Olszewski 29

Extra CGW05 Andrzej Olszewski 30

VOBox Services: ALICE AliEn and monitoring agents and services running on the VO node: AliEn Computing Element (CE) (Interface to LCG RB) Storage Element Service (SES) interface to local storage (via SRM or directly) File Transfer Daemon (FTD) scheduled file transfers agent (possibly using FTS implementation) Cluster Monitor (CM) local queue monitoring MonALISA general monitoring agent PackMan (PM) software distribution and management xrootd application file access Agent Monitoring (AmOn) CGW05 Andrzej Olszewski 31

Polish LCG/EGEE Centres Cracow: CYFRONET Academic Computer Centre http://www.cyfronet.pl/ Warsaw: ICM Interdisciplinary Centre for Mathematical and Computational Modelling http://www.icm.edu.pl/ Poznań PCSS Poznań Supercomputing and Networking Centre http://www.man.poznan.pl/ CGW05 Andrzej Olszewski 32

Polish Network GDAŃSK KOSZALIN OLSZTYN GÉANT SZCZECIN BYDGOSZCZ POZNAŃ TORUŃ BIAŁYSTOK BASNET 34 Mb/s ZIELONA GÓRA ŁÓDŹ WARSZAWA WROCŁAW CZĘSTOCHOWA RADOM OPOLE KIELCE PUŁAWY LUBLIN 10 Gb/s KATOWICE 1 Gb/s KRAKÓW BIELSKO-BIAŁA RZESZÓW Local network CESNET, SANET CGW05 Andrzej Olszewski 33

Polish Tier2 Poland is a federated Tier-2 HEP LHC community: ~60 people Each of the computing centres naturally will support mainly 1 experiment Cracow ATLAS Warsaw CMS Poznań ALICE Currently setting up for participation in SC3, SC4 In the future each centre will probably be about 1/3 of the average/small Tier2 for a given experiment CGW05 Andrzej Olszewski 34