GridKa: Roles and Status GmbH Institute for Scientific Computing P.O. Box 3640 D-76021 Karlsruhe, Germany Holger Marten http://www.gridka.de
History 10/2000: First ideas about a German Regional Centre for LHC Computing - planning and cost estimates 05/2001: Start a BaBar-Tier-B with Univ. Bochum, Dresden, Rostock 07/2001: German HEP communities send Requirements for a Regional Data and Computing Centre in Germany (RDCCG) - more planning and cost estimates 12/2001: Launching committee establishes RDCCG (renamed to Grid Computing Centre Karlsruhe, GridKa later) 04/2002: First prototype 10/2002: GridKa Inauguration meeting
High Energy Physics experiments served by GridKa Atlas (SLAC, USA) u p m o id C y today r G to ad e d r e l t a it m ta a m o d C l rea e v Ha (FNAL,USA) LHC experiments (FNAL,USA) ting (CERN) non-lhc experiments Other sciences later
GridKa Project Organization Technical Advisory Board Overview Board Board Alice Atlas CMS LHCb BaBar CDF D0 Compass Physics Committees DESY Project Leader GridKa Planning Development Technical realization Operation BMBF Physics Committees HEP Experiments LCG FZK Management Head FZK Comp. Centre Chairman of TAB Project Leader
Aachen (4) Bielefeld (2) Bochum (2) Bonn (3) Darmstadt (1) Dortmund (1) Dresden (2) Erlangen (1) Frankfurt (1) Freiburg (2) Hamburg (1) Heidelberg (1) (6) Karlsruhe (2) Mainz (3) Mannheim (1) München (1) (5) Münster (1) Rostock (1) Siegen (1) Wuppertal (2) German Users of GridKa 22 institutions 44 user groups 350 scientists
GridKa in the network of international Tier-1 centres France: Germany: Italy: Japan: Spain: Switzerland: Taiwan: UK: USA: USA: IN2P3, Lyon CNAF, Bologna ICEPP, University Tokio PIC, Barcelona CERN, Genf Academia Sinica, Taipei Rutherford Laboratory, Chilton Fermi Laboratory, Batavia, IL BNL Warning: List not fixed.
The fifth LHC subproject Lab z Uni e The global LHC Computing Centre ATLAS Virtual Organizations Lab y USA (Fermi, BNL) Tier 3 Tier 1 (Institute Tier 2 computers) (Uni-CCs, Lab-CCs) Uni d UK (RAL) Uni b CERN Tier 0 LHCb.. Lab x Uni a France (IN2P3) Italy (CNAF) Tier 4 (Desktop) CERN Tier 1 CMS Germany (FZK) Lab i Working Groups Uni c Tier 0 Centre at CERN
desktops portables Santiago RAL small Tier-2 Weizmann centres Forschungszentrum Karlsruhe Tier-1 MSU IC IN2P3 IFCA UB FNAL Cambridge LHC Computing Model (simplified!!) Tier-0 the accelerator centre Filter raw data Reconstruction summary data (ESD) Record raw data and ESD Distribute raw and ESD to Tier-1 CNAF Budapest Prague FZK Taipei PIC TRIUMF Legnaro ICEPP CSCS Rome CIEMAT Krakow NIKHEF USC Tier-1 Les Robertson, GDB, May 2004 BNL Permanent storage and management of raw, ESD, calibration data, meta- online to data acquisition process data, analysis data and databases -- high availability (24h x7d) grid-enabled data service -- managed mass storage Data-heavy analysis -- long-term commitment Re-processing raw ESD -- resources: 50% of average National, regional support Tier-1
Tier-2 Well-managed disk storage grid-enabled Simulation End-user analysis batch and interactive High performance parallel analysis (PROOF?) MSU IC IN2P3 IFCA UB FNAL Cambridge CNAF Budapest Prague FZK Taipei PIC TRIUMF Legnaro ICEPP CSCS Rome Each Tier-2 is associated with a Tier-1 that Serves as the primary data source Takes responsibility for long-term storage and management of all of the data generated at the Tier-2 (grid-enables mass storage) May also provide other support services (grid expertise, software distribution, maintenance, ) BNL CERN will not provide these services for Tier-2s GridKa School 2004, September 20-23, 2004, Karlsruhe, Germany except by special arrangement CIEMAT Krakow NIKHEF USC Les Robertson, GDB, May 2004 desktops portables Santiago RAL small Tier-2 Weizmann centres Forschungszentrum Karlsruhe Tier-1
GridKa planned resources Tbyte 6000 Jan-2004 4000 CPU Disk Tape 3000 4000 2000 2000 1000 0 0 2002 2003 2004 2005 2006 2007 2008 2009 LCG Phase I Phase II Phase III ksi95 8000
Distribution of planned resources at GridKa 100% 80% 60% 40% 20% 0% 100% 80% 60% 40%!! C H on-l non-lhc n o t ns o i t u rib t LHC n o c t n a c re2007 t A ifi 2004 n e 2002 ign 2003 2005 2006 r l C S Tie a n r o i a eg ab R B F non-lhc CD, 0 D 20% 80% 60% Jan-2004 2009 Disk LHC 0% 100% 2008 CPU 2002 2003 2004 2005 2006 2007 2008 2009 non-lhc Tape 40% 20% LHC 0% 2002 2003 2004 2005 2006 2007 2008 2009
GridKa Environment
IW R 441,442 Main building Tape Storage
Worker Nodes & Test beds Production environment 97x dual PIII, 1,26 GHz 97 ksi2000 64x dual PIV, 2,2 GHz 102 ksi2000 72x dual PIV, 2,667 GHz 130 ksi2000 267x dual PIV, 3,06 GHz 534 ksi2000 36x dual Opteron 246 90 ksi2000 1 GB mem, 40 GB HD 1 GB RAM, 40 GB HD 1 GB RAM, 40 GB HD 1 GB RAM, 40/80 GB HD 2 GB RAM, 80 GB HD Σ 536 nodes, 1072 CPUs, 953 ksi2000 installed with RH7.3, LCG 2.2.0 (except for Opterons) Test environment additional 30 machines in several test beds Next OS Scientific Linux if middleware and applications are ready
PBSPro fair share according to requirements experiment ksi2000 share percentage Alice 143 14 300 13.2 Atlas 150 15 000 13.9 CMS 140 14 000 12.9 LHCb 56 5 600 5.2 BaBar 210 21 000 19.4 CDF 50 5 000 4.6 Dzero 283 28 300 26.2 Compass 50 5 000 4.6 45% LHC 55 % nlhc 1-oct-2004 The default (test) queue is not handled by the fair share. These 20-30 CPUs are kept free for test jobs.
Disk Space available for HEP experiments: 202 TB 60 50 30 Oct 04 20 10 29 % LHC 71 % nlhc Compass D0 CDF BaBar LHCb CMS ATLAS 0 ALICE TByte 40
Online Storage I about 40 TB stored in NAS (better: DAS) dual CPU, 16 EIDE disks, 3Ware controller Experience hardware cheap, but not very reliable RAID software & management messages not always useful good throughput for a few simultaneous jobs, but doesn t scale to a few hundred simultaneous file accesses Workarounds disk mirroring management software ( managed disks ): file copies on multiple boxes) more reliable disks + parallel file system
Online Storage: I/O Design with NAS (DAS) Compute nodes TCP/IP/NFS Expansion ~ 30 MB/s r/w bottleneck disk access bottleneck Alice Atlas
Online Storage II about 160 TB stored in a SAN SCSI disks (rpm 10k) with redundant controllers parallel file system on a file server cluster exported via NFS on a cluster of file server to the WNs
Online Storage: Scalable I/O Design Compute nodes TCP/IP/NFS Expansion file server cluster SAN/SCSI Fibre Channel Alice Atlas RAID 5 storage striping + parallel file system; 350-400 MB/s I/O measured
Online Storage II about 160 TB stored in a SAN SCSI disks (rpm 10k) with redundant controllers parallel file system on a file server cluster exported via NFS on a cluster of file server to the WNs Advantages high availability through multiple redundant servers load balancing via automounter program map Experience many teething problems (bugs, learn how to configure,...) ratio (CPU/Wall clock) near to 1 in some applications more expensive > next try cheaper S-ATA systems
Why telling all this? Because we need your experience and feedback as users!
Tape Space available for HEP experiments: 374 TB 120 100 60 Oct 04 40 20 27 % LHC 73 % nlhc Compass D0 CDF BaBar LHCb CMS ATLAS 0 ALICE TByte 80
Tape Storage tape library IBM 3584 LTO Ultrium 8 drives LTO-1, 4 drives LTO-2 375 TB native (uncompressed) Tivoli Storage Manager (TSM) for Backup and Archive installation of dcache in progress - tape backend interfaced to Tivoli Storage Manager - installation with 1 head and 3 pool nodes currently tested by CMS & CDF other - SAM station caches for D0 and CDF - JIM (Job information management) station for D0 - tape connection via scripts (D0) - CORBA Naming service (for CDF)
GridKa Plan for WAN connectivity 2 Gbps 155 Mbps Start 10 Gbps tests 10 Gbps 20 Gbps Start discussion with Dante! 34 Mbps 2001 2002 2003 2004 2005 2006 2007 2008 Sept 2004 DFN upgraded the capacity from Karlsruhe to Géant to 10 Gbps; tests have been started! Routing (full 10 Gbps): GridKa DFN (Karlsruhe) DFN (Frankfurt) Géant (Frankfurt) Géant (Milano) Géant (Geneva) CERN
Further services & sources of information
GGUS (Global Grid User Support) www.ggus.org
User information www.gridka.de GridKa Info - user registration globus installation batch system PBS backup & archive getting a certificate from GermanGrid CA listserver / mailing lists monitoring status with Ganglia www.gridka.de HEP experiments - experiment specific information www.ggus.org - FAQ Documentaion...
Tools gridmon.fzk.de/ganglia
Final remarks
Europe on the way to e-science EU-Project EGEE April 2004 to March 2006 32 Mio. Euro f. personnel Russland 70 partner institutes in 27 countries organized in 9 federations applications LHC grid, Biomed,... Op Co Re Provide distributed European research communities with a common market of computing, offering round-the-clock access to major computing resources, independent of geographic location,..
Status of LCG / EGEE http://goc.grid-support.ac.uk/lcg2
Last but not least We want to help - our users on our systems - support/discuss cluster installations at other institutes - support/discuss middleware installations at other centres - creating a German Grid Infrastructure and... We will continue the balancing act between - testing & Data Challanges - production with real data
No equipment without people. Thanks! We appreciate the continuous interest and support by the Federal Ministry of Education and Research, BMBF.