Report from SARA/NIKHEF T1 and associated T2s



Similar documents
Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Tier0 plans and security and backup policy proposals

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN

Data storage services at CC-IN2P3

LHC schedule: what does it imply for SRM deployment? CERN, July 2007

HAMBURG ZEUTHEN. DESY Tier 2 and NAF. Peter Wegner, Birgit Lewendel for DESY-IT/DV. Tier 2: Status and News NAF: Status, Plans and Questions

The dcache Storage Element

Global Grid User Support - GGUS - in the LCG & EGEE environment

ALICE GRID & Kolkata Tier-2

Alternative models to distribute VO specific software to WLCG sites: a prototype set up at PIC

Database Services for CERN

Mass Storage System for Disk and Tape resources at the Tier1.

Analisi di un servizio SRM: StoRM

Global Grid User Support - GGUS - start up schedule

GridKa: Roles and Status

DCMS Tier 2/3 prototype infrastructure

Service Challenge Tests of the LCG Grid

Managed GRID or why NFSv4.1 is not enough. Tigran Mkrtchyan for dcache Team

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT

ATLAS GridKa T1/T2 Status

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

High Availability Databases based on Oracle 10g RAC on Linux

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Mass Storage at GridKa

Summer Student Project Report

DSA1.4 R EPORT ON IMPLEMENTATION OF MONITORING AND OPERATIONAL SUPPORT SYSTEM. Activity: SA1. Partner(s): EENet, NICPB. Lead Partner: EENet

CMS Dashboard of Grid Activity

The Grid-it: the Italian Grid Production infrastructure

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. dcache Introduction

Status and Evolution of ATLAS Workload Management System PanDA

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

Qualogy M. Schildmeijer. Whitepaper Oracle Exalogic FMW Optimization

Sun Grid Engine, a new scheduler for EGEE

Hillstone StoneOS User Manual Hillstone Unified Intelligence Firewall Installation Manual

We want to clarify for all vendors our definition of SaaS and Managed Services.

Preview of a Novel Architecture for Large Scale Storage

Network & HEP Computing in China. Gongxing SUN CJK Workshop & CFI

Welcome to the User Support for EGEE Task Force Meeting

Qsan Document - White Paper. Performance Monitor Case Studies

out of this world guide to: POWERFUL DEDICATED SERVERS

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

DEDICATED MANAGED SERVER PROGRAM

High Availability of the Polarion Server

ARDA Experiment Dashboard

POSIX and Object Distributed Storage Systems

Computing in High- Energy-Physics: How Virtualization meets the Grid

HTCondor at the RAL Tier-1

Status and Integration of AP2 Monitoring and Online Steering

The glite File Transfer Service

Exadata and Database Machine Administration Seminar

Red Hat Enterprise linux 5 Continuous Availability

WebLogic on Oracle Database Appliance: Combining High Availability and Simplicity

Patrick Fuhrmann. The DESY Storage Cloud

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

SARA Computing & Networking Services

Ignify ecommerce. Item Requirements Notes

Computing at the HL-LHC

Dedicated Hosting. The best of all worlds. Build your server to deliver just what you want. For more information visit: imcloudservices.com.

Internet Services. CERN IT Department CH-1211 Genève 23 Switzerland

ORACLE DATABASE HIGH AVAILABILITY STRATEGY, ARCHITECTURE AND SOLUTIONS

Transcription:

Report from SARA/NIKHEF T1 and associated T2s Ron Trompert SARA

About SARA and NIKHEF NIKHEF SARA High Energy Physics Institute High performance computing centre Manages the Surfnet 6 network for the Dutch NREN Surfnet. Manages HSM environments SARA and NIKHEF are located in the same building but are separate organisations collaborating in WLCG, EGEE and a Dutch escience project VLe

About SARA and NIKHEF Separate computational domains Weekly meetings to discuss operational matters, middleware development (EGEE, VLe), etc Bi-weekly meetings to discuss planning and other T1 matters

Infrastructure

Deployed hardware SARA: 59 TB disk storage for WLCG (dcache) 1.8 TB disk cache for tape silos Alice T0D1: 2TB T1D0: 3TB Atlas T0D1: 12TB T1D0: 4TB LHCb T0D1: 34TB T1D0: 4TB 50 TB of tapes 6 dedicated 9940B drives SL8500 and Powderhorn Computing 37 ksi2k MATRIX 3 GHz Xeon 2 CPUs/node WLCG shared with other VOs 2231 ksi2k LISA (Atlas and LHCb) Shared with other users (shared fairshare of 18%)

Deployed hardware NIKHEF: 14 TB disk storage for WLCG (DPM) Alice T0D1: 1TB Atlas LHCb T0D1: 11TB T0D1: 2TB Computing 305 ksi2k Luilak-1 2.66 GHz Xeon 5150 4 CPUs/node Dedicated for WLCG 305 ksi2k Luilak-2 2.66 GHz Xeon 5150 4 CPUs/node Shared with other (non-lhc) VOs 70 ksi2k Halloween 2.8 GHz Xeon 2 CPUs/node Dedicated for WLCG 106 ksi2k Bulldozer 3.2 GHz Xeon 2 CPUs/node 1/3 for WLCG

3D project Two dual Xeon nodes (Oracle RAC) at SARA with SAN storage (300 GB for now) Infrastructure is being tested Participate in ATLAS replication tests DBA workshop at SARA on march 20-21 2007

Metrics Normalised Wallclock over 2006

Metrics Normalised Wallclock over 2006

Metrics Datatransfers in 2006 for SARA SRM VO TRANSFERS IN TRANSFERS OUT STORE TO TAPE RESTORE FROM TAPE ALICE 158 TB 0 TB 0 TB 0 TB ATLAS 12 TB 29 TB 11 TB 5 TB LHCB 8 TB 9 TB 5 TB 0 TB

Metrics Data Stored ALICE ATLAS LHCB Disk 2 TB 14 TB 7 TB Tape 0 TB 18 TB 7 TB

Activities during the past year NIKHEF Ldap vo server grid-vo.nikhef.nl went into retirement Installed latest version of torque and maui and did work on the scheduling (fairshares) Installed new worker nodes Installed DPM SE plus storage

Activities during the past year SARA Installed 5 and later on 10 storage servers with a total of 78 TB Installed 4 extra tape drives Separated hsm environments T1 now has it s own dedicated hsm environment Network rearrangement Installed 5 new service nodes for Oracle RAC (2 nodes) Monitoring s/w (Nagios) Dcache admin Install server

Activities during the past year SC3 throughput disk2disk test rerun Met target rate of 150 MB/s Peak up to 272 MB/s

Activities during the past year SC4 throughput test Disk2disk 154 MB/s for 15 days Slow startup due to some problems Disk2tape Only about 50 MB/s Suboptimal SAN configuration Now write 28+ MB/s per 9940B drive

Funding WLCG resources funded by BIGGRID Proposal for the funding of a Dutch e-science infrastructure. Started January 1st 2007 About 28M for 4 years. To be shared by a number of participants in the Netherlands

Planning On the short term Grid services nodes CE,UI,FTS, Reliable hardware Double system disks Double power supplies Mirrored memory High availability linux

Planning Keeping up with Harry tables CPU Disk Tape Q1 2007 742 ksi2k 129 TB 187 TB Q2 2007 958 ksi2k 156 TB 214 TB

Planning 24x7 support Initially not availably Try to achieve the required MoU service level through redundancy. See how it goes If we do not achieve the required service level, we will reconsider this

T2s Israel Russia: Weizmann IHEP, ITEP, SINP,JINR, RRC Czech republic UK Prague, alternate data path for FZK Northgrid sites: Lancaster, Liverpool, Manchester, Sheffield, alternate data path for RAL

ITEP 2.3 TB disk (dcache), 1 TB classic SE with xrootd for Alice, 1TB DPM SE. Plan to unite all SEs into the dcache SE. 32 WNs P4 2.4 GHz dual CPU nodes, lcg- CE and glite-ce VOBOX, UI, MON, WMSLB, BDII External 1G network

RRC 2 disk servers (3x2TB DPM storage + 2 TB shared fs for exp. soft, etc.) 50 computing nodes (dual P4 3.0 and 3.4 GHz) available soon External network connection is 100Mbps at the moment. Hope to get 1 G this year

JINR 30TB dcache SE. Plans for extra 30 TB. 1.8 TB DPM SE One farm for all VOs, 15 dual P3 1 GHz nodes. Will be upgraded to dual xeon dual core in Q2 Second farm 4 dual P4. 2 lcg-ce, 2-3 additional CE are planned+torque server LFC, BDII, PROXY, WMSLB,RB

SINP Classic SE and dcache SE 5.4 TB. Plan to install extra 14 TB Lcg-CE with 29 WNs P4 2.8 GHz and dual Xeon 2.8GHz glite-ce with 12 WNs dual Opteron dual core 2.4 GHz 1 G network connectivity

Trouble tickets systems Current situation both NIKHEF and SARA have their own system with a person on duty watching over it. Considering single trouble ticket system for SARA/NIKHEF T1. EGEE Northern Federation has RT with a site on duty watching over it every other week

Networking 10G to CERN Up and running (level 2) but not in production right now because of routing issues 10G to FZK Aim to use SARA-FZK-CERN as backup when SARA-CERN fails 1G to Lancaster In progress 2G connection SARA-NIKHEF Plans to upgrade this to 10G

Issues Network Security policy dictates that only traffic related to T0-T1 and T1-T1 data transfer is allowed to go through LHC OPN. Part of the SARA-CERN traffic should go through LHC OPN while the rest goes through Géant->source-based routing We would like to use BGP for redundancy (use the SARA-FZK-CERN link as backup) SBR excludes redundancy and BGP excludes the possibility to discriminate against source ip addresses Sites have solved this issue using dedicated hardware Will be discussed further

Issues Storage dcache Gridftp doors stopping due to hanging transfers Watchdog script to restart the gridftpdoor Srm timeouts due to growth of the postgres db which underlies the srm i/f ROOT I/O for LHCb LHCb jobs running at NIKHEF tried to open files at SARA using gsidcap-based ROOT I/O. (gsi)dcap connects back to the WN which is on private NIKHEF IP space which failed Installed dcache 1.7.0 as soon as it was released which has a passive version of dcap

Issues LFC Sometimes the LFC loses connection to the oracle db. It tries to reconnect which fails for an unknown reason. Watchdog script checks this every 2 minutes and restarts lfcdaemon Randomly failing SAM tests Usual procedure is to look if the service is still running and if so, repeat the failed test by hand. If the test succeeds, invoke the SAM testsuite manually to clear the error status