Experience of Data Transfer to the Tier-1 from a DIRAC Perspective Lydia Heck Institute for Computational Cosmology Manager of the DiRAC-2 Data Centric Facility COSMA 1
Talk layout Introduction to DiRAC? The DiRAC computing systems What is DiRAC What type of science is done on the DiRAC facility? Why do we need to copy data to RAL? Copying data to RAL network requirements Collaboration between DiRAC and RAL to produce the archive Setting up the archiving tools Archiving Open issues Conclusions 2
Introduction to DiRAC DIRAC -- Distributed Research utilising Advanced Computing established in 2009 with DiRAC-1 Support of research in theoretical astronomy, particle physics and nuclear physics Funded by STFC with infrastructure money allocated from the Department for Business, Innovation and Skills (BIS) The running costs, such as staff costs and electricity are funded by STFC 3
Introduction to DiRAC, cont d 2009 DiRAC-1 8 installations across the UK of which COSMA-4 at the ICC in Durham is one. Still a loose federation. 2011/2012 DiRAC-2 major funding of 15M for e-infrastructure in bidding to host 5 installations identified judged by peers for successful bidders scrutiny and interview by representatives for BIS to see if we could deliver by a tight deadline 4
Introduction to DiRAC, cont d DiRAC has full management structure. Computing time on the DiRAC facility is allocated through a peer-reviewed procedure. Current director: Dr Jeremy Yates, UCL Current technical director:prof Peter Boyle, Edinburgh 5
The DiRAC computing systems Blue Gene Edinburgh Cosmos Cambridge Complexity Leicester Data Centric Durham Data Analytic Cambridge 6
The Bluegene @ DiRAC Edinburgh IBM Blue Gene 98304 cores 1 Pbyte of GPFS storage designed around (Lattice)QCD applications 7
COSMA @ DiRAC (Data Centric) Durham Data Centric system IBM IDataplex 6720 Intel Sandy Bridge cores 53.8 TB of RAM FDR10 infiniband 2:1 blocking 2.5 Pbyte of GPFS storage (2.2 Pbyte used!) 8
Complexity @ DiRAC Leicester Complexity HP system 4352 Intel Sandy Bridge cores 30 Tbyte of RAM FDR 1:1 non-blocking 0.8 Pbyte of Panasas storage 9
Cosmos @ DiRAC (SMP) Cambridge COSMOS SGI shared memory system 1856 Intel Sandy Bridge cores 31 Intel Xeon Phi coprocessors 14.8 Tbyte of RAM 146 Tbyte of storage 10
HPCS @ DiRAC (Data Analytic) Cambridge Data Analytic Dell 4800 Intel Sandy Bridge cores 19.2 TByte of RAM FDR Infiniband 1:1 nonblocking 0.75 PB of Lustre storage 11
What is DiRAC A national service run/managed/allocated by the scientists who do the science funded by BIS and STFC The systems are built around and for the applications with which the science is done. We do not rival a facility like ARCHER, as we do not aspire to run a general national service. DiRAC is classed as a major research facility by STFC on a par with the big telescopes 12
What is DiRAC, cont d Long projects with significant amount of CPU hours allocated for 3 years typically on a specific system for 2012 2015 with examples: Cosmos - dp002 : ~20M cpu hours on Cambridge Cosmos Virgo-dp004 : 63M cpu hours on Durham DC UK-MHD-dp010 : 40.5M cpu hours on Durham DC UK-QCD-dp008 : ~700M cpu hours on Edinburgh BG Exeter dp005: ~15M cpu hours on Leicester Complexity HPQCD dp019 : ~20M cpu hours on Cambridge Data Analytic 13
What type of Science is done on DiRAC? For the highlights of science carried out on the DiRAC facility please see: http://www.dirac.ac.uk/science.html Specific example: Large scale structure calculations with the Eagle run 4096 cores ~8 GB RAM/core 47 days = 4,620,288 cpu hours 200 TB of data 14
Why do we need to copy data (to RAL)? Original plan - each research project should make provisions for storing the research data requires additional storage resource at researchers home institutions Not enough provision will require additional funds. data creation considerably above expectation? if disaster struck many cpu hours of calculations would be lost. 15
Why do we need to copy data (to RAL)? Research data must now be shared with/available to interested parties Install DiRAC s own archive requires funds and currently there is no budget. we needed to get started: Jeremy Yates negotiated access to the RAL archive system Acquire expertise Identify bottlenecks and technical challenges submitted 2,000,000 files and created an issue at the file servers How can we collaborate and make use of previous experience. AND: copy data! 16
Copying data to RAL network requirements network bandwidth situation for Durham now: currently possible 300-400 Mbytes/sec required investment and collaboration from DU CIS upgrade to 6GBit/sec to JANET - Sep 2014 past: will be 10 Gbit/sec by end of 2015 infra structure already installed identified Durham related bottlenecks - FIREWALL 17
Copying data to RAL network requirements network bandwidth situation for Durham investment to by-pass of external campus firewall: two new routers (~ 80k) configured for throughput with minimal ACL enough to safeguard site. deploying internal firewalls part of new security infrastructure, essential for such a venture Security now relies on front-end system of Durham DiRAC and Durham GridPP. 18
Copying data to RAL network requirements Result for COSMA and GridPP in Durham guaranteed 2-3 Gbit/sec with bursts of up to 3-4Gbit/sec (3 Gbit/sec outside of term time) pushed the network performance for Durham GridPP from bottom 3 in the country to top 5 of the UK GridPP sites achieves up to 300 400 Mbyte/sec throughput to RAL on archiving depending on file sizes. 19
Collaboration between DiRAC and GridPP/RAL Durham Institute for Computational Cosmology (ICC) volunteered to be the prototype installation Huge thanks to Jens Jensen and Brian Davies - there were many emails exchanged, many questions asked and many answers given. Resulting document Setting up a system for data archiving using FTS3 by Lydia Heck, Jens Jensen and Brian Davies 20
Setting up the archiving tools Identify appropriate hardware could mean extra expense: need freedom to modify and experiment with - cannot have HPC users logged in and working! free to do very latest security updates requires optimal connection to storage - infiniband card 21
Setting up the archiving tools Create an interface to access the file/archving service at RAL using the GridPP tools gridftp Globus Toolkit also provides Globus Connect Trust anchors (egi-trustanchors) voms tools (emi3-xxx) fts3 (cern) 22
Archiving? long-lived voms proxy? myproxy-init; myproxy-logon; voms-proxy-init; ftstransfer-delegation How to create a proxy and delegation that lasts weeks even months? still an issue grid-proxy-init; fts-transfer-delegation grid-proxy-init valid HH:MM fts-transfer-delegation e time-in-seconds creates proxy that lasts up to certificate life time. 23
Archiving Large files optimal throughput limited by network bandwidth Many small files limited by latency; using -r flag to ftstransfer-submit to re-use connection Transferred: ~40 Tbytes since 20 August ~2M files challenge to FTS service at RAL User education on creating lots of small files 24
Open issues ownership and permissions are not preserved depends on single admin to carry out. what happens when content in directories change? complete new archive sessions? tries to archive all the files again but then fails as file already exists should be more like rsync 25
Conclusions With the right network speed we can archive the DiRAC data to RAL. The documentation has to be completed and shared with the system managers on the other DiRAC sites Each DiRAC site will have their own dirac0x account Start with and keep on archiving Collaboration between DiRAC and GridPP/RAL DOES work! Can we aspire to more? 26