Grid Computing and EGEE John Walsh and Brian Coghlan Grid Ireland /e INIS/EGEE John.Walsh@cs.tcd.ie www.eu egee.org EGEE III/Grid Ireland EGEE and glite are registered trademarks
Background Grid is an evolving and maturing architecture Based on several well established computer science domains Distributed Computing Group and Role Management Distributed Data Management Public Key Encryption 2
What is a Grid Anyway? Grid is highly overloaded and incorrectly used term peer to peer networks Compute cycle donation systems (BLAST, SETI, etc) Distributed clusters over VPN Basic requirements Facilitate virtual organisations across domain boundaries Secure and trusted infrastructure Single sign on login Resource sharing Storage Compute cycles New media access (grid filesystems, GridSite Wiki) Coordinated, long term infrastructure Policies for joining/leaving a Grid 3
Virtual Organisation Motivations Academic work often involves large and small collaborations Some local or national Others across international borders This gives rise to several challenges: How do I share my data and resources with someone who doesn t work locally? Don t personally know all people in collaboration bona fide users? How do I ensure that I can limit access to just a well defined set of people and resources? large scale, across national boundaries? 4
Non Grid Solutions Basic solutions address some of these issues, eg: Ssh for login access to remote batch systems Web with simple (but secure) password protection FTP with password protection, scp... However, these are not scalable OpenID aims at single signon for users 5
Grid Solutions Grids address these problems by providing: Internationally recognised identity credentials Renewed annually Must be revoked when user leaves institute Ability to revoke cert quickly (misuse, security,... ) Support for virtual organisations (Software + Service + Policies) Software (Middleware) to support controlled access to resources Extra support for roles/groups within the VO Secure access to larger pool of resources available to users Subject to site policy Storage + CPU (batch systems) Single sign on support Login once, use everywhere Limited duration 6
Certificate Authorities National credential services Known as a CA See http://www.eugridpma.org Often compared to a Passport Office User applies for a grid identity using a web browser Certificate request User is then required to: Present themselves to a local Regional Authority (RA) Prove that they are who they say they are (formal ID of Institute) After approval, uses same web browser to pick up their certificate Uses established Public Key Encryption (PKI) X.509 standard 7
CA in Ireland Run by the Grid Ireland Operations Centre Dr Brian Coghlan, Dr David O Callaghan (TCD) Officially recognised by GridPMA Regional Authorities Cork (Prof John Morrison, Brian Clayton UCC) Galway (Dr Andy Shearer, Dr Bruno Voisin NUI Galway) To apply for a certificate, see: http://www.grid.ie/getting a cert.html 8
The International Dimension Many International Grids EGEE (EU 27, spans 50+ countries ) OSG, TeraGrid (USA) NAREGI (Japan) ARC (Nordic Countries) NGS (UK) DEISA II (leading national super computing centres in EU) PRACE (Peta scale computing ICHEC is Irish Partner) Int.EU.Grid the interactive grid TCD is a partner (Stuart Kenny, Active Security) Finished in April Grid Interoperations/Interoperability demo at SC 07 These are mostly based around Globus Toolkit Grid interoperability is work in progress 9
EGEE III Well established infrastructure EDG(2001/04), EGEE I(2004/06), EGEE II (2006/08) EGEE III (May 2008 2010), EGI from 2010 EU supported grid with mandate to: Run a European wide production grid infrastructure Provides Compute and Data Storage for e Science > 600 people involved in day to day management Running 24/7/365, >150k jobs per day 265+ sites, 55 countries, 11 federations Build, test, certify and deploy quality grid middleware solutions glite middleware stack, developed by project Interface with other well established grids Encourage grid awareness and build new communities Encourage new application domains on the grid Establish and encourage grid technology in the business world 10
The EGEE Grid 11
Grid Ireland and the EGEE Grid Ireland at TCD is the contractual partner with EGEE Responsible for a distributed Regional Operations Centre with UK partners (STFC, NGS + others) Forms UK/I Federation Operations team integrated into European Operations Problem Resolution system (GGUS www.ggus.org) Resolve problems at sites Provide solutions and add to Knowledge base Weekly reporting on operational issues 2 sites in Ireland (DIAS,TCD), 21+ in UK NGI for Ireland in EGI 12
EGEE Grid Architecture and Services Administrative Global Grid User Support Grid Operations Centre DB Service Availability Monitoring Core Infrastructure Centre Certificate Authorities Grid Training Access Services User Interface Grid logon Grid Portals Core Services Information Services Top level BDII, RGMA Registry Authentication and Authorisation VOMS, MyProxy, CRL Service Workload Management WMS/LB Data Management LFC, FTS, AMGA, Site Services Job Management GateKeeper/batch Storage Management Information Management 13
Abstract Architecture 14
Grid Site Internals A Grid Site consists of 1 or more Gatekeepers linked to local batch system Batch System (including Worker nodes) A Grid enabled Storage Service A set of host information systems top level site system. Each information system collects data about the status of the services running on the The site BDII collects and aggregates information about the available services R GMA monitoring service Optional (Core) Services Workload Management/Logging and Book keeping MYPROXY File Transfer service.. UI (allowing a local user sign on to the Grid) 15
Core Services Form backbone of the Grid Provide the necessary services to allow user at Site A send a job or access a data resource at another site. Maintained at one or more sites Most important Services VOMS (Authentication and Authorisation) Workload management Service (Job Management) Information Management Data Management 16
Information System Basics In order for a grid to function, the current status of resources needs to be known e.g, job matchmaking will use this info when deciding where to send a job The amount of information gathered is very large This is handled hierarchically Top Level BDII queries Site BDIIs from a list of known sites This information is amalgamated from lower level Info Systems Data has a Distinguished Name and is unique 17
Site BDII Gathers information from a list of known supported services at the site, e.g. Compute Element Storage Element Top Level Services supported by the Site Managers Workload management system (WMS/LB, optional) Credential renewal (MYPROXY, optional) File catalog (LFC, optional) File transfer service (FTS, optional) Top Level BDII will pull this status information and republish it at regular intervals. Quasi realtime status 18
Information Hierarchy 19
Managing Jobs Grid Jobs are described using JDL A User can Match a job (as decribed by the JDL) against resources 70+ primitives for matching (e.g against available #CPU/ Storage/Site VO access restrictions etc) Submit a job/jobs via WMS/LB Submitted job returns unique ID to user Query Status of Job Retrieve data from finished jobs Cancel submitted jobs 20
Job Workflow 21
What type of jobs can run on the Grid? Grid is not a replacement of HPC Grid is best suited to high throughput computing Short/medium term lifetime jobs Support for long running jobs available Relatively independent tasks Monte Carlo Parameter Sweeps Grid can handle Large volume batches of single jobs Complex Workflows MPI jobs MPI support led by Grid Ireland & Int.EU.Grid However, cross site MPI not well supported, but getting better 22
Infrastructure Statistics No of sites steadily increasing (Jan 2009 = 265 ) CPU + Storage count increasing (very dynamic) More than 49 million jobs have been run in the last year Continuous increase observed. Doubling total for first year 18.9 million jobs for 12 months Number of jobs run by non LHC or Ops VOs also with almost 20,000 jobs each day around October 07. More and more VOs make significant usage of the infrastructure March 2008 there are 115 active VOs, >40 having used over 1 CPU year per week over extended periods during the first nine months of 2007. 23
EGEE Site Growth EGEE Site Growth (to October 2007) May 2008 Production Sites = 255 24
N No. of Jobs on EGEE Infrastructure Job Data is generated over a period of days. Drop off due to lack of accounting data during that period 25
CPU/Cores **** May 2008 Approx 68000 cores/cpus 26
Grid Communities and Applications What communities could I get involved with? LHC HEP community is well established Alice, Atlas, CMS, LHCb Smaller HEP VOs (barbar, PhenoGrid,...) Astronomy and Astrophysics CosmoGrid Planck Process Planck Satellite data MAGIC Origin of VHE Gamma rays Earth Science DEGREE Seismic sensor network EGEODE Seismic data processing ESR Earths Obs, Climate, Climate, Solid Earth Physics FUSION Nuclear Fusion Applications BIOMED Medical imaging, bioinformatics, drug discovery Finance, Infrastructure, Geophysics, Comp Chem... 27
Top 10 Accounting Top 10 VO Accounting for last 12 Months 28
No such thing as a free lunch Virtual Organisations Provide resources Provide application developers to Gridify software EGEE & Grid Ireland do not have infinite manpower EGEE doesn t provide resources, the sites do Resource sharing principal VOs should contribute as much as they use 29
Getting Involved Web Site http://www.grid.ie All e mail enquiries to grid ireland help@cs.tcd.ie Request Grid Ireland training and grid porting help Check to see if an existing VO is appropriate to you http://cic.gridops.org/index.php?section=home&page=volist Contact VO manager (see VO ID card) Is a new VO needed? Enable your site/batch system on the Grid 30
Future E INIS awarded 12 million under PRTLI 4 DIAS(lead), UCC, HEANET, NUI Galway, NUIM, ICHEC, TCD, Grid Ireland Building an integrated e Infrastructure in Ireland HPC + DATA + GRID = e Infrastructure Building on success of CosmoGrid Resources Improvements (TCD) Current WN pool increased to 96 Quad Cores (768 cores) Made possible under grant from SFI Grid Ireland Central Services will be upgraded Facilitated by E INIS User Controlled Light Paths (HEAnet) Allowing sites to have dedicated connections over fibre optic network 31
Questions? 32
Grid Ireland Facilitate virtual organisations in Ireland (and abroad) Provide a point of presence at institutes Machines + middleware providing grid access = Gateways Manage grid infrastructure Encourage institutes and communities to provide/share resources Manage grid operations in Ireland Basic user training Participate in the international grid communities Working Groups and forums Help define standards Develop innovative grid middleware solutions 18 sites + national services centrally managed 33
Network switch Gridfw (firewall) Gridinstall (Quattor) Gridgate (CE) Gridstore (SE) Gridmon (test WN) Gridui (UI) Gridnm (NM) UPS Grid Gateway Grid Gateway: All Virtual machines All run on 1 physical machine Remotely managed by OpsCentre Cluster(s): Managed by local admins OpsCentre supports integration Various config & install options 34
Grid Ireland Innovations Transactional Deployment with rollback Distribute and Configure all Grid Ireland gateways Grid Virtualisation Sites are instances of VMs VMs used in Grid Ireland testbeds Testing and Certification VMs used in e Learning testbed Non trivial networking in testbeds Allows testing of real world setup RemotePBS JobManager allowing submission to non grid batch systems GridFS (Grid FileSystem) R/W access to remote filesystems using grid credentials 35
Current VOs in Ireland CosmoGrid Grid enabled Computational Physics of Natural Phenomena WebCom G Middleware stack supporting complicated workflows, aims to hide complexity of Grid Gene Bioinformatics Solo VO A catch all VO for new users who would like to try Grid, 36
Grid Ireland Successes EGEE III SA3 Porting Coordinator (Eamonn Kenny) Producing ports for CentOS 5, Suse SLES 10 + Others Secretary of OGF e Learning WG (Kathryn Cassidy) Chair of EGEE MPI WG (Stephen Childs) Co author of EGEE technical report on EGEE MPI integration Closer integration with Int.EU.Grid AuthZ integration into EGEE R GMA (Stuart Kenny) VM/VNet work being followed by others Quattor Working Group Fabric Management Deployment of 36TB of data storage (April 2008) Deployment of 768 cores in TCD (Dec 2007) 37
End Users Support Once a user has a cert, they can be associated with an existing virtual organisation Facilitate establishment of international VOs Help VOs port applications to grid User training Adaptive e Learning Developed by Kathryn Cassidy Will be used for training sessions tomorrow Introductory and advanced courses Grid introduction, site integration, MPI on grid The EGEE Gilda testbed https://gilda.ct.infn.it/ open to everyone to try grid for themselves 38
Grid Ireland Infrastructure Central services Grid Ireland Sites Clusters 39