Centralized bookkeeping and monitoring in ALICE CHEP INFN 2000, GRID 10.02.2000 WP6, 24.07.2001 Roberto 1 Barbera
ALICE and the GRID Phase I: AliRoot production The GRID Powered by ROOT 2
How did we get there? Automatic Linux installation tool with configurable post installation. Tested on the new farm at INFN Catania: a node out of the box and ready to run in 15 mins! Local resource monitoring system with web interface based on MRTG (http://alipc1.ct.infn.it/mrtg/monitoring.html) installed at all sites. Network latency monitor (RRT) with web interface installed at the production coordination site (http://alipc1.ct.infn.it/mrtg/netmon.html). Root/AliRoot automatic installation/upgrade toolkit (both via CVS and TAR ball) distributed at all sites. It automatically sets the environment to run AliRoot with Globus. Web portal based on XML technology realized in collaboration with NICE s.r.l. for user authentication (via Globus GSI) and job submission (via Globus GRAM): http://gridct1.ct.infn.it/globus. 3
Lyon Dipartimento di Fisica dell Università di Catania and INFN Catania - Italy The ALICE testbed for Phase I OSU/C Mexico City Merida 4
I m the PPR production manager Disk Pool Globus EnginFrame Linux farm + MRTG monitor Production test lay-out for phase I Catania 1 week run ~ 200 events 300+ GB I m the local surveyor Test site Batch surveyor 5
Utilities Book keeping system Web interface to login on The Grid CPU Disk Load space Network availability Web interface for job submission! Only at Lyon LDAP server for ALICE (only in Italy ) 6
What did we learn? (1) Pros: We are able to successfully manage certificates from different CA s (INFN, CNRS, Globus). The Root/AliRoot installation toolkit (Torino) works nicely at all sites (ALICE s and WP6 s). Many different job-managers have been tested: Condor, LSF, PBS and BQS (special interface to Globus realized at CCIN2P3 Lyon) The web interface EnginFrame is interfaced not only with the Globus GRAM and GSI but also with the local monitoring systems and with the presently available information service. A geographically distributed AliRoot production can be centrally managed. Produced data was actually used by physics analysis (TOF group in Bologna). 7
What did we learn? (2) Cons: The output and error files do not fly back to the submitting machine if a job manager different from fork is used with the Globus commands. The absence of a centralized bookkeeping system which could also acts as job monitor was the most critical issue. There was no automatic resource broker and wide area work load management. There was no direct interaction between Root/AliRoot and the GRID services. 8
ALICE and the GRID Phase II: Reconstruction and analysis The GRID The GRID 9
The ALICE testbed for Phase II 10
I m the PPR production manager Manager s site Dipartimento di Fisica dell Università di Catania and INFN Catania - Italy Globus EnginFrame Linux farm + MRTG monitor I m the local site manager Disk Pool WWW/ Carrot MySQL PHP Bypass Run DB Cron Mirror DB Tape Pool (CASTOR/ HPSS) I m the impatient ALICE user checking the availability of events Production test lay-out for phase II Production site WWW/Carrot Anywhere 11
More integration: Grid services directly addressed from within Root TAuthorization (P. Malzacher) Interface between Root and the Globus GSI service. TLDAP (P. Malzacher) Interface between Root and the Globus GIS service based on LDAP directories. TPServer, TPServerSocket, TPSocket, TFTP, rootd (F. Rademakers) Parallel socket transfer using TCP (files and objects). RootFTP (F. Rademakers) File transfer utility which uses parallel sockets. 12
Next Root developments Interface between Root and the DataGRID WP2 middleware (GDMP API s). PROOF can use Grid File Catalogue and Replication Manager to map LFN s to chain of RFN s. Interface between PROOF and the GRID Resource Broker to detect which nodes in a cluster can be used in the parallel session (use of TLDAP for resource discovery from the GRID information service(s)). Interface between TFTP/PROOF and the Globus GSI services via TAuthorization. Comparison with GridFTP. Interface between PROOF and the Grid Monitoring Services. 13
DataGrid & Root Selection parameters TAG DB selected events Root RDB LFN #hits Grid RB output LFNs Grid log & monitor PROOF loop Grid replica manager Grid autenticate Spawn PROOF tasks best places Grid perf mon Grid cost evaluator Grid MDS Grid perf log Grid replica catalog Update Root RDB Send results back Grid replica catalog 14
Internal milestones for Phase II 6/2001 List of ALICE users (& Certificates) distributed to the test sites (ftp://alipc1.ct.infn.it/pub/grid/test). 7/2001 Distributed production/reconstruction test with a centralized bookkeeping system and Bypass. As many sites involved as possible. 8/2001 Distributed analysis test with PROOF. 9/2001 Test with/on DataGRID WP6 resources (new version of the installation toolkit with the new scripts). 12/2001 First results of tests of DataGRID PM9 middleware release. 15