Ralf Gerhards presented detailed plans to install and test LSF batch system on the current H1 PC-Farm and make it available to H1 users this summer.

Transcription

1 Summary of L45 upgrade meeting Alan Campbell Today a small group discussed some ideas on the use of PC farms in H1 for batch processing and online filtering, in the light of a second PC-Farm for which we already have funding. Ralf Gerhards presented detailed plans to install and test LSF batch system on the current H1 PC-Farm and make it available to H1 users this summer. Alan Campbell presented the attached slides. He started to formulate changes in the standard H1 program framework to enable user jobs to use parallel processing. In addition some technolgies which may be usefull in its implementation were mentioned. It was felt that before starting coding we should invite all H1 colleagues ( and friends ) to give comments, ideas, and suggestions and invite participation. Remember: PC-Farm is the future of computing in HEP Overall Goal: Bring this future to H1 Requirements: Offline computing - datafile access, batch Online computing - high reliability, guaranteed performance

2 Alan Campbell A new commodity PC farm for H1 ( using Linux ) The purpose of this talk/document is to initiate discussion on project goals and ask H1 ( and others ) for ideas. Remember: PC-Farm is the future of computing in HEP Overall Goal: Bring this future to H1 Requirements: Offline computing - datafile access, batch Online ( L45 ) computing - high reliability, guaranteed performance "Call for ideas" "Request for Concepts" "BoF Birds of Feather Session" To start this discussion I will present some initial thoughts on the following: An extension to H1 application framework ( BOS/FSEQR/module steering ) to Network communication software and hardware topology

3 An extension to H1 application framework ( BOS/FSEQR/module steering ) to Recall: Current H1 framework BEGJOB open first input file open output file open ntuples book histograms printout BEGRUN access database printout FSEQR/W loop Data File In Data File Out OTHDAT REVENT CALL MODULE1 CALL MODULE2... printout open following input file open following output file ENDRUN printout ENDJOB close ntuple file print histograms printout

4 An extension to H1 application framework ( BOS/FSEQR/module steering ) to Recall: Current L5/PC Farm framework Same as "current H1 framework" except that main input/output is not data file but instead shared memory event pool. Simple to run "current H1 framework" as only event in/out is altered But: multiple database connections multiple printout files multiple histogram files Extra infrastructure required combine histogram files feed events to / from shared memory pool synchronise runs/ sort event numbers Single shared memory pool currently limits all input and output most go via a single machine.

5 An extension to H1 application framework ( BOS/FSEQR/module steering ) to Recall: Current L4 Farm framework FSEQR input stream is fed not only with events but also with database records, end-of-run marker records, runstart record subset, pre-event records. FSEQR output stream contains not only RUNEVENT records but also end-of-run marker records, ( synchronise at run change ) RUNDATA records ( -> TCL job ) SPECIAL stream records histogram records Dummy NDB routine...all database banks automatically present before event processing ALL modules are called BEGRUN true, REVENT false to book histograms Required modules are called with REVENT true for processing No multiple database connections No multiple histogram files Multiple parallel input streams Output stream independant of input Extra infrastructure required database reading ( mdbquefpack ) combine histograms and write to database ( TCL job ) output file writing and dumping ( datalog job )

6 An extension to H1 application framework ( BOS/FSEQR/module steering ) to Proposal for new metacomputer-enabled H1 framework Purpose: Allow all H1 jobs to efficiently use many computers ( including multi-processor computers ) Goal: minimal changes needed to current scheme so that both h1rec modules and user analysis programs can be easily modified Remove completely "specialness" of L4,L5,Reprocessing, H1 batch frameworks -> all jobs run in one scheme -> anyone can maintain -> developments benefit everyone Possible/proposed implementation: Single user application steered by logical flags as present ie BEGJOB, BEGRUN, REVENT... These flags are set by the new application framework according to the records on the input datastream. This application will contain all user code and may be run as single process as present -OR- on metacomputer. In this case the user process will run simultaneously on many computers, and a harness will provide for the data transfer between the processes. New flags are introduced to cause the user application to perform additional tasks ie DBBJOB DBBRUN HISJOB HISRUN TCLANA Fetch job dependant database banks Fetch run dependent database banks Book job dependant histograms Book run dependant histograms Analyse RUNDATA record

7 An extension to H1 application framework ( BOS/FSEQR/module steering ) to Hence the functionality of the current "special tasks" eg mdbquefpack, TCL job is implemented within the user application -> no need for these special tasks -> every user job has access to this functionality histogram collection ntuple collection printout collection are to be implemented within the user application with the aid of new h1 library code histogramming : The harness will supply data on the input stream which causes the "event" loop histogram calls to either fill local histograms and send on output stream at ENDRUN ( or ENDJOB ) -OR- to send on output stream data from histogram filling calls at end of each event processing. The harness will care for the transfer of these data to a copy of the user process so that this copy sums histograms -OR- makes actual histogram filling and is informed of end-run, end-job conditions allowing for writing of histogram files on run and job basis. ntuples: The harness will send on output stream n-tuple entry calls made during event processing to an incarnation of the user process so that ntuple an ntuple file may be written. printout: The new framework will collect all printout made at each stage of processing ie BEGJOB, REVENT event processing etc label it with run/event number and timestamp and send this data on the output stream...to an incarnation of the user process which will write a single printout file. database input: database input information ( eg TCL job ) may be written by the user process during event processing and the harness will feed all such records...to an incarnation of the user process which will have then flag TCLANA set instructing it to analyse this data.

8 An extension to H1 application framework ( BOS/FSEQR/module steering ) to The harness The current L4 system contains what we can consider a prototype harness diski - disk input neto - transfer datastream to processing computers via network, adding end-of-run markers neti - receive datastream from network nodi - transfer datastream to user processes on this computer, adding database records and event steering records ( ie random number seed, event duplicate ) nodo - collect output from user processes on this computer synchronising on end-of-run logo - output this stream to network receiver - collect the output streams from all computers logging - disk output, dump job submission, tcl job submission master - starts some of the tasks ( receiver, logo ) These tasks use PVM for data transfer ( using TCP ) and shared memory for communication with the user application. This harness runs on Lynx-OS/irix and apart from movement to linux has several missing features which will be needed by a general user job harness ie Resource allocation and process management : on which machines should the job be run ( current har coded in master and in configuration files for h1l4iox machines ) - automatic choice according to data file location and network capabilities and system loading Security : user authentication - wee don't want to have to install all user account on all machines Health and Status : monitoring of health and status of system PVM daemon on per user OR per system -> interference between different user jobs PVM master daemon runs on single system -> may become bottleneck Data transfer speed/efficiency : TCP/IP flow control may not be appropriate and has high overhead especially for just round the corner high speed commodity networks eg gigabit ethernet. PVM introduces further data buffering ( and hence copying ) reducing performance. Executable management : construction, caching and location of executables - fast distribution of tasks to all machines in user job Remote data access : fpack causes file staging - but no pre-staging, automatic temporary output file creation and migration on close. -

9 Tigran and I have started to look at what other people are doing and we hope YOU will inform us of systems from which we can learn. ( campbell@desy.de ). Here are some projects I think we can learn from ( or make use of ). Most are research projects and/or work in progress. How should we implement the new harness for PC-Linux? Data transfer software API ( application programmers interface ): PVM ( parallel virtual machine ) MPI ( Message Passing Interface ) may be better than PVM in performance and longer-term support. Various implementation are available for Linux which provide also some of the features missing in MPI cf PVM. Nexus library ( part of Globus computing grid project ) VIA ( virtual interface architecture ) specifically M-VIA for Linux and upcoming MPI for M-VIA. Provides lower latency than TCP even for fast ethernet and for faster networks much higher bandwidth. May be best approach for our system Sockets Shared memory : current harness uses shared memory for inter-machine communication...should we use this approach on Linux or move to eg PVM with UNIX domain sockets or M-VIA

10 How should we implement the new harness for PC-Linux? Resource allocation / Security /... Maybe we can make use of components of the Globus project - the GUSTO Globus Uniquitous Supercomputing Testbed forms a grid of 3600 processors located in 17 sites worldwide ( 2 Tflops/s ). LSF commercial local load sharing facility ( maybe too expensive - RAL ) MOSIX for Linux allows process migration, machine load balancing For L45 application alone a static configuration file approach ( as used by H1 Monte-Carlo or L4 setup on Lynx-OS ) may be sufficient initially. Condor project - load balancing, checkpointing, process migration Easy-LL ( IBM loadleveler ). but is it available for Linux?

11 How should we implement the hardware and network? The nicest PC-Farm setup I found is Sarnoff Research Centre's Cyclone cluster :8000/docs/metacomputing.html Please look at the pictures. The configuration is: 128 dual processor nodes ( 64MB, 3.2GB disk, CDROM, 2 SMC 100Mbit ethernet cards, NO floppy drives, NO graphics cards ) Each node is connected to 2 3COM3c port fast ethernet switches making a public ( for NFS.. ) and a private network ( for application trtaffic only ). The private network uplinks via fibre to a gigabit ethernet switch forming a "fat tree" network. 10 boot CDROMs enable complete reinstallation of all software in 80 minutes. FreeBSD is deployed in addition to Linux Faster than 100Mbit ethernet is currently still expensive eg myrinet, gigabit ethernet, pvic. For the link to North Hall we need more bandwidth than the single FDDI link currently available. Unfortuneately it looks like a gigabit ethernet uplink between bridges in North Hall and DESY computer centre is not on because of the large distance and the special cable which would have to be installed. Perhaps we should consider replacing FDDI by Optical Fast Ethernet between h1l4iox machines and a fast ethernet switch in computer centre. It has to be investigated and tested if the current FDDI cables can be used for this. Otherwise we may have to at least install a second FDDI ring North Hall - Computer centre and network hardware with both FDDI and Fast- or Gigabit ethernet ( also expensive ) -OR- we have to maintain current L4 system for event rejection and live with a single FDDI connection to computer centre ( but this partially destroys the mission of this project ) -OR- we have to find space for the whole installation in the North Hall and live with very limited bandwidth to computer centre -OR-?...we should discuss these matters with computer centre soon. Addendum: During the meeting Ralf reported that it is now clear that many fibres are available which between Hall North and DESY computer centre which are suitable for optical fast ethernet. The suggestion is to install CISCO fast ethernet switchs on both end of these fibres ( say 3 or 4 links between the switches ) and use the rest of the ports on the North Hall side to connect the h1l4iox machines on which we will replace the FDDI interfaces by fast ethernet.

12 How should we proceed now? I propose to announce and publish these slides on H1 WWW and hope for some comments/suggestions from our colleagues. I propose everyone who wishes to actively participate ( ie write code, test code, investigate commercial and public domain software, discuss, test networks, care for PC and/or network hardware, care for Linux installation ) should consider this now and inform me and Ralf Gerhards of the area on which they would like to work and how much time they will may have available. All people associated to H1 are welcome to participate. Ralf and I will then organise a meeting of these volunteers.