Ressources management and runtime environments in the exascale computing era

Transcription

1 Ressources management and runtime environments in the exascale computing era Guillaume Huard MOAIS and MESCAL INRIA Projects CNRS LIG Laboratory Grenoble University, France Guillaume Huard MOAIS and MESCAL INRIA Projets 1/28

2 Introduction Large scale platforms have become a reality : using grids, parallel applications can run on thousands of processors cores Tow main models for grids : Structured : lightweight grids Unstructured : P2P overlay grids This talk : High performance computation on structured grids Anticipation of their evolution when growing to exaflop range computing power Guillaume Huard MOAIS and MESCAL INRIA Projets 2/28

3 Structured approach Computing centers interconnect their clusters : lightweight grids Hierarchical structure Clusters of homogeneous resources Network and CPU disparity only among distinct clusters Reasonable reliability Unavailability usually limited to few machines Reliable Backbone and services The french academic grid, Grid5000 is built on this model Guillaume Huard MOAIS and MESCAL INRIA Projets 3/28

4 Challenges in HPC on structured grids Scalability: required for both algorithms and runtime Adaptivity: computation and data must be balanced and placed to mach Computing resources capabilities Communication links capacity Efficiency: computation on a grid is expensive (energy consumption cost), efficient platform usage mandatory Guillaume Huard MOAIS and MESCAL INRIA Projets 4/28

5 Outline Computing on lightweight grids 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 5/28

6 OAR : Managing resources OAR is the batch scheduler used in Grid5000 clusters Classical batch/interactive submission of parallel jobs Elaborate resource query scheme (precise reservation of nodes/processors/cores, switch location, available memory,...) Job dependencies enabling computation workflow support OAR also features low level nodes management Effective nodes cleaning using cpuset Interfaced with kadeploy for environment deployment Guillaume Huard MOAIS and MESCAL INRIA Projets 6/28

7 OAR scheduling snapshot Support for backfilling and fair-sharing policies Guillaume Huard MOAIS and MESCAL INRIA Projets 7/28

8 Enabling the grid with OAR Efficient platform use Best effort jobs : opportunistic computation Dynamic nodes : appropriate management of volatile resources Large set of tasks abstraction Array jobs CiGri system : life cycle management for bag of tasks Large parallel applications setup Advance reservations : enable clusters coordination Checkpoint/resubmit : to test global gang scheduling or fault tolerance Guillaume Huard MOAIS and MESCAL INRIA Projets 8/28

10 TakTuk : Adaptive Deployment of Parallel Executions Nodes administration: launch the same command on all nodes of a platform uptime to grab statistics about the recent machine availability dig, ping, ifconfig... for network issues diagnostic... Parallel applications development: launch the same parallel program on all nodes (like mpirun) Slaves of a master/slave application All participants of a symmetric parallel application Self organizing system (P2P), daemons (monitoring) and redirect I/O to/from the initiating node Guillaume Huard MOAIS and MESCAL INRIA Projets 10/28

11 Existing tools Computing on lightweight grids Flat deployment tools : pdsh/dsh (IBM Cluster Tools suite) Similar to: Foreach host in hosts do fork ssh $host command Naturally pipelined by the OS : deployment in linear time Distributed deployment : gexec (Ganglia Cluster Suite) Remote gexec daemons take part in the deployment : deployment tree, logarithmic time Requires daemons installation Does not adapt to heterogeneity or nodes failures Guillaume Huard MOAIS and MESCAL INRIA Projets 11/28

12 Optimal deployment Theoretical optimal deployment on homogeneous machines mixes Concurrent connection processes Parallel connections initiation Distribution of remote connexion tasks node 3 node 2 node 1 Time Guillaume Huard MOAIS and MESCAL INRIA Projets 12/28

13 Dynamic deployment The performance of nodes and network vary Heterogeneous architectures in different clusters Load due to OS or hanged processes (zombies, infinite loop) External contention (network, centralized services) Cache effects, swap, other users,... TakTuk algorithm : try to do things ASAP Distribute the engine (using remote executions) Nodes initiate several parallel connections Idle nodes get remaining deployment tasks by work stealing Guillaume Huard MOAIS and MESCAL INRIA Projets 13/28

14 TakTuk deployment compared to other tools Performance versus pdsh and gexec Execution time (s) 20 pdsh, window 64 taktuk, window Execution time (s) gexec, arity 2 taktuk, ssh, window 15 taktuk, rsh, window Number of nodes Number of nodes Advantages No installation required on remote nodes (can self-propagate) Adapts to nodes load, insensitive to nodes failures Guillaume Huard MOAIS and MESCAL INRIA Projets 14/28

15 TakTuk unique features for the grid Heterogeneity and hierarchy Any part of the deployment can be statically specified (e.g. partial topology enforced by cluster front nodes) Deployed nodes logical numbering Distinct machines can execute different commands Applications support using deployment connexions Provides control communications layer Files transfer (send/receive/multicast/gather) capabilities Guillaume Huard MOAIS and MESCAL INRIA Projets 15/28

17 KAAPI : Parallel programming library Middleware for adaptive computation on Multi-core architectures Clusters and grids High level API for resources abstraction : Athapascan fork keyword to create parallel tasks shared keyword to declare shared data Objectives Write once, run anywhere Guaranteed performances Guillaume Huard MOAIS and MESCAL INRIA Projets 17/28

18 KAAPI example : C++, fork and shared struct Fibonacci { void operator()( int n, a1::shared w<int> result ) { if (n < 2) result.write(n); else { a1::shared<int> subresult1; a1::shared<int> subresult2; a1::fork<fibonacci>()(n-1, subresult1); a1::fork<fibonacci>()(n-2, subresult2); a1::fork<sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()( a1::shared w<int> result, a1::shared r<int> sr1, a1::shared r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } } Guillaume Huard MOAIS and MESCAL INRIA Projets 18/28

19 KAAPI Workflow Computing on lightweight grids KAAPI Application constructs a data flow graph KAAPI maps tasks on resources: workstealing (dynamic load balancing) static placement KAAPI manages communications (shared memory or network communication) Guillaume Huard MOAIS and MESCAL INRIA Projets 19/28

20 TRIVA : Application Execution Visualization Collaboration with UFRG (Brasil) 3D (2D Resources / Time) for visualization outlines Application topology Network topology TRIVA is generic and extensible Based on Pajè generic traces description language Treemap views of synthetic data available scales to 1000s of processes Guillaume Huard MOAIS and MESCAL INRIA Projets 20/28

21 Large scale experiments KAAPI/TakTuk winner of the Plugtest (ETSI Event) for three consecutive years ( ) N-Queens and Financial applications on near 4000 cores 2008 edition used G5K + Intrigger : mixed communications TakTuk communications between different grids TCP/IP within each grid IDHAL experiments : coupling highly heterogenous machines G5K grid Brasilian grids Luxembourg clusters Individual volunteer machines linked via DSL modem Machines from PlanetLab Guillaume Huard MOAIS and MESCAL INRIA Projets 21/28

22 Outline Computing on lightweight grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 22/28

23 Evolution forecast for structured grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Next step: interconnect several structured grids into a larger one Several new issues Hierarchical network Nodes communicate with their neighbors only Front node forwarding for inter grid communications More nodes failures (even during short executions) This meets unstructured grids issues (as in P2P grids, PlanetLab) Of course, former lightweight grid issues worsen: scale, heterogeneity and energy consumption Guillaume Huard MOAIS and MESCAL INRIA Projets 23/28

24 KAAPI ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Deepen the run anywhere concept Nodes dynamicity Fault tolerance : checkpoint/restart application CCK : coordinated checkpoint protocol TIC : theft induced protocol (distributed) Interaction with the deployment tool : add/remove resources during computation Heterogeneity handling Hierarchical work stealing (sensitive to high latency networks) NUMA aware sheduling Complete implementation of an adaptive parallel STL Guillaume Huard MOAIS and MESCAL INRIA Projets 24/28

25 TRIVA ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Improve scalability user navigation in the large volume of informations well chosen data aggregation for relevant overviews Aggregation example : treemap Transform data summary (e.g. number of steals) into visually relevant square Can be applied at each level: core, processor, node, cluster, grid Behavior patterns identification Manipulation of objects classes, correlated events,... Detection of common patterns Guillaume Huard MOAIS and MESCAL INRIA Projets 25/28

26 TakTuk ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Improve distributed applications support Applications management extensions Deployment networks union support Interface between batch scheduler and application Data management Efficient broadcast of large data files using direct connections rather than deployment network based on Santos and al. algorithms for K item broadcast Guillaume Huard MOAIS and MESCAL INRIA Projets 26/28

27 OAR ongoing works Computing on lightweight grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Flexibility and application support Green OAR Dynamic machines power state changes (history and models) Scheduling sensitive to energy (consumption/speed tradeoff) OAR API for interactions with applications dynamic job s resources addition/removal Clusters administration OAR live CD Support for virtualized clusters Guillaume Huard MOAIS and MESCAL INRIA Projets 27/28

28 Computing on lightweight grids Thanks for your attention, any question? OAR: N. Capit, G. Da-Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounier, P. Neyron, and O. Richard. A batch scheduler with high level components In CCGrid 2005 KAAPI: T. Gautier, X. Besseron, and L. Pigeon. KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors In PASCO 2007 TakTuk: B. Claudel, G. Huard, and O. Richard. Taktuk, adaptive deployment of remote executions In HPDC 2009 (to appear) TRIVA: L. M. Schnorr, G. Huard, and P. O. A. Navaux. 3d approach to the visualization of parallel applications and grid monitoring information In Grid 2008 Guillaume Huard MOAIS and MESCAL INRIA Projets 28/28