One target application: BLAST using large databases over the grid

Transcription

1 DIET_BLAST: Architecture logicielle et petits problèmes de recherche Frédéric Desprez LIP ENS Lyon/INRIA Grenoble Rhône- Alpes EPI GRAAL/Avalon 14/06/11! 1! Agenda Introduction One target application: BLAST using large databases over the grid One problem: join scheduling and replication Thèse d Antoine Vernois (avec C. Blanchet et P. Vicat-Blanc Primet) Thèse de Gaël Le Mahec (avec V. Breton) One middleware : DIET/DAGDA Conclusion and future work Décrypthon 2!

2 Introduction Several applications ready (and not only number crunching ones!) Huge data sets start to be available around the world Data management is one of the most important issue of today s applications Replication has to be used to improve platform throughput Services for resource management and data management/replication available in most grid middleware but usually they work separately Our approach Put the data management into the resource management sphere Perform the request scheduling with the data replication based on the information provided by the application Application to a large scale bioinformatics application 3! One target application: BLAST over the grid Basic Local Alignment Search Tool (BLAST) To find homologies between nucleotids or amino acids sequences. Objectives To find clues about the function of a protein/gene comparing a newly discovered sequence to well-known ones. Biological databases BLAST uses biological databases containing large sets of annotated sequences as entry parameter. Usage Generally, biologists perform BLAST searches on large sets of sequences. A T C A A G T C A C C A - G T C 4!

3 One target application: BLAST over the grid Each sequence of the entry set can be treated independently A simple parallelization: to perform the searches for each sequence on different nodes But efficient A sequence weights at most some kilobytes and a search on it takes at most some minutes Requirements to gridify BLAST applications A way to submit and distribute the requests A way to replicate databases! Then a software to perform! the sequences sets division before the computation! the results merge on exit 5! One target application: BLAST over the grid Grid-BLAST execution The user submits a large set of sequences and a database or a database identifier Entry set is divided. The database is replicated on the platform using a data manager The resource broker returns a set of computing resources Each sequence is treated on a different server Results merged into a single result file 6!

4 Parallelization and distribution of bioinformatics requests over the Grid Context Large sets of requests are submitted by many users of the grid Large databases are used as parameters of the requests The computing nodes of the grid have different computing and storage capacities Several different BLAST applications depending upon users needs The jobs have to be scheduled on-line (the scheduling is made job by job) We want to optimize the resources usage: No useless replication Platform throughput optimization Where to replicate the databases? How to distribute the requests? 7! Related work: Job scheduling & data replication In «Decoupling Computation and Data Scheduling in Distributed Data- Intensive Applications», Foster et al. analyzed the performance of various combinations of job scheduling and data replication algorithms. They evaluated four job scheduling strategies: JobRandom: The job is sent to a random site. JobLeastLoaded: The job is sent to the node with the least number of waiting jobs. JobDataPresent: The job is sent on the least loaded site on which the data is present. JobLocal: The job is run locally. And three replication strategies: DataDoNothing: The data are not replicated. Data may be downloaded from a remote site in a cache using LRU replacement algorithm. DataRandom: The most «popular» data are replicated on a random site. DataLeastLoaded: The most «popular» data are replicated on the least loaded site in the neighborhood of the node. Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications, Ranganathan, K., Foster, I., HPDC '02, Washington, DC, USA, IEEE !

5 Related work: Job scheduling & data replication For the different combinations they measured: Best average response time when: The job is scheduled where the data is With data replication (even with random replication) Best average data transferred when: The job is scheduled where the data is With data replication Best average idle time: The job is scheduled where the data is With data replication! Scheduling should take into account the data distribution.! It is not always necessary to couple data movement & computation. 9! Related work: Integration of Scheduling & Replication In «Integration of Scheduling and Replication in Data Grids», Chakrabati et al. present the Integrated Replication and Scheduling Strategy (IRS) which couple scheduling and replication strategies. Improve the performance by working alternatively on the data mapping and the task mapping. Algorithm principle The job scheduling can use two approaches: One based on the data availability: The job is sent on the node which have the most of the needed data. One based on scheduling cost (time to transfer the data, job waiting time and job execution time) : The job is sent on the node which have the smallest cost. After each job submission, a data usage matrix is updated. The replication algorithm have two phases: Estimation of the maximum number of replications for a data. The data replication itself using the data usage matrix Integration of Scheduling and Replication in Data Grids, Chakrabarti A., Dheepak, R.A., Sengupta, S., HiPC 2004, Lecture Notes in Computer Science, !

6 Parallelization and distribution of bioinformatics requests over the Grid To obtain dynamic information about the nodes status is a difficult task involving complex grid services The information obtained can be several minutes old For long execution time jobs: not really a problem For jobs that take some minutes: the information are outdated In this context without more information, there are few things to do Using the classical Minimum Completion Time (MCT) scheduling strategy is a good way to schedule the jobs but With few more information, we can do better The analysis of the execution traces of bioinformatics clusters showed that the way the biologists are submitting jobs is homogeneous if the observed time interval is long enough. We will use this information to optimize the Grid resources usage. 11! Parallelization and distribution of bioinformatics requests over the Grid Problem modeling Scheduling and Replication Algorithm (SRA) Relies on the integer approximation of a linear program Gives the data distribution and the jobs scheduling Takes into account the specificities of the biologists submissions A. Vernois 12!

7 Scheduling and Replication Algorithm SRA gives good results when the requests frequencies are constant The data distribution and the scheduling are efficient It does not need dynamic information about the Grid A. Vernois 13! Scheduling and Replication Algorithm Simulation experiments Grid 5000 platform 2540 heterogeneous CPUs on 9 sites 5 databases of different sizes 5 algorithms of different complexities 1 request per 0,3 s. The data are randomly distributed before the start For small sets of requests SRA does not have enough time to replicate the data For large sets of requests The replications and jobs scheduling of SRA avoid the computing nodes to saturate A. Vernois 14!

8 Using MCT A. Vernois 15! Using SRA A. Vernois 16!

9 SRA with frequency variation detection On the grid, the number and variety of users can make the time needed to have constant frequencies very long. Some «events» can temporary modify the frequencies Data challenges Important conference submission deadline Holidays and week-ends By detecting such a frequency variation, we can correct the data distribution and jobs scheduling 17! SRA with frequency variation detection Algorithm principle SRA gives an initial data distribution The Resource Broker records the submissions and update the frequencies If a frequency varies beyond a! threshold New data distribution computation using SRA Start of the transfers asynchronously if possible. Otherwise, the job scheduling will cause the data transfers Task scheduling using the job distribution given by SRA 18!

10 SRA with frequency variation detection The frequencies vary - We do not proceed to data redistribution The scheduling is no more efficient for large sets of jobs. The average execution time of the jobs increases with the number of jobs submitted. 19! SRA with frequency variation detection The frequencies vary - We detect the variations and proceed to the data redistribution The scheduling efficiency is saved. The average execution time of the jobs does not increase too much. 20!

11 Implementation within a middleware - DIET GridRPC API Network Enabled Servers paradigm Some features Distributed (hierarchical) scheduling Plugin schedulers Data management Workflow management Sequential and parallel task (through batch schedulers) Clusters, Grids, Clouds Open source 21! Data/replica management Two needs Keep the data in place to reduce the overhead of communications between clients and servers Replicate data whenever possible Three approaches for DIET DTM (LIFC, Besançon) Hierarchy similar to the DIET s one Distributed data manager Redistribution between servers JuxMem (Paris, Rennes) P2P data cache DAGDA (IN2P3, Clermont-Ferrand and now Univ. of Picardie) Replication Joining task scheduling and data management Work done within the GridRPC Working Group (OGF) Relations with workflow management 22!

12 DAGDA Data Arrangement for Grid and Distributed Applications A data manager for the DIET middleware providing Explicit data replication: using the API Implicit data replication: data items are replicated on the selected servers Direct data get/put through the API Automatic data management: using a selected data replacement algorithm when necessary LRU: The Least Recently Used data is deleted LFU: The Least Frequently Used data is deleted FIFO: The «oldest» data is deleted Transfer optimization by selecting the best source Using statistics on previous transfers Storage resources usage management The space reserved for the data can be configured by the user Data status backup/restoration Allowing to stop and restart DIET, saving the data status on each node 23! DAGDA Transfer model Uses the pull model. The data are sent independently of the service call. The data can be sent in several parts. 1: The client send a request for a service 2: DIET selects some SeDs according using a scheduling heuristic 3: The client sends its request to the SeD 4: The SeD downloads the data from the client and/or from other DIET servers 5: The SeD performs the call. 6: The persistent data are updated 24!

13 DAGDA DAGDA architecture Each data is associated to one unique identifier DAGDA control the disk and memory space limits. If necessary, it uses a data replacement algorithm The CORBA interface is used to communicate between the DAGDA nodes Users can access data and perform replications using the API 25! Putting everything together 26!

14 Putting everything together Experiments over Grid 5000 using DIET and DAGDA 300 nodes (4 sites), sequences, 4 BLAST algorithms, 2 databases 2 different sequences splits Small grain: 1 sequence after an other Coarse grain: splitting as a function of the number of servers available 4 scheduling algorithms random, MCT, round-robin, dynamic SRA 27! Putting everything together Maximum request partition Partition in n sub-requests. (n the number of available nodes) For MCT, Random & Round-Robin: The better requests partitioning is using the number of available nodes (coarse grain). For Dynamic-SRA: One sequence per request (small grain) 28!

15 Putting everything together Comparison of MCT and dynamic SRA over 1000 nodes (8 sites) 4 databases between 175 Mo and 6.5 Go 4 algorithms sequences Frequencies variation 29! Putting everything together Using MCT: Using Dynamic-SRA: Requests 1 to 20000: BLASTP Vs DB 1 30% BLASTN Vs DB 2 30% BLASTX Vs DB 3 20% TBLASTX Vs DB 4 20% Requests to 40000: BLASTP Vs DB 1 10% BLASTP Vs DB 3 30% BLASTN Vs DB 2 10% BLASTN Vs DB 4 10% BLASTX Vs DB 1 10% BLASTX Vs DB 3 10% TBLASTX Vs DB 2 20% 30!

16 Conclusion and future work about DIET_BLAST Data management is one of the most important issue of large scale applications over the grid It has to be linked with request scheduling to get the best performance Static algorithms perform well even on a dynamic platform Future work Provide a set of replication algorithms tuned for specific application classes within DAGDA Increase the exchange between resource and data management systems Use other performance metrics (fairness) Manage replication for workflows scheduling 31! Une collaboration autour de trois partenaires fondateurs AFM - coordination de l appel à projets auprès de la communauté scientifique - financement des projets de recherche stratégie scientifique de l'afm : guérir les maladies neuromusculaires et les maladies rares pour la plupart d'origine génétique IBM - expertise du Grid Computing + Sciences de la Vie - dotation de 6 universités de supercalculateurs (programme Shared University Research) - accès WCG CNRS - pilotage scientifique du programme - expertise scientifique et technologique du portage des applications - Gestion des travaux et des résultats du WCG (HCMD1 et 2 d A. Carbone) 32!

17 Avec la participation d'autres partenaires Installation de supercalculateurs (à base de power G5) puissance de 500 Gflops / 473 Gflops déjà présents dans les universités Réseau National de Télécommunications pour l Enseignement et la Recherche (RENATER) connecte l ensemble des ressources ENS Lyon équipe GRAAL/Avalon Pilotage des ressources de la grille en optimisant : planification et exécution - utilisation du logiciel DIET développé par l ENS (succède à United Devices) - suivi des programmes informatiques de chaque projet 33! Deployment example with Universities Sed = Server Daemon, installed on any server running Loadleveler. Note that we can define rescue SeD. MA = master agent, coordinates Jobs. We can define rescue or multiple Master Agent. WN = worker node DIET Décrypthon Web Interface Orsay Decrypthon2 Orsay Decrypthon1 Master Agent LILLE Project Users SeD LoadLeveler SeD LoadLeveler BORDEAUX JUSSIEU ORSAY SeD LoadLeveler SeD LoadLeveler IBM WII CRIHAN DB2 Lyon Data manager! Interface! BD AFM Cliniques 34!

18 Philosophy of the Décrypthon grid Transparency An interface as close as the one users are used to, to submit, to monitor and to download the results of jobs. Reactivity new algorithms are ported as quickly as possible on the grid to be close to the practical use case. 35! Data management Décrypthon Grid - Grid Resources Dedicated to Neuromuscular Disorders, Bard, N., Bolze, R., Caron, E., Desprez, F., Heymann, M., Friedrich, A., Moulinier, L., Nguyen, N.-H., Poch, O. and Toursel, T., 8th HealthGrid conference, Paris, France, June, Credits: H. N Guyen, O. Poch, IGBMC 36!

19 Data management, cont 37! SM2PH a pilot project and a success story Infrastructure: - BIRD database (sequences, genomical data, alignments, 3D structures or models, pathology descriptions, analysis results ) - Genotype/Phenotype description (UMD, LOVD) - WEB server, interactive analysis interface Software and algorithms: scoring function for prediction of mutation consequences in structural and variability contexts, correlation between mutation and phenotype severity, Credits: H. N Guyen, O. Poch, IGBMC 38!

20 SM2PH-db! SM2PH-db ( " entry: proteins involved in human monogenic diseases " for each entry: wide range of information involved in the genotype phenotype relationship " Evolutionary view : Multiple Alignment of Complete Sequences Eukaryote Filter Structural sampling " Structural view : 3D models: 1596 wild type proteins mutant proteins " Informational view : structural and functional annotations (UniProt, Pfam, Prosite, Interpro, GO) mutation and phenotypic data ( missense mutations) Automatic update every 2 months, current version : 9 39! Complex interconnected programs On a single data, up to 25 programs are applied, in cascade or in parallel. Gain de 15 à 1 jours sur la plate-forme du Décrypthon. Credits: H. N Guyen, O. Poch, IGBMC 40!

21 What s next? New platform based on IBM BlueGene/P Mixed Grid/Cloud approach using SysFera-DS LoadLeveler on existing platforms IBM Cloud software on new platforms Extension to public Clouds Seamless access to resources Tight integration of recent developments from IGBMC around data management Work around the user interfaces 41! SysFera SysFera-DS : une pile logicielle complète pour le HPC... et un accès simple et transparent aux infrastructures de Cloud Inside the Cloud + DIET platform is virtualized inside the cloud. (as Xen image for example) + Very flexible and scalable as DIET nodes can be launched + Dynamic adaptation % charge Cloud manager + EC2 interface + EC2 is treated as a new Batch System + Automatic deployment of VMs with associated services 42!

22 Des questions? 43!