Advances in the LIBI Project, a Virtual Laboratory for Bioinformatics Maria Mirto on behalf of Giovanni ALOISIO Giovanni ALOISIO (SPACI Consortium & University of Salento, Lecce) Giorgio MAGGI (INFN Sezione di Bari) Pietro LEO (IBM GBS Innovation Lab, Bari) Elda ROSSI (CINECA, Bologna) Rita CASADIO (Biocomputing Group & CIRB/CIG-Department of Biology, University of Bologna) Graziano PESOLE, Cecilia SACCONE (Project Coordinator) (ITB-CNR, Bari & Dipartimento di Biochimica e Biologia Molecolare E. Quagliariello, Università di Bari & Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano) 4th EGEE User Forum/OGF 25 - Grid Projects and Collaborations, March 5, 2009
Outline The LIBI Project Architecture Grid Infrastructure Services and basic frameworks Applications Porting Conclusions and Future Work
LIBI Project FIRB LIBI: International Laboratory of BioInformatics Funded by the MIUR (Italian Ministry for Education, University and Research) 2005-2010 Goal: Setting up of an advanced Bioinformatics and Computational Biology Laboratory, focusing on the central activities of basic and applied research in modern Biology and Biotechnologies Biological Activities the construction and the maintenance of general genomic, proteomic and transcriptomic databases (e.g. ENSEMBL) as well as specialized databases developed by LIBI partners (e.g. MitoRes, UTRdb, UTRsite, ASPicDB, CSTdb, etc.); the design and implementation of new algorithms and software for the analysis of genomes and their expression products and for the prediction of the structure of proteins. Technological Activities: www.libi.it Building of a technological framework (Grid PSE) that, guaranting the interoperability between different grid middleware (glite, Unicore, Globus), provides an environment for composing, executing and monitoring complex experiments in Bioinformatics; Optimization and porting in the Grid of Bioinformatics applications.
LIBI Partner Two kind of actors have equal responsibility in the LIBI: Technological and Bioinformatics partners Technological RUs: University of Salento SPACI Consortium, Lecce. (Prof. Giovanni Aloisio) INFN Sections of Bari, Catania and Padova and CNAF-Bologna (Dr. Mirco Mazzucato) IBM Italia S.p.A - IBM Innovation Lab, Bari (Dr. Luigi Di Pace) CINECA Bologna (Dr. Elda Rossi) Bioinformatics RUs: CNR Institute for Biomedical Technologies, Bari (Prof. Cecilia Saccone, project coordinator) University of Bologna Biocomputing Group, Bologna (Prof. Rita Casadio) University of Milano (Prof. Graziano Pesole) Centro di Biomedicina Molecolare Trieste (Prof. Claudio Schneider) Associate RUs: University of Milano-Bicocca (Prof. Giancarlo Mauri) DIBIT-HSR, Milan (Dr. Giovanni Lavorgna) TIGEM, Naples (Dr. Sandro Banfi, Dr. Elia Stupka) University of Rome - Tor Vergata (Prof. Manuela Helmer-Citterich) University of Rome (Prof. Anna Tramontano) CASPUR, Rome. (Dr. Tiziana Castrignanò) Dipartimento Interateneo di Fisica Bari (Prof. Giorgio Maggi) University of Bari - (Dip. Informatica, Prof. Donato Malerba; Dip. Biochimica e Biologia Molecolare, Prof. Marcella Attimonelli) Exhicon srl, Unità Bioinformatica Bari (Dr. Graziano Pappadà
LIBI Partner Torino UNITO UNIMI Milano Bologna CINECA UNIBO CBMTS Trieste CASPUR Roma INFN ITB Bari IBM SPACI Lecce Legend Scientific Unit Technological Unit
The LIBI Architecture Researcher Team work Team work Web-based Portal Bioinformatic Research Applications Bioinformatic and Text Analytics Tools and Services Bioinformatic Workflows Tools and Services Virtualization (knowledge & computing warehouse) glite Unicore Globus Physical Databases & Computing Resources
The LIBI e-infrastructure -Based on IGI (the Italian-Grid infrastructure) - (and the EGEE European-Grid infrastructure) Italian sites available for LIBI activity - ~10 INFN-Grid sites enabled for LIBI/BIO VOs jobs submission (~ 10% of available resources => up to 1500 CPU cores reachable) - Up to 17000 CPU cores reachable using the biomed VO all over EGEE Grid - 6 RBs at different sites: Catania, 2 at CNAF, Ferrara, Padova, Bari - DB services -GRelC, AMGA, OGSA-DAI, GDSE servers installed at INFN-Bari -High availability servers that could provide up to 1TB of data - LFC (Grid files catalogue): installed at CNAF and at INFN-Bari I FESR G I
The LIBI Services and basic frameworks GRB Grid Portal; Workflow Management System: Editor Engine (Meta Scheduler) Job Submission Tool (JST) Bioinformatic Data Federation Service for managing and accessing the LIBI Federated DBs LIBI federator server GRelC DAS plug-in for DB2 Text Analytics 2.0 Framework Several bioinformatics applications have been deployed and optimized in the LIBI Grid platform such as BLAST, PSI-BLAST, PatSearch, DNAFan, FT-COMAR, Antihunter, Gromacs and NAMD, MrBayes, Gene Analogous Finder, CSTMiner/Genominer, WeederWeb, RNAProfile, Exalign, ASPIC, PAML, R-www (submission of R jobs by using a web form). http://www.libi.it/biotools
GRB Grid Portal It is a Grid Portal with the following services: Grid Configuration Profile Credential VO Management Resource Management Database Management Software Management View Configuration Resource Status Applications Transfer
Job Submission & Monitoring
Meta Scheduler Scenario A Grid Middleware: Interoperability - the Big Issue B C D D C B A B C A B C B D A
Meta Scheduler Features We have developed a Web Service component that supports the submission and the monitoring of workflows, batch, MPI and parameter sweep jobs distributed on glite, Globus and Unicore based grids; Several libraries, plugged in the meta scheduler, provide core functions for job submission/monitoring and data transfer by using the GridFTP protocol; Used JSDL specification, OGF compliant; Automatic converters from JSDL into specific grid languages for the submission; Support for the applications wrapped as Web Services (work in progress).
WFMS in action A B Metascheduler LIBI Portal C JSDL download editor A B C Job Job CB A AJO JDL RSL Submission request Job B status job RSL DONE Job A status job DONE JDL Job C status job DONE AJO Workflow Editor GRB WMS LB WMProxy NJS Network Job Supervisor
The Job Submission Tool JST is a tool developed (and initially used) by few expert operators for interactive job submission in bioinformatics grid Challenges Recently, JST has been upgraded to provide grid job submission services to already existing (non-grid) bioportals The portal, in fact, does not need to implement any glite submission procedure It is only required to provide an SQL insert into a DB server The JST daemons will take care of submitting, controlling, resubmitting failed jobs and collecting the final output.
PORTALs-JST Interaction classical execution triggered on limited resources Few resources for execution communication with JST to use the GRID JST G R I D Submission on the GRID
Data Federation The LIBI Data Federation Service provides a Federated Schema Assimilation Model that can be accessed through a standard and transparent SQL interface and it is exposed to the Grid by using GrelC GrelC SQL LIBI Data Federation Service Federated Schema Assimilation Model It wraps, in real-time, a number of local and remote biological databases that store information their original formats, including Relational Data, Web Services, flatfiles, EMBL/FASTA formats, etc. Example of benefits provided by the federation layer with respect to accessing to EMBL/FASTA format - The availability to view in a synoptic way EMBL and FASTA DBs, federated together with other, heterogeneous DBs - A fast end efficient retrieving system for entries in EMBL/FASTA format from large DBs (sub-second response times) - Customizable indexation of both textual and non textual fields of the EMBL entries (with/out tokenization, etc.) to enable also mining features - Support for multiplatform and dislocated deployment Relational Wrapper MitoRes UTRSite UTREF HmtDB Entrez Wrapper Pubmed GeneBank OMIM EMBL Wrapper ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mrna for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca 360 gacctgaa 368 // Improved response times UNIPROT EMBL_CDS EMBL EMBL/FASTA formats with old Index type with new Index type
Applications porting GAF: Gene Analogous Finder 6000 jobs submitted on Grid More than 200 WNs used ASPic: Alternative Splicing Prediction One complete genome analyzed (Mouse) in 3 days CSTminer: Conserved tracts identification 2 months for the human-mouse genomes (FULL comparison) ~ 900 WNs used 1 Day for one genome comparison using the optimized algoritm Developed using the lesson learned in the first run 1 Week for a FULL Comparison of Vitis genome BLAST: Large scale genome comparison using BLAST program Dataset: 599 complete genomes (2,624,555 protein sequences in FASTA format) The complete comparison of all the genomes against all, carried out on the GRID, took about 1 week PAML: maximum likelihood analysis using approximate derivatives 6690 cases evaluated in 36 hours
Applications porting FT-COMAR: Protein Tertiary Structure Prediction 21.582.680 jobs executed in 5 days MrBayes: Bayesian inference of phylogeny over 7200 different runs; ~5 CPU/years. ~20 days of run on EGEE infrastructure Emboss vrnalfold : applications for molecular sequence over 0.5 M of sequences analyzed MAFFT: multiple sequence alignment more than 5000 sequences aligned Solexa Illumina: Sequence clustering 2.5M sequences clustered.
Applications porting PatSearch: Retrieving patterns into a sequence identify and annotate the presence of regulatory elements in mrna untranslated regions, collected in UTRsite and UTRdb databases, respectively; 150,000 jobs executed in 1 day. PSI-BLAST: multiple sequence alignment 70,000 jobs executed in 65 hours using 128 1.4 GHz Itanium 2 processors; 96 days required on a single processor. Gromacs: proteins dynamics simulation Short simulations of a small protein, bovine β-lactoglobulin, in a solvent (water/urea + NaCl) at 300K and at constant pressure (1bar); 4-26 secs by using respectively 12 and 2 CPUs on Itanium 2 processors. Antihunter: identification of expressed sequence tag (EST) antisense transcripts from BLAST output
Conclusions and Future Work The LIBI environment is a virtual laboratory for bioinformatics based on a high performance and distributed infrastructure supporting access to large datasets and the execution of single or complex jobs; The LIBI platform involves a large set of resources belonging to three different grid middlewares: glite, Unicore and Globus. These toolkits provide basic services for managing the resources; Built on top of these services, a set of enhanced and novel services has been implemented related to resource and data management; Several case studies with related results have been presented. Future work: Testbed on more bioinformatics applications; Make the system fully operational and open it to external users.
LIBI Team M. Mirto, I. Epicoco, S. Fiore, M. Cafaro, A. Negro, D. Tartarini, M. Passante, O. Marra, A. Ferramosca, V. Zara, G. Aloisio SPACI & University of Salento, Lecce G. Cuscela, G. Donvito, G. La Rocca, S. My, G. Selvaggi, G. Maggi INFN, Padova and Catania Sections, & Dipartimento Interateneo di Fisica di Bari G. Scioscia, P. Leo, L. Di Pace IBM Italy, Bari G. Pappada', V. Quinto, M. Berardi Exhicon srl, Bari F. Falciano, A. Emerson, G. Lavorgna, A. Vanni, E. Rossi CINECA, Bologna L. Bartoli, P. Di Lena, P. Fariselli, R. Fronza, L. Margara, L. Montanucci, P. L. Martelli, I. Rossi, M. Vassura, and R. Casadio Biocomputing Group, CIRB/CIG-Department of Biology, University of Bologna and Bioinformatics Group-Department of Computer Science T. Castrignanò CASPUR, Roma D. D Elia, G. Grillo, F. Licciulli, S. Liuni, A. Gisel, M. Santamaria, S. Vicario, C. Saccone (Coordinator) ITB-CNR, Bari, Dipartimento di Biochimica e Biologia Molecolare E. Quagliariello, Università di Bari A. Anselmo, D. Horner, F. Mignone, G. Pavesi, E. Picardi, V. Piccolo&, M. Re, F. Zambelli, G. Pesole Dipartimento di Chimica Strutturale e Stereochimica Inorganica, Università di Milano & Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano Reference: The LIBI Grid Platform Developers - M. Mirto et al., The LIBI Grid Platform for Bioinformatics, in Mario Cannataro (Ed.), Handbook of Research on Computational Grid Technologies for Life Sciences, 4th Biomedicine EGEE User and Forum/OGF Healthcare, 25 IGI - Global Grid Projects (to appear). and Collaborations, March 5 2009