Advances in the LIBI Project, a Virtual Laboratory for Bioinformatics

Similar documents
The Grid-it: the Italian Grid Production infrastructure

Efficient and Scalable Climate Metadata Management with the GRelC DAIS

SPACI & EGEE LCG on IA64

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

A W orkflow Management System for Bioinformatics Grid

INFN Testbed status report

Gruppi di lavoro Biologia Cellulare e Molecolare Biotecnologie e Differenziamento. Università degli Studi di Napoli Federico II BIOGEM.

The ENEA-EGEE site: Access to non-standard platforms

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

WebGReIC: Towards ubiquitous grid data management services

The Italian Grid Infrastructure (IGI) CGW08 Cracow. Mirco Mazzucato Italian Grid Infrastructure Coordinator INFN CNAF Director

GridICE: monitoring the user/application activities on the grid

Esqu Science Experiments For Computer Network

Cluster, Grid, Cloud Concepts

Integration of Protein-protein Interaction Data in a Genomic and proteomic Data Warehouse

Grid Scheduling Architectures with Globus GridWay and Sun Grid Engine

Round Table Italy-Russia at Dubna

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

A Workflow Service for Biomedical Applications

On Enabling Hydrodynamics Data Analysis of Analytical Ultracentrifugation Experiments

Early Experiences with the GRelC Library

An approach to grid scheduling by using Condor-G Matchmaking mechanism

Processing Genome Data using Scalable Database Technology. My Background

IS-ENES WP3. D3.8 - Report on Training Sessions

AN APPROACH TO DEVELOPING BUSINESS PROCESSES WITH WEB SERVICES IN GRID

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Module 1. Sequence Formats and Retrieval. Charles Steward

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Roberto Barbera. Centralized bookkeeping and monitoring in ALICE

The ENEA gateway approach providing EGEE/gLite access to unsupported platforms and operating systems

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

A Web-based Portal to Access and Manage WNoDeS Virtualized Cloud Resources

GridWay: Open Source Meta-scheduling Technology for Grid Computing

DNA Sequence formats

A Tutorial in Genetic Sequence Classification Tools and Techniques

The CMS analysis chain in a distributed environment

Instruments in Grid: the New Instrument Element

TUTORIAL. Rebecca Breu, Bastian Demuth, André Giesler, Bastian Tweddell (FZ Jülich) {r.breu, b.demuth, a.giesler,

PoS(ISGC 2013)024. Porting workflows based on small and medium parallelism applications to the Italian Grid Infrastructure.

IGI Portal architecture and interaction with a CA- online

A Platform for Collaborative e-science Applications. Marian Bubak ICS / Cyfronet AGH Krakow, PL bubak@agh.edu.pl

Analisi di un servizio SRM: StoRM

HPC and Grid Concepts

The EDGeS project receives Community research funding

The GISELA Science Gateway

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

PeptidomicsDB: a new platform for sharing MS/MS data.

A demonstration of the use of Datagrid testbed and services for the biomedical community

Report from Italian ROC

An agent-based layered middleware as tool integration

EMBL Identity & Access Management

EMBL-EBI Web Services

Magic-5. Medical Applications in a GRID Infrastructure Connection. Ivan De Mitri* on behalf of MAGIC-5 collaboration

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Data Grids. Lidan Wang April 5, 2007

Widening the number of e- Infrastructure users with Science Gateways and Identity Federations (access for success)

Zum aktuellen Stand der GRID Forschung

Status and Integration of AP2 Monitoring and Online Steering

Bioinformatics Resources at a Glance

Grids Computing and Collaboration

Interoperability in Grid Computing

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

CD-HIT User s Guide. Last updated: April 5,

Bioinformatics Grid - Enabled Tools For Biologists.

Interoperating Cloud-based Virtual Farms

The GENIUS Grid Portal

From Oracle Warehouse Builder to Oracle Data Integrator fast and safe.

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Building Platform as a Service for Scientific Applications

QoS management in Grid environments

CNR-INFM DEMOCRITOS and SISSA elab Trieste

Future Developments in UniGrids and NextGRID

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

GRIDSEED: A Virtual Training Grid Infrastructure

Prof. Elisabetta Cerbai University of Florence, Vice-Chancellor for Research Sapienza University of Rome, NVA Coordinator

Virtual digital libraries: The DILIGENT Project. Donatella Castelli ISTI-CNR, Italy

glibrary: Digital Asset Management System for the Grid

Recent advances of MedIGrid PSE in an LCG/gLite environment

THE CCLRC DATA PORTAL

a Peer-to-Peer Desktop Grid for scientific applications federating small research laboratories

Scheduling and Resource Management in Computational Mini-Grids

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

GenomeSpace Architecture

The Galaxy workflow. George Magklaras PhD RHCE

G E N OM I C S S E RV I C ES

Eventuale spazio per nome struttura o altro. Pharmacy education. The Italian academic viewpoint

CINECA DSpace-CRIS : An Open Source Solution Use Science 2013 w.c ineca.it

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Heterogeneous Database Replication Gianni Pucciani

Client/Server Grid applications to manage complex workflows

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

KLAPER: an Intermediate Language for Model-Driven Predictive Analysis of Performance and Reliability

DATA MODEL FOR DESCRIBING GRID RESOURCE BROKER CAPABILITIES

Distributed Computing for CEPC. YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep.

Experiences with the GLUE information schema in the LCG/EGEE production Grid

File Transfer Software and Service SC3

Tier 1 Services - CNAF to T1

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Transcription:

Advances in the LIBI Project, a Virtual Laboratory for Bioinformatics Maria Mirto on behalf of Giovanni ALOISIO Giovanni ALOISIO (SPACI Consortium & University of Salento, Lecce) Giorgio MAGGI (INFN Sezione di Bari) Pietro LEO (IBM GBS Innovation Lab, Bari) Elda ROSSI (CINECA, Bologna) Rita CASADIO (Biocomputing Group & CIRB/CIG-Department of Biology, University of Bologna) Graziano PESOLE, Cecilia SACCONE (Project Coordinator) (ITB-CNR, Bari & Dipartimento di Biochimica e Biologia Molecolare E. Quagliariello, Università di Bari & Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano) 4th EGEE User Forum/OGF 25 - Grid Projects and Collaborations, March 5, 2009

Outline The LIBI Project Architecture Grid Infrastructure Services and basic frameworks Applications Porting Conclusions and Future Work

LIBI Project FIRB LIBI: International Laboratory of BioInformatics Funded by the MIUR (Italian Ministry for Education, University and Research) 2005-2010 Goal: Setting up of an advanced Bioinformatics and Computational Biology Laboratory, focusing on the central activities of basic and applied research in modern Biology and Biotechnologies Biological Activities the construction and the maintenance of general genomic, proteomic and transcriptomic databases (e.g. ENSEMBL) as well as specialized databases developed by LIBI partners (e.g. MitoRes, UTRdb, UTRsite, ASPicDB, CSTdb, etc.); the design and implementation of new algorithms and software for the analysis of genomes and their expression products and for the prediction of the structure of proteins. Technological Activities: www.libi.it Building of a technological framework (Grid PSE) that, guaranting the interoperability between different grid middleware (glite, Unicore, Globus), provides an environment for composing, executing and monitoring complex experiments in Bioinformatics; Optimization and porting in the Grid of Bioinformatics applications.

LIBI Partner Two kind of actors have equal responsibility in the LIBI: Technological and Bioinformatics partners Technological RUs: University of Salento SPACI Consortium, Lecce. (Prof. Giovanni Aloisio) INFN Sections of Bari, Catania and Padova and CNAF-Bologna (Dr. Mirco Mazzucato) IBM Italia S.p.A - IBM Innovation Lab, Bari (Dr. Luigi Di Pace) CINECA Bologna (Dr. Elda Rossi) Bioinformatics RUs: CNR Institute for Biomedical Technologies, Bari (Prof. Cecilia Saccone, project coordinator) University of Bologna Biocomputing Group, Bologna (Prof. Rita Casadio) University of Milano (Prof. Graziano Pesole) Centro di Biomedicina Molecolare Trieste (Prof. Claudio Schneider) Associate RUs: University of Milano-Bicocca (Prof. Giancarlo Mauri) DIBIT-HSR, Milan (Dr. Giovanni Lavorgna) TIGEM, Naples (Dr. Sandro Banfi, Dr. Elia Stupka) University of Rome - Tor Vergata (Prof. Manuela Helmer-Citterich) University of Rome (Prof. Anna Tramontano) CASPUR, Rome. (Dr. Tiziana Castrignanò) Dipartimento Interateneo di Fisica Bari (Prof. Giorgio Maggi) University of Bari - (Dip. Informatica, Prof. Donato Malerba; Dip. Biochimica e Biologia Molecolare, Prof. Marcella Attimonelli) Exhicon srl, Unità Bioinformatica Bari (Dr. Graziano Pappadà

LIBI Partner Torino UNITO UNIMI Milano Bologna CINECA UNIBO CBMTS Trieste CASPUR Roma INFN ITB Bari IBM SPACI Lecce Legend Scientific Unit Technological Unit

The LIBI Architecture Researcher Team work Team work Web-based Portal Bioinformatic Research Applications Bioinformatic and Text Analytics Tools and Services Bioinformatic Workflows Tools and Services Virtualization (knowledge & computing warehouse) glite Unicore Globus Physical Databases & Computing Resources

The LIBI e-infrastructure -Based on IGI (the Italian-Grid infrastructure) - (and the EGEE European-Grid infrastructure) Italian sites available for LIBI activity - ~10 INFN-Grid sites enabled for LIBI/BIO VOs jobs submission (~ 10% of available resources => up to 1500 CPU cores reachable) - Up to 17000 CPU cores reachable using the biomed VO all over EGEE Grid - 6 RBs at different sites: Catania, 2 at CNAF, Ferrara, Padova, Bari - DB services -GRelC, AMGA, OGSA-DAI, GDSE servers installed at INFN-Bari -High availability servers that could provide up to 1TB of data - LFC (Grid files catalogue): installed at CNAF and at INFN-Bari I FESR G I

The LIBI Services and basic frameworks GRB Grid Portal; Workflow Management System: Editor Engine (Meta Scheduler) Job Submission Tool (JST) Bioinformatic Data Federation Service for managing and accessing the LIBI Federated DBs LIBI federator server GRelC DAS plug-in for DB2 Text Analytics 2.0 Framework Several bioinformatics applications have been deployed and optimized in the LIBI Grid platform such as BLAST, PSI-BLAST, PatSearch, DNAFan, FT-COMAR, Antihunter, Gromacs and NAMD, MrBayes, Gene Analogous Finder, CSTMiner/Genominer, WeederWeb, RNAProfile, Exalign, ASPIC, PAML, R-www (submission of R jobs by using a web form). http://www.libi.it/biotools

GRB Grid Portal It is a Grid Portal with the following services: Grid Configuration Profile Credential VO Management Resource Management Database Management Software Management View Configuration Resource Status Applications Transfer

Job Submission & Monitoring

Meta Scheduler Scenario A Grid Middleware: Interoperability - the Big Issue B C D D C B A B C A B C B D A

Meta Scheduler Features We have developed a Web Service component that supports the submission and the monitoring of workflows, batch, MPI and parameter sweep jobs distributed on glite, Globus and Unicore based grids; Several libraries, plugged in the meta scheduler, provide core functions for job submission/monitoring and data transfer by using the GridFTP protocol; Used JSDL specification, OGF compliant; Automatic converters from JSDL into specific grid languages for the submission; Support for the applications wrapped as Web Services (work in progress).

WFMS in action A B Metascheduler LIBI Portal C JSDL download editor A B C Job Job CB A AJO JDL RSL Submission request Job B status job RSL DONE Job A status job DONE JDL Job C status job DONE AJO Workflow Editor GRB WMS LB WMProxy NJS Network Job Supervisor

The Job Submission Tool JST is a tool developed (and initially used) by few expert operators for interactive job submission in bioinformatics grid Challenges Recently, JST has been upgraded to provide grid job submission services to already existing (non-grid) bioportals The portal, in fact, does not need to implement any glite submission procedure It is only required to provide an SQL insert into a DB server The JST daemons will take care of submitting, controlling, resubmitting failed jobs and collecting the final output.

PORTALs-JST Interaction classical execution triggered on limited resources Few resources for execution communication with JST to use the GRID JST G R I D Submission on the GRID

Data Federation The LIBI Data Federation Service provides a Federated Schema Assimilation Model that can be accessed through a standard and transparent SQL interface and it is exposed to the Grid by using GrelC GrelC SQL LIBI Data Federation Service Federated Schema Assimilation Model It wraps, in real-time, a number of local and remote biological databases that store information their original formats, including Relational Data, Web Services, flatfiles, EMBL/FASTA formats, etc. Example of benefits provided by the federation layer with respect to accessing to EMBL/FASTA format - The availability to view in a synoptic way EMBL and FASTA DBs, federated together with other, heterogeneous DBs - A fast end efficient retrieving system for entries in EMBL/FASTA format from large DBs (sub-second response times) - Customizable indexation of both textual and non textual fields of the EMBL entries (with/out tokenization, etc.) to enable also mining features - Support for multiplatform and dislocated deployment Relational Wrapper MitoRes UTRSite UTREF HmtDB Entrez Wrapper Pubmed GeneBank OMIM EMBL Wrapper ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mrna for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca 360 gacctgaa 368 // Improved response times UNIPROT EMBL_CDS EMBL EMBL/FASTA formats with old Index type with new Index type

Applications porting GAF: Gene Analogous Finder 6000 jobs submitted on Grid More than 200 WNs used ASPic: Alternative Splicing Prediction One complete genome analyzed (Mouse) in 3 days CSTminer: Conserved tracts identification 2 months for the human-mouse genomes (FULL comparison) ~ 900 WNs used 1 Day for one genome comparison using the optimized algoritm Developed using the lesson learned in the first run 1 Week for a FULL Comparison of Vitis genome BLAST: Large scale genome comparison using BLAST program Dataset: 599 complete genomes (2,624,555 protein sequences in FASTA format) The complete comparison of all the genomes against all, carried out on the GRID, took about 1 week PAML: maximum likelihood analysis using approximate derivatives 6690 cases evaluated in 36 hours

Applications porting FT-COMAR: Protein Tertiary Structure Prediction 21.582.680 jobs executed in 5 days MrBayes: Bayesian inference of phylogeny over 7200 different runs; ~5 CPU/years. ~20 days of run on EGEE infrastructure Emboss vrnalfold : applications for molecular sequence over 0.5 M of sequences analyzed MAFFT: multiple sequence alignment more than 5000 sequences aligned Solexa Illumina: Sequence clustering 2.5M sequences clustered.

Applications porting PatSearch: Retrieving patterns into a sequence identify and annotate the presence of regulatory elements in mrna untranslated regions, collected in UTRsite and UTRdb databases, respectively; 150,000 jobs executed in 1 day. PSI-BLAST: multiple sequence alignment 70,000 jobs executed in 65 hours using 128 1.4 GHz Itanium 2 processors; 96 days required on a single processor. Gromacs: proteins dynamics simulation Short simulations of a small protein, bovine β-lactoglobulin, in a solvent (water/urea + NaCl) at 300K and at constant pressure (1bar); 4-26 secs by using respectively 12 and 2 CPUs on Itanium 2 processors. Antihunter: identification of expressed sequence tag (EST) antisense transcripts from BLAST output

Conclusions and Future Work The LIBI environment is a virtual laboratory for bioinformatics based on a high performance and distributed infrastructure supporting access to large datasets and the execution of single or complex jobs; The LIBI platform involves a large set of resources belonging to three different grid middlewares: glite, Unicore and Globus. These toolkits provide basic services for managing the resources; Built on top of these services, a set of enhanced and novel services has been implemented related to resource and data management; Several case studies with related results have been presented. Future work: Testbed on more bioinformatics applications; Make the system fully operational and open it to external users.

LIBI Team M. Mirto, I. Epicoco, S. Fiore, M. Cafaro, A. Negro, D. Tartarini, M. Passante, O. Marra, A. Ferramosca, V. Zara, G. Aloisio SPACI & University of Salento, Lecce G. Cuscela, G. Donvito, G. La Rocca, S. My, G. Selvaggi, G. Maggi INFN, Padova and Catania Sections, & Dipartimento Interateneo di Fisica di Bari G. Scioscia, P. Leo, L. Di Pace IBM Italy, Bari G. Pappada', V. Quinto, M. Berardi Exhicon srl, Bari F. Falciano, A. Emerson, G. Lavorgna, A. Vanni, E. Rossi CINECA, Bologna L. Bartoli, P. Di Lena, P. Fariselli, R. Fronza, L. Margara, L. Montanucci, P. L. Martelli, I. Rossi, M. Vassura, and R. Casadio Biocomputing Group, CIRB/CIG-Department of Biology, University of Bologna and Bioinformatics Group-Department of Computer Science T. Castrignanò CASPUR, Roma D. D Elia, G. Grillo, F. Licciulli, S. Liuni, A. Gisel, M. Santamaria, S. Vicario, C. Saccone (Coordinator) ITB-CNR, Bari, Dipartimento di Biochimica e Biologia Molecolare E. Quagliariello, Università di Bari A. Anselmo, D. Horner, F. Mignone, G. Pavesi, E. Picardi, V. Piccolo&, M. Re, F. Zambelli, G. Pesole Dipartimento di Chimica Strutturale e Stereochimica Inorganica, Università di Milano & Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano Reference: The LIBI Grid Platform Developers - M. Mirto et al., The LIBI Grid Platform for Bioinformatics, in Mario Cannataro (Ed.), Handbook of Research on Computational Grid Technologies for Life Sciences, 4th Biomedicine EGEE User and Forum/OGF Healthcare, 25 IGI - Global Grid Projects (to appear). and Collaborations, March 5 2009