IFB s e-infrastructure Christophe Blanchet Institut Français de Bioinformatique - IFB French Institute of Bioinformatics - ELIXIR-FR CNRS UMS3601 - Gif-sur-Yvette - FRANCE
Life Sciences Platforms in France National platforms (GIS IBISA) Nb Cellular imaging 19 Genomic, Transcriptomic 16 Proteomic 13 Structural biology, biophysic 11 NGS BI C IMG Biological platform (Genomics, IMaGing, PROteomics...) Bioinformatics center Cloud resources Scientists PRO C NGS BI NGS PRO BI C French NGS platforms PRO BI IMG PRO NGS PRO C IMG BI C C Source: omicsmaps.com Regional centers distribute the load in terms of computing and storage, and provide better interactions with scientists Des sites intermédiaires permettent de répartir la charge en terme de stockage et de puissance de calcul tout en assurant une meilleure proximité avec les scientifiques
National and European Infrastructures FR / EU
IFB e-infrastructure Team Staff Christophe Blanchet (CNRS) Marie Grosjean - Data and tools integration (fixed-term contract, 2014-Oct/2016-Mar) Mohamed Bedri - Cloud technology (fixed-term contract, 2014-Dec/2016-May) Fedi Ben Ali - Information technology (fixed-term contract, 2015-Feb/2016-Jul) Xxx Xxx - Data and tools integration (fixed-term contract, to be hired) Services Provide scientists with bioinformatics resources, data and tools, as cloud appliances Provide users support Provide developers support to integrate their tools/dbs Deploy and operate IFB s national IT infrastructure as a cloud in collaboration with CNRS IDRIS SC center teams. Evaluate and deploy cloud technologies Interacting with the communities (academic and industrial) Collaborate with IFB s partners, national and European infrastructures, scientific and technological projects. Liven up and train national community: GRISBI workgroup, tutorials, thematic school CumuloNumBIO, 2015, http://www.france-bioinformatique.fr/?q=fr/core/cellule-infrastructure
Deploy and operate IFB s national IT infrastructure as a cloud
IT resources in presence Distributed infrastructure a national IFB-core resource (see Table) hosted at CNRS IDRIS SC center (Paris) + regional resources 6 regional bioinformatics centers 11,000 cores - 6 PB, but in +20 platforms Create a federation of clouds for life sciences IFB-GO IFB-SO APLIBIO IFB-core IFB-GS IFB-NE PRABI IFB-core # Compute Cores # TB Storage # TB RAM Max VM size Technology Location Pilot 200 50 2 40c 256GB StratusLab CNRS-IDRIS, Paris 2015 3,000 500-96c 1TB StratusLab CNRS-IDRIS, Paris 2016 10,000 2,000-96c 2TB StratusLab CNRS-IDRIS, Paris
Cloud? Essential characteristics On-demand self-service No human intervention Broad network access Rapid elasticity Fast, reliable remote access Scale based on app. needs Resource pooling Multi-tenant sharing Measured service Direct or indirect economic model with measured use Deployment models Hybrid Federation via combination of other deployment models Service models Software as a Service (SaaS) Direct (scalable) hosting of end user applications Platform as a Service (PaaS) Framework and infrastructure for creating web applications Infrastructure as a Service (IaaS) Access to remote virtual machines with root access Private Single administrative domain, limited number of users Community Different administrative domains with common interests & proc. Public People outside of institute s administrative domain http://csrc.nist.gov/publications/nistpubs/800-145/sp800-145.pdf
IFB-core s cloud PaaS NGS, imaging, statistics, S a Ia RENATER 10giga Scientists Sha red FS launch jobs SaaS Master Workers Virtualization Layer Frontend Web portal Pdisk storage 10giga eth iscsi 10giga eth Cloud Hypervisors - std nodes: 32c 128GB - bigmem nodes: 40c 256GB Hosted @ IDRIS CNRS SC-center
Storage for biological data CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, ) sftp/http/s3 Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS ro) BLAST, Clustal, etc. PaaS IaaS launch jobs ssh Shared FS Master & Storage VM ARIA Workers VM CNS Identity Mgmt j. doe e. martin you chb virtual disks Portal Bioinformatics Cloud cg User data sftp/http/s3 Get your results CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, )
A cloud driven through a web dashboard http://cloud.france-bioinformatique.fr/cloud
Moving VMs vs Data NGS IMG PRO NGS Biological platform (Genomics, IMaGing, PROteomics...) BI C Bioinformatics center Cloud resources Scientists C BI NGS data PRO VM BI VM C VMs PRO IFB life sciences marketplace & VMs repository NGS data VM PRO BI C IMG PRO data IMG BI C C
Provide scientists with bioinformatics resources, data and tools, as cloud appliances
Make an inventory of national resources Make an inventory of resources provided by IFB s platforms Data: through the federation of existing BioMAJ servers Tools: with a service registry to be set up in IFB-core current developments based on a graph-db model (Neo4J) and an ontology (EDAM) Information stored in text file, Web & wiki, DBMS, Large numbers (10s-100s) From 21 platforms and more labs Goal: provide most-used resources in the IFB s cloud
Cloud reference databases repository shared storage Cloud IFB manage BIOMAJ VM Databases All virtual machines
Create bioinformatics cloud appliances Integration of bioinformatics tools Bioinformatics appliances are pre-defined virtual machines small : few GB, easy to convert in most virtualization formats Installed and pre-configured with bioinformatics tools e.g. BLAST, ClustalW, ARIA, MEME, HMMer, TopHat, BWA, Samtools, RSAT, etc. Referenced in a the IFB marketplace a catalog of VM templates devoted to bioinformatics tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services R Linux system Bioinformatics Marketplace Structures Sequences Virtual Machines Proteomics + Galaxy...
Current bioinformatics appliances Scientific apps CLI Virtual desktop Web Galaxy MODAL Proteomics Galaxy Imaging Galaxy AVIESAN 2013 RSAT PhyML RSAT mini R statistics Aria biocompute Node Utilities biodata BioMaj BlobSeer biodata NFS Cassandra Data mgmt biohadoop CentOS Ubuntu Base OS
Browse the appliances and run yours! Proteomics Sequences Galaxy Structures?... IFB Marketplace!
Usecase cloud Galaxy portal Galaxy portal is widely used in the community analyse NGS data (mainly but not only) connected to community knowledge: data and indexes, tools, workflows Cloud advantages : User is administrator on his/her own Galaxy instance: install data and tools Preserve workflows and results in cloud storage Help the integration of monthly updates and new tools Cloud permit different appliances to be built from the same base: base one with common tools for NGS specific ones for a domain or a set of tools e.g. Galaxy-MODAL : MOdels for Data Analysis and Learning for training: create a special appliance with dedicated datasets, tools or workflows e.g. for the French AVIESAN school 2013
Usecase A specialized software suite for the analysis of noncoding sequences motif discovery in promotors of co-expressed genes CHIPseq analysis evolutionary conserved motifs (phylogenetics footprints) Contact: J. van Helden (TGAC) Used for ECCB 14 tutorial T01 RSAT offers a series of tools dedicated to the detection of regulatory signals in noncoding sequences input a list of genes of interest you retrieve the upstream sequences over a desired distance, discover putative regulatory signals, search the matching positions for these signals in your original dataset or in whole genomes, display the results graphically in the form of a feature map.
Usecase proteomics virtual desktop Motivation Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification tools Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X!Tandem User interface Remote Virtual Desktop (NX) Reference GUI PeptidShaker
Interacting with the communities - Liven up, train people and participate to projects -
GRISBI http://www.france-bioinformatique.fr/?q=fr/groupe-de-travail/grisbi Groupe de réflexion sur les InfraStructures BioInformatiques Concertation technologique IFB-core/centres régionaux/plateformes identification des besoins, choix des orientations, suivi des développements transfert de compétences en technologie cloud Lien avec les partenaires technologiques IDRIS, IDGC, mésocentres, StratusLab, ELIXIR, IFB-GS 1 3 IFB-SO 2 IFB-NE Parten. 1 IFB-GO 9 IFB-core 4 PRABI 2 APLIBIO 6 IFB-core APLIBIO PRABI IFB-GO IFB-SO IFB-GS IFB-NE Parten. 2 2 2 2 2 2 4 6 Grisbi-25 (22) 2014-06 2 2 2 1 1 3 3 8 Grisbi-24 (22) 2013-11 2 2 1 2 4 4 Grisbi-23 (15) 2012-12 3 1 2 1 2 6 Grisbi-22 (15) 2012-06 Grisbi-26 (28)
Tutoriels Cloud pour la Biologie 2 sessions en 2014 19 juin 2014-23 participants - IFB-core, Gif-sur-Yvette 6 novembre 2014-20 participants - GenOuest, IFB-GO, Rennes Formation d'initiation à l utilisation du cloud computing pour l'analyse de données biologiques avec les outils usuels de bioinformatique Aborde les concepts et principes généraux du cloud, ainsi que la description des outils, des usages et de son intérêt pour la Bioinformatique. Une partie pratique sur les clouds concernés par la session (IFBcore ou Genocloud). FORMATEURS - Christophe BLANCHET (IFB- core) - Olivier COLLIN (GenOuest) - Jean- François GIBRAT (IFB- core) - Marie GROSJEAN (IFB- core) - Charles LOOMIS (LAL) - Cyril MONJEAUD (GenOuest) - Yvan LE BRAS(GenOuest) - Olivier SALLOU (GenOuest) 09h00-09h30 09h30 10h00 10h00-11h00 11h00-11h30 11h30 13h00 13h00 14h00 14h00 15h30 15h30 16h00 Accueil des participants Présentation de l IFB Salle Markov Cloud computing Salle Markov Pause- café Salle Markov Cloud pour la biologie Salle Markov Déjeuner Hall Amphi G Exemples d application de cloud Salle Markov Pause- café
Ecole Cumulo NumBIO 2015 Objectifs Mettre en relation le plus largement possible les scientifiques et ingénieurs des sciences du vivant, avec leurs besoins d analyse à large échelle de données biologique hétérogènes, et les scientifiques et ingénieurs des sciences informatiques, avec leurs développements de recherche et les solutions de Cloud existantes pour l intégration des logiciels et des données. 1er au 5 juin 2015, Lieu : à confirmer Participation attendue: 100 personnes Soutien du CNRS et deux de ses instituts: INSB et INS2I Sessions Besoins de la Biologie, exemples de la génomique et de la protéomique Etat de l art des infrastructures bioinformatiques Intégration des données et des outils, workflows et provenance Cloud infrastructures des donnéesacadémiques de production Cloud développements en recherche Gestion des données L infrastructure nationale IFB Présentation de participants (sélection préalable) - Chairs (nom et qualité) C. Bruley, C. Gaspin, T. Grange, C. Médigue, C. Thermes J.F. Gibrat, H. Touzet S. Cohen-Boulakia, C. Froidevaux V. Breton, C. Loomis M. Daydé, F. Desprez G. Antoniu C. Blanchet, O. Collin
ELIXIR Workstreams Tools interoperability and service registry and Elixir technical services Partners Technical Core Group and task forces (cloud, services registry, AAI) Excelerate project Work packages Tools Interoperability and Service Registry and Technical Services PIA BIODATACLOUD WP-2.3 Besoins informatiques EU H2020 Cyclone Bioinformatics applications National infrastructures: ProFi, France Génomique, MetaboHub
Conclusion - Current usage of IFB s cloud Scientific production - 100 users opened to members of IFB (at least with standard user level) opened to partners, academic and industry, infrastructures and projects: e.g. BioDataCloud, ProFi, MetaboHub, resources allocation according to scientific and financial criteria Training - 60 users short- and middle-term accounts two tutorials Cloud pour la Biologie in June and Nov. 2014 tutorial at ECCB 14 about Analysis of Cis-Regulatory Motifs from High-Throughput Sequence Sets, in Sept. 2014 Master2 hands-on Méthodes bioinformatiques pour la cisrégulation (AMU SBBCU16L), Oct.-Dec. 2014
Perspectives Create more bioinformatics appliances by the experts of the different life sciences domains to make them available to the community current appliances in progress BioDataCloud-RNAseq, ProFi, SymBioWatch, Clinical NGS for cancerology, REPET, TriAnnot, BWA-mpi IFB supports different domain-specific developments First round: Microbial Bioinformatics, Evolutionary bioinformatics, Plant bioinformatics, Structural Biology, NGS data processing Future call to projects Technical pilots Interoperability of appliances on different cloud infrastructures Registry of distributed multi-cloud datasets Live remote cloud processing of sequencing data
Questions? http://www.france-bioinformatique.fr Acknowledgments Clément Gauthey (CNRS IDRIS, formerly IDB-IBCP) Developers that integrated their tools as a IFB s appliance: Samuel Blanck (Inria Lille), Jacques van Helden (TAGC), You? IFB members StratusLab members IFB s funding by French program PIA INBS 2012