Une e-infrastructure nationale en bioinformatique



Similar documents
Le cloud IFB et son instance Galaxy

Le cloud IFB et son instance Galaxy

Cloud pour la Bioinformatique

Sequencing data. And other experimental data. EMBL-EBI data resources growth

Bioinformatique sur Cloud Cas d usage avec le portail Galaxy

Cloud Ready for Bioinformatics?

IFB s e-infrastructure

A curated Domain centric shared Docker registry linked to the Galaxy toolshed

StratusLab project. Standards, Interoperability and Asset Exploitation. Vangelis Floros, GRNET

SURFsara HPC Cloud Workshop

SURFsara HPC Cloud Workshop

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

UGENE Quick Start Guide

Towards a galaxy.prabi.fr

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

icer Bioinformatics Support Fall 2011

DATA MANAGEMENT PLAN IN THE REAL LIFE SCIENCES

OpenNebula Open Souce Solution for DC Virtualization. C12G Labs. Online Webinar

Getting Started Hacking on OpenNebula

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Introduction to Cloud Computing

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Hadoopizer : a cloud environment for bioinformatics data analysis

OpenNebula Open Souce Solution for DC Virtualization

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

OpenNebula Open Souce Solution for DC Virtualization

Big Data and Cloud Computing for GHRSST

The OpenNebula Cloud Platform for Data Center Virtualization

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Building a BI Solution in the Cloud

Steven Newhouse, Head of Technical Services

Application-Centric WLAN. Rob Mellencamp

Virtualization & Cloud Computing (2W-VnCC)

Data Centers and Cloud Computing. Data Centers

Bioinformatics Grid - Enabled Tools For Biologists.

Intel IT Cloud Extending OpenStack* IaaS with Cloud Foundry* PaaS

Deploying Business Virtual Appliances on Open Source Cloud Computing

Assignment # 1 (Cloud Computing Security)

SOFTWARE DEFINED SOLUTIONS JEUDI 19 NOVEMBRE Nicolas EHRMAN Sr Presales SDS

How To Run A Cloud Server On A Server Farm (Cloud)

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

New solutions for Big Data Analysis and Visualization

Maquette DB2 PureScale

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

-> Integration of MAPHiTS in Galaxy

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Solution for private cloud computing

Scientific and Technical Applications as a Service in the Cloud

e-biogenouest : The Tools

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

Options in Open Source Virtualization and Cloud Computing. Andrew Hadinyoto Republic Polytechnic

Building Storage Service in a Private Cloud

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Open Source Technologies on Microsoft Azure

Final Report on StratusLab Adoption

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Data Centers and Cloud Computing. Data Centers. MGHPCC Data Center. Inside a Data Center

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Red Hat enterprise virtualization 3.0 feature comparison

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

OpenNebula Leading Innovation in Cloud Computing Management

System Management Tool for OpenPOWER

Build Your Own Private Cloud with Ezilla and Haduzilla

vnebula Cloud. Made Easy. Introducing vnebula from Stream Networks. A simple, self-service cloud portal for our partner community.

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

NASA's Strategy and Activities in Server Side Analytics

<Insert Picture Here> Private Cloud with Fusion Middleware

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Nebula Cloud Computing Project: Background, Technology, Operations, Challenges, and Status

Introducing. Markus Erlacher Technical Solution Professional Microsoft Switzerland

OpenNebula Cloud Platform for Data Center Virtualization

Automating Big Data Benchmarking for Different Architectures with ALOJA

Open Source Cloud Computing Management with OpenNebula

How To Set Up Egnyte For Netapp Sync For Netapp

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 7

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Google

File S1: Supplementary Information of CloudDOE

ArcGIS for Server: In the Cloud

Experiences and challenges in the development of the JASMIN cloud service for the environmental science community

Transcription:

Une e-infrastructure nationale en bioinformatique Christophe BLANCHET Institut Français de Bioinformatique - IFB French Institute of Bioinformatics - ELIXIR-FR CNRS UMS3601 - Gif-sur-Yvette - FRANCE JDEV 2015 2 juillet 2015, Bordeaux

History Since 2004, ReNaBi is the National Network of Bioinformatics platforms with an IBiSA label (Infrastructures in Biology, Health and Agronomy) In 2010, call of proposals Infrastructures in Biology and Health from the Investments for the Future initiative. Project ReNaBi-IFB accepted in 2012 and endowed with 20m Other national infrastructures (NIs) France Génomique : sequencing and genotyping NI Profi : proteomics NI Frisbi : structural biology NI etc. (17 NIs all together) + 5 IHUs (Instituts Hospitaliers Universitaires) + 1 IRT (Institut de Recherche Technologique) 2

IFB - Ins0tut Français de Bioinforma0que IFB, the French distributed infrastructure for life-science information Mission : to make available core bioinformatics resources to the national/international life science research community. To provide support for national biology programs To provide an IT infrastructure devoted to management and analysis of biological data To act as a middleman between the life science community and the bioinformatics/ computer science research community http://www.france-bioinformatique.fr CNRS UMS3601. Avenue de la Terrasse, Bât 21. 91190 Gif-sur-Yvette ELIXIR French Node The European distributed infrastructure for lifescience information To optimize the interactions and coordination between the national level and ELIXIR and other ESFRI infrastructures in biomedical and environmental field, To promote consistency and complementarities between the components offered by the ELIXIR French node and those of other European nodes 3

Experimental data in life sciences (FR) French national platforms (GIS IBISA) Nb Cellular imaging 19 Genomic, Transcriptomic 16 Proteomic 13 French NGS platforms Structural biology, biophysic 11 NGS C BI PRO Source: omicsmaps.com NGS IMG PRO NGS BI C Biological platform (Genomics, IMaGing, PROteomics...) BI C Bioinformatics center Cloud resources Scientists BI IMG PRO NGS BI C PRO IMG C C Un déluge de donnée. Blanchet C. et Collin O., 2011, Biofutur, 323: 64-67 PRO Regional centers distribute the load in terms of computing and storage, and provide better interactions with scientists Des sites intermédiaires permettent de répartir la charge en terme de stockage et de puissance de calcul tout en assurant une meilleure proximité avec les scientifiques 4

A lot of public reference databases source: EMBL-EBI Annual Report 2013 5

A lot of bioinforma0cs tools tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega R ABYSS 1.3.4 ARIA 2.3 Bioconductor 2.11 biomaj BLAST+ 2.2.27 Blat 35 Bowtie 0.12.8 Bowtie2 2.0.0- beta7 BWA 0.6.2 BWA 0.7.10 CAP3 CD-HIT 4.6.1 Clustal Omega 1.0.3 CLUSTALW 2.1 Cufflinks 2.0.2 Cutadapt 1.2.1 E-SURGE 1.9.0 Exonerate 2.2.0 express 1.5.1 FastA 3.6 FastQC 0.10.1 Galaxy portal GATK 2.3.4 HMMer 3.0 ImageJ 1.48 khmer 1.1 M-SURGE 1.8.5 MEME 4.7 MMSEQ 0.11.2a Mobyle MODAL MultAlin 5.4.1 MUSCLE 3.8.31 neo4j Oases 0.2.08 OMSSA 2.1.9 PeptideShaker 0.18.3 phyml 3.1 PREDATOR 2.1.2 proline python 2.7 R 2.13 R 3.1.1 R 3.1.2 R-studio Ray 1.3 RSAT samtools 0.1.18 Samtools 1.1 SearchGUI 1.10.4 SeqClean Shiny Stacks STAR 2.4.0f1 SuMo v1 TGICL TopHat 2.0.6 trim_galore 0.3.7 Trinity 2.0.4 U-CARE 2.3.2 VCFtools 0.1.11 Velvet 1.2.10 X!tandem 12-10-01-1 XPLOR-NIH 2.30 6

Many interfaces 7

Use case : deploy a simple bioinforma0cs applica0on Use case AuthN/Z Cloud web dashboard VMs lifecycle log in + data(+) deploy Pipeline web interface isolated network VM user s data User AuthN/Z +encrypt. refdata Cloud SW IFB web dashboard StrastusLab Slispstream TRESOR Docker SAML/ XCAML Cluster OpenNaaS Hadoop ( ) HW Secure User s LAN WAN academic + public DC A refdata 8

Use case: deploy a complex bioinforma0cs applica0on Use case AuthN/Z User Cloud web dashboard VMs/disks lifecycle AuthN/Z deploy data (scp/sftp) log in I/O viz viz isolated network shared data worker worker worker worker worker worker worker worker worker worker Bioinformatics web interface Cloud SW IFB web dashboard StrastusLab Slispstream SAML/ XCAML TRESOR OpenNaaS ( ) refdata(+) HW User s LAN WAN academic + public Isolated network DC A DC B geno mes Sync geno mes refdata(+) 9

Deploy and operate IFB s e- infrastructure as a cloud 10

IFB e- Infrastructure Mission : to provide core bioinformatics resources to the life science research community. To set up a French IT infrastructure devoted to management and analysis of biological data To collaborate with international infrastructure (ELIXIR) Goal : to help scientists and engineers to deploy and use their tools e-infrastructure: provide hardware, data collections and bioinformatics tools Current resources A national hub: IFB-core hosted at CNRS IDRIS SC center A network of regional centers: 32 bioinformatics platforms - 15,000 cores - 5 PB Create a federation of clouds for life sciences 11

IFB s CloudS? IFB-GO APLIBIO IFB-NE Running clouds IFB-core GenOuest IFB-core PRABI PoC & experiments URGI BiLille/Univ.Lille BISTRO/IPHC IFB-SO IFB-GS Next Lyon?? 12

IFB- core s cloud IFB-core # Compute Cores # TB Storage # TB RAM Max VM size Technology Location Pilot 200 50 2 40c 256GB StratusLab CNRS-IDRIS, Paris 2016-S1 3,000 500 -?144c 3TB? StratusLab 2017 10,000 2,000 -?? StratusLab CNRS-IDRIS, Paris CNRS-IDRIS, Paris NGS, imaging, statistics, PaaS IaaS launch jobs Scientists RENATER 10giga SaaS Frontend Master Virtualization Layer Shared FS Workers Web portal Pdisk storage iscsi 10giga eth Hosted @ IDRIS CNRS SC-center 10giga eth Cloud Hypervisors - std nodes: 32c 128GB - bigmem nodes: 40c 256GB 13

Create bioinforma0cs appliances tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Appliance? predefined virtual machine Ready to run Description Title Contact (and maintainer!) Description (w. controlled voc.) + Structures Sequences Proteomics Galaxy... 14

Docking bioinforma0cs tools IFB docker hub (@GenOuest) http://docker-ui.genouest.org Registry of images Bioinformatics images Docker Image size (MB) abyss pull push IFB s Cloud blast blast+ bowtie bwa clustalw2 cufflinks docker virtual machine docker virtual machine fasta36 fastqc gor4 Developer User hmm meme mmseq cufflinks 0 275 550 15

Move VMs rather than data NGS C BI NGS data PRO VM BI VM C PRO VMs IFB s marketplace & VMs repository for life sciences NGS data VM PRO BI C IMG PRO data IMG BI NGS BI IMG PRO Biological platform (Genomics, IMaGing, PROteomics...) Bioinformatics centre C C C Cloud resources Researchers 16

IFB s bioinforma0cs cloud services 17

A cloud driven through a web dashboard http://cloud.france-bioinformatique.fr/cloud 18

Storage for biological data CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, ) sftp/http Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS ro) BLAST, Clustal, etc. PaaS IaaS launch jobs ssh Shared FS Master & Storage VM ARIA Workers VM CNS Identity Mgmt j. doe e. martin you chb virtual disks Portal Bioinformatics Cloud cg User data sftp/http Get your results CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, ) 19

IFB s bioinforma0cs appliances Remote desktop Web Proteomics Galaxy Scientific apps Galaxy Galaxy MODAL AVIESAN 2013 RSAT Galaxy RADseq PhyML R statistics CLI biocompute Aria Node Imaging Eco Pop Utilities biodata BioMaj BlobSeer biodata NFS Cassandra Neo4j Data mgmt biohadoop Docker CentOS Ubuntu Base OS 20

Browse the marketplace and run an App! Proteomics Sequences Galaxy Structures?... IFB s bioinformatics marketplace! 21

RAINBio : Registry of bioinforma0cs tools and VMs Prototype Query: topic? tool? VM? Cloud Marketplace e.g. IFB VMs Services Registry e.g. ELIXIR Tools Life science researcher RAINBio Graph DB (Neo4J) 22

App Biocompute Standard bioinformatics node With pre-installed standard bioinformatics tools BLAST, FastA, SSearch,HMM,... Bowtie(2), BWA, samtools,... MEME, R, etc. ClustalW2, Clustal-Omega, Muscle,.. Connected to public reference datasets Uniprot, EMBL, genomes, PDB, etc. Automatically shared with the VMs Cluster mode turn several instances in a single virtual cluster shared file system batch scheduling 23

App R Sta0s0cal Compu0ng R software environment for statistical computing and graphics include common bioinformatics module RStudio IDE Biobase, BiocGenerics, BiocInstaller, GenomeInfoDb integrated development environment (IDE) for R features: console, syntax-highlighting editor Shiny web framework powerful web framework for building web applications using R. without requiring HTML, CSS, or JavaScript knowledge. Contact: Stéphane Delmotte (PRABI-LBBE) 24

Motivation App Proteomics desktop Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification tools Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X! Tandem User interface Remote Virtual Desktop (NX) Reference GUI PeptidShaker 25

And others RSAT Galaxy Databank FastAMR splits the databank into subsets and puts them in the DFS along with the sequences file FastAMR subset #01 FastA #01 subset #02 Mappers FastA #02...... Each mapper send the score and sequences to reducers Reducers Results score sequence score sequence... Users run the FastAMR script with its sequences and the databank User's Sequences Each mapper runs a FastA program on a part of the databank Reducers copy the best scores of the whole experiment in the DFS Ecology of populations Hadoop 26

Conclusion 20 bioinformatics appliances already available + 10 in progress IFB supports different domain-specific developments Microbial Bioinformatics, Evolutionary bioinformatics, Plant bioinformatics, Structural Biology, NGS data processing Scientific production - 180 users (July 2015) opened to members of IFB (standard allocated resources) opened to partners, academic and industry, infrastructures and projects: e.g. BioDataCloud, ProFi, MetaboHub, extra resources allocation according to scientific and financial criteria Training - 70 users French tutorials Cloud pour la Biologie tutorial at ECCB 14 about Analysis of Cis-Regulatory Motifs from High- Throughput Sequence Sets Bioinformatics Master in Marseille (2014), in Rouen (2015) 27

Perspec0ves Create more bioinformatics appliances available to the community By the experts of the different life sciences domains Appliances in progress: BioDataCloud-RNAseq, ProFi, SymBioWatch, Clinical NGS for cancerology (2x), REPET, TriAnnot, Galaxy RAD-seq, Bacterial genomics, imetamos, BioDataCloud-Genomes-Browser Call in 2015-Q1 for proposals to support technological developments and integration of data and tools Pilots Multi-cloud deployment of complex applications Secure cloud for human biomedical data analysis Live remote cloud processing of sequencing data Data registry of multi-cloud datasets Docker hub for life sciences 28

Ques0ons? http://www.france-bioinformatique.fr Acknowledgments IFB members IFB hub: Patricia, Marie, Jean-François, Mohamed, Quentin we are hiring! Groupe de réflexion sur les InfraStructures BioInformatiques IFB-GRISBI (co-chair with Olivier Collin) Appliances developers: Samuel Blanck (Inria Lille), Jacques van Helden (TAGC), Stéphane Delmotte (PRABI-Doua), Bruno Spataro (PRABI-Doua) CNRS IDRIS: R. Medeiros, C. Gauthey and staff StratusLab members IFB is funded by French program PIA INBS 2012 EU H2020 EGI-Engage (654142) and CYCLONE (644925) projects 29