Cloud pour la Bioinformatique

Similar documents
Bioinformatique sur Cloud Cas d usage avec le portail Galaxy

Cloud Ready for Bioinformatics?

Sequencing data. And other experimental data. EMBL-EBI data resources growth

Une e-infrastructure nationale en bioinformatique

IFB s e-infrastructure

Le cloud IFB et son instance Galaxy

Le cloud IFB et son instance Galaxy

StratusLab project. Standards, Interoperability and Asset Exploitation. Vangelis Floros, GRNET

Building Storage Service in a Private Cloud

OpenNebula Open Souce Solution for DC Virtualization. C12G Labs. Online Webinar

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

SURFsara HPC Cloud Workshop

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Getting Started Hacking on OpenNebula

Assignment # 1 (Cloud Computing Security)

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Private Cloud Database Consolidation with Exadata. Nitin Vengurlekar Technical Director/Cloud Evangelist

OpenNebula Open Souce Solution for DC Virtualization

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

OpenNebula Open Souce Solution for DC Virtualization

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Towards a galaxy.prabi.fr

OpenNebula Leading Innovation in Cloud Computing Management

The OpenNebula Cloud Platform for Data Center Virtualization

Hadoopizer : a cloud environment for bioinformatics data analysis

SURFsara HPC Cloud Workshop

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Solution for private cloud computing

Implementing Microsoft Azure Infrastructure Solutions

A curated Domain centric shared Docker registry linked to the Galaxy toolshed

Virtualizing Apache Hadoop. June, 2012

SOFTWARE DEFINED SOLUTIONS JEUDI 19 NOVEMBRE Nicolas EHRMAN Sr Presales SDS

Cloud OS. Philip Meyer Partner Technology Specialist - Hosting

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Linux/Open Source and Cloud computing Wim Coekaerts Senior Vice President, Linux and Virtualization Engineering

Integrated Rule-based Data Management System for Genome Sequencing Data

How To Run A Cloud Server On A Server Farm (Cloud)

Implementing Microsoft Azure Infrastructure Solutions 20533B; 5 Days, Instructor-led

Planning, Provisioning and Deploying Enterprise Clouds with Oracle Enterprise Manager 12c Kevin Patterson, Principal Sales Consultant, Enterprise

Course 20533B: Implementing Microsoft Azure Infrastructure Solutions

<Insert Picture Here> Private Cloud with Fusion Middleware

Scientific and Technical Applications as a Service in the Cloud

Cloud-Based Big Data Analytics in Bioinformatics

Steven Newhouse, Head of Technical Services

<Insert Picture Here> Cloud Computing Strategy

wu.cloud: Insights Gained from Operating a Private Cloud System

Microsoft Research Windows Azure for Research Training

Open source Google-style large scale data analysis with Hadoop

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

Microsoft Research Microsoft Azure for Research Training

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

OpenNebula The Open Source Solution for Data Center Virtualization

Introducing ScienceCloud

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

DATA MANAGEMENT PLAN IN THE REAL LIFE SCIENCES

New solutions for Big Data Analysis and Visualization

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

A Service for Data-Intensive Computations on Virtual Clusters

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Experiences and challenges in the development of the JASMIN cloud service for the environmental science community

<Insert Picture Here> Infrastructure as a Service (IaaS) Cloud Computing for Enterprises

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Open Source Cloud Computing Management with OpenNebula

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Building a BI Solution in the Cloud

Traditional v/s CONVRGD

Big Data and Cloud Computing for GHRSST

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Hadoop IST 734 SS CHUNG

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Deploying Business Virtual Appliances on Open Source Cloud Computing

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

UGENE Quick Start Guide

Building an Internal Cloud that is ready for the external Cloud

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 7

Transcription:

Cloud pour la Bioinformatique Christophe Blanchet Institut Français de Bioinformatique - IFB French Institute of Bioinformatics - ELIXIR French Node CNRS UMS3601 - Gif-sur-Yvette - FRANCE

Sequencing data source: www.genomesonline.org source: www.politigenomics.com/next-generation- Complete genome sequencing become a lab commodity with NGS (cheap and efficient) Ecole IN2P3 Cloud 2014, 1er juillet let 2014, Lyon

And other experimental data FR - EU

EMBL-EBI data resources growth source: EMBL-EBI Annual Report 2013

Plateformes Expérimentales en Biologie Plateformes nationales (GIS IBISA) Nb Imagerie cellulaire 19 Génomique, Transcriptomique 16 Protéomique 13 Biologie structurale, biophysique 11 NGS BI C IMG Biological platform (Genomics, IMaGing, PROteomics...) Bioinformatics center Cloud resources Scientists PRO C NGS BI NGS PRO BI C Localisation des plateformes NGS PRO BI IMG PRO NGS PRO C IMG BI C C Source: omicsmaps.com Des sites intermédiaires permettent de répartir la charge en terme de stockage et de puissance de calcul tout en assurant une meilleure proximité avec les scientifiques

Infrastructures in Biology Lot of bioinformatics tools and services to treat and vizualize the biological data Ecole IN2P3 Cloud 2014, 1er juillet 2014, Lyon

Bioinformatics Today Biological data are big data 1552 online databases (NAR Database Issue 2014) Institut Sanger, UK, 5 PB - Beijing Genome Institute, China, 7 sites, 20.6 PB Big data in many places Analysing such data became difficult Scale-up of the analyses : gene/protein to complete genome/proteome,... Lot of different daily-used tools that need to be combined in workflows Usual interfaces: portals, Web services, Datacenters with ease of access/use C NGS NGS BI NGS BI PRO BI C IMG PRO IMG Distributed resources Experimental platforms: NGS, imaging,... Bioinformatics platforms Federation of datacenters C PRO C n

Cloud? Essential characteristics On-demand self-service No human intervention Broad network access Fast, reliable remote access Rapid elasticity Scale based on app. needs Resource pooling Multi-tenant sharing Measured service Direct or indirect economic model with measured use Deployment models Private Single administrative domain, limited Community Public number of users Different administrative domains with common interests & proc. People outside of institute s administrative domain Hybrid Federation via combination of other deployment models Service models Software as a Service (SaaS) Direct (scalable) hosting of end user applications Platform as a Service (PaaS) Framework and infrastructure for creating web applications Infrastructure as a Service (IaaS) Access to remote virtual Machines with root access http://csrc.nist.gov/publications/ nistpubs/800-145/sp800-145.pdf

Cloud IDB Cloud workbench for Biology Infrastructure Distributed for Biology https://idee-b.ibcp.fr/cloud.html Running since Sept. 2011 IBCP FR3302 CNRS-Univ. lyon1, Lyon, France opened to Biology community 14 bioinformatics appliances: Galaxy portal, standard compute nodes, proteomics, virtual desktop, structural biology,... +70 users from all IFB regional centers PRABI 16, APLIBIO 28, RENABI-NE 13, -GO 7, -SO 2, - GS 5 VMs up to 32cores-768GB RAM Infrastructure Compute +900cores +4TB ram Standard nodes (32c-128GB) Bigmen nodes (64c 768GB) Storage +250TB Virtual disks, large-scale object storage (S3) Powered by StratusLab and CEPH RENATER Scientists 1giga Frontend 10giga eth IDB's Cloud Web portal Pdisk storage iscsi 10giga eth Object storage Cloud Hypervisors - std nodes: 32c 128GB - bigmem nodes: 64c 768 GB

French Institute of Bioinformatics - IFB Mission : to make available core bioinformatics resources to the national/ international life science research community. To provide support for national biology programs To provide an IT infrastructure devoted to management and analysis of biological data To act as a middleman between the life science community and the bioinformatics/computer science research community ELIXIR French Node optimizing the interactions and coordination between the national level and ELIXIR and other ESFRI infrastructures in biomedical and environmental field, promoting consistency and complementarities between the components offered by the ELIXIR French node and those of other European nodes

IFB e-infrastructure Support : help members to deploy and use their tools e-infrastructure: hardware, biology data collections, bioinformatics tools Academic cloud for life science a core ressource IFB-core hosted at CNRS IDRIS SC center (Paris) + regional resources 6 regional bioinformatics centers with 2 clouds 11,000 cores - 6 PB but +20 bioinformatics platforms Create a federation of clouds for life science Technical organization GRISBI: a national technical group (all national platforms) Participation to ELIXIR task forces RENABI-GO RENABI-SO APLIBIO RENABI-NE IFB-core PRABI RENABI-GS Cloud Ressources Location # Compute Cores # TB Storage # TB RAM Max VM size Technology IFB-core CNRS-IDRIS, Paris 100 50 1 40c 256GB StratusLab IFB-core 2014 CNRS-IDRIS, Paris 4,000 500-96c 1TB StratusLab IFB-core 2015 CNRS-IDRIS, Paris 10,000 2,000-96c 2TB StratusLab idee-b PRABI-IBCP, Lyon 1,000 380 4 64c 768GB StratusLab Genocloud IFB-GO, Rennes 240 8 1 - ONE

B B I Extended cloud functionalities for bioinformatics Bioinformatics appliances integrate bioinformatics tools and workflows Bioinformatics marketplace focus on bioinformatics appliances satisfy visibility contraints for some bioinformatics appliances (confidentiality) Bioinformatics metadata bio:tool annotate appliances with attributes related to bioinformatics tools help to select suitable bioinformatics appliances containing the required tools Integrated Web interface Native cloud services Authentication Virtual machine management Persistent disk service Client CLI VM & virtual disks management filter bioinformatics appliances with bio:tool etc. IDB CEPH storage backend large scale and distributed storage reliable by replication high-througput IO single unified storage cluster for all interfaces: block, object and file system

Bioinformatics cloud appliances Bioinformatics appliances are usual virtual machines tools R FastA OMSSA X!tandem Galaxy Muscle BLAST ClustalW2 SSearch ARIA PeptideShaker TopHat BWA HMMer samtools Clustal Omega fastqc small : few GB, easy to convert in most virtualization formats Installed and preconfigured with bioinformatics tools Create new cloud services Linux system e.g. BLAST, Clustalw, ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc. Virtual Machines Bioinformatics Marketplace Recorded in a marketplace Structures Sequences Proteomics + Galaxy... devoted to bioinformatics

Run bioinformatics appliances Bioinformatics marketplace Store life science VMs and a catalogue both a virtual machines repository Help users to select the appropriate VM for their analysis Bioinformatics Marketplace BI Structures Galaxy Sequences Proteomics B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines Analyze data tools VM: BLAST, ClustalW2, etc.... (2) IDB Cloud (3) Select tools Scientists can filter (1) the appliances through a Web interface to identify and launch (2) the appropriate ones. (1) Use tools (3) Scientists have access to their own cloud resources through web portal, remote virtual desktop or SSH. Filter images with metadata related to bioinformatics attribute <bio:tool> in VM manifests scientists can select the appropriate appliance according to the tools required for their analyses e.g. the BLAST tool Deploy on several clouds

Appliances page List of existing appliances Appliance description and doc Direct launch Power button

Filter appliances with tools description

A cloud driven through a simple web interface

Connection to VMs ssh web NX (opennx)

Cloud Storage for Biological Data Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS ro) BLAST, Clustal, etc. PaaS sftp/http/s3 IaaS launch jobs ssh Shared FS j. doe e. martin you chb v. disk sambaa Portal Master & Storage VM ARIA Workers VM CNS Bioinformatics Cloud Identity Mgmt cg User data S3 sftp/http/s3 Get your results

Exchanging data with VMs CLI scp/sftp GUI: Cyberduck, Transmit Integrated: Galaxy Ecole IN2P3 Cloud 2014, 1er juillet let 2014, Lyon

Moving VMs vs Data NGS IMG PRO NGS Biological platform (Genomics, IMaGing, PROteomics...) BI C Bioinformatics center Cloud resources Scientists C BI NGS data PRO VM BI VM C VMs PRO IFB Bioinfor- matics marketplace & VMs repository data VM BI IMG PRO NGS PRO C IMG data BI C C

Case 1: Standard Bioinformatics node appliance Biocompute Use your own instance(s) With pre-installed standard bioinformatics tools BLAST, FastA, SSearch,HMM,... ClustalW2, Clustal-Omega, Muscle,.. Bowtie(2), BWA, samtools,... MEME, R, etc. Connected to public reference data Uniprot, EMBL, genomes, PDB, etc. Automaticaly shared to the VMs Cluster mode turn several instances in a single virtual cluster shared file system batch scheduling Ecole IN2P3 Cloud 2014, 1er juillet 2014, Lyon

Case 2: Cloud Galaxy portal for NGS analyses Analyse NGS data portal Galaxy is widely used in the community connected to large public data: sequences and indexes large user data (GBs) Preserve workflows and results (cloud virtual disk) Different domain-specific instances (RNAseq, CHIPseq, etc.) For training: create a special instance derived from the main instance but with dedicated datasets Help the integration of monthly updates

w Tools s hares nt text bases s.ib bcp.fr"\ Run your Galaxy Portal on Cloud Bioinformatics Marketplace Sequence Structure NGS Galaxy ARIA ( ) Stay Connected to Standard Data & Tools ay Connected to Stan Launch Instances User data: upload datafiles or attache pdisk Reference databases: mount biodata server shares Tools: used pre-installed ones or or install install yours yours User data: upload datafiles Reference databases: moun Tools: used pre-installed on Run instances with bioinformatics context stratus-run-instance\ with bi i nformatics ormatics co co --context="bio_db_server=idb-databases.ibcp.fr"\ MNb7A8XGnNwwJAaZUBj5fwYVoFf Shared FS BLAST, Clustal, etc. launch jobs ssh IaaS Master & Storage VM ARIA Workers VM CNS Portal IBCP's Cloud Resources PaaS

Connect to your Galaxy Portal

for development & operations (DevOps): different In collaboration with the French Institute of Bioinformatics Advantages of Cloud for Galaxy Added value of cloud for Galaxy, for scientific analyses: user-specific resources, isolated, different domain-specific instances (RNAseq, CHIPseq, Variants,...) for training: create a special instance derived from the main but with dedicated datasets Examples of training with Galaxy: Mai 2013 Galaxy Lille, Nov 2013 Aviesan Bioinformatics School For integration of monthly updates versions at the same time Bioinformatics cloud (e.g. IDB) Tightly connected to existing bioinformatics resources Linked to public biological databases

Case 3: Proteomics virtual desktop Motivation Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification tools Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X!Tandem User interface Remote Virtual Desktop (NX) Reference GUIs SearchGUI PeptidShaker source: PeptideShaker esha site Ecole IN2P3 Cloud 2014, 1er juillet let 2014, Lyon

Case 4: Hadoop for Life Science Provide turnkey virtual machine with preconfigured mapreduce framework Accelerate biological bigdata analysis Hadoop MapReduce 1.0.4 Appliances (2) HDFS with integrated bioinformatics tools Example of sequence similarity searching FastA & SSearch deploy database of sequences in HDFS compare each structure to others provide standard hadoop: including mapreduce and Databank Developed in the context of the French project MapReduce, ANR ARPEGE FastAMR splits the databank into subsets and puts them in the DFS along with the sequences file FastAMR subset #01 FastA #01 subset #02 Mappers FastA #02...... Each mapper send the score and sequences to reducers Reducers Results score sequence score sequence... Users run the FastAMR script with its sequences and the databank User's Sequences Each mapper runs a FastA program on a part of the databank Reducers copy the best scores of the whole experiment in the DFS

Cloud it be done? IFB s cloud for life science simplify access to biological data and tools integrate tools and pipelines in turnkey cloud appliances is tightly connected to existing bioinformatics resources, e.g. public reference data sources 14 bioinformatics appliances: standard compute nodes, proteomics virtual desktop, Galaxy portal, structural biology +70 users from all IFB regional centers PRABI 16, APLIBIO 28, RENABI-NE 13, -GO 7, -SO 2, -GS 5 Bioinformatics marketplace store images related to life science help users to select the appropriate VM for their analysis BI tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Bioinformatics Marketplace Structures Galaxy Galaxy Sequences... Proteomics Proteomics (2) B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines Analyze data tools VM: BLAST, ClustalW2, etc. IDB Cloud Use Scientists have access cloud resources throug Ecole IN2P3 Cloud 2014, 1er remote juillet virtual 2014, desktop Lyon o... (3) Selec Scientists can the appliances Web interface t and launch appropriate ones (1)

Perspectives Create bioinformatics appliances by the experts of the domains make them available to the scientists IFB established priorities: 5 scientific domains Microbial Bioinformatics Evolutionary bioinformatics Plant bioinformatics Structural Biology NGS data processing and 3 technical pilots Appliances interoperability between different cloud infrastructures Distributing biological data with distributed nosql engine Live remote cloud processing of sequencing data

Questions? Clément Gauthey (IDB-IBCP) StratusLab members IDB s co-funding by European Community's Seventh Framework Programme (INFSO-RI-261552) French National Research Agency's Arpege Programme (ANR-10-SEGI-001). IFB s funding by French program PIA INBS 2012 Acknowledgments