Bioinformatique sur Cloud Cas d usage avec le portail Galaxy



Similar documents
Cloud Ready for Bioinformatics?

Cloud pour la Bioinformatique

Sequencing data. And other experimental data. EMBL-EBI data resources growth

Le cloud IFB et son instance Galaxy

Une e-infrastructure nationale en bioinformatique

Le cloud IFB et son instance Galaxy

IFB s e-infrastructure

StratusLab project. Standards, Interoperability and Asset Exploitation. Vangelis Floros, GRNET

Towards a galaxy.prabi.fr

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

UGENE Quick Start Guide

DATA MANAGEMENT PLAN IN THE REAL LIFE SCIENCES

Final Report on StratusLab Adoption

SURFsara HPC Cloud Workshop

SURFsara HPC Cloud Workshop

A curated Domain centric shared Docker registry linked to the Galaxy toolshed

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Assignment # 1 (Cloud Computing Security)

Deploying Business Virtual Appliances on Open Source Cloud Computing

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

SURFsara Data Services

Solution for private cloud computing

Bioinformatics Grid - Enabled Tools For Biologists.

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Microsoft Research Windows Azure for Research Training

MapReduce, Hadoop and Amazon AWS

Microsoft Research Microsoft Azure for Research Training

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Scientific and Technical Applications as a Service in the Cloud

OpenNebula Open Souce Solution for DC Virtualization

Hadoopizer : a cloud environment for bioinformatics data analysis

VDI: What Does it Mean, Deploying challenges & Will It Save You Money?

The Greenplum Analytics Workbench

Cloud-Based Big Data Analytics in Bioinformatics

OpenNebula Open Souce Solution for DC Virtualization

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Grid Computing Perspectives for IBM

SOFTWARE DEFINED SOLUTIONS JEUDI 19 NOVEMBRE Nicolas EHRMAN Sr Presales SDS

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

The OpenNebula Cloud Platform for Data Center Virtualization

Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed

OpenNebula Leading Innovation in Cloud Computing Management

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Data Semantics Aware Cloud for High Performance Analytics

A Cost-Evaluation of MapReduce Applications in the Cloud

e-biogenouest : The Tools

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Maquette DB2 PureScale

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Getting Started Hacking on OpenNebula

Hadoop Distributed File System Propagation Adapter for Nimbus

EMBL-EBI Web Services

SCC / QUANTUM Kick Off 2015 Comment gérer efficacement des workflows et archives de données non structurées?

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

OpenNebula Open Souce Solution for DC Virtualization. C12G Labs. Online Webinar

ESMA REGISTERS OJ/26/06/2012-PROC/2012/004. Questions/ Answers

Energy efficiency in HPC :

icer Bioinformatics Support Fall 2011

<Insert Picture Here> Private Cloud with Fusion Middleware

Cloud Computing Where ISR Data Will Go for Exploitation

CSE-E5430 Scalable Cloud Computing. Lecture 4

Similarity Search in a Very Large Scale Using Hadoop and HBase

Cloud Services. May 28 th, 2014 Athens, Greece

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Open source Google-style large scale data analysis with Hadoop

Vincent Rullier Technology specialist Microsoft Suisse Romande

Spécication et analyse formelle des politiques de sécurité dans le cloud computing

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 7

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Cloud Computing. Adam Barker

Hadoop & SAS Data Loader for Hadoop

OpenNebula The Open Source Solution for Data Center Virtualization

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Personalized Medicine and IT

Transcription:

Bioinformatique sur Cloud Cas d usage avec le portail Galaxy Christophe Blanchet Institute of Biology and Chemistry of Proteins Head of Service Infrastructure for Biology - IDB CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO- RI-261552), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) and by the French Institute for Bioinformatics (IFB-RENABI)

A Bioinformatics Today Biological data are big data 1512 online databases (NAR Database Issue 2013) Institut Sanger, UK, 5 PB Beijing Genome Institute, China, 5 sites, 12.6 PB Big data in many places Analysing such data became difficult Scale-up of the analyses : gene/protein to complete genome/ proteome,... Lot of different daily-used tools That need to be combined in workflows Usual interfaces: portals, Web services, federation,... Datacenters with ease of access/use Distributed resources Experimental platforms: NGS, imaging,... Bioinformatics platforms Federation of datacenters M ADN BI ADN ADN BI CC BI ADN ADN

IDB Cloud and Bioinformatics Appliances Cloud workbench for Biology Running since Sept. 2011 CNRS-IBCP FR3302, Lyon, France opened to Biology community 14 bioinformatics appliances: Galaxy portal, standard compute nodes, proteomics, virtual desktop, structural biology,... +40 users from all IFB regional centers PRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1 VMs up to 32cores-768GB RAM tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Infrastructure Compute +900cores +4TB ram Standard nodes (32c-128GB) Bigmen nodes (64c 768GB) Powered by StratusLab Storage +250TB Virtual disks, object storage (S3) BI user data Z Structures Galaxy Sequences Proteomics B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines tools VM: BLAST, ClustalW2, etc.... IDB Cloud

Cloud extended services Native cloud services Authentication Virtual machine management Persistent disk service Client CLI etc. IDB Bioinformatics Marketplace find appropriate appliances more easily. reduce noise in the central Marketplace respect visibility contraints for the bioinformatic appliances, such as confidentiality Bioinformatics metadata bio:tool additional elements related to bioinformatics tools to annotate appliances help users to search for the tools themselves or the type of analysis select suitable bioinformatics appliances containing the required tools Integrated Web interface VM & virtual disks management browse bionformatics appliances with bio:tool MDz

Driven throught a simple web interface

Run your Bioinformatics Cloud Instances Bioinformatics Marketplace Sequence Structure NGS Galaxy ARIA ( ) Launch Instances BLAST, Clustal, etc. PaaS IaaS launch jobs ssh Shared FS Master & Storage VM ARIA Workers VM CNS Portal IBCP's Cloud Resources

Biological Data in Cloud Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS) BLAST, Clustal, etc. PaaS sftp/http/s3 IaaS launch jobs ssh Shared FS User Persistent data pdisk (iscsi) Portal Master & Storage VM ARIA Bioinformatics Cloud Workers VM CNS sftp/http/s3 Get your results

Examples of Cloud Bionformatics Appliances

Standard Bioinformatics node Biocompute appliance Use your own instance(s) With pre-installed standard bioinformatics tools BLAST, FastA, SSearch,HMM,... ClustalW2, Clustal-Omega, Muscle,.. Bowtie(2), BWA, samtools,... MEME, R, etc. Connected to public reference data Uniprot, EMBL, genomes, PDB, etc. Automaticaly shared to the VMs

Structural Biology TOwards StruCtural AssignmeNt Improvement To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software Large computational needs. A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations. Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure. Commercial interest in providing such tools to structural biologists on a pay as you go basis. Endorsers: Institut Pasteur Paris and CNRS IBCP

Proteomics desktop Motivation Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X!Tandem User interface Remote display NX Reference GUIs SearchGUI PeptidShaker source: PeptideShaker site

MapReduce Biology Provide turnkey virtual machine with preconfigured mapreduce framework Accelerate bigadata analysis with the two steps map & reduce paradigm Hadoop MapReduce 1.0.4 Appliances (2) standard hadoop mapreduce bioinformaytics software integrated in hadoop Sequences similarity with mapreduce paradigm FastA & SSearch deploy database of sequences in HDFS compare each structure to others Developed in the context of the French project MapReduce, ANR ARPEGE Databank FastAMR splits the databank into subsets and puts them in the DFS along with the sequences file FastAMR subset #01 FastA #01 subset #02 Mappers FastA #02...... Each mapper send the score and sequences to reducers Reducers Results score sequence score sequence... Users run the FastAMR script with its sequences and the databank User's Sequences Each mapper runs a FastA program on a part of the databank Reducers copy the best scores of the whole experiment in the DFS

Cas d usage avec Galaxy

Compte Cloud IDB Connectez-vous Remplissez les différents champs adresse mail institutionnelle Créer la demande implique l acceptation des conditions d utilisations! https://idee-b.ibcp.fr/cloud.html

Appliances disponibles Liste des appliances existantes Documentation spécifique aux appliances Création directe bouton Power

Créer mon portail Galaxy Appelée aussi Instance Compléter les différents paramètres lui assigner un nom nombre de CPUs taille mémoire attacher un disque virtuel Cluster de VM remplir le nombre de VMs choix du nom unique

Connexion sur mon instance Galaxy

Les disques durs virtuels Un disque virtuel permet de conserver ses données indépendamment de l exécution des VMs retrouver ses données d une VM à la suivante. Actions créer un vdisk gérer ses vdisks Utiliser un vdisk à la création de la VM montage à chaud

Echanger les données avec mon portail Galaxy sftp / scp client graphique: Cyberduck, Transmit, Filezilla,... Web: Galaxy - Get Data - Lien pour download

Conclusion Added value of cloud, e.g. NGS with Galaxy for scientific analyses: user-specific resources, isolated, different instances together for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy for tools integration: semantic annotation, solve software dependencies for development & operations (DevOps): different versions at the same time Provide turnkey bioinformatics appliances Standard tools and pipelines New developments Ready to run on clouds Public bioinformatics cloud (e.g. IDB) Tightly connected to existing bioinformatics resources Linked to public biological databases In collaboration with the French Institute of Bioinformatics BI user data Z tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Structures Galaxy Sequences Proteomics B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines tools VM: BLAST, ClustalW2, etc.... IDB Cloud

IFB - French Institute of Bioinformatics Mission : to make available core bioinformatics resources to the national/international life science research community. To provide support for biology programs supporting projects training users To provide an IT infrastructure devoted to management and analysis of biological data material resources : CPUs, disks, etc. availability of biology data collections deployment of bioinformatics tools To act as a middleman between the life science community and the bioinformatics/computer science research community Institut Français de Bioinformatique

IFB - Infrastructure IFB-Core resources Academic cloud for life cience Will be hosted at CNRS IDRIS supercomputing center (PARIS) A pilot infrastructure (2014-Q1) Production infrastructure +5,000cores 1PB (2014-S2) + Regional resources 6 regional bioinformatics centers +6,000 cores ~1PB 2 existing clouds: PRABI-IBCP IDB cloud (Lyon) & Genouest genocloud (Rennes) - RENABI IFB - Bioinformatics French Institute RENABI-GO APLIBIO FIB-core IT CNRS-IDRIS, Paris RENABI-NE PRABI Deploy a clouds federation RENABI-SO RENABI-GS Institut Français de Bioinformatique

Questions? Acknowledgment Clément Gauthey (IDB) StratusLab members co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001). http://idee-b.ibcp.fr