Bioinformatique sur Cloud Cas d usage avec le portail Galaxy Christophe Blanchet Institute of Biology and Chemistry of Proteins Head of Service Infrastructure for Biology - IDB CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO- RI-261552), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) and by the French Institute for Bioinformatics (IFB-RENABI)
A Bioinformatics Today Biological data are big data 1512 online databases (NAR Database Issue 2013) Institut Sanger, UK, 5 PB Beijing Genome Institute, China, 5 sites, 12.6 PB Big data in many places Analysing such data became difficult Scale-up of the analyses : gene/protein to complete genome/ proteome,... Lot of different daily-used tools That need to be combined in workflows Usual interfaces: portals, Web services, federation,... Datacenters with ease of access/use Distributed resources Experimental platforms: NGS, imaging,... Bioinformatics platforms Federation of datacenters M ADN BI ADN ADN BI CC BI ADN ADN
IDB Cloud and Bioinformatics Appliances Cloud workbench for Biology Running since Sept. 2011 CNRS-IBCP FR3302, Lyon, France opened to Biology community 14 bioinformatics appliances: Galaxy portal, standard compute nodes, proteomics, virtual desktop, structural biology,... +40 users from all IFB regional centers PRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1 VMs up to 32cores-768GB RAM tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Infrastructure Compute +900cores +4TB ram Standard nodes (32c-128GB) Bigmen nodes (64c 768GB) Powered by StratusLab Storage +250TB Virtual disks, object storage (S3) BI user data Z Structures Galaxy Sequences Proteomics B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines tools VM: BLAST, ClustalW2, etc.... IDB Cloud
Cloud extended services Native cloud services Authentication Virtual machine management Persistent disk service Client CLI etc. IDB Bioinformatics Marketplace find appropriate appliances more easily. reduce noise in the central Marketplace respect visibility contraints for the bioinformatic appliances, such as confidentiality Bioinformatics metadata bio:tool additional elements related to bioinformatics tools to annotate appliances help users to search for the tools themselves or the type of analysis select suitable bioinformatics appliances containing the required tools Integrated Web interface VM & virtual disks management browse bionformatics appliances with bio:tool MDz
Driven throught a simple web interface
Run your Bioinformatics Cloud Instances Bioinformatics Marketplace Sequence Structure NGS Galaxy ARIA ( ) Launch Instances BLAST, Clustal, etc. PaaS IaaS launch jobs ssh Shared FS Master & Storage VM ARIA Workers VM CNS Portal IBCP's Cloud Resources
Biological Data in Cloud Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS) BLAST, Clustal, etc. PaaS sftp/http/s3 IaaS launch jobs ssh Shared FS User Persistent data pdisk (iscsi) Portal Master & Storage VM ARIA Bioinformatics Cloud Workers VM CNS sftp/http/s3 Get your results
Examples of Cloud Bionformatics Appliances
Standard Bioinformatics node Biocompute appliance Use your own instance(s) With pre-installed standard bioinformatics tools BLAST, FastA, SSearch,HMM,... ClustalW2, Clustal-Omega, Muscle,.. Bowtie(2), BWA, samtools,... MEME, R, etc. Connected to public reference data Uniprot, EMBL, genomes, PDB, etc. Automaticaly shared to the VMs
Structural Biology TOwards StruCtural AssignmeNt Improvement To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software Large computational needs. A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations. Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure. Commercial interest in providing such tools to structural biologists on a pay as you go basis. Endorsers: Institut Pasteur Paris and CNRS IBCP
Proteomics desktop Motivation Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X!Tandem User interface Remote display NX Reference GUIs SearchGUI PeptidShaker source: PeptideShaker site
MapReduce Biology Provide turnkey virtual machine with preconfigured mapreduce framework Accelerate bigadata analysis with the two steps map & reduce paradigm Hadoop MapReduce 1.0.4 Appliances (2) standard hadoop mapreduce bioinformaytics software integrated in hadoop Sequences similarity with mapreduce paradigm FastA & SSearch deploy database of sequences in HDFS compare each structure to others Developed in the context of the French project MapReduce, ANR ARPEGE Databank FastAMR splits the databank into subsets and puts them in the DFS along with the sequences file FastAMR subset #01 FastA #01 subset #02 Mappers FastA #02...... Each mapper send the score and sequences to reducers Reducers Results score sequence score sequence... Users run the FastAMR script with its sequences and the databank User's Sequences Each mapper runs a FastA program on a part of the databank Reducers copy the best scores of the whole experiment in the DFS
Cas d usage avec Galaxy
Compte Cloud IDB Connectez-vous Remplissez les différents champs adresse mail institutionnelle Créer la demande implique l acceptation des conditions d utilisations! https://idee-b.ibcp.fr/cloud.html
Appliances disponibles Liste des appliances existantes Documentation spécifique aux appliances Création directe bouton Power
Créer mon portail Galaxy Appelée aussi Instance Compléter les différents paramètres lui assigner un nom nombre de CPUs taille mémoire attacher un disque virtuel Cluster de VM remplir le nombre de VMs choix du nom unique
Connexion sur mon instance Galaxy
Les disques durs virtuels Un disque virtuel permet de conserver ses données indépendamment de l exécution des VMs retrouver ses données d une VM à la suivante. Actions créer un vdisk gérer ses vdisks Utiliser un vdisk à la création de la VM montage à chaud
Echanger les données avec mon portail Galaxy sftp / scp client graphique: Cyberduck, Transmit, Filezilla,... Web: Galaxy - Get Data - Lien pour download
Conclusion Added value of cloud, e.g. NGS with Galaxy for scientific analyses: user-specific resources, isolated, different instances together for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy for tools integration: semantic annotation, solve software dependencies for development & operations (DevOps): different versions at the same time Provide turnkey bioinformatics appliances Standard tools and pipelines New developments Ready to run on clouds Public bioinformatics cloud (e.g. IDB) Tightly connected to existing bioinformatics resources Linked to public biological databases In collaboration with the French Institute of Bioinformatics BI user data Z tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Structures Galaxy Sequences Proteomics B A data public data UNIPROT EMBL Genomes PDB PROSITE Move cloud virtual machines tools VM: BLAST, ClustalW2, etc.... IDB Cloud
IFB - French Institute of Bioinformatics Mission : to make available core bioinformatics resources to the national/international life science research community. To provide support for biology programs supporting projects training users To provide an IT infrastructure devoted to management and analysis of biological data material resources : CPUs, disks, etc. availability of biology data collections deployment of bioinformatics tools To act as a middleman between the life science community and the bioinformatics/computer science research community Institut Français de Bioinformatique
IFB - Infrastructure IFB-Core resources Academic cloud for life cience Will be hosted at CNRS IDRIS supercomputing center (PARIS) A pilot infrastructure (2014-Q1) Production infrastructure +5,000cores 1PB (2014-S2) + Regional resources 6 regional bioinformatics centers +6,000 cores ~1PB 2 existing clouds: PRABI-IBCP IDB cloud (Lyon) & Genouest genocloud (Rennes) - RENABI IFB - Bioinformatics French Institute RENABI-GO APLIBIO FIB-core IT CNRS-IDRIS, Paris RENABI-NE PRABI Deploy a clouds federation RENABI-SO RENABI-GS Institut Français de Bioinformatique
Questions? Acknowledgment Clément Gauthey (IDB) StratusLab members co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001). http://idee-b.ibcp.fr