Interfaculty efforts to approach High Performance Computing (HPC) needs. A decade history at Uniandes



Similar documents
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Next Generation Sequencing

History of DNA Sequencing & Current Applications

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Accelerate genomic breakthroughs in microbiology. Gain deeper insights with powerful bioinformatic tools.

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

NGS Technologies for Genomics and Transcriptomics

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Next Generation Sequencing Technologies in Microbial Ecology. Frank Oliver Glöckner

Nicolas Pons INRA Ins(tut Micalis Plateforme MetaQuant Jouy- en- Josas, France

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

NGS data analysis. Bernardo J. Clavijo

Universidad Nacional Autónoma de México. Grid Activities in Mexico

CCR Biology - Chapter 9 Practice Test - Summer 2012

GC3 Use cases for the Cloud

Overview sequence projects

Cloud Ready for Bioinformatics?

SURFsara HPC Cloud Workshop

SURFsara HPC Cloud Workshop

Structure and Function of DNA

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Typing in the NGS era: The way forward!

AmphoraNet: Taxonomic Composition Analysis of Metagenomic Shotgun Sequencing Data

Introduction to next-generation sequencing data

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

Range of studies: List of Courses Taught in Spanish CURSO

Personalized Medicine and IT

Curriculum Reform in Computing in Spain

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

IBM Bluemix José Miguel Ordax

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

Microbial Oceanomics using High-Throughput DNA Sequencing

EMBL Identity & Access Management

INGENIERíA. Scada System for a Power Electronics Laboratory. Sistema SCADA para un laboratorio de electrónica de potencia Y D E S A R R O L L O

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

Automated and Scalable Data Management System for Genome Sequencing Data

Milestones of bacterial genetic research:

DNA Sequencing and Personalised Medicine

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

How Sequencing Experiments Fail

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Protein Protein Interaction Networks

Intro to Bioinformatics

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Curriculum Vitae Dr. José Luis Herrera Diestra

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Fedora 14 & Red Hat. Descripción del curso:

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

The Human Genome. Genetics and Personality. The Human Genome. The Human Genome 2/19/2009. Chapter 6. Controversy About Genes and Personality

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

MARY DOHERTY 1111 Holland Avenue Cambridge, MD Telephone:

IBM PureSystems: Familia de Sistemas Expertos Integrados

Twister4Azure: Data Analytics in the Cloud

IMCAS-BRC: toward better management and more efficient exploitation of microbial resources

How To Manage Cloud Service Provisioning And Maintenance

Implementation of kalman filter for the indoor location system of a lego nxt mobile robot. Abstract

BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources

Human Genome Organization: An Update. Genome Organization: An Update

Monitoreo de Bases de Datos

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

International CEMarin Omics Workshop: Omics Techniques for the Study of Marine Organisms and Ecosystems

Cloud Computing for Scientific Research

Curriculum Vitae Lic. José Rafael Pino Rusconi Chio +52 (998)

Quick Hit Activity Using UIL Science Contests For Formative and Summative Assessments of Pre-AP and AP Biology Students

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Managing a Tier-2 Computer Centre with a Private Cloud Infrastructure

Complex Microbial Communities. Single-Stage Chemostat Model

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Metagenomics revisits the one pathogen/one disease postulates and translate the One Health concept into action

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Cloud Based Application Architectures using Smart Computing

June Blade.org 2009 ALL RIGHTS RESERVED

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

CONCEPTS OF INDUSTRIAL AUTOMATION. By: Juan Carlos Mena Adolfo Ortiz Rosas Juan Camilo Acosta

FACULTY OF MEDICAL SCIENCE

Basic processing of next-generation sequencing (NGS) data

Big Data Challenges in Bioinformatics

New solutions for Big Data Analysis and Visualization

Transcription:

Interfaculty efforts to approach High Performance Computing (HPC) needs. A decade history at Uniandes Alejandro Reyes Ph.D. Assistant Professor Department of Biological Sciences Universidad de los Andes

Interfaculty efforts to approach High Performance Computing (HPC) needs. A decade history at Uniandes Harold Castro, Mario Villamizar, Andrés Holguin, Michael Perez, Diego M Riaño, Silvia Restrepo, Alejandro Reyes

Biology is Computer Sciences next Physics Physics invented internet Tim Berners-Lee, a British scientist at CERN, invented the World Wide Web (WWW) in 1989.

Physics was the birth of HPC at Uniandes Before 2005 two small computer clusters, one in Computer Engineering and one in Physics and university wide resources from IT department DSIT School of Sciences School of Engeneering

Growing need in physics for more computational power How to better use the computational power within the university for scientist inside and outside? Solution? GRID Cloud computing Cluster computing Dedicated infrastructure usually requires large financial investments

EELA E-Infraestructure shared between Europe and Latin America (2006)

EELA E-Infraestructure shared between Europe and Latin America (2006) Usuarios G R I D Usuarios Usuarios Uniandes VO EDTEAM VO DTEAM VO EELA VO UNIANDES INTERNET Otras VOs Usuarios VO OPT VO CMS Usuarios

EELA E-Infraestructure shared between Europe and Latin America (2006) Eventually leads to project grid-colombia to interconect different universities and regions (2009) Usuarios Uniandes Site Física Site Biología DTI Infraestructura de administración Site Ingeniería Sistemas MOX Site DTI VO UNIANDES

HPC and Biological Sciences? We all know Moore s law

HPC and Biological Sciences? 10000$ 1000$ 100$ Hiseq' 2000/2500' Hiseq'X' Hiseq2500'RR' NextSeq'500' 10$ Proton' 1$ SOLiD' MiSeq' GS'FLX' ABI$3730xl$ Roche/454$GS$ 0.1$ 0.01$ GA'II' PGM' GS'Junior' PacBio'RS' Illumina$GA$ Series4$ SOLiD$ Illumina$MiSeq$ Ion$Torrent$PGM$ 0.001$ PacBio$RS$ 454$GS$Junior$ 0.0001$ Ion$Proton$ Illumina$Hiseq$2500$RR$ Sanger ' Illumina$Hiseq$X$ Lex$Nederbragt$(2014)$h3 p://dx.doi.org/10.6084/m9.figshare.100940$ Illumina$NextSeq$500$ 0.00001$ 10$ 100$ 1000$ 10000$

HPC and Biological Sciences? Biology s equivalent to Moore s law: Just Better!

New generation of computer scientist now focused on solving biological problems PROYECTOS Aplicaciones Descripción Usuario BLAST HMMER Identifica una cadena de ADN o de proteínas con un grupo de bases de datos (ADN y proteínas) conocidas. Suite de programas para la creación y análisis de modelos estadísticos de un alineamiento múltiple Biología InterproScan Flujo de trabajo (pipeline) para la caracterización funcional de secuencias biológicas. CMS SW Software del proyecto CMS. Física PovRay Crea imágenes, animaciones mediante código. MPICH Implementación de MPI para paralelizar procesos Phedex Administración de información distribuida. Física Aplicaciones Administrativas Descripción Usuario Ganglia Provee monitoreo del estado de los servidores. DTI Genius Interfaz de usuario gráfica para usar el Grid. Todos

Usuarios Uniandes What was Biology s HPC capability (2010) Webserver Entrance Workstation Site Física Site Biología MOX DTI Infraestructura de administración Site Ingeniería Sistemas Site DTI Mobyle Galaxy VO UNIANDES NAS: Storage Biosge Computing nodes 3x 24 cores, 32Gb RAM 1x 24 cores, 128GB RAM NFS 2x0,5TB

Application: Metagenomic analysis of extreme environments (GEBIX) METAGENOMES WHAT IS COMMON? WHAT IS DIFFERENT? Corporación Corpogen; Universidad Nacional; Universidad del Cauca; Universidad del Valle; Universidad Javeriana; Universidad de Caldas Uniandes

gebix 2 Bioinformatics for GEBIX gebix 3 gebix 8 www.gebix.org.c o Internet Almacenamient o gebix 7 gebix 4 gebix 5 gebix 6 biow n PU J bioinfmac bioin f UNIANDES

But most Bioinformatics users don t like command line software! Loni Pipeline Cortesía Javier Tabima

Building pipelines and running them from a web server allows different infrastructures to be used The computing cluster was heavily used while other computational resources such as teaching clasrooms and laboratories remained idle for most of the time

Solution: Bio-UnaGrid

Solution: Bio-UnaGrid A virtual machine is executed on each computer of a lab and it works as a slave on a cluster. There is the need for a dedicated node as cluster master

Solution: Bio-UnaGrid Virtual cluster can be defined by research groups, custom application environments, etc. A grid solution (several virtual clusters) can be deployed to fulfill different needs. Allow heterogeneous resources, not a good idea in cluster computing.

Solution: Bio-UnaGrid Aplicating LONI Pipeline on Grid infraestructure.

Solution: Bio-UnaGrid Aplicating LONI Pipeline on Grid infraestructure.

What if we want to select what type of machines to use? We want to deploy on-demand Computing Services. CLOUD COMPUTING And if we still want to use idle computing machines at the university? UnaCLOUD

The Second International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2011) UnaCloud THE PROBLEM More than 2000 CPU cores

The Second International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2011) UnaCloud THE DESIRED SOLUTION Debian with PBS

UnaCloud UnaCloud validates the convergence of cloud computing and virtual clusters offering promising opportunities to meet customized computational requirements through the use of an open source, low cost, extensible, interoperable, efficient, scalable, secure and opportunistic IaaS model. UnaCloud provides a multipurpose cloud computing experimental platform to deploy Customizable Virtual Clusters that support new specific computational requirements of academic and research projects. UnaCloud represents an economically attractive solution for building and deploying large scale computing infrastructures. UnaCloud cloud computing features are promising to reduce the development cycle and the generation of results depending on the agile and flexible provisioning and sharing of low cost computing resources Mario Villamizar, Harold Castro

Current status of HPC in Uniandes Joint effort Sciences, DSIT, Engineering Current infrastructure 10 servers 4 cores y 8 GB RAM 2 servers 8 cores y 16 GB RAM 6 servers 16 cores y 32 GB RAM Total Cores: 150 3 servers 24 cores y 32 GB RAM 1 servers 24 cores y 128 GB RAM Total Cores: 96 6 servers 64 cores y 128 GB RAM Total Cores: 384 DSIT - Biology Storage: 15.5 TB /Users /Applications /Scratch Total current users : 30 Engineering (MOX) Storage: 10 TB /Users /Applications /Scratch Total current users: 68 Installed infrastructure 17 servers 24 cores y 192 GB RAM 2 servers 24 cores y 512 GB RAM Total Cores: 456 Storage: 90 TB Total Cores: 840 6 servers 64 cores y 128 GB RAM Total Cores: 384 Centralized administration

Study of viruses (phages) in the human gut

Why study phages? Phages are the most abundant biological group on the planet. Phages are more diverse than their bacterial prey, by an estimated ratio of 10 phages per microbe. They play important roles in marine microbial communities. Important drivers of energy balance and nutrient recycling. Shape microbial communities and generate diversity at strain level. Rodriguez-Brito B, et al. (2010) Viral and microbial community dynamics in four aquatic environments. ISME J 4(6):739-751.

Why study human gut phages? Ecological importance: community dynamics lytic cycles. Evolutionary importance: Predator prey dynamics role in shaping community adaptations/diversity. Lysogenic cycle Horizontal gene transfer, response to environmental signals. Virome diversity fingerprint that can provide more resolution than bacterial species health versus disease.

Phage lifecycle Reyes, A.,et al (2012). Going viral: next-generation sequencing applied to phage populations in the human gut. Nature Reviews Microbiology. 10(9) 607-17

Initial characterization of human virome Initial MOAFTs samples: 4 families MZ twin pairs + Mother 3 Time points (0, 2, 12 months) Virus purification VLPs from frozen fecal samples 454 Sequencing (MDA amplified VLP DNA) Comparison against reference DB. NR_Viral_DB (tblastx) Sample Nomenclature: F: Family T1, T2: Twins M: Mother (R): Technical Replicate F1T1.1 F1T1.3 F1T2.1 F1T2.1(R) F1T2.2 F1T2.3 F1M.1 F1M.2 F2T1.1 F2T1.1(R) F2T1.2 F2T1.3 F2T2.1 F2T2.1(R) F2T2.2 F2M.1 F2M.1(R) F2M.2 F2M.3 F3T1.1 F3T1.2 F3T1.3 F3T2.1 F3T2.2 F3T2.3 F3M.1 F3M.2 F5T2.1 F5T2.1(R) F4T1.1 F4T1.2 F4T1.3 F4T2.1 F4T2.3 F4M.1 F4M.2 F4M.3 F4M.3(R) Percent assignable reads 0 20 40 60 80 100 Average unknown 81±6% Reyes, A., et al (2010). Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature, 466(7304), 334 338.

The Malawi viromes and malnutrition Marasmus Kwashiorkor F93 F229 F112 F194 F284 F23 F56 F268 F196 F57 F138 F26 F10 Dz Mz Dz Mz Dz = Dizygotic Mz = Monozygotic Sibling Mother RUTF Kwashiorkor Marasmus Moderate Malnutrition Total: 231 samples Average 56,000 454 reads/sample Healthy F95 F37 F301 F47 F121 F209 F259 Dz Mz 0 6 12 18 24 30 Age (Months) Reyes, A., et al (2015). Manuscript in preparation

New assembly strategies 35 30 10 6 Linear contigs Circular contigs Number of samples 25 20 15 10 Contig Length (bp) 10 5 10 4 10 3 5 0 50% 70% 90% 100% Percent of data used 10 2 1 10 1 10 2 10 3 10 4 10 5 Median coverage of contig Average 90% of data assembled. Contigs 500nt - 200Kb in length. Large number of circular contigs (potential full genomes). Reyes, A., et al (2015). Manuscript in preparation

Contigs specific for twin pairs Healthy Kwashiorkor Marasmus Contigs present at high abundance in twin pair over time. Higher number of discriminatory contigs in twinpairs that developed malnutrition. VLP-derived contigs Log (RPMM) 8 6 4 2 0 F301 F10 F37 F47 F121 F209 F259 F95 F56 F196 F268 F26 F57 F138 F93 F112 F194 F229 F284 F23 Mother Siblings Reyes, A., et al (2015). Manuscript in preparation

High diversity of new Eukaryotic viruses Reyes, A., et al (2015). Manuscript in preparation

Different lifestyles provides different advantages in the human gut. Reyes A, Semenkovich NP, Whiteson K, Rohwer F, & Gordon JI (2012) Going viral: next-generation sequencing applied to phage populations in the human gut. Nat Rev Microbiol:1-11.

Evaluate in a controlled environment viral:bacterial interactions First, introduce 15 prominent, sequenced members of the human gut microbiota into groups of 5 adult germ-free C57Bl/6 mice Then, after model microbiota assembles in mice, add a pool of previously characterized purified VLPs from 5 healthy adults. Microbes + VLPs Microbes + Heat killed VLPs No microbes + VLPs

Community changes observed during assembly Add live VLPs Add heat killed VLPs

Temporal changes on microbial community -> staged VLP attack 0.15 Bacteroides caccae Time of addition of Live VLPs B. caccae Relative Abundance 0.10 0.05 B. caccae Live VLP 0.00 3 7 11 15 19 21 25 29 33 37 41 45 Time (d)

Temporal changes on microbial community -> staged VLP attack 0.15 Change not seen with heat killed VLPs B. caccae Relative Abundance 0.10 0.05 B. caccae Live VLP B. caccae Heat Killed VLP 0.00 3 7 11 15 19 21 25 29 33 37 41 45 Time (d)

ϕhsc01 Absolute abundance Square root of viral genome equivalents per mg fecal pellet Temporal changes on microbial community -> staged VLP attack 0.15 20000 15000 B. caccae Relative Abundance 0.10 0.05 ϕhsc01 Live VLP B. caccae Live VLP B. caccae Heat Killed VLP 10000 5000 0.00 3 7 11 15 19 21 25 29 33 37 41 45 0 Time (d)

ϕhsc01 Absolute abundance Square root of viral genome equivalents per mg fecal pellet Temporal changes on microbial community -> staged VLP attack 33000 36000 0 0.15 3000 20000 6000 30000 27000 24000 21000 ϕhsc01 37,323 bp B. caccae Relative Abundance 0.10 18000 0.05 Assembled 37kb circular Phage, contains: - Phage genes - Terminase - Helicase - DNA polymerase - Bacteroides-associated carbohydrate 0.00 15000 binding protein 3 7 - Anaerobic bacterial stress response transcription factor 9000 12000 11 ϕhsc01 Live VLP B. caccae Live VLP B. caccae Heat Killed VLP 15 19 21 25 29 33 37 41 45 Time (d) 15000 10000 5000 0

Other 4 novel viral genomes identified Model Community + Live VLP Model Community + Heat-Killed VLP No obvious associations with a particular bacterial host. No evidence of integration into bacterial genomes. 4,800 4,200 5,400 3,600 6,000 0 ϕhsc02 6,209 bp 3,000 600 2,400 1,200 1,800 Viral abundance (square root of genome equivalents per mg fecal pellet weight) 3000 ϕhsc03 153,451 bp 2000 1000 0 15 19 21 25 29 33 37 41 45 4000 ϕhsc04 3000 104,253 bp 2000 1000 0 15 1921 25 29 33 37 41 45 3000 ϕhsc05 95,864 bp 2000 1000 0 15 1921 25 29 33 37 41 45 Time (d) 0 10,000 140,000 20,000 130,000 30,000 120,000 40,000 110,000 50,000 100,000 60,000 90,000 80,000 70,000 100,000 0 10,000 90,000 20,000 80,000 30,000 70,000 40,000 60,000 50,000 0 90,000 9,000 81,000 18,000 72,000 27,000 63,000 36,000 54,000 45,000

New Sequencing/Computing technologies has helped us see the clear picture Early Sanger sequencing

New Sequencing/Computing technologies has helped us see the clear picture Current Metagenomics

New Sequencing/Computing technologies has helped us see the clear picture Hopefully in the close future However, this doesn t allow to learn enough about the biology!

Harold Castro Mario Villamizar Andrés Holguin Michael Perez Diego M Riaño Silvia Restrepo Thank You! bcem.uniandes.edu.co