Interfaculty efforts to approach High Performance Computing (HPC) needs. A decade history at Uniandes Alejandro Reyes Ph.D. Assistant Professor Department of Biological Sciences Universidad de los Andes
Interfaculty efforts to approach High Performance Computing (HPC) needs. A decade history at Uniandes Harold Castro, Mario Villamizar, Andrés Holguin, Michael Perez, Diego M Riaño, Silvia Restrepo, Alejandro Reyes
Biology is Computer Sciences next Physics Physics invented internet Tim Berners-Lee, a British scientist at CERN, invented the World Wide Web (WWW) in 1989.
Physics was the birth of HPC at Uniandes Before 2005 two small computer clusters, one in Computer Engineering and one in Physics and university wide resources from IT department DSIT School of Sciences School of Engeneering
Growing need in physics for more computational power How to better use the computational power within the university for scientist inside and outside? Solution? GRID Cloud computing Cluster computing Dedicated infrastructure usually requires large financial investments
EELA E-Infraestructure shared between Europe and Latin America (2006)
EELA E-Infraestructure shared between Europe and Latin America (2006) Usuarios G R I D Usuarios Usuarios Uniandes VO EDTEAM VO DTEAM VO EELA VO UNIANDES INTERNET Otras VOs Usuarios VO OPT VO CMS Usuarios
EELA E-Infraestructure shared between Europe and Latin America (2006) Eventually leads to project grid-colombia to interconect different universities and regions (2009) Usuarios Uniandes Site Física Site Biología DTI Infraestructura de administración Site Ingeniería Sistemas MOX Site DTI VO UNIANDES
HPC and Biological Sciences? We all know Moore s law
HPC and Biological Sciences? 10000$ 1000$ 100$ Hiseq' 2000/2500' Hiseq'X' Hiseq2500'RR' NextSeq'500' 10$ Proton' 1$ SOLiD' MiSeq' GS'FLX' ABI$3730xl$ Roche/454$GS$ 0.1$ 0.01$ GA'II' PGM' GS'Junior' PacBio'RS' Illumina$GA$ Series4$ SOLiD$ Illumina$MiSeq$ Ion$Torrent$PGM$ 0.001$ PacBio$RS$ 454$GS$Junior$ 0.0001$ Ion$Proton$ Illumina$Hiseq$2500$RR$ Sanger ' Illumina$Hiseq$X$ Lex$Nederbragt$(2014)$h3 p://dx.doi.org/10.6084/m9.figshare.100940$ Illumina$NextSeq$500$ 0.00001$ 10$ 100$ 1000$ 10000$
HPC and Biological Sciences? Biology s equivalent to Moore s law: Just Better!
New generation of computer scientist now focused on solving biological problems PROYECTOS Aplicaciones Descripción Usuario BLAST HMMER Identifica una cadena de ADN o de proteínas con un grupo de bases de datos (ADN y proteínas) conocidas. Suite de programas para la creación y análisis de modelos estadísticos de un alineamiento múltiple Biología InterproScan Flujo de trabajo (pipeline) para la caracterización funcional de secuencias biológicas. CMS SW Software del proyecto CMS. Física PovRay Crea imágenes, animaciones mediante código. MPICH Implementación de MPI para paralelizar procesos Phedex Administración de información distribuida. Física Aplicaciones Administrativas Descripción Usuario Ganglia Provee monitoreo del estado de los servidores. DTI Genius Interfaz de usuario gráfica para usar el Grid. Todos
Usuarios Uniandes What was Biology s HPC capability (2010) Webserver Entrance Workstation Site Física Site Biología MOX DTI Infraestructura de administración Site Ingeniería Sistemas Site DTI Mobyle Galaxy VO UNIANDES NAS: Storage Biosge Computing nodes 3x 24 cores, 32Gb RAM 1x 24 cores, 128GB RAM NFS 2x0,5TB
Application: Metagenomic analysis of extreme environments (GEBIX) METAGENOMES WHAT IS COMMON? WHAT IS DIFFERENT? Corporación Corpogen; Universidad Nacional; Universidad del Cauca; Universidad del Valle; Universidad Javeriana; Universidad de Caldas Uniandes
gebix 2 Bioinformatics for GEBIX gebix 3 gebix 8 www.gebix.org.c o Internet Almacenamient o gebix 7 gebix 4 gebix 5 gebix 6 biow n PU J bioinfmac bioin f UNIANDES
But most Bioinformatics users don t like command line software! Loni Pipeline Cortesía Javier Tabima
Building pipelines and running them from a web server allows different infrastructures to be used The computing cluster was heavily used while other computational resources such as teaching clasrooms and laboratories remained idle for most of the time
Solution: Bio-UnaGrid
Solution: Bio-UnaGrid A virtual machine is executed on each computer of a lab and it works as a slave on a cluster. There is the need for a dedicated node as cluster master
Solution: Bio-UnaGrid Virtual cluster can be defined by research groups, custom application environments, etc. A grid solution (several virtual clusters) can be deployed to fulfill different needs. Allow heterogeneous resources, not a good idea in cluster computing.
Solution: Bio-UnaGrid Aplicating LONI Pipeline on Grid infraestructure.
Solution: Bio-UnaGrid Aplicating LONI Pipeline on Grid infraestructure.
What if we want to select what type of machines to use? We want to deploy on-demand Computing Services. CLOUD COMPUTING And if we still want to use idle computing machines at the university? UnaCLOUD
The Second International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2011) UnaCloud THE PROBLEM More than 2000 CPU cores
The Second International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2011) UnaCloud THE DESIRED SOLUTION Debian with PBS
UnaCloud UnaCloud validates the convergence of cloud computing and virtual clusters offering promising opportunities to meet customized computational requirements through the use of an open source, low cost, extensible, interoperable, efficient, scalable, secure and opportunistic IaaS model. UnaCloud provides a multipurpose cloud computing experimental platform to deploy Customizable Virtual Clusters that support new specific computational requirements of academic and research projects. UnaCloud represents an economically attractive solution for building and deploying large scale computing infrastructures. UnaCloud cloud computing features are promising to reduce the development cycle and the generation of results depending on the agile and flexible provisioning and sharing of low cost computing resources Mario Villamizar, Harold Castro
Current status of HPC in Uniandes Joint effort Sciences, DSIT, Engineering Current infrastructure 10 servers 4 cores y 8 GB RAM 2 servers 8 cores y 16 GB RAM 6 servers 16 cores y 32 GB RAM Total Cores: 150 3 servers 24 cores y 32 GB RAM 1 servers 24 cores y 128 GB RAM Total Cores: 96 6 servers 64 cores y 128 GB RAM Total Cores: 384 DSIT - Biology Storage: 15.5 TB /Users /Applications /Scratch Total current users : 30 Engineering (MOX) Storage: 10 TB /Users /Applications /Scratch Total current users: 68 Installed infrastructure 17 servers 24 cores y 192 GB RAM 2 servers 24 cores y 512 GB RAM Total Cores: 456 Storage: 90 TB Total Cores: 840 6 servers 64 cores y 128 GB RAM Total Cores: 384 Centralized administration
Study of viruses (phages) in the human gut
Why study phages? Phages are the most abundant biological group on the planet. Phages are more diverse than their bacterial prey, by an estimated ratio of 10 phages per microbe. They play important roles in marine microbial communities. Important drivers of energy balance and nutrient recycling. Shape microbial communities and generate diversity at strain level. Rodriguez-Brito B, et al. (2010) Viral and microbial community dynamics in four aquatic environments. ISME J 4(6):739-751.
Why study human gut phages? Ecological importance: community dynamics lytic cycles. Evolutionary importance: Predator prey dynamics role in shaping community adaptations/diversity. Lysogenic cycle Horizontal gene transfer, response to environmental signals. Virome diversity fingerprint that can provide more resolution than bacterial species health versus disease.
Phage lifecycle Reyes, A.,et al (2012). Going viral: next-generation sequencing applied to phage populations in the human gut. Nature Reviews Microbiology. 10(9) 607-17
Initial characterization of human virome Initial MOAFTs samples: 4 families MZ twin pairs + Mother 3 Time points (0, 2, 12 months) Virus purification VLPs from frozen fecal samples 454 Sequencing (MDA amplified VLP DNA) Comparison against reference DB. NR_Viral_DB (tblastx) Sample Nomenclature: F: Family T1, T2: Twins M: Mother (R): Technical Replicate F1T1.1 F1T1.3 F1T2.1 F1T2.1(R) F1T2.2 F1T2.3 F1M.1 F1M.2 F2T1.1 F2T1.1(R) F2T1.2 F2T1.3 F2T2.1 F2T2.1(R) F2T2.2 F2M.1 F2M.1(R) F2M.2 F2M.3 F3T1.1 F3T1.2 F3T1.3 F3T2.1 F3T2.2 F3T2.3 F3M.1 F3M.2 F5T2.1 F5T2.1(R) F4T1.1 F4T1.2 F4T1.3 F4T2.1 F4T2.3 F4M.1 F4M.2 F4M.3 F4M.3(R) Percent assignable reads 0 20 40 60 80 100 Average unknown 81±6% Reyes, A., et al (2010). Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature, 466(7304), 334 338.
The Malawi viromes and malnutrition Marasmus Kwashiorkor F93 F229 F112 F194 F284 F23 F56 F268 F196 F57 F138 F26 F10 Dz Mz Dz Mz Dz = Dizygotic Mz = Monozygotic Sibling Mother RUTF Kwashiorkor Marasmus Moderate Malnutrition Total: 231 samples Average 56,000 454 reads/sample Healthy F95 F37 F301 F47 F121 F209 F259 Dz Mz 0 6 12 18 24 30 Age (Months) Reyes, A., et al (2015). Manuscript in preparation
New assembly strategies 35 30 10 6 Linear contigs Circular contigs Number of samples 25 20 15 10 Contig Length (bp) 10 5 10 4 10 3 5 0 50% 70% 90% 100% Percent of data used 10 2 1 10 1 10 2 10 3 10 4 10 5 Median coverage of contig Average 90% of data assembled. Contigs 500nt - 200Kb in length. Large number of circular contigs (potential full genomes). Reyes, A., et al (2015). Manuscript in preparation
Contigs specific for twin pairs Healthy Kwashiorkor Marasmus Contigs present at high abundance in twin pair over time. Higher number of discriminatory contigs in twinpairs that developed malnutrition. VLP-derived contigs Log (RPMM) 8 6 4 2 0 F301 F10 F37 F47 F121 F209 F259 F95 F56 F196 F268 F26 F57 F138 F93 F112 F194 F229 F284 F23 Mother Siblings Reyes, A., et al (2015). Manuscript in preparation
High diversity of new Eukaryotic viruses Reyes, A., et al (2015). Manuscript in preparation
Different lifestyles provides different advantages in the human gut. Reyes A, Semenkovich NP, Whiteson K, Rohwer F, & Gordon JI (2012) Going viral: next-generation sequencing applied to phage populations in the human gut. Nat Rev Microbiol:1-11.
Evaluate in a controlled environment viral:bacterial interactions First, introduce 15 prominent, sequenced members of the human gut microbiota into groups of 5 adult germ-free C57Bl/6 mice Then, after model microbiota assembles in mice, add a pool of previously characterized purified VLPs from 5 healthy adults. Microbes + VLPs Microbes + Heat killed VLPs No microbes + VLPs
Community changes observed during assembly Add live VLPs Add heat killed VLPs
Temporal changes on microbial community -> staged VLP attack 0.15 Bacteroides caccae Time of addition of Live VLPs B. caccae Relative Abundance 0.10 0.05 B. caccae Live VLP 0.00 3 7 11 15 19 21 25 29 33 37 41 45 Time (d)
Temporal changes on microbial community -> staged VLP attack 0.15 Change not seen with heat killed VLPs B. caccae Relative Abundance 0.10 0.05 B. caccae Live VLP B. caccae Heat Killed VLP 0.00 3 7 11 15 19 21 25 29 33 37 41 45 Time (d)
ϕhsc01 Absolute abundance Square root of viral genome equivalents per mg fecal pellet Temporal changes on microbial community -> staged VLP attack 0.15 20000 15000 B. caccae Relative Abundance 0.10 0.05 ϕhsc01 Live VLP B. caccae Live VLP B. caccae Heat Killed VLP 10000 5000 0.00 3 7 11 15 19 21 25 29 33 37 41 45 0 Time (d)
ϕhsc01 Absolute abundance Square root of viral genome equivalents per mg fecal pellet Temporal changes on microbial community -> staged VLP attack 33000 36000 0 0.15 3000 20000 6000 30000 27000 24000 21000 ϕhsc01 37,323 bp B. caccae Relative Abundance 0.10 18000 0.05 Assembled 37kb circular Phage, contains: - Phage genes - Terminase - Helicase - DNA polymerase - Bacteroides-associated carbohydrate 0.00 15000 binding protein 3 7 - Anaerobic bacterial stress response transcription factor 9000 12000 11 ϕhsc01 Live VLP B. caccae Live VLP B. caccae Heat Killed VLP 15 19 21 25 29 33 37 41 45 Time (d) 15000 10000 5000 0
Other 4 novel viral genomes identified Model Community + Live VLP Model Community + Heat-Killed VLP No obvious associations with a particular bacterial host. No evidence of integration into bacterial genomes. 4,800 4,200 5,400 3,600 6,000 0 ϕhsc02 6,209 bp 3,000 600 2,400 1,200 1,800 Viral abundance (square root of genome equivalents per mg fecal pellet weight) 3000 ϕhsc03 153,451 bp 2000 1000 0 15 19 21 25 29 33 37 41 45 4000 ϕhsc04 3000 104,253 bp 2000 1000 0 15 1921 25 29 33 37 41 45 3000 ϕhsc05 95,864 bp 2000 1000 0 15 1921 25 29 33 37 41 45 Time (d) 0 10,000 140,000 20,000 130,000 30,000 120,000 40,000 110,000 50,000 100,000 60,000 90,000 80,000 70,000 100,000 0 10,000 90,000 20,000 80,000 30,000 70,000 40,000 60,000 50,000 0 90,000 9,000 81,000 18,000 72,000 27,000 63,000 36,000 54,000 45,000
New Sequencing/Computing technologies has helped us see the clear picture Early Sanger sequencing
New Sequencing/Computing technologies has helped us see the clear picture Current Metagenomics
New Sequencing/Computing technologies has helped us see the clear picture Hopefully in the close future However, this doesn t allow to learn enough about the biology!
Harold Castro Mario Villamizar Andrés Holguin Michael Perez Diego M Riaño Silvia Restrepo Thank You! bcem.uniandes.edu.co