Bioinformatics for Informatics Vinicius Vielmo Cogo Presented in UFSM, Santa Maria, RS, BR. April 08, 2014.
Context The presenter: is not a bioinformatician is a computer scientist from Inf2006-UFSM is a PhD. student in the ULisboa is a researcher at the LaSIGE (Navigators group ) has as main research areas: Distributed systems Dependability Fault Tolerance Security But why are you going to present about bioinformatics? 2
Context Interesting area Many open problems Opportunities to acquire and apply knowledge Opportunities from projects Bachelor in Computer Science (UFSM) Master in Informatics (ULisboa) PhD. in Informatics (ULisboa) PET-CC FTH-GRID CloudFIT BiobankCloud PET-CC LSC Tclouds 3
Context is not an introduction to bioinformatics is not a study of is a different way to look to bioinformatics, from an informatics perspective 4
What is bioinformatics? 5
What is bioinformatics? 6
What is bioinformatics? is the study of life and living organisms 7
What is bioinformatics? 8
What is bioinformatics? Mathematics Astronomy Physics Biology Ethics Chemistry Informatics Statistics 9
What is bioinformatics? Mathematics Astronomy Physics Biology Ethics Chemistry Informatics Statistics 10
What is bioinformatics? Biology Informatics 11
What is bioinformatics? Biology Bioinformatics? Informatics 12
What is bioinformatics? Biology Bioinformatics Informatics Computational Biology Biological Computing Modeling biological virtual systems and organs. Searching the optimal path in a graph based on ants behavior 13
What is bioinformatics? Biology Bioinformatics Informatics Bioinformatics is an informatics discipline for storing, retrieving, organizing and analyzing biological data. 14
Motivation But what makes the biological data so special? 15
Motivation But what makes the biological data so special? Data complexity Size Quantity Meaning Incomplete and uncertain data 16
Outline 1. Size 2. Quantity 3. Meaning 17
1. Size
DNA TCTCATGCATACGCAGCAGGTCAGGCAT DNA is: A very large string Composed only by A, C, G, and T characters 19
Human DNA From human beings to DNA 10-50 trillion 50 000 000 000 000 1 23 pairs 46 2 molecules A human being Cells Nucleus Chromosomes DNA 210 types of cells 300 million die every minute 20
Human DNA TCTCATGCATACGCAGCAGGTCAGGCAT Human DNA: 3.2 B characters (bp, base-pairs) 21
Genome size Organism Genome size Picture Porcine circovirus type 1 1.8 K Drosophila melanogaster 130 M Mus musculus 2.7 B Homo sapiens 3.2 B Zea mays 5 B Marbled lungfish 130 B 22
Printing a human genome Times New Roman 12 pt 2622 bp per page 1.2 M single-sided pages 2415 reams (500 sheets) 129 meters 129 m 93 m 30 m 23
Printing a human genome Leighton Pritchard 4.5 pt Double-sided pages 120 book volumes 1 bookshelf 24
Loading a human genome to memory Does it work? char[] humandna = char[3 200 000 000]; 25
Loading a human genome to memory Does it work? char[] humandna = char[3 200 000 000]; In most programming languages: No Most data structures are indexed by signed int 2.1 B chars = (2^31-1 = 2 147 483 647) Solutions? 26
Loading a human genome to memory char[] humandna = char[3 200 000 000]; Solutions: Unsigned int: 4 200 000 000 (2^32-1) 64-bit indexes: 9 000 000 000 000 000 000 (2^63-1) Matrices: humandna = char[100][32 000 000] 27
Loading a human genome to memory char[] humandna = char[3 200 000 000]; Solutions: Unsigned int: 4 200 000 000 (2^32-1) 64-bit indexes: 9 000 000 000 000 000 000 (2^63-1) Matrices: humandna = char[100][32 000 000] Split in chromosomes: Map<Key, Value> Map<chrName, chrsequence> 28
Human chromosomes The human genome is not a contiguous DNA 29
Storing a human genome in a file Entire human genome 3.2 B chars 3 GB UTF-8 or ISO 8859-1 1 char = 1-byte (8-bits) 30
Storing a human genome in a file FASTA format Broadly used Accepts comments (>) Store also incomplete or small sequences > HG00339 chr1 ATGCATCGTACGCTCATCATGCAGC AGACTGCTAGCTGATGGTACGTCAT GCAGTCGTCAGTCAGTCAGTACATC > HG00339 chr2 TCGTACGCTCATCATGCAGCAGGTC AGTACCATGGTCAGACTCATTGCAA CTGCTAGCTGATGATCGGCATAGT > HG00339 chr22 CAGTACGTCAGTGCTAGCTAGCTAC GATCGATCGACTGATACGATC > HG00339 chry GGGTACTCATGCAGTCAGTCAGTTG CAGTCAGTCAGTCGCAGTCA 31
Storing a human genome in a file 1 char = 2-bits 770 MB Entire human genome 3.2 B chars A = 00 C = 01 G = 10 T = 11 32
Storing a human genome in a file 2-bit format Not human readable (normal humans) There are other characters than A, C, G and T E.g.: N for any. Nomenclature for incompletely specified bases in nucleic acid sequences IUPAC How to reduce even more? 001010111101010111000101 010110101010101010110101 010101111101010101010101 010101010101010111101010 101010101010101010010000 000101010101010101010100 000101010111101010000000 010110101010101010101000 000000101010101010101010 101010101010101101001101 010000000101010101101111 101010101010101010101010 111111111111101010101110 101110101010010001100001 1010011010001110101101 000100010101010110101 33
Compressing a human genome file ZIP Compression 830 MB (in 10min) 34
Compressing a human genome file Referential Compression Humans are 99.9% genetically equals 7 MB (in 30s) 35
Sequencing a human genome Reads Genomes are not sequenced in one contiguous line Reads contains 30-1000bp each Human genome reference Resulting aligned genome 36
Sequencing a human genome Sequenced Genome 180 GB > HG00339 read123 300bp ATGCATCGTACGCTCATCATGCAGC +!''*((((***+))%%%++)(%%%%).1*** > HG00339 read456 300bp TCGTACGCTCATCATGCAGCAGGTC + %%%).1****''))**55CCF>>>>>>CC data FASTQ format > HG00339 read999 300bp CAGTACGTCAGTGCTAGCTAGCTAC + )(%%%%).1***''))**5CCCC?CCC65 37
Resuming Human genome = string of 3.2B chars Size of creatures does not determine genome size Normally in memory, a genome is divided into 23 chromosomes In a file: 3 GB in FASTA 770 MB in 2-bits 830 MB when compressed with ZIP 7 MB when using referential compression 180 GB in FASTQ 38
2. Quantity
Problem statement Huge Big data data 40
Quantity The number of sequenced genomes is exponentially increasing Caused by the decreasing on: time and cost 1 st Human Genome Currently Year 1988 2014 Cost US$ 3 billion US$ 1 000 Time 13 years (2001) < 1 hour 41
From first to next-generation sequencing Sanger-based chemistries and capillary-based instruments sequencing platforms Next-Generation sequencing platforms (NGS) 42
Cost per human genome 43
Cost x quantity 44
US$1000 (2014) HiSeq X Ten 45
NGS around the world 46
US$100 (?) 47
US$30 (?) 48
Projects gathering genomes 49
Physical storage in biobanks 50
Physical storage in biobanks 51
Digital storage in e-biobanks Biobanks store also health and behavior data Pressure to store also sequenced DNA Researchers need a lot of data (shared) 52
Digital storage in e-biobanks Biobanks will soon start using cloud computing Illumina Basespace BiobankCloud is creating a platform for biobanks efficiently and securely use cloud facilities 53
Neonatal heel prick Newborn screening Detect congenital diseases Hypotiroidism Cystic fibrosis Implemented all over the world Employed in demographics for birth rates What about newborn genome sequencing? 54
Genomes of a population How much disk space do we need to store all genomes of an entire population? 1-person 1 Population Santa Maria, RS Porto Alegre, RS Rio Grande do Sul São Paulo, SP Brazil Latin America World 270 K 1.5 M 10.7 M 11.3 M 201 M 589 M 7 B 55
Genomes of a population How much disk space do we need to store all genomes of an entire population? Population Compressed Size with 2-bits Size with 8-bits WGS 1-person 1 7 MB 770 MB 3 GB 180 GB Santa Maria, RS Porto Alegre, RS Rio Grande do Sul São Paulo, SP Brazil Latin America World 270 K 1.5 M 10.7 M 11.3 M 201 M 589 M 7 B B KB MB GB TB PB EB ZB YB 56
Genomes of a population How much disk space do we need to store all genomes of an entire population? Population Compressed Size with 2-bits Size with 8-bits WGS 1-person 1 7 MB 770 MB 3 GB 180 GB Santa Maria, RS 270 K 1.8 TB 200 TB 790 TB 46 PB Porto Alegre, RS 1.5 M 10 TB 1.1 PB 4.3 PB 260 PB Rio Grande do Sul 10.7 M 72 TB 7.7 PB 30 PB 1.8 EB São Paulo, SP 11.3 M 75 TB 8.1 PB 32 PB 1.9 EB Brazil 201 M 1.3 PB 144 PB 575 PB 34 EB Latin America 589 M 3.8 PB 422 PB 1.6 EB 100 EB World 7 B 45 PB 5 EB 20 EB 1.1 ZB B KB MB GB TB PB EB ZB YB 57
Genomes of a population Current storage capacity of a hard disk? 58
Genomes of a population Current storage capacity of a hard disk? 6 TB WD Ultrastar He6 (US$750) Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 898 K 8170 2048 34 3x SM B KB MB GB TB PB EB ZB YB 59
Genomes of a population Current storage capacity of a data center? 60
Genomes of a population Current storage capacity of a data center? 1 PB 350 Seagate of 3TB (US$150 each) (US$50.000) Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 153 M 1.3 M 350 K 5825 SM B KB MB GB TB PB EB ZB YB 61
Genomes of a population Current storage capacity of a data center? 500 PB BackBlaze, Inc. Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 76 B 697 M 174 M 2.9 M 9x World Lat. Am. POA B KB MB GB TB PB EB ZB YB 62
Genomes of a population Current storage capacity of the entire world? 63
Genomes of a population Current storage capacity of the entire world? 8.2 ZB 295 EB in 2007 law Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population - - 3 T 50 B 400x World 7x World B KB MB GB TB PB EB ZB YB 64
Genomes of a population Near future storage capacity of a data center? 50 ZB (or 1 YB?) NSA data center in Bluffdale, UT, US Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population - - - 305 B 40x World B KB MB GB TB PB EB ZB YB 65
Resuming Sequencing cost and time will continue to fall Huge data: size*quantity Storage capacity is increasing (not at the same pace as sequencing) 66
3. Meaning
Meaning GATTCGTACTGACTACGCAGCAGGT What does it do? What does it produces? How does it looks like? Is it related with a specific disease? Does it interact with other components? Does it interact with some medicine? 68
Meaning CATTCATCATGCATACGCAGCAGGT What does it mean individual? (Personalized medicine) population? (Public-health genomics) human kind? (Science) 69
DNA Lowest structure present in all living organisms High expectation for medicine DNA cannot tell everything about your future DNA is not the only variable causing diseases Behavior and environment interfere But, DNA still plays an important role 70
From DNA to meaning DNA 98% Noncoding 2% Coding Amino acids Genes Proteins 71
3.1. Comparing sequences Do you see any difference in these sequences? CATTCATCATGCATACGGACTGCAGCAGGT CATTCATTATGCATACGGACTGCAGCAGGT 72
3.1. Comparing sequences Do you see any difference in these sequences? CATTCATCATGCATACGGACTGCAGCAGGT CATTCATTATGCATACGGACTGCAGCAGGT 73
3.1. Comparing sequences Comparison of two or more sequences Similarity, similarity and similarity Similar sequences may have similar genes, proteins, structures and functions Percent identity as metric Considering small variations (SNPs, insertions, deletions) Search terms: Hamming distance, Edit distance (Levenshtein), sequence alignment, Needleman-Wunsch algorithm, Smith and Waterman algorithm, substitution matrices, BLAST algorithm, seedand-extend, suffix tree, homologue sequences 74
3.2. Assembling DNA fragments In which chromosome can we find this sequence? In which position? CATTCATCATGCATACGGACTGCAGCAGGT 75
NGS Reads 3.2. Assembling DNA fragments Human genome reference Resulting aligned genome 76
3.2. Assembling DNA fragments 77
3.2. Assembling DNA fragments Assembling DNA to create a contiguous genome It may be tricky DNA has many repetitive regions Orientation, repeats and sequencing errors Search terms: Alignment algorithms, de-novo assembly, sequence mapping, MAQ, SHRiMP, Bowtie, GNUMAP, MUMmer, BWA, Slider, SOAP3, SNAP 78
3.3. Constructing evolutionary trees The 1 st sequence is more similar to the 2 nd or the 3 rd? CATTACGGACGCATCATCATGCAGCAGGTG CATTACGGACGTATCATCATCCAGCAGGTG CATTACGGACGGATCATCATACAGCAGGTG 79
3.3. Constructing evolutionary trees 80
3.3. Constructing evolutionary trees All about evolution and again similarity Comparing sequences from different organisms Groups organisms by similarity They show how organisms evolve in time Multiple alignments may generate phylogenic trees Recursively compare sequences until find a common ancestor Search terms: Evolutionary similarity, multiple alignments, tree of life, phylogenetic tree, Clustal, T- coffee, Muscle, ProbCons, COBALT, UniProt database, JalView, Logos, PSI-BLAST 81
3.4. Detecting patterns in sequences Do you see any pattern in this sequence? CGTTACGGACGCATCATCATGCAGCAGGTG 82
3.4. Detecting patterns in sequences Do you see any pattern in this sequence? CGTTACGGACGCATCATCATGCAGCAGGTG 83
3.4. Detecting patterns in sequences Delimit parts of sequences that have a biological meaning Detect the begin and end of a gene Detect the begin and end of shapes Search terms: Conserved domains, motifs, regular expressions, Fuzzy regular expressions, Hidden Markov Models (HMMs), InterPro database, PROSITE, multiple alignments, grammars, PCR analysis 84
3.5. Determining structure from sequence Which is the structure of this sequence in real life? CATTACGGACGCATCATCATGCAGCAGGTG 85
3.5. Determining structure from sequence Which is the structure of this sequence in real life? Primary Secondary Tertiary 86
3.5. Determining structure from sequence Structure is related with function Structure: Primary Secondary Tertiary Quaternary Search terms: Conserved domains, motifs, folding, PDB, 3D transformations, protein alignment, intermolecular distances, intramolecular distance, RMSD, DALI, SSAP, graph theory, abinitio methods, fold recognition, threading, neural networks, machine learning, support vector machines, random forests, lowest energy algorithms, ROSETTA, CASP challenge 87
3.6. Inferring cell regulation Does the 1 st sequence interfere in the 2 nd? CATTACGGACGCATCATCATGCAGCAGGTG 88
3.6. Inferring cell regulation Genes interact with each other Non-coding regions of DNA are important for regulation Depending on geographic position of a cell and neighbors to turn on/off genes Search terms: Cell regulation, junk DNA, turn on/off genes, pathways, Bayesian networks, Support vector machines, clustering algorithms, reducing dimensionality, simulation and modeling, discrete and continuous models, epigenomics 89
3.7. Determining protein function What does this sequence? Does it define my hair color? CATTACGGACGCATCATCATGCAGCAGGTG 90
3.7. Determining protein function 91
3.7. Determining protein function Body measurement 92
3.7. Determining protein function Cardiovascular disorder 93
3.7. Determining protein function Most challenging topic Depend on all previous topics Human annotations for protein functions Search terms: Annotating DNA, ontologies, natural language processing (NLP), motifs, conserved domains, gene ontology, OBO, BioOntologies, UMLS, term-for-term analysis, data mining, directed acyclic graph, text mining, semantic web 94
3.8. Languages Workloads everywhere Step-by-step to achieve the final goal Optimization, parallel programming Python, Perl and Bash Java, C, C++ R Search terms: Workload, programming languages, script languages, dynamic programming, design pattersn, gang of four, regular expressions, performance evaluation 95
3.9. Other important topics Database design, development, management Natural language processing (NLP) Graphics and graphical interfaces Distributed systems Security 96
3.10. Privacy The individual right of maintaining private undisclosed information Genomes contain private information 97
3.10. Our approach Does it contains any private information? CATTCATCATGCATACGGACTGCAGCAGGT 98
3.10. Disclosure filter for genetic data 7M bp/s * Throughput with a single core. Scales until approximately 300 M bp/s in some cases. 99
3.11. Other problems ELSI Ethical Legal Social implications Genism Genetic discrimination Targeted attacks Loss of reputation Data leakage Privacy problems 100
Resuming Obtaining meaning from DNA is Difficult Complex Time consuming Painful Similarity is important Comparison with what is already known 101
Final remarks Bioinformatics: interesting area many open problems opportunities to acquire and apply knowledge opportunities from projects reachable for students from all semesters Informatics: improve biology area improved by working on biological data 102
CC and SI disciplines Disciplinas CC SI Cálculo A 1 1 Circuitos Digitais 1 1 Lógica e Algoritmo 1 1 * Laboratório de Programação I 1 1 * Introdução à CC e SI 1 1 Geometria Analítica 1 - Vectors, 3D representation, torsion angles Teoria Geral da Administração - 1 Estruturas de Dados 2 2 Trees, graphs, probabilistic data structures Organização de Computadores 2 2 Laboratório de Programação II 2 2 * Matemática Discreta 2 2 Álgebra Linear 2 - Língua Inglesa Instrumental I 2 - Teoria Econômica - 2 Introdução a Ciencia da Informação - 2 * 103
CC and SI disciplines Disciplinas CC SI Arquitetura de Computadores 3 3 Estatística 3 3 GWAS, microarray, HMMs Pesquisa e Ordenação de Dados 3 3 DAG, pairwise alignment Paradigmas de Programação 3 3 * Cálculo B 3 - Língua Inglesa Intrumental II 3 - Gestão de Pessoas A - 3 Fundamentos de Bancos de Dados 4 4 Designing DB for biological data Engenharia de Software 4 3 Gang of Four, Design patterns Sistemas Operacionais 4 4 Compression, running workloads, scheduling Métodos Numéricos e Computacionais 4 - Eletricidade e Magnetismo 4 - Projeto e Análise de Algoritmos 4 - Optimization Marketing - 4 Engenharia Econômica - 4 Gerência de Projetos de Software - 4 104
CC and SI disciplines Disciplinas CC SI Qualidade de Software 5 5 Comunicação de Dados 5 - Optimizing data transfer Computação Gráfica 5-3D structure representation, torsion angles Linguagens Formais 5 - Lógica de Predicados 5 - Implementação de Bancos de Dados 5 - Database for biological data Projeto e Gerência de Banco de Dados - 5 Database for biological data Custo A - 5 Redes de Computadores 6 6 Computadores e Sociedade 6 4 ELSI of genomics, privacy Empreendedorismo 6 6 Implementação de Linguagens de Programação 6 - Teoria da Computação 6 - Projeto de Software I - 6 105
CC and SI disciplines Disciplinas CC SI Sistemas Distribuídos 7 7 Distributed services and data access Metodologia de Pesquisa em Informática 7 7 Defining experiments and writing articles Inteligência Artificial 7 - Gene finding, secondary structure Compiladores 7 - Projeto de Software II - 7 Sistemas de Computação Móvel DCG DCG Mobile solutions for storage and processing Middleware para Sistemas Distribuídos DCG DCG Distributed protocols Interface Humano-Computador DCG 5 ELSI of genomics, interfaces for biologists Processamento Estruturado de Documentos Métodos Computacionais Aplicados à Educação DCG - DCG - Processamento de Imagens DCG - Microarrays, diagnosis annotation Empreendedorismo DCG - Fundamentos de Tolerância a Falhas DCG - 106
CC and SI disciplines Disciplinas CC SI Programação Paralela DCG - Parallelize bioinformatics algorithms Visão Computacional DCG - Programação de Jogos 3D DCG - Computação Gráfica Avançada DCG - Projeto de Sistemas Digitais DCG - Concepção de Circuitos Integrados DCG - Bases da Gestão Eletrônica de Documento - DCG LIMS, management of donor consents Desenvolvimento de Software Para A Web - DCG Web services and data access Linguagens de Marcação Extensíveis - DCG Ontologies Modelagens de Processos de Negócios - DCG Preparação para Maratona de Programação I - DCG Sistemas Colaborativos - DCG Mineração de Dados - DCG Gene finding, secondary struct., clustering Sistemas Inteligentes - DCG Análise e Projeto de Sistemas Orientados a Objetos - DCG * 107
Final remarks exciting problems to work on Donald Knuth, 1993. 108
Further reading Bioinformatics An introduction for Computer Scientists. Jacques Cohen 10.1145/1031120.1031122 Bioinformatics: Genes, Proteins & Computers. Christine Orengo David Jones Janet Thornton 978-1859960547 Introduction to Bioinformatics. Arthur M. Lesk 978-0199208043 Bioinformatics for Dummies. Jean-Michel Claverie Cedric Notredame 978-0470089859 Thank you! Vinicius Vielmo Cogo vielmo@lasige.di.fc.ul.pt http://lasige.di.fc.ul.pt/~vielmo 109