Bioinformatics for Informatics Vinicius Vielmo Cogo. Presented in UFSM, Santa Maria, RS, BR. April 08, 2014.

Similar documents
Agreement on Dual Degree Bachelor Program in Computer Science. Universidade Federal do Rio Grande do Sul. Technische Universität Berlin

MSc on Applied Computer Science

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Analysis of NGS Data

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Bio-Informatics Lectures. A Short Introduction

Version 5.0 Release Notes

Bioinformatics Grid - Enabled Tools For Biologists.

Guidelines for Establishment of Contract Areas Computer Science Department

Core Bioinformatics. Titulació Tipus Curs Semestre Bioinformàtica/Bioinformatics OB 0 1

Doctor of Philosophy in Computer Science

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Guide for Bioinformatics Project Module 3

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

UGENE Quick Start Guide

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Pairwise Sequence Alignment

Curriculum Reform in Computing in Spain

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Human Genome and Human Genome Project. Louxin Zhang

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Next Generation Sequencing

Core Bioinformatics. Degree Type Year Semester

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Genome Explorer For Comparative Genome Analysis

Cancer Genomics: What Does It Mean for You?

Databases and mapping BWA. Samtools

Next generation sequencing (NGS)

Protein Protein Interaction Networks

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

A Primer of Genome Science THIRD

Doctor of Philosophy in Informatics

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Introduction to Bioinformatics 3. DNA editing and contig assembly

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Next Generation Sequencing: Technology, Mapping, and Analysis

Introduction to Data Mining

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Bachelor Curriculum in cooperation with

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bachelor Curriculum in cooperation with

Sanjeev Kumar. contribute

An Interdepartmental Ph.D. Program in Computational Biology and Bioinformatics:

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Introduction to NGS data analysis

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Hadoopizer : a cloud environment for bioinformatics data analysis

T cell Epitope Prediction

Bioinformatics: course introduction

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

CD-HIT User s Guide. Last updated: April 5,

New solutions for Big Data Analysis and Visualization

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

UF EDGE brings the classroom to you with online, worldwide course delivery!

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

How To Get A Masters Degree In Biomedical Informatics

Biomedical Big Data and Precision Medicine

Network Protocol Analysis using Bioinformatics Algorithms

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Data, Measurements, Features

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

CCR Biology - Chapter 9 Practice Test - Summer 2012

Usability in bioinformatics mobile applications

Learning outcomes. Knowledge and understanding. Competence and skills

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Current Motif Discovery Tools and their Limitations

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

High Performance Compu2ng Facility

Intro to Bioinformatics

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

An Introduction to Data Mining

FBIO - Fundations of Bioinformatics

IC05 Introduction on Networks &Visualization Nov

Introduction to the Mathematics of Big Data. Philippe B. Laval

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Becker Muscular Dystrophy

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Master s Program in Information Systems

Final Project Report

Searching Nucleotide Databases

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Statistics Graduate Courses

Machine Learning Introduction

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Delivering the power of the world s most successful genomics platform

GenBank: A Database of Genetic Sequence Data

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Overview. Overarching observations

Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator

Learning is a very general term denoting the way in which agents:

Importance of Statistics in creating high dimensional data

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Transcription:

Bioinformatics for Informatics Vinicius Vielmo Cogo Presented in UFSM, Santa Maria, RS, BR. April 08, 2014.

Context The presenter: is not a bioinformatician is a computer scientist from Inf2006-UFSM is a PhD. student in the ULisboa is a researcher at the LaSIGE (Navigators group ) has as main research areas: Distributed systems Dependability Fault Tolerance Security But why are you going to present about bioinformatics? 2

Context Interesting area Many open problems Opportunities to acquire and apply knowledge Opportunities from projects Bachelor in Computer Science (UFSM) Master in Informatics (ULisboa) PhD. in Informatics (ULisboa) PET-CC FTH-GRID CloudFIT BiobankCloud PET-CC LSC Tclouds 3

Context is not an introduction to bioinformatics is not a study of is a different way to look to bioinformatics, from an informatics perspective 4

What is bioinformatics? 5

What is bioinformatics? 6

What is bioinformatics? is the study of life and living organisms 7

What is bioinformatics? 8

What is bioinformatics? Mathematics Astronomy Physics Biology Ethics Chemistry Informatics Statistics 9

What is bioinformatics? Mathematics Astronomy Physics Biology Ethics Chemistry Informatics Statistics 10

What is bioinformatics? Biology Informatics 11

What is bioinformatics? Biology Bioinformatics? Informatics 12

What is bioinformatics? Biology Bioinformatics Informatics Computational Biology Biological Computing Modeling biological virtual systems and organs. Searching the optimal path in a graph based on ants behavior 13

What is bioinformatics? Biology Bioinformatics Informatics Bioinformatics is an informatics discipline for storing, retrieving, organizing and analyzing biological data. 14

Motivation But what makes the biological data so special? 15

Motivation But what makes the biological data so special? Data complexity Size Quantity Meaning Incomplete and uncertain data 16

Outline 1. Size 2. Quantity 3. Meaning 17

1. Size

DNA TCTCATGCATACGCAGCAGGTCAGGCAT DNA is: A very large string Composed only by A, C, G, and T characters 19

Human DNA From human beings to DNA 10-50 trillion 50 000 000 000 000 1 23 pairs 46 2 molecules A human being Cells Nucleus Chromosomes DNA 210 types of cells 300 million die every minute 20

Human DNA TCTCATGCATACGCAGCAGGTCAGGCAT Human DNA: 3.2 B characters (bp, base-pairs) 21

Genome size Organism Genome size Picture Porcine circovirus type 1 1.8 K Drosophila melanogaster 130 M Mus musculus 2.7 B Homo sapiens 3.2 B Zea mays 5 B Marbled lungfish 130 B 22

Printing a human genome Times New Roman 12 pt 2622 bp per page 1.2 M single-sided pages 2415 reams (500 sheets) 129 meters 129 m 93 m 30 m 23

Printing a human genome Leighton Pritchard 4.5 pt Double-sided pages 120 book volumes 1 bookshelf 24

Loading a human genome to memory Does it work? char[] humandna = char[3 200 000 000]; 25

Loading a human genome to memory Does it work? char[] humandna = char[3 200 000 000]; In most programming languages: No Most data structures are indexed by signed int 2.1 B chars = (2^31-1 = 2 147 483 647) Solutions? 26

Loading a human genome to memory char[] humandna = char[3 200 000 000]; Solutions: Unsigned int: 4 200 000 000 (2^32-1) 64-bit indexes: 9 000 000 000 000 000 000 (2^63-1) Matrices: humandna = char[100][32 000 000] 27

Loading a human genome to memory char[] humandna = char[3 200 000 000]; Solutions: Unsigned int: 4 200 000 000 (2^32-1) 64-bit indexes: 9 000 000 000 000 000 000 (2^63-1) Matrices: humandna = char[100][32 000 000] Split in chromosomes: Map<Key, Value> Map<chrName, chrsequence> 28

Human chromosomes The human genome is not a contiguous DNA 29

Storing a human genome in a file Entire human genome 3.2 B chars 3 GB UTF-8 or ISO 8859-1 1 char = 1-byte (8-bits) 30

Storing a human genome in a file FASTA format Broadly used Accepts comments (>) Store also incomplete or small sequences > HG00339 chr1 ATGCATCGTACGCTCATCATGCAGC AGACTGCTAGCTGATGGTACGTCAT GCAGTCGTCAGTCAGTCAGTACATC > HG00339 chr2 TCGTACGCTCATCATGCAGCAGGTC AGTACCATGGTCAGACTCATTGCAA CTGCTAGCTGATGATCGGCATAGT > HG00339 chr22 CAGTACGTCAGTGCTAGCTAGCTAC GATCGATCGACTGATACGATC > HG00339 chry GGGTACTCATGCAGTCAGTCAGTTG CAGTCAGTCAGTCGCAGTCA 31

Storing a human genome in a file 1 char = 2-bits 770 MB Entire human genome 3.2 B chars A = 00 C = 01 G = 10 T = 11 32

Storing a human genome in a file 2-bit format Not human readable (normal humans) There are other characters than A, C, G and T E.g.: N for any. Nomenclature for incompletely specified bases in nucleic acid sequences IUPAC How to reduce even more? 001010111101010111000101 010110101010101010110101 010101111101010101010101 010101010101010111101010 101010101010101010010000 000101010101010101010100 000101010111101010000000 010110101010101010101000 000000101010101010101010 101010101010101101001101 010000000101010101101111 101010101010101010101010 111111111111101010101110 101110101010010001100001 1010011010001110101101 000100010101010110101 33

Compressing a human genome file ZIP Compression 830 MB (in 10min) 34

Compressing a human genome file Referential Compression Humans are 99.9% genetically equals 7 MB (in 30s) 35

Sequencing a human genome Reads Genomes are not sequenced in one contiguous line Reads contains 30-1000bp each Human genome reference Resulting aligned genome 36

Sequencing a human genome Sequenced Genome 180 GB > HG00339 read123 300bp ATGCATCGTACGCTCATCATGCAGC +!''*((((***+))%%%++)(%%%%).1*** > HG00339 read456 300bp TCGTACGCTCATCATGCAGCAGGTC + %%%).1****''))**55CCF>>>>>>CC data FASTQ format > HG00339 read999 300bp CAGTACGTCAGTGCTAGCTAGCTAC + )(%%%%).1***''))**5CCCC?CCC65 37

Resuming Human genome = string of 3.2B chars Size of creatures does not determine genome size Normally in memory, a genome is divided into 23 chromosomes In a file: 3 GB in FASTA 770 MB in 2-bits 830 MB when compressed with ZIP 7 MB when using referential compression 180 GB in FASTQ 38

2. Quantity

Problem statement Huge Big data data 40

Quantity The number of sequenced genomes is exponentially increasing Caused by the decreasing on: time and cost 1 st Human Genome Currently Year 1988 2014 Cost US$ 3 billion US$ 1 000 Time 13 years (2001) < 1 hour 41

From first to next-generation sequencing Sanger-based chemistries and capillary-based instruments sequencing platforms Next-Generation sequencing platforms (NGS) 42

Cost per human genome 43

Cost x quantity 44

US$1000 (2014) HiSeq X Ten 45

NGS around the world 46

US$100 (?) 47

US$30 (?) 48

Projects gathering genomes 49

Physical storage in biobanks 50

Physical storage in biobanks 51

Digital storage in e-biobanks Biobanks store also health and behavior data Pressure to store also sequenced DNA Researchers need a lot of data (shared) 52

Digital storage in e-biobanks Biobanks will soon start using cloud computing Illumina Basespace BiobankCloud is creating a platform for biobanks efficiently and securely use cloud facilities 53

Neonatal heel prick Newborn screening Detect congenital diseases Hypotiroidism Cystic fibrosis Implemented all over the world Employed in demographics for birth rates What about newborn genome sequencing? 54

Genomes of a population How much disk space do we need to store all genomes of an entire population? 1-person 1 Population Santa Maria, RS Porto Alegre, RS Rio Grande do Sul São Paulo, SP Brazil Latin America World 270 K 1.5 M 10.7 M 11.3 M 201 M 589 M 7 B 55

Genomes of a population How much disk space do we need to store all genomes of an entire population? Population Compressed Size with 2-bits Size with 8-bits WGS 1-person 1 7 MB 770 MB 3 GB 180 GB Santa Maria, RS Porto Alegre, RS Rio Grande do Sul São Paulo, SP Brazil Latin America World 270 K 1.5 M 10.7 M 11.3 M 201 M 589 M 7 B B KB MB GB TB PB EB ZB YB 56

Genomes of a population How much disk space do we need to store all genomes of an entire population? Population Compressed Size with 2-bits Size with 8-bits WGS 1-person 1 7 MB 770 MB 3 GB 180 GB Santa Maria, RS 270 K 1.8 TB 200 TB 790 TB 46 PB Porto Alegre, RS 1.5 M 10 TB 1.1 PB 4.3 PB 260 PB Rio Grande do Sul 10.7 M 72 TB 7.7 PB 30 PB 1.8 EB São Paulo, SP 11.3 M 75 TB 8.1 PB 32 PB 1.9 EB Brazil 201 M 1.3 PB 144 PB 575 PB 34 EB Latin America 589 M 3.8 PB 422 PB 1.6 EB 100 EB World 7 B 45 PB 5 EB 20 EB 1.1 ZB B KB MB GB TB PB EB ZB YB 57

Genomes of a population Current storage capacity of a hard disk? 58

Genomes of a population Current storage capacity of a hard disk? 6 TB WD Ultrastar He6 (US$750) Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 898 K 8170 2048 34 3x SM B KB MB GB TB PB EB ZB YB 59

Genomes of a population Current storage capacity of a data center? 60

Genomes of a population Current storage capacity of a data center? 1 PB 350 Seagate of 3TB (US$150 each) (US$50.000) Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 153 M 1.3 M 350 K 5825 SM B KB MB GB TB PB EB ZB YB 61

Genomes of a population Current storage capacity of a data center? 500 PB BackBlaze, Inc. Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population 76 B 697 M 174 M 2.9 M 9x World Lat. Am. POA B KB MB GB TB PB EB ZB YB 62

Genomes of a population Current storage capacity of the entire world? 63

Genomes of a population Current storage capacity of the entire world? 8.2 ZB 295 EB in 2007 law Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population - - 3 T 50 B 400x World 7x World B KB MB GB TB PB EB ZB YB 64

Genomes of a population Near future storage capacity of a data center? 50 ZB (or 1 YB?) NSA data center in Bluffdale, UT, US Compressed Size with 2-bits Size with 8-bits WGS 1-person 7 MB 770 MB 3 GB 180 GB Population - - - 305 B 40x World B KB MB GB TB PB EB ZB YB 65

Resuming Sequencing cost and time will continue to fall Huge data: size*quantity Storage capacity is increasing (not at the same pace as sequencing) 66

3. Meaning

Meaning GATTCGTACTGACTACGCAGCAGGT What does it do? What does it produces? How does it looks like? Is it related with a specific disease? Does it interact with other components? Does it interact with some medicine? 68

Meaning CATTCATCATGCATACGCAGCAGGT What does it mean individual? (Personalized medicine) population? (Public-health genomics) human kind? (Science) 69

DNA Lowest structure present in all living organisms High expectation for medicine DNA cannot tell everything about your future DNA is not the only variable causing diseases Behavior and environment interfere But, DNA still plays an important role 70

From DNA to meaning DNA 98% Noncoding 2% Coding Amino acids Genes Proteins 71

3.1. Comparing sequences Do you see any difference in these sequences? CATTCATCATGCATACGGACTGCAGCAGGT CATTCATTATGCATACGGACTGCAGCAGGT 72

3.1. Comparing sequences Do you see any difference in these sequences? CATTCATCATGCATACGGACTGCAGCAGGT CATTCATTATGCATACGGACTGCAGCAGGT 73

3.1. Comparing sequences Comparison of two or more sequences Similarity, similarity and similarity Similar sequences may have similar genes, proteins, structures and functions Percent identity as metric Considering small variations (SNPs, insertions, deletions) Search terms: Hamming distance, Edit distance (Levenshtein), sequence alignment, Needleman-Wunsch algorithm, Smith and Waterman algorithm, substitution matrices, BLAST algorithm, seedand-extend, suffix tree, homologue sequences 74

3.2. Assembling DNA fragments In which chromosome can we find this sequence? In which position? CATTCATCATGCATACGGACTGCAGCAGGT 75

NGS Reads 3.2. Assembling DNA fragments Human genome reference Resulting aligned genome 76

3.2. Assembling DNA fragments 77

3.2. Assembling DNA fragments Assembling DNA to create a contiguous genome It may be tricky DNA has many repetitive regions Orientation, repeats and sequencing errors Search terms: Alignment algorithms, de-novo assembly, sequence mapping, MAQ, SHRiMP, Bowtie, GNUMAP, MUMmer, BWA, Slider, SOAP3, SNAP 78

3.3. Constructing evolutionary trees The 1 st sequence is more similar to the 2 nd or the 3 rd? CATTACGGACGCATCATCATGCAGCAGGTG CATTACGGACGTATCATCATCCAGCAGGTG CATTACGGACGGATCATCATACAGCAGGTG 79

3.3. Constructing evolutionary trees 80

3.3. Constructing evolutionary trees All about evolution and again similarity Comparing sequences from different organisms Groups organisms by similarity They show how organisms evolve in time Multiple alignments may generate phylogenic trees Recursively compare sequences until find a common ancestor Search terms: Evolutionary similarity, multiple alignments, tree of life, phylogenetic tree, Clustal, T- coffee, Muscle, ProbCons, COBALT, UniProt database, JalView, Logos, PSI-BLAST 81

3.4. Detecting patterns in sequences Do you see any pattern in this sequence? CGTTACGGACGCATCATCATGCAGCAGGTG 82

3.4. Detecting patterns in sequences Do you see any pattern in this sequence? CGTTACGGACGCATCATCATGCAGCAGGTG 83

3.4. Detecting patterns in sequences Delimit parts of sequences that have a biological meaning Detect the begin and end of a gene Detect the begin and end of shapes Search terms: Conserved domains, motifs, regular expressions, Fuzzy regular expressions, Hidden Markov Models (HMMs), InterPro database, PROSITE, multiple alignments, grammars, PCR analysis 84

3.5. Determining structure from sequence Which is the structure of this sequence in real life? CATTACGGACGCATCATCATGCAGCAGGTG 85

3.5. Determining structure from sequence Which is the structure of this sequence in real life? Primary Secondary Tertiary 86

3.5. Determining structure from sequence Structure is related with function Structure: Primary Secondary Tertiary Quaternary Search terms: Conserved domains, motifs, folding, PDB, 3D transformations, protein alignment, intermolecular distances, intramolecular distance, RMSD, DALI, SSAP, graph theory, abinitio methods, fold recognition, threading, neural networks, machine learning, support vector machines, random forests, lowest energy algorithms, ROSETTA, CASP challenge 87

3.6. Inferring cell regulation Does the 1 st sequence interfere in the 2 nd? CATTACGGACGCATCATCATGCAGCAGGTG 88

3.6. Inferring cell regulation Genes interact with each other Non-coding regions of DNA are important for regulation Depending on geographic position of a cell and neighbors to turn on/off genes Search terms: Cell regulation, junk DNA, turn on/off genes, pathways, Bayesian networks, Support vector machines, clustering algorithms, reducing dimensionality, simulation and modeling, discrete and continuous models, epigenomics 89

3.7. Determining protein function What does this sequence? Does it define my hair color? CATTACGGACGCATCATCATGCAGCAGGTG 90

3.7. Determining protein function 91

3.7. Determining protein function Body measurement 92

3.7. Determining protein function Cardiovascular disorder 93

3.7. Determining protein function Most challenging topic Depend on all previous topics Human annotations for protein functions Search terms: Annotating DNA, ontologies, natural language processing (NLP), motifs, conserved domains, gene ontology, OBO, BioOntologies, UMLS, term-for-term analysis, data mining, directed acyclic graph, text mining, semantic web 94

3.8. Languages Workloads everywhere Step-by-step to achieve the final goal Optimization, parallel programming Python, Perl and Bash Java, C, C++ R Search terms: Workload, programming languages, script languages, dynamic programming, design pattersn, gang of four, regular expressions, performance evaluation 95

3.9. Other important topics Database design, development, management Natural language processing (NLP) Graphics and graphical interfaces Distributed systems Security 96

3.10. Privacy The individual right of maintaining private undisclosed information Genomes contain private information 97

3.10. Our approach Does it contains any private information? CATTCATCATGCATACGGACTGCAGCAGGT 98

3.10. Disclosure filter for genetic data 7M bp/s * Throughput with a single core. Scales until approximately 300 M bp/s in some cases. 99

3.11. Other problems ELSI Ethical Legal Social implications Genism Genetic discrimination Targeted attacks Loss of reputation Data leakage Privacy problems 100

Resuming Obtaining meaning from DNA is Difficult Complex Time consuming Painful Similarity is important Comparison with what is already known 101

Final remarks Bioinformatics: interesting area many open problems opportunities to acquire and apply knowledge opportunities from projects reachable for students from all semesters Informatics: improve biology area improved by working on biological data 102

CC and SI disciplines Disciplinas CC SI Cálculo A 1 1 Circuitos Digitais 1 1 Lógica e Algoritmo 1 1 * Laboratório de Programação I 1 1 * Introdução à CC e SI 1 1 Geometria Analítica 1 - Vectors, 3D representation, torsion angles Teoria Geral da Administração - 1 Estruturas de Dados 2 2 Trees, graphs, probabilistic data structures Organização de Computadores 2 2 Laboratório de Programação II 2 2 * Matemática Discreta 2 2 Álgebra Linear 2 - Língua Inglesa Instrumental I 2 - Teoria Econômica - 2 Introdução a Ciencia da Informação - 2 * 103

CC and SI disciplines Disciplinas CC SI Arquitetura de Computadores 3 3 Estatística 3 3 GWAS, microarray, HMMs Pesquisa e Ordenação de Dados 3 3 DAG, pairwise alignment Paradigmas de Programação 3 3 * Cálculo B 3 - Língua Inglesa Intrumental II 3 - Gestão de Pessoas A - 3 Fundamentos de Bancos de Dados 4 4 Designing DB for biological data Engenharia de Software 4 3 Gang of Four, Design patterns Sistemas Operacionais 4 4 Compression, running workloads, scheduling Métodos Numéricos e Computacionais 4 - Eletricidade e Magnetismo 4 - Projeto e Análise de Algoritmos 4 - Optimization Marketing - 4 Engenharia Econômica - 4 Gerência de Projetos de Software - 4 104

CC and SI disciplines Disciplinas CC SI Qualidade de Software 5 5 Comunicação de Dados 5 - Optimizing data transfer Computação Gráfica 5-3D structure representation, torsion angles Linguagens Formais 5 - Lógica de Predicados 5 - Implementação de Bancos de Dados 5 - Database for biological data Projeto e Gerência de Banco de Dados - 5 Database for biological data Custo A - 5 Redes de Computadores 6 6 Computadores e Sociedade 6 4 ELSI of genomics, privacy Empreendedorismo 6 6 Implementação de Linguagens de Programação 6 - Teoria da Computação 6 - Projeto de Software I - 6 105

CC and SI disciplines Disciplinas CC SI Sistemas Distribuídos 7 7 Distributed services and data access Metodologia de Pesquisa em Informática 7 7 Defining experiments and writing articles Inteligência Artificial 7 - Gene finding, secondary structure Compiladores 7 - Projeto de Software II - 7 Sistemas de Computação Móvel DCG DCG Mobile solutions for storage and processing Middleware para Sistemas Distribuídos DCG DCG Distributed protocols Interface Humano-Computador DCG 5 ELSI of genomics, interfaces for biologists Processamento Estruturado de Documentos Métodos Computacionais Aplicados à Educação DCG - DCG - Processamento de Imagens DCG - Microarrays, diagnosis annotation Empreendedorismo DCG - Fundamentos de Tolerância a Falhas DCG - 106

CC and SI disciplines Disciplinas CC SI Programação Paralela DCG - Parallelize bioinformatics algorithms Visão Computacional DCG - Programação de Jogos 3D DCG - Computação Gráfica Avançada DCG - Projeto de Sistemas Digitais DCG - Concepção de Circuitos Integrados DCG - Bases da Gestão Eletrônica de Documento - DCG LIMS, management of donor consents Desenvolvimento de Software Para A Web - DCG Web services and data access Linguagens de Marcação Extensíveis - DCG Ontologies Modelagens de Processos de Negócios - DCG Preparação para Maratona de Programação I - DCG Sistemas Colaborativos - DCG Mineração de Dados - DCG Gene finding, secondary struct., clustering Sistemas Inteligentes - DCG Análise e Projeto de Sistemas Orientados a Objetos - DCG * 107

Final remarks exciting problems to work on Donald Knuth, 1993. 108

Further reading Bioinformatics An introduction for Computer Scientists. Jacques Cohen 10.1145/1031120.1031122 Bioinformatics: Genes, Proteins & Computers. Christine Orengo David Jones Janet Thornton 978-1859960547 Introduction to Bioinformatics. Arthur M. Lesk 978-0199208043 Bioinformatics for Dummies. Jean-Michel Claverie Cedric Notredame 978-0470089859 Thank you! Vinicius Vielmo Cogo vielmo@lasige.di.fc.ul.pt http://lasige.di.fc.ul.pt/~vielmo 109