Bioinformática www.geocities.com/mirkozimic/bioinfo Introducción, Bases de datos biológicas Prof. Mirko Zimic What is Bioinformatics? What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. What is Computational Biology? - The development and application of dataanalytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. (Working Definition of Bioinformatics and Computational Biology - July 17, 2000). http://www.grants2.nih.gov/grants/bistic/compubiodef.pdf 1
The Ideal Syllabus Molecular Biology Basic concepts, Genomic and Proteomic structure Core Bioinformatics Biological Databases, Sequence Analysis, Functional Genomics Advanced Bioinformatics Molecular Evolution and Phylogeny Protein Structure Prediction The Transcriptome The Proteome Informatics Information Theory Basic Statistics Database Technologies Knowledge Representation Biocomputing Konrad Zuse con la Z1 reconstruída. Zurich 2
Durante la II Guerra mundial los ingleses construyen en respuesta al codificadorenigma, el Colossus. Enigma En 1944 IBM y la Universidad de Harvard estrenan Mark I, la primera computadora que responde a la moderna definición. Medía.15 metros de largo, 2.40 mts de alto y pesaba 10 toneladas. Utilizaba relays electromecánicos. 3
Este es uno de los relay que se usaron en la Mark I Sumaba en menos de un segundo, multiplicaba en cerca de seis, y dividía en cerca de doce. 4
Costo Efectividad! La Bioinformática resulta ser una disciplina muy favorable en cuanto a costo-efectividad. On Life... Living things are composed of lifeless molecules (Albert Lehninger) La Biología puede reducirse a las leyes Físicas fundamentales? 5
La Bioinformática se inicia con el desarrollo de bases de datos biológicas, seguido del desarrollo de herramientas de búsqueda rápida de información Actualmente la Bioinformática busca el desarrollo de algoritmos de predicción basado en la información almacenada en las bases de datos biológicas. 6
Historical Perspective Key developments: Dayhoff, Atlas of Protein Sequence and Structure (1965-1978) Genbank/EMBL nucleic-acid sequence databases (1979-1992) Entrez (early 90 s date) Sequence alignment algorithms: Needleman/Wunsch (1970), Smith/Waterman (1981), FASTA (Pearson/Lipman, 1988), BLAST (Altschul, 1990) Genomes (1995 date) Collecting Sequence Data Genome (DNA-level): Genomic sequencing Complete picture of genome Generates physical map Includes regulatory and other silent regions Transcriptome (RNA-level): Expression-library sequencing Expressed genes only Splicing / variant forms Can correlate with levels of expression Proteome (protein-level): Protein sequencing Insight into biological function Gives information on protein-protein interactions Post-translational modifications detected 7
DNA Sequencing DNA Sequencing (Cont d) 8
Fragment Assembly Genomic DNA Random shearing Sequence overlapping fragments Sequences assembled CCAGATTACGAAATCC... GGCTTATACCGGCAT Sequencing from Expression Libraries Exon 1 Exon 2 Exon 3 Exon 4 Exon 5 Gene Introns Transcription / splicing / processing AAA AAA mrna Reverse transcriptase AAA AAA TTT TTT Sequence Transcriptome 9
Secuenciamiento de Proteínas Digital Storage of Sequence Data Bit: A binary digit represented in a digital circuit; only two states recognized, 0 and 1 (usually 0 V and +5 V, respectively). Byte: Grouping of 8 bits into a larger unit. Bits are usually numbered 0-7 (not 1-8!). ASCII: Acronym for American Standard Code for Information Interchange. Representation of alphanumeric and some special characters as 1-byte (8 bit) unsigned integers {0... 255} (the set {2 0-1... 2 8-1}). The ASCII character set also includes nonprinting control characters such as carriage return (CR) or line feed (LF). Minimum storage requirement for human genome data represented as ASCII characters: 3 10 9 bytes (3000 Mbytes) or about 5 CD-ROMs, exclusive of annotations or other data 10
11 Number Systems Number Systems 13 23 10011 19 9 11 1001 9 12 22 10010 18 8 10 1000 8 11 21 10001 17 7 7 111 7 10 20 10000 16 6 6 110 6 F 17 1111 15 5 5 101 5 E 16 1110 14 4 4 100 4 D 15 1101 13 3 3 11 3 C 14 1100 12 2 2 10 2 B 13 1011 11 1 1 1 1 A 12 1010 10 0 0 0 0 Hex Octal Bin Dec Hex Octal Bin Dec The ASCII Table The ASCII Table
Extended ASCII Characters Nucleic-acid Base Codes Symbol Meaning Symbol Meaning A A S G or C G G W A or T C C H A, C, or T (~G) T T B C, G, or T (~A) R A or G V A, C, or G (~T) Y C or T D A, G, or T (~C) M A or C N A, C, G, or T K G or T Adapted from Mount, Bioinformatics Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY (2001) 12
Amino-acid Codes 1-letter Code A C D E F G H I K L M 3-letter Code Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Amino Acid alanine cysteine aspartic acid glutamicacid phenylalanine glycine histidine isoleucine lysine leucine methionine 1-letter Code N P Q R S T V W X Y Z 3-letter Code Asn Pro Gln Arg Ser Thr Val Trp Xxx Tyr Glx Amino Acid asparagine proline glutamine arginine serine threonine valine tryptophan undetermined tyrosine Glu or Gln Adapted from Mount, Bioinformatics Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY (2001) The exponential growth of molecular sequence databases & cpu power Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 15514776 14584 1988 23800000 20579 1989 34762585 28791 1990 49179285 39533 1991 71947426 55627 1992 101008486 78608 1993 157152442 143492 1994 217102462 215273 1995 384939485 555694 1996 651972984 1021211 1997 1160300687 1765847 1998 2008761784 2837897 1999 3841163011 4864570 2000 11101066288 10106023 2001 14396883064 13602262 doubling time ~ one year 13
What are sequence databases? These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have their own specific format. North America: the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has GenBank & GenPept. Europe: the European Molecular Biology Laboratory (EMBL), the European Bioinformatics Institute (EBI), and the Swiss Institute of Bioinformatics (SIB) Expert Protein Analysis System (ExPasy), all help maintain the EMBL Nucleotide Sequence Database, and the SWISS- PROT & TrEMBL amino acid sequence databases. Asia: The National Institute of Genetics (NIG) supports the Center for Information Biology s (CIG) DNA Data Bank of Japan (DDBJ). More organization stuff Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings the Fungi warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation. Nucleic Acid DB s GenBank/EMBL/DDBJ all Taxonomic categories Tags EST s GSS s Amino Acid DB s SWISS-PROT TrEMBL PIR PIR1 PIR2 PIR3 PIR4 NRL_3D Genpept 14
TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database, which are not yet integrated into SwissProt. PIR (Protein Information Resource) produces the Protein Sequence Database (PSD) of functionally annotated protein sequences, which grew out of the Atlas of Protein Sequence and Structure (1965-1978) edited by Margaret Dayhoff TREMBL (proteina traducida del EMBL) EMBL (DNA) SwissProt (proteínas secuenciadas curadas) PIR PROSITE GeneBank DDBJ 15
What about other types of biological databases? Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. These databases contain all of the 3D atomic coordinate data necessary to define the tertiary shape of a particular biological molecule. The data is usually experimentally derived, either by X-ray crystallography or with NMR, but sometimes it is a hypothetical model. In all cases the source of the structure and its resolution is clearly indicated. Secondary structure boundaries, sequence data, and reference information are often associated with the coordinate data, but it is the 3D data that really matters, not the annotation. Other types of Biological DB s Still more; these can be considered non-molecular : Genomic linkage mapping databases for most large genome projects(w/ pointers to sequences) H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli,.... Reference Databases (also w/ pointers to sequences): e.g. OMIM Online Mendelian Inheritance in Man PubMed/MedLine over 11 million citations from more than 4 thousand bio/medical scientific journals. Phylogenetic Tree Databases: e.g. the Tree of Life. Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). Population studies data which strains, where, etc. And then databases that many biocomputing people don t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates.... 16
Large Databases Once upon a time, GenBank sent out sequence updates on CD-ROM disks a few times per year. Now GenBank is over 40 Gigabytes (11 billion bases) Most biocomputing sites update their copy of GenBank every day over the internet. Scientists access GenBank directly over the Web Finding Genes in GenBank These billions of G, A, T, and C letters would be almost useless without descriptions of what genes they contain, the organisms they come from, etc. All of this information is contained in the "annotation" part of each sequence record. 17
Entrez is a Tool for Finding Sequences GenBank is managed by the NCBI (National Center for Biotechnology Information) which is a part of the US National Library of Medicine. NCBI has created a Web-based tool called Entrez for finding sequences in GenBank. http://www.ncbi.nlm.nih.gov Each sequence in GenBank has a unique accession number. Entrez can also search for keywords such as gene names, protein names, and the names of orgainisms or biological functions 18
Entrez is Internally Cross-linked DNA and protein sequences are linked to other similar sequences Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar structures 19
Databases contain more than just DNA & protein sequences The omics Series Genomics Gene identification & charaterisation Transcriptomics Expression profiles of mrna Proteomics functions & interactions of proteins Structural Genomics Large scale structure determination Cellinomics Metabolic Pathways Cell-cell interactions Pharmacogenomics Genome-based drug design 20
Structural Genomics What is structural genomics? Genomes and folds: Finding folds in genomes Structural properties of entire proteomes Comparing genomes in terms ofstructure Selectionof targets for structuralgenomes Covering the sequence space withstructures Using structure to understand function Systematic structure determination for complete genomes Special targets Predicting success of structure determination Adaptationof proteins to extreme environments Structural genomics resources on the internet Functional Genomics Development and application of global (genome-wide or system-wide) experimental approaches to assess gene function by making use of the information provided by structural genomics. 21
Commercial Structural Genomics Initiatives IBM (Blue Gene project: 2000) Computational protein folding Geneformatics (1999) Modeling for identifying active sites Prospect Genomics (1999) Homology modeling Protein Pathways (1999) Phylogenetic profiling, domain analysis, expression profiling Structural Bioinformatics Inc (1996) Homology modeling, docking Proyecto Genoma Humano La secuencia del genomaestá casicompleta! aproximadamente 3.5 billones de pares de bases. 22
Raw Genome Data Implications for Biomedicine Physicians will use genetic information to diagnose and treat disease. Virtually all medical conditions (other than trauma) have a genetic component. Faster drug development research Individualized drugs Gene therapy All Biologists will use gene sequence information in their daily work 23
Bioinformatics Challenges The huge dataset Lots of new sequences being added - automated sequencers - Human Genome Project - EST sequencing GenBank has over 10 Billion bases and is doubling every year!! (problem of exponential growth...) How can computers keep up? 24
Genome comparisons Designed for looking at complete bacterial genomes. Gene finding AT content Forward translations Reverse Translations DNA and amino acids 25
Gene finding Bringing a New Drug to Market Review and approval by Food & Drug Administration Phase III: Confirms effectiveness and monitors adverse reactions from long-term use in 1,000 to 5,000 patient volunteers. 1 compound approved Phase II: Assesses effectiveness and looks for side effects in 100 to 500 patient volunteers. Phase I: Evaluates safety and dosage in 20 to 100 healthy human volunteers. 5 compounds enter clinical trials 5,000 compounds evaluated Discovery and preclininal testing: Compounds are identified and evaluated in laboratory and animal studies for safety, biological activity, and formulation. 0 2 4 6 8 10 12 14 Years 16 26
Impact of Structural Genomics on Drug Discovery Epitopes B-cell epitopes T h -cell epitopes 27
Vaccine development In Post-genomic era: Reverse Vaccinology Approach. How a molecule changes during MD 28
In Silico Analysis Peptide Multitope vaccines VACCINOME Candidate Epitope DB Epitope prediction Disease related protein DB Gene/Protein Sequence Database The job of the biologist is changing As more biological information becomes available The biologist will spend more time using computers The biologist will spend more time on data analysis (and less doing lab biochemistry) Biology will become a more quantitative science (think how the periodic table and atomic theory affected chemistry) 29
Biological Research in 21st Century The new paradigm, now emerging is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. - Walter Gilbert 30
Data Quality Issues Bioinformatics Databases Usually organised in flat files Huge collection of Data Include alpha-numeric and pictorial data Latest databases have gene/protein expression data (images) Demand High quality curated data Interconnectivity between data sets Fast and accurate data retrieval tools queries using fussy logic Excellent Data mining tools For sequence and structural patters 31
Errors in DNA sequence and Data Annotation Current technology should reduce error rates to as low as 1 base in 10000 as every base is sequenced between 6-10 times and at least one reading per strand. Therefore, in a procaryote, error of 1 isolated wrong base would result to one amino acid error in ~10-15 proteins. In human genome gene-dense regions contain about 1 gene per 10000 bases, with average estimated at 1 gene per 30000bases. Therefore, corresponding error rate would be roughly one amino acid substitution in 100 proteins. But large scale error in sequence assembly can also occur. Missing a nucleotide can cause a frameshift error. DNA data The DNA databases (EMBL/ GenBank/ DDBJ) carry out quality checks on every sequence submitted. No general quality control algorithm is yet in widespread use. Some annotations are hypothetical because they are inferences derived from the sequences. Ex. Identification of coding regions. These inferences have error rates of their own. 32
Quality Control Issues related to 3-D structure data determined using X-rays The reported parameter called the `B-factor' of each atom describes its effective size, and for proteins it should be treated as an empirical value. Because every atom contributes to every observation, it is difficult to estimate errors in individual atomic positions. Resolution of structures in PDB Low resolution... High Resolution in Å 4.0 3.5 3.0 2.5 2.0 1.5 Ratio of observations to Parameters 0.3 0.4 0.6 1.1 2.2 3.8 The median resolution of structures in the Protein Data Bank is about 2.0 Å. 33
Un ejemplo Cisteíno proteasa de la fasciola hepática: En busca de un péptido inmunogénico Alineamiento: cisteíno proteasas de mamífero Vs. cisteíno proteasa de Fasciola hepatica. AA Idénticos AA divergentes VPKSVDWREKGYVTPVKNQGQCGSCWAFSATGALEGQMFRKTGR ISLSEQNLVDCSRPQGN AVPDKIDWRESGYVTEVKDQGNCGSCWAFSTTGTMEGQYM KNERTSISFSEQQLVDCSRPWGN ROJO QGCNGGLMDNAFQYIKENGGLDSEESYPYEATDTSCNY KPEYSVANDTGFVDIPQREKA LMK NGCGGGLMENAYQYLKQF GLETESSYPYTAVGGQCRYNKQLG VAKVTGYYTV QSGSEVEL KN _VIOLETA AMARILLO AVATVGPISVAIDAGHSFQFYKSGIYYEPDCSSKDLDHGVLVVGYGFEG TDSNNNKYW IVKNSW LIGSEGPSAVAVDVESDFMMYRSGIYQSQTCSPLRVNHAVLAVGYGTQGGTD YW IVKNSW VERDE GPEWGM-GYVKMAKDRNNH CGIATAASYPTV GLSWGERGYIRMV RNRGNMCGIASLASLPMVARFP 34
Epítope Discontinuo, formado por porciones distantes de la secuencia. Denaturación El epítope se pierde con la denaturación. Epítope Continuo, formado por una porción de la secuencia Denaturación El epítope se conserva como tal. 35
Modelaje tridimensional por homología. Identidad de secuencia de 56% con quimopapaína (1YAL) Análisis de Superficie:vista de átomos por radio de van der Waals AA idénticos AA divergentes 36
Selección de secuencias (1)divergentes, (2)accesibles al solvente y (3)contínuas. TMEGQYMKNERTSISFS YYTVQSGSEVELK NLIGSE QSQTCSPLRVN RYNKQLGVAKV Otro ejemplo Sensibilidad de la aspartyl proteasa del HIV-1 a los inhibidores más frecuentes 37
Representación en cartoon de la enzima proteasa de HIV-1 Enzima proteasa de HIV-1 mostrando los elementos de estructura secundaria, flaps y sitio activo 38
Enzima proteasa de HIV-1 indicando los residuos consenso de unión inhibidor-enzima INDINAVIR 39
RITONAVIR COMPARACION ENTRE UNA ENZIMA SENSIBLE Y UNA RESISTENTE A RITONAVIR 40