CS691K Bioinformatics Kulp Lecture Notes #0 Molecular & Cell Biology Fall 2005 dkulp@cs.umass.edu
Syllabus distributed Logistics Class taught in 3 stages by faculty in CS, math/stats, and microbio Grades will be based on up to six homework assignments Office hours on syllabus. All faculty are readily available by email. We are happy to discuss the class with you personally. Not all notes will be available online - you should attend all lectures and take good notes Diverse group of students Emphasis will be on understanding methods and practical use of existing bioinformatics tools Why are you here? What is your background? What are you hoping to get out of this class? Please sign the email sheet! Homework will involve the use of the unix ED-LAB computers. There will be a special meeting on WEDNESDAY, SEPTEMBER 14 for novice unix users.
What is Bioinformatics Computational Biology: The use of algorithmic, mathematical, and statistical methods to analyze genome sequences (i.e. DNA, RNA, protein) and derived data (e.g. expression, NMR, etc.) Informatics: The software and data management methodologies for storing, retrieving, and intrigrating such data Data Mining / In-silico Biology: Hypothesis generation and testing from genome data sets
Topics Detecting similar sequences (homology) Pairwise and multiple sequence alignment Protein function/structure prediction Sequence pattern modeling and recognition Motif discovery Gene finding Analyzing high-dimension data Function prediction, target discovery, etc. from gene expression Constructing trees Phylogenetics Informatics and integration Genome biology
The Cell Prokaryotes are unicellular with minimal compartments - bacteria, archaea Eukaryotes are multicellular with differentiation and many organelles including the nucleus that typically can reproduce sexually - all higher organisms including mammals, birds, fish, invertebrates, mushrooms, plants, and yeast. ~300,000,000,000,000 cells in a human.
The Cell The cell is composed of and makes thousands of proteins, e.g. the cell wall is made of a layer of proteins and lipids. There are special proteins embedded in the wall as channels and pumps And the cell makes (synthesizes) proteins DNA makes RNA, RNA makes proteins, and proteins make us! F. Crick The cell is a chemical catalytic machine Networks: one type of network are metabolic networks describing catalytic reactions for the consumption or synthesis of products necessary for life. Many of these are fairly well understood. (e.g. photosynthesis) Another type of network are signaling networks where information is conveyed about the environment. These are partially understood. (e.g. protein kinases are involved in cell differentiation and cell death)
From KEGG (http://www.genome.ad.jp/kegg/pathway.html)
The Cell - Genetic Information There is a third major type of network: genetic information processing. We will focus on these networks. To understand this: we describe the nature of DNA Tangentially mention homology and conservation Then discuss the process of translation
DNA Structure - Eukaryotic Chromosome DNA - a string of nucleic acids (Adenine, Guanine, Cytosine, and Thymine) Regular, long, stable, oriented, double-stranded, helical structure Humans: 23 pairs of chromosomes. Total ~3B bases (x2) DNA resides in nucleus in eukaryotes
DNA DNA Structure Always: chemical pairing of A-T and C-G. Thus, strands are complementary. Two chains run in opposite directions: 5 to 3 5 3 3 5
Prokaryotes (and mitochondria) have one circular chromosome Prokaryotic Chromosomes This shows the E. coli genome with orange and yellow bars indicating the positions of the genes on the two strands.
RNA RNA is a similar molecule composed of 4 nucleic acids (A, C, G, and U) Single-stranded. Can base-pair with DNA (synthesis) Can self-base-pair and fold
DNA Replication We won t be discussing the details of DNA replication. There are 2 processes: Mitosis for normal cell duplication Meiosis for gametes for sexual reproduction - single, recombined chromosomes In both processes, DNA is copied by breaking doublestrand (dsdna) into single-strands (ssdna) at origins of replication and synthesizing a complementary copy from the template. 50 bp/sec * 15K origins = ~1 hr to replicate human genome Problem: How does DNA polymerase find the origins? Are there sequence patterns?
The Tree of Life Single common ancestral genome!
DNA Conservation and Variation Mutations occur in DNA due to environmental effects (e.g. radiation) and random mistakes during synthesis. Usually just single nucleotides are changes, sometimes large rearrangements. Those changes occurring in somatic (non-sex) cells cause local damage, usually cell death, but can cause cancer. (Search for the common mutations that cause different types of cancers.) Those changes occurring in gametes can be inherited and if favorable can become fixed Variation in non-functional (junk) DNA tends to drift, whereas functional DNA (e.g. containing genes) tends to remain conserved. Problems: Given a set of sequences from different organisms: Identify and align sequences from a common ancestor (homologous) What are the important (conserved) parts? What was the evolutionary history? (Reconstruct the tree ) Given a model organism (e.g. mouse, yeast, fruitfly, etc.), find the orthologous locus in human
Examples of Sequence Conservation A segment from the RNA needed for protein synthesis - a fundamental process in all life forms. It is conserved across all 3 major branches of the tree of life. A multiple alignment of homologous protein sequences. Colors indicate different classes of amino acids. Dots are inserts/deletes.
DNA contains GENES Genes are heriditary units of DNA We now know that, for the most part, genes are regions that code for proteins Proteins are derived from DNA according to the central dogma : DNA => RNA => Protein Like DNA replication, DNA is opened into two single strands. Using a ssdna as a template, a complementary copy of RNA is synthesized for a small region of the genome (1000-100000nt) The RNA is processed and transported (more about that in later lectures) Each triple of RNA (codon) is translated to one of 20 amino acids creating a polypeptide chain, which folds into a protein Problems: How does the cell know where to find a gene? (Sequence patterns?) How does RNA transcription know when to stop? (Patterns?) How is RNA edited?
Central Dogma - DNA - RNA - Protein 1998 by Alberts, Bray, Johnson, Lewis, Raff, Roberts, Walter
Codon Translation Each triplet translates to a unique amino acid. For example, CUU is Leucine. There are 4*4*4=64 possible codons that translate into 20 amino acids This translation table is fixed for almost all life
Cell Differentiation Eukaryotes have many different cell types (skin, muscle, neurons, etc.) that each play a different role. To accomplish the cell s role, different genes must be activated Problems: How are genes activated? What regulatory patterns are in the DNA? What genes control other genes? What network associations among genes can be found? What genes are differentially expressed?
Cell Differentiation
Differential Expression Interleukin 1 alpha expressed in different cell types
Protein Sequence, Structure, Function Lastly, given a protein sequence, what is the 3-D structure and function? The most common approach is to exploit conservation (see earlier) Problem: Find similar proteins to my query protein. Maybe I can assign structure or function to my new query protein, if structure or function is already known for a homologous protein. (Sequence similarity searching, protein family modeling)
Protein Structure
Further Reading Many online intros to genome biology E.g. http://www.ncbi.nlm.nih.gov/about/primer/ Any molecular biology text E.g. Molecular Biology of the Cell by Alberts, et al or Genomes by Brown.