Molecular Biology of the Genome. Christine Queitsch Department of Genome Sciences

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Molecular Biology of the Genome. Christine Queitsch Department of Genome Sciences"

Transcription

1 Molecular Biology of the Genome Christine Queitsch Department of Genome Sciences 1

2 Outline Information Flow in Genomics Gene Structure Genetic Linkage Chromatin Structure Genome Sequencing 2

3 DNA and the Flow of Information The genetic material: DNA - Four kinds of subunits (bases A, C, G, T) A coding problem C A G T Activities within the cell performed by proteins - Twenty kinds of subunits (amino acids) Ser Gly Cys Leu His Gln Asn Val Phe H 3 N + His Leu Val Glu Ala Leu Tyr Leu Val Cys Gly Glu Arg Glu Arg Arg Ala Lys Pro Thr Tyr Phe Phe Gly Ala Gln Asn Pro Gln Ala Gly Ala Val Glu Leu Ala Leu Tyr Leu Leu Gln Glu Ser Ala Leu Gly Cys Gln Glu Pro Ile Leu Asn Pro Ser Gly Tyr Gln Thr Gly Cys Lys Cys Leu Asn Arg COO Cys - Gly Gly Gln Gly Ile Gly Val Glu 3

4 The Central Dogma of Molecular Biology replication transcription translation DNA RNA Protein heredity phenotype Information into protein flows one way A universal code: 3 nucleotides = 1 amino acid 4

5 DNA Structure Information content is in the sequence of bases along a DNA molecule rules of base pairing each strand of the double helix has all the info needed to recreate the other strand Redundancy in the code multiple ways that DNA can specify a single amino acid Genetic variation differences in the base sequence between different individuals why individuals vary in their phenotypes 5

6 Central Dogma: DNA Replication DNA structure: polarity and base pairing 5 Watson 3 3 Crick 5 A pairs with T G pairs with C DNA replication: what s the point? lagging strand leading strand 3 duplicate the entire genome prior to cell division new subunits can only be added to the 3 OH of the growing chain 3 5 6

7 Central Dogma: Transcription Genes specific segments along the chromosomal DNA that code for some function Transcription: copy gene into RNA (to make a specific protein) mrna gene promoter terminator gene terminator gene mrna promoter 7

8 Transcription Transcription: copy gene into RNA to make a specific protein ribonucleic acid uses uracil (U) in place of thymine (T) gene coding or sense strand 5 w 3 3 c 5 mrna RNA polymerase template strand Where s the 5 end of the gene? of the mrna? Which way is RNA polymerase moving? 8

9 Transcription in vivo gene DNA nascent RNA transcripts RNA polymerases 9

10 Practice Question gene 5 w 3 3 c 5 1. Which way (to the right or left) are RNA polymerases moving? 2. Which strand (W or C) is the template strand? 10

11 Processing of pre-mrna Eukaryotic genes are interrupted by introns (non-coding information). They must be removed from the RNA before translation in a process called splicing. gene pre-mrna exons introns mature mrna ORF UTR s (untranslated regions) introns discarded exons spliced together 11

12 Review of the Central Dogma: Translation Translating the nucleic acid code to a peptide code Possible coding systems: 1 base per amino acid 2 bases per amino acid 3 bases per amino acid Could only code for 4 amino acids! Could only code for 16 amino acids 64 possible combinations that s plenty! 12

13 The triplet code 3 bases = 1 amino acid More than 1 triplet can code for the same amino acid codon Translation: reads the information in RNA to order the amino acids in a protein A U G A C U U C A G U A A C A U U A A C C 5 3 mrna M e t T h r S e r V a l T h r P h e NH 3 + COO - protein 13

14 Punctuation: Start: AUG = methionine, the first amino acid in (almost) all proteins Stop: UAA, UAG, and UGA. A U G A C U U C A G U A A C A U U A A C C 5 3 mrna M e t T h r S e r V a l T h r P h e STOP NH 3 + COO - protein NOT an amino acid! 14

15 The Genetic Code: Who is the interpreter? Where s the dictionary? What are the rules of grammar? trna = transfer RNA Met amino acid 3 Met charged trna UAC trna aminoacyl trna synthetase 3 UAC 5 5 AUG 3 anticodon recognizes codon in mrna 15

16 The ribosome: mediates translation Locates the 1st AUG, sets the reading frame for codonanticodon base-pairing Met Thr ribosome... UAC UGA... 5 AUAUGACUUCAGUAACCAUCUAACA 3 After the 1st two trnas have bound 16

17 the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid Met Thr ribosome UAC UGA... 5 AUAUGACUUCAGUAACCAUCUAACA 3 P-site A-site 17

18 the ribosome breaks the Met-tRNA bond; Met is instead joined to the second amino acid and the Met-tRNA is released Met Thr ribosome UAC UGA... 5 AUAUGACUUCAGUAACCAUCUAACA 3 then ribosome moves over by 1 codon in the 3 direction 18

19 Met Thr Ser UGA AGU... 5 AUAUGACUUCAGUAACCAUCUAACA 3 19

20 When the ribosome reaches the Stop codon termination Met Thr Ser Val Thr Phe STOP UAG... 5 AUAUGACUUCAGUAACCAUCUAACA 3 20

21 The finished peptide! N-terminus NH 3 + Met Thr Ser Val Thr Phe COO - C-terminus 5 3 AUAUGACUUCAGUAACCAUCUAACA 21

22 Practice Question Which strand on the DNA sequence is the coding (sense) strand? How can you tell? 22

23 Finding Sense in Nonsense cbdryloiaucahjdhtheflybitthedogbutnotthecatjhhajctipheq GGGTATAGAAAATGAATATAAACTCATAGACAAGATCGGTGAGGGAACATTTTCGTCAGTGTATAAAGCCAAAGATATCACTGGGAAAATAA CAAAAAAATTTGCATCACATTTTTGGAATTATGGTTCGAACTATGTTGCTTTGAAGAAAATATACGTTACCTCGTCACCGCAAAGAATTTATAA TGAGCTCAACCTGCTGTACATAATGACGGGATCTTCGAGAGTAGCCCCTCTATGTGATGCAAAAAGGGTGCGAGATCAAGTCATTGCTGTTT TACCGTACTATCCCCACGAGGAGTTCCGAACTTTCTACAGGGATCTACCAATCAAGGGAATCAAGAAGTACATTTGGGAGCTACTAAGAGCA TTGAAGTTTGTTCATTCGAAGGGAATTATTCATAGAGACATCAAACCGACAAATTTTTTATTTAATTTGGAATTGGGGCGTGGAGTGCTTGTT GATTTTGGTCTAGCCGAGGCTCAAATGGATTATAAAAGCATGATATCTAGTCAAAACGATTACGACAATTATGCAAATACAAACCATGATGGT GGATATTCAATGAGGAATCACGAACAATTTTGTCCATGCATTATGCGTAATCAATATTCTCCTAACTCACATAACCAAACACCTCCTATGGTCAC CATACAAAATGGCAAGGTCGTCCACTTAAACAATGTAAATGGGGTGGATCTGACAAAGGGTTATCCTAAAAATGAAACGCGTAGAATTAAAA GGGCTAATAGAGCAGGGACTCGTGGATTTCGGGCACCAGAAGTGTTAATGAAGTGTGGGGCTCAAAGCACAAAGATTGATATATGGTCCGT AGGTGTTATTCTTTTAAGTCTTTTGGGCAGAAGATTTCCAATGTTCCAAAGTTTAGATGATGCGGATTCTTTGCTAGAGTTATGTACTATTTTT GGTTGGAAAGAATTAAGAAAATGCGCAGCGTTGCATGGATTGGGTTTCGAAGCTAGTGGGCTCATTTGGGATAAACCAAACGGATATTCTA ATGGATTGAAGGAATTTGTTTATGATTTGCTTAATAAAGAATGTACCATAGGTACGTTCCCTGAGTACAGTGTTGCTTTTGAAACATTCGGATT TCTACAACAAGAATTACATGACAGGATGTCCATTGAACCTCAATTACCTGACCCCAAGACAAATATGGATGCTGTTGATGCCTATGAGTTGAA AAAGTATCAAGAAGAAATTTGGTCCGATCATTATTGGTGCTTCCAGGTTTTGGAACAATGCTTCGAAATGGATCCTCAAAAGCGTAGTTCAG CAGAAGATTTACTGAAAACCCCGTTTTTCAATGAATTGAATGAAAACACATATTTACTGGATGGCGAGAGTACTGACGAAGATGACGTTGTC AGCTCAAGCGAGGCAGATTTGCTCGATAAGGATGTTCT How do you find out if sequence contains a gene? How do you identify the gene? 23

24 Reading Frame: the ribosome establishes the grouping of nucleotides that correspond to codons by the first AUG encountered. Starts counting triplets from this base 5 AUAUGACUUCAGUAACCAUCUAACA 3 ORF: open reading frame, from the first AUG to the first in-frame stop. The ORF encodes the information for the protein. More generally: a reading frame with a stretch of codons not interrupted by stop non-coding RNAs! 24

25 Looking for ORFs - read the sequence 5 3, looking for stop - try each reading frame - since we know the genetic code can do a virtual translation if necessary How to identify genes experimentally? 25

26 Outline Information Flow in Genomics Gene Structure Genetic Linkage Chromatin Structure Genome Sequencing 26

27 Gene Structure: The Parts List Genomic DNA for a protein-coding eukaryotic gene is comprised of regulatory and coding sequences Enhancer distal regulatory element 5 UTR Exon Exon 3 UTR Promoter proximal regulatory element Intron Intron = CRM (cis-regulatory motif) Can be upstream or downstream of promoter, proximal or distal 27

28 Promoters Promoters are specific sites on DNA that RNA polymerase first binds to initiate the transcription of a gene Composed of a variety of different cis-sequence elements which recruit trans-acting factors through DNA-protein interactions 28

29 Core Promoter Elements 5 UTR Exon Exon 3 UTR Enhancer Promoter Intron Intron BRE TATA inr G C G G CGCC C A ~-50 - not all elements required TATA ~-30 A A A T T PyPyAN A PyPy +1 T - many promoters lack a TATA box, using instead the functionally analogous initiator (inr) element 29

30 Combinatorial Gene Regulation Most eukaryotic genes have multiple cis regulatory motifs located outside of the core promoter region Can be located in promoter proximal regions, 3 downstream regions, and many kb away from target gene Allows for combinatorial control of gene expression 30

31 Distal regulatory elements: Enhancers Enhancer : - Can function in either orientation - Can occur far (>50 kb) from the gene - Can be up or downstream - Range in size between ~ bp - Contain multiple TF binding sites Enhancesome 31

32 Untranslated Regions (UTRs) Most eukaryotic mrnas contain untranslated regions in their 5 and 3 ends The 5 UTR is the region between the start of transcription and the start of translation The 3 UTR is the region between the stop codon and poly-a tail Both the 5 and 3 UTRs can contain cis regulatory sequences that bind TFs, influence transport to the cytoplasm, mediate transcript stability, and translational control 5 UTR Exon Exon 3 UTR 32

33 Alternative Splicing mrna from some genes can be spliced into two or more distinct transcripts Creates protein diversity (isoforms) 5 splice site 3 splice site 33

34 Outline Information Flow in Genomics Gene Structure Genetic Linkage Chromatin Structure Genome Sequencing 34

35 Transmission of Genetic Information 2N 2N Diploid 2N 1N 1N Chromosomes decondensed Chromosomes condensed Elements of cell division Cell growth Chromosome duplication Chromosome segregation 35

36 Meiosis Interphase: Chromosomes replicate Meiosis I: Reductive division, homologous chromosomes separate Meiosis II: Sister chromatids separate 36

37 Recombination 37

38 How Does Distance Between Loci Affect Transmission? Independent Assortment: loci are unlinked or far enough apart that they are transmitted independently from one another Genetic linkage: loci are close enough together on a chromosome to be transmitted together 38

39 Genetic Mapping The frequency of recombination between loci is based on the distance between them 39

40 Recombination Is A Measure of Distance Recombination fraction, = the probability that a recombinant gamete is transmitted If two loci are on different chromosomes, they will segregate independently => recombination fraction = 0.5 If two loci are right next to each other, they will segregate together during meiosis => recombination fraction = 0 Jargon: < 0.5 the loci are close (they are linked) = 0.5 the loci are far apart (they are not linked) 40

41 Recombination Is A Measure of Distance Map Distance = Number Recombinant Gametes Total Number of Gametes x 100 Centimorgan (cm): a unit of chromosome length, equals the length of chromosome over which crossing-over occurs with 1% frequency 41

42 Practice Question In maize, consider three recessive phenotypes: lazy growth (ll), glossy leaves (gg), and sugary endosperm (ss). The following cross was made: Ll Gg Ss x ll gg ss and the observed progeny distribution was (neither gene nor linkage phase is known) Phenotype Number wildtype 286 lazy 33 glossy 59 sugary 4 lazy, glossy 2 lazy, sugary 44 glossy, sugary 40 lazy, glossy, sugary 272 Total 740 Determine order and distances among the three genes 42

43 Where to begin? Wild-type for all L G S / l g s x Recomb. lazy, gloss, sugary l g s / l g s Parental types will constitute 50% of all progeny, so Rule 1: Two most-frequent gametes types are the parental types

44 L G S // l g s x l g s // l g s Progeny Phenotype Progeny Genotypes Number wildtype L G S // l g s 286 lazy l G S // l g s 33 glossy L g S // l g s 59 sugary L G s // l g s 4 lazy,glossy l g S // l g s 2 lazy,sugary l G s // l g s 44 glossy,sugary L g s // l g s 40 lazy,glossy,sugary l g s // l g s 272 Total 740

45 Linkage phase in heterozygous parent? L G S or L g S or l g S or L g s l g s l G s L G s l G S

46 Rule 2 The double-recombinant gametes will be the two least frequent types. A B C a b c Progeny Phenotype Progeny Genotypes Number wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272 Total 740

47 Rule 3 Effect of double crossovers is to interchange the members of the middle pair of alleles between the chromosomes A B C A b C a b c a B c

48 Parental types: L G S and l g s Double-crossover types: L G s and l g S Which gene is in the middle? L S G L s G l s g l S g Now you know linkage phase of heterozygous parent and gene order how far apart are these genes?

49 Count the cross-overs between adjacent genes L S G l s g In parents, L allele on same homolog as S and l on same homolog as s. So if these get broken up ---> cross-over between L and S loci In parents, S on same homolog as G and s on same homolog as g. If these get broken up --> recombination between S and G loci

50 Rule 4: Reciprocal products expected to occur in approximately equal numbers LGS lgs ( ) LgS lgs (59 44) Lgs lgs (40 33) LGs lgs (4 2) Progeny Progeny Genotype Phenotype # wildtype L G S / l g s 286 lazy l G S / l g s 33 glossy L g S / l g s 59 sugary L G s / l g s 4 lazy,glossy l g S / l g s 2 lazy,sugary l G s / l g s 44 glossy,sugary L g s / l g s 40 lazy,glossy,sugary l g s / l g s 272 Total 740

51 Progeny Progeny Genotype Crossover or Non-Crossover? Phenotype # wildtype L G S / l g s 286 Parental (NCO) lazy l G S / l g s 33 single CO between L and S glossy L g S / l g s 59 single CO between S and G sugary L G s / l g s 4 double CO lazy,glossy l g S / l g s 2 double CO lazy,sugary l G s / l g s 44 single CO between S and G glossy,sugary L g s / l g s 40 single CO between L and S lazy,glossy,sugary l g s / l g s 272 Parental (NCO) Total 740 Rec Freq L-S l G S 33 L g s 40 L G s 4 l g S 2 79 Rec Freq S-G L g S 59 l G s 44 L G s 4 l g S 2 109

52 Rec Freq L-S l G S 33 L g s 40 L G s 4 l g S 2 79 Rec Freq S-G L g S 59 l G s 44 L G s 4 l g S /740 or 10.7% of gametes recombinant between L & S. distance between L & S = 10.7 map units 109/740 or 14.8 % of gametes recombinant between S & G. distance between S & G= 14.8 map units 10.7 mu 14.8 mu L S G

53 Outline Information Flow in Genomics Gene Structure Genetic Linkage Chromatin Structure Genome Sequencing 53

54 Chromosome Structure: Coils of Coils of Coils nucleosome at mitosis Local unpacking of chromatin allows gene expression and replication 54

55 Nucleosomes ~146 bp of DNA wrapped around nucleosome ~ 80 bp linker histone octamer 55

56 Histone Modification and Chromatin Activity modifications change interaction with DNA and trans-factors can activate or repress transcription reinforce regulatory patterns set up by TFs 56

57 What Do These Modifications Do? A Histone Code? Distinct histone modifications, on one or more tails, act sequentially or in combination that is read by other proteins to bring about distinct downstream events (Strahl and Allis, 2000, Nature, 403:41) Carey et al. Cell (2007) 128:707 57

58 Outline Information Flow in Genomics Gene Structure Genetic Linkage Chromatin Structure Genome Sequencing 58

59 DNA Sequencing Technology Sanger sequencing Next-Generation 3 rd and 4 th Generation 59

60 Genome Sequencing: Hierarchical Shotgun Sequencing Shear genomic DNA into smaller pieces and subclone into library (such as BACs, Cosmids, etc.) Create physical map Shotgun sequence each BAC from minimal tiling path (shearing of ~150kb BAC clone into ~ 2kb fragments) Data from linkage and physical maps used to assemble sequence maps of chromosomes 60

61 Genome Sequencing: Whole Genome Shotgun Sequencing Whole genome randomly sheared three times Plasmid library constructed with ~ 2kb inserts Plasmid library with ~10 kb inserts BAC library with ~200 kb inserts Computer program assembles sequences into chromosomes No physical map construction Only one BAC library Overcomes problems of repeat sequences only not really 61

62 62

63 Next-Generation Sequencing Technology Illumina HiSeq: 4 billion reads per flow cell X 100 bases, paired = 400 Gbp 8 samples per flow cell = 50 Gbp each (one human genome = 3 Gbp) Reagent cost ~$8K per run Updated: HiSeq 3000/4000 SBS Kits enable up to 1500 Gb (1.5 Tb) of output per dual flow cell run ABI SOLID: similar yield Roche 454: 1 million reads X 500 bases = 0.5 Gbp 63

64 Illumina sequencing Mardis, ER, 2008, ARGHG 64

65 Illumina sequencing: clusters 65

66 Illumina sequencing: sequence reaction 66

67 Illumina sequencing: sequence reaction Sequence clusters are imaged after each cycle of synthesis 67

68 What is missed? Plenty: repetitive DNA and structural variation Example: short tandem repeats C C C A A A G C A G G G C A G C A G 68

69 3 rd Generation Sequencing Technology Single Molecule Real Time (SMRT) sequencing technology (PacBio RS) based on circular DNA molecules read by polymerase and long reads - up to 10kb error-prone 69

70 4 th Generation Sequencing Technology Protein nanopore sequencing (Oxford Nanopore) ultra-long reads - up to 1MB, limited by integrity of the DNA high error rate, low throughput 70

71 Next-Gen Sequencing - What s All the Fuss About? 71

72 The Era of Personal Genomics? James D. Watson (5/31/2007) J. Craig Venter (8/4/2007) /photos/venter jpg It is here. The challenge is interpretation.

73 Censoring of Watson s ApoE gene 3.6 kb Important ethical issues confront personal genomics. 73

74 Interpreting Genome Sequences The ENCODE Project: comprehensive parts list of the functional elements in the human genome Pilot Project Description ENCODE Project Consortium et al. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science (2004) vol. 306 (5696) Pilot Project Results ENCODE Project Consortium et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (2007) vol. 447 (7146) 74

75 Let s Play Gene or No Gene A gene is often a segment of DNA that encodes a protein. How about DNA that encodes: a micro RNA that binds to an mrna to inhibit translation? an RNA spliced out of an intron and used for another function? an antisense transcript? a long non-coding RNA of unknown function? a pseudogene? 75