Gene Models & Bed format: What they represent.



Similar documents
Protein Synthesis How Genes Become Constituent Molecules

Transcription and Translation of DNA

AP BIOLOGY 2009 SCORING GUIDELINES

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

GenBank, Entrez, & FASTA

To be able to describe polypeptide synthesis including transcription and splicing

Molecular Genetics. RNA, Transcription, & Protein Synthesis

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Translation Study Guide

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

1 Mutation and Genetic Change

Structure and Function of DNA

The world of non-coding RNA. Espen Enerly

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Bioinformatics Resources at a Glance

RNA & Protein Synthesis

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Sample Questions for Exam 3

The Steps. 1. Transcription. 2. Transferal. 3. Translation

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

MUTATION, DNA REPAIR AND CANCER

13.4 Gene Regulation and Expression

Problem Set 3 KEY

CCR Biology - Chapter 8 Practice Test - Summer 2012

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Activity 7.21 Transcription factors

13.2 Ribosomes & Protein Synthesis

Genomes and SNPs in Malaria and Sickle Cell Anemia

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

Overview of Eukaryotic Gene Prediction

GENE REGULATION. Teacher Packet

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

mrna EDITING Watson et al., BIOLOGIA MOLECOLARE DEL GENE, Zanichelli editore S.p.A. Copyright 2005

Control of Gene Expression

RNA and Protein Synthesis

Name: Date: Period: DNA Unit: DNA Webquest

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

European Medicines Agency

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Gene Switches A Model

Gene Finding CMSC 423

BioBoot Camp Genetics

From DNA to Protein

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Biological Sequence Data Formats

Control of Gene Expression

Complex multicellular organisms are produced by cells that switch genes on and off during development.

Transcription: RNA Synthesis, Processing & Modification

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Introduction to Genome Annotation

Genetics Module B, Anchor 3

Eukaryotes. PSI Biology Eukaryotes & Gene Expression

Outline. interfering RNA - What is dat? Brief history of RNA interference. What does it do? How does it work?

2013 W. H. Freeman and Company. 26 RNA Metabolism

Chapter 18 Regulation of Gene Expression

Replication Study Guide

How To Understand How Gene Expression Is Regulated

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

PRACTICE TEST QUESTIONS

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

RNA Structure and folding

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Human Genome Organization: An Update. Genome Organization: An Update

Bob Jesberg. Boston, MA April 3, 2014

DNA (genetic information in genes) RNA (copies of genes) proteins (functional molecules) directionality along the backbone 5 (phosphate) to 3 (OH)

Bio 102 Practice Problems Recombinant DNA and Biotechnology

Transfection-Transfer of non-viral genetic material into eukaryotic cells. Infection/ Transduction- Transfer of viral genetic material into cells.

Introduction to Bioinformatics 3. DNA editing and contig assembly

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

Problem Set 1 KEY

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

Searching Nucleotide Databases

DNA. Discovery of the DNA double helix

Final Project Report

Genetics Test Biology I

Concluding lesson. Student manual. What kind of protein are you? (Basic)

HiPer RT-PCR Teaching Kit

Chapter 5: Organization and Expression of Immunoglobulin Genes

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

DNA, RNA, Protein synthesis, and Mutations. Chapters

Frequently Asked Questions Next Generation Sequencing

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Academic Nucleic Acids and Protein Synthesis Test

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Modeling DNA Replication and Protein Synthesis

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

First Strand cdna Synthesis

Gene Switches Teacher Information

Transcription:

GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically, we use information from EST (expressedsequencetag)projectstoevaluateorcreategenemodels.it simportanttorememberatall timesthatagenemodelisonlythat:amodel. In this class, you will spend a lot of time working with gene models and with file formats used to represent them, mostly the bed (browser extensible format) format, which was developed by the UCSCGenomeBioinformaticsgroupfordisplayinggenemodelsintheiron linebrowser. To understand what a gene model represents, you need to refresh your memory about how transcription,rnasplicing,andpolyadenylationoperate. Most protein coding genes in eukaryotic organisms (like humans, the research plant Arabidopsis thaliana,fruitflies,etc.)aretranscribedintornabyanenzymecomplexcalledrnapolymeraseii, which binds to the five prime end of a gene in its so called promoter region. The promoter region typically contains binding sites for transcription factors that help the RNA polymerase complex recognize the position in the genomic DNA where it should begin transcription. Many genes have multipleplacesinthegenomicdnawheretranscriptioncanbegin,andsotranscriptsarisingfromthe same gene may have different five prime ends. Transcripts arising from the same gene that have differenttranscriptionstartsitesaresaidtocomefromalternativepromoters. OncetheRNApolymerasecomplexbindstothefiveprimeendofgene,itcanbeginbuildinganRNA copy of the DNA sense strand via the process known as transcription. The ultimate product of transcription is thus called a transcript. During and after transcription, another large complex of proteinsandnon codingrnascalledthespliceosomeattachestothegrowingrnamolecule,cutsout segmentsofrnacalledintrons,andjoinstogether(splices)theflankingsequences,whicharecalled exons. Not every newly synthesized transcript is processed in this way; sometimes no introns are removed at all. Genes whose products do not undergo splicing are often called single exon genes. Also, splicing may remove different segments from transcripts arising from the same gene. This variabilityinsplicingpatternsiscalledalternativesplicing. Inadditiontosplicing,RNAtranscriptsundergoanotherprocessingreactioncalledpolyadenylation. Inpolyadenylation,asegmentofsequenceatthe3 primeendofthernatranscriptiscutoff,anda polymer consisting of adenosine residues called a polya tail is attached to the 3 prime end of the transcript.thelengthofpolyatailmayvaryalotfromtranscripttotranscript,andthepositionwhere itisaddedmayalsodiffer.geneswhosetranscriptscanreceiveapolyatailatmorethanonelocation aresaidtobesubjecttoalternativepolyadenylationoralternative3 primeendprocessing. These processing reactions are believed to take place in the nucleus. Ultimately, the mature or maturingrnatranscriptisexportedfromthenucleusintothecytoplasm,whereitwillbetranslated byribosomesintoproteins,chainsofaminoacidsthatperformworkinthecell(suchasenzymes)or thatprovideformandstructure(likeactininthecytoskeleton). ThecontinuoussequenceofbasesinanRNAthatencodeaproteiniscalledacodingregion,andthe codingtypicallystartswithanaugcodonandterminateswithoneofthreepossiblestopcodons.the segments of sequence that comprise a coding region are called CDSs and they generally occupy the samesequencesastheexons,apartfromtheregionsfiveandthreeprimeofthestartandstopcodons, respectively.

Most RNAs code for one protein sequence, but there are some interesting exceptions in which one maturemrnamaycontainmorethanonetranslatedopenreadingframe.thethreebaseswherethe ribosomeinitiatestranslationarecalledastartcodonandthetripletofbasesimmediatelyfollowing the last translated codon are called the stop codon. The start codon encodes the amino acid methionine,typically,andthestopcodondoesn tcodeforanyaminoacid. A gene model thus consists of a collection of introns and exons and their locations in the genomic sequence, as well as the location of the translated region or region. Thus, a gene model implies a theory about where the RNA polymerase started transcription, as well as the location of the polyadenylationsiteandthestartsandstopsoftranslation. Usually, we draw gene models as showing the location of introns and exons relative to the genomic sequence,asifwearemappingthernacopybackontothegenomicdnaitself. Considerthefollowinggenemodeldiagram,whichcomesfromagenomevisualizationprogramcalled theintegratedgenomebrowser.(torunityourself,visitigb.bioviz.org.) This diagram can also be represented in simple text using the bed format, a format developed at UCSCfordisplayinggenemodelsintheUCSCGenomeBrowser.ScientistswhowanttousetheUCSC GenomeBrowsertodisplaygenemodelsorotherannotationscancreate bed formatfilesandupload thesetothegenomebrowsersite,whichwillthendisplaythemalongsidetheotherannotationsthat arepartoftheucscdatabasesystem.thisiswhyfieldsin bed formatfilesrefertothingslikescores andcolorschemes,whichtheucscgenomebrowsercandisplay. Thedatafileyou llworkwithforyoursecondassignmentonusingunixtoolsinbioinformaticsuses the bed formattorepresentgenemodelsfromarabidopsisthaliana,includingthegenemodelshown above.tofindthelineoftextrepresentingthegenemodelshownabove,afterdownloadingthefilefor AssignmentTwo,usetheUnixcommandgreplikeso: $ zcat TAIR9_mRNA.bed.gz grep AT1G04250.1 Theresultisthefollowinglineoftext:

chr1 1136257 1138613 AT1G04250.1 0 + 1136381 1138340 0 363,212,139,62,311, 0,676,1202,1482,2045, This line of text contains several fields of data that are separated by tab characters. The first three fieldsgivethegenomiclocationforthegenemodel thenameofthesequencewhereitresides,andits start(chromstart)andend(chromend)positionsonthatchromosome.thenextfieldgivesthegene model s name. The next field indicates a score associated with the gene model; it can be any value between0and1000,inclusive.inthiscase,thegenemodelhasnoscoreandsothevalueinthisfieldis 0. The next field defines the gene model s strand. In this case, the gene model is located on the plus (also called top) strand of the DNA sequence named chr1. The strand of the gene model indicates whichstrandofchr1isthesensestrandforthegene.thesensestrandiswhatultimatelyiscopiedinto mrna during transcription. The sense strand of the gene is identical in sequence to the processed mrna transcript, except that all T residues are replaced with U s in mrna and the mrna lacks introns. The next field (field 9) should contain an RGB value (e.g., 255,0,0) or 0. When non zero, it indicates a color a browser should use to represent the model when presented visually. No color is specifiedintheexampleabove,andsoheretheninthcontainsthevalue 0. Thenextfieldindicates thenumberofexonblocks(5inthiscase)andthetwofieldsafterthatcontaincomma separatedlists of exon sizes and start postions, respectively. Note that the exon start positions are relative to the position given in the second field. To get the start position of an exon, you would have to add the numberinthelastfieldtothenumberthesecondfield. Inmostsituations,youcansafelyignorethescoreandcolorfields fieldsfiveandnine.also,thesixth fieldisredundant;ifyouknowthestartpositionsandsizesofalltheexons,youcanalsocountthem. To review, examine the following screen capture from the page at: http://genome.ucsc.edu/faq/faqformat#format1

The bed format indicates positions using interbase notation, which is different from how programs likeblastandmanyothersequenceanalysisprogramsindicatelocationswithinasequence.interbase is a way of numbering base positions, as illustrated in the figure below. Note that in interbase, we counttheboundariesbetweenbases,notthebasesthemselves.