Yale Pseudogene Analysis as part of GENCODE Project



Similar documents
Final Project Report

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

GenBank, Entrez, & FASTA

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Introduction to Genome Annotation

Searching Nucleotide Databases

Frequently Asked Questions Next Generation Sequencing

Module 1. Sequence Formats and Retrieval. Charles Steward

Human-Mouse Synteny in Functional Genomics Experiment

Gene Models & Bed format: What they represent.

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

Protein Synthesis How Genes Become Constituent Molecules

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Introduction to Bioinformatics 3. DNA editing and contig assembly

Analysis of ChIP-seq data in Galaxy

Bioinformatics Grid - Enabled Tools For Biologists.

1 Mutation and Genetic Change

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

A Primer of Genome Science THIRD

Transcription and Translation of DNA

Simplifying Data Interpretation with Nexus Copy Number

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Novel Transcribed Regions in the Human Genome

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

BioBoot Camp Genetics

Bioinformatics Resources at a Glance

How many of you have checked out the web site on protein-dna interactions?

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

PreciseTM Whitepaper

The world of non-coding RNA. Espen Enerly

Genetics Test Biology I

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

Linear Sequence Analysis. 3-D Structure Analysis

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

PRACTICE TEST QUESTIONS

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Guide for Bioinformatics Project Module 3

Converting GenMAPP MAPPs between species using homology

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

RNA & Protein Synthesis

The Steps. 1. Transcription. 2. Transferal. 3. Translation

GMQL Functional Comparison with BEDTools and BEDOPS

Genomes and SNPs in Malaria and Sickle Cell Anemia

GeneProf and the new GeneProf Web Services

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Comparing Methods for Identifying Transcription Factor Target Genes

RNA and Protein Synthesis

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Concluding lesson. Student manual. What kind of protein are you? (Basic)

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Problem Set 3 KEY

Structure and Function of DNA

Quantitative proteomics background

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Genetomic Promototypes

Using Databases in R

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Intro to Bioinformatics

Computational Methodologies for Transcript Analysis in the. Age of Next-Generation DNA Sequencing

Sample Questions for Exam 3

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Global and Discovery Proteomics Lecture Agenda

Modeling DNA Replication and Protein Synthesis

Transcription:

Sanger Center 2009.01.20, 11:20-11:40 Mark B Gerstein Yale Illustra(on from Gerstein & Zheng (2006). Sci Am. (c) Mark Gerstein, 2002, (c) Yale, 1 1Lectures.GersteinLab.org 2007bioinfo.mbb.yale.edu Yale Pseudogene Analysis as part of GENCODE Project

Overview Outline Flow Additional Pseudogene Annotation Pseudogene Sets Regular conference calls Sanger / Havana: Jenn, Adam UCSC: Rachel, Mark Discuss difficult pseudogene cases, pipeline improvements and additional pseudogene annotation Questions for you: consis. Labels? Pfam/ensembl usage? trans. set? consensus set? 2 Lectures.GersteinLab.org (c) 2007

Overall Flow: Pipeline Runs, Coherent Sets, Annotation, Transfer to Sanger Overall Approach 1. Overall Pipeline runs at Yale and UCSC, yielding raw pseudogenes 2. Extraction of coherent subsets for further analysis and annotation 3. Passing to Sanger for detailed manual analysis and curation 4. Incorporation into final GENCODE annotation 5. Pipeline modification Chronology of Sets 1. Unitary pseudogenes 2. Ribosomal Protein pseudogenes (Hard then easy) 3. Transcribed pseudogenes annotated earlier 4. Strong overlap consensus pseudogenes 5. Pseudogenes associated with SDs 6. GAPDH pseudogenes 3 Lectures.GersteinLab.org (c) 2007

Additional Annotation on Pseudogenes: Consistent Labeling (Ontology) 4 Lectures.GersteinLab.org (c) 2007

has_pseudogene_family Pseudogene family type_of Protein family contained_in has_parent_protein Protein translated_into retrotranscribed_from mrna transcribed_into Pseudogene duplicated_from Gene Pseudogen e Unitary Processed Duplicated Transcribed Regulatory Domain Ontology Upper Ontology Segmental Duplica;on [Lam et al., NAR DB Issue (in press, '09)] contained_in Sequence Region is_a etc mutated_from Syntenic Region contained_in derives_from 5 Lectures.GersteinLab.org (c) 2007

Domain Ontology Pseudogene has a Classified Type is a Proposed HAVANA Unitary Processed Duplicated *Unprocessed Semi Processed Duplicated Processed Feature Origin Polymorphic Transcribed Regulatory Evidence Mitochondria Nucleus Intra-Genome Homology Sequence Homology [Lam et al., NAR DB Issue (in press, '09)] Cross- Genome Homology Regulatory Element Lost Premature Stop Codon Disablement Frameshift Recognition Feature Pseudo- PolyA Tail 6 Lectures.GersteinLab.org (c) 2007

Collections to provide further annotation HAVANA Human Pseudogenes Yale UCSC Ribosomal Protein Transcribed 7 Lectures.GersteinLab.org (c) 2007

Additional Annotation on Pseudogenes: Families & Histories 8 Lectures.GersteinLab.org (c) 2007

SVN Flat Files Table Browser single pgene query tables.pseudogene.org [Karro et al., NAR ('07)] DAS UCSC Genome Browser 12 eukaryotic species Human, mouse, rat, chimp 100,052 pseudogenes 64 prokaryotic species 6,412 pseudogenes 28,237 human pseudogenes 13+ unique human sets 9 Lectures.GersteinLab.org (c) 2007

Pseudofam Database Data Sources Ensembl, Pfam, BioMart Pseudopipe Highlighted Features Browse families Statistics Enrichment P-value, etc Search families Pfam ID/Acc Ensembl ID Pseudogene ID Correlate families Genes Pseudogenes Parents [Lam et al., NAR DB Issue (in press, '09)] 10 Lectures.GersteinLab.org (c) 2007

Pseudofam Construction Data Generation Alignment Identify pseudogenes by proteins Map parent proteins to protein families Assign pseudogenes to their parent families Align the pseudogenes in pseudogene families Calculate the key statistics and organize the data into database Align pseudogene to parent Transfer alignment from Pfam Combine and adjust the alignments to build the pseudofam alignment [Lam et al., NAR DB Issue (in press, '09)] 11 Lectures.GersteinLab.org (c) 2007

Pseudofam Statistics: Enrichment of pseudogenes within a family ("Living vs Dead") Total (10 Eukaryotes) Human [Lam et al., NAR DB Issue (in press, '09)] 12 Lectures.GersteinLab.org (c) 2007

Pseudogene Set #2: RP pseudogenes 13 Lectures.GersteinLab.org (c) 2007

Human Ribosomal Proteins (RP) 79 Functional RP genes Nakao A, Yoshihama M, Kenmochi N: RPG: the Ribosomal Protein Gene database. Nucleic Acids Res 2004, 32:D168-170. 14 [Zhang et al. Genome Research, 2002] 14 Lectures.GersteinLab.org (c) 2007

Human Ribosomal Proteins (RP) 79 Functional RP genes ~ 2000 RP Ψgenes 15 [Zhang et al. Genome Research, 2002] 15 Lectures.GersteinLab.org (c) 2007

Number of RP pseudogenes (identified by pipeline) Organism Processed Fragments Low confidence Human 1822 218 212 Chimp 1462 219 160 Mouse 2092 326 413 Rat 2848 343 450 RP pseudogenes constitute the largest family of pseudogenes. Almost all are processed: There are ~90 clearly duplicated ones in the human genome [Balasubramanian et al., Genome Biol. ('09)] 16 Lectures.GersteinLab.org (c) 2007

Number of each type of human ribosomal protein processed pseudogenes appears unrelated to expression level or to number in mouse [Balasubramanian et al., Genome Biol. ('09)] 17 Lectures.GersteinLab.org (c) 2007

Using Synteny to Improve Annotation of RP pgenes Synteny derived based on local gene orthology [Balasubramanian et al., Genome Biol. ('09)] 18 Lectures.GersteinLab.org (c) 2007

Syntenic proc pseudogenes Species1- Species2 Number of syntenic pgenes Human-chimp 1282 Human-mouse 6 Human-rat 11 Rat-mouse 394 [Balasubramanian et al., Genome Biol. ('09)] 19 Lectures.GersteinLab.org (c) 2007

A human-mouse syntenic pgene, likely to be coding [Balasubramanian et al., Genome Biol. ('09)] Nakao A, Yoshihama M, Kenmochi N: RPG: the Ribosomal Protein Gene database. Nucleic Acids Res 2004, 32:D168-170. 20 Lectures.GersteinLab.org (c) 2007

Set #3: Transcribed Pseudogenes Annotated Earlier 21 Lectures.GersteinLab.org (c) 2007

Harrison '05 set of Transcribed Pseudogenes 233 Transcribed from ~8000 Processed Pseudogenes Evidence for Transcription 8% Refseq mrnas 32% Unigene consensus sequences 72% dbest expressed sequence tags 32% Oligonucleotide microarray data (extra support) Highly decayed Fraction with Ka/Ks 0.5 is 54% Harrison et al. (2005) NAR 22 22 Lectures.GersteinLab.org (c) 2007

Set #4: Strong Consensus Pseudogenes 23 Lectures.GersteinLab.org (c) 2007

Consensus Pseudogenes Yale & UCSC agreement Yale, UCSC & HAVANA agreement HAVANA disagreement 24 Lectures.GersteinLab.org (c) 2007

Overlap Initially, 50 nucleotide overlap Why so low? What about percentage overlap? 25 Lectures.GersteinLab.org (c) 2007

26 Lectures.GersteinLab.org (c) 2007

27 Lectures.GersteinLab.org (c) 2007

28 Lectures.GersteinLab.org (c) 2007

High Quality Overlaps? 29 Lectures.GersteinLab.org (c) 2007

30 Lectures.GersteinLab.org (c) 2007

gencode.gersteinlab.org/consensus Django-based Filter by genomic coordinates, source Yale & UCSC Consensus track Flagged pseudogenes Verification status 31 Lectures.GersteinLab.org (c) 2007

Set #5: SD pseudogenes 32 Lectures.GersteinLab.org (c) 2007

Segmental Duplications (SDs) SDs are regions of the genome with 90% sequence identity and 1kb in length Clues about pseudogene formation from pseudogenes located in SDs, for example, annotation of duplicated-processed pseudogenes 33 Lectures.GersteinLab.org (c) 2007

Pseudogene families and Segmental Duplications (SDs) SDs comprise ~5-6% of the human genome but contain ~17.8% genes, 45.7% duplicated pgenes and 21.6% processed pgenes Relative values of correlation coefficients in the plots above consistent with the observation that SDs contain more pgenes than parent genes [Lam et al., NAR DB Issue (in press, '09)] 34 Lectures.GersteinLab.org (c) 2007

Summary Sets RP, transcribed 05, consensus, SD... Additional Annotation families & labels 35 Lectures.GersteinLab.org (c) 2007

Credits Yale: S Balasubramanian, P Cayting, H Lam, Y Liu, G Fang, N Carriero, R Robilotto, E Khurana Alums: D Zheng, P Harrison, Z Zhang Sanger: A Frankish, J Harrow UCSC: R Harte, M Diekhans 36 Lectures.GersteinLab.org (c) 2007

Extra 37 Lectures.GersteinLab.org (c) 2007

Collections Combination of multiple sets Human50, UCSC, GAPDH, Transcribed, Unitary, etc. Also can have attributes View by set or as a whole 38 Lectures.GersteinLab.org (c) 2007

Pseudofam & Pseudogene.org Integration PseudoPipe svn annotations via shared ID 39 Lectures.GersteinLab.org (c) 2007

gencode.gersteinlab.org/consensus 40 Lectures.GersteinLab.org (c) 2007