Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models"

Transcription

1 Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models Megan Bowman Childs Lab Bioinformatics Seminar 22 April 2015

2 Outline GC content in plant genomes Codon usage Bioinformatics strategy MAKER Creation of GC specific HMM training datasets Results Ongoing work

3 GC content of plant genomes GC content varies among species and phylogeny impacts variation Recombination, expression level and replication may correlate with GC content Positive selection should result in high levels of codon bias and low rates of synonymous substitution Bimodal GC distribution in grasses, 5 3 GC gradient Plants are much more complicated, monocots have higher GC content than dicots and they show variation in codon usage Serres-Giardi et al., Plant Cell 2012

4 Why look at codon usage? Codon usage is linked to: Transcriptional selection Translation efficiency Gene expression GC content AA conservation Protein hydropathicity RNA stability Adaption to habitat Jane Pulman

5 CodonW Written by John Peden as part of his thesis Command line and GUI menu based Several issues Creates codon tables, determines optimal codons, COA analysis, CAI, N c, CBI, Fop, GRAVY, GC, GC3. Jane Pulman

6 Codon table Jane Pulman

7 GC3 Wobble position in codons is a marker for GC richness

8 How does GC impact genome annotation? Very likely that we miss gene models Prediction based on abinitio programs = HMMs Abinitio programs require training data sets for accurate gene prediction Nucleotide content of training set influences prediction Missing gene models with varying GC content that may have important functions in plants

9 Rapid divergence of codon usage patterns Wang and Hickley, BMC Evolutionary Biology 2007, 7(Suppl 1):S6 Jane Pulman

10 Hidden Markov Models (HMMs) HMMs are the Legos of computational sequence analysis Sean Eddy, Nature Biotechnology 2004

11 AUGUSTUS HMM

12 Bioinformatics strategy ü Reannotation of rice genome (Oryza sativa L. ssp japonica cv. Nipponbare) MSU v7 genome using MAKER ü Calculate distribution of GC content across all evidence supported predicted gene models ü Develop code to create new HMM training sets with extreme high and low GC content ü Retrain HMMs for abinito prediction programs with high and low GC sets ü MAKER annotations using high and low GC HMMs ü Identify newly predicted gene models and improve existing gene models ü Combine original, high and low GC annotations for new annotation

13 MAKER What are advantages to using MAKER for structural genome annotation? Masks repeats Aligns ESTs/RNA-seq assemblies and proteins to genome Guides abinitio gene predictors (Ex: SNAP and AUGUSTUS) Combines these into a final annotation Provides quality values for predicted gene models (AED Score)

14 Annotation Edit Distance (AED) Collapsed Evidence SN SP AED Perfect Accuracy Poor Specificity Poor Sensitivity Poor Specificity and Sensitivity Kevin Childs Eilbeck et al BMC Bioinformatics :67

15 Michael Campbell MAKER-Standard - Evidence - Pfam - Evidence + Pfam MAKER Max + Evidence + Pfam MAKER Default MAKER Standard + Evidence - Pfam

16 Training HMMs and GC content Does GC content impact the training of the SNAP and AUGUSTUS HMMs? Would we predict new gene models if the average GC content of the training set used for HMM training was either higher or lower than normal? Can we add additional evidence to gene models using these new HMMs, or identify new gene models not previously found?

17 MAKER Annotation Reannotation of the MSU Rice v7 genome assembly Transcript Evidence Fifty SRA datasets Read cleaning and QC with FastQC and Cutadapt Transcript assembly with Trinity Trained SNAP and AUGUSTUS HMMs MAKER with default parameters Iprscan hmmscan Pfam for MAKER standard

18 Improving MAKER models based on GC distribution GC distribution of Maker Standard Models 4 3 Gene Frequency GC Content

19 Creation of GC Specific HMM Training Datasets Perl Scripts: MAKER Standard GFF3 and genome FASTA file Determines GC percentage and bins GC to integers over a user defined window on each side Creates FASTA files and GFF3s of high and low GC maker standard genes for HMM training GFF3 is input for SNAP abinitio gene prediction program (maker2zff)

20 Workflow Align Transcripts to Genome with MAKER Calculate Lower Threshold GC Content of Aligned Transcripts Use Transcripts with Regular GC Distribution Calculate Upper Threshold GC Content of Aligned Transcripts Train SNAP/Run MAKER with Lower GC Content Train SNAP/Run MAKER Train SNAP/Run MAKER with Upper GC Content Use SNAP Gene Predictions with Lower GC Content Use SNAP Gene Predictions with Regular GC Distribution Use SNAP Gene Predictions with Upper GC Content Train Augustus with Lower GC Gene Predictions Train Augustus with Regular GC Gene Predictions Train Augustus with Upper GC Gene Predictions MAKER Annotation with Lower GC HMMs MAKER Annotation with Regular GC HMMs MAKER Annotation with Upper GC HMMs Rerun MAKER Annotation with Regular GC HMMs and Predictions from Lower and Upper GC HMMs; MAKER Retains the Best Annotations

21 MAKER Annotation Comparative AED Curves Total Number of Transcripts AED combo high low original

22 Unique gene models AED Scores of New High GC Gene Models AED Scores of New Low GC Gene Models factor(high_names) factor(low_names) Count < AED <= < AED <= < AED < 1 AED <=.25 AED = 1 Count.25 < AED <= < AED <= < AED < 1 AED <=.25 AED = AED <= < AED <= < AED <= < AED < 1 AED = 1 AED Scores AED <= < AED <= < AED <= < AED < 1 AED = 1 AED Scores 1598 new gene models, 25,168 improved models (Decrease in AED score)

23 GC Distribution GC Distribution 7.5 Gene Frequency GC Content high high only low low only original

24 Codon usage original Jane Pulman

25 Codon usage low new models Jane Pulman

26 Codon usage high new models Jane Pulman

27 The effective number of codons!! = 2 + 9!! + 1!! + 5!! + 3!!! F i is the average homozygosity estimated for SF type i. 2 is the N c for M and Try as they are always 1. 9,1,5 and 3 are the number of AA of that type. So for example there are 9 AA that have 2 synonymous codons. Jane Pulman

28 Nc Plot Original Jane Pulman

29 Nc Plot Low New Models Jane Pulman

30 Nc Plot High New Models Jane Pulman

31 Position on first and second axis for High and Low GC new models Jane Pulman

32 Ongoing work Functional annotation of new genes Orthologs of new genes in other species Paralogs of new genes Comparison of MAKER annotations with full length cdnas Comparison of results to randomized training sets Expression analysis of high and low GC genes Tissue specific GC content and codon usage analyses

33 Conclusions High and low GC content gene models may be missed by using current HMM training methods Developed a pipeline for the creation of high and low GC content training data sets for existing gene prediction programs MAKER reannotation of MSU rice genome with high and low GC HMMs Identify over 1500 new gene models specific to high and low GC specific annotations Codon usage bias with high and low GC gene models

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

NSilico Life Science Introductory Bioinformatics Course

NSilico Life Science Introductory Bioinformatics Course NSilico Life Science Introductory Bioinformatics Course INTRODUCTORY BIOINFORMATICS COURSE A public course delivered over three days on the fundamentals of bioinformatics and illustrated with lectures,

More information

Introduction to NGS data analysis

Introduction to NGS data analysis Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High

More information

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT Kimberly Bishop Lilly 1,2, Truong Luu 1,2, Regina Cer 1,2, and LT Vishwesh Mokashi 1 1 Naval Medical Research Center, NMRC Frederick, 8400 Research Plaza,

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Comprehensive Examinations for the Program in Bioinformatics and Computational Biology

Comprehensive Examinations for the Program in Bioinformatics and Computational Biology Comprehensive Examinations for the Program in Bioinformatics and Computational Biology The Comprehensive exams will be given once a year. The format will be six exams. Students must show competency on

More information

A Web Based Software for Synonymous Codon Usage Indices

A Web Based Software for Synonymous Codon Usage Indices International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 3 (2013), pp. 147-152 International Research Publications House http://www. irphouse.com /ijict.htm A Web

More information

DNA Sequence Classification in the Presence of Sequencing Errors Project of CSE847

DNA Sequence Classification in the Presence of Sequencing Errors Project of CSE847 DNA Sequence Classification in the Presence of Sequencing Errors Project of CSE847 Yuan Zhang zhangy72@msu.edu Cheng Yuan chengy@msu.edu Computer Science and Engineering Department Michigan State University

More information

ABiL. Workforce Development Course Description. A unique bioinformatics resource for the translation of molecular data into

ABiL. Workforce Development Course Description. A unique bioinformatics resource for the translation of molecular data into Workforce Development Course Description ABiL A unique bioinformatics resource for the translation of molecular data into Applied BioInformatics Laboratory actionable public health intelligence ABiL is

More information

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg (hackenberg@ugr.es)

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg (hackenberg@ugr.es) WEB-SERVER MANUAL Contact: Michael Hackenberg (hackenberg@ugr.es) 1 1 Introduction srnabench is a free web-server tool and standalone application for processing small- RNA data obtained from next generation

More information

Computational localization of promoters and transcription start sites in mammalian genomes

Computational localization of promoters and transcription start sites in mammalian genomes Computational localization of promoters and transcription start sites in mammalian genomes Thomas Down This dissertation is submitted for the degree of Doctor of Philosophy Wellcome Trust Sanger Institute

More information

SAM Teacher s Guide DNA to Proteins

SAM Teacher s Guide DNA to Proteins SAM Teacher s Guide DNA to Proteins Overview Students examine the structure of DNA and the processes of translation and transcription, and then explore the impact of various kinds of mutations. Learning

More information

Biology Performance Level Descriptors

Biology Performance Level Descriptors Limited A student performing at the Limited Level demonstrates a minimal command of Ohio s Learning Standards for Biology. A student at this level has an emerging ability to describe genetic patterns of

More information

Supplemental Data. Martis et al. (2013). Plant Cell /tpc

Supplemental Data. Martis et al. (2013). Plant Cell /tpc 1 2 Supplemental Data. Martis et al. (2013). Plant Cell 10.1105/tpc.113.114553 Supplemental Figures 3 4 5 6 7 8 9 10 Supplemental Figure 1: Rye consensus transcript map: Graphical representation of the

More information

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches Introduction to Bioinformatics Who is taking this course? Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707 People with very diverse backgrounds in biology

More information

ABiL. Workforce Development Course Description. A unique bioinformatics resource for the translation of molecular data into

ABiL. Workforce Development Course Description. A unique bioinformatics resource for the translation of molecular data into Workforce Development Course Description ABiL A unique bioinformatics resource for the translation of molecular data into Applied BioInformatics Laboratory actionable public health intelligence ABiL is

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource Alan R. Gingle Andrew H. Paterson Joshua A. Udall Jonathan F. Wendel 1 CEGC project goals set the context

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

DnaSP, DNA polymorphism analyses by the coalescent and other methods. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,

More information

SAM Teacher s Guide DNA to Proteins

SAM Teacher s Guide DNA to Proteins SAM Teacher s Guide DNA to Proteins Note: Answers to activity and homework questions are only included in the Teacher Guides available after registering for the SAM activities, and not in this sample version.

More information

Lab #5: DNA, RNA & Protein Synthesis. Heredity & Human Affairs (Biology 1605) Spring 2012

Lab #5: DNA, RNA & Protein Synthesis. Heredity & Human Affairs (Biology 1605) Spring 2012 Lab #5: DNA, RNA & Protein Synthesis Heredity & Human Affairs (Biology 1605) Spring 2012 DNA Stands for : Deoxyribonucleic Acid Double-stranded helix Made up of nucleotides Each nucleotide= 1. 5-carbon

More information

Introduction to Bioinformatics 3. DNA editing and contig assembly

Introduction to Bioinformatics 3. DNA editing and contig assembly Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov

More information

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis By the end of this lab students should be able to: Describe the uses for each line of the DNA subway program (Red/Yellow/Blue/Green) Describe

More information

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML 9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze

More information

Next generation sequencing (NGS) Bioinformatics Challenges and strategies. Urmi Trivedi Lead Bioinformatician

Next generation sequencing (NGS) Bioinformatics Challenges and strategies. Urmi Trivedi Lead Bioinformatician Next generation sequencing (NGS) Bioinformatics Challenges and strategies Urmi Trivedi Lead Bioinformatician urmi.trivedi@ed.ac.uk Major Bottlenecks Data volume Data complexity Data noise Overview Solutions

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Gene annotation pipeline. Note: Code is highlighted in green, changes to parameter files are highlighted in gray.

Gene annotation pipeline. Note: Code is highlighted in green, changes to parameter files are highlighted in gray. Gene annotation pipeline Note: Code is highlighted in green, changes to parameter files are highlighted in gray. References: http://gmod.org/wiki/maker_tutorial_2012 https://groups.google.com/forum/#!forum/maker-devel

More information

ADVANCES IN BOTANICAL RESEARCH

ADVANCES IN BOTANICAL RESEARCH o >VOLUME SIXTY NINE ADVANCES IN BOTANICAL RESEARCH Genomes of Herbaceous Land Plants Volume Editor ANDREW H. PATERSON Plant Genome Mapping Laboratory Department of Crop and Soil Sciences, Department of

More information

Alternative Splicing in Higher Plants. Just Adding to Proteomic Diversity or an Additional Layer of Regulation?

Alternative Splicing in Higher Plants. Just Adding to Proteomic Diversity or an Additional Layer of Regulation? in Higher Plants Just Adding to Proteomic Diversity or an Additional Layer of Regulation? Alternative splicing is nearly ubiquitous in eukaryotes It has been found in plants, flies, worms, mammals, etc.

More information

Biological Sequence Data Formats

Biological Sequence Data Formats Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA

More information

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat

More information

Evolutionary Bioinformatics. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary Genomics

Evolutionary Bioinformatics. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary Genomics Evolutionary Bioinformatics Short Report Open Access Full open access to this and thousands of other papers at http://www.la-press.com. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary

More information

The Galaxy workflow. George Magklaras PhD RHCE

The Galaxy workflow. George Magklaras PhD RHCE The Galaxy workflow George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Biotechnology and reporter genes Here, a lentivirus is used to carry foreign DNA into chickens. A reporter gene (GFP)indicates that foreign DNA has been successfully transferred. Recombinant DNA continued

More information

Visualization with the Integrative Genomics Viewer (IGV)

Visualization with the Integrative Genomics Viewer (IGV) Ecole de Bioinformatique Aviesan - Integrative Genomics Viewer (IGV) 1 Visualization with the Integrative Genomics Viewer (IGV) Elodie Girard Institut Curie U900 Inserm Mines ParisTech Ecole de Bioinformatique

More information

MEGA-CC (COMPUTE CORE) AND MEGA- PROTO. Quick Start Tutorial

MEGA-CC (COMPUTE CORE) AND MEGA- PROTO. Quick Start Tutorial MEGA-CC (COMPUTE CORE) AND MEGA- PROTO Quick Start Tutorial OVERVIEW MEGA-CC (Molecular Evolutionary Genetics Analysis Computational Core) is an integrated suite of tools for statistics-based comparative

More information

A Primer of Genome Science THIRD

A Primer of Genome Science THIRD A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:

More information

OUTCOMES. PROTEIN SYNTHESIS IB Biology Core Topic 3.5 Transcription and Translation OVERVIEW ANIMATION CONTEXT RIBONUCLEIC ACID (RNA)

OUTCOMES. PROTEIN SYNTHESIS IB Biology Core Topic 3.5 Transcription and Translation OVERVIEW ANIMATION CONTEXT RIBONUCLEIC ACID (RNA) OUTCOMES PROTEIN SYNTHESIS IB Biology Core Topic 3.5 Transcription and Translation 3.5.1 Compare the structure of RNA and DNA. 3.5.2 Outline DNA transcription in terms of the formation of an RNA strand

More information

Understanding the Microbiome: Metatranscriptomics. Marcus Claesson APC Microbiome Symposium 2015

Understanding the Microbiome: Metatranscriptomics. Marcus Claesson APC Microbiome Symposium 2015 Understanding the Microbiome: Metatranscriptomics Marcus Claesson APC Microbiome Symposium 2015 Metatranscriptomics Definition (genetics, ecology) A branch of transcriptomics that studies and correlates,

More information

(http://genomes.urv.es/caical) TUTORIAL. (July 2006)

(http://genomes.urv.es/caical) TUTORIAL. (July 2006) (http://genomes.urv.es/caical) TUTORIAL (July 2006) CAIcal manual 2 Table of contents Introduction... 3 Required inputs... 5 SECTION A Calculation of parameters... 8 SECTION B CAI calculation for FASTA

More information

NAME: Microbiology BI234 MUST be written and will not be accepted as a typed document.

NAME: Microbiology BI234 MUST be written and will not be accepted as a typed document. Chapter 8 Study Guide What is the study of genetics, and what topics does it focus on? What is a genome? NAME: Microbiology BI234 MUST be written and will not be accepted as a typed document. Describe

More information

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE ICH HARMONISED TRIPARTITE GUIDELINE QUALITY OF BIOTECHNOLOGICAL PRODUCTS: ANALYSIS

More information

Recombinant DNA and Biotechnology

Recombinant DNA and Biotechnology Recombinant DNA and Biotechnology Chapter 18 Lecture Objectives What Is Recombinant DNA? How Are New Genes Inserted into Cells? What Sources of DNA Are Used in Cloning? What Other Tools Are Used to Study

More information

GenBank, Entrez, & FASTA

GenBank, Entrez, & FASTA GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,

More information

Sequence Alignment. Part 0: Files. Part 1: Local Alignment

Sequence Alignment. Part 0: Files. Part 1: Local Alignment Sequence Alignment Note: in the experiments below, you will not necessarily be given an exact recipe for how to proceed. You are encouraged to explore and find out what works and what doesn't. You're also

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment Definition Given N sequences x 1, x 2,, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum Applications

More information

Sequence Database Searching (Basic Tools and Advanced Methods)

Sequence Database Searching (Basic Tools and Advanced Methods) I519 Introduction to Bioinformatics Sequence Database Searching (Basic Tools and Advanced Methods) Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Basics of DB search BLAST Table of

More information

Bioinformatics Summer School Konstantin Okonechnikov Max Planck Institute For Infection Biology

Bioinformatics Summer School Konstantin Okonechnikov Max Planck Institute For Infection Biology Bioinformatics Summer School 2014 Konstantin Okonechnikov Max Planck Institute For Infection Biology Quality Control of High Throughput Sequencing Data Летняя Школа Биоинформатики 2014 If we lived in a

More information

FastQC 1. Introduction 1.1 What is FastQC

FastQC 1. Introduction 1.1 What is FastQC FastQC 1. Introduction 1.1 What is FastQC Modern high throughput sequencers can generate tens of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

Overview of Eukaryotic Gene Prediction

Overview of Eukaryotic Gene Prediction Overview of Eukaryotic Gene Prediction CBB 231 / COMPSCI 261 W.H. Majoros What is DNA? Nucleus Chromosome Telomere Centromere Cell Telomere base pairs histones DNA (double helix) DNA is a Double Helix

More information

Tuesday 11/13. Agenda 1.Warm Up (Stamp HW) 2.Protein Synthesis Notes 3.HW Time (Transcription/ Translation Worksheet)

Tuesday 11/13. Agenda 1.Warm Up (Stamp HW) 2.Protein Synthesis Notes 3.HW Time (Transcription/ Translation Worksheet) Tuesday 11/13 Warm Up 1.What are the three parts of a nucleotide? How do two nucleotides link together 2.What binds the two strands of DNA together? Be Specific 3.What are the three main enzymes of DNA

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

LEVEL TWO BIOLOGY: GENE EXPRESSION

LEVEL TWO BIOLOGY: GENE EXPRESSION LEVEL TWO BIOLOGY: GENE EXPRESSION Protein synthesis DNA structure and replication Polypeptide chains and amino acids Mutations Metabolic pathways Protein Synthesis: I can define a protein in terms of

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

Visualization of Phylogenetic Trees and Metadata

Visualization of Phylogenetic Trees and Metadata Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12 Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12 What s Galaxy? Bringing Developers And Biologists Together. Reproducible Science Is Our Goal An open, web-based platform for data intensive

More information

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE AP Biology Date SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE LEARNING OBJECTIVES Students will gain an appreciation of the physical effects of sickle cell anemia, its prevalence in the population,

More information

Life. In nature, we find living things and non living things. Living things can move, reproduce, as opposed to non living things.

Life. In nature, we find living things and non living things. Living things can move, reproduce, as opposed to non living things. Computat onal Biology Lecture 1 Life In nature, we find living things and non living things. Living things can move, reproduce, as opposed to non living things. Both are composed of the same atoms and

More information

UGENE Quick Start Guide

UGENE Quick Start Guide Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.

More information

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome Module 2 Genome Viewing Using Genome Browsers to View Annotation of the Human Genome Bert Overduin, Ph.D. PANDA Coordination & Outreach EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus

More information

Introduction to Genome Annotation

Introduction to Genome Annotation Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT

More information

GENE MUTATIONS. Name: Date: Period: Part One: DNA Error in Replication

GENE MUTATIONS. Name: Date: Period: Part One: DNA Error in Replication Part One: DNA Error in Replication In your Modern Biology textbook, turn to page 202. After reading this page, complete the following. 1. A mutation is a change in. 2. Since genes (sections of DNA) code

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

A response to charges of error in Biology by Miller & Levine

A response to charges of error in Biology by Miller & Levine A response to charges of error in Biology by Miller & Levine According to TEA, a citizen disputes two sentences on page 767 of our textbook, Biology, by Miller & Levine. These sentences are: SE 767, par.

More information

An Investigation of Codon Usage Bias Including Visualization and Quantification in Organisms Exhibiting Multiple Biases

An Investigation of Codon Usage Bias Including Visualization and Quantification in Organisms Exhibiting Multiple Biases An Investigation of Codon Usage Bias Including Visualization and Quantification in Organisms Exhibiting Multiple Biases Douglas W. Raiford, Travis E. Doom, Dan E. Krane, and Michael L. Raymer Abstract

More information

Exercise 11 - Understanding the Output for a blastn Search (excerpted from a document created by Wilson Leung, Washington University)

Exercise 11 - Understanding the Output for a blastn Search (excerpted from a document created by Wilson Leung, Washington University) Exercise 11 - Understanding the Output for a blastn Search (excerpted from a document created by Wilson Leung, Washington University) Read the following tutorial to better understand the BLAST report for

More information

Gene Prediction. Jasreet, Jia, Kunal, Ben, Jeff February 4 th 2009

Gene Prediction. Jasreet, Jia, Kunal, Ben, Jeff February 4 th 2009 Gene Prediction Jasreet, Jia, Kunal, Ben, Jeff February 4 th 2009 What are genes? Complete DNA segments responsible to make functional products Products Proteins Functional RNA molecules RNAi (interfering

More information

European Medicines Agency

European Medicines Agency European Medicines Agency July 1996 CPMP/ICH/139/95 ICH Topic Q 5 B Quality of Biotechnological Products: Analysis of the Expression Construct in Cell Lines Used for Production of r-dna Derived Protein

More information

Analyzing A DNA Sequence Chromatogram

Analyzing A DNA Sequence Chromatogram LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ

More information

Office hours: by appointment. Practical Computing for Biologists. Steven H. D. Haddock & Casey Dunn (2011).

Office hours: by appointment. Practical Computing for Biologists. Steven H. D. Haddock & Casey Dunn (2011). MCB 5429: Theory and Practice of High Throughput Sequence Analysis Bioinformatics analysis of data from next generation sequencing Mon, Wed. 1-2:15pm Beach Hall room 202 Spring 2016 Syllabus Instructor

More information

Genome-scale technologies 2/ Algorithmic and statistical aspects of DNA sequencing What to sequence next? Exciting achievements of the -seq.

Genome-scale technologies 2/ Algorithmic and statistical aspects of DNA sequencing What to sequence next? Exciting achievements of the -seq. Genome-scale technologies 2/ Algorithmic and statistical aspects of DNA sequencing What to sequence next? Exciting achievements of the -seq. Ewa Szczurek University of Warsaw, MIMUW szczurek@mimuw.edu.pl

More information

BIOINFORMATICS TUTORIAL

BIOINFORMATICS TUTORIAL Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

Genetic Mutations. What mistakes can occur when DNA is replicated? T A C G T A G T C C C T A A T G G A T C

Genetic Mutations. What mistakes can occur when DNA is replicated? T A C G T A G T C C C T A A T G G A T C Why? Genetic Mutations What mistakes can occur when DNA is replicated? The genes encoded in your DNA result in the production of proteins that perform specific functions within your cells. Various environmental

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium

More information

Figure During transcription, RNA nucleotides base-pair one by one with DNA

Figure During transcription, RNA nucleotides base-pair one by one with DNA Objectives Describe the process of DNA transcription. Explain how an RNA message is edited. Describe how RNA is translated to a protein. Summarize protein synthesis. Key Terms messenger RNA (mrna) RNA

More information

Single Nucleotide Polymorphism (SNP) Calling from Next-Gen Sequencing (NGS) data for Bacterial Phylogenetics

Single Nucleotide Polymorphism (SNP) Calling from Next-Gen Sequencing (NGS) data for Bacterial Phylogenetics Single Nucleotide Polymorphism (SNP) Calling from Next-Gen Sequencing (NGS) data for Bacterial Phylogenetics Taj Azarian, MPH Doctoral Student Department of Epidemiology College of Medicine and College

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker Multiple Sequence Alignment Hot Topic 5/24/06 Kim Walker Outline Why are Multiple Sequence Alignments useful? What Tools are Available? Brief Introduction to ClustalX Tools to Edit and Add Features to

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle Faculty of Science; Department of Marine Sciences The Swedish Royal

More information

Chapter 21 Active Reading Guide The Evolution of Populations

Chapter 21 Active Reading Guide The Evolution of Populations Name: Roksana Korbi AP Biology Chapter 21 Active Reading Guide The Evolution of Populations This chapter begins with the idea that we focused on as we closed Chapter 19: Individuals do not evolve! Populations

More information

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences A new type of Hidden Markov Models to predict complex domain architecture in protein sequences Raluca Uricaru, Laurent Bréhélin and Eric Rivals LIRMM, CNRS Université de Montpellier 2 14 Juin 2007 Raluca

More information

Gene Finding CMSC 423

Gene Finding CMSC 423 Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher

More information

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes 2.1 Introduction Large-scale insertional mutagenesis screening in

More information

MANTRA 2.0 TUTORIAL. mantra.tigem.it

MANTRA 2.0 TUTORIAL. mantra.tigem.it MANTRA 2.0 TUTORIAL mantra.tigem.it OUTLINE 1. MANTRA Web Tool 2. Analysis a) New Experiment b) New Node c) GSEA 3. Network a) View b) Button Panel 4. Search 5. In Summary 6. Conclusion OUTLINE 1. MANTRA

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Tutorial in Genetic Sequence Classification Tools and Techniques A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University jakemdrew@gmail.com www.jakemdrew.com Sequence Characters IUPAC nucleotide

More information

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

In silico comparison of nucleotide composition and codon usage bias between the essential and non- essential genes of Staphylococcus aureus NCTC 8325

In silico comparison of nucleotide composition and codon usage bias between the essential and non- essential genes of Staphylococcus aureus NCTC 8325 International Journal of Current Microbiology and Applied Sciences ISSN: 2319-7706 Volume 3 Number 12 (2014) pp. 8-15 http://www.ijcmas.com Original Research Article In silico comparison of nucleotide

More information

What s the Point? --- Point, Frameshift, Inversion, & Deletion Mutations

What s the Point? --- Point, Frameshift, Inversion, & Deletion Mutations What s the Point? --- Point, Frameshift, Inversion, & Deletion Mutations http://members.cox.net/amgough/mutation_chromosome_translocation.gif Introduction: In biology, mutations are changes to the base

More information

AP Biology Essential Knowledge Student Diagnostic

AP Biology Essential Knowledge Student Diagnostic AP Biology Essential Knowledge Student Diagnostic Background The Essential Knowledge statements provided in the AP Biology Curriculum Framework are scientific claims describing phenomenon occurring in

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information