Efficient codon optimization with motif engineering

Similar documents
agucacaaacgcu agugcuaguuua uaugcagucuua

Gene Models & Bed format: What they represent.

Gene Finding CMSC 423

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

RNA & Protein Synthesis

A Web Based Software for Synonymous Codon Usage Indices

Molecular Genetics. RNA, Transcription, & Protein Synthesis

13.2 Ribosomes & Protein Synthesis

PRACTICE TEST QUESTIONS

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

FINDING RELATION BETWEEN AGING AND

The Steps. 1. Transcription. 2. Transferal. 3. Translation

RNA Structure and folding

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

Scaling from 1 PC to a super computer using Mascot

Basic Concepts of DNA, Proteins, Genes and Genomes

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Overview of Eukaryotic Gene Prediction

Bioinformatics Grid - Enabled Tools For Biologists.

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

Translation Study Guide

Molecular Facts and Figures

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

AP BIOLOGY 2009 SCORING GUIDELINES

Protein Synthesis How Genes Become Constituent Molecules

Name: Date: Period: DNA Unit: DNA Webquest

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Graph theoretic approach to analyze amino acid network

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Guide for Bioinformatics Project Module 3

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Integrated Protein Services

Introduction to Genome Annotation

Transcription: RNA Synthesis, Processing & Modification

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

Current Motif Discovery Tools and their Limitations

Module 10: Bioinformatics

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

T cell Epitope Prediction

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Bio-Informatics Lectures. A Short Introduction

Vector NTI Advance 11 Quick Start Guide

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Recombinant DNA Unit Exam

BCOR101 Midterm II Wednesday, October 26, 2005

Computer based Scheduling Tool for Multi-product Scheduling Problems

Final Project Report

Lecture Data Warehouse Systems

( TUTORIAL. (July 2006)

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Problem Set 3 KEY

A Mathematical Programming Solution to the Mars Express Memory Dumping Problem

Amino Acids and Their Properties

Dynamics of Biological Systems

Analysis of Algorithms I: Binary Search Trees

Basic attributes of genetic processes (replication, transcription, translation)

SR2000 FREQUENCY MONITOR

FREE computing using Amazon EC2

MUTATION, DNA REPAIR AND CANCER

160 CHAPTER 4. VECTOR SPACES

Data Integration via Constrained Clustering: An Application to Enzyme Clustering

Introduction to Principal Components and FactorAnalysis

Pairwise Sequence Alignment

European Medicines Agency

Biological Databases and Protein Sequence Analysis

Bioinformatics Resources at a Glance

Lecture 2, Introduction to Python. Python Programming Language

Chapter 4. SDP - difficulties. 4.1 Curse of dimensionality

Hands on Simulation of Mutation

The EcoCyc Curation Process

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

Figure 1: RotemNet Main Screen

Integrating Apache Spark with an Enterprise Data Warehouse

Concluding lesson. Student manual. What kind of protein are you? (Basic)

MS SQL Installation Guide

Protein Prospector and Ways of Calculating Expectation Values

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Integrated Protein Services

LECTURE 6 Gene Mutation (Chapter )

Molecular Computing Athabasca Hall Sept. 30, 2013

Step by Step Guide to Importing Genetic Data into JMP Genomics

Interactive Level-Set Deformation On the GPU

Time series experiments

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Chapter 3. if 2 a i then location: = i. Page 40

Transcription:

Anne Condon and Chris Thachuk University of British Columbia IWOCA 2011 June 20, 2011

Why synthesize proteins? Image credit: Protein Data Bank

Central dogma Image credit: Daniel Horspool

How to synthesize DNA/RNA

Genetic code is redundant

Genetic code is redundant

Genetic code is a many-to-one mapping

Genetic code is a many-to-one mapping Definition λ(i) denotes the i th amino acid λ j (i) denotes its j th codon λ(i) denotes its codon count

Genetic code is a many-to-one mapping Definition λ(i) denotes the i th amino acid λ j (i) denotes its j th codon λ(i) denotes its codon count Example λ(2) = Leucine λ(2) = 6 λ 1 (2) = CUA

Codon frequency

Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid.

Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid. Example λ(3) = Tyrosine ρ 1 (3) = 9 9 + 1 ρ 2 (3) = 1 9 + 1 = 0.9 = 0.1

Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid. Example λ(3) = Tyrosine ρ 1 (3) = 9 9 + 1 ρ 2 (3) = 1 9 + 1 = 0.9 = 0.1 λ 1 (3) is a most frequent codon

Codon fitness Definition Let τ j (i) denote the codon fitness of j th codon of i th amino acid. A codon s fitness is its frequency relative to the most frequent codon of the same amino acid.

Codon fitness Definition Let τ j (i) denote the codon fitness of j th codon of i th amino acid. A codon s fitness is its frequency relative to the most frequent codon of the same amino acid. Example λ(3) = Tyrosine τ 2 (3) = ρ 2(3) ρ 1 (3) = 0.1 0.9 0.11

The codon adaption index Given an amino acid sequence A = α 1, α 2,..., α A, and a corresponding codon design S = c 1, c 2,..., c A : Example A CAI(S, A) = τ ci (α i ) i=1 1 A A Tyrosine Tyrosine Tyrosine Tyrosine S UAC UAU UAU UAC 1.0 0.11 0.11 1.0 CAI (S, A) = (1.0 0.11 0.11 1.0) 1 4 0.33

Optimizing CAI The CAI codon optimization problem Instance: Amino acid sequence A = α 1, α 2,..., α A. Problem: Find a codon design S corresponding to A such that: CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all codon designs for A.

Optimizing CAI The CAI codon optimization problem Instance: Amino acid sequence A = α 1, α 2,..., α A. Problem: Find a codon design S corresponding to A such that: CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all codon designs for A. Note: trivially solvable in linear time

Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design.

Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC)

Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC) polyhomomeric regions (e.g., CCCCCC)

Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC) polyhomomeric regions (e.g., CCCCCC) Certain motifs (substrings) are desirable in the codon design. Immuno stimulatory motifs

Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

Forbidden motifs and desirable motifs Example Detecting motifs Equivalent to the dictionary matching problem. Classic dictionary (Aho & Corasick 1975) Succinct dictionary (Belazzougui 2010, Thachuk 2011) F = {CCUU, AGC, UGGC} Text of length h scanned for patterns in O(h) time.

Motifs are of constant length Observation If the largest forbidden or desired motif is of length g, then any forbidden or desired motif can span at most k + 1 consecutive codons, where k = g/3.

Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D.

Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D. Problem: Find a codon design S corresponding to A such that: S is valid, with respect to F and D, CAI (S, A) = max{cai (S, A) S S(A)}

Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D. Problem: Find a codon design S corresponding to A such that: S is valid, with respect to F and D, CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all valid codon designs for A.

The algorithm Overview and Notation We will develop a dynamic programming algorithm. (We ignore desirable motifs.) Notation λ j (i) j th codon of i th amino acid τ j (i) j th codon s fitness w.r.t i th amino acid

The algorithm Overview and Notation We will develop a dynamic programming algorithm. (We ignore desirable motifs.) Notation λ j (i) j th codon of i th amino acid τ j (i) j th codon s fitness w.r.t i th amino acid M F (λ ci (α j )... λ ci+q (α j+q )) count of forbidden motifs M F (λ c i (α j )... λ ci+q (α j+q )) count of forbidden motifs ending in last codon

The algorithm A tale of two matrices We make use of two k-dimensional DP matrices. The Forbidden motif DP matrix F i c i k+1,...,c i 1,c i Denotes the minimum possible number of forbidden motifs in a DNA sequence which codes for an amino acid sequence A = α 1, α 2,..., α i, given that the last k codons (of i total codons) have indices denoted as c i k+1,..., c i 1, c i.

The algorithm A tale of two matrices We make use of two k-dimensional DP matrices. The Forbidden motif DP matrix F i c i k+1,...,c i 1,c i Denotes the minimum possible number of forbidden motifs in a DNA sequence which codes for an amino acid sequence A = α 1, α 2,..., α i, given that the last k codons (of i total codons) have indices denoted as c i k+1,..., c i 1, c i. The CAI value DP matrix P i c i k+1,...,c i 1,c i Denotes the maximum possible CAI score among all valid sequences.

The algorithm The base case Simply evaluate all assignments to first k codons. Fc k ( 1,c 2,...,c k 1,c k = M F λc1 (α 1 )λ c2 (α 2 )... λ ck 1 (α k 1 )λ ck (α k ) ) k Pc k 1,c 2,...,c k 1,c k = (τ ci (α i )) i=1

The algorithm The base case Simply evaluate all assignments to first k codons. Fc k ( 1,c 2,...,c k 1,c k = M F λc1 (α 1 )λ c2 (α 2 )... λ ck 1 (α k 1 )λ ck (α k ) ) k Pc k 1,c 2,...,c k 1,c k = (τ ci (α i )) i=1 Complexity analysis at most 6 codons for each of the k positions O(6 k ) assignments evaluating one assignment for motifs O(k) time O(6 k k) time and O(6 k ) words of space

The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Pc i i k+1,...,c i 1,c i = max 1 ci k λ(α i k ) Fc i 1, if i k (,...,c i 2,c i 1 + M F λci k (α i k )... λ ci (α i ) ) Fc i i k+1,...,c i 1,c i τ ci (α i ) Pc i 1 i k,...,c i 2,c i 1, otherwise Example

The algorithm The recursive case case (i > k) P i k = F i k = max 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) min 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) { {F i ci k+1,...,ci 1,ci } Pc i i k+1,...,c i 1,c i, if Fc i i k+1,...,c i 1,c i = F k i, otherwise }

The algorithm The recursive case case (i > k) P i k = F i k = max 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) min 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) { {F i ci k+1,...,ci 1,ci } Pc i i k+1,...,c i 1,c i, if Fc i i k+1,...,c i 1,c i = F k i, otherwise } Complexity analysis O(6 k ) codon assignments evaluated at n positions O(k) time to evaluate assignment O(6 k kn) = O(n) time and space overall previous state-of-the-art θ(n 2 ) time/space (Satya et al., 2003)

Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation)

Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli

Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli F = 10 D = 33 k = 3 (motifs of length 9 or less)

Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli F = 10 D = 33 k = 3 (motifs of length 9 or less) Implementation & Hardware Implemented in C++ Pentium IV 2.4 GhZ 1 GB RAM

motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06)

motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06) forbidden 0.92 (0.04) 0.14 (0.45) 0 (0.00)

motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06) forbidden 0.92 (0.04) 0.14 (0.45) 0 (0.00) forbidden & desired 0.83 (0.05) 0.14 (0.45) 10.13 (14.84)

runtime [CPU seconds] 0.0 0.1 0.2 0.3 0.4 all motifs only forbidden motifs 0 2000 4000 6000 8000 sequence length

Conclusions & Future Work Conclusions CAI of gene can be optimized effectively motifs can be removed/added effectively O(n) time/space algorithm for constant length motifs algorithm is fast in practice

Conclusions & Future Work Conclusions CAI of gene can be optimized effectively motifs can be removed/added effectively O(n) time/space algorithm for constant length motifs algorithm is fast in practice Open Problem Design codon sequence free of secondary structure