Efficient codon optimization with motif engineering

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Efficient codon optimization with motif engineering"

Transcription

1 Anne Condon and Chris Thachuk University of British Columbia IWOCA 2011 June 20, 2011

2 Why synthesize proteins? Image credit: Protein Data Bank

3 Central dogma Image credit: Daniel Horspool

4 How to synthesize DNA/RNA

5

6

7

8

9

10

11 Genetic code is redundant

12 Genetic code is redundant

13 Genetic code is a many-to-one mapping

14 Genetic code is a many-to-one mapping Definition λ(i) denotes the i th amino acid λ j (i) denotes its j th codon λ(i) denotes its codon count

15 Genetic code is a many-to-one mapping Definition λ(i) denotes the i th amino acid λ j (i) denotes its j th codon λ(i) denotes its codon count Example λ(2) = Leucine λ(2) = 6 λ 1 (2) = CUA

16 Codon frequency

17 Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid.

18 Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid. Example λ(3) = Tyrosine ρ 1 (3) = ρ 2 (3) = = 0.9 = 0.1

19 Codon frequency Definition Let ρ j (i) denotes the relative frequency of j th codon of i th amino acid. Example λ(3) = Tyrosine ρ 1 (3) = ρ 2 (3) = = 0.9 = 0.1 λ 1 (3) is a most frequent codon

20 Codon fitness Definition Let τ j (i) denote the codon fitness of j th codon of i th amino acid. A codon s fitness is its frequency relative to the most frequent codon of the same amino acid.

21 Codon fitness Definition Let τ j (i) denote the codon fitness of j th codon of i th amino acid. A codon s fitness is its frequency relative to the most frequent codon of the same amino acid. Example λ(3) = Tyrosine τ 2 (3) = ρ 2(3) ρ 1 (3) =

22 The codon adaption index Given an amino acid sequence A = α 1, α 2,..., α A, and a corresponding codon design S = c 1, c 2,..., c A : Example A CAI(S, A) = τ ci (α i ) i=1 1 A A Tyrosine Tyrosine Tyrosine Tyrosine S UAC UAU UAU UAC CAI (S, A) = ( )

23 Optimizing CAI The CAI codon optimization problem Instance: Amino acid sequence A = α 1, α 2,..., α A. Problem: Find a codon design S corresponding to A such that: CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all codon designs for A.

24 Optimizing CAI The CAI codon optimization problem Instance: Amino acid sequence A = α 1, α 2,..., α A. Problem: Find a codon design S corresponding to A such that: CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all codon designs for A. Note: trivially solvable in linear time

25 Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design.

26 Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC)

27 Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC) polyhomomeric regions (e.g., CCCCCC)

28 Forbidden motifs and desirable motifs The catch Certain motifs (substrings) should not appear in the codon design. restriction enzyme motifs (e.g., MlyI restriction enzyme motif: GAGTC) polyhomomeric regions (e.g., CCCCCC) Certain motifs (substrings) are desirable in the codon design. Immuno stimulatory motifs

29 Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

30 Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

31 Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

32 Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

33 Forbidden motifs and desirable motifs Example F = {CCUU, AGC, UGGC}

34 Forbidden motifs and desirable motifs Example Detecting motifs Equivalent to the dictionary matching problem. Classic dictionary (Aho & Corasick 1975) Succinct dictionary (Belazzougui 2010, Thachuk 2011) F = {CCUU, AGC, UGGC} Text of length h scanned for patterns in O(h) time.

35 Motifs are of constant length Observation If the largest forbidden or desired motif is of length g, then any forbidden or desired motif can span at most k + 1 consecutive codons, where k = g/3.

36 Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D.

37 Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D. Problem: Find a codon design S corresponding to A such that: S is valid, with respect to F and D, CAI (S, A) = max{cai (S, A) S S(A)}

38 Optimizing CAI and considering motifs The CAI codon optimization problem with motif engineering Instance: Amino acid sequence A = α 1, α 2,..., α A, a set of forbidden motifs F, and a set of desired motifs D. Problem: Find a codon design S corresponding to A such that: S is valid, with respect to F and D, CAI (S, A) = max{cai (S, A) S S(A)} where S(A) is the set of all valid codon designs for A.

39 The algorithm Overview and Notation We will develop a dynamic programming algorithm. (We ignore desirable motifs.) Notation λ j (i) j th codon of i th amino acid τ j (i) j th codon s fitness w.r.t i th amino acid

40 The algorithm Overview and Notation We will develop a dynamic programming algorithm. (We ignore desirable motifs.) Notation λ j (i) j th codon of i th amino acid τ j (i) j th codon s fitness w.r.t i th amino acid M F (λ ci (α j )... λ ci+q (α j+q )) count of forbidden motifs M F (λ c i (α j )... λ ci+q (α j+q )) count of forbidden motifs ending in last codon

41 The algorithm A tale of two matrices We make use of two k-dimensional DP matrices. The Forbidden motif DP matrix F i c i k+1,...,c i 1,c i Denotes the minimum possible number of forbidden motifs in a DNA sequence which codes for an amino acid sequence A = α 1, α 2,..., α i, given that the last k codons (of i total codons) have indices denoted as c i k+1,..., c i 1, c i.

42 The algorithm A tale of two matrices We make use of two k-dimensional DP matrices. The Forbidden motif DP matrix F i c i k+1,...,c i 1,c i Denotes the minimum possible number of forbidden motifs in a DNA sequence which codes for an amino acid sequence A = α 1, α 2,..., α i, given that the last k codons (of i total codons) have indices denoted as c i k+1,..., c i 1, c i. The CAI value DP matrix P i c i k+1,...,c i 1,c i Denotes the maximum possible CAI score among all valid sequences.

43 The algorithm The base case Simply evaluate all assignments to first k codons. Fc k ( 1,c 2,...,c k 1,c k = M F λc1 (α 1 )λ c2 (α 2 )... λ ck 1 (α k 1 )λ ck (α k ) ) k Pc k 1,c 2,...,c k 1,c k = (τ ci (α i )) i=1

44 The algorithm The base case Simply evaluate all assignments to first k codons. Fc k ( 1,c 2,...,c k 1,c k = M F λc1 (α 1 )λ c2 (α 2 )... λ ck 1 (α k 1 )λ ck (α k ) ) k Pc k 1,c 2,...,c k 1,c k = (τ ci (α i )) i=1 Complexity analysis at most 6 codons for each of the k positions O(6 k ) assignments evaluating one assignment for motifs O(k) time O(6 k k) time and O(6 k ) words of space

45 The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

46 The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

47 The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

48 The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Fc i i k+1,...,c i 1,c i = min 1 ci k λ(α i k ) { Fc i 1 i k,...,c i 2,c i 1 + M F ( λci k (α i k )... λ ci 1 (α i 1 )λ ci (α i ) )} Example

49 The algorithm The recursive case case (i > k) For a fixed assignment to last k codons: Pc i i k+1,...,c i 1,c i = max 1 ci k λ(α i k ) Fc i 1, if i k (,...,c i 2,c i 1 + M F λci k (α i k )... λ ci (α i ) ) Fc i i k+1,...,c i 1,c i τ ci (α i ) Pc i 1 i k,...,c i 2,c i 1, otherwise Example

50 The algorithm The recursive case case (i > k) P i k = F i k = max 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) min 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) { {F i ci k+1,...,ci 1,ci } Pc i i k+1,...,c i 1,c i, if Fc i i k+1,...,c i 1,c i = F k i, otherwise }

51 The algorithm The recursive case case (i > k) P i k = F i k = max 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) min 1 c i λ(α i ) 1 c i 1 λ(α i 1 ). 1 c i k+1 λ(α i k+1 ) { {F i ci k+1,...,ci 1,ci } Pc i i k+1,...,c i 1,c i, if Fc i i k+1,...,c i 1,c i = F k i, otherwise } Complexity analysis O(6 k ) codon assignments evaluated at n positions O(k) time to evaluate assignment O(6 k kn) = O(n) time and space overall previous state-of-the-art θ(n 2 ) time/space (Satya et al., 2003)

52 Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation)

53 Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli

54 Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli F = 10 D = 33 k = 3 (motifs of length 9 or less)

55 Experimental setup Data set 3,157 sequences from GENCODE subset of ENCODE dataset comprises 1% of human genome representative of genome in various characteristics range in length from 75 to 8186 bases mean length of 173 bases (267 bases standard deviation) using codon frequencies of Escherichia coli F = 10 D = 33 k = 3 (motifs of length 9 or less) Implementation & Hardware Implemented in C++ Pentium IV 2.4 GhZ 1 GB RAM

56 motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06)

57 motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06) forbidden 0.92 (0.04) 0.14 (0.45) 0 (0.00)

58 motif sets CAI value # forbidden # desired none (wild-types) 0.65 (0.06) 9.24 (16.24) 0.49 (1.06) forbidden 0.92 (0.04) 0.14 (0.45) 0 (0.00) forbidden & desired 0.83 (0.05) 0.14 (0.45) (14.84)

59

60

61 runtime [CPU seconds] all motifs only forbidden motifs sequence length

62 Conclusions & Future Work Conclusions CAI of gene can be optimized effectively motifs can be removed/added effectively O(n) time/space algorithm for constant length motifs algorithm is fast in practice

63 Conclusions & Future Work Conclusions CAI of gene can be optimized effectively motifs can be removed/added effectively O(n) time/space algorithm for constant length motifs algorithm is fast in practice Open Problem Design codon sequence free of secondary structure