Computational Biology Hendrik Jan Hoogeboom

LAPP-Top Computer Science: Working Smarter! Computational Biology Hendrik Jan Hoogeboom jan/feb 2013 http://www.liacs.nl/~hoogeboo/praatjes/lapp-top/»

fundamenteel programmeerlijn fi1 discrete wiskunde fi2 formele talen theory van concurrency fi3 berekenbaarheid service Programmeermethoden Algoritmiek Datastructuren Software engineering Databases challenges project android

COMPMOLBIOL

Computational Biology A.P. Gultyaev (Sacha) Application oriented view application H.J. Hoogeboom (Hendrik Jan) Theory oriented view www.liacs.nl/home/hoogeboo/mcb/

sequence alignment TCAGACGATTG TCGGAGCTG are sequences similar? what does that mean how do we compute it

feb 01 - human genome A scientific milestone of enormous proportions, the sequencing of the human genome will impact all of us in diverse ways from our views of ourselves as human beings to new paradigms in medicine.

deoxy-ribonucleic acid DNA chromosomes H H O H N N C N H H O H O H C H 2 N N G N N H N H O O H 2 C O H O H bio-chemistry double helix O H H

5 P S A T S 3 DNA P P S S G C C G S S P P sugar phosphor nucleotides basepairs Watson & Crick 3 P S A T S P P 5 A=T adenine - thymine CG guanine - cytosine

statistics human genome 23 pairs of chromosomes 3.1 10 9 nucleotide bases 30,000-40,000 genes avarage 3000 bases / gene dystrophin 2.4 million protein variants 1 million (splicing) 99.9% exactly the same in humans

30,000-40,000 genes statistics human genome The haploid human genome contains ca. 23,000 proteincoding genes, far fewer than had been expected before its sequencing. In fact, only about 1.5% of the genome codes for proteins, while the rest consists of non-coding RNA genes, regulatory sequences, introns, and noncoding DNA (once known as "junk DNA"). wikipedia http://en.wikipedia.org/wiki/human_genome feb 2011 contains approximately 20,000 protein-coding genes, significantly fewer than had been anticipated. Proteincoding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA molecules, regulatory DNA sequences, introns, and sequences to which no function has yet been assigned. jan 2013

themes problem model (eg. graph) known algorithms characterization unprecise data complexity heuristics what is the right answer?

two alphabets DNA bases 4 symbols a c t g RNA a c u g eiwitten proteins amino acids 20 symbols A R D N C E Q G H I L K M F P S T W Y V

central dogma DNA RNA transcription & splicing protein eiwit translation gene expression

translation trna protein chain eiwitketen ribosome aminoacid C G U G A A A C C A C G A U G U G G U A U G C A C U U U G G U G C mrna codon

20 amino acids U G G

U C A G U Phe Ser Tyr Cys U Phe Ser Tyr Cys C Leu Ser Stop Stop A Leu Ser Stop Trp G C Leu Pro His Arg U Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G A Ile Thr Asn Ser U Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G G Val Ala Asp Gly U Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G genetic code UGG Trp U G G

Lookup Is the gene known for my protein (or vice versa)? On which chromosome is the gene located? What sequence patterns are present in my protein? Are the mutations known which cause this disease? To what class or family does my protein belong? What is known? Compare Are there sequences in the database resembling my protein? How can I optimally align the members of this protein family? Are these two sequences similar? Predict Can I predict the active site residues of this enzyme? Why are these patients ill? Can I make a 3D model for my protein? Can I predict a (better) drug for this target? How can I improve the thermostability? (protein engineering) How can I predict the genes located on this genome? questions

BLAST The following DNA sequence fragment, containing some mutation, was isolated from a patient: tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag (a) In what gene the mutation is located? On which chromosome? How many nucleotides are changed? (b) Could you indicate a possible disease determined by this mutation? http://www.ncbi.nlm.nih.gov/» nucleotide blast

2D & 3D Structures of Yeast Phenylalanyl-Transfer RNA 2D Structure 3D Structure

ALIGNMENT

sequence alignment are sequences similar? what does that mean how do we compute it

sequence alignment TCAGACGATTG TCGGAGCTG insertion deletion substitution match mismatch gap TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG TCAG-ACGATTG TC-GGA-GCT-G

sequence alignment TCAGACGATTG TCGGAGCTG TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG 14-6=8 12-3-2=7 match mismatch gap +2-1 -1 TCAG-ACGATTG TC-GGA-GCT-G 14-1-4=9

TCAGACGATTG TCGGAGCTG sequence alignment TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG TCAG-ACGATTG TC-GGA-GCT-G TCAGACGATTG TCGGA-GCT-G 14-2-2=10

basic algorithm alignment recursive principle dynamic programming

alignment: recursion (-,x) = -1 (x,-) = -1 (x,y) = -1 (x,x) = 2 acgctg catgt reward and punishment acgct catgt acgct catg acgctg catg g - g t - t -1-1 -1

acgctg catgt acgct catgt acgct catg alignment: recursion acgc catg acgct catg acgc cat acgctg catg acgct catg acgct cat

alignment: recursion acgctg catgt acgct catgt acgct catg acgctg catg acgct cat acgctg cat acgct ca acgctg ca acgct c acgctg c acgc catgt acgc catg acgc cat acgc ca acgc c acg catgt acg catg acg cat acg ca acg c ac catgt ac catg ac cat ac ca ac c a catgt a catg a cat a ca a c. catgt. catg. cat. ca. c acgct. acgctg. acgc. acg. ac. a...

dynamic programming 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt acgct cat

dynamic programming - c a t g t acgct cat - a c g c t * g acgctg catgt

initialization - c a t g t --- cat - a c 0-1 -2-1 -2-3 -4-5 g -3 c -4 t -5 g -6

dynamic programming i a[i,j] - = max a c j g c t g - c a t g t a[i,j-1] + g a[i-1,j-1] + ( s[i],t[j] ) a[i-1,j] + g acgc ca acgct ca t t -1+2-1-1-2-1 - t acgc cat 1 t - acgct cat

alignment 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt

traceback 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt -acgctg catg-t- acgctg- -ca-tgt acgctg- -c-atgt - t g -

/* Recursion: the heart of the DP algorithm.*/ /* Initialization. */ S[0][0] = 0; for (i = 1; i <= M; i++) S[i][0] = i * INDEL; for (j = 1; j <= N; j++) S[0][j] = j * INDEL; Eddy S.R. (2004a) /* Dynamic programming, global alignment (Needleman/Wunsch) recursion. */ for (i = 1; i <= M; i++) for (j = 1; j <= N; j++) { /* case #1: i,j are aligned */ if (x[i] == y[j]) S[i][j] = S[i-1][j-1] + MATCH; else S[i][j] = S[i-1][j-1] + MISMATCH; sc = S[i-1][j] + INDEL; /* case #2: i aligned to - */ if (sc > S[i][j]) S[i][j] = sc; } sc = S[i][j-1] + INDEL; /* case #3: j aligned to - */ if (sc > S[i][j]) S[i][j] = sc; /* The result (optimal alignment score) is now in S[M][N]. */

program Eddy [hoogeboo@tin 22:01 ~/colleges/mcb/programs] >./align Sequence X: TTCATA Sequence Y: TGCTCGTA Scoring system: 5 for match; -2 for mismatch; -6 for gap Dynamic programming matrix: T G C T C G T A 0-6 -12-18 -24-30 -36-42 -48 T -6 5-1 -7-13 -19-25 -31-37 T -12-1 3-3 -2-8 -14-20 -26 C -18-7 -3 8 2 3-3 -9-15 A -24-13 -9 2 6 0 1-5 -4 T -30-19 -15-4 7 4-2 6 0 A -36-25 -21-10 1 5 2 0 11 Alignment: X: T--TCATA Y: TGCTCGTA Optimum alignment score: 11

scoring and parameter choice A A A C AT TA vs. vs. vs. A C A- -C -AT TA- match mismatch gap M m g +2-1 -1 +2-2 -1 A C G T A 91-114 -31-123 C 100-125 -31 G 100-114 T 91 gap 400 + 30k ask your favourite molecular biologist!

PAM250 Matrix mutation prob (evolution) biochemical properties

alignment CAGCACTTGGATTCTCGG CAGC-----G-T----GG CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------

variants of alignment > global Needleman & Wunsch local Smith & Waterman semi-global end

local alignment a[i,j] = max a[i,j-1] + g a[i,j] + ( s[i],t[j] ) a[i-1,j] + g 0 0 max solution (traceback) from max to 0

probleem solved!? much too slow long strings huge databases heuristics FASTA along diagonal BLAST minimal close match multiple alignment (several strings) NP complete exponential

Basic Local Alignment Search Tools query query word (W=3) GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL neighbourhood words threshold PQG 18 PEG 15 PRG 14 PMG 13 PQA 12 PQN 12 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT subject high-scoring segment pair (HSP) parameters: W word size T threshold S score E expected

om thuis over na te denken

genetic variation thanks google

BAVARIANBEER

computational biology nog even geduld

1 + 2 * 3-4 som

BAPC 2006 A. Geodes B. Bowling C. High spies D. Digital Friends E. The Bavarian Beer Party F. Evacuation G. Decompression H. Rummikub I. Make it Manhattan website BAPC

programmeerwedstrijd voor teams

E. The Bavarian Beer Party Description The professors of the Bayerische Mathematiker Verein have their annual party in the local Biergarten. They are sitting at a round table each with his own pint of beer. As a ceremony each professor raises his pint and toasts one of the other guests in such a way that no arms cross. Figure 2: Toasting across a table with eight persons: no arms crossing(left), arms crossing(right) We know that the professors like to toast with someone that is drinking the same brand of beer, and we like to maximize the number of pairs of professors toasting with the same brand, again without crossing arms. Write an algorithm to do this, keeping in mind that every professor should take part in the toasting.

E. The Bavarian Beer Party Input The first line of the input contains a single number: the number of test cases to follow. Each test case has the following format: One line with an even number p, satisfying 2 p 1000: the number of participants, One line with p integers (separated by single spaces) indicating the beer brands for the consecutive professors (in clockwise order, starting at an arbitrary position). Each value is between 1 and 100 (boundaries included). Output For every test case in the input, the output should contain a single number on a single line: the maximum number of non-intersecting toasts of the same beer brand for this test case. Sample Input 2 6 1 2 2 1 3 3 22 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 Sample Output 3 6

Sample Input 2 6 1 2 2 1 3 3 22 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 Sample Output 3 6 3 bijvoorbeeld 1 2 1 2 2 1 3 3 3 1 2 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 eeuh?

open knippen 1 2 2 1 3 3 3 1 2 3 1 2 3 1 2 2 1 3 3 1 2 3 1 2

alle mogelijkheden? 1 2 3 2 1 3 3 1 2 1 2 3

alle mogelijkheden? handshaking problem :- Catalan numbers opgave: teken bijbehorende smiling faces

alle mogelijkheden? handshaking problem :- Catalan numbers http://www.maths.usyd.edu.au/u/kooc/catalan.html» klik op smiling faces en handshaking problem

hoeveel mogelijkheden? Sloane On-Line Encyclopedia of Integer Sequences http://oeis.org/search?q=1+2+5+14+42 A000108 Catalan numbers: C(n) = binomial(2n,n)/(n+1) = (2n)!/(n!(n+1)!). Also called Segner numbers. The solution to Schroeder's first problem. A very large number of combinatorial interpretations are known - see references, esp. Stanley, Enumerative Combinatorics. Number of ways to insert n pairs of parentheses in a word of n+1 letters. E.g. for n=3 there are 5 ways: ((ab)(cd)), (((ab)c)d), ((a(bc))d), (a((bc)d)), (a(b(cd))). Shifts one place left when convolved with itself. Ways of joining 2n points on a circle to form n nonintersecting chords. Arises in Schubert calculus - see Sottile reference. Inverse Euler transform of sequence is A022553. With interpolated zeros, the inverse binomial transform of the Motzkin numbers A001006. [ etc ]

Eugène Charles Catalan Eugène Charles Catalan (Brugge, 30 mei 1814 Luik, 14 februari 1894) was een Belgisch wiskundige wiki:catalan

hoeveel? 1 1 Sloane: Catalan numbers 2 2 http://www.research.att.com/~njas/sequences/a000108» 3 5 4 14 5 42 6 132 7 429 (precies) 8 1430 9 4862 10 16796 11 58786 12 208012 13 742900 (ongeveer) 14 2674440 15 9694845 16 35357670 a(n) ~ 4^n / ( sqrt(pi*n) * (n+1) )

0 1 2 3 4 5 6 1 1 2 5 14 42 132 lengte hoeveel? aantal ( ) x x x x x x x x (0)4 1 x14 ( x x ) x x x x x x (1)3 1 x 5 ( x x x x ) x x x x (2)2 2 x 2 ( x x x x x x ) x x (3)1 5 x 1 ( x x x x x x x x ) (4)0 14 x 1 42 + Shifts one place left when convolved with itself

bav(1,10) = maximum van hoeveel algoritme ( ) x x x x x x x x bav(3,10) +match(1,2) ( x x ) x x x x x x bav(2,3) +bav(5,10) +match(1,4) ( x x x x ) x x x x bav(2,5) +bav(7,10) +match(1,6) ( x x x x x x ) x x bav(2,7) +bav(9,10) +match(1,8) ( x x x x x x x x ) bav(2,9) +match(1,10) dynamisch programmeren

bav(1,10) = maximum van hoeveel algoritme ( ) x x x x x x x x bav(1,2)+bav(3,10) ( x x ) x x x x x x bav(1,4)+bav(5,10) ( x x x x ) x x x x bav(1,6)+bav(7,10) ( x x x x x x ) x x bav(1,8)+bav(9,10) ( x x x x x x x x ) bav(2,9)+match(1,10) dynamisch programmeren

van bav(1,10) = maximum van 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 tm.

bav(1,10) = maximum van

Bavarian Beer Party optimale score interval [k,l] 2 1 bav[k,l] 2 2 k k+1 l-1 l k j j+1 l 3 2 3 1 1 2 2 1 3 3 2 2 bav[k,l] = max - bav[k+1,l-1] + match(k,l) - (j) bav[k,j]+bav[j+1,l] dynamic programming even length segments

int proost ( int totaal ) { int i, eerst, laatst, aantal, best, som; // initialisatie: paren for (i=1; i<totaal; i++) { bav[i][i+1] = ( brand[i] == brand[i+1] ); } if (totaal == 2) { return bav[1][2] ; } Bavarian Beer Party for (aantal=4; aantal<=totaal; aantal+=2) { // dynamische loop: 'aantal' toastende professoren // aantal == laatst - eerst + 1 for (eerst=1; eerst<totaal-aantal; eerst++) { laatst = eerst + aantal -1; best = (brand[eerst] == brand[laatst]); best += bav[eerst+1][laatst-1]; for (i=eerst+1; i<=laatst-2; i=i+2) { // alleen even matches! som = bav[eerst][i] + bav[i+1][laatst]; if (som > best) best = som; } bav[eerst][laatst] = best; } }// loop return bav[1][totaal] ; }// proost

lekker belangrijk

RNA secondary structure A Adenine C Cytosine G Guanine U Uracil A=U G C

PRAKTIKUM

practicum http://www.liacs.nl/home/hoogeboo/praatjes/lapp-top/»

laboratorium excel driehoek van Pascal cellulaire automaat alignment

gegevens adressen formules excel

excel gegevens adressen formules getallen, strings, datum, geld A1 $A$1 relatief absoluut =A2+B1 IF(A2=B1,1,2) SUM(B9:B12)

excel adressen A1 $A$1 relatief absoluut 1 2 3 A B C D E F =A2+B1 =D3+E2

excel adressen A1 $A$1 relatief absoluut 1 2 3 A B C D E F =$A2+B$1 =$A3+E$1

1 2 3 A B C =A2+B1 driehoek van Pascal 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 3 6 10 15 21 28 36 1 4 10 20 35 56 84 1 5 15 35 70 126 1 6 21 56 126 1 7 28 84 1 8 36 1 9 1

welke getallen gekleurd?

welke getallen gekleurd? 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 x y X+y- 2xy exclusive or even / oneven 0 0 0 0 1 1 1 0 1 1 1 0

Δ van Pascal waarom eigenlijk niet gewoon rechtop? (dat lukt met gaten tussen de waardes) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0 0 0 0 1 0 3 0 3 0 1 0 0 0 0 0 1 0 4 0 6 0 4 0 1 0 0 0 1 0 5 0 10 0 10 0 5 0 1 0 1 0 6 0 15 0 20 0 15 0 6 0 1 A1 C1 =A1+C1

alignment

excel - alignment A B C D E F G H I 1 c g a t t a 2 a 3 c 4 t 5 t MAX drie buren ± gap / (mis)match IF/ALS letters match/mismatch

excel - alignment

cellulaire automaat

cellulaire automaat http://mathworld.wolfram.com/cellularautomaton.html http://nl.wikipedia.org/wiki/elementaire_cellulaire_automaat»

rule 30 tijd http://mathworld.wolfram.com/rule30.html

Conus textile http://en.wikipedia.org/wiki/conus_textile

rule 90 http://mathworld.wolfram.com/rule90.html

rule 90 xyz 90 000 0 001 1 010 0 011 1 100 1 101 0 110 1 111 0 A1 B1 C1 x y z 90 als functie van xyz formule??

rule 30 xyz 30 000 0 001 1 010 1 011 1 100 1 101 0 110 0 111 0 (1-x) (y+z-y z) + x (1-y) (1-z) in logische terminologie: x(yz) x(yz) xor x(yz)

lookup ipv. formule P Q R 1 000 0 0 2 001 1 1 3 010 2 0 4 011 3 1 5 100 4 1 6 101 5 0 7 110 6 1 8 111 7 0 invoer uitvoer =Lookup( b, Q1:Q8, R1:R8) b = binaire waarde buren: A1 B1 C1 =4*A1+2*B1+1*C1

2-dim cellulaire automaat game of life puffer train http://www.collidoscope.com/modernca/

2-dim cellulaire automaat http://www.ux.uis.no/~ruoff/bz_phenomenology.html