LAPP-Top Computer Science: Working Smarter! Computational Biology Hendrik Jan Hoogeboom jan/feb 2013 http://www.liacs.nl/~hoogeboo/praatjes/lapp-top/»
fundamenteel programmeerlijn fi1 discrete wiskunde fi2 formele talen theory van concurrency fi3 berekenbaarheid service Programmeermethoden Algoritmiek Datastructuren Software engineering Databases challenges project android
COMPMOLBIOL
Computational Biology A.P. Gultyaev (Sacha) Application oriented view application H.J. Hoogeboom (Hendrik Jan) Theory oriented view www.liacs.nl/home/hoogeboo/mcb/
sequence alignment TCAGACGATTG TCGGAGCTG are sequences similar? what does that mean how do we compute it
feb 01 - human genome A scientific milestone of enormous proportions, the sequencing of the human genome will impact all of us in diverse ways from our views of ourselves as human beings to new paradigms in medicine.
deoxy-ribonucleic acid DNA chromosomes H H O H N N C N H H O H O H C H 2 N N G N N H N H O O H 2 C O H O H bio-chemistry double helix O H H
5 P S A T S 3 DNA P P S S G C C G S S P P sugar phosphor nucleotides basepairs Watson & Crick 3 P S A T S P P 5 A=T adenine - thymine CG guanine - cytosine
statistics human genome 23 pairs of chromosomes 3.1 10 9 nucleotide bases 30,000-40,000 genes avarage 3000 bases / gene dystrophin 2.4 million protein variants 1 million (splicing) 99.9% exactly the same in humans
30,000-40,000 genes statistics human genome The haploid human genome contains ca. 23,000 proteincoding genes, far fewer than had been expected before its sequencing. In fact, only about 1.5% of the genome codes for proteins, while the rest consists of non-coding RNA genes, regulatory sequences, introns, and noncoding DNA (once known as "junk DNA"). wikipedia http://en.wikipedia.org/wiki/human_genome feb 2011 contains approximately 20,000 protein-coding genes, significantly fewer than had been anticipated. Proteincoding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA molecules, regulatory DNA sequences, introns, and sequences to which no function has yet been assigned. jan 2013
themes problem model (eg. graph) known algorithms characterization unprecise data complexity heuristics what is the right answer?
two alphabets DNA bases 4 symbols a c t g RNA a c u g eiwitten proteins amino acids 20 symbols A R D N C E Q G H I L K M F P S T W Y V
central dogma DNA RNA transcription & splicing protein eiwit translation gene expression
translation trna protein chain eiwitketen ribosome aminoacid C G U G A A A C C A C G A U G U G G U A U G C A C U U U G G U G C mrna codon
20 amino acids U G G
U C A G U Phe Ser Tyr Cys U Phe Ser Tyr Cys C Leu Ser Stop Stop A Leu Ser Stop Trp G C Leu Pro His Arg U Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G A Ile Thr Asn Ser U Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G G Val Ala Asp Gly U Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G genetic code UGG Trp U G G
Lookup Is the gene known for my protein (or vice versa)? On which chromosome is the gene located? What sequence patterns are present in my protein? Are the mutations known which cause this disease? To what class or family does my protein belong? What is known? Compare Are there sequences in the database resembling my protein? How can I optimally align the members of this protein family? Are these two sequences similar? Predict Can I predict the active site residues of this enzyme? Why are these patients ill? Can I make a 3D model for my protein? Can I predict a (better) drug for this target? How can I improve the thermostability? (protein engineering) How can I predict the genes located on this genome? questions
BLAST The following DNA sequence fragment, containing some mutation, was isolated from a patient: tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag (a) In what gene the mutation is located? On which chromosome? How many nucleotides are changed? (b) Could you indicate a possible disease determined by this mutation? http://www.ncbi.nlm.nih.gov/» nucleotide blast
2D & 3D Structures of Yeast Phenylalanyl-Transfer RNA 2D Structure 3D Structure
ALIGNMENT
sequence alignment are sequences similar? what does that mean how do we compute it
sequence alignment TCAGACGATTG TCGGAGCTG insertion deletion substitution match mismatch gap TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG TCAG-ACGATTG TC-GGA-GCT-G
sequence alignment TCAGACGATTG TCGGAGCTG TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG 14-6=8 12-3-2=7 match mismatch gap +2-1 -1 TCAG-ACGATTG TC-GGA-GCT-G 14-1-4=9
TCAGACGATTG TCGGAGCTG sequence alignment TCAG-ACG-ATTG TC-GGA-GC-T-G TCAGACGATTG TCGGAGC--TG TCAG-ACGATTG TC-GGA-GCT-G TCAGACGATTG TCGGA-GCT-G 14-2-2=10
basic algorithm alignment recursive principle dynamic programming
alignment: recursion (-,x) = -1 (x,-) = -1 (x,y) = -1 (x,x) = 2 acgctg catgt reward and punishment acgct catgt acgct catg acgctg catg g - g t - t -1-1 -1
acgctg catgt acgct catgt acgct catg alignment: recursion acgc catg acgct catg acgc cat acgctg catg acgct catg acgct cat
alignment: recursion acgctg catgt acgct catgt acgct catg acgctg catg acgct cat acgctg cat acgct ca acgctg ca acgct c acgctg c acgc catgt acgc catg acgc cat acgc ca acgc c acg catgt acg catg acg cat acg ca acg c ac catgt ac catg ac cat ac ca ac c a catgt a catg a cat a ca a c. catgt. catg. cat. ca. c acgct. acgctg. acgc. acg. ac. a...
dynamic programming 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt acgct cat
dynamic programming - c a t g t acgct cat - a c g c t * g acgctg catgt
initialization - c a t g t --- cat - a c 0-1 -2-1 -2-3 -4-5 g -3 c -4 t -5 g -6
dynamic programming i a[i,j] - = max a c j g c t g - c a t g t a[i,j-1] + g a[i-1,j-1] + ( s[i],t[j] ) a[i-1,j] + g acgc ca acgct ca t t -1+2-1-1-2-1 - t acgc cat 1 t - acgct cat
alignment 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt
traceback 2 3 0-3 -3-6 g 3 0 1-2 -2-5 t 1 1-1 -1-1 -4 c 1 2-1 0 0-3 g -2-1 0 0 1-2 c -2-1 0 1-1 -1 a -5-4 -3-2 -1 0 - t g t a c - acgctg catgt -acgctg catg-t- acgctg- -ca-tgt acgctg- -c-atgt - t g -
/* Recursion: the heart of the DP algorithm.*/ /* Initialization. */ S[0][0] = 0; for (i = 1; i <= M; i++) S[i][0] = i * INDEL; for (j = 1; j <= N; j++) S[0][j] = j * INDEL; Eddy S.R. (2004a) /* Dynamic programming, global alignment (Needleman/Wunsch) recursion. */ for (i = 1; i <= M; i++) for (j = 1; j <= N; j++) { /* case #1: i,j are aligned */ if (x[i] == y[j]) S[i][j] = S[i-1][j-1] + MATCH; else S[i][j] = S[i-1][j-1] + MISMATCH; sc = S[i-1][j] + INDEL; /* case #2: i aligned to - */ if (sc > S[i][j]) S[i][j] = sc; } sc = S[i][j-1] + INDEL; /* case #3: j aligned to - */ if (sc > S[i][j]) S[i][j] = sc; /* The result (optimal alignment score) is now in S[M][N]. */
program Eddy [hoogeboo@tin 22:01 ~/colleges/mcb/programs] >./align Sequence X: TTCATA Sequence Y: TGCTCGTA Scoring system: 5 for match; -2 for mismatch; -6 for gap Dynamic programming matrix: T G C T C G T A 0-6 -12-18 -24-30 -36-42 -48 T -6 5-1 -7-13 -19-25 -31-37 T -12-1 3-3 -2-8 -14-20 -26 C -18-7 -3 8 2 3-3 -9-15 A -24-13 -9 2 6 0 1-5 -4 T -30-19 -15-4 7 4-2 6 0 A -36-25 -21-10 1 5 2 0 11 Alignment: X: T--TCATA Y: TGCTCGTA Optimum alignment score: 11
scoring and parameter choice A A A C AT TA vs. vs. vs. A C A- -C -AT TA- match mismatch gap M m g +2-1 -1 +2-2 -1 A C G T A 91-114 -31-123 C 100-125 -31 G 100-114 T 91 gap 400 + 30k ask your favourite molecular biologist!
PAM250 Matrix mutation prob (evolution) biochemical properties
alignment CAGCACTTGGATTCTCGG CAGC-----G-T----GG CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------
variants of alignment > global Needleman & Wunsch local Smith & Waterman semi-global end
local alignment a[i,j] = max a[i,j-1] + g a[i,j] + ( s[i],t[j] ) a[i-1,j] + g 0 0 max solution (traceback) from max to 0
probleem solved!? much too slow long strings huge databases heuristics FASTA along diagonal BLAST minimal close match multiple alignment (several strings) NP complete exponential
Basic Local Alignment Search Tools query query word (W=3) GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL neighbourhood words threshold PQG 18 PEG 15 PRG 14 PMG 13 PQA 12 PQN 12 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNL TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT subject high-scoring segment pair (HSP) parameters: W word size T threshold S score E expected
om thuis over na te denken
genetic variation thanks google
BAVARIANBEER
computational biology nog even geduld
1 + 2 * 3-4 som
BAPC 2006 A. Geodes B. Bowling C. High spies D. Digital Friends E. The Bavarian Beer Party F. Evacuation G. Decompression H. Rummikub I. Make it Manhattan website BAPC
programmeerwedstrijd voor teams
E. The Bavarian Beer Party Description The professors of the Bayerische Mathematiker Verein have their annual party in the local Biergarten. They are sitting at a round table each with his own pint of beer. As a ceremony each professor raises his pint and toasts one of the other guests in such a way that no arms cross. Figure 2: Toasting across a table with eight persons: no arms crossing(left), arms crossing(right) We know that the professors like to toast with someone that is drinking the same brand of beer, and we like to maximize the number of pairs of professors toasting with the same brand, again without crossing arms. Write an algorithm to do this, keeping in mind that every professor should take part in the toasting.
E. The Bavarian Beer Party Input The first line of the input contains a single number: the number of test cases to follow. Each test case has the following format: One line with an even number p, satisfying 2 p 1000: the number of participants, One line with p integers (separated by single spaces) indicating the beer brands for the consecutive professors (in clockwise order, starting at an arbitrary position). Each value is between 1 and 100 (boundaries included). Output For every test case in the input, the output should contain a single number on a single line: the maximum number of non-intersecting toasts of the same beer brand for this test case. Sample Input 2 6 1 2 2 1 3 3 22 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 Sample Output 3 6
Sample Input 2 6 1 2 2 1 3 3 22 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 Sample Output 3 6 3 bijvoorbeeld 1 2 1 2 2 1 3 3 3 1 2 1 7 1 2 4 2 4 9 1 1 9 4 5 9 4 5 6 9 2 1 2 9 eeuh?
open knippen 1 2 2 1 3 3 3 1 2 3 1 2 3 1 2 2 1 3 3 1 2 3 1 2
alle mogelijkheden? 1 2 3 2 1 3 3 1 2 1 2 3
alle mogelijkheden? handshaking problem :- Catalan numbers opgave: teken bijbehorende smiling faces
alle mogelijkheden? handshaking problem :- Catalan numbers http://www.maths.usyd.edu.au/u/kooc/catalan.html» klik op smiling faces en handshaking problem
hoeveel mogelijkheden? Sloane On-Line Encyclopedia of Integer Sequences http://oeis.org/search?q=1+2+5+14+42 A000108 Catalan numbers: C(n) = binomial(2n,n)/(n+1) = (2n)!/(n!(n+1)!). Also called Segner numbers. The solution to Schroeder's first problem. A very large number of combinatorial interpretations are known - see references, esp. Stanley, Enumerative Combinatorics. Number of ways to insert n pairs of parentheses in a word of n+1 letters. E.g. for n=3 there are 5 ways: ((ab)(cd)), (((ab)c)d), ((a(bc))d), (a((bc)d)), (a(b(cd))). Shifts one place left when convolved with itself. Ways of joining 2n points on a circle to form n nonintersecting chords. Arises in Schubert calculus - see Sottile reference. Inverse Euler transform of sequence is A022553. With interpolated zeros, the inverse binomial transform of the Motzkin numbers A001006. [ etc ]
Eugène Charles Catalan Eugène Charles Catalan (Brugge, 30 mei 1814 Luik, 14 februari 1894) was een Belgisch wiskundige wiki:catalan
hoeveel? 1 1 Sloane: Catalan numbers 2 2 http://www.research.att.com/~njas/sequences/a000108» 3 5 4 14 5 42 6 132 7 429 (precies) 8 1430 9 4862 10 16796 11 58786 12 208012 13 742900 (ongeveer) 14 2674440 15 9694845 16 35357670 a(n) ~ 4^n / ( sqrt(pi*n) * (n+1) )
0 1 2 3 4 5 6 1 1 2 5 14 42 132 lengte hoeveel? aantal ( ) x x x x x x x x (0)4 1 x14 ( x x ) x x x x x x (1)3 1 x 5 ( x x x x ) x x x x (2)2 2 x 2 ( x x x x x x ) x x (3)1 5 x 1 ( x x x x x x x x ) (4)0 14 x 1 42 + Shifts one place left when convolved with itself
bav(1,10) = maximum van hoeveel algoritme ( ) x x x x x x x x bav(3,10) +match(1,2) ( x x ) x x x x x x bav(2,3) +bav(5,10) +match(1,4) ( x x x x ) x x x x bav(2,5) +bav(7,10) +match(1,6) ( x x x x x x ) x x bav(2,7) +bav(9,10) +match(1,8) ( x x x x x x x x ) bav(2,9) +match(1,10) dynamisch programmeren
bav(1,10) = maximum van hoeveel algoritme ( ) x x x x x x x x bav(1,2)+bav(3,10) ( x x ) x x x x x x bav(1,4)+bav(5,10) ( x x x x ) x x x x bav(1,6)+bav(7,10) ( x x x x x x ) x x bav(1,8)+bav(9,10) ( x x x x x x x x ) bav(2,9)+match(1,10) dynamisch programmeren
van bav(1,10) = maximum van 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 tm.
bav(1,10) = maximum van
Bavarian Beer Party optimale score interval [k,l] 2 1 bav[k,l] 2 2 k k+1 l-1 l k j j+1 l 3 2 3 1 1 2 2 1 3 3 2 2 bav[k,l] = max - bav[k+1,l-1] + match(k,l) - (j) bav[k,j]+bav[j+1,l] dynamic programming even length segments
int proost ( int totaal ) { int i, eerst, laatst, aantal, best, som; // initialisatie: paren for (i=1; i<totaal; i++) { bav[i][i+1] = ( brand[i] == brand[i+1] ); } if (totaal == 2) { return bav[1][2] ; } Bavarian Beer Party for (aantal=4; aantal<=totaal; aantal+=2) { // dynamische loop: 'aantal' toastende professoren // aantal == laatst - eerst + 1 for (eerst=1; eerst<totaal-aantal; eerst++) { laatst = eerst + aantal -1; best = (brand[eerst] == brand[laatst]); best += bav[eerst+1][laatst-1]; for (i=eerst+1; i<=laatst-2; i=i+2) { // alleen even matches! som = bav[eerst][i] + bav[i+1][laatst]; if (som > best) best = som; } bav[eerst][laatst] = best; } }// loop return bav[1][totaal] ; }// proost
lekker belangrijk
RNA secondary structure A Adenine C Cytosine G Guanine U Uracil A=U G C
PRAKTIKUM
practicum http://www.liacs.nl/home/hoogeboo/praatjes/lapp-top/»
laboratorium excel driehoek van Pascal cellulaire automaat alignment
gegevens adressen formules excel
excel gegevens adressen formules getallen, strings, datum, geld A1 $A$1 relatief absoluut =A2+B1 IF(A2=B1,1,2) SUM(B9:B12)
excel adressen A1 $A$1 relatief absoluut 1 2 3 A B C D E F =A2+B1 =D3+E2
excel adressen A1 $A$1 relatief absoluut 1 2 3 A B C D E F =$A2+B$1 =$A3+E$1
1 2 3 A B C =A2+B1 driehoek van Pascal 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 3 6 10 15 21 28 36 1 4 10 20 35 56 84 1 5 15 35 70 126 1 6 21 56 126 1 7 28 84 1 8 36 1 9 1
welke getallen gekleurd?
welke getallen gekleurd? 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 x y X+y- 2xy exclusive or even / oneven 0 0 0 0 1 1 1 0 1 1 1 0
Δ van Pascal waarom eigenlijk niet gewoon rechtop? (dat lukt met gaten tussen de waardes) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0 0 0 0 1 0 3 0 3 0 1 0 0 0 0 0 1 0 4 0 6 0 4 0 1 0 0 0 1 0 5 0 10 0 10 0 5 0 1 0 1 0 6 0 15 0 20 0 15 0 6 0 1 A1 C1 =A1+C1
alignment
excel - alignment A B C D E F G H I 1 c g a t t a 2 a 3 c 4 t 5 t MAX drie buren ± gap / (mis)match IF/ALS letters match/mismatch
excel - alignment
cellulaire automaat
cellulaire automaat http://mathworld.wolfram.com/cellularautomaton.html http://nl.wikipedia.org/wiki/elementaire_cellulaire_automaat»
rule 30 tijd http://mathworld.wolfram.com/rule30.html
Conus textile http://en.wikipedia.org/wiki/conus_textile
rule 90 http://mathworld.wolfram.com/rule90.html
rule 90 xyz 90 000 0 001 1 010 0 011 1 100 1 101 0 110 1 111 0 A1 B1 C1 x y z 90 als functie van xyz formule??
rule 30 xyz 30 000 0 001 1 010 1 011 1 100 1 101 0 110 0 111 0 (1-x) (y+z-y z) + x (1-y) (1-z) in logische terminologie: x(yz) x(yz) xor x(yz)
lookup ipv. formule P Q R 1 000 0 0 2 001 1 1 3 010 2 0 4 011 3 1 5 100 4 1 6 101 5 0 7 110 6 1 8 111 7 0 invoer uitvoer =Lookup( b, Q1:Q8, R1:R8) b = binaire waarde buren: A1 B1 C1 =4*A1+2*B1+1*C1
2-dim cellulaire automaat game of life puffer train http://www.collidoscope.com/modernca/
2-dim cellulaire automaat http://www.ux.uis.no/~ruoff/bz_phenomenology.html