Gene finding and gene structure prediction
|
|
|
- Oswald Abner Burke
- 9 years ago
- Views:
Transcription
1 Gene finding and gene structure prediction Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, March 2003
2 Introduction Homology methods Outline Principles Examples of programs using homology methods Ab initio methods Signal detection Coding statistics Methods to integrate signal detection and coding statistics Ab initio gene predictors on the web Evaluating performances of gene predictors Gene prediction limitations Color code: Keywords, Databases, Software 1
3 Introduction Gene finding is about detecting coding regions and infer gene structure Gene finding is difficult DNA sequence signals have low information content (degenerated and highly unspecific) It is difficult to discriminate real signals Sequencing errors Prokaryotes High gene density and simple gene structure Short genes have little information Overlapping genes Eukaryotes Low gene density and complex gene structure Alternative splicing Pseudo-genes 2
4 Homology method Gene finding strategies Gene structure can be deduced by homology Requires a not too distant homologous sequence Ab initio method Requires two types of information compositional information signal information 3
5 Gene finding: Homology method 4
6 Homology method Principles of the homology method. Coding regions evolve slower than non-coding regions, i.e. local sequence similarity can be used as a gene finder. Homologous sequences reflect a common evolutionary origin and possibly a common gene structure, i.e. gene structure can be solved by homology (mrnas, ESTs, proteins, protein domains). Standard pairwise comparison methods can be used (BLAST, Smith-Waterman,...). Include gene syntax information (start/stop codons,...). Homology methods are also useful to confirm predictions inferred by other methods 5
7 Homology method: a simple view Gene of unknown structure Gene of known structure Exon 1 Exon 2 Exon 3 Pairwise comparison Exon 1 Exon 2 Exon 3 Find DNA signals ATG {TAA,TGA,TAG} GT AG 6
8 Procrustes Procrustes is a software to predict gene structure from homology found in proteins (Gelfand et al., 1996) Principle of the algorithm Find all possible blocks (exons) in the query sequence (based on the acceptor/donor sites) Find optimal alignments between blocks and model sequences Find the best alignment between concatenation of the blocks and the target sequence Find all possible true exons Find homologous regions to a template protein Find the best path 7
9 Advantages of the homology method Procrustes Successfully recognizes short exons and exons with unusual codon usage Assembles correctly complex genes (> 10 exons) Available on the web Problems of the homology method Genes without homologous in the databases are missed Requires close homologous to deduce gene structure Very sensitive to frame shift errors Protocol to find gene structure using protein homology Do a BLASTX of your query sequence against a protein database (SWISS- PROT/TrEMBL) Retrieve sequences giving the best results Find gene structure using the retrieved sequences from the BLASTX search (Procrustes) BLAST the predicted protein against a protein database to verify the predicted gene structure 8
10 Genewise Genewise uses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns. Principle The gene model used in genewise is a HMM with 3 base states (match, insert, delete) with the addition of more transition between states to consider frame-shifts. Intron states have been added to the base model. Genewise directly compare HMM-profiles of proteins or domains to the gene structure HMM model. Genewise can be used with the whole Pfam protein domain databases (find protein domain signatures in the DNA sequence). Genewise is a powerful tool, but time consuming. Genewise is part of the Wise2 package: 9
11 Gene finding: Ab initio method 10
12 Ab initio method Principles of the ab initio methods Integration of signal detection and coding statistics Signal detection and coding statistics are deduced from a training set Probabilistic frameworks are used to infer a probable gene structure A solid scoring system can be used to evaluate the predictions Algorithms used for the ab initio methods derive from machine learning theory. 11
13 Ab initio method Gene of unknown structure Find signals and probable coding regions AAAAA Coding region probability ATG {TAA,TGA,TAG} GT AG AAAAA Promoter signal PolyA signal 12
14 Signal detection Detect short DNA motifs (promoters, start/stop codons, splice sites,...). A number of methods are used for signal detection: Consensus string: based on most frequently observed residues at a given position. Pattern recognition: flexible consensus strings. Weight matrices: based on observed frequencies of residues at a given position. standard alignment algorithms. This method returns a score. Weight array matrices: weight matrices based on dinucleotides frequencies. Takes into account the non-independence of adjacent positions in the sites. Maximal dependence decomposition (MDD): MDD generates a model which captures significant dependencies between non-adjacent as well as adjacent positions, starting from an aligned set of signals. Uses 13
15 Signal detection Methods for signal detection (continuation) Hidden Markov Models (HMM) HMM uses a probabilistic framework to infer the probability that a sequence correspond to a real signal Neural Networks (NN) NN are trained with positive and negatives example and discover the features that distinguish the two sets. Example: NN for acceptor sites, the perceptron, (Horton and Kanehisa, 1992): T A C A G G [0100] [1000] [0010] [1000] [0001] [0001] w1 weights w2 w3 w4 w5 w6 { ~ 1=> true ~0 => false C C [0010] [0010] w7 w8 14
16 Signal detection Signal detection problem DNA sequence signals have low information content Signals are highly unspecific and degenerated Difficult to distinguish between true and false positive How improve signal detection Take context into consideration (ex. acceptor site must be flanked by an intron and an exon) Combine with coding statistics (compositional bias) 15
17 Coding statistics Inter-genic regions, introns, exons,... have different nucleotides contents This compositional differences can be used to infer gene structure Examples of coding region finding methods: ORF length Assuming an uniform random distribution, stop codons are present every 64/3 codons ( 21 codons) in average In coding regions stop codon average decrease Method sensitive to frame shift errors Can t detect short coding regions Bias in nucleotide content in coding regions Generally coding regions are G+C rich There are exceptions. For example coding regions of P. falciparum are A+T rich 16
18 Coding statistics Examples of coding region finding methods (continuation): Periodicity Plot of the number of residues separating a pair of nucleotides is periodic in coding regions, but not in non-coding regions. From Guigó, Genetic Databases, Academic Press,
19 Codon frequencies Codon frequencies Synonym codon usage is biased in a species dependent way 3 rd codon position: 90% are A/T; 10% are G/C How to calculate codon frequencies Assume S = a 1 b 1 c 1, a 2 b 2 c 2,..., a n+1 b n+1 c n+1 is a coding sequence with unknown reading frame. Let f abc denote the appearance frequency of codon abc in a coding sequence. The probabilities p 1, p 2, p 3 of observing the sequence of n codons in the 1 st, 2 nd and 3 rd frame respectively are: p 1 = f a1 b 1 c 1 f a2 b 2 c 2... f a nbncn p 2 = f b1 c 1 a 2 f b2 c 2 a 3... f b ncna n+1 p 3 = f c1 a 2 b 2 f c2 a 3 b 3... f c na n+1 b n+1 The probability P i of the ith reading frame for being the coding region is: P i = p i p 1 +p 2 +p 3 where i {1, 2, 3}. 18
20 Codon frequencies In practice we use these computations in a search algorithm as follows: Select a window of size n (for example n = 30) Slide the window along the sequence and calculate P i window for each start position of the A variation of the codon frequency method is to use 6-tuple frequencies instead of 3-tuple (codon) frequencies. This method was found to be the best single property to predict whether a window of vertebrate genomic sequence was coding or non-coding (Claverie and Bougueleret, 1986). The usage of hexamers frequencies has been integrated in a number of gene predictors. 19
21 Integrating signal information and compositional information for gene structure prediction A number of methods exists for gene structure prediction which integrate different techniques to detect signals (splicing sites, promoters, etc.) and coding statistics. All these methods are classifiers based on machine learning theory. Training sets are required to train the algorithms. The following slides present a non-exhaustive list of these methods. 20
22 Linear and quadratic discrimination analysis Linear discrimination analysis is a standard technique in multivariate analysis. Linear discrimination analysis is used to linearly combine several measures (e.g. signals and coding statistics) in order to perform the best discrimination between coding and non-coding sequences. Quadratic discriminant analysis. Similar to linear discrimination analysis, but uses a quadratic discriminant function Dynamic programming is used in to combine the inferred exons 21
23 Linear and quadratic discrimination analysis
24 Decision trees Decision trees can be automatically build using training algorithms. Internal nodes of a decision tree are property values tested for each subsequence passed to the tree. Properties can be various coding statistics (e.g. hexamers frequencies), signal strengths Bottom nodes (leaves) of the tree contains class labels to be associated with the subsequences. Dynamic programming can be used to deduce the complete gene structure. 23
25 Decision trees Example: from MORGAN (a decision tree system for finding genes in vertebrate DNA) (Salzberg et al. 1998). YES d + a < 3.4? NO d + a < 1.3? d + a < 5.3? YES NO YES NO (6,560) hex < 10.3? hex < 0.1? hex < 5.6? YES NO YES NO YES NO (18,160) donor < 0.09? (9,49) (142,13) asym < 4.6? (151,50) YES (5,21) NO (23,16) YES NO (24,13) (1,5) d: donnor site score a: acceptor site score hex: in frame hexamer frequency asym: Fickett s position asymmetry statistic donor: donor site score leaf nodes: exon, pseudo exon distribution in the training set 24
26 Neural network A neural network is trained with a set of true positives and true negatives examples (set of true exons/false exons,...). For each training example, the neurons are tuned to return the right answer. Dynamic programming can be used to deduce the complete gene structure. score of hexamere in candiate region score of hexamere in flanking regions Markov model score candidate region GC composition flanking region GC composition score for splicing acceptor site score for splicing donnor site length of region... Exon score Input layer Hidden layer Outout layer (Uberbacher et al., 1996) 25
27 Markov derived models Principle of Markov Models Given a DNA string S, find the most probable path M in the model that generates S. This will be the most probable gene structure. Markov derived models have many desirable properties Modeling: theoretically well-founded models. Efficient: O( M S ) where M is the number of states in the model and S is the length of the string. Scoring: theoretically well-founded scoring system. 26
28 Markov Model (MM) Markov derived models Biological sequences can be modeled as the output of a stochastic process in which the probability for a given nucleotide to occur at position p depends on the k previous positions. This representation is called k-order Markov Model. P (x i x 1, x 2,..., x i 1 ) = P (x i x i k, x i (k 1),..., x i 1 ) T C G T A G A G Hidden Markov Model (HMM) In a HMM the biological sequences are modeled as the output of a stochastic process that progresses through a series of discrete states. Each state model correspond to a Markov Model. x xxxxxx ATG ccc ccc ccc TAA xxxxxxx Inter-genic Region around start codon Coding region Region around stop codon (Krog, 1998) 27
29 Markov derived models Generalized Hidden Markov Model (GHMM) GHMMs are HMMs where states are arbitrary sub-models (e.g., neural networks, position weight matrices, etc.). The duration of a particular state depends on some probability distributions. Example of the GHMM in GENSCAN (Burge and Karlin, 1997) Each state can correspond to a position weight matrix, a neural network,... Each state has an associated length distribution (time) there are no transitions from the state to the state itself. 28
30 Markov derived models The underlying (hidden) model of GENSCAN: Forward strand Reverse strand 3 UTR + 3 UTR - E2 + I2 + E term + PolyA + PolyA - E term - I2 - E2 - E1 + I1 + E single + Intragenic E single - I1 - E1 - E0 + I0 + E init + Prom + Prom - E init I0 - - E0-5 UTR + 5 UTR - 29
31 Description of some gene predictors available on the web 30
32 GRAIL 1 GRAIL Neural network recognizing coding potential within a fixed-size window (100 bases) Evaluates coding potential without looking for additional features (e.g., splice junctions, start/stop codons) Exon quality evaluated by a score GRAIL 1a Look at regions immediately adjacent to regions with coding potential Determine the best boundaries for the coding region Performs better than GRAIL 1 in finding true exons and eliminating false positives Returns a score 31
33 GRAIL 2 Uses variable window length GRAIL Incorporates genomic context information Splice junctions Start and stop codons PolyA signals Requires regions next to an exon Not appropriate for sequences without genomic context Much better at estimating the true extent of an exon as compared to GRAIL 1 Returns scores WEB server Accept multiple sequence Length 100 bases to 100 kilobases Available for Human, Mouse, Drosophila, Arabidopsis, and E. coli 32
34 GRAIL 2 output GRAIL [grail2exons -> Exons] St Fr Start End ORFstart ORFend Score Quality 1- f good 2- f excellent 3- f excellent 4- f good 5- f good 6- f excellent 7- f excellent 8- f excellent 9- f good 10- f good [grail2exons -> Exon Translations] 11- MLRGTDASNNSEVFKKAKIMFLEVRKSLTCGQGPTGSSCNGAGQRESGHA AFGIKHTQSVDR 12- AQIPNQQELKETTMCRAISLRRLLLLLLQLCKFSDLGT 13- AQLLAVTQGKTLVLGKEGESAELPCESSQKKITVFTWKFSDQRKILGQHG KGVLIR 33
35 FGENES FGENES uses linear discriminant analysis. Combine several measures of pattern recognition using a linear discriminant analysis Donor and acceptor splice sites Putative coding regions 5 and 3 intronic regions of the putative exon Pass the previous results to a dynamic programming algorithm to find a coherent gene model Each predicted exon has an associated score, but not directly useful (most of the scores < 10, while probable good prediction scores > 10) WEB server Can combine homology method with ab initio results Available organisms Human, Drosophila, Worm, Yeasts, Plants 34
36 FGENES FGENES output (CDSf = first exon; CDSi = internal exon; CDSl = last exon; CDSo = only one exon; PolA = PolyA signal) Length of sequence: GC content: 0.48 Zone: 2 Number of predicted genes: 2 In +chain: 2 In -chain: 0 Number of predicted exons: 12 In +chain: 12 In -chain: 0 Predicted genes and exons in var: 2 Max var= 15 GENE WEIGHT: 27.3 G Str Feature Start End Weight ORF-start ORF-end CDSf CDSl PolA CDSf CDSi CDSi CDSi CDSi CDSi CDSi CDSi CDSi CDSl Predicted proteins: >FGENES-M 1.5 >MySeq 1 Multiexon gene a Ch+ MSSAFSDPFKEQNPVISLITRTNLNSSSLPVRIYCQPPNMFLYIAPCAVLVLSTSSTPRR TENGPLRMALNSRFPASFYLLCRDYQYTPPQLGPLHGRCS >FGENES-M 1.5 >MySeq 2 Multiexon gene a Ch+ MCRAISLRRLLLLLLQLSQLLAVTQGKTLVLGKEGESAELPCESSQKKITVFTWKFSDQR 35
37 MZEF Designed to predict only internal coding exons To discriminate between coding and non-coding regions, MZEF uses quadratic discriminant analysis of different measures Exon length Intron-exon transition/exon-intron transition Branch-site scores 5 and 3 splice sites scores Exon score Strand score Possible to set a prior probability depending on the gene density and G+C content Scores over-estimate of the accuracy of the prediction WEB server: thanaraj/mzef- SPC.html Length up to 200 kilobases Available organisms: Human, Mouse, Arabidopsis, Fission yeast 36
38 MZEF MZEF output Internal coding exons predicted by MZEF Sequence_length: G+C_content: Coordinates P Fr1 Fr2 Fr3 Orf 3ss Cds 5ss Description of the symbols P: Posterior probability (between.5 to 1.) Fri: Frame preference score for the ith frame of the genomic sequence Orf: ORF indicator, 011 (or 211 ) means 2nd and 3rd frames are open 3ss: Acceptor score Cds: Coding preference score 5ss: Donor score 37
39 HMMgene HMMgene has been designed to predict complete gene structures. Uses HMMs with a criterion called Conditional Maximum Likelihood which maximize the probability of correct predictions Can return sub-optimal prediction to help identifying alternative spicing Regions of the sequence can be locked as coding and non-coding by the user Probability score for each predicted exon WEB server Available organisms: Human and worm 38
40 HMMgene HMMgene output # SEQ: Sequence (-) A:5406 C:4748 G:4754 T:5092 Sequence HMMgene1.1a firstex bestparse:cds_1 Sequence HMMgene1.1a exon_ bestparse:cds_1 Sequence HMMgene1.1a exon_ bestparse:cds_1 Sequence HMMgene1.1a exon_ bestparse:cds_1 Sequence HMMgene1.1a exon_ bestparse:cds_1 Sequence HMMgene1.1a lastex bestparse:cds_1 Sequence HMMgene1.1a CDS bestparse:cds_1 Sequence HMMgene1.1a DON Sequence HMMgene1.1a START Sequence HMMgene1.1a ACC Sequence HMMgene1.1a DON Sequence HMMgene1.1a DON position prob strand and frame Symbols: firstex = first exon; exon n = internal exon; lastex = last exon; singleex = single exon gene; CDS = coding region 39
41 GENSCAN GENSCAN has been designed to predict complete gene structures. Uses generalized hidden Markov models (GHMM): structure of genomic sequence is modeled by explicit state duration HMMs. Signals are modeled by weight matrices, weight arrays, and maximal dependence decomposition Probability score for each predicted exon. WEB server Sequence length up to 200 kilobases Graphical output available Available organisms Vertebrate, Arabidopsis, Maize 40
42 GENSCAN output GENSCAN Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P... Tscr Prom Init Intr Intr Intr Intr Intr Intr Intr >02:36:44 GENSCAN_predicted_peptide_1 448_aa MCRAISLRRLLLLLLQLSQLLAVTQGKTLVLGKEGESAELPCESSQKKITVFTWKFSDQR KILGQHGKGVLIRGGSPSQFDRFDSKKGAWEKGSFPLIINKLKMEDSQTYICELENRKEE... Gn.Ex : gene number, exon number (for reference) Type : Init = Initial exon (ATG to 5 splice site) Intr = Internal exon (3 splice site to 5 splice site) Term = Terminal exon (3 splice site to stop codon) Sngl = Single-exon gene (ATG to stop) Prom = Promoter (TATA box / initation site) PlyA = poly-a signal (consensus: AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning of exon or signal (numbered on input strand) End : end point of exon or signal (numbered on input strand) Len : length of exon or signal (bp) 41
43 Gene finding EMBNet 2003 Fr : reading frame (a forward strand codon ending at x has frame x mod 3) Ph : net phase of exon (exon length modulo 3) I/Ac : initiation signal or 3 splice site score (tenth bit units) Do/T : 5 splice site or termination signal score (tenth bit units) CodRg : coding region score (tenth bit units) P : probability of exon (sum over all parses containing exon) Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores) Graphical output GENSCAN predicted genes in sequence 02:36:44 kb kb kb kbkb Key: Initial exon Internal exon Terminal exon Single-exon gene Optimal exon Suboptimal exon 42
44 GeneMark.hmm GeneMark.hmm was initially developed for bacterial gene finding, then modified to predict gene structure in eukaryotes. Uses explicit state duration HMMs. Optimal gene candidates, selected by HMM and dynamic programming, are further processed by a ribosomal binding site recognition algorithm. No scores! WEB server Access to both the original GeneMark and GeneMark.hmm 43
45 Geneid Geneid is one of the oldest gene structure prediction programs, recently updated to a new and faster version. Uses a hierarchical search structure (signal exon gene): 1 st : finds signals (splice sites, start and stop codons); 2 nd : from the found signals start to score regions for exon-defining signals and proteincoding potential; 3 rd : a dynamic programming algorithm is used to search the space of predicted exons to assemble the gene structure. Very fast and scale linearly with the length of the sequence (both in time and memory) adapted to analyze large sequences. Trained with Drosophila and Human. Available at also source code! 44
46 Geneid Geneid output ## date Mon Aug 26 19:14: ## source-version: geneid v [email protected] ## Sequence emb U47924 HSU Length = bps # Optimal Gene Structure. 2 genes. Score = # Gene 1 (Forward). 4 exons. 266 aa. Score = Internal AA 1: 86 gene_1 Internal AA 86:129 gene_1 Internal AA 130:175 gene_1 Terminal AA 175:266 gene_1 >emb U47924 HSU47924 geneid_v1.1_predicted_protein_1 266_AA gdltdiyllrsyihlryvdisenhltdlsplnylthllwlkadgnrlrsaqmnelpylqi ASFAYNQITDTEGISHPRLETLNLKGNSIHMVTGLDPEKLISLHTVELRGNQLESTLGIN LPKLKNLYLAQNMLKKVEGLEDLSNLTTLHLRDNQIDTLSGFSREMKSLQYLNLRGNMVA NLGELAKLRDLPKLRALVLLDNPCTDETSYRQEALVQMPYLERLDKEFYEEEERAEADVI RQRLKEEKEQEPEPQRDLEPEQSLI* # Gene 2 (Forward). 7 exons. 395 aa. Score = First AA 1: 29 gene_2 Internal AA 29: 61 gene_2 Internal AA 61:105 gene_2 Internal AA 106:180 gene_2 Internal AA 180:277 gene_2 Internal AA 277:344 gene_2 Internal AA 344:395 gene_2 45
47 Evaluating performances Real TP FP TN FN TP FN TN Predicted Measures (Burset and Guigo, 1996; Snyder and Stormo, 1997): Sensitivity S n is the proportion of coding nucleotides that are correctly predicted as coding: S n = T P T P +F N Specificity S p is the proportion of nucleotides predicted as coding that are actually coding: S p = T P T P +F P 46
48 Measures (contd.): Evaluating performances Correlation coefficient CC is a single measure that captures both specificity and sensitivity: CC = (T P T N) (F N F P ) (T P +F N) (T N+F P ) (T P +F P ) (T N+F N) Approximate correlation AC is similar to CC, but defined under any circumstances: where ACP = 1 4 ( AC = (ACP 0.5) 2 T P T P +F N + T P T P +F P + T N T N+F P + ) T N T N+F N 47
49 Accuracy of the different methods Evaluation of the different programs (Rogic et al., 2001) Programs No. of sequences S n S p AC CC FGENES ± GeneMark.hmm ± GENSCAN ± HMMgene ± MZEF ± Overall performances are the best for HMMgene and GENSCAN. Some program s accuracy depends on the G+C content, except for HMMgene and GENSCAN, which use different parameters sets for different G+C contents. For almost all the tested programs, medium exons ( nucleotides long), are most accurately predicted. Accuracy decrease for shorter and longer exons, except for HMMgene. Internal exons are much more likely to be correctly predicted (weakness of the start/stop codon detection). Initial and terminal exons are most likely to be missed completely. Only HMMgene and GENSCAN have reliable scores for exon prediction. 48
50 Accuracy of the different methods Recently a new benchmark has been published by Makarov (2002), with similar results, but other predictors have been included. Programs Sn a Sp a Sn b Sp b HMMgene GenScan Geneid Genie FGENES a Adh region of Drosophila. b 195 high-quality mammalian sequences (human, mouse, and rat). 49
51 Gene prediction limits Existing predictors are for protein coding regions Non-coding areas are not detected (5 and 3 UTR) Non-coding RNA genes are missed Predictions are for typical genes Partial genes are often missed Training sets may be biased Atypical genes use other grammars 50
52 The end 51
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
Overview of Eukaryotic Gene Prediction
Overview of Eukaryotic Gene Prediction CBB 231 / COMPSCI 261 W.H. Majoros What is DNA? Nucleus Chromosome Telomere Centromere Cell Telomere base pairs histones DNA (double helix) DNA is a Double Helix
Gene Finding and HMMs
6.096 Algorithms for Computational Biology Lecture 7 Gene Finding and HMMs Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 - Introduction - Hashing and BLAST - Combinatorial Motif
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Gene Finding CMSC 423
Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Gene Models & Bed format: What they represent.
GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction Introduction Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals
Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals Xiaohui Xie 1, Jun Lu 1, E. J. Kulbokas 1, Todd R. Golub 1, Vamsi Mootha 1, Kerstin Lindblad-Toh
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
The world of non-coding RNA. Espen Enerly
The world of non-coding RNA Espen Enerly ncrna in general Different groups Small RNAs Outline mirnas and sirnas Speculations Common for all ncrna Per def.: never translated Not spurious transcripts Always/often
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!
DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,
Protein Synthesis How Genes Become Constituent Molecules
Protein Synthesis Protein Synthesis How Genes Become Constituent Molecules Mendel and The Idea of Gene What is a Chromosome? A chromosome is a molecule of DNA 50% 50% 1. True 2. False True False Protein
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
Gene Prediction with a Hidden Markov Model
Gene Prediction with a Hidden Markov Model Dissertation zur Erlangung des Doktorgrades der Mathematisch-Naturwissenschaftlichen Fakultäten der Georg-August-Universität zu Göttingen vorgelegt von Mario
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
T cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
Frequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
Biological Sciences Initiative. Human Genome
Biological Sciences Initiative HHMI Human Genome Introduction In 2000, researchers from around the world published a draft sequence of the entire genome. 20 labs from 6 countries worked on the sequence.
1 Mutation and Genetic Change
CHAPTER 14 1 Mutation and Genetic Change SECTION Genes in Action KEY IDEAS As you read this section, keep these questions in mind: What is the origin of genetic differences among organisms? What kinds
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo
Hidden Markov models in gene finding Broňa Brejová Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo 1 Topics for today What is gene finding (biological
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.
New Technologies for Sensitive, Low-Input RNA-Seq Clontech Laboratories, Inc. Outline Introduction Single-Cell-Capable mrna-seq Using SMART Technology SMARTer Ultra Low RNA Kit for the Fluidigm C 1 System
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Introduction. Keywords: Hidden Markov model, gene finding, imum likelihood, statistical sequence analysis.
From: ISMB-97 Proceedings. Copyright 1997, AAAI (www.aaai.org). All rights reserved. Two methods for improving performance of an HMMand their application for gene finding Anders Krogh" Center for Biological
MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,
Next Generation Sequencing
Next Generation Sequencing Technology and applications 10/1/2015 Jeroen Van Houdt - Genomics Core - KU Leuven - UZ Leuven 1 Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977
Hidden Markov models in biological sequence analysis
Hidden Markov models in biological sequence analysis by E. Birney The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of
The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:
Module 3F Protein Synthesis So far in this unit, we have examined: How genes are transmitted from one generation to the next Where genes are located What genes are made of How genes are replicated How
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
Transcription and Translation of DNA
Transcription and Translation of DNA Genotype our genetic constitution ( makeup) is determined (controlled) by the sequence of bases in its genes Phenotype determined by the proteins synthesised when genes
Web Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three
Chem 121 Chapter 22. Nucleic Acids 1. Any given nucleotide in a nucleic acid contains A) two bases and a sugar. B) one sugar, two bases and one phosphate. C) two sugars and one phosphate. D) one sugar,
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
Gene Expression Assays
APPLICATION NOTE TaqMan Gene Expression Assays A mpl i fic ationef ficienc yof TaqMan Gene Expression Assays Assays tested extensively for qpcr efficiency Key factors that affect efficiency Efficiency
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
MUTATION, DNA REPAIR AND CANCER
MUTATION, DNA REPAIR AND CANCER 1 Mutation A heritable change in the genetic material Essential to the continuity of life Source of variation for natural selection New mutations are more likely to be harmful
Homologous Gene Finding with a Hidden Markov Model
Homologous Gene Finding with a Hidden Markov Model by Xuefeng Cui A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of Master of Mathematics in Computer
CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.
: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results
Master's projects at ITMO University. Daniil Chivilikhin PhD Student @ ITMO University
Master's projects at ITMO University Daniil Chivilikhin PhD Student @ ITMO University General information Guidance from our lab's researchers Publishable results 2 Research areas Research at ITMO Evolutionary
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Outline. MicroRNA Bioinformatics. microrna biogenesis. short non-coding RNAs not considered in this lecture. ! Introduction
Outline MicroRNA Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology (CMB) Karolinska Institutet! Introduction! microrna target site prediction! Useful resources 2 short non-coding RNAs
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
Yale Pseudogene Analysis as part of GENCODE Project
Sanger Center 2009.01.20, 11:20-11:40 Mark B Gerstein Yale Illustra(on from Gerstein & Zheng (2006). Sci Am. (c) Mark Gerstein, 2002, (c) Yale, 1 1Lectures.GersteinLab.org 2007bioinfo.mbb.yale.edu Yale
Expression Quantification (I)
Expression Quantification (I) Mario Fasold, LIFE, IZBI Sequencing Technology One Illumina HiSeq 2000 run produces 2 times (paired-end) ca. 1,2 Billion reads ca. 120 GB FASTQ file RNA-seq protocol Task
org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
DNA Sequencing Overview
DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled
Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.
Module 3 Questions Section 1. Essay and Short Answers. Use diagrams wherever possible 1. With the use of a diagram, provide an overview of the general regulation strategies available to a bacterial cell.
Genomes and SNPs in Malaria and Sickle Cell Anemia
Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing
From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
Analyzing A DNA Sequence Chromatogram
LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ
Sample Questions for Exam 3
Sample Questions for Exam 3 1. All of the following occur during prometaphase of mitosis in animal cells except a. the centrioles move toward opposite poles. b. the nucleolus can no longer be seen. c.
Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
4. DNA replication Pages: 979-984 Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?
Chapter 25 DNA Metabolism Multiple Choice Questions 1. DNA replication Page: 977 Difficulty: 2 Ans: C The Meselson-Stahl experiment established that: A) DNA polymerase has a crucial role in DNA synthesis.
Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary
Protein Synthesis Vocabulary Transcription Translation Translocation Chromosomal mutation Deoxyribonucleic acid Frame shift mutation Gene expression Mutation Point mutation Page 41 Page 41 Page 44 Page
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.
Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course
Welcome to the Plant Breeding and Genomics Webinar Series
Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
Module 10: Bioinformatics
Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior
2006 7.012 Problem Set 3 KEY
2006 7.012 Problem Set 3 KEY Due before 5 PM on FRIDAY, October 13, 2006. Turn answers in to the box outside of 68-120. PLEASE WRITE YOUR ANSWERS ON THIS PRINTOUT. 1. Which reaction is catalyzed by each
