Sequence Alignment (Inexact Matching) Overview
|
|
- Aron Russell
- 7 years ago
- Views:
Transcription
1 CSI/BINF 5330 Sequence Alignment (Inexact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 1
2 Sequence Similarity Homologs similar sequence + common ancestor (divergent evolution) Orthlogs: homologs in different species by species divergence Paralogs: homologs in the same species by gene duplication Analogs similar sequence + no common ancestor (convergent evolution) How to measure sequence similarity? Sequence Alignment Definition Arranging two or more DNA or protein sequences by inserting gaps to maximize their sequence similarity e.g., DNA1 = AGA, DNA2 = CGAC Applications Given gene sequences, infer their evolutionary history (phylogenetics) Given gene sequences of known functions, infer the functions of newly sequenced genes Given genes of known functions in one organism, infer the functions of genes in another organism 2
3 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Longest Common Subsequences (1) Subsequence of x An ordered sequence of letters from x Not necessarily consecutive e.g., x= AGCA, AGCA?, CG?, AC?, GA? Common Subsequence of x and y e.g., x= ACGA and y= GCAA, CA?, GA?, AA? Longest Common Subsequence (LCS) of x and y? 3
4 Longest Common Subsequences (2) Definition of LCS Given two sequences, v = v 1 v 2 v m and w = w 1 w 2 w n, LCS of v and w is a sequence of positions in v: 1 i 1 < i 2 < < i t m and a sequence of positions in w: 1 j 1 < j 2 < < j t n such that i t -th letter of v equals to j t -letter of w, and t is maximal LCS in 2-Row Representation (1) Example x= ACGAG (m=8), y= GCAAC (n=7) x y A C G A G G C A A C Position in x: Position in y: LCS: CA 2 < 3 < 4 < 6 1 < 3 < 5 < 6 4
5 LCS in 2-Row Representation (2) Example - Continued x= ACGAG (m=8), y= GCAAC (n=7) x y A C G A G G C A A C x y A C G A G G C A A C LCS in 2-D Grid Representation Edit Graph 2-D grid structure having diagonals on the position of the same letter Example x= AGA (m=7) y= ACGAC (n=7) Strongest path in edit graph source A A C G A C (0,0) (1,1) (2,2) G (2,3) (3,4) (4,5) (5,5) (6,6) (7,6) (7,7) A sink 5
6 Formulation of LCS Problem Goal Finding the longest common subsequence (LCS) of two sequences (length-m, length-n) Finding the strongest path from a source to a sink in a weighted edit graph he path strength is measured by summing the weights on the path Input A weighted edit graph G with source (0,0) and sink (m,n) Output A strongest path in G from the source to the sink Solving by Exhaustive Search Algorithm (1) Enumerate all possible paths from the source to the sink (2) Compute the path strength for all possible paths (3) Find the strongest path Problems? 6
7 Solving by (Iterative) Greedy Algorithm Algorithm (1) Start from the source (2) Select the edge having the highest weight (i.e., if there is a diagonal edge, select it. Otherwise, select one of the other edges.) (3) Repeat (2) until it reaches the sink Problems? Runtime? Solving by Dynamic Programming Formula Algorithm 7
8 Example of LCS Example x= AGA (m=7), y= ACGAC (n=7) source A C G A C A G A sink Finding LCS Storing Directions Backtracking 8
9 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment from LCS to Sequence Alignment LCS problem Allows only insertions and deletions - no substitutions Scores 1 for a match and 0 for an insertion or deletion Sequence Alignment Problem Allows gaps (insertions and deletions) and mismatches (substitutions) Uses any scoring schemes 9
10 Formulation of Global Alignment Problem Goal Finding the best alignment of two sequences under a given scoring schema Input wo sequences x (length-m) and y (length-n), and a scoring schema Output An alignment of x and y with the maximal score Basic Scoring Scheme Simplest Scoring Scheme Match premium: + Mismatch penalty: -μ Insertion and deletion (gap) penalty: -σ Score = #matches μ #mismatches σ (#insertions + #deletions) Formula 10
11 Example of Sequence Alignment Example x= AGA (m=7), y= ACGAC (n=7) A C G A C A G A + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ All horizontal and vertical edges: -σ Solving by Dynamic Programming Algorithm 11
12 Advanced Scoring Scheme Percent Identity Percentage of identical matches Percent Similarity Percentage of nucleotide pairs with the same type Percentage of similar amino acid pairs in biochemical structure Advanced Scoring Scheme Varying scores in similarity Varying, strong penalties for mismatches Relative likelihood of evolutionary relationship Probability of mutations Define scoring matrix for DNA or protein sequences Scoring Matrices (1) Scoring Matrix δ Also called substitution matrix 4 4 array representation for DNA sequences or (4) (4) array array representation for protein sequences or (20) (20) array Entry of δ (i,j) has the score between i and j, i.e., the rate at which i is substituted with j over time Formula 12
13 Scoring Matrices (2) PAM (Point Accepted Mutations) For protein sequence alignment Amino acid substitution frequency in mutations Logarithmic matrix of mutation probabilities PAM120: Results from 120 mutations per 100 residues PAM120 vs. PAM240 BLOSUM (Block Substitution Matrix) For protein sequence alignment Applied for local sequence alignments Substitution frequencies between clustered groups BLOSUM-62: Results with a threshold (cut-off) of 62% identity BLOSUM-62 vs. BLOSUM-50 Scoring Matrices (3) Substitution Matrix Examples BLOSUM-62 PAM120 13
14 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Local Sequence Alignment (1) Example x = CAGGCGAAGA y = AGGCAGCAGGG Global Alignment C A G G C G A A G A A G G C A G C A G G G Local Alignment C A G G C G A A G A A G G C A G C A G G G 14
15 Local Sequence Alignment (2) Example - Continued C A G G C G A A G A A G G C A G C A G G G scoring in this range for local alignment Global vs. Local Alignment Global Alignment Problem Finds the path having the largest weight between vertices (0,0) and (m,n) in the edit graph Local Alignment Problem Finds the path having the largest weight between two arbitrary vertices, (i,j) and (i, j ), in the edit graph Score Comparison he score of local alignment must be greater than (or equal to) the score of global alignment 15
16 Formulation of Local Alignment Problem Goal Finding the best local alignment between two sequences Input wo sequences x and y, and a scoring matrix δ Output An alignment of substrings of x and y with the maximal score among all possible substrings of them Implementation of Local Alignment Strategy C A G G C G A A G A A G G C A G C A G G A Compute mini global alignment for local alignment 16
17 Solving by Exhaustive Search Process (1) Enumeration of all possible pairs of substrings (2) Global alignment for each pair of substrings (1) Enumeration of all substrings starting from (i,j) (2) Global alignment from each start position Runtime Suppose two sequences have the same length n Global alignment : otal runtime : Solution Free ride! Solving by Dynamic Programming Free Ride Assigns 0 weights from (0,0) to any other nodes (i,j) Yeah, a free ride! (0,0) Formula 0 Runtime? 17
18 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Scoring Insertions/Deletions Naïve Approach -σ for 1 insertion/deletion, -2σ for 2 consecutive insertions/deletions -3σ for 3 consecutive insertions/deletions, etc. too severe penalty for a series of 100 consecutive insertions/deletions Example x= AAGC, y= AAGC single event x= AAGGC, y= AGGC 18
19 Scoring Gaps of Insertions/Deletions Gap Contiguous sequence of spaces in one of the rows Contiguous sequence of insertions or deletions in 2-row representation Linear Gap Penalty Score for a gap of length x : -σ x ( Naïve approach ) Constant Gap Penalty Score for a gap of length x : -ρ Affine Gap Penalty Score for a gap of length x : - (ρ + σ x) -ρ : gap opening penalty / -σ : gap extension penalty ( ρ σ ) Solving Constant/Affine Gap Penalty Edit Graph Update Add long horizontal or vertical edges with ρ weights to the edit graph Runtime? 19
20 Improved Solution for Constant/Affine Gap Penalty 3-Layer Grid Structure Middle layer ( Main layer ) for diagonal edges Extends matches and mismatches Upper layer for horizontal edges Creates/extends gaps in a sequence x Lower layer for vertical edges Creates/extends gaps in a sequence y Gap opening penalty (-ρ) for jumping from middle layer to upper/lower layer Gap extension penalty (-σ) for extending on upper/lower layer Example of 3-Layer Grid upper layer ( gaps in x ) main layer (matches/mismatches) lower layer ( gaps in y ) 20
21 Solving by Dynamic Programming Formula Dynamic programming with recursion match / mismatch continuing gap in y starting gap in y continuing gap in x starting gap in x Runtime? Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 21
22 Searching Databases Homolog Search Searching similar sequences to a query sequence in a database (e.g. GenBank) Computational issues Dynamic programming are rigorous But inefficient in searching a huge database Need heuristic approaches Homolog Search ools FASA BLAS FASA (1) FASA DNA / protein sequence alignment tool (local alignment) Applies dynamic programming in scoring selected sequences Heuristic method in candidate sequence search Process Finding all pairwise k-tuples (at least k contiguous matching residues) Scoring the k-tuples by a substitution matrix Selecting sequences with high scores for alignment 22
23 FASA (2) FASA package query sequence database fasta protein protein fasta DNA DNA fastx / fasty DNA (all reading frames) protein tfastx / tfasty protein DNA (all reading frames) fastf mixed peptide protein tfastf mixed peptide DNA ssearch : applies dynamic programming BLAS (1) BLAS (Basic Local Alignment Search ool) DNA / protein sequence alignment tool Finds local alignments Heuristic method in sequence search Runs faster than FASA Process Makes a list of words (word pairs) from the query sequence Chooses high-scoring words Searches database for matches (hits) with the high-scoring words Extends the matches in both directions to find high-scoring segments (HSP) 23
24 BLAS (2) BLAS programs query sequence database blastp protein protein blastn DNA DNA blastx DNA (all reading frames) protein tblastn protein DNA (all reading frames) tblastx DNA (all reading frames) DNA (all reading frames) Search Results BLAS Search Results FASA Search Results 24
25 E-value E-value Average number of alignments with a score of at least S that would be expected by chance alone in searching a database of n sequences High alignment score S Low E-value Low alignment score S High E-value Ranges of E-value: 0 ~ n Factors Alignment score he number of sequences in the database Sequence length Default E-value threshold: 0.01 ~ Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 25
26 Pairwise Alignment vs. Multiple Alignment Pairwise Alignment Alignment of two sequences Sometimes two sequences are functionally similar or have common ancestor although they have weak sequence similarity Multiple Alignment Alignment of more than two sequences Finds invisible similarity in pairwise alignment Alignment of 3 Sequences Alignment of 2 Sequences Described in a 2-row representation Best alignment is found in a 2-D matrix by dynamic programming Alignment of 3 Sequences Described in a 3-row representation x= AGG, y= ACGA, z= ACG x : A - G G - y : A - C G - A z : A C - G - Best alignment is found in a 3-D matrix by dynamic programming 26
27 Alignment in 3-D Grid 3-D Edit Graph 3-D grid structure (cube) with diagonals in each cell Example source x : y : z : A - G G A - C G - A A C - G - Path in 3-D grid : sink (0,0,0) (1,1,1) (2,1,2) (2,2,3) (3,3,3) (4,4,4) (5,4,5) (5,5,5) 3-D Grid Unit 2-D Grid Unit Maximum 3 edges in each unit of 2-D grid 3-D Grid Cell Maximum 7 edges in each unit of 3-D grid ( i-1, j-1, k-1 ) ( i-1, j-1, k ) ( i-1, j, k-1 ) ( i-1, j, k ) ( i, j-1, k-1 ) ( i, j, k-1 ) ( i, j-1, k ) ( i, j, k ) 27
28 Solving by Dynamic Programming Formula δ (x, y, z) is the entry of 3-D scoring matrix Runtime? from 3-D Alignment to Multiple Alignment Alignment of k Sequences Able to be solved by dynamic programming in k-d grid Runtime? Conclusion Dynamic programming for pairwise alignment can be extended to multiple alignment However, computationally impractical How can we solve this problem? 28
29 Heuristics of Multiple Alignment Intuition Implementing pairwise alignment (2-D alignment) many times is better than implementing k-d multiple alignment once Heuristic Process (1) Implementing all possible pairwise alignments (2) Combining the most similar pair iteratively Multiple Alignment to Pairwise Alignment Multiple Alignment Pairwise Alignments Multiple alignment induces pairwise alignments x: A C G C G G C y: A C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G 29
30 Multiple Alignment Projection Multiple Alignment Pairwise Alignments x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Projection Pairwise Alignment to Multiple Alignment (1) Pairwise Alignments Multiple Alignment x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Can we construct a multiple alignment that induces pairwise alignments? 30
31 Pairwise Alignment to Multiple Alignment (2) Examples x= AAA, y= GGG, z= AAAGGG x= AAA, y= GGG, z= GGGAAA Conclusion Difficult to infer optimal multiple alignment from all possible optimal pairwise alignments Greedy Approach (1) Process (1) Implement pairwise alignment for all possible pairs of sequences (2) Choose the most similar pair of sequences (3) Merge them into a new sequence (4) Choose the most similar sequence to the new sequence (5) Repeat (3) and (4) until choosing all sequences Example Step 1 s1: GACA s2: GCGA s3: GAA s4: GCAGC s2 GCGA s4 GCAGC s1 GA-CA s2 G-CGA s1 GA-CA s3 GAA- s1 GACA-- s4 G -CAGC s2 G-CGA s3 GAA- s3 GA-A s4 G-CAGC 31
32 Greedy Approach (2) Example - continued Step 2 s2 GCGA s4 GCAGC Step 3 s 1 s 3 s 2,4 GACA GAA GCt/aGa/c s 2,4 GCt/aGa/c ( called profile or consensus sequence ) Features k-way alignment (alignment of k sequences) Runtime? Greedy algorithm Not optimal multiple alignment Progressive Alignment (1) Features A variation of greedy algorithm (more intelligent strategy on each step) Also called a hierarchical method Uses profiles to compare sequences Gaps are permanent ( once a gap, always a gap ) Works well for close sequences Process Stage 1 Computes similarity of all possible pairs of sequences ( similarity = sequence identity = # exact match / sequence length ) Makes a similarity matrix v 1 v 2 v 3 v 4 v 1 v 2 v 3 v
33 Progressive Alignment (2) Process - continued Stage 2 Creates a guide tree using the similarity matrix Stage 3 Applies a series of pairwise alignment following the guide tree Application of Progressive Alignment ClustalW Popular multiple alignment tool Adopts the progressive multiple alignment FOS_RA FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN PEEMSVS-LDLGGLPEAPESEEAFLPLLNDPEPK-PSLEPVKNISNMELKAEPFD PEEMSVAS-LDLGGLPEASPESEEAFLPLLNDPEPK-PSLEPVKSISNVELKAEPFD SEELAAAALDLG----APSPAAAEEAFALPLMEAPPAVPPKEPSG--SGLELKAEPFD PGPGPLAEVRDLPG-----SSAKEDGFGWLLPPPPPPP LPFQ PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP LPFQ.. : **. :.. *:.* *. * **: Dots and stars show how well-conserved a column is 33
34 Scoring Schemes Number of Matches Multiple longest common subsequence score A column is a match if all the letters in the column are the same AAA AAG AA AC Only good for very similar sequences Entropy-Based Scoring Sum-of-Pair Scoring Entropy-Based Scoring (1) Entropy in Information heory A measure of the uncertainty associated with a random variable Entropy-Based Scoring in Multiple Alignment (1) Define frequencies for the occurrence of each letter on each column (2) Compute entropy of each column (3) Sum all entropies over all columns 34
35 Entropy-Based Scoring (2) Example AAA AAG AA AC Frequency 1 st column: p(a) = 1, p() = p(g) = p(c) = 0 2 nd column: p(a) = 0.75, p() = 0.25, p(g) = p(c) = 0 3 rd column: p(a) = 0.25, p() = 0.25, p(c) = 0.25, p(g) = 0.25 Entropy A A A A A G 1 1 H 0 H log log A H A log A C Entropy-based score in multiple alignment: Sum-of-Pair Scoring Sum-of-Pairs Scoring in Multiple Alignment Consider pairwise alignment of sequences, a i and a j, imposed by a multiple alignment of k sequences Denote the score of the pairwise alignment as S*(a i, a j ) Sum up the pairwise scores for a multiple alignment: Example Aligning 4 sequences, a 1, a 2, a 3, and a 4, by 35
36 Advanced Multiple Alignment Background Progressive sequence alignment has loss of information not optimal Multi-domain proteins evolve not only through point mutations but also through domain duplications and domain re-combinations Rearrangement might be meaningful for aligning multi-domain protein sequences Examples Partial Order Multiple Sequence Alignment (PO-MSA) A-Bruijn Alignment (ABA) Alignment as a Graph Conventional pairwise alignment A path of a sequence Combining two paths Combined directed acyclic graph (partial order) 36
37 PO-MSA Algorithm Partial Order Multiple Sequence Alignment (PO-MSA) Considers a set of sequences S as a directed acyclic graph G such that each sequence in S is a path in G ( Each sequence is mapped into the graph.) Focuses on ordering rather than positions Algorithm (1) Construct a guide tree (2) Apply progressive alignment following the guide tree (3) Align two directed acyclic graphs (Partial Order Alignment) using dynamic programming algorithm at each step Partial Order Alignment (1) Schematic View Conventional pairwise alignment PO-MSA 37
38 Partial Order Alignment (2) Advantages Scalability Linear increase of time complexity as the increment of predecessors Accuracy Homologous recombination for multi-domain protein sequences ( he graph represents domain structure.) Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
More informationPairwise Sequence Alignment
Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
More informationSimilarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationBio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
More informationAmino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationProtein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
More informationRapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
More informationSequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
More informationBioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
More informationNetwork Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
More informationIntroduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov
More informationBIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
More informationBioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
More informationHow To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
More informationDynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
More informationIntroduction to Bioinformatics AS 250.265 Laboratory Assignment 6
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
More informationCD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction
More informationA Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University jakemdrew@gmail.com www.jakemdrew.com Sequence Characters IUPAC nucleotide
More informationChoices, choices, choices... Which sequence database? Which modifications? What mass tolerance?
Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS
More informationWeb Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
More informationHeuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations
Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations AlCoB 2014 First International Conference on Algorithms for Computational Biology Thiago da Silva Arruda Institute
More informationT cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
More informationGenetic Algorithms and Sudoku
Genetic Algorithms and Sudoku Dr. John M. Weiss Department of Mathematics and Computer Science South Dakota School of Mines and Technology (SDSM&T) Rapid City, SD 57701-3995 john.weiss@sdsmt.edu MICS 2009
More informationDatabase searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999
Dr Clare Sansom works part time at Birkbeck College, London, and part time as a freelance computer consultant and science writer At Birkbeck she coordinates an innovative graduate-level Advanced Certificate
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationSGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
More informationPhylogenetic Trees Made Easy
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
More informationTHREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
More informationAlgorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:
Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47 5 BLAST and FASTA This lecture is based on the following, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid and Sensitive Protein
More informationMORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
More informationMultiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker
Multiple Sequence Alignment Hot Topic 5/24/06 Kim Walker Outline Why are Multiple Sequence Alignments useful? What Tools are Available? Brief Introduction to ClustalX Tools to Edit and Add Features to
More informationClone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
More informationPROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
More informationDNA Insertions and Deletions in the Human Genome. Philipp W. Messer
DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.
More informationGenome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
More informationThe PageRank Citation Ranking: Bring Order to the Web
The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized
More informationA linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form
Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...
More information2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
More informationA permutation can also be represented by describing its cycles. What do you suppose is meant by this?
Shuffling, Cycles, and Matrices Warm up problem. Eight people stand in a line. From left to right their positions are numbered,,,... 8. The eight people then change places according to THE RULE which directs
More informationHierarchical Bayesian Modeling of the HIV Response to Therapy
Hierarchical Bayesian Modeling of the HIV Response to Therapy Shane T. Jensen Department of Statistics, The Wharton School, University of Pennsylvania March 23, 2010 Joint Work with Alex Braunstein and
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationHidden Markov Models
8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies
More informationBIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
More informationLecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching
COSC 348: Computing for Bioinformatics Definitions A pattern (keyword) is an ordered sequence of symbols. Lecture 4: Exact string searching algorithms Lubica Benuskova http://www.cs.otago.ac.nz/cosc348/
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationGold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996
Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 1 Genetic algorithms Inspired
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More information14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
More informationA greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
More informationFocusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationD A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
More informationIE 680 Special Topics in Production Systems: Networks, Routing and Logistics*
IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti
More informationScheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
More informationCOMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
More informationLabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15
LabGenius The world s most advanced synthetic DNA libraries Technical design notes hi@labgeni.us V1.5 NOV 15 Introduction OUR APPROACH LabGenius is a gene synthesis company focussed on the design and manufacture
More informationLecture 1: Systems of Linear Equations
MTH Elementary Matrix Algebra Professor Chao Huang Department of Mathematics and Statistics Wright State University Lecture 1 Systems of Linear Equations ² Systems of two linear equations with two variables
More informationIntroduction to Phylogenetic Analysis
Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.
More informationAnalyzing A DNA Sequence Chromatogram
LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ
More informationWelcome to the Plant Breeding and Genomics Webinar Series
Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen
More informationClustering and scheduling maintenance tasks over time
Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating
More information131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
More informationLearning from Diversity
Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More information3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials
3. Interpolation Closing the Gaps of Discretization... Beyond Polynomials Closing the Gaps of Discretization... Beyond Polynomials, December 19, 2012 1 3.3. Polynomial Splines Idea of Polynomial Splines
More informationA Review And Evaluations Of Shortest Path Algorithms
A Review And Evaluations Of Shortest Path Algorithms Kairanbay Magzhan, Hajar Mat Jani Abstract: Nowadays, in computer networks, the routing is based on the shortest path problem. This will help in minimizing
More informationMemory Allocation Technique for Segregated Free List Based on Genetic Algorithm
Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationFrom DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
More informationMASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
More informationTechnology, Kolkata, INDIA, pal.sanjaykumar@gmail.com. sssarma2001@yahoo.com
Sanjay Kumar Pal 1 and Samar Sen Sarma 2 1 Department of Computer Science & Applications, NSHM College of Management & Technology, Kolkata, INDIA, pal.sanjaykumar@gmail.com 2 Department of Computer Science
More informationMolecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
More informationEmpirically Identifying the Best Genetic Algorithm for Covering Array Generation
Empirically Identifying the Best Genetic Algorithm for Covering Array Generation Liang Yalan 1, Changhai Nie 1, Jonathan M. Kauffman 2, Gregory M. Kapfhammer 2, Hareton Leung 3 1 Department of Computer
More informationGene Finding CMSC 423
Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher
More informationSYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis
SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline
More informationThree Effective Top-Down Clustering Algorithms for Location Database Systems
Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr
More informationHierarchical Data Visualization
Hierarchical Data Visualization 1 Hierarchical Data Hierarchical data emphasize the subordinate or membership relations between data items. Organizational Chart Classifications / Taxonomies (Species and
More informationChapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors
Chapter 9. General Matrices An n m matrix is an array a a a m a a a m... = [a ij]. a n a n a nm The matrix A has n row vectors and m column vectors row i (A) = [a i, a i,..., a im ] R m a j a j a nj col
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
More informationEffect of Using Neural Networks in GA-Based School Timetabling
Effect of Using Neural Networks in GA-Based School Timetabling JANIS ZUTERS Department of Computer Science University of Latvia Raina bulv. 19, Riga, LV-1050 LATVIA janis.zuters@lu.lv Abstract: - The school
More informationA Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML
9 June 2011 A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML by Jun Inoue, Mario dos Reis, and Ziheng Yang In this tutorial we will analyze
More informationAn Introduction to the Use of Bayesian Network to Analyze Gene Expression Data
n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca
More informationArbres formels et Arbre(s) de la Vie
Arbres formels et Arbre(s) de la Vie A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Basic principles DATA sequence alignment distance
More informationComputational searches of biological sequences
UNAM, México, Enero 78 Computational searches of biological sequences Special thanks to all the scientis that made public available their presentations throughout the web from where many slides were taken
More informationLecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method
Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming
More informationAnalysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
More informationUCHIME in practice Single-region sequencing Reference database mode
UCHIME in practice Single-region sequencing UCHIME is designed for experiments that perform community sequencing of a single region such as the 16S rrna gene or fungal ITS region. While UCHIME may prove
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
More information6. Cholesky factorization
6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix
More informationVector storage and access; algorithms in GIS. This is lecture 6
Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationMining Social-Network Graphs
342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is
More informationLecture 1: Course overview, circuits, and formulas
Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik
More informationPrinciples of Evolution - Origin of Species
Theories of Organic Evolution X Multiple Centers of Creation (de Buffon) developed the concept of "centers of creation throughout the world organisms had arisen, which other species had evolved from X
More informationA Brief Study of the Nurse Scheduling Problem (NSP)
A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the
More information