MATCH Commun. Math. Comput. Chem. 61 (2009)
|
|
|
- Merilyn Garrison
- 9 years ago
- Views:
Transcription
1 MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) ISSN Three distances for rapid similarity analysis of DNA sequences Wei Chen, Yusen Zhang School of Mathematics and Statistics, Shandong University at Weihai Weihai , China (Received July 14, 2008) Abstract. Three distances for assessing genomic similarity based on dinucleotide frequency in large DNA sequences is introduced. The method requires neither homologous sequences nor prior sequence alignments. The analysis centers on symmetrized dinucleotide frequency reflecting DNA structures related to dinucleotide stacking energies, constraints of DNA curvature. To show the utility of the method, we use these distances to examine the similarities among the first exon-1 of the β-globin gene for 11 different species. 1 Introduction The traditional algorithms for similarity analysis and phylogenetic inference are based mostly on multiple alignment [1]. Such approaches have been hitherto widely used. However, for large genomic sequences, alignments of the sequences are generally not feasible. To overcome these problems, more and more researchers begin to try alignment-free methods for DNA sequence comparison and analysis [2, 3]. More recently, alternative routes for quantitative measure of DNA sequences were considered [4,5]. The novel methodology starts with a graphical representation of DNA, such as proposed in [6-10], which allow visual inspection of lengthy sequences. It was shown that it is possible to characterize numerically the graphical representation to obtain a numerical characterization of the degree of similarity /dissimilarity of different DNA sequences. This is accomplished by associating with graphical representations of DNA a corresponding mathematical object such as a matrix, and then using various properties of mathematical object, like matrix invariants, as sequence descriptors. In this way one arrives at an alternative approach for comparative studies of DNA, which are less computer-intensive, because it replaces the original DNA sequence by an ordered set of sequence invariants, which can be viewed as components of vectors and thus comparison of sequences is transformed into a simpler comparison of vectors, rather than by a direct comparison of the sequences themselves [11-16]. An important advantage of the characterization of structures by this invariants, as opposed to use of codes, is the simplicity Corresponding author: [email protected]
2 of the comparison based on invariants. Although we also use other invariants [17][18], the calculation of invariants, especially the eigenvalues, will become more and more difficult with the order of the matrix large. In paper [19], a new method based on the double-stranded nature of DNA has been proposed to construct the similarity matrices. Such matrix allows one to mathematically characterize the DNA sequences and make quantitative comparisons between different DNA sequences, between the same or between different species. In this paper we consider the properties of the neighboring dual nucleotides of DNA sequence and propose three distances based on symmetrized dinucleotide frequency reflecting DNA structures related to dinucleotide stacking energies, constraints of DNA curvature., which are adaptive to both analysis of short and long DNA sequences. As an application, we make a comparison for the first exon-1 of the β-globin gene for 11 different species. 2 Symmetrized dinucleotide frequency Consider a DNA sequence read from the 5 - to the 3 -end with n bases. The cumulative numbers of the nucleotide X (A, C, G, or T), denoted by the positive integer X n. By considering neighboring two bases, we can obtain sixteen dinucleotide XY: AG, GA, CT, TC, AC, CA, GT, TG, AT, TA, CG, GC, AA, CC, GG and TT. The cumulative numbers of the dinucleotide XY denoted by the positive integer XY n. Let f X denote the frequency of the nucleotide X (A, C, G, or T) and f XY denote the frequency of dinucleotide XY. Then we obtain f X = X n /n and f XY = XY n /n 1. Since DNA structures are influenced by oligonucleotide compositions of both strands (e.g., stacking energies), the frequency formula for f XY is modified to accommodate the double-stranded nature of DNA by combining the given sequence and its inverted complement sequence. In this context, the frequency f A is symmetrized to f A = f T =(f A + f T )/2 andf C = f G =(f C + f G )/2. Similarly, f GT = f AC = (f GT + f AC )/2 is the symmetrized double stranded frequency of GT/AC, etc. 3 Proposed distance Before we present our main result, we define three distances between two DNA sequences. Let f is a DNA sequence, the dinucleotide frequency matrix is defined by: F (f) = f AT f AA f AC f AG f TT f TA f TC f TG f GT f GA f GC f GG f CT f CA f CC f CG. (1) By this way, we get a correspondence between the DNA sequence and the dinucleotide frequency matrix F (f). A DNA sequence can be analyzed by studying the corresponding dinucleotide frequency matrix. It is easy to find that the dinucleotide frequency matrix F (f) is real symmetric. Given two sequences f and g (e.g., sequences from different organisms or from different regions of a genomic sequence), then we get the dinucleotide frequency matrix F (f) and
3 F (g), comparison between DNA sequences becomes comparison between these dinucleotide frequency matrices. Based on this idea the first dinucleotide frequency distance d 1 (f,g) is defined as d 1 (f,g) = F ij (f) F ij (g), where the sum extends over all dinucleotides. Another distance measure that would follow from the idea of building the variancecovariance matrix corresponding to the dinucleotide frequency matrix F (f). The variance-covariance matrix CF(f) consists of the variances of the variables along the main diagonal and the covariances between each pair of variables in the other matrix positions. If the row vectors of dinucleotide frequency matrix F (f) are denoted as X i (i =1, 2, 3, 4), then the formula for computing the covariance of X i and X j is CF(f) ij = (X i Xi 0 )(X j Xj 0 ), where Xi 0 and Xj 0 denoting the means of X i and X j, respectively. Given two sequences f and g, the second distance measure is defined as k=1 d 2 (f,g) = w ij CF(f) ij CF(g) ij, where the sum extends over all dinucleotides and w ij = 16 or some other natural weights. Table 1: The coding sequences of the first exon of β-globin gene of eleven different species Species Coding sequence human ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATTAAGTTGGTGGTGAGGCCCTGGGCAG Goat ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGT GGATGAAGTTGGTGCTGAGGCCCTGGGCAG Opossum ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGT GCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG Gallus ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGT CAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG Lemmur ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGT GGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTGTGGGGAAAGGT GAACTCCGATGAAGTTGGTGGTGAGGCCCTGGGCAG Rabbit ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGT GATTGTGGAAGAAGTTGGTGGTGAGGCCCTGGGCAG Rat ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGT GAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG Gorilla ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG Bovine ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAAGT GGATGAAGTTGGTGGTGAGGCCCTGGGCAG Chimpanzee ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG Before we present the third distance, we need to consider ratio measure P XY,thatis the ratio of the frequency f XY of a dinucleotide XY to the frequency f X of the nucleotide X. The ratio measure P XY is suitable for a single sequence, in order to compare sequences
4 from different organisms (or from different chromosomes), the formula has to be modified to account for the complementary antiparallel structure of double-stranded DNA. Based on this fact, the P XY is defined as P XY = f XY /f X. That is, P GT = f GT /f G = n(f GT + f AC )/((n 1)(f C + f G )), and similarly for other dinucleotides. For any DNA sequence f, we construct a 16-component vector: V (f) =(P AA,P AT,P AG,P AC,P AT,P TT,P TG,P TC,P AG,P TG,P GG,P GC,P AC,P TC,P GC,P CC ), then we get a correspondence between the DNA sequence and 16-component vector V (f). So Given two sequences f and g, the third distance measure can be defined as (Pij d 3 (f,g) = (f) P ij (g)) 2, where the sum extends over all dinucleotides. A comparison between a pair of DNA sequences to judge their similarities and dissimilarities can be carried out by calculatting the distance d 3 (f,g). The analysis of similarity among these DNA sequences is based on the assumption that the smaller is the Euclidean distance the more similar are the two DNA sequences. Table 2: The upper triangular part of the similarities/dissimilarities matrix based on distance measure d 1(f, g) of the 11 coding sequences of Table 1 Species Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee 0 4 Results and discussion The computation of the proposed distances is simple and alignment-free. Unlike most existing methods to analyze the similarity of DNA sequence, the proposed method does not require gene identification nor any prior biology knowledge such as an accurate alignment score matrix. To show the utility of the method, we use these distances to examine the similarities among the first exon-1 of the β-globin gene for 11 different species. In Table 1, the first
5 exon-1 of the β-globin gene for 11 different species are listed, which were reported by Randić et al. [7]. Table 2 presents the similarities/dissimilarities matrix based on distance measure d 1 (f,g) of the 11 coding sequences of Table 1. The smallest entries are associated with the pairs human and Chimpanzee [d 1 =0.0249], mouse and rabbit [d 1 =0.0283], human and gorilla [d 1 =0.0329] and gorilla and chimpanzee [d 1 =0.0350]. The greatest distance, d 1 =0.1547, among the 11 coding sequences is observed between gallus (the only non-mammalian representative) and opossum (the most remote species from the remaining mammals), and the larger entries in the similarity matrix appear in the rows belonging to opossum and gallus. Table 3: The upper triangular part of the similarities/dissimilarities matrix based on distance measure d 2(f, g) of the 11 coding sequences of Table 1 Species Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee 0 In Table 3, the similarities/dissimilarities matrix is based on distance measure d 2 (f,g). Observing Table 3, we find that gallus is very dissimilar to others among the 11 species because its corresponding row has lager entries. which is consistent with the fact that Gallus is non-mammal, while the others are mammal. On the other hand, the two close species are human and chimpanzee [d 2 (human, chimpanzee) = ], human and gorilla [d 2 (human, gorilla) =0.0545], and gorilla and chimpanzee [d 2 (chimpanzee, gorilla) = ], the distance [d 2 =0.0865] between mouse and rabbit, dissimilar with that in Table 2, is larger than above mentioned three cases. Comparing Table 2, 3, we can find that there exists an overall qualitative agreement among similarities although there is small difference. The result presented in Table 2 and Table 3 are in accord with that reported results of the examination of the similarity/dissimilarity of the coding sequences of the first exon of β-globin gene of several species by means of approaches using matrix invariants techniques[12][14][15][17][18]. We present the the similarities/dissimilarities matrix obtained using the distance d 3 (f,g) in Table 4. The similarities is in agreement with the results in Table 2 and Table 3 except for the greatest distance which are associated with gallus and chimpanzee [d 3 (chimpanzee, gallus) =
6 Table 4: The upper triangular part of the similarities/dissimilarities matrix based on distance measure d 3(f, g) of the 11 coding sequences of Table 1 Species Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee ], not gallus and opossum. the two close species are (human, Chimpanzee) [d 3 = ], (human, gorilla) [d 3 =0.0835] and (gorilla, chimpanzee) [d 3 =0.0962]. But Gallus is dissimilar to others in evidence. And we can clearly verify that gallus and opossum are dissimilar to others in Table 4. Besides gallus and opossum, lemur should be more remote from the other species relatively. 5 Conclusions Sequence comparison is a fundamental task in Computational Biology that aims to discover similarity relationships between molecular sequences. In this paper, three distances based on the symmetrized dinucleotide frequency of DNA sequence has been proposed to mathematically characterize the DNA sequences. Its application to the similarity/dissimilarity of the coding sequences of β-globin gene of 11 species and each of the exons of the gene illustrate validity. The results about similarity fix basically the reality. Meanwhile, our approach does not require complicated calculation. The method is more simple, convenient and fast. So they are adaptive to both analysis of short and long DNA sequences. Acknowledgements This work was supported in part by the Shandong Natural Science Foundation(Y2006A14). References [1] A. Godzik, The structural alignment between two proteins: is there a unique answer? Protein Sci. 5 (1996)
7 [2] C. Burge, A. M. Campbell, and S. Karlin., Over- and under-representation of short oligonucleotides in DNA sequences, Proc. Natl. Acad. Sci. 89 (1992) [3] S. Karlin, I. Ladunga, Comparisons of eukaryotic genomic sequences, Proc. Natl. Acad. Sci. 91(1994), , [4] A. Nandy, M. Harle, S.C. Basak, Mathematical descriptors of DNA sequences: development and applications, ARKIVOC 9 (2006) [5] G. Jaklic, T. Pisanski, M. Randić, Characterization of complex biological systems by matrix invariants, J. Comput. Biol. 13 (2006) [6] M. Randić, M. Vračko, N. Lerš, D. Plavšić, Novel 2-D graphical representation of DNA sequences and their numberical characterization, Chem. Phys. Lett. 368 (2003) 1-6. [7] M. Randić, X. F. Guo, S. C. Basak, On the characterization of DNA primary sequence by triplet of nucleic acid bases, J. Chem. Inf. Comput. Sci. 41(2001) [8] Y. S. Zhang, B. Liao, K. Ding, On 2D graphical representation of DNA sequence of nondegeneracy, Chem. Phys. Lett. 411 (2005) [9] Y. S. Zhang, B. Liao, K. Ding, On 3DD-Curves of DNA sequences, Mol. Simul. 32(2006) [10] B. Liao, A 2D graphical representation of DNA sequence,chem. Phys. Lett. 401(2005) [11] Y. S. Zhang, M. Tan, Visualization of DNA sequences based on 3DD-Curves, J. Math. Chem. 44 (2008) [12] M. Randić, M. Vračko, N. Lerš, D. Plavšić, Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation, Chem. Phys. Lett. 371(2003) [13] D. Bielinska-Waz, P. Waz, T. Clark, Similarity studies of DNA sequences using genetic methods, Chem. Phys. Lett. 445 (2007) [14] B. Liao, Y. S. Zhang, K. Ding, T. Wang, Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation, J. Mol. Struct. (Theochem) 717 (2005) [15] B. Liao, K. Ding, Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleaotide bases, J. Comput. Inf. Comput. Sci. 44 (2004) [16] B. Liao, K. Ding, A graphical approach to analyzing DNA sequences, J. Comput. Chem. 26 (2005)
8 [17] Y. S. Zhang, W. Chen, Invariants of DNA sequences based on 2DD-curves, J. Theor. Biol. 242 (2006) [18] Y. S. Zhang, W. Chen, New invariant of DNA sequences, MATCH Commun. Math. Comput. Chem. 58 (2007), [19] Y. S. Zhang, A simple method to construct the similarity matrices of DNA sequences, MATCH Commun. Math. Comput. Chem. 60 (2008)
ColorSquare: A Colorful Square Visualization of DNA Sequences
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 68 (2012) 621-637 ISSN 0340-6253 ColorSquare: A Colorful Square Visualization of DNA Sequences Zhujin Zhang
MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2
HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2 1 Tata Consultancy Services, India [email protected] 2 Bangalore University, India ABSTRACT
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
International Language Character Code
, pp.161-166 http://dx.doi.org/10.14257/astl.2015.81.33 International Language Character Code with DNA Molecules Wei Wang, Zhengxu Zhao, Qian Xu School of Information Science and Technology, Shijiazhuang
DNA Insertions and Deletions in the Human Genome. Philipp W. Messer
DNA Insertions and Deletions in the Human Genome Philipp W. Messer Genetic Variation CGACAATAGCGCTCTTACTACGTGTATCG : : CGACAATGGCGCT---ACTACGTGCATCG 1. Nucleotide mutations 2. Genomic rearrangements 3.
Extremal Wiener Index of Trees with All Degrees Odd
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 70 (2013) 287-292 ISSN 0340-6253 Extremal Wiener Index of Trees with All Degrees Odd Hong Lin School of
SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH
31 Kragujevac J. Math. 25 (2003) 31 49. SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH Kinkar Ch. Das Department of Mathematics, Indian Institute of Technology, Kharagpur 721302, W.B.,
Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices
MATHEMATICAL COMMUNICATIONS 47 Math. Commun., Vol. 15, No. 2, pp. 47-58 (2010) Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices Hongzhuan Wang 1, Hongbo Hua 1, and Dongdong Wang
Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. [email protected] V1.5 NOV 15
LabGenius The world s most advanced synthetic DNA libraries Technical design notes [email protected] V1.5 NOV 15 Introduction OUR APPROACH LabGenius is a gene synthesis company focussed on the design and manufacture
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
The Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three
Chem 121 Chapter 22. Nucleic Acids 1. Any given nucleotide in a nucleic acid contains A) two bases and a sugar. B) one sugar, two bases and one phosphate. C) two sugars and one phosphate. D) one sugar,
Real-time PCR: Understanding C t
APPLICATION NOTE Real-Time PCR Real-time PCR: Understanding C t Real-time PCR, also called quantitative PCR or qpcr, can provide a simple and elegant method for determining the amount of a target sequence
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
Interaktionen von Nukleinsäuren und Proteinen
Sonja Prohaska Computational EvoDevo Universitaet Leipzig June 9, 2015 DNA is never naked in a cell DNA is usually in association with proteins. In all domains of life there are small, basic chromosomal
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
Multivariate Analysis of Variance (MANOVA): I. Theory
Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
PHYSICAL REVIEW LETTERS
PHYSICAL REVIEW LETTERS VOLUME 86 28 MAY 21 NUMBER 22 Mathematical Analysis of Coupled Parallel Simulations Michael R. Shirts and Vijay S. Pande Department of Chemistry, Stanford University, Stanford,
A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in
DNA, RNA, Protein Synthesis Keystone 1. During the process shown above, the two strands of one DNA molecule are unwound. Then, DNA polymerases add complementary nucleotides to each strand which results
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 38-39 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent
Webserver: bioinfo.bio.wzw.tum.de Mail: [email protected]
Webserver: bioinfo.bio.wzw.tum.de Mail: [email protected] About me H. Werner Mewes, Lehrstuhl f. Bioinformatik, WZW C.V.: Studium der Chemie in Marburg Uni Heidelberg (Med. Fakultät, Bioenergetik)
For additional information on the program, see the current university catalog.
For information call: Tel: (818) 77-81 Fax: (818) 77-08 E-mail: [email protected] Website: http://www.csun.edu/chemistry Or write: Department of Chemistry and Biochemistry California State University,
MAKING AN EVOLUTIONARY TREE
Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities
Metric Multidimensional Scaling (MDS): Analyzing Distance Matrices
Metric Multidimensional Scaling (MDS): Analyzing Distance Matrices Hervé Abdi 1 1 Overview Metric multidimensional scaling (MDS) transforms a distance matrix into a set of coordinates such that the (Euclidean)
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
Basic Scientific Principles that All Students Should Know Upon Entering Medical and Dental School at McGill
Fundamentals of Medicine and Dentistry Basic Scientific Principles that All Students Should Know Upon Entering Medical and Dental School at McGill Students entering medical and dental training come from
Visualization of General Defined Space Data
International Journal of Computer Graphics & Animation (IJCGA) Vol.3, No.4, October 013 Visualization of General Defined Space Data John R Rankin La Trobe University, Australia Abstract A new algorithm
Hidden Markov Models
8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies
Chemistry INDIVIDUAL PROGRAM INFORMATION 2015 2016. 866.Macomb1 (866.622.6621) www.macomb.edu
Chemistry INDIVIDUAL PROGRAM INFORMATION 2015 2016 866.Macomb1 (866.622.6621) www.macomb.edu Chemistry PROGRAM OPTIONS CREDENTIAL TITLE CREDIT HOURS REQUIRED NOTES Associate of Science Chemistry 64 CONTACT
Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1
Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1 Ziheng Yang Department of Animal Science, Beijing Agricultural University Felsenstein s maximum-likelihood
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
SOLiD System accuracy with the Exact Call Chemistry module
WHITE PPER 55 Series SOLiD System SOLiD System accuracy with the Exact all hemistry module ONTENTS Principles of Exact all hemistry Introduction Encoding of base sequences with Exact all hemistry Demonstration
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
How to do AHP analysis in Excel
How to do AHP analysis in Excel Khwanruthai BUNRUAMKAEW (D) Division of Spatial Information Science Graduate School of Life and Environmental Sciences University of Tsukuba ( March 1 st, 01) The Analytical
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Mathematics INDIVIDUAL PROGRAM INFORMATION 2014 2015. 866.Macomb1 (866.622.6621) www.macomb.edu
Mathematics INDIVIDUAL PROGRAM INFORMATION 2014 2015 866.Macomb1 (866.622.6621) www.macomb.edu Mathematics PROGRAM OPTIONS CREDENTIAL TITLE CREDIT HOURS REQUIRED NOTES Associate of Arts Mathematics 62
An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data
n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca
13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.
3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms
Chapter 6. Orthogonality
6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be
Custom Antibody Services
prosci-inc.com Custom Antibody Services High Performance Antibodies and More Broad Antibody Catalog Extensive Antibody Services CUSTOM ANTIBODY SERVICES Established in 1998, ProSci Incorporated is a leading
= 2 + 1 2 2 = 3 4, Now assume that P (k) is true for some fixed k 2. This means that
Instructions. Answer each of the questions on your own paper, and be sure to show your work so that partial credit can be adequately assessed. Credit will not be given for answers (even correct ones) without
Introduction to Principal Components and FactorAnalysis
Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a
T cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Amino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
Expression Quantification (I)
Expression Quantification (I) Mario Fasold, LIFE, IZBI Sequencing Technology One Illumina HiSeq 2000 run produces 2 times (paired-end) ca. 1,2 Billion reads ca. 120 GB FASTQ file RNA-seq protocol Task
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS
Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
Section 6.1 - Inner Products and Norms
Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,
Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.
13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both
Network (Tree) Topology Inference Based on Prüfer Sequence
Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 [email protected],
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
On Some Vertex Degree Based Graph Invariants
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 65 (20) 723-730 ISSN 0340-6253 On Some Vertex Degree Based Graph Invariants Batmend Horoldagva a and Ivan
MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.
MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar
A Brief Study of the Nurse Scheduling Problem (NSP)
A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the
agucacaaacgcu agugcuaguuua uaugcagucuua
RNA Secondary Structure Prediction: The Co-transcriptional effect on RNA folding agucacaaacgcu agugcuaguuua uaugcagucuua By Conrad Godfrey Abstract RNA secondary structure prediction is an area of bioinformatics
Notes on Determinant
ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
Graph theoretic approach to analyze amino acid network
Int. J. Adv. Appl. Math. and Mech. 2(3) (2015) 31-37 (ISSN: 2347-2529) Journal homepage: www.ijaamm.com International Journal of Advances in Applied Mathematics and Mechanics Graph theoretic approach to
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected] ABSTRACT This
MiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
Becker Muscular Dystrophy
Muscular Dystrophy A Case Study of Positional Cloning Described by Benjamin Duchenne (1868) X-linked recessive disease causing severe muscular degeneration. 100 % penetrance X d Y affected male Frequency
DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences
DNA and the Cell Anastasios Koutsos Alexandra Manaia Julia Willingale-Theune Version 2.3 English version ELLS European Learning Laboratory for the Life Sciences Anastasios Koutsos, Alexandra Manaia and
Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques
International Journal of Electronics and Computer Science Engineering 2000 Available Online at www.ijecse.org ISSN- 2277-1956 Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization
CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS
CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford,
