A Tutorial in Genetic Sequence Classification Tools and Techniques
|
|
|
- Annice Dawson
- 10 years ago
- Views:
Transcription
1 A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University
2 Sequence Characters IUPAC nucleotide code Base A - Adenine C - Cytosine G - Guanine T - Thymine A Adenine C Cytosine G Guanine T (or U) Thymine (or Uracil) R A or G Y C or T S G or C W A or T K G or T M A or C B C or G or T D A or G or T H A or C or T V A or C or G N any base. or - gap
3 Sequence File Formats Plain sequence format A sequence in plain format may contain only IUPAC characters and spaces (no numbers!). Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file. An example sequence in plain format is: ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA FASTA format A sequence file in FASTA format can contain several sequences. Each sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. An example sequence in FASTA format is: >AB acc=ab descr=homo sapiens mrna for prepro cortistatin like peptide, complete cds. len=368 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA
4 Sequence File Formats GenBank format A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). An example sequence in GenBank format is: LOCUS AB bp mrna linear PRI 05-FEB-1999 DEFINITION Homo sapiens mrna for prepro cortistatin like peptide, complete cds. ACCESSION AB ORIGIN 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga 301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca 361 gacctgaa //
5 SAM / BAM File Formats SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that: Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in file size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. BAM is the Binary Equivalent Format for SAM
6 Sequence Tokenization Typically, sequences are compared for similarity using words of length k: Token 1 Token 2 Token 3 Token 4 Token 5 Token 6 Token 47 TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA TTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATGCATGCAAGTCGA
7 Simple Sequence Comparison Two sequences S and T can be compared for similarity using Jaccard Similarity: Sequence S Sequence T 1 ACCTGTAA TTGGCCAA 2 GGGTAACA CACACACA 3 AACCGGTT ACCTGTAA 4 ACACACAC ATGATGTG 5 GTGTAGTA GGGTAACA 6 TTGGTGAG GAGTGGTT J S, T = S T S T = 2 =.2 = 20% 10 This could be painfully slow when dealing with millions of sequence words.
8 Feature Space Size The word length k chosen during tokenization can have impacts on classification performance and accuracy: When k = 3, there are 4 3 = 64 possible unique words. When k = 8, there are 4 8 = 65,536 possible unique words. When k = 31, there are 4 31 = 4.61 x possible unique words. When k = 60, there are 4 60 = 1.33 x possible unique words. When k = 3, the features of two sequences can be represented in a 64 x 2 matrix: Sequence 1 Sequence 2 AAA 1 1 AAC 0 1 AAT 0 1 AAG 1 0 ACA 1 0 ACC 0 1 ACG 0 0 ACT 1 0 Very small matrix All words fit in memory of any machine Performance optimal Poor accuracy as many sequences will share words. Complicated statistics must be used to calculate similarity.
9 Feature Space Size When k = 60, the features of two sequences can be represented in a matrix containing x possible rows. Performance challenges due to very large matrix. Matrix typically must be compressed using Locality Sensitive Hashing, random sampling, or some other technique. Highly accurate when collisions occur since the random probability of such a long match is very small. It is possible to use highly performance optimal Boolean similarity calculations since collisions are such strong indicators of similarity. Unique words are not included in the matrix unless they are identified within a sequence during training.
10 Classification Accuracy Optimal Word Length Accuracy impacted by small feature space. 100% 98% 96% 94% 92% 90% 88% 86% 84% 82% 80% 78% 76% 74% 72% 70% RDP Training Dataset Genus Overall Word Length at Signature Size = 300 Accuracy impacted by lack of collisions. Strand uses a word length of 60 characters with 4 60 possible unique word values.
11 Sequence Alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Sequence 1 Sequence 2 GGGGGCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCAC GTAGGCGGATCGGAAAGTCAGAGGTGAAATCCCAGGGCT GGGGGCTAGCGTTATTCGGAATTACTGTGCGTAAAGCGCAC GTAGGCGGAACGGAAAGTCAGAGGTGAAATCCCAGGGCT
12 Sequence Alignment Tools The following are some of the tools available online from NCBI: BLAST Basic Local Alignment Search Tool. Megablast Optimized for highly similar sequences. BLASTN - Search nucleotide subjects using a nucleotide query. BLASTP - Search protein databases using a protein query. BLASTX - search protein databases using a translated nucleotide query. TBLASTN - search translated nucleotide databases using a protein query. TBLASTX - search translated nucleotide databases using a translated nucleotide query. Sequence alignment is very accurate but slow compared to alignment-free methods.
13 JELLYFISH Parallel K-mer Counter JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. Tokenizes and counts words at a length specified by the user. Reads FASTA and multi-fasta files containing DNA sequences. Outputs its k-mer counts in an binary format. Offers the "jellyfish dump" command for human readable outputs. Used by Kraken to tokenize and count sequence words from input data. Open source and freely available. Quickly implement custom classifiers without implementing tokenization and counting.
14 Alignment-Free Classifiers Alignment-free use sequence tokenization and k-mer / word counting to estimate similarity between sequences. The following are state of the art alignment-free classifiers : RDP - Naive Bayesian Classifier Kraken - Abundance Estimation Classifier Strand - MapReduce Style Classifier using Locality Sensitive Hashing
15 RDP The Ribosomal Database Project (RDP) Classifier, a naıve Bayesian classifier, can rapidly and accurately classify bacterial 16S rrna sequences: Naıve Bayesian Classifier Predicts the taxonomy of gene sequences Uses a word length of 8 (k = 8) Average RDP training dataset record length = 1500 bases Tested shorter word lengths of 6 and 7 bases with low accuracy results RDP keeps a matrix of its training data with frequencies of the words it encounters by each training class RDP uses sampling to further reduce its feature space Due to sampling and a shorter word length RDP uses 100 bootstrap trials to make a class prediction (hurts performance) Must use more complex statistical methods to make predictions: P-value Estimate Membership Estimate
16 BioTools R This package aims at using R and the Biostrings package as the common interface for several important tools for multiple sequence alignment as database driven sequence management for 16S rrna: Use R to execute RDP Blast Clustalw kalign Available using Library(BioTools) package in R Must install RDP and BLAST executables to use
17 BioTools R Example The following example shows using BioTools and RDP to classify sequences in R: Reference Package Read Data Train Classifier Read Data Make Predictions
18 Kraken Classifier Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences: Kraken is an Abundance Estimation classifier Classifies sequence reads between bases The goal is very rapid shotgun classification Uses a word length of 31 (k = 31) The full Kraken database is in memory and requires 70GB High precision (classification accuracy) >95% Low sensitivity (the ability to make a prediction) Only classified between 72% to 91% of its test files Saves words during training with the lowest common ancestry Classification speeds are measured in reads per minute (rpm) Rpm values vary between test datasets from 890,000 to 1.5 million Kraken is open source and available for download from the Kraken website.
19 Kraken Classifier
20 Kraken Classifier
21 Strand Classifier Performs Fast Sequence Comparison using MapReduce and Locality Sensitive Hashing: General purpose classifier Uses a MapReduce style pipeline for highly parallel processing performance Uses Locality Sensitive Hashing to compress data by > 98% 20 times faster than RDP Uses only around 4GB to store the same training data used in the Kraken database. More accurate than RDP. Comparable accuracy when compared to Kraken.
22 Strand Multicore MapReduce
23 MinHash Signatures Strand uses n randomly seeded hash functions to select n minimum hash values which represent a random permutation of all words in a sequence or taxonomy: Set S Hash 1 Hash 2 Hash 3 Hash 4 Hash n -529, ,328, ,598 Set T Hash 1 Hash 2 Hash 3 Hash 4 Hash n 9, ,328,398 89,754 1, Strand Intersects n MinHash values between sequences and taxonomies to provide an accurate approximation of the Jaccard Index. J S, T = S T S T
24 Questions?
25 Additional References 1. BLAST - Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology (1990): JELLYFISH - Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): RDP Wang, Qiong, et al. "Naive Bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy." Applied and environmental microbiology (2007): Kraken Wood, Derrick E., and Steven L. Salzberg. "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15.3 (2014): R Strand - Drew, Jake, and Michael Hahsler. "Strand: fast sequence comparison using mapreduce and locality sensitive hashing." Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, R BioTools - Hahsler, Michael, and Anurag Nagar. "BioTools: Tools based on Biostrings (alignment, classification, database)." (2013):
26 Thank You! Jake Drew
DNA Sequence formats
DNA Sequence formats [Plain] [EMBL] [FASTA] [GCG] [GenBank] [IG] [IUPAC] [How Genomatix represents sequence annotation] Plain sequence format A sequence in plain format may contain only IUPAC characters
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
Analyzing A DNA Sequence Chromatogram
LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ
13.2 Ribosomes & Protein Synthesis
13.2 Ribosomes & Protein Synthesis Introduction: *A specific sequence of bases in DNA carries the directions for forming a polypeptide, a chain of amino acids (there are 20 different types of amino acid).
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline
Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015
Reference Genome Tracks November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected] Reference
From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS
Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary
UGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
BIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
Databases and mapping BWA. Samtools
Databases and mapping BWA Samtools FASTQ, SFF, bax.h5 ACE, FASTG FASTA BAM/SAM GFF, BED GenBank/Embl/DDJB many more File formats FASTQ Output format from Illumina and IonTorrent sequencers. Quality scores:
To be able to describe polypeptide synthesis including transcription and splicing
Thursday 8th March COPY LO: To be able to describe polypeptide synthesis including transcription and splicing Starter Explain the difference between transcription and translation BATS Describe and explain
Molecular Genetics. RNA, Transcription, & Protein Synthesis
Molecular Genetics RNA, Transcription, & Protein Synthesis Section 1 RNA AND TRANSCRIPTION Objectives Describe the primary functions of RNA Identify how RNA differs from DNA Describe the structure and
PRACTICE TEST QUESTIONS
PART A: MULTIPLE CHOICE QUESTIONS PRACTICE TEST QUESTIONS DNA & PROTEIN SYNTHESIS B 1. One of the functions of DNA is to A. secrete vacuoles. B. make copies of itself. C. join amino acids to each other.
Comparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
DNA Sequencing Overview
DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled
Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in
DNA, RNA, Protein Synthesis Keystone 1. During the process shown above, the two strands of one DNA molecule are unwound. Then, DNA polymerases add complementary nucleotides to each strand which results
Phylogenetic Analysis using MapReduce Programming Model
2015 IEEE International Parallel and Distributed Processing Symposium Workshops Phylogenetic Analysis using MapReduce Programming Model Siddesh G M, K G Srinivasa*, Ishank Mishra, Abhinav Anurag, Eklavya
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
Modeling DNA Replication and Protein Synthesis
Skills Practice Lab Modeling DNA Replication and Protein Synthesis OBJECTIVES Construct and analyze a model of DNA. Use a model to simulate the process of replication. Use a model to simulate the process
Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr
Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.
13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both
2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three
Chem 121 Chapter 22. Nucleic Acids 1. Any given nucleotide in a nucleic acid contains A) two bases and a sugar. B) one sugar, two bases and one phosphate. C) two sugars and one phosphate. D) one sugar,
Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Sequencing the Human Genome
Revised and Updated Edvo-Kit #339 Sequencing the Human Genome 339 Experiment Objective: In this experiment, students will read DNA sequences obtained from automated DNA sequencing techniques. The data
GenBank: A Database of Genetic Sequence Data
GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant
Apply PERL to BioInformatics (II)
Apply PERL to BioInformatics (II) Lecture Note for Computational Biology 1 (LSM 5191) Jiren Wang http://www.bii.a-star.edu.sg/~jiren BioInformatics Institute Singapore Outline Some examples for manipulating
MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
DNA Sequence Analysis Software
DNA Sequence Analysis Software Group: Xin Xiong, Yuan Zhang, HongboLiu Supervisor: Henrik Bulskov Table of contents Introduction...2 1 Backgrounds and Motivation...2 1.1 Molecular Biology...2 1.2 Computer
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Welcome to the Plant Breeding and Genomics Webinar Series
Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen
DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!
DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Genomes and SNPs in Malaria and Sickle Cell Anemia
Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing
Design Style of BLAST and FASTA and Their Importance in Human Genome.
Design Style of BLAST and FASTA and Their Importance in Human Genome. Saba Khalid 1 and Najam-ul-haq 2 SZABIST Karachi, Pakistan Abstract: This subjected study will discuss the concept of BLAST and FASTA.BLAST
DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky
DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to
Keeping up with DNA technologies
Keeping up with DNA technologies Mihai Pop Department of Computer Science Center for Bioinformatics and Computational Biology University of Maryland, College Park The evolution of DNA sequencing Since
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
2.3 Identify rrna sequences in DNA
2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.
: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results
2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.
1. True or False? A typical chromosome can contain several hundred to several thousand genes, arranged in linear order along the DNA molecule present in the chromosome. True 2. True or False? The sequence
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences
DNA and the Cell Anastasios Koutsos Alexandra Manaia Julia Willingale-Theune Version 2.3 English version ELLS European Learning Laboratory for the Life Sciences Anastasios Koutsos, Alexandra Manaia and
LESSON 9. Analyzing DNA Sequences and DNA Barcoding. Introduction. Learning Objectives
9 Analyzing DNA Sequences and DNA Barcoding Introduction DNA sequencing is performed by scientists in many different fields of biology. Many bioinformatics programs are used during the process of analyzing
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
Integration of data management and analysis for genome research
Integration of data management and analysis for genome research Volker Brendel Deparment of Zoology & Genetics and Department of Statistics Iowa State University 2112 Molecular Biology Building Ames, Iowa
Module 10: Bioinformatics
Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior
Towards Integrating the Detection of Genetic Variants into an In-Memory Database
Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014 Motivation Genome Data Analysis Process DNA Sample Base
Learning from Diversity
Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology
Data formats and file conversions
Building Excellence in Genomics and Computational Bioscience s Richard Leggett (TGAC) John Walshaw (IFR) Common file formats FASTQ FASTA BAM SAM Raw sequence Alignments MSF EMBL UniProt BED WIG Databases
Basic processing of next-generation sequencing (NGS) data
Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance
Genetics Test Biology I
Genetics Test Biology I Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Avery s experiments showed that bacteria are transformed by a. RNA. c. proteins.
BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations
Supervised DNA barcodes species classification: analysis, comparisons and results Emanuel Weitschek, Giulia Fiscon, and Giovanni Felici Citations If you use this procedure please cite: Weitschek E, Fiscon
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks
Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks Semester II Paper II: Mathematics I 85 marks B.Sc. II Year
Molecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
(http://genomes.urv.es/caical) TUTORIAL. (July 2006)
(http://genomes.urv.es/caical) TUTORIAL (July 2006) CAIcal manual 2 Table of contents Introduction... 3 Required inputs... 5 SECTION A Calculation of parameters... 8 SECTION B CAI calculation for FASTA
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER Veeram Venkata Siva Prasad 1 and Gunisetti Loshma 2 1 M.Tech(CSE), Sri Vasavi Engineering college, Tadepalligudem, West Godavari District, Andhra Pradesh,
MAKING AN EVOLUTIONARY TREE
Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities
Next generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT [email protected] 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
Cloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes
ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes Page 1 of 22 Introduction Indiana students enrolled in Biology I participated in the ISTEP+: Biology I Graduation Examination
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Metagenomic and metatranscriptomic analysis
Metagenomic and metatranscriptomic analysis Marcelo Falsarella Carazzolle [email protected] Laboratório de Genômica e Expressão (LGE) Unicamp METAGENOMIC Jo Handelsman (1998) University of Wisconsin-EUA)
