Data formats and file conversions

Similar documents

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Module 1. Sequence Formats and Retrieval. Charles Steward

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

UGENE Quick Start Guide

Databases and mapping BWA. Samtools

Version 5.0 Release Notes

Introduction to NGS data analysis

Bioinformatics Resources at a Glance

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Next generation sequencing (NGS)

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

SRA File Formats Guide

The Galaxy workflow. George Magklaras PhD RHCE

Comparing Methods for Identifying Transcription Factor Target Genes

Introduction to Bioinformatics 3. DNA editing and contig assembly

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Searching Nucleotide Databases

Welcome to the Plant Breeding and Genomics Webinar Series

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

CD-HIT User s Guide. Last updated: April 5,

Analysis of ChIP-seq data in Galaxy

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Basic processing of next-generation sequencing (NGS) data

Analysis of NGS Data

A Tutorial in Genetic Sequence Classification Tools and Techniques

GenBank, Entrez, & FASTA

Clone Manager. Getting Started

Biological Sequence Data Formats

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Prepare the environment Practical Part 1.1

Text file One header line meta information lines One line : variant/position

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Practical Guideline for Whole Genome Sequencing

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bioinformatics Grid - Enabled Tools For Biologists.

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

What is a contig? What are the contig assembly programs?

Introduction to GCG and SeqLab

Library page. SRS first view. Different types of database in SRS. Standard query form

RESTRICTION DIGESTS Based on a handout originally available at

NGS Data Analysis: An Intro to RNA-Seq

EndNote Beyond the Basics

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Geneious 8.1. Biomatters Ltd

RJE Database Accessory Programs

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Genome Explorer For Comparative Genome Analysis

Introduction to next-generation sequencing data

Delivering the power of the world s most successful genomics platform

Chironomid DNA Barcode Database Search System. User Manual

DNA Sequencing Data Compression. Michael Chung

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Practical Solutions for Big Data Analytics

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Human Genomes and Big Data Challenges QUANTITY, QUALITY AND QUANDRY Gerry Higgins, M.D., Ph.D. AssureRx Health, Inc.

DNA Sequencing Overview

LifeScope Genomic Analysis Software 2.5

The Artemis Manual. Copyright by Genome Research Limited

The QuickStudy Guide for Sage ACT! 2013

Introduction to Genome Annotation

Unipro UGENE Manual. Version

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Hadoop-BAM and SeqPig

Database Searching Tutorial/Exercises Jimmy Eng

Integrated Rule-based Data Management System for Genome Sequencing Data

Biological Databases and Protein Sequence Analysis

Introduction. Overview of Bioconductor packages for short read analysis

-> Integration of MAPHiTS in Galaxy

Hadoopizer : a cloud environment for bioinformatics data analysis

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Step by Step Guide to Importing Genetic Data into JMP Genomics

Mascot Search Results FAQ

GDC Data Transfer Tool User s Guide. NCI Genomic Data Commons (GDC)

Pairwise Sequence Alignment

Banana is a native application for Windows, Linux and Mac and includes functions that allow the user to manage different types of accounting files:

Sequence Database Administration

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

BioHPC Web Computing Resources at CBSU

The Artemis Manual. Copyright by Genome Research Limited

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

MiSeq: Imaging and Base Calling

Creating and Using Databases with Microsoft Access

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

NaviCell Data Visualization Python API

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

Manual for Demo Data

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Processing NGS Data with Hadoop-BAM and SeqPig

TCB No September Technical Bulletin. GS FLX+ System & GS FLX System. Installation of 454 Sequencing System Software v2.

Transcription:

Building Excellence in Genomics and Computational Bioscience s Richard Leggett (TGAC) John Walshaw (IFR)

Common file formats FASTQ FASTA BAM SAM Raw sequence Alignments MSF EMBL UniProt BED WIG Databases Annotation GenBank VCF GFF

FASTQ files e.g. Illumina read files 4 lines per read Stores sequence and quality information Read ID Sequence Quality @HWI-ST790:234:D0W8BACXX:1:1101:1792:2000 1:N:0:GCCAA ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG + CC#4ADDFHHHHHIIIIEGHIIIIIIIIIIGIIIIIIIIIIIIIIIIIDGHHIDHHIII6@FGI @HWI-ST790:234:D0W8BACXX:1:1101:2592:1999 1:N:0:GCCAA CTNGAATGCAGGTAGAATACATCTCCCGGATAAGCCTCGCGGCCCCCGGGGCGGGGGGGGAGAG + :=#44AA?:<DFFE>FED?3A<EHH>FIF?ADGCGBA?D######################### @HWI-ST790:234:D0W8BACXX:1:1101:4221:1999 1:N:0:GCCAA GGNAAATACGAAAGATAAGCTACGCAAGAAACGAAGGATTACTGCGAAAGGCTGCGATGCGGCA + @@#4=BDDFDFHDIIBGIHHHIGGIIIBHHIF=ABB@?B<DE@BF<FHH@@EHACD<B3=8@:B

FASTQ files Sanger format quality scores 0-93 Encoded with ASCII characters 33-126 Older versions of Illumina software slightly different

FASTQ files Q score relates to probability, p, that base is incorrect: What this means

FASTA files e.g. assembler contigs Stores ID and sequence data only Sequence data can cover multiple lines Sequence ID Sequence >contig1 ACNATTAACAACCTTGGTGTTCAGCATGAGAACTTATCTGCAGCTGAGTCTCGTATCCGTGACG CTGAGTCTCGTATCCGTGACGGTTAGGGCGATTAGCATAGA >contig2 TGACTAGCGGATTCGATTCGGAGGCTTATGGGCATTCCAGATGCAGCTAGCAGATGACATAGAT GGGCATT >contig3 CCCCCCTGACTAGCGGATTCGGTTCAGCATGAGTACGAATTCGGAGGCTTATGGGCATTCCAGA AGCGTGCAGCTAGCAGATGAAGCGCATAGATGGGCTATTGTTCAGCATGAGCTGATCAACTACG TACGGGACTGAGATGCCATGCAGTTGG >contig4 TGACTAGCTAGTGGATTGACGAC

Manipulating FASTA and FASTQ files Numerous options: FASTX toolkit conversion, quality statistics, clipping, renaming, trimming, reverse compliment, formatting & more. NGSUtils suite of utils for working with NGS datasets. EMBOSS sequence analysis package mature package which can do a lot. Many other programs/scripts or collections of scripts are available for common tasks Google can help find them! Simple manipulations possible even with one-line commands in UNIX/Linux shells see Introduction to Linux session!

FASTQ to FASTA conversion Using FASTX Toolkit $ fastq_to_fasta h usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE] version 0.0.6 [-h] [-r] [-n] [-v] = This helpful help screen. = Rename sequence identifiers to numbers. = keep sequences with unknown (N) nucleotides. Default is to discard such sequences. = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR. = Compress output with GZIP. [-z] [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA output file. default is STDOUT. $ fastq_to_fasta Q 33 i file.fastq o file.fasta

Interleaving FASTQ files No one killer app: shufflesequences_fastq.pl comes with Velvet in the contrib directory. Interleave_fastq.py Example with shufflesequences: shufflesequences_fastq.pl file_r1.fastq file_r2.fastq file_r1r2.fastq Don t often need to go back, but popgentools has a script called split-interleaved-fastq.pl.

Splitting FASTA/Q files into chunks For example, to spread alignment load. For FASTA files: Using fastasplit (Exonerate) fastasplit f in.fasta o outdir -c 100 For FASTQ files: As long as not multi-line FASTQ, can use Linux split command: split -l 1000 in.fastq outprefix_ Using NGSUtils: fastqutils split in.fastq outprefix_ 100

Exercise: FASTQ/FASTA 1. Convert the file example.fastq in the Documents directory into a FASTA file. 2. Interleave the two LIB6574 files inside Documents/reads to make a single FASTQ file. 3. Split the file exreads.fastq in the Documents directory into 5 (approximately) chunks. 4. Split the file example.fastq in the Documents directory into 3 (approxiamtely) chunks.

Sequence databases Primary nucleotide DBs have their own native formats ENA db: EMBL format NCBI Nucleotide db ( Genbank ): Genbank format DDBJ: DDBJ format very similar to Genbank Primary protein DBs likewise: UniProt Knowledgebase: Swiss-Prot format Essentially the same as EMBL format NCBI Protein db: Genbank format ( Genpept ) Most sequence DBs will also provide the data in FASTA format Other DBs (e.g. for a particular genome-sequencing project) might use their own or standard formats

Exercise: Sequence databases (1) We will query ENA for some entries representing (partial) gene sequences of Purple Osier Willow Obtain an entry in native ENA ( EMBL ) format And FASTA format And repeat the query in the NCBI Nucleotide DB to obtain the equivalent record in Genbank format In a different search, we will query the Sequence Read Archive (SRA) to obtain FASTA- and FASTQ-format data from the genome-sequencing project of the same Willow We will use the NCBI implementation of SRA (the ENA or DRA versions could be used for the same search) This sequencing project used 454 sequencing keeps the data sets (relatively) small This kind of data is made available in compressed files so we will uncompress and examine the files

Exercise: Sequence databases (2) http://www.ebi.ac.uk/ena/ Search ENA for: Salix purpurea Examine the hit-list of coding sequences Choose an entry representing a whole (not partial) gene Obtain native (EMBL) format and FASTA-format files of this Make a note of the Accession number of the record Extra exercise if you have time: Find, examine and download in Swiss-Prot format this UniProtKB entry Examine the EMBL-format record: Can you see cross-references to other databases? Any to the UniProt KnowledgeBase? Make a note of any cross-reference to UniProtKB which you see.

Exercise: Sequence databases (3) http://www.ncbi.nlm.nih.gov/ Change All Databases to Nucleotide and search for Salix purpurea To narrow down the hit list, click Advanced (under the search box) Restrict the search to: Organism = Salix purpurea Entries which do NOT have partial cds in any field How many of the hits appear to be proteincoding sequences? The entry equivalent to the one found in the ENA search should be in the list. What is its Accession number? Examine the record Click on Send to download the entry in Genbank format

Exercise: Sequence databases (4) Obtaining read data sets (FASTA and/or FASTQ) from SRA http://www.ncbi.nlm.nih.gov/ - change DB to search to SRA; search for Salix purpurea The hit list is a list of sequencing experiments Accession of an SRA experiment begins with SRX Among the hit list look for those annotated as random whole genome shotgun library Note that these are 454 (GS FLX) sequence reads each set is much smaller than the other (Illumina, GA II) Pick the smallest experiment (read set) (should take you here: http://www.ncbi.nlm.nih.gov/sra/srx029333)

Exercise: Sequence databases (5) Each experiment is associated with one or more sequencing runs. This experiment has only one run. Click on the link (SRR070318) Click the Reads tab. Individual reads can be examined. But here we will download the set in bulk. Click on the Filtered Download button Select clipped and FASTA ; click Download This will deliver the whole set of reads (auto quality-clipped) in a single compressed (gzipped) file The Linux (Ubuntu) archive manager should automatically provide access to the contents of this compressed file It can be examined e.g. in a text editor Then repeat, but this time obtain the FASTQ-format file

Alignments SAM format Sequence Alignment/Map BAM format binary version of SAM (compressed, more efficient) Use SAMtools to process. Reference T G C T T A G T C C T T A G T C T A C T A G T Reads C T T A G T C C C T T G G T C T Insertion C T A A G C T A SNP? Error?

The SAM file Flags Pos CIGAR Read Optional fields Read1 0 TheRef 3 178 8M * 0 0 CTTAGTCC EEDDEEDE AS:i:8 XS:i:0 Read2 16 TheRef 10 150 8M * 0 0 CTTGGTCT FFEEDDEE AS:7 XS:i:0 Read3 0 TheRef 16 120 3M2I3M * 0 0 CTAAGCTA GGGHHHHH AS:i:5 XS:i:0 Read ID Ref ID MAPQ Mate Qualities Reference T G C T T A G T C C T T A G T C T A C T A G T Reads C T T A G T C C Insertion C T T G G T C T C T A A G C T A SNP? Error?

SAMtools SAMtools tools: view filter SAM or BAM sort sort according to position on reference index create fast look-up of BAM or SAM tview text viewer for alignments mpileup generate pileup (BCF) file, eg. for SNP calling merge merge sorted alignments rmdup remove potential PCR duplicates and more For more info: http://samtools.sourceforge.net/samtools.shtml

Multiple Sequence Alignments Related but different aims, meanings and file formats Sequence read alignment ( assembly ) Multiple protein or nucleotide sequence alignment Each nucleotide position (column) represents multiple copies of the same base of an original sequence (e.g. genome sequence) Each position (column) represents a homologous nucleotide (or amino acid). Sequences are evolutionarily related (homologous) sequences, typically from different organisms, and/or multiple members of a gene family Gaps represent insertions/ deletions

Multiple Sequence Alignments Various file formats for MSA A multiple alignment can be represented in FASTA format MSA-dedicated formats are more richly annotated and more flexible for some purposes MSF Stockholm Selex and others Each nucleotide or amino acid, and indel, is represented explicitly C.f. SAM/BAM

Multiple Sequence Alignments MSF Stockholm

Automation saves effort and prevents errors Many (but not all) sequence formats are flatfiles they consist of plain-text characters It may be convenient to: Examine a file s contents, e.g. UNIX/Linux less Text editor, e.g. gedit Can be useful as a quick sanity check perform a single operation on a single sequence manually But if even a simple manual operation is to be repeated many times, errors are likely Manual operations likely to be infeasible for large sequence sets Or possible, but very timewasting If you find yourself doing something repetitive using interactive tools, ask yourself if there might be an easier way Often the answer is, there must be an easier way

Automation saves effort and prevents errors Repetitive chains of operations: Data set A, in file A1 reformat filea1 filea2 Input filea2 into tool X (output) filea3 Reformat filea3 filea4 Input filea4 into tooly -> (output) filea5 Next week, repeat on Data set B Use automated pipelines Re-useability of analysis steps/tools In different combinations for different purposes Ideally, records each input/output process E.g. GALAXY

The (t)errors of cut-and-paste A real-world example (but not with this actual sequence) A plant scientist working on a particular gene/protein asked a bioinformatician colleague to do some analyses on the protein sequence, along with those from the same family in related plants. The sequences were emailed to the bioinformatician. Unsurprisingly, the family of proteins exhibited numerous amino acid substitutions, and insertions/deletions It was noticed that one sequence alone had two instances of an inserted dipeptide, Phenylalanine-Threonine. These were 59 amino acids apart, and appeared to be absent from all related proteins in the databases.

The (t)errors of cut-and-paste >WillowMatK FSDSAIIDRFVRICRNLSHYYSGSSRKKSLYRIKYILRLSCVKTLFTARKHKSTVRIFLK RLGSELLDEFFTEEEQILFLTFPRVSSISQKLYRGRVWYLDIICINFTELSNHE ID AJ849584; SV 1; linear; genomic DNA; STD; PLN; 622 BP. DE Salix purpurea chloroplast partial trna-lys gene intron and partial matk DE gene for maturase K, clone A XX KW matk gene; maturase K; trna-lys. XX" FT /gene="matk" FT /product="maturase K" FT /db_xref="goa:a0zvw3" FT /db_xref="interpro:ipr024937" FT /db_xref="uniprotkb/trembl:a0zvw3" FT /protein_id="cah74183.1" FT /translation="fsdsaiidrfvricrnlshyysgssrkkslyrikyilrlscvktl FT ARKHKSTVRIFLKRLGSELLDEFFTEEEQILFLTFPRVSSISQKLYRGRVWYLDIICIN FT ELSNHE" XX SQ Sequence 622 BP; 205 A; 123 C; 97 G; 197 T; 0 other; gggttgcccg ggactcgaac ccggactagt cggatggagt agagaatttc tttgttaaaa 60

Where to get software FASTX Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/ NGSUtils: http://ngsutils.org/ EMBOSS: http://emboss.sourceforge.net Exonerate: http://www.ebi.ac.uk/~guy/exonerate/ Velvet: https://www.ebi.ac.uk/~zerbino/velvet/ Interleave_fastq.py: https://gist.github.com/ngcrawford/2232505 popgentools: http://code.google.com/p/popgentools/ SAMtools: http://samtools.sourceforge.net

Thank you Any questions?