Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Similar documents
Version 5.0 Release Notes

Installation Guide for Windows

Next generation sequencing (NGS)

Analysis of ChIP-seq data in Galaxy

Introduction to NGS data analysis

Working with AppleScript

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

How to install and use the File Sharing Outlook Plugin

Software Getting Started Guide

Analysis of NGS Data

Jolly Server Getting Started Guide

Databases and mapping BWA. Samtools

BioHPC Web Computing Resources at CBSU

Bioinformatics Resources at a Glance

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Client for Macintosh

RLM Server Guide. For Macintosh and Windows

Sales Person Commission

Software Application Tutorial

Egnyte Single Sign-On (SSO) Installation for Okta

454 Sequencing System Software Manual Version 2.6

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Administrator s Plus. Backup Process. A Get Started Guide

Remedy ITSM Service Request Management Quick Start Guide

Network Server for Windows. Overview of the Sequencher Network Page 2. Installing Sequencher Server for the First Time Page 3

Kepware Technologies KEPServerEX Client Connectivity Guide for GE's Proficy ifix

Step by Step Guide to Importing Genetic Data into JMP Genomics

Network Server for Macintosh. Sequencher Server Network Overview Page 2. Installing Sequencher Licensing Page 3

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

U of S Course Tools. Copying a Development/Test Course into a Live Course in the U of S Course Tools For Instructors

Web Ambassador Training on the CMS

Kuali Requisition Training

The Welcome screen displays each time you log on to PaymentNet; it serves as your starting point or home screen.

Notice. DNA Sequencing Module User Guide

HDAccess Administrators User Manual. Help Desk Authority 9.0

Tutorial 5: Add-to-Cart Button

SRA File Formats Guide

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Technical Support Set-up Procedure

Business Online Banking ACH Reference Guide

Baylor Secure Messaging. For Non-Baylor Users

Teacher References archived classes and resources

Table of Contents. 1. Content Approval...1 EVALUATION COPY

Data formats and file conversions

Next Generation Sequencing: Technology, Mapping, and Analysis

Fairfield University Using Xythos for File Sharing

Install MS SQL Server 2012 Express Edition

Accessing the Professional Development Plan (PDP) Evaluation Process Staff Evaluations Edit Professional Development Plan.

i>clicker v7 Gradebook Integration: Blackboard Learn Instructor Guide

HOW TO CREATE AN HTML5 JEOPARDY- STYLE GAME IN CAPTIVATE

How to Attach the Syllabus and Course Schedule to a Content Item

Training Manual Version 1.0

Enter your User Id and Password and click the Log In button to launch the application.

Intellicus Enterprise Reporting and BI Platform

COLLABORATION NAVIGATING CMiC

Genomes and SNPs in Malaria and Sickle Cell Anemia

-> Integration of MAPHiTS in Galaxy

Remedy ITSM Service Request Management Quick Start Guide

Creating a Digital Signature in Adobe Acrobat Created on 1/11/2013 2:48:00 PM

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Practical Guideline for Whole Genome Sequencing

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Instructions for Configuring a SAS Metadata Server for Use with JMP Clinical

BT Quantum Unified Communicator Client Outlook Upgrade Installation Guide

Excel 2013 What s New. Introduction. Modified Backstage View. Viewing the Backstage. Process Summary Introduction. Modified Backstage View

SAM Brief Student User Guide

How Sequencing Experiments Fail

Umbraco Content Management System (CMS) User Guide

SysAid Remote Discovery Tool

Prepare the environment Practical Part 1.1

SOS SO S O n O lin n e lin e Bac Ba kup cku ck p u USER MANUAL

Simplifying Data Interpretation with Nexus Copy Number

Basic processing of next-generation sequencing (NGS) data

GETTING STARTED WITH SQL SERVER

Frequently Asked Questions Next Generation Sequencing

Affiliation Security

Challenges associated with analysis and storage of NGS data

How To Connect Your Event To PayPal

Census. di Monitoring Installation User s Guide

Transitioning from TurningPoint 5 to TurningPoint Cloud - LMS 1

Computational Genomics. Next generation sequencing (NGS)

Customizing Confirmation Text and s for Donation Forms

User Guide for Kelani Mail

GMC Connect User Guide v1.1

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Administrator s Guide ALMComplete Support Ticket Manager

Using Websense Data Endpoint Client Software

Virtual Office Remote Installation Guide

Release Information. Copyright. Limit of Liability. Trademarks. Customer Support

Scan to Quick Setup Guide

INTERCALL ONLINE Administrator Invoices User Guide

Document Revision Date: August 14, Create a New Data Protection Policy in the Symantec.cloud Management Portal

Content Management System QUICK START GUIDE

Google Apps for Sharing Folders and Collecting Assignments

Dispatch Board Maintenance. User Guide

How to Download Census Data from American Factfinder and Display it in ArcMap

This tutorial provides detailed instructions to help you download and configure Internet Explorer 6.0 for use with Web Commerce application.

PortfolioCenter Export Wizard in Practice: Evaluating IRA Account Holder Ages and Calculating Required Minimum Distribution (RMD) Amounts

SonicWALL CDP Local Archiving

Cascade Server CMS Quick Start Guide

Transcription:

Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249 (elsewhere) 1.734.769.7074 (fax) www.genecodes.com gcinfo@genecodes.com

Preparing Your Data for NGS Alignment Data File Requirements for Multiplex ID (MID) Files... 3 Data File Requirements for GSNAP... 3 Preparing a Known SNPs Data File For GSNAP... 4 Building Your GSNAP Reference Sequence Database... 5 Data File Requirements for BWA-MEM... 6 Building Your BWA-MEM Reference Sequence Index... 6 Data File Requirements for Velvet... 8 Data File Requirements for Maq... 9 Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 2 of 9

Preparing Your Data for NGS Alignment Sequencher accepts many sequence formats. For Next-Generation sequencing, your reads should be in FastA or FastQ format. Although most sequencers have their own native formats, as long as the data can be converted into FastA or FastQ format, Sequencher will be able to align the reads. When aligning sequences using BWA-MEM, Maq, or GSNAP, you will be using a reference sequence that has been imported into Sequencher. The advantage of using a GenBank sequence is that it will usually carry annotations. If you are performing a de novo assembly with the Velvet algorithms, a reference sequence is not required. All the algorithms for Next-Generation sequence analysis will work with single-end or paired-end data. Each algorithm has its own requirements for data input files that may require some modification to your files in advance of performing the alignment in order to succeed. One thing to note is that, although it is possible to get FastQ format files from both Illumina and 454, the formats differ in how they encode confidence scores. Both represent confidence scores in single ASCII characters, but the key to decode them back into scores is different. 454 data conforms to what is popularly known as Sanger standard format. Illumina represents scores using a different range of ASCII characters, unless your pipeline is Casava 1.8, in which case it is equivalent to Sanger standard format. You will need to specify which FastQ variant you ll be using when aligning with GSNAP using the FastQ Encoding drop-down menu.. DATA FILE REQUIREMENTS FOR MULTIPLEX ID (MID) FILES Valid entries for a barcodes file consist of a barcode name followed by a tab character followed by the barcode sequence itself. Barcode sequences must all be of the same length. MID1 TCAGATATCGCGAG DATA FILE REQUIREMENTS FOR GSNAP GSNAP supports both FastA and FastQ file formats (both Sanger standard and Illumina variants). Read lengths may vary in size and fall within the range of 14bp to 1500bp. GSNAP may be configured for even longer read lengths though. 2 FastQ files will be treated as paired-end reads and 1 FastQ file will be treated as single-ends reads. Paired-end data in FastQ format must list the reads in the same order in both files. Here is an example: Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 3 of 9

File 1: @NC_014230.1 _1831264_1831446_0/1 TCTCCATAAGTTGAGATAAGTTAGAAACCAAGTGTT &IIIIIIIIIIIIIIBBII)$&IIII>.&%E*I0 @NC_014230.1 _1066261_1066432_1/1 GCTGAACTTGCATAATAGTGGACCAATCATAAGAAT D!"IIIIIII$!!*(&%.$IIIIIIIIIIII3&III File 2: @NC_014230.1 _1831264_1831446_0/2 ATAGGATTCAAGGCAGATTTAAAATTGACGGCGCGC III<IIIIIIIIIIIIIIIIII'%IICII3/II>= @NC_014230.1 _1066261_1066432_1/2 AATCCTGGTAACAAAATGTTTTTACATTATAGCCTA IIIIICIIIIIIIIADIIIIIIIIIIIIIIIII0II FastA files may also represent both single-ends and paired-ends reads. GSNAP has specific requirements for modifying the basic FastA format for alignment. o The entire read must be on a single line no line breaks in the DNA sections. o If you have paired-ends data, the second read must be on the next line. For example: >name sequence header information ATGAACAGGCGCGATCTTCTTTTACAAGAAATGGGCATTTCCCAGTGGGA GAATGTAAGCAGCCTATTCGTTATTGGTTACTATCAGAAAATAGCGACCA >next-sequence CACTTTGCCATTTTGCAAGCAGGCTGAGCAGGTTTATCGC TATCGCCCCGAGGTACTGCAAGGTTCAGTAGGAATTAGTG In the above example, there are 4 DNA sequences 2 pairs, not 2 sequences. PREPARING A KNOWN SNPS DATA FILE FOR GSNAP To perform an SNP-Tolerant alignment using GSNAP, you must provide a file containing the list of known SNPs. GSNAP performs SNP hunting in a different way than Maq. This text file has to list each SNP in a specific format one SNP per line. Each line must begin with the > character followed by a SNP identifier. In the example above, it is an rs number. The next pieces of data are reference and positional information in the format RefName:#..# followed by a major and minor allele. In real data, the reference name to use before the colon is the name of the sequence you select in the Project Window. If there are spaces in the name, these should be replaced with underscores. In the example above, My Hflu Ref is selected for an SNP-tolerant GSNAP alignment. The reference identifier used in the known SNP file is therefore My_Hflu_Ref. The position information in this file always assumes that the first base of the reference sequence in Sequencher is 1, no matter what its actual numbering relative to its chromosomal or contig position is. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 4 of 9

BUILDING YOUR GSNAP REFERENCE SEQUENCE DATABASE You can use a single sequence in FastA format or a series of sequences in concatenated FastA format (for example a series of chromosomes). If you plan to perform an in silico experiment such as comparing the BWA-MEM algorithm to the GSNAP algorithm, you will need to build a database/index for both aligners. The files for the database are placed in the External Data Home (the default location for this is your Documents folder but you may change this through a User Preference). You will be able to give the database a name of your own choosing. In addition, information about the database run will be visible in the External Data Browser. You can launch this by choosing Window>Open External Data Browser. Do this before the next step so that you can monitor the progress of the build. Choose Assemble>Build Reference Database or Index>GSNAP Database to use the reference sequence with the GSNAP algorithm. The Build GSNAP Database dialog appears. Click the Reference FASTA button and browse to the FastA file containing your reference sequence(s). Click the Open button. Give your database a name and click the Build button. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 5 of 9

If you are using the External Data Browser to monitor the progress of the build (especially useful with large genomes), click the Refresh button from time to time and note the progress in the log file. Once Sequencher has finished the creation of the database, you will see a green SUCCESS status in the Final Run Status column. If there was a problem with the creation of the database, you may see a red FAILED status. Consulting the log file will often give you a clue as to the reason for the failure. Very rarely there will be no status message at all. This is also an indicator of problems with the creation/build run. The Notes field allows you to add pertinent information about your reference sequence. This is a good place to record information such as the build version for complete or partial genomes. Sequencher will automatically add the name of the sequence file used to create the database to the Notes field. Once the database is built, you will see it listed in a dropdown menu the next time you use GSNAP. DATA FILE REQUIREMENTS FOR BWA-MEM BWA-MEM supports both FastA and FastQ file formats (both Sanger standard and Illumina variants). 2 FastQ files will be treated as paired-end reads and 1 FastQ file will be treated as single-ends reads. Paired-end data in FastQ format must list the reads in the same order in both files. Here is an example: File 1: @NC_014230.1 _1831264_1831446_0/1 TCTCCATAAGTTGAGATAAGTTAGAAACCAAGTGTT &IIIIIIIIIIIIIIBBII)$&IIII>.&%E*I0 @NC_014230.1 _1066261_1066432_1/1 GCTGAACTTGCATAATAGTGGACCAATCATAAGAAT D!"IIIIIII$!!*(&%.$IIIIIIIIIIII3&III File 2: @NC_014230.1 _1831264_1831446_0/2 ATAGGATTCAAGGCAGATTTAAAATTGACGGCGCGC III<IIIIIIIIIIIIIIIIII'%IICII3/II>= @NC_014230.1 _1066261_1066432_1/2 AATCCTGGTAACAAAATGTTTTTACATTATAGCCTA IIIIICIIIIIIIIADIIIIIIIIIIIIIIIII0II Paired-end data in FastA format must list the reads in the same order in both files. BWA-MEM has specific requirements for modifying the basic FastA format for assembly. The entire read must be on a single line no line breaks in the DNA sections. BUILDING YOUR BWA-MEM REFERENCE SEQUENCE INDEX You can use a single sequence in FastA format or a series of sequences in concatenated FastA format (for example a series of chromosomes). If you plan to perform an in silico experiment such as comparing the BWA-MEM algorithm to the GSNAP algorithm, you will need to build a database/index for both aligners. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 6 of 9

The files for the index are placed in the External Data Home (the default location for this is your Documents folder but you may change this through a User Preference). You will be able to give the index a name of your own choosing. In addition, information about the index run will be visible in the External Data Browser. You can launch this by choosing Window>Open External Data Browser. Do this before the next step so that you can monitor the progress of the build. Choose Assemble>Build Reference Database or Index>BWA Index to use the reference sequence with the BWA-MEM algorithm. The Build BWA Index dialog appears. Click the Reference FASTA button and browse to the FastA file containing your reference sequence(s). Click the Open button. Give your index a name and click the Build button. If you are using the External Data Browser to monitor the progress of the build (especially useful with large genomes), click the Refresh button from time to time and note the progress in the log file. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 7 of 9

Once Sequencher has finished the creation of the index, you will see a green SUCCESS status in the Final Run Status column. If there was a problem with the creation of the index, you may see a red FAILED status. Consulting the log file will often give you a clue as to the reason for the failure. Very rarely there will be no status message at all. This is also an indicator of problems with the creation/build run. The Notes field allows you to add pertinent information about your reference sequence. This is a good place to record information such as the build version for complete or partial genomes. Sequencher will automatically add the name of the sequence file used to create the index to the Notes field. Once the index is built, you will see it listed in a dropdown menu the next time you use BWA-MEM. DATA FILE REQUIREMENTS FOR VELVET Velvet supports both FastA and FastQ file formats (both Sanger standard and Illumina variants). Velvet may be configured for longer k-mers. 2 FastQ files will be treated as paired-end reads and 1 FastQ file will be treated as single-ends reads. Paired-end data in FastQ format must list the reads in the same order in both files. Here is an example: File 1: @NC_014230.1 _1831264_1831446_0/1 TCTCCATAAGTTGAGATAAGTTAGAAACCAAGTGTT &IIIIIIIIIIIIIIBBII)$&IIII>.&%E*I0 @NC_014230.1 _1066261_1066432_1/1 GCTGAACTTGCATAATAGTGGACCAATCATAAGAAT D!"IIIIIII$!!*(&%.$IIIIIIIIIIII3&III File 2: @NC_014230.1 _1831264_1831446_0/2 ATAGGATTCAAGGCAGATTTAAAATTGACGGCGCGC III<IIIIIIIIIIIIIIIIII'%IICII3/II>= @NC_014230.1 _1066261_1066432_1/2 AATCCTGGTAACAAAATGTTTTTACATTATAGCCTA IIIIICIIIIIIIIADIIIIIIIIIIIIIIIII0II Paired-end data in FastA format must list the reads in the same order in both files. Velvet has specific requirements for modifying the basic FastA format for assembly. The entire read must be on a single line no line breaks in the DNA sections. If you have paired-ends data, the second read must be on the next line. For example: >name sequence header information ATGAACAGGCGCGATCTTCTTTTACAAGAAATGGGCATTTCCCAGTGGGA GAATGTAAGCAGCCTATTCGTTATTGGTTACTATCAGAAAATAGCGACCA >next-sequence CACTTTGCCATTTTGCAAGCAGGCTGAGCAGGTTTATCGC TATCGCCCCGAGGTACTGCAAGGTTCAGTAGGAATTAGTG In the above example, there are 4 DNA sequences 2 pairs, not 2 sequences. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 8 of 9

DATA FILE REQUIREMENTS FOR MAQ Must be in FastQ format. Maq expects Sanger standard encoded quality scores. Read lengths can be no greater than 127 bases and must be the same size for every read. Paired-ends data (2 FastQ files) and single-ends data (1 FastQ file) are supported. Gene Codes Corporation 2015 Preparing Your Data for NGS Alignment p. 9 of 9