Frequently Asked Questions Next Generation Sequencing



Similar documents
New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Expression Quantification (I)

Partek Methylation User Guide

Analysis of ChIP-seq data in Galaxy

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

PrimePCR Assay Validation Report

Tutorial for proteome data analysis using the Perseus software platform

Bioinformatics Resources at a Glance

GenBank, Entrez, & FASTA

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Gene Models & Bed format: What they represent.

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

LifeScope Genomic Analysis Software 2.5

Computational Genomics. Next generation sequencing (NGS)

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

G E N OM I C S S E RV I C ES

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Comparing Methods for Identifying Transcription Factor Target Genes

Core Facility Genomics

PrimePCR Assay Validation Report

Genomes and SNPs in Malaria and Sickle Cell Anemia

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Basic processing of next-generation sequencing (NGS) data

Step-by-Step Guide to Basic Expression Analysis and Normalization

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Challenges associated with analysis and storage of NGS data

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Gene Expression Macro Version 1.1

Gene Expression Analysis

Biological Sequence Data Formats

PreciseTM Whitepaper

RNAseq / ChipSeq / Methylseq and personalized genomics

NGS Data Analysis: An Intro to RNA-Seq

Hierarchical Clustering Analysis

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Understanding West Nile Virus Infection

Data Analysis for Ion Torrent Sequencing

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Final Project Report

UGENE Quick Start Guide

Version 5.0 Release Notes

Mail Merge Creating Mailing Labels 3/23/2011

DeCyder Extended Data Analysis module Version 1.0

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Introduction To Real Time Quantitative PCR (qpcr)

GENE REGULATION. Teacher Packet

NEXT GENERATION SEQUENCING

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Transcription and Translation of DNA

A Primer of Genome Science THIRD

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

High Throughput Sequencing Data Analysis using Cloud Computing

Next Generation Sequencing: Technology, Mapping, and Analysis

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Introduction to Genome Annotation

Yale Pseudogene Analysis as part of GENCODE Project

Structure and Function of DNA

REAL TIME PCR USING SYBR GREEN

Services. Updated 05/31/2016

Human Genome Organization: An Update. Genome Organization: An Update

Simplifying Data Interpretation with Nexus Copy Number

13.4 Gene Regulation and Expression

Basic Analysis of Microarray Data

mrna NGS Data Analysis Report

Translation Study Guide

Creating Web Pages with Microsoft FrontPage

Processing Genome Data using Scalable Database Technology. My Background

ProteinPilot Report for ProteinPilot Software

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Searching Nucleotide Databases

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Model Selection. Introduction. Model Selection

Overview of Eukaryotic Gene Prediction

1 Mutation and Genetic Change

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Transcription:

Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided into several categories for ease of use. If you do not find the answer you are looking for contact our Technical Support Team by phone at +1-314-878-2329 (between 9am 5pm U.S. Central Time), or by email (support@partek.com) at any time. We strive to answer all support requests within 24 business hours. Select a category from the list below: Import ChIP-Seq RNA-Seq Question: What aligners does Partek support? Answer: Partek can import any files in.bam or.sam format, irrespective of the aligner. Question: I have my aligned reads in.eland format. Can I import those files in Partek? Answer: Please first convert.eland files to.bam files. The converter can be found in Partek, under Tools menu (Tools > Convert Eland to BAM). Question: Can I import and analyze my raw sequencing data in Partek? Answer: No, Partek imports already aligned sequencing reads. On the other hand, alignment can be performed in Partek Flow. Question: Do I need to set any filter to the aligned reads if I only want to analyze uniquely mapped reads? Answer: No filtering is necessary because Partek imports all the sequencing reads that have been aligned, counting each read once even if it has multiple alignments. As reads can be aligned to more than one location, the number of alignments may be greater than the number of reads. Since unaligned reads may be present in the BAM file, the number of alignments may be less than the number of reads. For a discussion on exonic, intronic, and intergenic reads please refer to the white paper Understanding Reads (Help > On-line Tutorials, tab White Papers). Question: How should I analyze technical replicates in Partek? Answer: There are two possibilities to analyze technical replicates in Partek. The first is to import the.bam/.sam files of the replicates separately, and treat the technical replicates as biological replicates. In the other words, to use a categorical Frequently Asked Questions Next Generation Sequencing 1

attribute to label them and then summarize the replicates during the statistical analysis (i.e. by ANOVA). Alternatively, technical replicates can be merged during the import stage. In the BAM Sample Manager please use the Mange samples option to assign the technical replicates to the same Sample ID. ChIP-Seq Question: How does Partek detect peaks in the CHIP-seq workflow? Answer: Partek traverses the reads in order and locates coverage that is above the use-defined threshold (defined as fraction of false positive peaks allowed, in the Peak Detection dialog). It then finds the endpoints of the regions by taking the median of the forward reads (left endpoint) and the median of the reverse strands (right endpoint). Question: In the CHIP-seq workflow, I set up the threshold of peak detection as 10. Why do I sometimes see numbers smaller than 10 in column 8, number of reads that begin in the region for each sample? Answer: This is because you set directionally extend tags in your peak detection dialogue. The peaks come from the extended reads and the read counts come from the original reads. Question: Will the peak cut-off FDR setting affect the result? How should I set up a proper threshold? Answer: You can always set the threshold conservative (low) and filter later based on p-value. An often used threshold is 0.001. Question: In the detecting motifs dialog, should the number of motifs always be set to one? Answer: Because any given DNA binding protein has only one core binding motif, one is probably a good choice. If you are looking for two half-sites, then you would choose two. Question: In the detect motifs dialog, the changes of minimum motif length and maximum motif length have great impact on the motif prediction. Any recommendations to these two settings? Answer: Most binding sites are between 6 and 16 bases long; therefore, the Partek default setting is 6 to 16. We don t recommend a setting less than 6, but more than 16 should be fine. Question: Where do I specify the.2bit file if I am using a known database? Answer: If you use a standard organism, such as human or mouse, Partek should automatically download the.2bit file. However, if it is not working, you can also manually download the hg19.2bit file from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigzips/. In Partek, go to Tools > File Manager. Select hg19 (assuming you are running an experiment on human Frequently Asked Questions Next Generation Sequencing 2

samples, using the hg19 build of the human genome) for the Please Specify Files For entry. In the Genome Sequence.2bit box, use Browse specify the location of your 2bit file. RNA-Seq Question: Does Partek assign reads to different isoforms of a gene? Answer: We used an expectation maximization (EM) algorithm to probabilistically assign reads to known isoforms of a gene. Similar methods have been used for identifying isoforms in the Xing Y, et al. paper An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs, Nucleic Acids Research 2006;34(10):3150-60. (http://nar.oxfordjournals.org/cgi/content/full/34/10/3150) EM algorithm: Input: 1. Set of isoforms 2. Counts of the number of reads on each exon 3. Length of isoforms Output: Proportion of each isoform where the sum of the proportions is 1. Algorithm: The E/M algorithm is a way of solving the chicken and egg problem: If you know relative proportions of isoforms, you could assign the reads to each isoform accordingly. If you knew the assignment of reads to isoforms, you could get an estimate of the isoform proportions. The algorithm works by first guessing the isoform proportions (say 1/n where n is the number of isoforms). Then, reads are assigned to each isoform based on the proportions. The reads mapped to the isoforms are then used to estimate the isoform proportions. Question: What is the RPKM value? Answer: The RPKM value is reads per 1k bases of exon model per million mapped reads. It is defined in the paper Mapping and quantifying mammalian transcriptomes by RNA-Seq Mortazavi A., et al., Nature Methods 2008;5(7):621-8. (http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1226.html) Question: How does Partek calculate differentially expressed transcripts? Answer: Partek uses a log-likelihood ratio test to identify genes with different relative abundances of isoforms across samples (in-house method), similar to the one discussed in the paper RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays, Marioni JC, et al., Genome Research, 2008;18(9):1509-17. (http://genome.cshlp.org/content/18/9/1509.full) Frequently Asked Questions Next Generation Sequencing 3

Question: How does Partek calculate fold change of two transcripts between two contrast groups? Answer: The fold change is calculated as the ratio of RPKM values between two contrast groups. Question: Does Partek support paired-end reads? If so, where do I tell the software that these samples used paired end reads? Answer: Partek supports paired-end reads. The software will automatically recognize the read type in the.bam/.sam files and import them. Question: What source does Partek use for transcript annotation? Answer: Full flexibility with the selection of the source for transcript annotation is provided. User can choose between RefSeq, AceView, and Ensembl, which are downloaded automatically, or can decided to use custom annotation. For custom annotation (e.g. to be used with non-standard organisms), please go to Tools > Annotation Manager and select Create Annotation on the My Annotations tab. Partek annotation (.pannot format) can be created by importing an annotation file in one of the following formats:.gtf,.gff,.bed,.bam, USCS db SNP file, USCS RefFlat file (suggested), or.txt/.csv. Question: Can I use Partek to analyze non-standard organisms? Answer: Partek automatically downloads annotation files and reference genomes for standard organisms, e.g. human or mouse. For a non-standard organism, one needs to manually provide a transcript annotation file (for details, please see the respective question in this document) and a reference genome in.2bit format. The.2bit file can be either downloaded (USCS and Ensembl provide the reference genomes for many species) or created in Partek, by using the fasta file containing the reference genome of the organism. The functionality is available at Tools > 2 Bit Creator. Question: During the finding differentially expressed transcript step in the RNA-seq dialog, I can choose if any assay recognizes a sense strand or an antisense strand. What are the outcomes I can expect if I choose yes or no in this dialog? Answer: It is depends on the sample preparation. Some preparations will preserve strand information of the original transcript, like Illumina s directional RNA-seq, or the SOLiD Whole Transcriptome Analysis Kit. When cdna was prepared from the RNA sample, only the 1 st strand cdna was synthesized. On the other hand, some preparation will reverse transcribe the mrna into double stranded cdna. In this case, the sequence was read from both the sense and the antisense strand, and was not discriminated between them. The biologists who prepared the sample should know that information from the kit they used. If you select yes on the genome browser, you can see if all the reads from a single transcript are either on Frequently Asked Questions Next Generation Sequencing 4

the sense or on the antisense strand (as indicated by two different colors). Selecting no will show mixed reads from both strand on the same transcript. Question: I ve noticed that Partek can display positive strand reads on one track and negative strand reads on another track. How do I enable this in my analysis? Answer: The display of positive and negative strand reads on separate tracks is enabled by the answer to the Can assay discriminate between sense and antisense strand? question in the RNA-seq dialog. Partek has the ability to automatically detect the reads that come from the positive strand or the negative strand, and display and analyze the strand-specific sequencing result. Question: What are unexplained regions? Answer: Unexplained regions contain reads that map to the genome but not to the transcriptome, i.e. to the known transcripts. Please note that mapping depends on the chosen database. For instace, as RefSeq is more conservative than AceView, reads that do not map to one of the RefSeq transcripts (and are, hence, labaled unexplained ) might map to an AceView transcript. There are at least two ways to check for that. First, mapping can be performed again, but with the selection of a different data base. Second, unexplained regions can be overlayed with transcript annotation different than the one used for mapping. To do the latter, please select the spreadsheet with unexplained regions, go to Tools > Find Overlapping Genes and in the Output Overlapping Features dialog select an appropriate annotation source. Question: Can Partek detect fusion genes? Answer: Currently the system is designed to count reads into a predefined transcript space, so Partek does not have an obvious mechanism to count the enormous possible combination of fusion genes that are possible. Last revision: October 2011 Frequently Asked Questions Next Generation Sequencing 5