Frequently Asked Questions Next Generation Sequencing

Transcription

1 Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided into several categories for ease of use. If you do not find the answer you are looking for contact our Technical Support Team by phone at (between 9am 5pm U.S. Central Time), or by at any time. We strive to answer all support requests within 24 business hours. Select a category from the list below: Import ChIP-Seq RNA-Seq Question: What aligners does Partek support? Answer: Partek can import any files in.bam or.sam format, irrespective of the aligner. Question: I have my aligned reads in.eland format. Can I import those files in Partek? Answer: Please first convert.eland files to.bam files. The converter can be found in Partek, under Tools menu (Tools > Convert Eland to BAM). Question: Can I import and analyze my raw sequencing data in Partek? Answer: No, Partek imports already aligned sequencing reads. On the other hand, alignment can be performed in Partek Flow. Question: Do I need to set any filter to the aligned reads if I only want to analyze uniquely mapped reads? Answer: No filtering is necessary because Partek imports all the sequencing reads that have been aligned, counting each read once even if it has multiple alignments. As reads can be aligned to more than one location, the number of alignments may be greater than the number of reads. Since unaligned reads may be present in the BAM file, the number of alignments may be less than the number of reads. For a discussion on exonic, intronic, and intergenic reads please refer to the white paper Understanding Reads (Help > On-line Tutorials, tab White Papers). Question: How should I analyze technical replicates in Partek? Answer: There are two possibilities to analyze technical replicates in Partek. The first is to import the.bam/.sam files of the replicates separately, and treat the technical replicates as biological replicates. In the other words, to use a categorical Frequently Asked Questions Next Generation Sequencing 1

2 attribute to label them and then summarize the replicates during the statistical analysis (i.e. by ANOVA). Alternatively, technical replicates can be merged during the import stage. In the BAM Sample Manager please use the Mange samples option to assign the technical replicates to the same Sample ID. ChIP-Seq Question: How does Partek detect peaks in the CHIP-seq workflow? Answer: Partek traverses the reads in order and locates coverage that is above the use-defined threshold (defined as fraction of false positive peaks allowed, in the Peak Detection dialog). It then finds the endpoints of the regions by taking the median of the forward reads (left endpoint) and the median of the reverse strands (right endpoint). Question: In the CHIP-seq workflow, I set up the threshold of peak detection as 10. Why do I sometimes see numbers smaller than 10 in column 8, number of reads that begin in the region for each sample? Answer: This is because you set directionally extend tags in your peak detection dialogue. The peaks come from the extended reads and the read counts come from the original reads. Question: Will the peak cut-off FDR setting affect the result? How should I set up a proper threshold? Answer: You can always set the threshold conservative (low) and filter later based on p-value. An often used threshold is Question: In the detecting motifs dialog, should the number of motifs always be set to one? Answer: Because any given DNA binding protein has only one core binding motif, one is probably a good choice. If you are looking for two half-sites, then you would choose two. Question: In the detect motifs dialog, the changes of minimum motif length and maximum motif length have great impact on the motif prediction. Any recommendations to these two settings? Answer: Most binding sites are between 6 and 16 bases long; therefore, the Partek default setting is 6 to 16. We don t recommend a setting less than 6, but more than 16 should be fine. Question: Where do I specify the.2bit file if I am using a known database? Answer: If you use a standard organism, such as human or mouse, Partek should automatically download the.2bit file. However, if it is not working, you can also manually download the hg19.2bit file from In Partek, go to Tools > File Manager. Select hg19 (assuming you are running an experiment on human Frequently Asked Questions Next Generation Sequencing 2

3 samples, using the hg19 build of the human genome) for the Please Specify Files For entry. In the Genome Sequence.2bit box, use Browse specify the location of your 2bit file. RNA-Seq Question: Does Partek assign reads to different isoforms of a gene? Answer: We used an expectation maximization (EM) algorithm to probabilistically assign reads to known isoforms of a gene. Similar methods have been used for identifying isoforms in the Xing Y, et al. paper An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs, Nucleic Acids Research 2006;34(10): ( EM algorithm: Input: 1. Set of isoforms 2. Counts of the number of reads on each exon 3. Length of isoforms Output: Proportion of each isoform where the sum of the proportions is 1. Algorithm: The E/M algorithm is a way of solving the chicken and egg problem: If you know relative proportions of isoforms, you could assign the reads to each isoform accordingly. If you knew the assignment of reads to isoforms, you could get an estimate of the isoform proportions. The algorithm works by first guessing the isoform proportions (say 1/n where n is the number of isoforms). Then, reads are assigned to each isoform based on the proportions. The reads mapped to the isoforms are then used to estimate the isoform proportions. Question: What is the RPKM value? Answer: The RPKM value is reads per 1k bases of exon model per million mapped reads. It is defined in the paper Mapping and quantifying mammalian transcriptomes by RNA-Seq Mortazavi A., et al., Nature Methods 2008;5(7): ( Question: How does Partek calculate differentially expressed transcripts? Answer: Partek uses a log-likelihood ratio test to identify genes with different relative abundances of isoforms across samples (in-house method), similar to the one discussed in the paper RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays, Marioni JC, et al., Genome Research, 2008;18(9): ( Frequently Asked Questions Next Generation Sequencing 3

4 Question: How does Partek calculate fold change of two transcripts between two contrast groups? Answer: The fold change is calculated as the ratio of RPKM values between two contrast groups. Question: Does Partek support paired-end reads? If so, where do I tell the software that these samples used paired end reads? Answer: Partek supports paired-end reads. The software will automatically recognize the read type in the.bam/.sam files and import them. Question: What source does Partek use for transcript annotation? Answer: Full flexibility with the selection of the source for transcript annotation is provided. User can choose between RefSeq, AceView, and Ensembl, which are downloaded automatically, or can decided to use custom annotation. For custom annotation (e.g. to be used with non-standard organisms), please go to Tools > Annotation Manager and select Create Annotation on the My Annotations tab. Partek annotation (.pannot format) can be created by importing an annotation file in one of the following formats:.gtf,.gff,.bed,.bam, USCS db SNP file, USCS RefFlat file (suggested), or.txt/.csv. Question: Can I use Partek to analyze non-standard organisms? Answer: Partek automatically downloads annotation files and reference genomes for standard organisms, e.g. human or mouse. For a non-standard organism, one needs to manually provide a transcript annotation file (for details, please see the respective question in this document) and a reference genome in.2bit format. The.2bit file can be either downloaded (USCS and Ensembl provide the reference genomes for many species) or created in Partek, by using the fasta file containing the reference genome of the organism. The functionality is available at Tools > 2 Bit Creator. Question: During the finding differentially expressed transcript step in the RNA-seq dialog, I can choose if any assay recognizes a sense strand or an antisense strand. What are the outcomes I can expect if I choose yes or no in this dialog? Answer: It is depends on the sample preparation. Some preparations will preserve strand information of the original transcript, like Illumina s directional RNA-seq, or the SOLiD Whole Transcriptome Analysis Kit. When cdna was prepared from the RNA sample, only the 1 st strand cdna was synthesized. On the other hand, some preparation will reverse transcribe the mrna into double stranded cdna. In this case, the sequence was read from both the sense and the antisense strand, and was not discriminated between them. The biologists who prepared the sample should know that information from the kit they used. If you select yes on the genome browser, you can see if all the reads from a single transcript are either on Frequently Asked Questions Next Generation Sequencing 4

5 the sense or on the antisense strand (as indicated by two different colors). Selecting no will show mixed reads from both strand on the same transcript. Question: I ve noticed that Partek can display positive strand reads on one track and negative strand reads on another track. How do I enable this in my analysis? Answer: The display of positive and negative strand reads on separate tracks is enabled by the answer to the Can assay discriminate between sense and antisense strand? question in the RNA-seq dialog. Partek has the ability to automatically detect the reads that come from the positive strand or the negative strand, and display and analyze the strand-specific sequencing result. Question: What are unexplained regions? Answer: Unexplained regions contain reads that map to the genome but not to the transcriptome, i.e. to the known transcripts. Please note that mapping depends on the chosen database. For instace, as RefSeq is more conservative than AceView, reads that do not map to one of the RefSeq transcripts (and are, hence, labaled unexplained ) might map to an AceView transcript. There are at least two ways to check for that. First, mapping can be performed again, but with the selection of a different data base. Second, unexplained regions can be overlayed with transcript annotation different than the one used for mapping. To do the latter, please select the spreadsheet with unexplained regions, go to Tools > Find Overlapping Genes and in the Output Overlapping Features dialog select an appropriate annotation source. Question: Can Partek detect fusion genes? Answer: Currently the system is designed to count reads into a predefined transcript space, so Partek does not have an obvious mechanism to count the enormous possible combination of fusion genes that are possible. Last revision: October 2011 Frequently Asked Questions Next Generation Sequencing 5