Introduction to NGS data analysis



Similar documents
Next Generation Sequencing

Introduction to next-generation sequencing data

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

NGS data analysis. Bernardo J. Clavijo

Next Generation Sequencing: Technology, Mapping, and Analysis

Next generation DNA sequencing technologies. theory & prac-ce

Next Generation Sequencing; Technologies, applications and data analysis

Copy Number Variation: available tools

Next Generation Sequencing; Technologies, applications and data analysis

E. coli plasmid and gene profiling using Next Generation Sequencing

Data Analysis for Ion Torrent Sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

MiSeq: Imaging and Base Calling

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Next Generation Sequencing; Technologies, applications and data analysis

Analysis of NGS Data

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Text file One header line meta information lines One line : variant/position

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

G E N OM I C S S E RV I C ES

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

NGS Data Analysis: An Intro to RNA-Seq

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Challenges associated with analysis and storage of NGS data

LifeScope Genomic Analysis Software 2.5

Computational Genomics. Next generation sequencing (NGS)

Deep Sequencing Data Analysis

How-To: SNP and INDEL detection

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Comparing Methods for Identifying Transcription Factor Target Genes

Next Generation Sequencing; Technologies, applications and data analysis

BioHPC Web Computing Resources at CBSU

Delivering the power of the world s most successful genomics platform

Analysis of ChIP-seq data in Galaxy

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Practical Guideline for Whole Genome Sequencing

Version 5.0 Release Notes

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Data formats and file conversions

Next generation sequencing (NGS)

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Welcome to the Plant Breeding and Genomics Webinar Series

-> Integration of MAPHiTS in Galaxy

SEQUENCING. From Sample to Sequence-Ready

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Basic processing of next-generation sequencing (NGS) data

NGS Technologies for Genomics and Transcriptomics

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

RNAseq / ChipSeq / Methylseq and personalized genomics

How To Find Rare Variants In The Human Genome

July 7th 2009 DNA sequencing

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

How Sequencing Experiments Fail

Bioinformatics Unit Department of Biological Services. Get to know us

Practical Solutions for Big Data Analytics

Databases and mapping BWA. Samtools

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

NEXT GENERATION SEQUENCING

HiSeq Analysis Software v0.9 User Guide

Module 1. Sequence Formats and Retrieval. Charles Steward

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

New solutions for Big Data Analysis and Visualization

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Software Getting Started Guide

Core Facility Genomics

CHALLENGES IN NEXT-GENERATION SEQUENCING

Disease gene identification with exome sequencing

PreciseTM Whitepaper

Illumina Sequencing Technology

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Overview sequence projects

DNA Sequencing & The Human Genome Project

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Globus Genomics Tutorial GlobusWorld 2014

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Transcription:

Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics

Sequencing Illumina platforms Characteristics: High throughput (3 genomes). Paired end. High accuracy. Read length 2 125bp. Relatively long run time (6 days). Relatively expensive. Figure 1: HiSeq 2500. Workshop NGS, Hogeschool Leiden 1/50 Thursday, May 22, 2014

Sequencing Illumina platforms Characteristics: Moderate throughput (3 exomes). Paired end. High accuracy. Read length 2 300bp. Relatively short run time (3 days). Relatively expensive. Figure 2: MiSeq. Workshop NGS, Hogeschool Leiden 2/50 Thursday, May 22, 2014

Sequencing Illumina platforms Figure 3: Flowcell. Figure 4: Amplification. Workshop NGS, Hogeschool Leiden 3/50 Thursday, May 22, 2014

Sequencing Illumina platforms Figure 3: Flowcell. Optical system. DNA fragments are loaded on a flowcell. Fragments are amplified. Clusters of fragments give the same signal. Figure 4: Amplification. Workshop NGS, Hogeschool Leiden 3/50 Thursday, May 22, 2014

Sequencing Illumina platforms Figure 5: Image of tile (part of the flowcell). Workshop NGS, Hogeschool Leiden 4/50 Thursday, May 22, 2014

Sequencing Illumina platforms Figure 6: Base calling on Illumina systems. Workshop NGS, Hogeschool Leiden 5/50 Thursday, May 22, 2014

Sequencing Illumina platforms Figure 7: Paired end. Workshop NGS, Hogeschool Leiden 6/50 Thursday, May 22, 2014

Sequencing Illumina platforms Advantages: Mapping to non-unique regions. Structural variation. De novo assembly (scaffolding). Figure 7: Paired end. Workshop NGS, Hogeschool Leiden 6/50 Thursday, May 22, 2014

Life-Tech Sequencing Characteristics: Low throughput. Single end (for now). High accuracy. Read length ±400bp. Short run time (1 day). Cheap runs. Figure 8: Ion torrent. Workshop NGS, Hogeschool Leiden 7/50 Thursday, May 22, 2014

Life-Tech Sequencing Figure 9: Ion proton. Characteristics: Moderate throughput (1 exome). Single end (for now). High accuracy. Read length ±175bp. Short run time (1 day). Cheap runs. Workshop NGS, Hogeschool Leiden 8/50 Thursday, May 22, 2014

Life-Tech Sequencing Figure 10: Ion Torrent chip. Figure 11: Ion Proton chip. Workshop NGS, Hogeschool Leiden 9/50 Thursday, May 22, 2014

Life-Tech Sequencing Figure 12: Close up of an Ion chip. Workshop NGS, Hogeschool Leiden 10/50 Thursday, May 22, 2014

Life-Tech Sequencing Figure 14: Base calling. Figure 13: Loading a sample. Problems with mono nucleotide stretches. Workshop NGS, Hogeschool Leiden 11/50 Thursday, May 22, 2014

Sequencing Pacific Biosciences Figure 15: PacBio RS. Workshop NGS, Hogeschool Leiden 12/50 Thursday, May 22, 2014

Sequencing Pacific Biosciences Characteristics: Long reads (50% is longer than 10,000 bp). High error rate (15-20%). Low throughput (comparable with the Roche 454). Workshop NGS, Hogeschool Leiden 13/50 Thursday, May 22, 2014

Sequencing Pacific Biosciences Characteristics: Long reads (50% is longer than 10,000 bp). High error rate (15-20%). Low throughput (comparable with the Roche 454). Circular consensus sequencing. Sequence the same molecule several times. Extremely high accuracy. Acceptable read length (±250bp). Workshop NGS, Hogeschool Leiden 13/50 Thursday, May 22, 2014

Sequencing Pacific Biosciences Figure 16: Real time polymerase monitoring. Workshop NGS, Hogeschool Leiden 14/50 Thursday, May 22, 2014

Sequencing Pacific Biosciences Figure 17: Base calling on the PacBio. Workshop NGS, Hogeschool Leiden 15/50 Thursday, May 22, 2014

Data analysis Next generation sequencing data 1 @SGGPP:4:101 2 TTCGGGGGCTGGCAAATCCACTTCCGTGACACGCTACCATTCGCTGGTGGT 3 + 4 +4589,53330 0&07+03:54/2362 +.488587>@/25440++0(+ 5 @SGGPP:4:102 6 CGGTAAACCACCCTGCTGACGGAACCCTAATGCGCCTGAAAGACAGCGTTC 7 + 8 34/ 0 +.000(.55:;:99(0(+2(22(0316;185;;0;:<<>=AA59 9 @SGGPP:4:106 10 TCGTTAACGACTTTGTTCGCCACCGCAACCGCCTGTTTCGGGTCACAGGCA 11 + 12 09875;5? <;?@A4?B:BBB<AA>CCC>C>BB0. >=0488+3444:@5@< 13 @SGGPP:4:112 14 TTGATGAATATATTATTTCAGGGAATAATTATGACACCTTTAGAACGCATT 15 + 16 70<<@::5:<;==7;>>/79<:.:494.8(,,8:753/5@5??C>B???B7 Listing 1: A FastQ file. Workshop NGS, Hogeschool Leiden 16/50 Thursday, May 22, 2014

Data analysis Common data analysis techniques Reference based techniques: Resequencing. Exome sequencing. Transcriptome sequencing (RNA). Full genome sequencing. Structural variation. Copy number variation. Not reference based: De novo assembly. k-mer profiling. Workshop NGS, Hogeschool Leiden 17/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. 4. Filtering. Post-variant calling quality control. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Resequencing Data analysis Resequencing pipelines can roughly be divided in five steps. 1. Pre-alignment. Quality control. Data cleaning. 2. Alignment. Post-alignment quality control. 3. Variant calling. 4. Filtering. Post-variant calling quality control. 5. Annotation. Workshop NGS, Hogeschool Leiden 18/50 Thursday, May 22, 2014

Pre-alignment Data cleaning Depending on the sequencing platform, parts of the reads need to be removed. Remove linker sequences (Cutadapt, FASTX toolkit). Clip low quality reads at the end of the read (Sickle, Trimmomatic, FASTX toolkit). Length filtering (Fastools). http://code.google.com/p/cutadapt/ http://hannonlab.cshl.edu/fastx toolkit/ https://github.com/najoshi/sickle http://www.usadellab.org/cms/index.php?page=trimmomatic https://pypi.python.org/pypi/fastools Workshop NGS, Hogeschool Leiden 19/50 Thursday, May 22, 2014

Trimming Pre-alignment Figure 18: Quality score per position. Workshop NGS, Hogeschool Leiden 20/50 Thursday, May 22, 2014

Clipping Pre-alignment Figure 19: Sequencing linkers. Workshop NGS, Hogeschool Leiden 21/50 Thursday, May 22, 2014

Pre-alignment Quality control The FastQC toolkit can be used for quality control (both before and after the data cleaning step). GC content. GC distribution. Quality scores distribution.... http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Workshop NGS, Hogeschool Leiden 22/50 Thursday, May 22, 2014

Figure 21: Per sequence quality. Workshop NGS, Hogeschool Leiden 23/50 Thursday, May 22, 2014 Pre-alignment Example QC output Figure 20: Per base sequence content.

Alignment The best match to the reference genome Very efficient. The reference genome needs to be indexed. Finding an alignment is as easy as looking up a word in a dictionary. Figure 22: Visualisation of an alignment. Workshop NGS, Hogeschool Leiden 24/50 Thursday, May 22, 2014

Alignment Choose an aligner Not all aligners can deal with indels. Only a couple of years ago, only SNPs were considered. Bowtie. http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/ http://tophat.cbcb.umd.edu/ http://bio-bwa.sourceforge.net/ Workshop NGS, Hogeschool Leiden 25/50 Thursday, May 22, 2014

Alignment Choose an aligner Not all aligners can deal with indels. Only a couple of years ago, only SNPs were considered. Bowtie. Few aligners can work with large deletions. Spliced RNA. GMAP / GSNAP. Tophat. BWA-MEM. http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/ http://tophat.cbcb.umd.edu/ http://bio-bwa.sourceforge.net/ Workshop NGS, Hogeschool Leiden 25/50 Thursday, May 22, 2014

Alignment Choose an aligner The choice of aligner may be restricted by the sequencer. For the Ion Torrent: Tmap. Combination of three different aligners. Deals with errors in homopolymer stretches. For the PacBio: BLASR. https://github.com/iontorrent/ts/tree/master/analysis/tmap https://github.com/pacificbiosciences/blasr Workshop NGS, Hogeschool Leiden 26/50 Thursday, May 22, 2014

Variant calling Consistent deviations from the reference Figure 23: Result of an alignment. Workshop NGS, Hogeschool Leiden 27/50 Thursday, May 22, 2014

Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014

Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014

Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. RNA. Allele specific expression. RNA editing. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014

Variant calling Some considerations Things a variant caller might take into account: Strand balance. Base quality. Mapping quality. Distribution within the reads. Ploidity of the organism in question. Complicating factors: Pooled samples. RNA. Allele specific expression. RNA editing. Strand specific sampleprep. Workshop NGS, Hogeschool Leiden 28/50 Thursday, May 22, 2014

Variant calling Choice of variant caller Rules of thumb: Well known organism and experiment: Statistical model. Use a simpler variant caller otherwise. http://samtools.sourceforge.net/ https://www.broadinstitute.org/gatk/ http://varscan.sourceforge.net/ Workshop NGS, Hogeschool Leiden 29/50 Thursday, May 22, 2014

Variant calling Choice of variant caller Rules of thumb: Well known organism and experiment: Statistical model. Use a simpler variant caller otherwise. Popular variant callers: Samtools. GATK. VarScan. http://samtools.sourceforge.net/ https://www.broadinstitute.org/gatk/ http://varscan.sourceforge.net/ Workshop NGS, Hogeschool Leiden 29/50 Thursday, May 22, 2014

Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014

Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. We filter for a maximum coverage because of copy number variation. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014

Variant filtering Filtering on coverage We can set some thresholds: Minimum. Maximum. We filter for a maximum coverage because of copy number variation. A good way to calculate the maximum: Calculate the mean coverage. Only of the covered (targeted) regions. Multiply this number with a reasonable factor e.g., 2.5. Workshop NGS, Hogeschool Leiden 30/50 Thursday, May 22, 2014

Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014

Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014

Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014

Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Is it in the 5 /3 UTR of a gene?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014

Annotation What is already known about a variant A selection of SeattleSeq annotation: Is the variant known? Does it hit a gene? Is it in an intron? Does it hit a splice site? Is it in the coding region? Is there a gain/loss of a stop codon? Does the variant result in a frameshift?... Is it in the 5 /3 UTR of a gene?... Is it in a regulatory region?... Workshop NGS, Hogeschool Leiden 31/50 Thursday, May 22, 2014

Full genome sequencing Copy number variation Figure 24: Coverage patterns over a whole chromosome. Workshop NGS, Hogeschool Leiden 32/50 Thursday, May 22, 2014

Full genome sequencing Copy number variation Per sample: The reference needs to be very good. Sequencability biases. Mapping biases. Workshop NGS, Hogeschool Leiden 33/50 Thursday, May 22, 2014

Full genome sequencing Copy number variation Per sample: The reference needs to be very good. Sequencability biases. Mapping biases. Within a population: Mixture of distributions. Not sensitive to aforementioned biases. Needs a lot of controls. Workshop NGS, Hogeschool Leiden 33/50 Thursday, May 22, 2014

Full genome sequencing Structural variation Figure 25: Multiple issues while mapping. Workshop NGS, Hogeschool Leiden 34/50 Thursday, May 22, 2014

Full genome sequencing Structural variation Figure 26: Discordant and split reads. http://breakdancer.sourceforge.net/ http://sourceforge.net/projects/pindel/ Workshop NGS, Hogeschool Leiden 35/50 Thursday, May 22, 2014

Full genome sequencing Structural variation Figure 27: Different types of structural variation. Workshop NGS, Hogeschool Leiden 36/50 Thursday, May 22, 2014

Assesmbly De Novo assembly Figure 28: General overview of de novo assembly. Workshop NGS, Hogeschool Leiden 37/50 Thursday, May 22, 2014

Assesmbly De Novo assembly Figure 29: Overlaps are used to reconstruct a genome. Workshop NGS, Hogeschool Leiden 38/50 Thursday, May 22, 2014

Scaffolding De Novo assembly Figure 30: Paired end or mate pair reads can be used. Workshop NGS, Hogeschool Leiden 39/50 Thursday, May 22, 2014

De Novo assembly Easier assembly with PacBio Figure 31: Correcting PacBio reads. Workshop NGS, Hogeschool Leiden 40/50 Thursday, May 22, 2014

Pipelines Pipelines Figure 32: A real-life pipeline. Workshop NGS, Hogeschool Leiden 41/50 Thursday, May 22, 2014

Pipelines Pipelines Figure 33: Scene from Modern times. Workshop NGS, Hogeschool Leiden 42/50 Thursday, May 22, 2014

Pipelines Pipelines Combining tools: The output of one tool can serve as the input for another. Not necessarily linear.... Workshop NGS, Hogeschool Leiden 43/50 Thursday, May 22, 2014

Pipelines Pipelines Combining tools: The output of one tool can serve as the input for another. Not necessarily linear.... Running various different tools: Two or three different aligners. A couple of variant callers.... Workshop NGS, Hogeschool Leiden 43/50 Thursday, May 22, 2014

Pipelines Combining tools 1 bwa aln t 8 $reference $i > $i. sai 2 bwa samse $reference $i. sai $i > $i.sam 3 samtools view bt $reference o $i.bam $i.sam Listing 2: Shell script Workshop NGS, Hogeschool Leiden 44/50 Thursday, May 22, 2014

Pipelines Combining tools 1 bwa aln t 8 $reference $i > $i. sai 2 bwa samse $reference $i. sai $i > $i.sam 3 samtools view bt $reference o $i.bam $i.sam Listing 2: Shell script 1 %. sai : %.fq 2 $ (BWA) aln t $ (THREADS) $ ( c a l l MKREF, $@) $< > $@ 3 4 %.sam: %. sai %.fq 5 $ (BWA) samse $ ( c a l l MKREF, $@) $ ˆ > $@ 6 7 %.bam: %.sam 8 $ (SAMTOOLS) view bt $ ( c a l l MKREF, $@) o $@ $< Listing 3: Makefile Workshop NGS, Hogeschool Leiden 44/50 Thursday, May 22, 2014

Graphical interfaces Galaxy Galaxy: a graphical user interface: Wrapper for command line utilities. User friendly. Point and click. http://galaxy.psu.edu/ Workshop NGS, Hogeschool Leiden 45/50 Thursday, May 22, 2014

Galaxy Graphical interfaces Galaxy: a graphical user interface: Wrapper for command line utilities. User friendly. Point and click. Workflows. Save all the steps you did in your analysis. Rerun the entire analysis on a new dataset. Share your workflow with other people.... http://galaxy.psu.edu/ Workshop NGS, Hogeschool Leiden 45/50 Thursday, May 22, 2014

Galaxy Graphical interfaces Figure 34: Galaxy main user interface Workshop NGS, Hogeschool Leiden 46/50 Thursday, May 22, 2014

Galaxy Graphical interfaces Figure 35: User friendly interface with Galaxy Workshop NGS, Hogeschool Leiden 47/50 Thursday, May 22, 2014

Graphical interfaces Workflow of a parallel pipeline Makefile blueprint all create_report.secondary Makefile.SUFFIXES.DEFAULT recursive clean MC0184ACXX_s_1.h_cov MC0184ACXX_s_1.tar.gz MC0184ACXX_s_1.final.snponly.vcf.annotation MC0184ACXX_s_1.final.indelonly.vcf.annotation MC0184ACXX_s_1.covPerTarget MC0184ACXX_s_1.hsMetrics MC0184ACXX_s_1.mtdna_mincov.bed MC0184ACXX_s_1.asomes_avgcov.bed MC0184ACXX_s_1.xysomes_avgcov.bed MC0184ACXX_s_1.final.vcf.ti-tv MC0184ACXX_s_1.final.snponly.vcf MC0184ACXX_s_1.final.indelonly.vcf MC0184ACXX_s_1.final.vcf.vdb_annotation /sureselect_whole_exome_v2.hg19.cbr.bed.interval MC0184ACXX_s_1.asomes_mincov.bed MC0184ACXX_s_1.xysomes_mincov.bed MC0184ACXX_s_1.asomes_cnvhigh.bed MC0184ACXX_s_1.xysomes_cnvhigh.bed MC0184ACXX_s_1.final.vcf MC0184ACXX_s_1.insertSizesHist.pdf MC0184ACXX_s_1.pileup MC0184ACXX_s_1.insertSizes MC0184ACXX_s_1.mincov.bed MC0184ACXX_s_1.asomes_cutoff_steepcov.bed MC0184ACXX_s_1.xysomes_cutoff_steepcov.bed MC0184ACXX_s_1.asome.vcf MC0184ACXX_s_1.xysome.vcf MC0184ACXX_s_1.mtdna.vcf MC0184ACXX_s_1.bases.aligned MC0184ACXX_s_1.wig MC0184ACXX_s_1.asome_cutoff.vcf MC0184ACXX_s_1.xysome_cutoff.vcf MC0184ACXX_s_1.mtdna_cutoff.vcf MC0184ACXX_s_1.sorted.dedup.bam.flagstat MC0184ACXX_s_1.readsPerChr MC0184ACXX_s_1.sorted.dedup.bam.bai MC0184ACXX_s_1.sorted.dedup.bam.pileup MC0184ACXX_s_1.asome.targeted.avg-cov MC0184ACXX_s_1.xysome.targeted.avg-cov MC0184ACXX_s_1.bases.loose-ontarget MC0184ACXX_s_1.loose-ontarget.bam.flagstat MC0184ACXX_s_1.bases.strong-ontarget MC0184ACXX_s_1.strong-ontarget.bam.flagstat MC0184ACXX_s_1.bcf MC0184ACXX_s_1.sorted.dedup.bam.sorted.dedup.bam H.Sapiens/hg19/reference.fa /sureselect_whole_exome_v2.hg19.cbr.bed MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.sorted.dedup.bam.sorted.bam MC0184ACXX_s_1.sorted.bam MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1_1_sequence_fastqc MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.parts 000.MC0184ACXX_s_1_1_sequence.part 000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1_2_sequence.filtered_fastqc MC0184ACXX_s_1_1_sequence.part000 MC0184ACXX_s_1_2_sequence.part000 MC0184ACXX_s_1_2_sequence_fastqc MC0184ACXX_s_1_1_sequence.sync MC0184ACXX_s_1_2_sequence.sync MC0184ACXX_s_1_1_sequence.filtered MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1_2_sequence.txt.recaldata MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1_2_sequence.trimmed MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1_2_sequence.txt Figure 36: Dependency diagram. Workshop NGS, Hogeschool Leiden 48/50 Thursday, May 22, 2014

Graphical interfaces Workflow of a parallel pipeline MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.sorted.bam MC0184ACXX_s_1_1_sequenc MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.parts 000.MC0184ACXX_s_1_1_sequence.part 000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1_1_sequence.part000 MC0184ACXX_s_1_2_sequence.part000 MC0184ACXX_s_1_1_sequence.sync MC0184ACXX_s_1_2_sequence.sync MC0184ACXX_s_1_1_sequence.filtered MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1_2_sequence.txt.recaldata MC0184ACXX_s_1 MC0184ACXX_s_1_1_sequence.txt Figure 37: Zoomed in. Workshop NGS, Hogeschool Leiden 49/50 Thursday, May 22, 2014

Questions? Acknowledgements: Michiel van Galen Martijn Vermaat Johan den Dunnen Workshop NGS, Hogeschool Leiden 50/50 Thursday, May 22, 2014