Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Size: px
Start display at page:

Download "Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute"

Transcription

1 Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

2 Current Issues

3 Current Issues

4 The QSEQ file Number files per run: 120 x 8 x (1 2) = files Number of reads per run: ~ 40M x 8 x (1 2) = ~320M 640M Total nucleotides (BPs): 320M x 150 x (1 2) = ~48 96 billions

5 What am I going to do with my sequencing data?

6 Overview Infrastructure So<ware Novoalign Picard GATK SAMtools BowEe Cufflinks MACS RTA RTA Gigabit / drag- and- drop netowrk Prod: 400 nodes (64GB and 8GB) Isilon:100TB Very Fast NAS /rcgenomics (20 TBs) Dev: 4 nodes/64gb Pipelines Exome Seq RNA Seq ChiP Seq NAS/SAN 400TBs Archives SSH,FTP SMB,HTTP hrp://lims.csms.edu VM Server (Linux) LIMS(NGS/Microarray) Linux (VM & Physical) MySQL Oracle LAMP Analysis / Mining

7 Overview Infrastructure So<ware Novoalign Picard GATK SAMtools BowEe Cufflinks MACS Prod: 400 nodes (64GB and 8GB) Isilon:100TB Dev: 4 nodes/64gb (3 nodes x 75G)+(4 nodes x 65G) = 485Gs (3 nodes x 24CPU) + (4 nodes x 16) = 120CPUs Pipelines Exome Seq RNA Seq ChiP Seq NAS/SAN 400TBs Archives >300 nodes X 2 CPU = >600CPUs >300 nodes x 4G = > 1200Gs Fast NAS storage capacity ~ 100TBs Other storage devices ~ 400TBs

8 RNA-Seq Analysis Pipeline Phase 1: NGS data processing Phase 2: Basic Analysis Phase 3: Downstream Analysis input Binary Data BCL files Sample 1 BAM. Sample N BAM genes.expr hits.bam transcripts.expr transcripts.gif Raw Data QSEQ files BAM - > FASTQ Cuffcompare Mapped Reads SAM - >BAM TopHat Mapped reads BAM file Cuffdiff UCSC QC by samtools Cufflinks CLC BIO output clean BAM files genes/express files deliver results

9 Data Processing Phase 1: NGS data processing Binary Data BCL files Raw Data QSEQ files Mapped Reads SAM - >BAM QC by samtools QC / Data filtering Bin and remove indexes Remove adapters (if any) Remove duplicate Removal of non-mapped sequences Filter out read mapping to ribosomal RNA Percentage of ribosomal? QSEQ files -> Aligned files (BAM) Parallel processing (MPI) What is in the alignment files? clean BAM files

10 The BAM/SAM file BAM (Binary Alignment Mapping) SAM (Sequence Alignment Mapping) Standardized output for alignment Contains all required information for downstream analysis Query name Segment sequence Quality score Phred- base

11 RNA-Seq Analysis Pipeline Phase 1: NGS data processing Phase 2: Basic Analysis Phase 3: Downstream Analysis input Binary Data BCL files Sample 1 BAM. Sample N BAM genes.expr hits.bam transcripts.expr transcripts.gif Raw Data QSEQ files BAM - > FASTQ Cuffcompare Mapped Reads SAM - >BAM TopHat Mapped reads BAM file Cuffdiff UCSC QC by samtools Cufflinks CLC BIO output clean BAM files genes/express files deliver results

12 RNA-Seq Data Analysis Phase 2: Basic Analysis Sample 1 BAM. BAM - > FASTQ TopHat Mapped reads BAM file Cufflinks genes/express files Sample N BAM One BAM file per sample Tophat bowtie : align short reads splice junction identifier Cufflinks Uses annotation file to count transcripts and isoforms Output files Gene expression : gene.expr Transcript expression : transcripts.expr Transcript in GTF format : transcripts.gft

13 RNA-Seq Analysis Phase 2: Basic Analysis Sample 1 BAM. Sample N BAM Genes expression genes.expr BAM - > FASTQ TopHat Mapped reads BAM file transcriptions expression transcripts.expr Cufflinks genes/express files

14 RNA-Seq Analysis Phase 2: Basic Analysis Sample 1 BAM. Sample N BAM Genes expression genes.expr BAM - > FASTQ TopHat Mapped reads BAM file Cufflinks genes/express files Unit of measurement (FPKM/RPKM) FPKM: Fragments per kilobase per million mapped reads 1kb transcript with 1000 alignments in a sample of 10M reads (out of which 8 million reads can be mapped) will have FPKM = 1000/(1*8) = 125

15 RNA-Seq Analysis Pipeline Phase 1: NGS data processing Phase 2: Basic Analysis Phase 3: Downstream Analysis input Binary Data BCL files Sample 1 BAM. Sample N BAM genes.expr hits.bam transcripts.expr transcripts.gif Raw Data QSEQ files BAM - > FASTQ Cuffcompare Mapped Reads SAM - >BAM TopHat Mapped reads BAM file Cuffdiff UCSC QC by samtools Cufflinks CLC BIO output clean BAM files genes/express files deliver results

16 RNA-Seq Analysis Phase 3: Downstream Analysis genes.expr hits.bam Cuffcompare Cuffdiff UCSC CLC BIO transcripts.expr transcripts.gif Cuffcompare Compare your assembled transcript to a reference annotation Track cufflinks transcripts across multiple experiments (e.g across time course) cuffcompare -<options> Cuffdiff Part of cufflinks package Find significant changes in transcriptions splicing, and promoter use. Viewer & Annotation deliver results

17 Cuffdiff Sample Differential expression at the transcript isoform level and at the gene level

18 RNA-Seq Analysis Pipeline Phase 1: NGS data processing Phase 2: Basic Analysis Phase 3: Downstream Analysis input Binary Data BCL files Sample 1 BAM. Sample N BAM genes.expr hits.bam transcripts.expr transcripts.gif Raw Data QSEQ files BAM - > FASTQ Cuffcompare Mapped Reads SAM - >BAM TopHat Mapped reads BAM file Cuffdiff UCSC QC by samtools Cufflinks CLC BIO output clean BAM files genes/express files deliver results

19 ChiP-Seq Analysis Pipeline Phase 1: NGS data processing Phase 2: Basic Analysis/Peak calling Phase 3: Downstream Analysis input Binary Data BCL files Sample 1 BAM. Sample N BAM peaks.bed peaks.xls model.r summits.bed Raw Data QSEQ files novoalign CEAS Mapped Reads SAM - >BAM Mapped reads BAM file UCSC CLC BIO output QC by samtools clean BAM files MACS peaks raw results - MoEf discovery - RelaEonship to gene structure - Gene set analysis - DifferenEal profile analysis - Other advanced analysis MACS: Model-base Analysis ChiP-seq / CEAS: Cis-regulatory Element Annotation System

20 I can start my research now.

21 Overview Infrastructure So<ware Novoalign Picard GATK SAMtools BowEe Cufflinks MACS RTA / drag- and- drop Prod: 400 nodes (64GB and 8GB) Isilon:100TB Very Fast NAS /rcgenomics (20 TBs) Dev: 4 nodes/64gb Pipelines Exome Seq RNA Seq ChiP Seq NAS/SAN 900TBs Archives SSH,FTP SMB,HTTP hrp://lims.csms.edu VM Server (Linux) LIMS(NGS/Microarray) Linux (VM & Physical) MySQL Oracle LAMP Analysis / Mining

22 Data Management Data storage Understand NGS data Not all data equally important How data gets storage? User access LIMS (Laboratory InformaEcs Management System) Service Request Access data

23 Data Storage Raw Data ~2 5TB per run NAS QSEQ/FASTQ Intensities BaseCalls Results SNV, Indel, expr LIMS Fast NAS

24 We are

25 Access the NGS LIMS

26 NGS LIMS

27 NGS LIMS Add Sample

28 NGS LIMS Add Sample

29 NGS LIMS Add Library

30 NGS LIMS Create a FlowCell

31 NGS LIMS Get Results

32 NGS LIMS Get Results

33 Questions?

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team http://usegalaxy.org

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team http://usegalaxy.org Using Galaxy for NGS Analysis Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team http://usegalaxy.org Overview NGS Data Galaxy tools for NGS Data Galaxy for Sequencing Facilities Overview

More information

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing

More information

Introduction to NGS data analysis

Introduction to NGS data analysis Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless

More information

Practical Solutions for Big Data Analytics

Practical Solutions for Big Data Analytics Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute (madduri@anl.gov) Paul Dave (pdave@uchicago.edu) Dinanath Sulakhe (sulakhe@uchicago.edu) Alex Rodriguez (arodri7@uchicago.edu)

More information

New solutions for Big Data Analysis and Visualization

New solutions for Big Data Analysis and Visualization New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology

More information

Analysis of NGS Data

Analysis of NGS Data Analysis of NGS Data Introduction and Basics Folie: 1 Overview of Analysis Workflow Images Basecalling Sequences denovo - Sequencing Assembly Annotation Resequencing Alignments Comparison to reference

More information

Basic processing of next-generation sequencing (NGS) data

Basic processing of next-generation sequencing (NGS) data Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance

More information

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster Cluster Info Sheet About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster Welcome to the PMCBRC cluster! We are happy to provide and manage this compute cluster as a resource

More information

Challenges associated with analysis and storage of NGS data

Challenges associated with analysis and storage of NGS data Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group gabry@ebi.ac.uk Next-generation sequencing Next-generation sequencing

More information

Next generation sequencing (NGS)

Next generation sequencing (NGS) Next generation sequencing (NGS) Vijayachitra Modhukur BIIT modhukur@ut.ee 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known

More information

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012 RNA-Seq Tutorial 1 John Garbe Research Informatics Support Systems, MSI March 19, 2012 Tutorial 1 RNA-Seq Tutorials RNA-Seq experiment design and analysis Instruction on individual software will be provided

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information

Bioinformatics Unit Department of Biological Services. Get to know us

Bioinformatics Unit Department of Biological Services. Get to know us Bioinformatics Unit Department of Biological Services Get to know us Domains of Activity IT & programming Microarray analysis Sequence analysis Bioinformatics Team Biostatistical support NGS data analysis

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis By the end of this lab students should be able to: Describe the uses for each line of the DNA subway program (Red/Yellow/Blue/Green) Describe

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Next Generation Sequencing

Next Generation Sequencing Next Generation Sequencing Cavan Reilly December 5, 2012 Table of contents Next generation sequencing NGS and microarrays Study design Quality assessment Burrows Wheeler transform BWT example Introduction

More information

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design) Experimental Design & Intro to NGS Data Analysis Ryan Peters Field Application Specialist Partek, Incorporated Agenda Experimental Design Examples ANOVA What assays are possible? NGS Analytical Process

More information

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this

More information

Open source analytics for Big Data in Big Pharma

Open source analytics for Big Data in Big Pharma Open source analytics for Big Data in Big Pharma Applications in next generation sequencing data Big Data SIG 23 Apr 2015 Miika Ahdesmaki Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

More information

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline

More information

NGS Data Analysis: An Intro to RNA-Seq

NGS Data Analysis: An Intro to RNA-Seq NGS Data Analysis: An Intro to RNA-Seq March 25th, 2014 GST Colloquim: March 25th, 2014 1 / 1 Workshop Design Basics of NGS Sample Prep RNA-Seq Analysis GST Colloquim: March 25th, 2014 2 / 1 Experimental

More information

GeneProf and the new GeneProf Web Services

GeneProf and the new GeneProf Web Services GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter

More information

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript

More information

LifeScope Genomic Analysis Software 2.5

LifeScope Genomic Analysis Software 2.5 USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use

More information

Text file One header line meta information lines One line : variant/position

Text file One header line meta information lines One line : variant/position Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!

More information

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data WHITE PAPER Ion RNA-Seq Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data Introduction High-resolution measurements of transcriptional activity and organization

More information

How-To: SNP and INDEL detection

How-To: SNP and INDEL detection How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications

More information

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract

More information

NEXT GENERATION SEQUENCING

NEXT GENERATION SEQUENCING NEXT GENERATION SEQUENCING Dr. R. Piazza SANGER SEQUENCING + DNA NEXT GENERATION SEQUENCING Flowcell NEXT GENERATION SEQUENCING Library di DNA Genomic DNA NEXT GENERATION SEQUENCING NEXT GENERATION SEQUENCING

More information

Globus Genomics Tutorial GlobusWorld 2014

Globus Genomics Tutorial GlobusWorld 2014 Globus Genomics Tutorial GlobusWorld 2014 Agenda Overview of Globus Genomics Example Collaborations Demonstration Globus Genomics interface Globus Online integration Scenario 1: Using Globus Genomics for

More information

-> Integration of MAPHiTS in Galaxy

-> Integration of MAPHiTS in Galaxy Enabling NGS Analysis with(out) the Infrastructure, 12:0512 Development of a workflow for SNPs detection in grapevine From Sets to Graphs: Towards a Realistic Enrichment Analy species: MAPHiTS -> Integration

More information

mrna NGS Data Analysis Report

mrna NGS Data Analysis Report mrna NGS Data Analysis Report Project: Test Project (Ref code: 00001) Customer: Test customer Company/Institute: Exiqon Date: Monday, June 29, 2015 Performed by: XploreRNA Exiqon A/S Company Reg. No. (CVR)

More information

High Throughput Sequencing Data Analysis using Cloud Computing

High Throughput Sequencing Data Analysis using Cloud Computing High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure

More information

Introduction. Overview of Bioconductor packages for short read analysis

Introduction. Overview of Bioconductor packages for short read analysis Overview of Bioconductor packages for short read analysis Introduction General introduction SRAdb Pseudo code (Shortread) Short overview of some packages Quality assessment Example sequencing data in Bioconductor

More information

Deep Sequencing Data Analysis

Deep Sequencing Data Analysis Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist

More information

Expression Quantification (I)

Expression Quantification (I) Expression Quantification (I) Mario Fasold, LIFE, IZBI Sequencing Technology One Illumina HiSeq 2000 run produces 2 times (paired-end) ca. 1,2 Billion reads ca. 120 GB FASTQ file RNA-seq protocol Task

More information

RNAseq / ChipSeq / Methylseq and personalized genomics

RNAseq / ChipSeq / Methylseq and personalized genomics RNAseq / ChipSeq / Methylseq and personalized genomics 7711 Lecture Subhajyo) De, PhD Division of Biomedical Informa)cs and Personalized Biomedicine, Department of Medicine University of Colorado School

More information

CHALLENGES IN NEXT-GENERATION SEQUENCING

CHALLENGES IN NEXT-GENERATION SEQUENCING CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture

More information

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoopizer : a cloud environment for bioinformatics data analysis Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,

More information

BioHPC Web Computing Resources at CBSU

BioHPC Web Computing Resources at CBSU BioHPC Web Computing Resources at CBSU 3CPG workshop Robert Bukowski Computational Biology Service Unit http://cbsu.tc.cornell.edu/lab/doc/biohpc_web_tutorial.pdf BioHPC infrastructure at CBSU BioHPC Web

More information

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows Genes 2012, 3, 545-575; doi:10.3390/genes3030545 Article OPEN ACCESS genes ISSN 2073-4425 www.mdpi.com/journal/genes Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline

More information

NGS data analysis. Bernardo J. Clavijo

NGS data analysis. Bernardo J. Clavijo NGS data analysis Bernardo J. Clavijo 1 A brief history of DNA sequencing 1953 double helix structure, Watson & Crick! 1977 rapid DNA sequencing, Sanger! 1977 first full (5k) genome bacteriophage Phi X!

More information

Lectures 1 and 8 15. February 7, 2013. Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Lectures 1 and 8 15. February 7, 2013. Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling Lectures 1 and 8 15 February 7, 2013 This is a review of the material from lectures 1 and 8 14. Note that the material from lecture 15 is not relevant for the final exam. Today we will go over the material

More information

Computational Genomics. Next generation sequencing (NGS)

Computational Genomics. Next generation sequencing (NGS) Computational Genomics Next generation sequencing (NGS) Sequencing technology defies Moore s law Nature Methods 2011 Log 10 (price) Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years

More information

Next generation DNA sequencing technologies. theory & prac-ce

Next generation DNA sequencing technologies. theory & prac-ce Next generation DNA sequencing technologies theory & prac-ce Outline Next- Genera-on sequencing (NGS) technologies overview NGS applica-ons NGS workflow: data collec-on and processing the exome sequencing

More information

Practical Guideline for Whole Genome Sequencing

Practical Guideline for Whole Genome Sequencing Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina imedina@cipf.es Scientific Computing Unit Bioinformatics and Genomics Department

More information

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here A Complete Example of Next- Gen DNA Sequencing Read Alignment Presentation Title Goes Here 1 FASTQ Format: The de- facto file format for sharing sequence read data Sequence and a per- base quality score

More information

Introduction to next-generation sequencing data

Introduction to next-generation sequencing data Introduction to next-generation sequencing data David Simpson Centre for Experimental Medicine Queens University Belfast http://www.qub.ac.uk/research-centres/cem/ Outline History of DNA sequencing NGS

More information

How Sequencing Experiments Fail

How Sequencing Experiments Fail How Sequencing Experiments Fail v1.0 Simon Andrews simon.andrews@babraham.ac.uk Classes of Failure Technical Tracking Library Contamination Biological Interpretation Something went wrong with a machine

More information

Installation Guide for Windows

Installation Guide for Windows Installation Guide for Windows Overview: Getting Ready Installing Sequencher Activating and Installing the License Registering Sequencher GETTING READY Trying Sequencher: Sequencher 5.2 and newer requires

More information

GMQL Functional Comparison with BEDTools and BEDOPS

GMQL Functional Comparison with BEDTools and BEDOPS GMQL Functional Comparison with BEDTools and BEDOPS Genomic Computing Group Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano This document presents a functional comparison

More information

RNA- seq de novo ABiMS

RNA- seq de novo ABiMS RNA- seq de novo ABiMS Cleaning 1. import des données d'entrée depuis Data Libraries : Shared Data Data Libraries RNA- seq de- novo 2. lancement des programmes de nettoyage pas à pas BlueLight.sample.read1.fastq

More information

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department

More information

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Barry Bolding Cray Inc Seattle, WA 1 CUG 2013 Paper Genomic Applications on Cray supercomputers: Next Generation Sequencing

More information

Services. Updated 05/31/2016

Services. Updated 05/31/2016 Updated 05/31/2016 Services 1. Whole exome sequencing... 2 2. Whole Genome Sequencing (WGS)... 3 3. 16S rrna sequencing... 4 4. Customized gene panels... 5 5. RNA-Seq... 6 6. qpcr... 7 7. HLA typing...

More information

Importance of Statistics in creating high dimensional data

Importance of Statistics in creating high dimensional data Importance of Statistics in creating high dimensional data Hemant K. Tiwari, PhD Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham History of Genomic Data

More information

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013 ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and

More information

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc. New Technologies for Sensitive, Low-Input RNA-Seq Clontech Laboratories, Inc. Outline Introduction Single-Cell-Capable mrna-seq Using SMART Technology SMARTer Ultra Low RNA Kit for the Fluidigm C 1 System

More information

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies ISRN Bioinformatics Volume 2013, Article ID 481545, 8 pages http://dx.doi.org/10.1155/2013/481545 Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale

More information

Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study

Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study Sebastian J. Schultheiss Machine Learning in Biology, Rätsch Lab, FML of the Max Planck Society Tübingen,

More information

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting 15.04.2014

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting 15.04.2014 Writing & Running Pipelines on the Open Grid Engine using QMake Wibowo Arindrarto DTLS Focus Meeting 15.04.2014 Makefile (re)introduction Atomic recipes / rules that define full pipelines Initially written

More information

Accelerating Data-Intensive Genome Analysis in the Cloud

Accelerating Data-Intensive Genome Analysis in the Cloud Accelerating Data-Intensive Genome Analysis in the Cloud Nabeel M Mohamed Heshan Lin Wu-chun Feng Department of Computer Science Virginia Tech Blacksburg, VA 24060 {nabeel, hlin2, wfeng}@vt.edu Abstract

More information

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

An example of bioinformatics application on plant breeding projects in Rijk Zwaan An example of bioinformatics application on plant breeding projects in Rijk Zwaan Xiangyu Rao 17-08-2012 Introduction of RZ Rijk Zwaan is active worldwide as a vegetable breeding company that focuses on

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

BIG DATA BIG DATA 8/1/12. Cool Informa+cs Tools and Services for Biomedical Research. David Ruau, PhD. August 1 st, 2012

BIG DATA BIG DATA 8/1/12. Cool Informa+cs Tools and Services for Biomedical Research. David Ruau, PhD. August 1 st, 2012 Cool Informa+cs Tools and Services for Biomedical Research David Ruau, PhD. August 1 st, 2012 @druau Sponsored by the Office of Postdoctoral Affairs and the Lane Medical Library BIG DATA BIG DATA 1 Big

More information

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011 NECC History Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011 EPSCoR Cyberinfrastructure Workshop First regional NENI (now NECC) Workshop held in Vermont in August 2007 Workshop heldinkentucky

More information

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/

More information

High Performance Compu2ng Facility

High Performance Compu2ng Facility High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,

More information

Welcome to the Plant Breeding and Genomics Webinar Series

Welcome to the Plant Breeding and Genomics Webinar Series Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen

More information

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

Genomic Testing: Actionability, Validation, and Standard of Lab Reports Genomic Testing: Actionability, Validation, and Standard of Lab Reports emerge: Laura Rasmussen-Torvik Reaction: Heidi Rehm Summary: Dick Weinshilboum Panel: Murray Brilliant, David Carey, John Carpten,

More information

MiSeq: Imaging and Base Calling

MiSeq: Imaging and Base Calling MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please

More information

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine

More information

Databases and mapping BWA. Samtools

Databases and mapping BWA. Samtools Databases and mapping BWA Samtools FASTQ, SFF, bax.h5 ACE, FASTG FASTA BAM/SAM GFF, BED GenBank/Embl/DDJB many more File formats FASTQ Output format from Illumina and IonTorrent sequencers. Quality scores:

More information

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance RNA Express Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance ILLUMINA PROPRIETARY 15052918 Rev. A February 2014 This document and its contents are

More information

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA Nebula A web-server for advanced ChIP-seq data analysis Tutorial by Valentina BOEVA Content Upload data to the history pp. 5-6 Check read number and sequencing quality pp. 7-9 Visualize.BAM files in UCSC

More information

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Introduction Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb).

More information

Data Management & Storage for NGS

Data Management & Storage for NGS Data Management & Storage for NGS 2009 Pre-Conference Workshop Chris Dagdigian BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance

More information

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v2.2.0. 1.1 SMRT Analysis v2.2.0 Overview. Notes:

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v2.2.0. 1.1 SMRT Analysis v2.2.0 Overview. Notes: SMRT Analysis v2.2.0 Overview 100 338 400 01 1. SMRT Analysis v2.2.0 1.1 SMRT Analysis v2.2.0 Overview Welcome to Pacific Biosciences' SMRT Analysis v2.2.0 Overview 1.2 Contents This module will introduce

More information

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. : An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results

More information

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics Christopher Benner, PhD Director, Integrative Genomics and Bioinformatics Core (IGC) idash Webinar,

More information

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual Development of Bio-Cloud Service for Genomic Analysis Based on Virtual Infrastructure 1 Jung-Ho Um, 2 Sang Bae Park, 3 Hoon Choi, 4 Hanmin Jung 1, First Author Korea Institute of Science and Technology

More information

Next Generation Sequencing

Next Generation Sequencing Next Generation Sequencing Technology and applications 10/1/2015 Jeroen Van Houdt - Genomics Core - KU Leuven - UZ Leuven 1 Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977

More information

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing KOO10 5/31/04 12:17 PM Page 131 10 Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing Sandra Porter, Joe Slagel, and Todd Smith Geospiza, Inc., Seattle, WA Introduction The increased

More information

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/

More information

Automating installation, testing and development of bcbio-nextgen pipeline

Automating installation, testing and development of bcbio-nextgen pipeline Automating installation, testing and development of bcbio-nextgen pipeline GUILLERMO CARRASCO HERNÁNDEZ guillermo.carrasco@scilifelab.se June 2013 Final project at Barcelona School of Informatics (FIB)

More information

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing Matthew Trunnell Manager, Research Computing Broad Institute Overview The Broad Institute Major challenges Current data workflow

More information

The Galaxy workflow. George Magklaras PhD RHCE

The Galaxy workflow. George Magklaras PhD RHCE The Galaxy workflow George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

More information

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Managing and Conducting Biomedical Research on the Cloud Prasad Patil Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine

More information

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences Genetics: Advance Online Publication, published on October 10, 2012 as 10.1534/genetics.112.144204 CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences Gregory Minevich 1,, Danny S.

More information

Understanding West Nile Virus Infection

Understanding West Nile Virus Infection Understanding West Nile Virus Infection The QIAGEN Bioinformatics Solution: Biomedical Genomics Workbench (BXWB) + Ingenuity Pathway Analysis (IPA) Functional Genomics & Predictive Medicine, May 21-22,

More information

Managing Biinformatics Workflows in Cloud Computing

Managing Biinformatics Workflows in Cloud Computing J Grid Computing DOI 10.1007/s10723-013-9260-9 Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds Vincent C. Emeakaroha Michael Maurer Patrick Stern Paweł P. Łabaj Ivona Brandic

More information