PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

Similar documents

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Using the Intel Inspector XE

The NGS IT notes. George Magklaras PhD RHCE

Computational Genomics. Next generation sequencing (NGS)

Introduction to next-generation sequencing data

How Sequencing Experiments Fail

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Analysis of ChIP-seq data in Galaxy

Visualization of Phylogenetic Trees and Metadata

Spring 2011 Prof. Hyesoon Kim

Hadoopizer : a cloud environment for bioinformatics data analysis

Introduction to NGS data analysis

Basic processing of next-generation sequencing (NGS) data

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

MiSeq: Imaging and Base Calling

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bioinformatics Grid - Enabled Tools For Biologists.

July 7th 2009 DNA sequencing

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

MPI and Hybrid Programming Models. William Gropp

NGS data analysis. Bernardo J. Clavijo

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Delivering the power of the world s most successful genomics platform

Next generation sequencing (NGS)

Frequently Asked Questions Next Generation Sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

De Novo Assembly Using Illumina Reads

Reading DNA Sequences:

NGS Technologies for Genomics and Transcriptomics

Big Data & Scripting Part II Streaming Algorithms

Prepare the environment Practical Part 1.1

Biological Sequence Data Formats

Next Generation Sequencing

CHALLENGES IN NEXT-GENERATION SEQUENCING

Integrated Rule-based Data Management System for Genome Sequencing Data

CD-HIT User s Guide. Last updated: April 5,

Analysis of NGS Data

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Cancer Genomics: What Does It Mean for You?

Module 10: Bioinformatics

Next generation DNA sequencing technologies. theory & prac-ce

Non-invasive prenatal detection of chromosome aneuploidies using next generation sequencing: First steps towards clinical application

Big Data Visualization on the MIC

Automated and Scalable Data Management System for Genome Sequencing Data

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

The International Journal Of Science & Technoledge (ISSN X)

Deep Sequencing Data Analysis

GC3 Use cases for the Cloud

G E N OM I C S S E RV I C ES

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Original-page small file oriented EXT3 file storage system

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Storage Solutions for Bioinformatics

How To Cluster Of Complex Systems

Next Generation Sequencing; Technologies, applications and data analysis

Web-Based Genomic Information Integration with Gene Ontology

Performance Characteristics of Large SMP Machines

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Windows Server Performance Monitoring

NEXT GENERATION SEQUENCING

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

FQbin: a compatible and optimized format for storing and managing sequence data

Searching Nucleotide Databases

OpenMP Programming on ScaleMP

Installation Guide for Windows

RNA Structure and folding

Hybrid and Custom Data Structures: Evolution of the Data Structures Course

Automated Library Preparation for Next-Generation Sequencing

Scaling up to Production

Parallel Algorithm Engineering

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Intro to Bioinformatics

Data-Flow Awareness in Parallel Data Processing

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

Illumina Sequencing Technology

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

New solutions for Big Data Analysis and Visualization

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Transcription:

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA Markus Joppich, Tony Bolger, Dirk Schmidl Björn Usadel and Torsten Kuhlen Markus Joppich Lehr- und Forschungseinheit Bioinformatik Institut für Informatik LMU München

PAGANtec PAGANtec Probabilistic Algorithm for Genome Assembly of Next-Generation Sequencing Data transcriptome error correction 1. Biological Background 2. PAGANtec 3. OpenMP Parallelism in PAGANtec 4. Conclusion Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 2

Biology / Sequencing DNA consists of words w = Σ = A, T, G, C from a 4 letter alphabet bases or base-pairs (bp) Sequencing tries to translate from molecule to textual representation Many technologies Sanger (<1000bp) NGS ( ~100 300 bp) Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 3

Raw Read Data >SequenceID1 ATGACAACAGAGATATCAGA @ 1231241212312AABASDA >SequenceID2 ATGACAACATAGATATCAGA @ 1231241212312AABASDA Fragments/Sequences are called reads Sequencing is error-prone Up to 10% for early Illumina, 30% Nanopore Huge amount of data: several 100GB Not easy to spot mistakes by eye Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 4

Assembly Assembly: put reads together to get original sequence Assembly Problem: Maximize the likelihood that the assembly is the source of the set of reads Gene/Transcript Reads Aligned Reads Genome Several Methods: Overlap / Layout / Consenus or De Bruijn Graph Problem complexity ~ Graph complexity Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 5

DBG: Graph Structure 1. Take every read 2. Split read into k long subwords k-mers s-mers 3. Connect unique k-mers if consecutive in read Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 6

Error Correction Several Structures can be observed in k-mer graph: Tangles, Spurs and Bubbles due to errors in reads, make assembly harder Essentially error correction removes such artifacts Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 7

Error Correction There are several error correctors available Mostly for genomes Different characteristics SEECER error corrector for transcriptomes [Le 2013] Results not satisfying enough Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 8

PAGAN Graph Structure Allows lossless storage of reads in a k-mer-graph Single reads can be extracted Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 10

PAGANtec: algorithmic Correction strategey depends on the location of error Error at an end Or forming a bubble or long spur Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 11

PAGANtec: algorithmic Short tip correction Detect erroneous sequence Replace by correct sequence Attention: tails can be used by other reads! Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 12

PAGANtec: algorithmic Long tip/bubble correction Detect erroneous sequence Store correction and corrected sequence Not depending on other reads! Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 13

PAGANtec: results SEECER PAGANtec Good correction in general and in detail! No structures forming However slower. Easy parallelization Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 14

PAGANtec: filter Each algorithm for Short tips Long tips Bubbles Essentially same steps Detect Errors Find correction Store/Apply correction Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 15

PAGANtec: parallelism Apply #pragma omp parallel for ~10 0 10 1 ~ < 10 1 > 10 6 Width of smer FilterTips: race conditions / data race! Correction Storage: Threadsafe Hash-Map Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 16

PAGANtec: load imbalance Several stages identifiable Problems detected: Tips have a low CPU usage! One thread takes essentially longer than others! Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 17

PAGANtec: load imbalance Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 18

PAGANtec: characteristics Low CPU usage? Locking! Load Imbalance? Very coarse grained! First reason: omp parallel for Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 19

PAGANtec: characteristics Second reason: busy k-mers Low average route width for whole dataset O 10 1 Accumulated Width explains behaviour Very few s-mers account major part of width! Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 20

Lessons Learned OpenMP parallel task construct Create batches of s-mers with 2000 route entries Create extra tasks for routes with > 500 entries Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 21

PAGANtec: parallelism Make use of parallel tasks Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 22

PAGANtec: revised Why is there still a load imbalance? How many s-mers are there per task? How many route entries (width) are per task? Red Ticks: > 1 inner task Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 23

Conclusion Convoluted graph structure Not easily splittable into same chunks Locking poses a problem #pragma omp task greatly improves performance Not possible with #pragma omp parallel for Manual creation of chunks Still difficulties from coarse grained structure Prioritize chunk to avoid imbalance Instead of leaving it at end of queue Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 24

Thank you for your attention! Dirk Schmidl & Torsten Kuhlen Tony Bolger & Björn Usadel Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 25

Literature [Le 2013] Le, H.S., Schulz, M.H., McCauley, B.M., Hinman, V.F., Bar-Joseph, Z.: Probabilistic error correction for RNA sequencing. Nucleic acids research 41(10), e109 (May 2013) Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 26

Chunk Creation Manual effort to create chunks Pattern observed frequently with tasks! Often leads to some (minor) code duplication Integrate into OpenMP-clause? Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 27

Frequency Flows per s-mer G = (V, E) V = n n k mers E = i j suff j, k 1 == pref i, k 1 Markus Joppich, LFE Bioinformatik, Institut für Informatik, LMU München 28