Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Similar documents

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Data management challenges in todays Healthcare and Life Sciences ecosystems

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Introduction to NGS data analysis

Practical Solutions for Big Data Analytics

New solutions for Big Data Analysis and Visualization

CHALLENGES IN NEXT-GENERATION SEQUENCING

Delivering the power of the world s most successful genomics platform

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Overview of Next Generation Sequencing platform technologies

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Personalized Medicine and IT

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Integrated Rule-based Data Management System for Genome Sequencing Data

Deep Sequencing Data Analysis

Automated and Scalable Data Management System for Genome Sequencing Data

Globus Genomics Tutorial GlobusWorld 2014

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

School of Nursing. Presented by Yvette Conley, PhD

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

The NGS IT notes. George Magklaras PhD RHCE

G E N OM I C S S E RV I C ES

Cray s Storage History and Outlook Lustre+ Jason Goodman, Cray LUG Denver

Analysis of NGS Data

Twister4Azure: Data Analytics in the Cloud

NGS data analysis. Bernardo J. Clavijo

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

icer Bioinformatics Support Fall 2011

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Copy Number Variation: available tools

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Core Facility Genomics

Data Storage At the Heart of any Information System. Ken Claffey, VP/GM - June 2015

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Clusters: Mainstream Technology for CAE

Lustre failover experience

Hadoopizer : a cloud environment for bioinformatics data analysis

Analysis of ChIP-seq data in Galaxy

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Open source analytics for Big Data in Big Pharma

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

GC3 Use cases for the Cloud

SEQUENCING. From Sample to Sequence-Ready

Big data in cancer research : DNA sequencing and personalised medicine

GenomeStudio Data Analysis Software

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

GenomeStudio Data Analysis Software

Bioinformatics Unit Department of Biological Services. Get to know us

White Paper The Numascale Solution: Extreme BIG DATA Computing

Next generation sequencing (NGS)

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Installation Guide for Windows

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

MiSeq: Imaging and Base Calling

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

High Performance Computing in CST STUDIO SUITE

High-Performance Computing and Big Data Challenge

A leader in the development and application of information technology to prevent and treat disease.

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Big Data Challenges in Bioinformatics

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

The University is comprised of seven colleges and offers 19. including more than 5000 graduate students.

Big Data and Graph Analytics in a Health Care Setting

Intro to Bioinformatics

Accelerate genomic breakthroughs in microbiology. Gain deeper insights with powerful bioinformatic tools.

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

-> Integration of MAPHiTS in Galaxy

Building Clusters for Gromacs and other HPC applications

Overview sequence projects

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Getting Started with HPC

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Bioinformatics Grid - Enabled Tools For Biologists.

DNA Sequencing and Personalised Medicine

De Novo Assembly Using Illumina Reads

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Transcription:

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Barry Bolding Cray Inc Seattle, WA 1

CUG 2013 Paper Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Mikhail Kandel Department of Computer Science University of Illinois Urbana-Champagne, IL USA kandel3@illinois.edu Steve Behling, Nathan Schumann and Bill Long Cray Inc Saint Paul, MN USA sbehling@cray.com, nds@cray.com, longb@cray.com Carlos P. Sosa Cray Inc and University of Minnesota Rochester Saint Paul, MN USA cpsosa@cray.com Sébastien Boisvert and Jacques Corbeil Département de Médecine Moléculaire, Université Laval, Québec, Canada sebastien.boisvert.3@ulaval.ca, jacques.corbeil@crchul.ulaval.ca Lorenzo Pesce Computation Institute, University of Chicago, Chicago, IL USA lpesce@uchicago.edu 5/8/2013 CUG 2013 2

Agenda Big Data Problem in Genomics Next-Generation Sequencing Work Flow Genomics Applications: Ray Summary 5/8/2013 CUG 2013 3

A New Dimension for HPC in Life Sciences Personalized Therapeutics Selected Milestones in HPC In 2012 U of Illinois and NCSA completed a 100 Million atoms Chromatophore NAMD Simulation on a Cray named Blue Waters In 1998 Prof Kollman study folding of Villin Headpiece Sub-Domain Observed for a Full Microsecond on a Cray T3E Omics Era Structural Biology Era Data Analytics Personalized Medicine Era Big Data Era Simulations Selected Milestones in Life Sciences 4

The Era of Big Data 5/8/2013 CUG 2013 5

Next Generation Sequencing (NGS) Cost per sequenced human genome 5/8/2013 Source: genome.gov/sequencingcosts/ CUG 2013 6

Big Data Generation Evolution and Explosion MR Stratton et al. Nature 458, 719-724 (2009) doi:10.1038/nature07943 7

Current Processing and Analytics Tools Cannot Cope with the Amount of Data Generated in Genomics Think about this: you need to collect the 1.5 GB for each person and likely extract out the genetic markers. Then you need to analyze the cancer and the treatment data. According to The American Cancer Society, 12,549,000 people in the U.S. have cancer. So at 1.5 GB per person, that comes out to about 18.8 PB of data and this does not include the genetics of the cancer. Henry Newman - Cancer, Big Data and Storage 5/8/2013 CUG 2013 8

Data Explosion in the Quest for Personalized Medicine Ericka C. Hayden, NATURE, VOL 482, 16 FEBRUARY 2012 9

What is Sequencing? Analysis Lab Best Practices for Variant Calling with the GATK, Broad Institute, MIT, Cambridge, MA Image source: http://www.broadinstitute.org/gatk//events/2038/gatkwh0-bp-0a-intro_to_ngs.pdf 10

WorkFlow Based on Illumina Sequencer Next Generation Sequencers TeraBytesof images (tiff) PC with large file systems Storage Transferred Storage Approx. 1 TeraByte Formatted into BCl file Translate to fastqfiles Approx. 100 GB Selected Applications Abyss AFABC ALLPATH AMOS Amplicon Noise BamTools BEAST BEDTools BEERS Bioconductor BLAST+ Blat BlueGnome Bowtie BreakDancer BWA CAP3 CASAVA ClustalW/X Cufflinks Decypher EMBOSS FastQC GATK GMAP Hmmer Illumina GenomeStudio MIRA https://www.msi.umn.edu/sw/cat/genetics 11

Detailed Sample Workflow: Illumina Sequencer Analysis HiSeq 2500 http://www.illumina.com Whole Genome Resequencing Targeted Resequencing DeNovo Assembly Chromatin Immunoprescription Sequencing Methylation Analysis Whole Transcriptome Analysis Small RNA Analysis Gene Expression RNA Structure Metagenomics Exomics Data Analysis 12

Workflow Based on IT Resources Required Sequencer Datacenter Desktop 13

Cray Genomics Work Flow Diagram Next Generation Sequencers User Analysis Workstations Analysis Cluster (Cray System or other Linux Cluster) Image Filter External NFS/CIFS Server CRAY Sonexion Data Storage System InfiniBand Ethernet External Archive External Data Mover 14

Analysis: Leveraging Cray Solutions Analysis Whole Genome Resequencing Targeted Resequencing De Novo Assembly Chromatin Immunoprescription Sequencing Methylation Analysis Whole Transcriptome Analysis Small RNA Analysis Gene Expression RNA Structure Metagenomics Exomics Cray started a collaborative effort with Université Laval: To achieve the goal of assembling a human genome in less than 1 hour Ray can achieve it in 10 hours and other assemblers in days This tool will help toward the goal of personalized medicine 5/8/2013 CUG 2013 15

Bioinformatics and Computational Biology Project Collaborating to enable rapid genomics analysis RayPlatform 5/8/2013 CUG 2013 16

Ray: Hybrid De Novo Assembler Accelerating Genomic Applications is a Collaborative effort The goal in De Novo assembly is to correctly assemble short reads into longer sequences Representing contiguous genomic regions Current Next Generation Sequencing technologies offer increase in throughput and decrease in cost and time Most software is available to assemble reads from specific NGS system Ray has been developed to assemble reads obtained from a combination of sequencing platforms S. Boisvert, F. Laviolette, and J. Corbeil, J. Comp. Biol. 17, 1519-1533(2010) 5/8/2013 CUG 2013 17

Ray Parallel Performance Human gut gene catalog Metagenomics 124 Individuals, 577 GB generated Beijing Genomic Institute Application Tuning Total Elapsed Time in Sec. 18000 16000 14000 12000 10000 8000 6000 4000 2000 Improving MPI scalability > 20% Improvement Improving MPI placement The Ray benchmarks shown in this study were run on a Cray XE6 system with AMD Opteron Interlagos IL-16 Clock frequency of the core of 2.1 GHz 0 128 256 512 1024 Number of Cores 18

Future Work Ray represents a major step forward in overcoming some of the major challenges facing genome assembling today This is particularly true for large datasets that otherwise are intractable To achieve the goal of sequencing even large data sets additional optimization work will be required We ll put particular attention to the bidirectional extension of seeds Continue to explore the role of Lustre, large scale data storage and storage optimization in detail with the parallel, high-throughput genomics workflow. Extend the scalability of Ray in collaboration with ORNL on Titan 5/8/2013 CUG 2013 19

Summary NGS machines technology have provided critical tools for deciphering DNA The cost of one Mb of DNA sequence has gone down from about $5,000 in 2001 to approximately $0.78 in 2009 Assembler programs have been created to assist in the process of assembling genomic data Data coming from sequencers outpaces Moore s law, it is critical to develop tools and procedures that can accurately and efficiently keep pace with the data production Ray provides next generation of assemblers Ray represents a major step forward in overcoming some of the major challenges facing genome assembling today 20

Acknowledgements Biomedical Informatics and Computational Biology (BICB) Prof Claudia Neuhauser, UMR Michael Olesen, UMR Université Laval Prof Jacques Corbeil Sébastien Boisvert Cray Inc Per Nyberg, Segment Team Sonexion Team Cray Inc resources ( Cray XE6 ) 21

http://www.cray.com Thank You! 22