Globus Genomics Tutorial GlobusWorld 2014



Similar documents
Practical Solutions for Big Data Analytics

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Delivering a Campus Research Data Service with Globus. GlobusWorld 2014 Keynote

Enhanced Research Data Management and Publication with Globus

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Globus Research Data Management: Introduction and Service Overview

Delivering the power of the world s most successful genomics platform

DTI Image Processing Pipeline and Cloud Computing Environment

GeneProf and the new GeneProf Web Services

Globus Research Data Management: Introduction and Service Overview. Steve Tuecke Vas Vasiliadis

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Cloud-Based Big Data Analytics in Bioinformatics

The Galaxy workflow. George Magklaras PhD RHCE

Introduction to Arvados. A Curoverse White Paper

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Analysis of ChIP-seq data in Galaxy

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

High Performance Compu2ng Facility

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Introduction to NGS data analysis

San Diego Supercomputer Center, UCSD. Institute for Digital Research and Education, UCLA

CSE-E5430 Scalable Cloud Computing. Lecture 4

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Hadoopizer : a cloud environment for bioinformatics data analysis

CLOSHA MANUAL ver1.1. KOBIC (Korean Bioinformation Center) Bioinformatics Workflow management System in Bio-Express

globus online Cloud-based services for (reproducible) science Ian Foster Computation Institute University of Chicago and Argonne National Laboratory

Cloud Computing for Scientific Research

BioHPC Web Computing Resources at CBSU

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Integrated Rule-based Data Management System for Genome Sequencing Data

Running and Enhancing your own Galaxy. Daniel Blankenberg The Galaxy Team

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf

New solutions for Big Data Analysis and Visualization

The Globus Galaxies Platform: Delivering Science Gateways as a Service

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

A Service for Data-Intensive Computations on Virtual Clusters

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

High Throughput Sequencing Data Analysis using Cloud Computing

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

GenomeSpace Architecture

How To Write A Blog Post On Globus

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Cloud-based Analytics and Map Reduce

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

SAP Data Services 4.X. An Enterprise Information management Solution

A Brief Introduction to Apache Tez

Scaling up to Production

End-to-end Big Data Discovery with Platfora Date: May 2015 Author: Mike Leone, ESG Lab Analyst

Early Cloud Experiences with the Kepler Scientific Workflow System

LifeScope Genomic Analysis Software 2.5

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis

Real Time Big Data Processing

-> Integration of MAPHiTS in Galaxy

Actian Vortex Express 3.0

Working up and building the foundation for Data and Software Carpentry within ELIXIR

Benchmark Report: Univa Grid Engine, Nextflow, and Docker for running Genomic Analysis Workflows

Hadoop-BAM and SeqPig

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hybrid Development and Test USE CASE

Federal Open Data Compliance

Automate Your BI Administration to Save Millions with Command Manager and System Manager

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Tableau Online. Understanding Data Updates

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,

The Evolution of Media Workflows

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10

Processing NGS Data with Hadoop-BAM and SeqPig

G E N OM I C S S E RV I C ES

DataSafe Solutions. Protect your valuable genomic data

IndustrialIT System 800xA Engineering

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Placing Your Applications in the Best Cloud Model

ORACLE BUSINESS INTELLIGENCE WORKSHOP

Software Life-Cycle Management

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

DITA Adoption Process: Roles, Responsibilities, and Skills

Integrating computational data analysis capabilities into analytics applications

How To Extend An Enterprise Bio Solution

1 What is Cloud Computing? Cloud Infrastructures OpenStack Amazon EC CAMF Cloud Application Management

GEOFLUENT TRANSLATION MANAGEMENT SYSTEM

WORKFLOW ENGINE FOR CLOUDS

How To Create A Grid On A Microsoft Web Server On A Pc Or Macode (For Free) On A Macode Or Ipad (For A Limited Time) On An Ipad Or Ipa (For Cheap) On Pc Or Micro

EMA Radar for Workload Automation (WLA): Q2 2012

Enabling multi-cloud resources at CERN within the Helix Nebula project. D. Giordano (CERN IT-SDC) HEPiX Spring 2014 Workshop 23 May 2014

Transcription:

Globus Genomics Tutorial GlobusWorld 2014

Agenda Overview of Globus Genomics Example Collaborations Demonstration Globus Genomics interface Globus Online integration Scenario 1: Using Globus Genomics for Bioinformatics Core Scenario 2: Using Globus Genomics for Individual Research labs Hands-On Experience

What Is Globus Genomics? Flexible, powerful SaaS-based genomics analysis platform Workflows can be easily defined and automated with integrated Galaxy capabilities Data movement is streamlined with integrated Globus file-transfer functionality Resources can be provisioned ondemand with Amazon Web Services cloud based infrastructure

Challenges in Sequencing Analysis Data Movement and Access Challenges Sequencing Centers Public Data Research Lab Storage Manually move the data to the Compute node Install all the tools required for the Analysis BWA, Picard, GATK, Filtering Scripts, etc. Shell scripts to sequen?ally execute the tools Manually modify the scripts for any change Error Prone, difficult to keep track, messy.. Difficult to maintain and transfer the knowledge Fastq Ref Genome Seq Center Local Cluster/ Cloud Modify Picard Data is distributed in different loca?ons Research labs need access to the data for analysis Be able to Share data with other researchers/collaborators Inefficient ways of data movement Data needs to be available on the local and Distributed Compute Resources Local Clusters, Cloud, Grid Once we have the Sequence Data Install Alignment GATK (Re)Run Script Variant Calling How do we analyze this Sequence Data Manual Data Analysis

Globus Genomics Globus Genomics Galaxy Based Workflow Management System Sequencing Centers Seq Center Public Data Globus Provides a High- performance Fault- tolerant Research Lab Secure file transfer Service between all data- endpoints Storage Local Cluster/ Cloud Galaxy Data Libraries Fastq Picard Alignment GATK Globus Integrated within Ref Genome Galaxy Web- based UI Drag- Drop workflow crea?ons Easily modify Workflows with new tools Analy?cal tools are automa?cally run on the scalable compute resources when possible Data Management Variant Calling Globus Genomics on Amazon EC2 Data Analysis

Globus integrated with Galaxy A flexible, scalable, simplified analysis platform Accessibility Unified Web- interface for obtaining genomic data and applying computa?onal tools to analyze the data Easily integrate your own tools and scripts for analysis Collec?on of tools (Tools Panel) that reflect good prac?ces and community insights Access every step of analysis and intermediate results: View, Download, Visualize, Reuse (History Panel) Data and Tools Reproducibility Track provenance and ensure repeatability of each analysis step: input datasets, tools used, parameter values, and output datasets Intui?ve Workflow Editor to create or modify complex workflows and use them as templates Reusable and Reproducible Templates Transparency Publish and share metadata, histories, and workflows at mul?ple levels Store public and generated datasets as Data Libraries e.g: hg19 Ref Genome Shared datasets and workflows can be imported by other users for reuse Globus Integra?on Publish Access Globus Endpoints and transfer data from within Galaxy UI and into Galaxy workspace Leverage local cluster or cloud based scalable computa?onal resources for parallelizing the tools

Additional Capabilities Professionally managed and supported platform Best practice pipelines Whole Genome, Exome, RNA-Seq, ChIP-Seq, Enhanced workbench with breadth of analytic tools Technical support and bioinformatics consulting Access to pre-integrated end-points for reliable and highperformance data transfer (e.g. Broad Institute, Perkin Elmer, university sequencing centers, etc.)

Example Collaborations Dobyns Lab Backround: Investigate the nature and causes of a wide range of human developmental brain disorders Approach: Replaced manual analysis with Globus Genomics Results: Achieved greater than 20X speed-up in analysis of exome data Future Plans: Leverage scale-out capability of Globus Genomics on 150 exome data set and seek to achieve 50X speed-up in analysis

Example Collaborations Georgetown Medical Center Backround: Innovation Center for Biomedical Informatics is an academic hub for innovative research in the field of biomedical informatics. Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging elastic cloud-based resources. Results: Pilot effort is complete significantly improved performance results on whole genome, exome and RNA-Seq pipelines utilizing Globus Genomics Future Plans: Provide Globus Genomics as a well-managed platform-as-aservice for ICBI collaborators and users

Diversity of Collaborations Dobyns Lab Seattle Children s Hospital Cox Lab University of Chicago ICBI / Georgetown University Kansas University Medical Center Volchenboum Lab University of Chicago Olopade Lab University of Chicago Inova Translational Medicine Institute Becton Dickinson Perkin Elmer Nagarajan Lab Washington University St. Louis Genome Sciences Institute Boston University Cedars-Sinai Medical Center Los Angeles University of California Irvine University of California San Francisco University of Pittsburgh Medical Center Poroyko Lab University of Chicago The Ohio State University Wexner Medical Center Broad Institute Many others

Globus Genomics Platform Overview DEMO Overview of the Globus Genomics interface Interface (Tools, Histories) Sharing Histories and Workflows Globus Integration in Galaxy Globus interface Globus transfers within Galaxy View/Track Transfers

Scenario 1 Bioinformatics Core Workflows Use Case: Running workflows with all the tools and parameters predefined. Introduction to Exome seq pipeline Import the best practices workflow Scientific pipeline details Running a pre-defined exome seq pipeline with Globus transfers with one Sample Submit a workflow Batch Submission with multiple-samples

Globus Genomics Demonstra?on Scenario 1

Scenario 2 Individual Researchers Use Case: Running individual tools and creating/modifying workflows and the parameters Running individual tools E.g: FastQC and Flagstat Importing a workflow Modifying the tools in the workflow E.g: Change the aligner, Add/Remove Data transfer Modify the parameters of the tools

Globus Genomics Demonstra?on Scenario 2

Questions?

Hands-On Exercise 1. Register with www.globus.org 2. Join the Globus Genomics Workshop group at https://www.globus.org/groups 3. Login to http://demo.globusgenomics.org 4. Browse and Get Data from SequencingCenter endpoint Endpoint Name: sulakhe#sequencingcenter Username/Passwd: genomics/globus Input files: Exome-Sample_Forward_1.fastq.gz Exome-Sample_Reverse_2.fastq.gz 5. Change datatype of the input files to fastqsanger (click on the pencil sign) 6. Import a workflow from Shared Data Name: ExomeSeq-Analysis-no-transfer_short_version 7. Run the workflow